New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raptor v2 #11254

Open
wants to merge 4 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@electrum
Collaborator

electrum commented Aug 13, 2018

No description provided.

@electrum electrum force-pushed the electrum:raptorx branch from 83755d6 to b334737 Aug 14, 2018

@zhaotao1986

Thanks, @electrum. After I worked on a few projects, I have to say this is the most elegant codebase.

I mostly read the transaction part, and got some questions. You guys may have discussed before. I don't have the big picture; some questions may not be relevant. So please help me warm up :)

First question, it seems we build a Transaction layer on top of Mysql Transaction layer? For ex, DatabaseMetadataWriter.writeMaster calls masterDbi.useTransaction.
If so, I think we can just use Mysql transaction for both writing and reading, right? A benefit is we may get rid of table "active_commit", and "chunksAdded", "chunksRemoved" inside Transaction.java, since mysql has internal logic for MVCC.

@zhaotao1986

This comment has been minimized.

zhaotao1986 commented Aug 14, 2018

Another question is related to above. I think we are doing MVCC. For ex, table “schemata”, one row’s life is bound by the interval [start_commit_id, end_commit_id]. To clarify, we don’t update, we only insert and delete.
But do we need both a CommitID and a TransactionID? I see them in MasterWriterDao.

But I think last TransactionID which successful == TRUE, is just last CommitID. We only write into table “schemata” when commits, so may use the committed TransactionID as the CommitID.

I also added some inline comments

@electrum

This comment has been minimized.

Collaborator

electrum commented Aug 16, 2018

There are several reasons why we need to build our own transaction layer:

  • Raptor v1 relied on doing everything in a single MySQL transaction, which ran into problems with
    • Hitting the size limit for a single MySQL transaction when you insert too rows at once.
    • MySQL does not have transactional DDL, thus index tables have to be updated separately, and they can become inconsistent (corrupted or orphaned).
  • Using MySQL would mean that we need a single MySQL transaction open for the duration of the Presto transaction. This is unlikely to work well due to MySQL idle timeouts, network disconnects, etc.
  • Raptor v2 supports sharded metadata, rather than relying on a single large MySQL database, so we need our own transaction layer to keep them consistent.
@electrum

This comment has been minimized.

Collaborator

electrum commented Aug 16, 2018

A commit ID keeps track of committed data. It is allocated sequentially at commit time with the invariant that commits are sequential. This has two important properties:

  1. Everything in the database is either valid, committed data, or it belongs to the single committing transaction. This allows readers can do a simple range check for visibility. They don't need to filter out data from uncommitted or failed transactions.

  2. A single commit ID identifies an immutable database snapshot forever. This doesn't have any immediate use, but would make time travel queries consistent. Using non sequential transaction IDs would allow data to appear from transactions that started earlier but commit later.

A transaction ID is used to keep track of chunks that are being created so that they can be cleaned up if the transaction fails.

@zhaotao1986

This comment has been minimized.

zhaotao1986 commented Aug 16, 2018

Thank you for clarifying!

@electrum electrum force-pushed the electrum:raptorx branch 3 times, most recently from c8ead36 to f9667dc Aug 22, 2018

@electrum electrum force-pushed the electrum:raptorx branch from f9667dc to 7ce6ade Aug 30, 2018

@electrum electrum force-pushed the electrum:raptorx branch from 7ce6ade to f511423 Sep 13, 2018

@electrum electrum force-pushed the electrum:raptorx branch 5 times, most recently from 4f8b48d to 0b22d6c Sep 18, 2018

@electrum electrum force-pushed the electrum:raptorx branch from 0b22d6c to 1e842c6 Sep 26, 2018

@tdcmeehan tdcmeehan self-requested a review Sep 28, 2018

@electrum electrum force-pushed the electrum:raptorx branch from 1e842c6 to 5000aa9 Oct 2, 2018

@electrum electrum requested a review from dain Oct 15, 2018

@tdcmeehan

This comment has been minimized.

Contributor

tdcmeehan commented Oct 18, 2018

@electrum I'll begin to review this now. This may take some time.

@tdcmeehan

Looked at ChunkDelta and OrcStorageManager. Looks great.

@facebook-github-bot

This comment has been minimized.

facebook-github-bot commented Nov 17, 2018

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired.

Before we can review or merge your code, we need you to email cla@fb.com with your details so we can update your status.

@facebook-github-bot

This comment has been minimized.

facebook-github-bot commented Nov 28, 2018

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@dain dain self-assigned this Nov 30, 2018

@dain

Some comments

Show resolved Hide resolved ...o-raptorx/src/test/java/com/facebook/presto/raptorx/TestRaptorSmoke.java
Show resolved Hide resolved ...o-raptorx/src/test/java/com/facebook/presto/raptorx/TestRaptorSmoke.java
Show resolved Hide resolved ...o-raptorx/src/test/java/com/facebook/presto/raptorx/TestRaptorSmoke.java
Show resolved Hide resolved ...o-raptorx/src/test/java/com/facebook/presto/raptorx/TestRaptorSmoke.java
Show resolved Hide resolved ...o-raptorx/src/test/java/com/facebook/presto/raptorx/TestRaptorSmoke.java
}
}
if ((totalBytes > maxBufferBytes) && (maxBuffer != null)) {

This comment has been minimized.

@dain

dain Nov 30, 2018

Contributor

Shouldn't you run the flush until totalBytes <= maxBufferBytes

This comment has been minimized.

@electrum

electrum Dec 12, 2018

Collaborator

This is tricky, and it's the same as Raptor v1, so I'm going to leave it alone for now.

As long as the added page is less than the buffer size, we will stay balanced, since we can flush a buffer for each added page.

I'm hesitant to add a loop, since the current code is O(n) to find the largest buffer to flush. We could sort the buffers to allow efficiently flushing multiple, but that's substantially more expensive in the common case that only one is flushed.

Show resolved Hide resolved ...torx/src/main/java/com/facebook/presto/raptorx/util/PageListBuilder.java Outdated
Show resolved Hide resolved ...torx/src/main/java/com/facebook/presto/raptorx/util/PageListBuilder.java Outdated
Show resolved Hide resolved presto-raptorx/src/main/java/com/facebook/presto/raptorx/util/OrcUtil.java Outdated
public static OrcRecordReader createOrcRecordReader(OrcReader reader, Map<Integer, Type> includedColumns)
{
return reader.createRecordReader(includedColumns, OrcPredicate.TRUE, UTC, newSimpleAggregatedMemoryContext(), MAX_BATCH_SIZE);

This comment has been minimized.

@dain

dain Nov 30, 2018

Contributor

using MAX_BATCH_SIZE can be dangerous if you have rows containing large collections

This comment has been minimized.

@electrum

electrum Dec 12, 2018

Collaborator

How much of a performance hit is changing this to INITIAL_BATCH_SIZE (1)?

@electrum electrum force-pushed the electrum:raptorx branch from 5000aa9 to 526481f Dec 12, 2018

@electrum

This comment has been minimized.

Collaborator

electrum commented Dec 12, 2018

All comments should be addressed, with the exception of better handling of choosing bucket count for non-bucketed tables.

@electrum electrum force-pushed the electrum:raptorx branch from 526481f to 7ce3639 Dec 13, 2018

@electrum

This comment has been minimized.

Collaborator

electrum commented Dec 13, 2018

This includes fixes from #12068 which will be merged first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment