New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed LoadDataWriter to load data in batches #81
Conversation
With the default |
I did perf testing with tests from this repo. Batch size showed the same performance. 100k may work bad for insert queries as there should be a limit for a statement length. I bumped into it with fivetran connector and update queries. |
We have a connection pool, so creating a new statement is just sending a small LOAD DATA query. |
for insert queries 100k is indeed too much, I'm referring to LOAD DATA queries. You're saying that sending batches of 10k, , 100k, 1M rows shows the same performance? |
10k and 1m (no batches) show the same performance. Here queries are not parallelized (inside of one job), so 100k should also have the same performance. |
Performance testing that we did before parallelized writes and created a separate connection for each query. This caused a situation when ~100k was the optimal batch size. |
I checked several more parameters. |
Ignore my previous comment, it refers to rows per partition, so if there are at least 2 partitions - 10k should be ok :) |
} | ||
|
||
def tempColName(colName: String) = s"@${colName}_tmp" | ||
def tempColName(colName: String) = s"@${colName}_tmp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: delete tab
43d0c05
to
8a30b37
Compare
This change will allow us to avoid table locking.
insertBatchSize
config will be used to set batch size. Previously, we used it whenOnDuplicateKey
configuration was set.