Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed LoadDataWriter to load data in batches #81

Merged
merged 1 commit into from Feb 23, 2024
Merged

Conversation

AdalbertMemSQL
Copy link
Collaborator

This change will allow us to avoid table locking.
insertBatchSize config will be used to set batch size. Previously, we used it when OnDuplicateKey configuration was set.

@pmishchenko-ua
Copy link

With the default insertBatchSize of 10,000, can these changes negatively affect performance with default configuration? My guess is that at least 100,000 rows in batch are needed to get close to the best performance.

@AdalbertMemSQL
Copy link
Collaborator Author

With the default insertBatchSize of 10,000, can these changes negatively affect performance with default configuration? My guess is that at least 100,000 rows in batch are needed to get close to the best performance.

I did perf testing with tests from this repo. Batch size showed the same performance. 100k may work bad for insert queries as there should be a limit for a statement length. I bumped into it with fivetran connector and update queries.

@AdalbertMemSQL
Copy link
Collaborator Author

We have a connection pool, so creating a new statement is just sending a small LOAD DATA query.

@pmishchenko-ua
Copy link

for insert queries 100k is indeed too much, I'm referring to LOAD DATA queries. You're saying that sending batches of 10k, , 100k, 1M rows shows the same performance?

@AdalbertMemSQL
Copy link
Collaborator Author

10k and 1m (no batches) show the same performance. Here queries are not parallelized (inside of one job), so 100k should also have the same performance.

@AdalbertMemSQL
Copy link
Collaborator Author

Performance testing that we did before parallelized writes and created a separate connection for each query. This caused a situation when ~100k was the optimal batch size.

@AdalbertMemSQL
Copy link
Collaborator Author

AdalbertMemSQL commented Feb 22, 2024

I checked several more parameters. default_columnstore_table_lock_threshold by default will lock the whole table if the query touches more than 5k rows https://docs.singlestore.com/cloud/create-a-database/columnstore/locking-in-columnstores/#overriding-default-locking so it makes sense to use even smaller batch size.

@AdalbertMemSQL
Copy link
Collaborator Author

Ignore my previous comment, it refers to rows per partition, so if there are at least 2 partitions - 10k should be ok :)

}

def tempColName(colName: String) = s"@${colName}_tmp"
def tempColName(colName: String) = s"@${colName}_tmp"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: delete tab

@AdalbertMemSQL AdalbertMemSQL merged commit c0955b9 into master Feb 23, 2024
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants