Changed LoadDataWriter to load data in batches #81

AdalbertMemSQL · 2024-02-22T10:24:58Z

This change will allow us to avoid table locking.
insertBatchSize config will be used to set batch size. Previously, we used it when OnDuplicateKey configuration was set.

pmishchenko-ua · 2024-02-22T15:03:39Z

With the default insertBatchSize of 10,000, can these changes negatively affect performance with default configuration? My guess is that at least 100,000 rows in batch are needed to get close to the best performance.

AdalbertMemSQL · 2024-02-22T15:13:26Z

With the default insertBatchSize of 10,000, can these changes negatively affect performance with default configuration? My guess is that at least 100,000 rows in batch are needed to get close to the best performance.

I did perf testing with tests from this repo. Batch size showed the same performance. 100k may work bad for insert queries as there should be a limit for a statement length. I bumped into it with fivetran connector and update queries.

AdalbertMemSQL · 2024-02-22T15:14:28Z

We have a connection pool, so creating a new statement is just sending a small LOAD DATA query.

pmishchenko-ua · 2024-02-22T15:16:46Z

for insert queries 100k is indeed too much, I'm referring to LOAD DATA queries. You're saying that sending batches of 10k, , 100k, 1M rows shows the same performance?

AdalbertMemSQL · 2024-02-22T15:22:29Z

10k and 1m (no batches) show the same performance. Here queries are not parallelized (inside of one job), so 100k should also have the same performance.

AdalbertMemSQL · 2024-02-22T15:24:03Z

Performance testing that we did before parallelized writes and created a separate connection for each query. This caused a situation when ~100k was the optimal batch size.

AdalbertMemSQL · 2024-02-22T15:26:27Z

I checked several more parameters. default_columnstore_table_lock_threshold by default will lock the whole table if the query touches more than 5k rows https://docs.singlestore.com/cloud/create-a-database/columnstore/locking-in-columnstores/#overriding-default-locking so it makes sense to use even smaller batch size.

AdalbertMemSQL · 2024-02-22T15:28:25Z

Ignore my previous comment, it refers to rows per partition, so if there are at least 2 partitions - 10k should be ok :)

pmishchenko-ua · 2024-02-22T14:35:09Z

src/main/scala/com/singlestore/spark/SinglestoreLoadDataWriter.scala

    }

-    def tempColName(colName: String) = s"@${colName}_tmp"
+        def tempColName(colName: String) = s"@${colName}_tmp"


nit: delete tab

AdalbertMemSQL requested a review from pmishchenko-ua February 22, 2024 10:24

AdalbertMemSQL self-assigned this Feb 22, 2024

pmishchenko-ua approved these changes Feb 22, 2024

View reviewed changes

AdalbertMemSQL force-pushed the PLAT-6748 branch from 43d0c05 to 8a30b37 Compare February 22, 2024 19:07

Used insertBatchSize in LoadDataWriter

8a30b37

AdalbertMemSQL merged commit c0955b9 into master Feb 23, 2024
0 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed LoadDataWriter to load data in batches #81

Changed LoadDataWriter to load data in batches #81

AdalbertMemSQL commented Feb 22, 2024

pmishchenko-ua commented Feb 22, 2024

AdalbertMemSQL commented Feb 22, 2024

AdalbertMemSQL commented Feb 22, 2024

pmishchenko-ua commented Feb 22, 2024

AdalbertMemSQL commented Feb 22, 2024

AdalbertMemSQL commented Feb 22, 2024

AdalbertMemSQL commented Feb 22, 2024 •

edited

Loading

AdalbertMemSQL commented Feb 22, 2024

pmishchenko-ua Feb 22, 2024

Changed LoadDataWriter to load data in batches #81

Changed LoadDataWriter to load data in batches #81

Conversation

AdalbertMemSQL commented Feb 22, 2024

pmishchenko-ua commented Feb 22, 2024

AdalbertMemSQL commented Feb 22, 2024

AdalbertMemSQL commented Feb 22, 2024

pmishchenko-ua commented Feb 22, 2024

AdalbertMemSQL commented Feb 22, 2024

AdalbertMemSQL commented Feb 22, 2024

AdalbertMemSQL commented Feb 22, 2024 • edited Loading

AdalbertMemSQL commented Feb 22, 2024

pmishchenko-ua Feb 22, 2024

Choose a reason for hiding this comment

AdalbertMemSQL commented Feb 22, 2024 •

edited

Loading