Optimize scanning of string offsets to skip pages we don't need to read #2575

benjaminwinger · 2023-12-12T17:12:32Z

Before, we were scanning all offsets at once to reduce the number of scans done (since with dictionary compression, offsets may be out of order even when doing unfiltered scans, so there's no guarantee that they will be adjacent). That had a bad worst case where for chunks with a large dictionary, you might end up scanning the entire chunk to read just a few pages.

So instead, this scans in batches, calculated by finding the next range of values where no two values are further apart than one page (using the number of values per page calculated from the chunk metadata). It will still scan unneeded values, but no longer scan unneeded pages. This had better performance than scanning only adjacent values.

I'm concerned that this is a little complicated, but it gives a moderate improvement to string scanning performance (roughly 10% on columns with a moderate amount of duplication).

Here are two sets of benchmarks on the LDBC-10 comment table (I did two runs to make it clearer when differences were due to externalities like system noise. On one of them, the two runs are superimposed). Note that these are without clearing OS filesystem caches.

string-scan-opt-normal.pdf
Same as the above, but with string identifiers to show that scans of string columns without duplication haven't regressed string-scan-opt-stringid.pdf

I'd also done some cold runs in the same circumstances on the locationIP, and content tables, and got a speedup of about 7% (only numbers I have to hand are 859ms average compared to 924ms average, for the content column with the string IDs; non-string IDs were slightly faster, but still a comparable difference).

Oddly, the q05 benchmark (scanning browserUsed) is faster with this change, which I think is suspicious as the browserUsed column only has 5 distinct values and should get no benefit from skipping pages.

codecov · 2023-12-12T17:23:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.38%. Comparing base (2b924e5) to head (d0bb44a).
Report is 6 commits behind head on master.

❗ Current head d0bb44a differs from pull request most recent head 2d2aa22

Please upload reports for the commit 2d2aa22 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2575      +/-   ##
==========================================
+ Coverage   89.24%   93.38%   +4.13%     
==========================================
  Files        1220     1063     -157     
  Lines       45297    39371    -5926     
==========================================
- Hits        40427    36767    -3660     
+ Misses       4870     2604    -2266

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benjaminwinger · 2023-12-12T21:45:14Z

https://github.com/kuzudb/kuzu/actions/runs/7186804200/job/19573028840?pr=2575

Clang-tidy on macos doesn't seem to recognize that std::ranges::subrange exists. The code did compile in the macos build job.

ray6080 · 2023-12-13T16:50:45Z

I'm concerned that this is a little complicated, but it gives a moderate improvement to string scanning performance (roughly 10% on columns with a moderate amount of duplication).

Can you share more details on your benchmark? What dataset is used, how large the dictionary is and what are the numbers like?

benjaminwinger · 2023-12-13T18:41:05Z

Can you share more details on your benchmark? What dataset is used, how large the dictionary is and what are the numbers like?

I've updated the description to include benchmark details, as well as more details about the change

src/storage/store/string_column.cpp

ray6080 · 2023-12-13T17:05:48Z

src/storage/store/string_column.cpp

+    }
+    return std::ranges::subrange(data.begin() + startPos, data.begin() + endPos);
+}
+
 void StringColumn::scanValuesToVector(Transaction* transaction, node_group_idx_t nodeGroupIdx,


This code block is becoming complex now. Can you write a more detailed comment to summarize the flow of scan values when there are (or aren't) lots of duplication?

ray6080 · 2023-12-13T22:25:16Z

src/storage/store/string_column.cpp

@@ -28,6 +28,9 @@ StringColumn::StringColumn(std::string name, std::unique_ptr<LogicalType> dataTy

 void StringColumn::scanOffsets(Transaction* transaction, const ReadState& state,
    string_offset_t* offsets, uint64_t index, uint64_t numValues, uint64_t dataSize) {
+    // TODO(bmwinger): this causes an issue with structs where a read is attempted on an empty chunk


I don't follow here, what's the connection with structs?

I think it was that for some reason structs are reading the first value, even if the numValues in the metadata is 0, so the assertion fails for strings stored in a struct. As long as everything is zero, that value just ends up being an empty string, but it shouldn't be being read. I'm not sure why it was just for structs.

Do you have a test case or example that can reproduce that? It looks like we should fix it now or later, better document it down in an issue?

Un-commenting that line, it causes only EmptyCreateNodeTest.CreateStructWithWriteTransaction to fail. That said, after creating the node, the string chunk should contain a single string, so maybe it's the metadata that's wrong, something to do with the scan occurring before the commit probably.

github-actions · 2024-06-21T21:08:55Z

Benchmark Result

Master commit hash: 91f55e66c53022cfec0a2b1005fbf5d56e6ab2ca
Branch commit hash: b856d3d4403e7fcafa151a344eaae5a6eada6a73

Query Group	Query Name	Mean Time - Commit (ms)	Mean Time - Master (ms)	Diff
aggregation	q24	648.14	641.43	6.71 (1.05%)
aggregation	q28	14738.20	13869.30	868.90 (6.26%)
copy	node-Comment	51474.79	N/A	N/A
copy	node-Forum	4785.93	N/A	N/A
copy	node-Organisation	1647.98	N/A	N/A
copy	node-Person	2896.29	N/A	N/A
copy	node-Place	1687.03	N/A	N/A
copy	node-Post	19426.16	N/A	N/A
copy	node-Tag	1533.47	N/A	N/A
copy	node-Tagclass	1494.01	N/A	N/A
copy	rel-comment-hasCreator	46697.71	N/A	N/A
copy	rel-comment-hasTag	66742.12	N/A	N/A
copy	rel-comment-isLocatedIn	47884.87	N/A	N/A
copy	rel-containerOf	11568.87	N/A	N/A
copy	rel-forum-hasTag	2825.13	N/A	N/A
copy	rel-hasInterest	2096.95	N/A	N/A
copy	rel-hasMember	37978.32	N/A	N/A
copy	rel-hasModerator	1370.12	N/A	N/A
copy	rel-hasType	506.38	N/A	N/A
copy	rel-isPartOf	385.79	N/A	N/A
copy	rel-isSubclassOf	440.84	N/A	N/A
copy	rel-knows	4429.24	N/A	N/A
copy	rel-likes-comment	82935.59	N/A	N/A
copy	rel-likes-post	27085.85	N/A	N/A
copy	rel-organisation-isLocatedIn	583.80	N/A	N/A
copy	rel-person-isLocatedIn	767.04	N/A	N/A
copy	rel-post-hasCreator	11463.35	N/A	N/A
copy	rel-post-hasTag	15527.90	N/A	N/A
copy	rel-post-isLocatedIn	13063.42	N/A	N/A
copy	rel-replyOf-comment	53259.49	N/A	N/A
copy	rel-replyOf-post	39105.13	N/A	N/A
copy	rel-studyAt	683.44	N/A	N/A
copy	rel-workAt	842.34	N/A	N/A
filter	q14	132.73	126.00	6.73 (5.34%)
filter	q15	134.09	124.53	9.56 (7.67%)
filter	q16	309.70	294.06	15.64 (5.32%)
filter	q17	452.51	444.80	7.71 (1.73%)
filter	q18	1903.02	1854.05	48.96 (2.64%)
fixed_size_expr_evaluator	q07	568.72	566.26	2.47 (0.44%)
fixed_size_expr_evaluator	q08	798.79	792.65	6.14 (0.77%)
fixed_size_expr_evaluator	q09	798.66	788.81	9.85 (1.25%)
fixed_size_expr_evaluator	q10	246.90	243.45	3.45 (1.42%)
fixed_size_expr_evaluator	q11	241.41	239.01	2.39 (1.00%)
fixed_size_expr_evaluator	q12	240.76	253.45	-12.69 (-5.01%)
fixed_size_expr_evaluator	q13	1473.13	1478.34	-5.21 (-0.35%)
fixed_size_seq_scan	q23	124.02	123.78	0.24 (0.20%)
join	q29	703.67	697.16	6.51 (0.93%)
join	q30	1494.82	1514.18	-19.36 (-1.28%)
join	q31	20.16	16.79	3.37 (20.08%)
ldbc_snb_ic	q35	935.12	1060.74	-125.61 (-11.84%)
ldbc_snb_ic	q36	63.84	54.57	9.28 (17.00%)
ldbc_snb_is	q32	16.25	18.21	-1.97 (-10.80%)
ldbc_snb_is	q33	23.91	25.48	-1.58 (-6.18%)
ldbc_snb_is	q34	15.33	16.57	-1.24 (-7.50%)
order_by	q25	139.50	127.14	12.36 (9.72%)
order_by	q26	449.42	435.54	13.89 (3.19%)
order_by	q27	1388.65	1377.52	11.13 (0.81%)
scan_after_filter	q01	177.65	169.11	8.54 (5.05%)
scan_after_filter	q02	157.33	150.35	6.98 (4.64%)
shortest_path_ldbc100	q39	58.09	99.96	-41.87 (-41.89%)
var_size_expr_evaluator	q03	2016.45	2006.73	9.72 (0.48%)
var_size_expr_evaluator	q04	2225.84	2167.93	57.91 (2.67%)
var_size_expr_evaluator	q05	2627.00	2564.23	62.76 (2.45%)
var_size_expr_evaluator	q06	1343.76	1353.49	-9.73 (-0.72%)
var_size_seq_scan	q19	1430.53	1442.15	-11.62 (-0.81%)
var_size_seq_scan	q20	3037.97	3042.38	-4.41 (-0.14%)
var_size_seq_scan	q21	2090.64	2339.69	-249.05 (-10.64%)
var_size_seq_scan	q22	130.44	127.52	2.92 (2.29%)

benjaminwinger force-pushed the string-scan-opt branch from df711a2 to 8dedd93 Compare December 12, 2023 20:15

ray6080 approved these changes Dec 13, 2023

View reviewed changes

benjaminwinger force-pushed the string-scan-opt branch 6 times, most recently from 3a8a041 to d0bb44a Compare January 10, 2024 17:43

Optimize scanning of string offsets to skip pages we don't need to read

2d2aa22

benjaminwinger force-pushed the string-scan-opt branch from d0bb44a to 2d2aa22 Compare June 19, 2024 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize scanning of string offsets to skip pages we don't need to read #2575

Optimize scanning of string offsets to skip pages we don't need to read #2575

benjaminwinger commented Dec 12, 2023 •

edited

Loading

codecov bot commented Dec 12, 2023 •

edited

Loading

benjaminwinger commented Dec 12, 2023

ray6080 commented Dec 13, 2023

benjaminwinger commented Dec 13, 2023

ray6080 Dec 13, 2023

ray6080 Dec 13, 2023

benjaminwinger Dec 19, 2023

ray6080 Dec 19, 2023

benjaminwinger Dec 19, 2023

github-actions bot commented Jun 21, 2024

Optimize scanning of string offsets to skip pages we don't need to read #2575

Are you sure you want to change the base?

Optimize scanning of string offsets to skip pages we don't need to read #2575

Conversation

benjaminwinger commented Dec 12, 2023 • edited Loading

codecov bot commented Dec 12, 2023 • edited Loading

Codecov Report

benjaminwinger commented Dec 12, 2023

ray6080 commented Dec 13, 2023

benjaminwinger commented Dec 13, 2023

ray6080 Dec 13, 2023

Choose a reason for hiding this comment

ray6080 Dec 13, 2023

Choose a reason for hiding this comment

benjaminwinger Dec 19, 2023

Choose a reason for hiding this comment

ray6080 Dec 19, 2023

Choose a reason for hiding this comment

benjaminwinger Dec 19, 2023

Choose a reason for hiding this comment

github-actions bot commented Jun 21, 2024

Benchmark Result

benjaminwinger commented Dec 12, 2023 •

edited

Loading

codecov bot commented Dec 12, 2023 •

edited

Loading