Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize scanning of string offsets to skip pages we don't need to read #2575

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

benjaminwinger
Copy link
Collaborator

@benjaminwinger benjaminwinger commented Dec 12, 2023

Before, we were scanning all offsets at once to reduce the number of scans done (since with dictionary compression, offsets may be out of order even when doing unfiltered scans, so there's no guarantee that they will be adjacent). That had a bad worst case where for chunks with a large dictionary, you might end up scanning the entire chunk to read just a few pages.

So instead, this scans in batches, calculated by finding the next range of values where no two values are further apart than one page (using the number of values per page calculated from the chunk metadata). It will still scan unneeded values, but no longer scan unneeded pages. This had better performance than scanning only adjacent values.

I'm concerned that this is a little complicated, but it gives a moderate improvement to string scanning performance (roughly 10% on columns with a moderate amount of duplication).

Here are two sets of benchmarks on the LDBC-10 comment table (I did two runs to make it clearer when differences were due to externalities like system noise. On one of them, the two runs are superimposed). Note that these are without clearing OS filesystem caches.

I'd also done some cold runs in the same circumstances on the locationIP, and content tables, and got a speedup of about 7% (only numbers I have to hand are 859ms average compared to 924ms average, for the content column with the string IDs; non-string IDs were slightly faster, but still a comparable difference).

Oddly, the q05 benchmark (scanning browserUsed) is faster with this change, which I think is suspicious as the browserUsed column only has 5 distinct values and should get no benefit from skipping pages.

Copy link

codecov bot commented Dec 12, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.38%. Comparing base (2b924e5) to head (d0bb44a).
Report is 6 commits behind head on master.

Current head d0bb44a differs from pull request most recent head 2d2aa22

Please upload reports for the commit 2d2aa22 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2575      +/-   ##
==========================================
+ Coverage   89.24%   93.38%   +4.13%     
==========================================
  Files        1220     1063     -157     
  Lines       45297    39371    -5926     
==========================================
- Hits        40427    36767    -3660     
+ Misses       4870     2604    -2266     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benjaminwinger
Copy link
Collaborator Author

https://github.com/kuzudb/kuzu/actions/runs/7186804200/job/19573028840?pr=2575

Clang-tidy on macos doesn't seem to recognize that std::ranges::subrange exists. The code did compile in the macos build job.

@ray6080
Copy link
Contributor

ray6080 commented Dec 13, 2023

I'm concerned that this is a little complicated, but it gives a moderate improvement to string scanning performance (roughly 10% on columns with a moderate amount of duplication).

Can you share more details on your benchmark? What dataset is used, how large the dictionary is and what are the numbers like?

@benjaminwinger
Copy link
Collaborator Author

Can you share more details on your benchmark? What dataset is used, how large the dictionary is and what are the numbers like?

I've updated the description to include benchmark details, as well as more details about the change

src/storage/store/string_column.cpp Outdated Show resolved Hide resolved
src/storage/store/string_column.cpp Outdated Show resolved Hide resolved
}
return std::ranges::subrange(data.begin() + startPos, data.begin() + endPos);
}

void StringColumn::scanValuesToVector(Transaction* transaction, node_group_idx_t nodeGroupIdx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is becoming complex now. Can you write a more detailed comment to summarize the flow of scan values when there are (or aren't) lots of duplication?

@@ -28,6 +28,9 @@ StringColumn::StringColumn(std::string name, std::unique_ptr<LogicalType> dataTy

void StringColumn::scanOffsets(Transaction* transaction, const ReadState& state,
string_offset_t* offsets, uint64_t index, uint64_t numValues, uint64_t dataSize) {
// TODO(bmwinger): this causes an issue with structs where a read is attempted on an empty chunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow here, what's the connection with structs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was that for some reason structs are reading the first value, even if the numValues in the metadata is 0, so the assertion fails for strings stored in a struct. As long as everything is zero, that value just ends up being an empty string, but it shouldn't be being read. I'm not sure why it was just for structs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a test case or example that can reproduce that? It looks like we should fix it now or later, better document it down in an issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Un-commenting that line, it causes only EmptyCreateNodeTest.CreateStructWithWriteTransaction to fail. That said, after creating the node, the string chunk should contain a single string, so maybe it's the metadata that's wrong, something to do with the scan occurring before the commit probably.

@benjaminwinger benjaminwinger force-pushed the string-scan-opt branch 6 times, most recently from 3a8a041 to d0bb44a Compare January 10, 2024 17:43
Copy link

Benchmark Result

Master commit hash: 91f55e66c53022cfec0a2b1005fbf5d56e6ab2ca
Branch commit hash: b856d3d4403e7fcafa151a344eaae5a6eada6a73

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 648.14 641.43 6.71 (1.05%)
aggregation q28 14738.20 13869.30 868.90 (6.26%)
copy node-Comment 51474.79 N/A N/A
copy node-Forum 4785.93 N/A N/A
copy node-Organisation 1647.98 N/A N/A
copy node-Person 2896.29 N/A N/A
copy node-Place 1687.03 N/A N/A
copy node-Post 19426.16 N/A N/A
copy node-Tag 1533.47 N/A N/A
copy node-Tagclass 1494.01 N/A N/A
copy rel-comment-hasCreator 46697.71 N/A N/A
copy rel-comment-hasTag 66742.12 N/A N/A
copy rel-comment-isLocatedIn 47884.87 N/A N/A
copy rel-containerOf 11568.87 N/A N/A
copy rel-forum-hasTag 2825.13 N/A N/A
copy rel-hasInterest 2096.95 N/A N/A
copy rel-hasMember 37978.32 N/A N/A
copy rel-hasModerator 1370.12 N/A N/A
copy rel-hasType 506.38 N/A N/A
copy rel-isPartOf 385.79 N/A N/A
copy rel-isSubclassOf 440.84 N/A N/A
copy rel-knows 4429.24 N/A N/A
copy rel-likes-comment 82935.59 N/A N/A
copy rel-likes-post 27085.85 N/A N/A
copy rel-organisation-isLocatedIn 583.80 N/A N/A
copy rel-person-isLocatedIn 767.04 N/A N/A
copy rel-post-hasCreator 11463.35 N/A N/A
copy rel-post-hasTag 15527.90 N/A N/A
copy rel-post-isLocatedIn 13063.42 N/A N/A
copy rel-replyOf-comment 53259.49 N/A N/A
copy rel-replyOf-post 39105.13 N/A N/A
copy rel-studyAt 683.44 N/A N/A
copy rel-workAt 842.34 N/A N/A
filter q14 132.73 126.00 6.73 (5.34%)
filter q15 134.09 124.53 9.56 (7.67%)
filter q16 309.70 294.06 15.64 (5.32%)
filter q17 452.51 444.80 7.71 (1.73%)
filter q18 1903.02 1854.05 48.96 (2.64%)
fixed_size_expr_evaluator q07 568.72 566.26 2.47 (0.44%)
fixed_size_expr_evaluator q08 798.79 792.65 6.14 (0.77%)
fixed_size_expr_evaluator q09 798.66 788.81 9.85 (1.25%)
fixed_size_expr_evaluator q10 246.90 243.45 3.45 (1.42%)
fixed_size_expr_evaluator q11 241.41 239.01 2.39 (1.00%)
fixed_size_expr_evaluator q12 240.76 253.45 -12.69 (-5.01%)
fixed_size_expr_evaluator q13 1473.13 1478.34 -5.21 (-0.35%)
fixed_size_seq_scan q23 124.02 123.78 0.24 (0.20%)
join q29 703.67 697.16 6.51 (0.93%)
join q30 1494.82 1514.18 -19.36 (-1.28%)
join q31 20.16 16.79 3.37 (20.08%)
ldbc_snb_ic q35 935.12 1060.74 -125.61 (-11.84%)
ldbc_snb_ic q36 63.84 54.57 9.28 (17.00%)
ldbc_snb_is q32 16.25 18.21 -1.97 (-10.80%)
ldbc_snb_is q33 23.91 25.48 -1.58 (-6.18%)
ldbc_snb_is q34 15.33 16.57 -1.24 (-7.50%)
order_by q25 139.50 127.14 12.36 (9.72%)
order_by q26 449.42 435.54 13.89 (3.19%)
order_by q27 1388.65 1377.52 11.13 (0.81%)
scan_after_filter q01 177.65 169.11 8.54 (5.05%)
scan_after_filter q02 157.33 150.35 6.98 (4.64%)
shortest_path_ldbc100 q39 58.09 99.96 -41.87 (-41.89%)
var_size_expr_evaluator q03 2016.45 2006.73 9.72 (0.48%)
var_size_expr_evaluator q04 2225.84 2167.93 57.91 (2.67%)
var_size_expr_evaluator q05 2627.00 2564.23 62.76 (2.45%)
var_size_expr_evaluator q06 1343.76 1353.49 -9.73 (-0.72%)
var_size_seq_scan q19 1430.53 1442.15 -11.62 (-0.81%)
var_size_seq_scan q20 3037.97 3042.38 -4.41 (-0.14%)
var_size_seq_scan q21 2090.64 2339.69 -249.05 (-10.64%)
var_size_seq_scan q22 130.44 127.52 2.92 (2.29%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants