-
Notifications
You must be signed in to change notification settings - Fork 323
perf: Improve getKeyValues query performance for JSON keys #1284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🦋 Changeset detectedLatest commit: 4945803 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
PR Review: Performance Optimization for getKeyValuesSummaryThis PR significantly improves query performance (from 15s to 0.443s - 97% improvement) by optimizing how JSON fields are selected in ClickHouse queries. The optimization is well-implemented with comprehensive integration tests. ✅ No critical issues found. What Was Changed
Code Quality Observations✅ Strengths:
Minor Suggestions (non-blocking):
Testing Notes
Recommendation: Approve and merge ✅ The performance improvement is substantial, the implementation is solid, and the test coverage is excellent. The minor suggestions above are optional refinements that do not block merging. |
E2E Test Results✅ All tests passed • 26 passed • 3 skipped • 197s
|
d2aaa67 to
b332faa
Compare
b332faa to
4c54135
Compare
4c54135 to
5c6327b
Compare
5c6327b to
1d8c5f0
Compare
1d8c5f0 to
3297abd
Compare
| const selectClause = keys | ||
| .map((k, i) => `groupUniqArray(${limit})(${k}) AS param${i}`) | ||
| .join(', '); | ||
| if (keys.length === 0) return []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the functional changes are in this file.
This check was added because previously, the query would generate an empty select clause when no keys were provided, resulting in a query error. (eg. SELECT FROM table...)
3297abd to
23f37db
Compare
| .PHONY: dev-int-common-utils | ||
| dev-int-common-utils: | ||
| docker compose -p int -f ./docker-compose.ci.yml up -d | ||
| npx nx run @hyperdx/common-utils:dev:int $(FILE) | ||
| docker compose -p int -f ./docker-compose.ci.yml down | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@teeohhem You mentioned our filter queries are fragile - this PR adds integration tests so that we can actually test the queries against real ClickHouse data. Hopefully that will help reduce some of the fragility. I think it would be great if we could extend these tests to cover more of our query generation code from common-utils in the future (eg. all of renderChartConfig).
| databaseName: chartConfig.from.databaseName, | ||
| tableName: chartConfig.from.tableName, | ||
| connectionId: chartConfig.connection, | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind this change (just so I understand)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider a case where we are trying to get filter values for stringCol and jsonCol.nested.field.
Before these changes, the query would have been:
WITH sampledData AS (
SELECT
`stringCol`,
`jsonCol`, -- This is bad for performance
... every other column in the table
FROM table
...sampling condition
)
SELECT
groupUniqArray(20)(stringCol) as param0,
groupUniqArray(20)(jsonCol.nested.field) as param1
-- None of the other columns are used out here, so they don't need to be selected in the CTE
FROM sampledData
There's no need to select ... every other column in the table in the CTE, and selecting an entire JSON column instead of just the sub-column / path we need is bad for performance.
So now with this change we do:
WITH sampledData AS (
SELECT
stringCol as param0,
jsonCol.nested.field as param1 -- This is better for performance
FROM table
...sampling condition
)
SELECT
groupUniqArray(20)(param0) as param0,
groupUniqArray(20)(param1) as param1
FROM sampledData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great info! thanks! This seems like an important thing to comment in the code (and also remove the comments below)
// Build select expression that includes all columns by name
// This ensures materialized columns are included
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out! The comments have been fixed.
Closes HDX-2623
Summary
This change improves the performance of
getKeyValueswhen getting values of a JSON key.Generally, columns that are not referenced outside of a CTE will be pruned by the query planner. For JSON however, if the outer select references one field in a JSON column, then the inner select will read (it seems) the entire JSON object.
This PR also adds integration tests for
getKeyValuesto ensure that the function generates queries that work as expected in ClickHouse.Performance impact (on single JSON Dashboard Filter)