Skip to content

Conversation

@pulpdrew
Copy link
Contributor

@pulpdrew pulpdrew commented Oct 21, 2025

Closes HDX-2623

Summary

This change improves the performance of getKeyValues when getting values of a JSON key.

Generally, columns that are not referenced outside of a CTE will be pruned by the query planner. For JSON however, if the outer select references one field in a JSON column, then the inner select will read (it seems) the entire JSON object.

This PR also adds integration tests for getKeyValues to ensure that the function generates queries that work as expected in ClickHouse.

Performance impact (on single JSON Dashboard Filter)

  • Original: 15.03s
Screenshot 2025-10-21 at 3 28 07 PM
  • Optimized: 0.443s
Screenshot 2025-10-21 at 3 25 47 PM

@changeset-bot
Copy link

changeset-bot bot commented Oct 21, 2025

🦋 Changeset detected

Latest commit: 4945803

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@hyperdx/common-utils Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link

vercel bot commented Oct 21, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
hyperdx-v2-oss-app Ready Ready Preview Comment Oct 27, 2025 4:46pm

@claude
Copy link

claude bot commented Oct 21, 2025

PR Review: Performance Optimization for getKeyValues

Summary

This PR significantly improves query performance (from 15s to 0.443s - 97% improvement) by optimizing how JSON fields are selected in ClickHouse queries. The optimization is well-implemented with comprehensive integration tests.

No critical issues found.

What Was Changed

  • Core optimization (metadata.ts:683-688): Changed from selecting all columns to selecting only requested keys in the CTE, avoiding full JSON column reads
  • Integration tests added: Comprehensive test suite covering JSON columns, materialized columns, and edge cases
  • Unit tests cleaned up: Removed obsolete tests for old column expansion logic
  • CI infrastructure: Added integration test support for common-utils package

Code Quality Observations

Strengths:

  • Excellent test coverage with real ClickHouse integration tests
  • Clear comments explaining the performance rationale
  • Proper separation of unit vs integration tests
  • Early return for empty keys list prevents unnecessary queries

Minor Suggestions (non-blocking):

  1. metadata.ts:688 - The fallback to || '*' appears unreachable since keys.length === 0 returns early (line 671). Consider removing the fallback.

  2. Integration test cleanup - metadata.int.test.ts:79: Consider awaiting client.close() in an afterAll at the outer describe level to ensure proper cleanup even if tests fail.

Testing Notes

  • Integration tests properly handle both disableRowLimit modes
  • Test data includes edge cases (empty strings, missing fields, empty JSON objects)
  • Proper test isolation with table creation/teardown

Recommendation: Approve and merge

The performance improvement is substantial, the implementation is solid, and the test coverage is excellent. The minor suggestions above are optional refinements that do not block merging.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 21, 2025

E2E Test Results

All tests passed • 26 passed • 3 skipped • 197s

Status Count
✅ Passed 26
❌ Failed 0
⚠️ Flaky 0
⏭️ Skipped 3

View full report →

const selectClause = keys
.map((k, i) => `groupUniqArray(${limit})(${k}) AS param${i}`)
.join(', ');
if (keys.length === 0) return [];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the functional changes are in this file.

This check was added because previously, the query would generate an empty select clause when no keys were provided, resulting in a query error. (eg. SELECT FROM table...)

@pulpdrew pulpdrew force-pushed the drew/optimize-filter-sampling branch from 3297abd to 23f37db Compare October 23, 2025 09:16
@pulpdrew pulpdrew marked this pull request as ready for review October 23, 2025 09:16
@pulpdrew pulpdrew requested review from a team and teeohhem and removed request for a team October 27, 2025 12:28
Comment on lines +48 to +53
.PHONY: dev-int-common-utils
dev-int-common-utils:
docker compose -p int -f ./docker-compose.ci.yml up -d
npx nx run @hyperdx/common-utils:dev:int $(FILE)
docker compose -p int -f ./docker-compose.ci.yml down

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teeohhem You mentioned our filter queries are fragile - this PR adds integration tests so that we can actually test the queries against real ClickHouse data. Hopefully that will help reduce some of the fragility. I think it would be great if we could extend these tests to cover more of our query generation code from common-utils in the future (eg. all of renderChartConfig).

databaseName: chartConfig.from.databaseName,
tableName: chartConfig.from.tableName,
connectionId: chartConfig.connection,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this change (just so I understand)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a case where we are trying to get filter values for stringCol and jsonCol.nested.field.

Before these changes, the query would have been:

WITH  sampledData AS (
  SELECT
    `stringCol`,
    `jsonCol`, -- This is bad for performance
    ... every other column in the table
  FROM table
  ...sampling condition
)

SELECT
  groupUniqArray(20)(stringCol) as param0,
  groupUniqArray(20)(jsonCol.nested.field) as param1
  -- None of the other columns are used out here, so they don't need to be selected in the CTE
FROM sampledData

There's no need to select ... every other column in the table in the CTE, and selecting an entire JSON column instead of just the sub-column / path we need is bad for performance.

So now with this change we do:

WITH  sampledData AS (
  SELECT
    stringCol as param0,
    jsonCol.nested.field as param1 -- This is better for performance
  FROM table
  ...sampling condition
)

SELECT
  groupUniqArray(20)(param0) as param0,
  groupUniqArray(20)(param1) as param1
FROM sampledData

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great info! thanks! This seems like an important thing to comment in the code (and also remove the comments below)

// Build select expression that includes all columns by name
// This ensures materialized columns are included

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out! The comments have been fixed.

@teeohhem teeohhem self-requested a review October 27, 2025 16:22
@kodiakhq kodiakhq bot merged commit 8190ee8 into main Oct 27, 2025
8 of 9 checks passed
@kodiakhq kodiakhq bot deleted the drew/optimize-filter-sampling branch October 27, 2025 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants