-
-
Notifications
You must be signed in to change notification settings - Fork 160
feat: enable dataset stats collection and retrieval #1529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
server always collects field level stats add endpoint to fetch stats for field(s) along with offset and limit
WalkthroughAdds a dataset statistics API (types, SQL builder, handler, and routing), introduces Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Handler as Handler\n(get_dataset_stats)
participant SQLBuilder as SQL Builder\n(build_stats_sql)
participant Query as Query Engine
participant Transform as Transform\n(transform_query_results)
participant Response as JSON Response
Client->>Handler: POST /dataset_stats\n(DataSetStatsRequest)
Handler->>SQLBuilder: build_stats_sql(dataset, fields, offset, limit)
SQLBuilder-->>Handler: SQL string
Handler->>Query: Execute SQL
Query-->>Handler: Rows (QueryRow[])
Handler->>Transform: transform_query_results(rows)
Transform-->>Handler: HashMap<String, FieldStats>
Handler->>Response: Serialize to JSON
Response-->>Client: FieldStats JSON (ordered distinct values)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/storage/field_stats.rs`:
- Around line 614-621: The aggregation in the CTE field_totals is using
SUM(DISTINCT field_stats_count) which undercounts when multiple parquet files
produce identical counts; change the aggregation to SUM(field_stats_count) in
the query that references DATASET_STATS_STREAM_NAME and dataset_name so
total_field_count correctly sums all field_stats_count rows (look for the CTE
named field_totals and the SUM(DISTINCT field_stats_count) expression and
replace it with SUM(field_stats_count)).
🧹 Nitpick comments (3)
src/handlers/http/health_check.rs (1)
65-70: Double sync for stats data flush is intentional but warrants clarification.The first
perform_sync_operations()call triggers field stats calculation during parquet upload. The generated stats are buffered in thepstatsstream. The second call is necessary to flush and upload this stats data. The comment could be more explicit about this two-phase dependency.📝 Consider clarifying the comment
// Perform sync operations perform_sync_operations().await; - // This is to ensure that all stats data is synced before the server shuts down + // Second sync required: the first sync generates field stats during parquet uploads, + // which are buffered in the pstats stream. This second sync flushes those stats. perform_sync_operations().await;src/handlers/http/modal/server.rs (1)
731-735: Consider renaming for consistency with return type.
get_dataset_stats_webscope()returnsResourcebut the naming convention*_webscopein this file typically returnsScope(e.g.,get_demo_data_webscope(),get_metrics_webscope()). Consider renaming toget_dataset_stats_factory()to match the pattern used by otherResource-returning methods likeget_query_factory()andget_about_factory().src/storage/field_stats.rs (1)
537-547: Consider preserving error context in conversions.The error handling chain loses context by using generic error messages. Consider including the original error details for better debugging.
♻️ Suggested improvement
let body_bytes = response.into_body().try_into_bytes().map_err(|_| { - QueryError::CustomError("error in converting response to bytes".to_string()) + QueryError::CustomError("failed to extract response body bytes".to_string()) })?; - let body_str = std::str::from_utf8(&body_bytes).map_err(|_| { - QueryError::CustomError("error in converting response bytes to string".to_string()) + let body_str = std::str::from_utf8(&body_bytes).map_err(|e| { + QueryError::CustomError(format!("invalid UTF-8 in response: {e}")) })?; let rows: Vec<QueryRow> = - serde_json::from_str(body_str).map_err(|e| QueryError::CustomError(e.to_string()))?; + serde_json::from_str(body_str).map_err(|e| { + QueryError::CustomError(format!("failed to parse query response: {e}")) + })?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/storage/field_stats.rs`:
- Around line 622-636: The CTE ranked_values uses
MAX(field_stats_distinct_count) which yields the largest per-file distinct_count
instead of the true distinct count across files; update the aggregation to
compute the true distinct count by replacing MAX(field_stats_distinct_count)
with COUNT(DISTINCT field_stats_distinct_stats_distinct_value) (or by deriving
distinct_count from the grouped distinct_value rows) and ensure the alias
field_stats_distinct_count or distinct_count reflects that change; if you
intentionally want the per-file max as an approximation, add a clear
comment/docstring near the ranked_values CTE explaining the limitation instead.
🧹 Nitpick comments (2)
src/storage/field_stats.rs (2)
508-553: Consider direct query execution to avoid double serialization.The current implementation serializes the query result to JSON (in
query()), converts to bytes, then deserializes back intoQueryRow. While functional, this adds unnecessary overhead.Consider using the query session directly to execute the SQL and process
RecordBatchresults, similar to howcalculate_single_field_statsworks. This would avoid the intermediate JSON round-trip.
650-651: Missing tests for new API functions.The new functions
get_dataset_stats,transform_query_results, andbuild_stats_sqllack unit tests. Consider adding tests for:
build_stats_sql: SQL generation with various field filters, pagination values, and special characterstransform_query_results: Aggregation logic, sorting behavior, empty input handlingWould you like me to generate unit tests for these functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/storage/field_stats.rs`:
- Around line 607-640: The SQL string injects dataset_name into a literal but
only escapes double quotes; fix by escaping single quotes in dataset_name before
the format! call (e.g., replace each ' with '' or chain the existing replace) so
the formatted SQL uses a safely-escaped dataset_name; update the dataset_name
variable used in the format! invocation (the one right before the big
format!(...) block) to the escaped form to prevent SQL injection when building
the query.
🧹 Nitpick comments (1)
src/storage/field_stats.rs (1)
512-523: Consider clampinglimit(and possiblyoffset) to a safe upper bound.Unbounded limits can drive heavy scans and large responses for popular datasets. A small guard using an existing config (if available) would protect the endpoint from accidental or abusive large requests.
server always collects field level stats
add endpoint to fetch stats for field(s) along with offset and limit
Summary by CodeRabbit
New Features
Chores
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.