[DataLoader] Add OpenTelemetry metrics support#575
Merged
Conversation
Resolved conflict in pyproject.toml by combining: - Upstream: dynamic versioning (hatch-vcs), license, keywords, twine dev dep - Local: OpenTelemetry dependencies (opentelemetry-api, opentelemetry-sdk)
Resolve conflicts in dataloader between metrics instrumentation and upstream ArrivalOrder / picklable-split changes: - _table_scan_context.py: keep table_name field; extend __reduce__ to include it so the scan context remains picklable for distributed execution. - data_loader_split.py: thread metrics timing/recording around the new ArrowScan.to_record_batches(order=ArrivalOrder(...)) call. - pyproject.toml: union both dep sets (pyiceberg fork, requests, tenacity, opentelemetry-api; mypy, responses, types-requests, opentelemetry-sdk in dev). - data_loader.py: pass table_name=str(self._table_id) when building the scan context so metric attributes are populated. - metrics/_split.py: add Iterator[None] return annotation to satisfy mypy (now wired into make check on main). - tests/test_metrics.py: drop the unsupported SELECT-1 Substrait plan and construct DataLoaderSplit with the no-plan path; the metrics tests don't exercise plan execution.
Reconcile OTEL metrics work with upstream's recent dataloader refactor
(catalog rewrite, scan optimizer, transform_sql split, JVM worker args).
Conflict resolutions in integrations/python/dataloader:
- _table_scan_context.py: drop the metrics-only `table_name` field;
use upstream's `table_id: TableIdentifier` (which already carries
the fully-qualified name) plus `worker_jvm_args`. __reduce__ now
pickles the upstream shape.
- data_loader.py: take upstream's TableScanContext construction
(passes table_id and worker_jvm_args) unchanged.
- data_loader_split.py: rebuild on top of upstream's multi-file split
(file_scan_tasks, transform_sql, _TimedBatchIter, transform session).
Layer OTEL instrumentation around it:
* record_bytes_read with the sum across all tasks
* timed_split_iteration wraps the whole __iter__
* record_batch per yielded batch, on both the passthrough and
transform paths
* `table` attribute is str(ctx.table_id)
- pyproject.toml: take upstream's deps (datafusion 53.0.0, li-pyiceberg
0.11.5, sqlglot, jfrog index for li-pyiceberg) and add
opentelemetry-api (runtime) and opentelemetry-sdk (dev). Drop the
pyiceberg git+ source.
- uv.lock: rebased on upstream's lock so sqlglot stays pinned at 29.0.1
(sqlglot 30.x changed Expr/Expression typing and breaks
scan_optimizer.py mypy). OTEL packages added via targeted upgrade.
- tests/test_metrics.py: updated `_make_split` to construct
TableScanContext with `table_id=TableIdentifier(...)` and
DataLoaderSplit with `file_scan_tasks=[task]` to match the new
signatures. `test_attributes_include_table_name` now asserts
`attrs["table"] == str(table_id)`.
Verified: ruff check, ruff format, mypy, and 269 pytest cases all pass.
Drop the SplitMetrics instruments and their integration in DataLoaderSplit so this PR introduces only the metrics package scaffolding (DataLoaderMetrics container + get_metrics accessor + opentelemetry-api dependency). Specific instrument groups will be added in follow-up PRs.
robreeves
reviewed
May 11, 2026
Address PR linkedin#575 review feedback: - Remove the DataLoaderMetrics/get_metrics passthrough — OTEL's MeterProvider.get_meter() already memoizes by name. Keep only METER_NAME so future instrument groups call get_meter(METER_NAME) directly. - Bump opentelemetry-api and opentelemetry-sdk minimums from 1.20.0 to 1.41.1 (current latest, matches what's installed).
robreeves
approved these changes
May 12, 2026
17 tasks
maluchari
added a commit
that referenced
this pull request
May 29, 2026
Reverts the 17 commits that landed on main after v0.5.417, bringing the tree back to exactly the v0.5.417 state. Squashed into a single revert commit for reviewability and to allow reinstating everything as one unit (revert this commit to bring all 17 changes back). Reverted commits (v0.5.417..main, newest first): - Revert #579 (HTS fields in table list api) (#610) - feat(optimizer): [3/N] Analyzer (#533) - [DataLoader] Handle Cast(Literal, TIMESTAMP/DATE/TIME) in scan optimizer (#569 follow-up) (#583) - Skip metadata.json parse in drop path (#589) - feat(optimizer): [2/N] Optimizer REST Service and Controller (#531) - [BDP-102028] feat(optimizer): [1/N] Optimizer Database (#530) - [RTAS]: Fix bug - remove fs scheme from tableLocation in commit (cont) (#594) - Trigger ELR process (#593) - [BDP-102028] feat(optimizer): [0/N] Optimizer API and internal model (#527) - Fail retention app when the columnPattern mismatch partition spec (#552) - [DataLoader] Drop OpenTelemetry minimum version to 1.38.0 (#590) - [DataLoader] Emit OpenTelemetry metrics for read operations (#582) - Cache iceberg metadata to reduce redundant requests to storage (#509) - bump iceberg 1.2 version to 1.2.0.17 (#587) - Support returning HTS fields in table list api (#579) - [DataLoader] Add unique id property to OpenHouseDataLoader (#580) - [DataLoader] Add OpenTelemetry metrics support (#575) ## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> [Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this pull request in 2-3 lines. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces the OpenTelemetry metrics scaffolding for the Python dataloader. No instruments are emitted yet — this PR only adds the plumbing so future PRs can register metric groups without touching dependency, wiring, or test-harness code.
Changes
openhouse.dataloader.metricspackage withDataLoaderMetrics(holds a shared OTELMeter) andget_metrics()(returns the default or a custom-provider instance).opentelemetry-apias a runtime dependency (no-op until an SDK is configured by the consumer) andopentelemetry-sdkas a dev dependency.Testing Done
Additional Information