Skip to content

[DataLoader] Add OpenTelemetry metrics support#575

Merged
ShreyeshArangath merged 8 commits into
linkedin:mainfrom
ShreyeshArangath:feat/metrics
May 12, 2026
Merged

[DataLoader] Add OpenTelemetry metrics support#575
ShreyeshArangath merged 8 commits into
linkedin:mainfrom
ShreyeshArangath:feat/metrics

Conversation

@ShreyeshArangath
Copy link
Copy Markdown
Collaborator

@ShreyeshArangath ShreyeshArangath commented May 11, 2026

Summary

Introduces the OpenTelemetry metrics scaffolding for the Python dataloader. No instruments are emitted yet — this PR only adds the plumbing so future PRs can register metric groups without touching dependency, wiring, or test-harness code.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
    • Adds openhouse.dataloader.metrics package with DataLoaderMetrics (holds a shared OTEL Meter) and get_metrics() (returns the default or a custom-provider instance).
    • Adds opentelemetry-api as a runtime dependency (no-op until an SDK is configured by the consumer) and opentelemetry-sdk as a dev dependency.
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.
    • Follow-up PRs will introduce concrete instrument groups (e.g. split iteration metrics) on top of this scaffolding.

Resolved conflict in pyproject.toml by combining:
- Upstream: dynamic versioning (hatch-vcs), license, keywords, twine dev dep
- Local: OpenTelemetry dependencies (opentelemetry-api, opentelemetry-sdk)
Resolve conflicts in dataloader between metrics instrumentation and
upstream ArrivalOrder / picklable-split changes:

- _table_scan_context.py: keep table_name field; extend __reduce__ to
  include it so the scan context remains picklable for distributed
  execution.
- data_loader_split.py: thread metrics timing/recording around the new
  ArrowScan.to_record_batches(order=ArrivalOrder(...)) call.
- pyproject.toml: union both dep sets (pyiceberg fork, requests,
  tenacity, opentelemetry-api; mypy, responses, types-requests,
  opentelemetry-sdk in dev).
- data_loader.py: pass table_name=str(self._table_id) when building the
  scan context so metric attributes are populated.
- metrics/_split.py: add Iterator[None] return annotation to satisfy
  mypy (now wired into make check on main).
- tests/test_metrics.py: drop the unsupported SELECT-1 Substrait plan
  and construct DataLoaderSplit with the no-plan path; the metrics
  tests don't exercise plan execution.
Reconcile OTEL metrics work with upstream's recent dataloader refactor
(catalog rewrite, scan optimizer, transform_sql split, JVM worker args).

Conflict resolutions in integrations/python/dataloader:

- _table_scan_context.py: drop the metrics-only `table_name` field;
  use upstream's `table_id: TableIdentifier` (which already carries
  the fully-qualified name) plus `worker_jvm_args`. __reduce__ now
  pickles the upstream shape.

- data_loader.py: take upstream's TableScanContext construction
  (passes table_id and worker_jvm_args) unchanged.

- data_loader_split.py: rebuild on top of upstream's multi-file split
  (file_scan_tasks, transform_sql, _TimedBatchIter, transform session).
  Layer OTEL instrumentation around it:
    * record_bytes_read with the sum across all tasks
    * timed_split_iteration wraps the whole __iter__
    * record_batch per yielded batch, on both the passthrough and
      transform paths
    * `table` attribute is str(ctx.table_id)

- pyproject.toml: take upstream's deps (datafusion 53.0.0, li-pyiceberg
  0.11.5, sqlglot, jfrog index for li-pyiceberg) and add
  opentelemetry-api (runtime) and opentelemetry-sdk (dev). Drop the
  pyiceberg git+ source.

- uv.lock: rebased on upstream's lock so sqlglot stays pinned at 29.0.1
  (sqlglot 30.x changed Expr/Expression typing and breaks
  scan_optimizer.py mypy). OTEL packages added via targeted upgrade.

- tests/test_metrics.py: updated `_make_split` to construct
  TableScanContext with `table_id=TableIdentifier(...)` and
  DataLoaderSplit with `file_scan_tasks=[task]` to match the new
  signatures. `test_attributes_include_table_name` now asserts
  `attrs["table"] == str(table_id)`.

Verified: ruff check, ruff format, mypy, and 269 pytest cases all pass.
Drop the SplitMetrics instruments and their integration in
DataLoaderSplit so this PR introduces only the metrics package
scaffolding (DataLoaderMetrics container + get_metrics accessor +
opentelemetry-api dependency). Specific instrument groups will be
added in follow-up PRs.
@ShreyeshArangath ShreyeshArangath marked this pull request as ready for review May 11, 2026 16:38
Comment thread integrations/python/dataloader/pyproject.toml Outdated
Comment thread integrations/python/dataloader/src/openhouse/dataloader/metrics/__init__.py Outdated
Address PR linkedin#575 review feedback:
- Remove the DataLoaderMetrics/get_metrics passthrough — OTEL's
  MeterProvider.get_meter() already memoizes by name. Keep only
  METER_NAME so future instrument groups call get_meter(METER_NAME)
  directly.
- Bump opentelemetry-api and opentelemetry-sdk minimums from 1.20.0
  to 1.41.1 (current latest, matches what's installed).
@ShreyeshArangath ShreyeshArangath changed the title [DataLoader] Add OpenTelemetry metrics infrastructure [DataLoader] Add OpenTelemetry metrics support May 12, 2026
@ShreyeshArangath ShreyeshArangath merged commit 9392027 into linkedin:main May 12, 2026
2 checks passed
maluchari added a commit that referenced this pull request May 29, 2026
Reverts the 17 commits that landed on main after v0.5.417, bringing the
tree back to exactly the v0.5.417 state. Squashed into a single revert
commit for reviewability and to allow reinstating everything as one unit
(revert this commit to bring all 17 changes back).

Reverted commits (v0.5.417..main, newest first):
- Revert #579 (HTS fields in table list api) (#610)
- feat(optimizer): [3/N] Analyzer (#533)
- [DataLoader] Handle Cast(Literal, TIMESTAMP/DATE/TIME) in scan
optimizer (#569 follow-up) (#583)
- Skip metadata.json parse in drop path (#589)
- feat(optimizer): [2/N] Optimizer REST Service and Controller (#531)
- [BDP-102028] feat(optimizer): [1/N] Optimizer Database (#530)
- [RTAS]: Fix bug - remove fs scheme from tableLocation in commit (cont)
(#594)
- Trigger ELR process (#593)
- [BDP-102028] feat(optimizer): [0/N] Optimizer API and internal model
(#527)
- Fail retention app when the columnPattern mismatch partition spec
(#552)
- [DataLoader] Drop OpenTelemetry minimum version to 1.38.0 (#590)
- [DataLoader] Emit OpenTelemetry metrics for read operations (#582)
- Cache iceberg metadata to reduce redundant requests to storage (#509)
- bump iceberg 1.2 version to 1.2.0.17 (#587)
- Support returning HTS fields in table list api (#579)
- [DataLoader] Add unique id property to OpenHouseDataLoader (#580)
- [DataLoader] Add OpenTelemetry metrics support (#575)

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

[Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly
discuss the summary of the changes made in this
pull request in 2-3 lines.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants