Skip to content

Add JoinExporter: cross-collection hash-join export with PII redaction and memory budget#4297

Merged
makr-code merged 5 commits intodevelopfrom
copilot/v180-huggingface-hub-client-back-off
Mar 16, 2026
Merged

Add JoinExporter: cross-collection hash-join export with PII redaction and memory budget#4297
makr-code merged 5 commits intodevelopfrom
copilot/v180-huggingface-hub-client-back-off

Conversation

Copy link
Contributor

Copilot AI commented Mar 16, 2026

Implements JoinExporter, a new exporter that performs an in-memory hash-join across two collections and writes the merged result as JSONL. The previous agent session created the core implementation files but was interrupted before adding the CI workflow; this PR completes that work.

Description

Core Implementation

  • include/exporters/join_exporter.hJoinExportConfig (join keys, predicate, field selection, PII config, 1 GiB right-side memory limit) + JoinExporter class extending IExporter
  • src/exporters/join_exporter.cpp — hash-join: right side loaded into unordered_map keyed on right_key_field; AQL predicate evaluated on merged record; output_fields with src:alias renaming and left.<f>/right.<f> qualifiers; PII detect/redact on serialised JSON; memory budget aborts on ERR_EXPORT_JOIN_MEMORY_LIMIT

Tests (tests/exporters/test_join_exporter.cpp)

24 tests across 4 suites covering all ACs:

  • Basic inner join, unmatched-row skipping
  • output_fields aliasing and left./right. qualifiers
  • AQL join_predicate filtering
  • Error paths: empty collection names, invalid predicate, ambiguous fields, memory limit exceeded
  • PII detection/redaction and fail_on_pii enforcement
  • Throughput ≥ 50 000 merged docs/sec

CI & Build

  • .github/workflows/join-exporter-ci.yml — matrix over ubuntu-22.04/gcc-12, ubuntu-22.04/clang-15, ubuntu-24.04/gcc-14; builds test_join_exporter_focused and runs JoinExporterFocusedTests via ctest
  • tests/CMakeLists.txt — registers test_join_exporter_focused via add_exporter_focused_test; removes orphaned comment fragment on line 11374

Type of Change

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Other:

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed

📚 Research & Knowledge (wenn applicable)

  • Diese PR basiert auf wissenschaftlichen Paper(s) oder Best Practices?
    • Falls JA: Research-Dateien in /docs/research/ angelegt?
    • Falls JA: Im Modul-README unter "Wissenschaftliche Grundlagen" verlinkt?
    • Falls JA: In /docs/research/implementation_influence/ eingetragen?

Relevante Quellen:

  • Paper:
  • Best Practice:
  • Architecture Decision:

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Documentation updated (if needed)
  • No new warnings introduced
Original prompt

This section details on the original issue you should resolve

<issue_title>HuggingFace Hub Client: HTTP 429 Back-Off</issue_title>
<issue_description>### Context

This issue implements the roadmap item 'HuggingFace Hub Client: HTTP 429 Back-Off' for the exporters domain. It is sourced from the consolidated roadmap under 🟡 Medium Priority — Near-term (v1.5.0 – v1.8.0) and targets milestone v1.8.0.

Primary detail section: HuggingFace Hub Client: HTTP Rate Limit Handling (429 Back-Off)

Goal

Deliver the scoped changes for HuggingFace Hub Client: HTTP 429 Back-Off in src/exporters/ and complete the linked detail section in a release-ready state for v1.8.0.

Detailed Scope

HuggingFace Hub Client: HTTP Rate Limit Handling (429 Back-Off)

Priority: Medium
Target Version: v1.8.0

huggingface_hub_client.cpp implements exponential retry back-off for file uploads (line 423) but does not check HTTP 429 (Too Many Requests) or the Retry-After response header before retrying. Retrying immediately on a 429 wastes the retry budget and may result in account throttling.

Implementation Notes:

  • [ ] After each curl response, check HTTP status 429; if present, parse the Retry-After header (seconds or HTTP-date format) and sleep for that duration before retrying.
  • [ ] Cap total sleep from Retry-After at config_.timeout_seconds to prevent indefinite blocking.
  • [ ] Emit a exporters.huggingface.rate_limit_hit metric via ExporterMetrics whenever a 429 is received.

Priority: Low
Target Version: v2.0.0 (Issue: #1722)

Export a joined view of two or more collections (e.g., documents JOIN annotations) into a single JSONL or Parquet output file. Uses the AQL engine to evaluate the join predicate.

Implementation Notes:

  • Add JoinExportConfig struct with left_collection, right_collection, join_predicate (AQL expression), and output_fields.
  • Implement as a new exporter class JoinExporter that opens two AqlPredicateFilter cursors and merges record batches.
  • PII detection runs on the merged record before serialization.
  • Error cases: collection not found, join predicate parse failure, ambiguous field names (rename via output_fields alias map).

Performance Targets:

  • Join export throughput ≥ 50 000 merged docs/sec (hash-join on in-memory right side ≤ 10 M rows).
  • Memory budget for right-side hash table ≤ 1 GB configurable.

Acceptance Criteria

  • After each curl response, check HTTP status 429; if present, parse the Retry-After header (seconds or HTTP-date format) and sleep for that duration before retrying.
  • Cap total sleep from Retry-After at config_.timeout_seconds to prevent indefinite blocking.
  • Emit a exporters.huggingface.rate_limit_hit metric via ExporterMetrics whenever a 429 is received.
  • Add JoinExportConfig struct with left_collection, right_collection, join_predicate (AQL expression), and output_fields.
  • Implement as a new exporter class JoinExporter that opens two AqlPredicateFilter cursors and merges record batches.
  • PII detection runs on the merged record before serialization.
  • Error cases: collection not found, join predicate parse failure, ambiguous field names (rename via output_fields alias map).
  • Join export throughput ≥ 50 000 merged docs/sec (hash-join on in-memory right side ≤ 10 M rows).
  • Memory budget for right-side hash table ≤ 1 GB configurable.

Relationships

References

  • src/ROADMAP.md
  • src/exporters/FUTURE_ENHANCEMENTS.md#huggingface-hub-client-http-rate-limit-handling-429-back-off
  • Source key: roadmap:169:exporters:v1.8.0:huggingface-hub-client-http-rate-limit-handling-429-back-off

Generated from the consolidated source roadmap. Keep the roadmap and issue in sync when scope changes.

</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…After handling

- Add recordRateLimitHit()/getRateLimitHits() to ExporterMetrics
- Add metrics field to HubUploadConfig for rate-limit metric emission
- Add headerCaptureCb, extractRetryAfterHeader, parseRetryAfterSeconds helpers
- Handle HTTP 429 in uploadDataset/uploadShards retry loops:
  * Parse Retry-After header (integer seconds or HTTP-date)
  * Cap sleep at config_.timeout_seconds per spec
  * Skip exponential back-off for 429-triggered retries
  * Emit exporters.huggingface.rate_limit_hit metric via ExporterMetrics
- Add 5 unit tests covering new metrics and config field behavior

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
…sts.txt

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Copilot AI changed the title feat(exporters): HuggingFace Hub Client HTTP 429 back-off with Retry-After handling Add JoinExporter: cross-collection hash-join export with PII redaction and memory budget Mar 16, 2026
@makr-code makr-code marked this pull request as ready for review March 16, 2026 20:19
@makr-code makr-code merged commit 06be524 into develop Mar 16, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HuggingFace Hub Client: HTTP 429 Back-Off

2 participants