Skip to content

Harden telemetry delivery and refine recipe metadata#2441

Closed
bmehta001 wants to merge 19 commits intomicrosoft:mainfrom
bmehta001:bhamehta/fix-telemetry-deadlock
Closed

Harden telemetry delivery and refine recipe metadata#2441
bmehta001 wants to merge 19 commits intomicrosoft:mainfrom
bmehta001:bhamehta/fix-telemetry-deadlock

Conversation

@bmehta001
Copy link
Copy Markdown
Contributor

@bmehta001 bmehta001 commented May 4, 2026

Summary

  • harden Olive telemetry delivery and cache behavior, including Windows file locking, cache-path handling, unreadable flush preservation, and CI recipe-only emission
  • simplify and scope telemetry plumbing while addressing PR review and GitHub Advanced Security feedback
  • refine OliveRecipe metadata so target fields are emitted only for explicit targets, host fields are tracked separately, and recipe/config telemetry captures redacted explicit overrides
  • replace opaque package-config hashing with redacted package_config_overrides, while keeping package_config out of recipe_hash
  • add or update focused telemetry and workflow tests and refresh the privacy documentation

Why

We want telemetry to be reliable under cache contention, offline retry, and CI execution, and we want the recipe signal to be useful for analyzing explicit overrides without mixing host/target semantics or relying on opaque hashes.

Validation

  • pytest test\\test_telemetry.py -q
  • pytest test\\workflows\\test_workflow_run.py -q
  • lintrunner f -a -m main
  • confirmed .azure_pipelines and .github do not set OLIVE_DISABLE_TELEMETRY, disable_telemetry, or --disable_telemetry

bmehta001 and others added 5 commits April 9, 2026 13:35
Replace separate _lock and _callback_condition with a single
_condition (threading.Condition) that protects all shared state:
_shutdown, _is_flushing, _callbacks_item_count, _events_logged.

The two-lock design had a lock order inversion: on_payload_transmitted
acquired _lock then _callback_condition, while wait_for_callbacks
acquired _callback_condition then _lock (via is_flushing). This could
deadlock under concurrent flush + callback scenarios.

Using one lock eliminates the ordering issue entirely and simplifies
the code — the on_payload_transmitted callback no longer needs nested
lock acquisition.

Files changed:
- olive/telemetry/telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wait_for_callbacks: Hold _condition lock continuously from condition
check through wait() to prevent missing notifications. Previously the
lock was released between checking and waiting, allowing notify_all()
to fire in the gap.

TelemetryLogger: Add RLock to __new__ and get_default_logger to
prevent race conditions when multiple threads create the singleton
simultaneously. Uses RLock because get_default_logger calls __new__
which both need the same lock.

Files changed:
- olive/telemetry/telemetry.py
- olive/telemetry/library/telemetry_logger.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Remove base64 encoding from cache: write raw JSON lines instead.
   The base64 layer added 33% size overhead and prevented human
   inspection during debugging with no security benefit (cache dir
   is user-owned). Cache file is now directly readable.

2. Simplify cache flush: remove the .flush file rename dance
   (atomic rename, restore on failure, stale file cleanup). For ~5
   events/session, simple lock-read-send-delete is sufficient. On
   failure the cache file persists for next retry.

3. Simplify Telemetry singleton: remove double-checked locking in
   __new__. The lock is cheap and called once at startup; the outer
   check saved a lock acquisition but added complexity.

Net: -83 lines.

Files changed:
- olive/telemetry/telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 1: Remove duplicate callback count increment in
on_payload_transmitted when _is_flushing is true. The count was
incremented both in the early-return path and in the finally block,
causing wait_for_callbacks to think events completed before they
actually did. Now only the finally block increments (which always
runs, even on return).

Fix 2: Replace flush_path.replace(cache_path) with new
_restore_flush_file() method that appends flush entries into the
cache file instead of overwriting it. This prevents losing events
written by another process while a flush was in progress.

Files changed:
- olive/telemetry/telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Emit a single OliveRecipe event per workflow so recipe usage and failures can be measured without double-counting nested action failures. Keep CI in recipe-only mode, make Olive app naming explicit, update the telemetry ingestion key, and harden the cache/locking path that the new telemetry depends on.

Files changed:
- docs/Privacy.md
- olive/cli/base.py
- olive/cli/run.py
- olive/telemetry/constants.py
- olive/telemetry/library/options.py
- olive/telemetry/library/telemetry_logger.py
- olive/telemetry/telemetry.py
- olive/telemetry/telemetry_extensions.py
- olive/telemetry/utils.py
- olive/workflows/run/run.py
- test/test_telemetry.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 4, 2026 18:40
@bmehta001 bmehta001 self-assigned this May 4, 2026
Replace optional runtime imports with importlib-based lookups so the recent telemetry changes stay lint-clean without adding new noqa markers. Keep the focused telemetry tests import-sorted and ready for CI.

Files changed:
- olive/cli/base.py
- olive/workflows/run/run.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR hardens Olive telemetry reliability (cache/locking, service naming) and introduces a dedicated OliveRecipe event intended to be the only telemetry emitted in CI.

Changes:

  • Add recipe-boundary telemetry (OliveRecipe) and log recipe success/failure from workflow runs.
  • Harden telemetry cache flushing and multi-process safety (including Windows file locking), and make cache location overrideable.
  • Make telemetry app naming explicit (service.name=Olive) and update the ingestion connection string; suppress non-recipe events in CI.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
olive/workflows/run/run.py Adds recipe result logging with hashed/normalized metadata derived from run configs.
olive/telemetry/telemetry.py Introduces OliveRecipe, CI filtering, cache path override, and modifies cache flush/write behavior.
olive/telemetry/utils.py Reworks cross-platform exclusive file locking; uses LockFileEx on Windows.
olive/telemetry/telemetry_extensions.py Adds log_recipe_result(...) helper that emits OliveRecipe.
olive/telemetry/library/telemetry_logger.py Adds thread-safe singleton init and supports explicit service.name.
olive/telemetry/library/options.py Extends exporter options with service_name.
olive/telemetry/constants.py Updates the telemetry ingestion connection string.
olive/cli/run.py Passes recipe telemetry metadata for WorkflowRun.
olive/cli/base.py Centralizes recipe telemetry metadata for generated CLI commands; refactors dynamic imports.
docs/Privacy.md Documents CI behavior: only OliveRecipe is emitted in CI.
test/test_telemetry.py Adds tests for cache path override, explicit service.name, CI filtering, and Windows file locking.
test/workflows/test_workflow_run.py Adds tests ensuring recipe result logging on success/failure.

Comment thread olive/telemetry/telemetry.py
Comment thread olive/telemetry/telemetry.py Outdated
Comment thread olive/telemetry/telemetry.py Outdated
Comment thread olive/workflows/run/run.py Outdated
Keep exporter options focused on transport/export concerns and move service.name defaults into the logger/resource layer where they belong. This keeps Olive's explicit app name override separate from the shared logger fallback and removes unnecessary plumbing.

Files changed:
- olive/telemetry/library/options.py
- olive/telemetry/library/telemetry_logger.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread olive/telemetry/telemetry.py Fixed
Comment thread olive/telemetry/utils.py Fixed
bmehta001 and others added 3 commits May 4, 2026 14:03
Keep Olive''s explicit service-name override in the logger path, but restore the previous shared-library fallback and compatibility behavior so this cleanup does not broaden unrelated API or default changes.

Files changed:
- olive/telemetry/library/options.py
- olive/telemetry/library/telemetry_logger.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Trim the branch back to telemetry behavior and the minimal plumbing needed to support it. Restore the original local-import patterns in CLI and workflow code, keep only targeted lint suppressions where those restored lines are required, and simplify the telemetry logger app-name plumbing without changing the feature behavior.

Files changed:
- olive/cli/base.py
- olive/cli/run.py
- olive/telemetry/library/telemetry_logger.py
- olive/workflows/run/run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix the correctness and security issues raised on PR microsoft#2441 by handling empty telemetry cache-dir overrides safely, preserving non-empty unreadable flush files instead of deleting them, restoring the legacy .json.flush naming pattern, handling non-pathlike recipe config inputs without masking the original error, and cleaning up the Windows ctypes import pattern for CodeQL.

Files changed:
- olive/telemetry/telemetry.py
- olive/telemetry/utils.py
- olive/workflows/run/run.py
- test/test_telemetry.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread test/test_telemetry.py Fixed
Comment thread test/test_telemetry.py Fixed
Comment thread test/test_telemetry.py Fixed
Comment thread test/test_telemetry.py Fixed
Comment thread test/test_telemetry.py Fixed
Comment thread olive/cli/run.py Fixed
Comment thread olive/cli/run.py Fixed
Comment thread olive/cli/base.py Fixed
Comment thread olive/cli/base.py Fixed
Comment thread test/test_telemetry.py Fixed
bmehta001 and others added 8 commits May 4, 2026 15:46
Drop the explicit service.name wiring test from test/test_telemetry.py since it is mostly implementation-detail coverage and does not protect the higher-value telemetry behavior changes on this branch.

Files changed:
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve the remaining github-advanced-security findings by removing unused PLC0415 noqa markers, keeping the optional-import behavior via minimal import cleanup, and updating the telemetry tests to satisfy the protected-access and consider-using-with lint comments without changing the tested behavior.

Files changed:
- olive/cli/base.py
- olive/cli/run.py
- olive/workflows/run/run.py
- test/test_telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove dead base64 cache helpers from olive/telemetry/utils.py and move exception formatting next to the telemetry extension code that actually uses it. Keep the Win32 locking and cache-dir logic intact while reducing unrelated utility clutter.

Files changed:
- olive/telemetry/utils.py
- olive/telemetry/telemetry_extensions.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compute the CI environment flag once during Telemetry initialization and reuse it for recipe-only gating and heartbeat suppression instead of calling the check twice back-to-back.

Files changed:
- olive/telemetry/telemetry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep target telemetry fields only for explicitly configured targets, add separate host fields, remove the ambiguous input model name hash, and add redacted config-overrides plus package-config hash metadata so recipe telemetry can show which overrides users actually provide without folding environment-specific package config into recipe_hash.

Files changed:
- docs/Privacy.md
- olive/telemetry/telemetry.py
- olive/workflows/run/run.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Log redacted package-config overrides instead of an opaque hash so Olive telemetry captures the specific package settings users changed, while still excluding package_config from recipe_hash and avoiding raw module-path leakage.

Files changed:
- docs/Privacy.md
- olive/telemetry/telemetry.py
- olive/workflows/run/run.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Skip Hugging Face and Docker logins when PR secrets are unavailable or unresolved so Azure unit-test jobs do not fail before tests start on fork-style runs, while still preserving normal login behavior when valid credentials are present.

Files changed:
- .azure_pipelines/job_templates/build-docker-image-template.yaml
- .azure_pipelines/job_templates/huggingface-login-template.yaml
- .azure_pipelines/job_templates/olive-test-linux-gpu-template.yaml
- .azure_pipelines/scripts/run_test.sh

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 changed the title Harden telemetry and add recipe-only CI telemetry Harden telemetry delivery and refine recipe metadata May 5, 2026
Revert the Azure pipeline login guard changes because the pipeline behavior should match main and those changes are not necessary for the telemetry PR.

Files changed:
- .azure_pipelines/job_templates/build-docker-image-template.yaml
- .azure_pipelines/job_templates/huggingface-login-template.yaml
- .azure_pipelines/job_templates/olive-test-linux-gpu-template.yaml
- .azure_pipelines/scripts/run_test.sh

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bmehta001 bmehta001 closed this May 5, 2026
bmehta001 added a commit that referenced this pull request May 9, 2026
Fix the correctness and security issues raised on PR #2441 by handling empty telemetry cache-dir overrides safely, preserving non-empty unreadable flush files instead of deleting them, restoring the legacy .json.flush naming pattern, handling non-pathlike recipe config inputs without masking the original error, and cleaning up the Windows ctypes import pattern for CodeQL.

Files changed:
- olive/telemetry/telemetry.py
- olive/telemetry/utils.py
- olive/workflows/run/run.py
- test/test_telemetry.py
- test/workflows/test_workflow_run.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants