Skip to content

Log missing/invalid edx_module_id in content-file API requests#3539

Merged
mbertrand merged 7 commits into
mainfrom
mb/log_missing_edx_module_id
Jun 30, 2026
Merged

Log missing/invalid edx_module_id in content-file API requests#3539
mbertrand merged 7 commits into
mainfrom
mb/log_missing_edx_module_id

Conversation

@mbertrand

@mbertrand mbertrand commented Jun 29, 2026

Copy link
Copy Markdown
Member

What are the relevant tickets?

Closes mitodl/hq#11873

Description (What does it do?)

Logs an error whenever an API request references an edx_module_id that has no backing content, so we get a
signal when AskTIM and other consumers are degraded by missing content.

For every requested id, existence is probed directly against the source of
truth
, never inferred from empty search results. Two reasons are
distinguished:

  • not_in_db — no ContentFile row (ETL/scrape gap)
  • not_in_index — row exists but isn't embedded in Qdrant (embedding gap)

How can this be tested?

  • Pick a real edx_module_id from your local data, and also deindex a

    from learning_resources.models import ContentFile
    from vector_search.tasks import *
    
    real_cf = ContentFile.objects.exclude(edx_module_id='').first()
    unindexed_cf = ContentFile.objects.exclude(edx_module_id='').last()
    generate_embeddings([real_cf.run.id], "content_file", False)
    real_edx_module_id = print(real_cf.edx_module_id)
    print(real_edx_module_id )
    
    remove_embeddings([unindexed_cf.id], "content_file")
    unindexed_edx_module_id = unindexed_cf.edx_module_id
    print(unindexed_edx_module_id)  
  • Log in as an admin user and try the following URLs. The edx_module_id values will need to be url-encoded.

    REST endpoint — existing id is silent, bogus id logs not_in_db:

    http://open.odl.local:8065/api/v1/contentfiles/?edx_module_id=<REAL_ID>

    non-existent -> empty results AND an error log
    http://open.odl.local:8065/api/v1/contentfiles/?edx_module_id=does-not-exist

    Vector endpoint — only probes when there are no hits:

    non-existent id, no hits -> logs not_in_db
    http://open.odl.local:8065/api/v0/vector_content_files_search/?edx_module_id=does-not-exist

    real id that is in the DB but absent from Qdrant -> logs not_in_index
    http://open.odl.local:8065/api/v0/vector_content_files_search/?edx_module_id=<REAL_UNINDEXED_ID>

Watch the web container logs for lines like:

Missing ContentFile (not_in_db) for edx_module_id=does-not-exist [source=contentfiles_api]

mbertrand and others added 4 commits June 29, 2026 15:35
Sentry already groups these by the stable message template and applies its
own rate limiting, so the Redis-backed per-id throttle was redundant. Removing
it drops the caches["redis"] coupling (and a 500 risk on the contentfiles
endpoint), the identifier hashing, the throttle setting, and the real-redis
test fixture. log_missing_content_file is now a plain log.error; every
occurrence is logged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o results

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 29, 2026 20:48
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown

OpenAPI Changes

3 changes: 0 error, 3 warning, 0 info

View full changelog

Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds observability for degraded consumers (e.g., AskTIM) by logging (and emitting Sentry events via LoggingIntegration) when API requests reference edx_module_id values that are missing in the DB or missing from the Qdrant index, without breaking search behavior.

Changes:

  • Introduces shared logging utilities to record missing ContentFile backing for requested edx_module_id values (not_in_db, not_in_index).
  • Instruments the vector content-files search endpoint to probe DB/Qdrant when edx_module_id is provided and the search returns no hits (probe failures are swallowed to preserve endpoint availability).
  • Instruments the REST ContentFileFilter.edx_module_id filter to log any requested IDs that have no backing ContentFile row, and adds automated test coverage for both REST and vector behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
vector_search/views.py Calls the missing-content probe when edx_module_id is provided and the vector search returns zero hits.
vector_search/views_test.py Adds endpoint-level tests for logging, probe failure isolation, and skipping the probe when hits exist.
vector_search/utils.py Implements the async probe that checks DB presence and (when applicable) Qdrant index presence via count(exact=True).
vector_search/utils_test.py Adds unit tests covering not_in_db, not_in_index, silent success, and “unknown collection” skip behavior.
learning_resources/utils.py Adds log_missing_content_file and present_edx_module_ids helpers used by REST and vector instrumentation.
learning_resources/utils_test.py Adds a unit test asserting the log template/arguments for log_missing_content_file.
learning_resources/filters.py Adds LoggedEdxModuleIdFilter and wires it into ContentFileFilter.edx_module_id.
learning_resources/filters_test.py Adds REST filter tests to ensure missing IDs are logged and present IDs are not.

@mbertrand mbertrand added Needs Review An open Pull Request that is ready for review and removed Work in Progress labels Jun 30, 2026
@shanbady shanbady self-requested a review June 30, 2026 14:07

@shanbady shanbady left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving since it does what it needs to.

I did leave a small comment about performance when using exact=True - however since the edx_module_id is an indexed payload i think this will be minimal.

I also tried manually running this count with exact=True on prod and it was pretty fast

@shanbady shanbady added Waiting on author and removed Needs Review An open Pull Request that is ready for review labels Jun 30, 2026
@mbertrand mbertrand merged commit 7ee90c5 into main Jun 30, 2026
13 checks passed
@mbertrand mbertrand deleted the mb/log_missing_edx_module_id branch June 30, 2026 19:05
@odlbot odlbot mentioned this pull request Jul 1, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants