Skip to content

Conversation

shanbady
Copy link
Contributor

@shanbady shanbady commented Oct 1, 2025

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/8766

Description (What does it do?)

This PR resolves an issue where we run OCR for problem file pdfs even if the content hasnt changed which means on every webhook sync sent by the data platform (for canvas courses that have pdf problem files)

How can this be tested?

  1. checkout this branch
  2. in settings_course_etl.py set CANVAS_PDF_TRANSCRIPTION_MODEL to an ollama model that can process images (in my case phi4-mini) or gpt-4o-mini if you are setup with openai
  3. restart your celery container
  4. run a sync for a canvas course that has problem files - keep an eye on the celery logs and you should see output indicating that the task is calling an llm to ocr files: python manage.py backpopulate_canvas_courses --canvas-ids 14566 --overwrite
  5. re-run once again and note that the LLM is not called for processing the problem files

@shanbady shanbady changed the title canvas - skip TutorProblemFile OCR if content is the same Canvas - skip TutorProblemFile OCR if content is the same Oct 1, 2025
@shanbady shanbady marked this pull request as ready for review October 1, 2025 15:21
@shanbady shanbady added the Needs Review An open Pull Request that is ready for review label Oct 1, 2025
Copy link
Contributor

@dsubak dsubak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shanbady shanbady merged commit 75a0686 into main Oct 1, 2025
13 checks passed
@shanbady shanbady deleted the shanbady/canvas-pdf-ocr-fix branch October 1, 2025 17:14
This was referenced Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review An open Pull Request that is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants