-
Notifications
You must be signed in to change notification settings - Fork 3
Optimize memory footprint of pdf problem transcription task #2433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cache: "poetry" | ||
|
||
- name: Install poetry with pip | ||
run: python -m pip install poetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems unrelated to the rest of the pr. Is it an intentional change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be unnecessary but it was introduced to fix a strange issue adding pypdfium2 uncovered with the python version our ci runner uses (it had been using 3.10 all along and not the pinned 3.12). will see if i can take this step out
pyproject.toml
Outdated
django-filter = "^2.4.0" | ||
django-guardian = "^3.0.0" | ||
django-health-check = { git = "https://github.com/revsys/django-health-check", rev="b0500d14c338040984f02ee34ffbe6643b005084" } # pragma: allowlist secret | ||
pypdfium2 = { git = "https://github.com/pypdfium2-team/pypdfium2"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pypdfium2 is available from pypi. Is there a reason we are installing the git version. Also it looks like the git version you are installing in this pr isn't version controlled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea - there was an issue that adding pypdfium2 uncovered with the python version our github ci used. i can revert this to using the pypi version now
What are the relevant tickets?
Closes https://github.com/mitodl/hq/issues/8079
Also includes changes to resolve https://github.com/mitodl/hq/issues/8074
Description (What does it do?)
This PR reduces the memory usage of the problemset pdf -> markdown conversion which currently leads to an OOMkill on our deployed instances.
The main changes:
How can this be tested?
pip install memory-profiler
from memory_profiler import profile
and then decorate the sync_canvas_archive method with the @Profile decorator:python -m filprofiler run memory_test.py
BEFORE:
AFTER:
Note the consistent memory usage vs the 240.5 spike
Additional Context
The final test of this will be in a deployed environment. we should have enough headroom to complete the processing