Optimize memory footprint of pdf problem transcription task #2433

shanbady · 2025-08-13T17:14:56Z

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/8079

Also includes changes to resolve https://github.com/mitodl/hq/issues/8074

Description (What does it do?)

This PR reduces the memory usage of the problemset pdf -> markdown conversion which currently leads to an OOMkill on our deployed instances.

The main changes:

use a generator to yield base64 encoded image strings
use pymupdf to generate the images one at a time as needed vs pdf2image which only provides a method to generate all pages at once.

How can this be tested?

checkout main
make sure you have all the settings required to fetch canvas courses set:
- CANVAS_COURSE_BUCKET_NAME=ol-data-lake-landing-zone-production
- CANVAS_COURSE_BUCKET_PREFIX=canvas/course_content
- active AWS/s3 keys
make sure you either have ollama setup or an openai key. also set CANVAS_PDF_TRANSCRIPTION_MODEL to "gpt-4o" (might be gpt-5 now) or the name of a ollama model you have locally.
create a file "memory_test.py" with the following contents:

import os, django

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "main.settings")
django.setup()

from learning_resources.etl.canvas import (
    sync_canvas_archive,
)

from learning_resources.etl.utils import (
    get_learning_course_bucket,
    get_learning_course_bucket_name,
)
from learning_resources.etl.constants import MARKETING_PAGE_FILE_TYPE, ETLSource

bucket = get_learning_course_bucket(ETLSource.canvas.name)

path = "canvas/course_content/14566/d799e9bdc2d0f99644004e4eb651e3141b8ae917ebbd9dd73c466a75d785e017.imscc"

sync_canvas_archive(bucket, path, False)

go into the web container and install memory profile pip install memory-profiler
modify learning_resources/etl/canvas.py - from memory_profiler import profile and then decorate the sync_canvas_archive method with the @Profile decorator:

@profile
def sync_canvas_archive(bucket, key: str, overwrite):

run the profiler on the script via python -m filprofiler run memory_test.py
note the memory stats. in particular the "Increment" column which shows how much usage increases from baseline.
checkout this branch
rebuild your web container to install the new dependencies
repeat steps 5-8

BEFORE:

Line    Mem usage    Increment  Occurrences   Line Contents
---------------------------------------------------
    43    382.1 MiB    382.1 MiB           1   @profile
    44                                         def sync_canvas_archive(bucket, key: str, overwrite):
    45                                             """
    46                                             Sync a Canvas course archive from S3
    47                                             """
    48    382.1 MiB      0.0 MiB           1       from learning_resources.etl.loaders import load_content_files, load_problem_files
    49                                         
    50    382.1 MiB      0.0 MiB           1       course_folder = key.lstrip(settings.CANVAS_COURSE_BUCKET_PREFIX).split("/")[0]
    51                                         
    52    715.8 MiB      0.0 MiB           2       with TemporaryDirectory() as export_tempdir:
    53    382.1 MiB      0.0 MiB           1           course_archive_path = Path(export_tempdir, key.split("/")[-1])
    54    384.1 MiB      2.0 MiB           1           bucket.download_file(key, course_archive_path)
    55    386.0 MiB      1.9 MiB           2           resource_readable_id, run = run_for_canvas_archive(
    56    384.1 MiB      0.0 MiB           1               course_archive_path, course_folder=course_folder, overwrite=overwrite
    57                                                 )
    58    386.0 MiB      0.0 MiB           1           checksum = calc_checksum(course_archive_path)
    59    386.0 MiB      0.0 MiB           1           if run:
    60    475.3 MiB     89.3 MiB           2               load_content_files(
    61    386.0 MiB      0.0 MiB           1                   run,
    62    386.0 MiB      0.0 MiB           2                   transform_canvas_content_files(
    63    386.0 MiB      0.0 MiB           1                       course_archive_path, run, overwrite=overwrite
    64                                                         ),
    65                                                     )
    66                                         
    67    715.8 MiB    240.5 MiB           2               load_problem_files(
    68    475.3 MiB      0.0 MiB           1                   run,
    69    475.3 MiB      0.0 MiB           2                   transform_canvas_problem_files(
    70    475.3 MiB      0.0 MiB           1                       course_archive_path, run, overwrite=overwrite
    71                                                         ),
    72                                                     )
    73    715.8 MiB      0.0 MiB           1               run.checksum = checksum
    74    715.8 MiB      0.0 MiB           1               run.save()
    75                                         
    76    715.8 MiB      0.0 MiB           1       return resource_readable_id, run

AFTER:

    43    406.7 MiB    406.7 MiB           1   @profile
    44                                         def sync_canvas_archive(bucket, key: str, overwrite):
    45                                             """
    46                                             Sync a Canvas course archive from S3
    47                                             """
    48    406.7 MiB      0.0 MiB           1       from learning_resources.etl.loaders import load_content_files, load_problem_files
    49                                         
    50    406.7 MiB      0.0 MiB           1       course_folder = key.lstrip(settings.CANVAS_COURSE_BUCKET_PREFIX).split("/")[0]
    51                                         
    52    521.3 MiB      0.0 MiB           2       with TemporaryDirectory() as export_tempdir:
    53    406.7 MiB      0.0 MiB           1           course_archive_path = Path(export_tempdir, key.split("/")[-1])
    54    408.3 MiB      1.6 MiB           1           bucket.download_file(key, course_archive_path)
    55    410.5 MiB      2.2 MiB           2           resource_readable_id, run = run_for_canvas_archive(
    56    408.3 MiB      0.0 MiB           1               course_archive_path, course_folder=course_folder, overwrite=overwrite
    57                                                 )
    58    410.5 MiB      0.0 MiB           1           checksum = calc_checksum(course_archive_path)
    59    410.5 MiB      0.0 MiB           1           if run:
    60    501.3 MiB     90.8 MiB           2               load_content_files(
    61    410.5 MiB      0.0 MiB           1                   run,
    62    410.5 MiB      0.0 MiB           2                   transform_canvas_content_files(
    63    410.5 MiB      0.0 MiB           1                       course_archive_path, run, overwrite=overwrite
    64                                                         ),
    65                                                     )
    66                                         
    67    521.3 MiB     20.1 MiB           2               load_problem_files(
    68    501.3 MiB      0.0 MiB           1                   run,
    69    501.3 MiB      0.0 MiB           2                   transform_canvas_problem_files(
    70    501.3 MiB      0.0 MiB           1                       course_archive_path, run, overwrite=overwrite
    71                                                         ),
    72                                                     )
    73    521.3 MiB      0.0 MiB           1               run.checksum = checksum
    74    521.3 MiB      0.0 MiB           1               run.save()
    75                                         
    76    521.3 MiB      0.0 MiB           1       return resource_readable_id, run

Note the consistent memory usage vs the 240.5 spike

Additional Context

The final test of this will be in a deployed environment. we should have enough headroom to complete the processing

abeglova · 2025-08-14T15:10:23Z

.github/workflows/ci.yml

          cache: "poetry"
-
+      - name: Install poetry with pip
+        run: python -m pip install poetry


this seems unrelated to the rest of the pr. Is it an intentional change?

this could be unnecessary but it was introduced to fix a strange issue adding pypdfium2 uncovered with the python version our ci runner uses (it had been using 3.10 all along and not the pinned 3.12). will see if i can take this step out

abeglova · 2025-08-14T15:12:02Z

pyproject.toml

 django-filter = "^2.4.0"
 django-guardian = "^3.0.0"
 django-health-check = { git = "https://github.com/revsys/django-health-check", rev="b0500d14c338040984f02ee34ffbe6643b005084" }  # pragma: allowlist secret
+pypdfium2 = { git = "https://github.com/pypdfium2-team/pypdfium2"}


pypdfium2 is available from pypi. Is there a reason we are installing the git version. Also it looks like the git version you are installing in this pr isn't version controlled

Yea - there was an issue that adding pypdfium2 uncovered with the python version our github ci used. i can revert this to using the pypi version now

shanbady added 10 commits August 13, 2025 13:12

removing unused arg

dde8a75

adding dep

b45bb85

removing old dep

3e00415

try install from source

462c8cf

pin python

4a9865c

pin python

169fd9a

test

defdb98

explicitely use right version of python

78d5e4a

explicitely use right version of python

32e6c19

remobing unused steps

f7f7da0

shanbady added the Needs Review An open Pull Request that is ready for review label Aug 13, 2025

shanbady marked this pull request as ready for review August 13, 2025 19:27

shanbady added 2 commits August 13, 2025 15:32

remove comment

eea03ee

fixing other bug with canvas task json serialization

8dde577

abeglova self-assigned this Aug 14, 2025

abeglova reviewed Aug 14, 2025

View reviewed changes

shanbady added 6 commits August 14, 2025 11:15

adding latext .tex files

d5584b4

adding migration

985c168

fix test

05cfd55

pinning to pypi and removing unused ci steps

e46f516

pushing fix for ci

f07a4a0

test

e3c7295

abeglova approved these changes Aug 15, 2025

View reviewed changes

shanbady merged commit 8813d37 into main Aug 15, 2025
13 checks passed

shanbady deleted the shanbady/canvas-pdf-memory-opt2 branch August 15, 2025 15:39

odlbot mentioned this pull request Aug 15, 2025

Release 0.40.1 #2437

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize memory footprint of pdf problem transcription task #2433

Optimize memory footprint of pdf problem transcription task #2433

Uh oh!

shanbady commented Aug 13, 2025 •

edited

Loading

Uh oh!

abeglova Aug 14, 2025

Uh oh!

shanbady Aug 14, 2025

Uh oh!

abeglova Aug 14, 2025

Uh oh!

shanbady Aug 14, 2025

Uh oh!

Uh oh!

Uh oh!

Optimize memory footprint of pdf problem transcription task #2433

Optimize memory footprint of pdf problem transcription task #2433

Uh oh!

Conversation

shanbady commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Additional Context

Uh oh!

abeglova Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

shanbady Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

abeglova Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

shanbady Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shanbady commented Aug 13, 2025 •

edited

Loading