Skip to content

Conversation

shanbady
Copy link
Contributor

@shanbady shanbady commented Aug 13, 2025

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/8079

Also includes changes to resolve https://github.com/mitodl/hq/issues/8074

Description (What does it do?)

This PR reduces the memory usage of the problemset pdf -> markdown conversion which currently leads to an OOMkill on our deployed instances.

The main changes:

  • use a generator to yield base64 encoded image strings
  • use pymupdf to generate the images one at a time as needed vs pdf2image which only provides a method to generate all pages at once.

How can this be tested?

  1. checkout main
  2. make sure you have all the settings required to fetch canvas courses set:
    • CANVAS_COURSE_BUCKET_NAME=ol-data-lake-landing-zone-production
    • CANVAS_COURSE_BUCKET_PREFIX=canvas/course_content
    • active AWS/s3 keys
  3. make sure you either have ollama setup or an openai key. also set CANVAS_PDF_TRANSCRIPTION_MODEL to "gpt-4o" (might be gpt-5 now) or the name of a ollama model you have locally.
  4. create a file "memory_test.py" with the following contents:
import os, django

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "main.settings")
django.setup()

from learning_resources.etl.canvas import (
    sync_canvas_archive,
)

from learning_resources.etl.utils import (
    get_learning_course_bucket,
    get_learning_course_bucket_name,
)
from learning_resources.etl.constants import MARKETING_PAGE_FILE_TYPE, ETLSource

bucket = get_learning_course_bucket(ETLSource.canvas.name)

path = "canvas/course_content/14566/d799e9bdc2d0f99644004e4eb651e3141b8ae917ebbd9dd73c466a75d785e017.imscc"

sync_canvas_archive(bucket, path, False)
  1. go into the web container and install memory profile pip install memory-profiler
  2. modify learning_resources/etl/canvas.py - from memory_profiler import profile and then decorate the sync_canvas_archive method with the @Profile decorator:
@profile
def sync_canvas_archive(bucket, key: str, overwrite):
  1. run the profiler on the script via python -m filprofiler run memory_test.py
  2. note the memory stats. in particular the "Increment" column which shows how much usage increases from baseline.
  3. checkout this branch
  4. rebuild your web container to install the new dependencies
  5. repeat steps 5-8

BEFORE:

Line    Mem usage    Increment  Occurrences   Line Contents
---------------------------------------------------
    43    382.1 MiB    382.1 MiB           1   @profile
    44                                         def sync_canvas_archive(bucket, key: str, overwrite):
    45                                             """
    46                                             Sync a Canvas course archive from S3
    47                                             """
    48    382.1 MiB      0.0 MiB           1       from learning_resources.etl.loaders import load_content_files, load_problem_files
    49                                         
    50    382.1 MiB      0.0 MiB           1       course_folder = key.lstrip(settings.CANVAS_COURSE_BUCKET_PREFIX).split("/")[0]
    51                                         
    52    715.8 MiB      0.0 MiB           2       with TemporaryDirectory() as export_tempdir:
    53    382.1 MiB      0.0 MiB           1           course_archive_path = Path(export_tempdir, key.split("/")[-1])
    54    384.1 MiB      2.0 MiB           1           bucket.download_file(key, course_archive_path)
    55    386.0 MiB      1.9 MiB           2           resource_readable_id, run = run_for_canvas_archive(
    56    384.1 MiB      0.0 MiB           1               course_archive_path, course_folder=course_folder, overwrite=overwrite
    57                                                 )
    58    386.0 MiB      0.0 MiB           1           checksum = calc_checksum(course_archive_path)
    59    386.0 MiB      0.0 MiB           1           if run:
    60    475.3 MiB     89.3 MiB           2               load_content_files(
    61    386.0 MiB      0.0 MiB           1                   run,
    62    386.0 MiB      0.0 MiB           2                   transform_canvas_content_files(
    63    386.0 MiB      0.0 MiB           1                       course_archive_path, run, overwrite=overwrite
    64                                                         ),
    65                                                     )
    66                                         
    67    715.8 MiB    240.5 MiB           2               load_problem_files(
    68    475.3 MiB      0.0 MiB           1                   run,
    69    475.3 MiB      0.0 MiB           2                   transform_canvas_problem_files(
    70    475.3 MiB      0.0 MiB           1                       course_archive_path, run, overwrite=overwrite
    71                                                         ),
    72                                                     )
    73    715.8 MiB      0.0 MiB           1               run.checksum = checksum
    74    715.8 MiB      0.0 MiB           1               run.save()
    75                                         
    76    715.8 MiB      0.0 MiB           1       return resource_readable_id, run

AFTER:

    43    406.7 MiB    406.7 MiB           1   @profile
    44                                         def sync_canvas_archive(bucket, key: str, overwrite):
    45                                             """
    46                                             Sync a Canvas course archive from S3
    47                                             """
    48    406.7 MiB      0.0 MiB           1       from learning_resources.etl.loaders import load_content_files, load_problem_files
    49                                         
    50    406.7 MiB      0.0 MiB           1       course_folder = key.lstrip(settings.CANVAS_COURSE_BUCKET_PREFIX).split("/")[0]
    51                                         
    52    521.3 MiB      0.0 MiB           2       with TemporaryDirectory() as export_tempdir:
    53    406.7 MiB      0.0 MiB           1           course_archive_path = Path(export_tempdir, key.split("/")[-1])
    54    408.3 MiB      1.6 MiB           1           bucket.download_file(key, course_archive_path)
    55    410.5 MiB      2.2 MiB           2           resource_readable_id, run = run_for_canvas_archive(
    56    408.3 MiB      0.0 MiB           1               course_archive_path, course_folder=course_folder, overwrite=overwrite
    57                                                 )
    58    410.5 MiB      0.0 MiB           1           checksum = calc_checksum(course_archive_path)
    59    410.5 MiB      0.0 MiB           1           if run:
    60    501.3 MiB     90.8 MiB           2               load_content_files(
    61    410.5 MiB      0.0 MiB           1                   run,
    62    410.5 MiB      0.0 MiB           2                   transform_canvas_content_files(
    63    410.5 MiB      0.0 MiB           1                       course_archive_path, run, overwrite=overwrite
    64                                                         ),
    65                                                     )
    66                                         
    67    521.3 MiB     20.1 MiB           2               load_problem_files(
    68    501.3 MiB      0.0 MiB           1                   run,
    69    501.3 MiB      0.0 MiB           2                   transform_canvas_problem_files(
    70    501.3 MiB      0.0 MiB           1                       course_archive_path, run, overwrite=overwrite
    71                                                         ),
    72                                                     )
    73    521.3 MiB      0.0 MiB           1               run.checksum = checksum
    74    521.3 MiB      0.0 MiB           1               run.save()
    75                                         
    76    521.3 MiB      0.0 MiB           1       return resource_readable_id, run

Note the consistent memory usage vs the 240.5 spike

Additional Context

The final test of this will be in a deployed environment. we should have enough headroom to complete the processing

@shanbady shanbady added the Needs Review An open Pull Request that is ready for review label Aug 13, 2025
@shanbady shanbady marked this pull request as ready for review August 13, 2025 19:27
@abeglova abeglova self-assigned this Aug 14, 2025
cache: "poetry"

- name: Install poetry with pip
run: python -m pip install poetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unrelated to the rest of the pr. Is it an intentional change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be unnecessary but it was introduced to fix a strange issue adding pypdfium2 uncovered with the python version our ci runner uses (it had been using 3.10 all along and not the pinned 3.12). will see if i can take this step out

pyproject.toml Outdated
django-filter = "^2.4.0"
django-guardian = "^3.0.0"
django-health-check = { git = "https://github.com/revsys/django-health-check", rev="b0500d14c338040984f02ee34ffbe6643b005084" } # pragma: allowlist secret
pypdfium2 = { git = "https://github.com/pypdfium2-team/pypdfium2"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pypdfium2 is available from pypi. Is there a reason we are installing the git version. Also it looks like the git version you are installing in this pr isn't version controlled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - there was an issue that adding pypdfium2 uncovered with the python version our github ci used. i can revert this to using the pypi version now

@shanbady shanbady merged commit 8813d37 into main Aug 15, 2025
13 checks passed
@shanbady shanbady deleted the shanbady/canvas-pdf-memory-opt2 branch August 15, 2025 15:39
@odlbot odlbot mentioned this pull request Aug 15, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review An open Pull Request that is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants