Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow large files to process without running out of memory #698

Merged
merged 1 commit into from
Mar 12, 2024

Conversation

sandbergja
Copy link
Member

@sandbergja sandbergja commented Mar 12, 2024

This patch replaces various occurences where huge MARC files are read into memory as Strings or StringIOs. Instead, it uses:

  • TempFiles on disk
  • A SAX parser that allows us to read XML files into memory one element at a time, rather than all at once.
  • Nokogiri in one case where we were using ReXML

Related to #695

I tried this out on staging, and also the main branch, with 4 files:

branch peak memory consumption memory consumption pattern time
this one 0.4 GB steady during the whole run 4 minutes 33 seconds
main 1.2 GB growing at a regular pace during the whole run 4 minutes 7 seconds

For what it's worth, there are 475 files that need to be processed in prod.

This patch replaces various occurences where huge MARC files
are read into memory as Strings or StringIOs.  Instead, it
uses:
* TempFiles on disk
* A SAX parser that allows us to read XML files into memory
one line at a time.
* Nokogiri in one case where we were using ReXML
@sandbergja sandbergja marked this pull request as ready for review March 12, 2024 01:28
Copy link
Member

@christinach christinach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sandbergja 🍄 !

@christinach christinach merged commit 96f5ff0 into main Mar 12, 2024
7 checks passed
@christinach christinach deleted the can-the-job-complete branch March 12, 2024 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants