Skip to content
Multiprocessing SQLAlchemy Largefile Processing
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
consumers
producers
.gitignore
README.md
bootstrap.py
generate_chat_file.py
parse_chat_file.py
requirements.pip
settings.py

README.md

MSLP - Multiprocessing SQLAlchemy Largefile Processing

Example using Python 3 multiprocessing to parse large files using sqlalchemy across multiple processors: http://blog.jeremyaldrich.net/en/latest/multiprocessing_sqlalchemy_largefile_processing.html

  • Parses file of any size (1GB+)
  • Inserts file contents into mysql using SQLAlchemy per processor in batch inserts
  • Processes 10million+ rows of json by reading file, filtering through SQLAlchemy models, and inserting into MySQL in 2 minutes and 70-100MB memory consistently
  • Processing speed increases per cpu avaliable: from multiprocessing import cpu_count; cpu_count()

This takes advantage of multiple processors. Able to parse a file of any size by splitting it into seperate file chunks, and yielding each line to a consumer queue.

The main process parses the file and is the producer. Due to disk IO limitations, only one producer parses the file, but can be easily increased if you run this on SSD for example.

For every core you have, a seperate process will be spawned and handle SQLAlchemy integrity checks, parsing, and insertion into MySQL using batch inserts.

Parses and stores 10 million messages in a transactional database in around 2-3 minutes. Generates results in 1 minute. Processor usage around 7%-20% per core. Total memory usage between 70-100MB total.

Generates a result file for you to check against.

  • If using virtualbox, allocate multiple processors to your VM. Settings>System>Processor and increase processor count
You can’t perform that action at this time.