Skip to content

Memory issues on very large PDFs #193

@SpencerNorris

Description

@SpencerNorris

I'm currently trying to extract a ~28,000 page PDF (not a typo) and am running up against memory limits when I run in a loop.

import pandas as pd
import pdfplumber
from os import path

#Read in data
pdf = pdfplumber.open("data/my.pdf")

#Create settings for extraction
table_settings = {
    "vertical_strategy": "text", #No lines on table
    "horizontal_strategy": "text", #No lines on table
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "keep_blank_chars": False,
    "text_tolerance": 3,
    "text_x_tolerance": None,
    "text_y_tolerance": None,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": None,
    "intersection_y_tolerance": None,
}
COLUMNS = [
    'Work Date',
    'Employee Number',
    'Pay Type',
    'Hours',
    'Account Number',
    'Hourly Rate',
    'Gross',
    'Job Code',
    'Activity Code'
]
#Begin extracting pages one at a time
page_num = 0
for page in pdf.pages:
    try:
        #Pull the table
        table = page.extract_table(table_settings)
        #Drop the first row
        table = table[1:]
        #Read into Dataframe
        df = pd.DataFrame(table,columns=COLUMNS)
        #Output to CSV
        df.to_csv(path.join('data','output',str(page_num) + '.csv'))
        page_num += 1
    #There's bound to be a billion issues with the data
    except Exception as ex:
        print("Error on page ", page_num, ".")
        print(ex)
        page_num += 1

I'm handling this one page at a time because if this bombs out at any point, then I lose all my work.

As the loop runs, memory consumption keeps growing until it hits about 5Gb (about all the space I have left on my machine).

I suspect a memory leak, but I'm not sure. I'd figure that memory would be released as the loop iterates.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions