-
Notifications
You must be signed in to change notification settings - Fork 863
Open
Description
I'm currently trying to extract a ~28,000 page PDF (not a typo) and am running up against memory limits when I run in a loop.
import pandas as pd
import pdfplumber
from os import path
#Read in data
pdf = pdfplumber.open("data/my.pdf")
#Create settings for extraction
table_settings = {
"vertical_strategy": "text", #No lines on table
"horizontal_strategy": "text", #No lines on table
"explicit_vertical_lines": [],
"explicit_horizontal_lines": [],
"snap_tolerance": 3,
"join_tolerance": 3,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
"keep_blank_chars": False,
"text_tolerance": 3,
"text_x_tolerance": None,
"text_y_tolerance": None,
"intersection_tolerance": 3,
"intersection_x_tolerance": None,
"intersection_y_tolerance": None,
}
COLUMNS = [
'Work Date',
'Employee Number',
'Pay Type',
'Hours',
'Account Number',
'Hourly Rate',
'Gross',
'Job Code',
'Activity Code'
]
#Begin extracting pages one at a time
page_num = 0
for page in pdf.pages:
try:
#Pull the table
table = page.extract_table(table_settings)
#Drop the first row
table = table[1:]
#Read into Dataframe
df = pd.DataFrame(table,columns=COLUMNS)
#Output to CSV
df.to_csv(path.join('data','output',str(page_num) + '.csv'))
page_num += 1
#There's bound to be a billion issues with the data
except Exception as ex:
print("Error on page ", page_num, ".")
print(ex)
page_num += 1
I'm handling this one page at a time because if this bombs out at any point, then I lose all my work.
As the loop runs, memory consumption keeps growing until it hits about 5Gb (about all the space I have left on my machine).
I suspect a memory leak, but I'm not sure. I'd figure that memory would be released as the loop iterates.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels