extract data from tables not properly aligned #943
chanpreet90
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment 11 replies
-
That's an interesting PDF >>> page.rects
[]
>>> page.lines
[]
>>> page.curves
[] Anybody know how those lines are drawn? Perhaps you could use the column names as markers and create lines based on their position: def extract_table_with_column_names(page, columns):
columns_re = [re.escape(col) for col in columns]
header = page.search(' '.join(columns_re))[0]
bbox = header['x0'], header['top'], header['x1'], page.height
table_area = page.crop(bbox)
explicit_vertical_lines = [
table_area.search(column)[0]['x0'] for column in columns_re
]
explicit_vertical_lines.append(header['x1'])
return table_area.extract_table(dict(
explicit_vertical_lines = explicit_vertical_lines,
horizontal_strategy = 'text'
))
columns = ['Date', 'Description', 'Withdrawals (S)', 'Deposits (S)', 'Balance (S)']
extract_table_with_column_names(page, columns) [['Date', 'Description', 'Withdrawals (S)', 'Deposits (S)', 'Balance (S)'],
['', '', '', '', ''],
['', 'Opening Balance', '', '', '17,928.29'],
['', '', '', '', ''],
['?Jun', 'Fees/Dues YORK', '704.55', '', '17,223.74'],
['', '', '', '', ''],
['12Jun', 'e-Transfer received', '', '', ''],
['', 'CAbxE4hY', '', '2,600.00', '19,823.74'],
['', '', '', '', ''],
... |
Beta Was this translation helpful? Give feedback.
11 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello
I have a PDF that has data in tabular format, and the table spans several pages, but the columns are not properly aligned. So, how to appropriately extract data?
sample.pdf
Thanks in advance
Beta Was this translation helpful? Give feedback.
All reactions