-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dealing with merged table cells #79
Comments
Thank you, @blmoistawinde. And better handling of merged cells is a good idea. I'll consider it for a future release. In the meantime, could you provide of the following?:
|
Ok, below is a similar example I've made up with Microsoft Word. The code I used: #coding=utf-8
import pdfplumber
import pandas as pd
import numpy as np
pdf = pdfplumber.open("example_merged.pdf")
p0 = pdf.pages[0]
table2df = lambda table: pd.DataFrame(table[1:], columns=table[0])
tables = table2df(p0.extract_table())
tables By the way, the final structure I'd like to extract from this complex table is: Thanks a lot! |
Thanks! This will be a helpful reference when I attempt to improve the handling of merged cells. |
Hi jsvine, is there any page merge operation available? |
@binzhouchn No, currently there is not a page merge operation available. |
Hi, can someone please confirm the merged-cell capability has been included in the releases after July 2020? Thanks |
@rohaan2614 There is no specific merged-cell capability at the moment. |
+1 |
1 similar comment
+1 |
@jsvine is the merged capability available in 2024 |
@jsvine Is merged cell parsing available in 2024? |
I'm doing table extraction on a PDF generated by Word, and there are many merged cells in the table.
Now if I use
extract_table
, I would get a table like this, with many None cells.I want to praise this tool for its excellent accuracy on OCR! And I would be even more grateful if there is a new, easy function that can deal with these merged cells properly.
For example, I would expect the 3rd column to be also “指标”, cell at loc[0,"项目"] to be also "项目", and cell at loc[2,None] to be also 8.0, etc. To achieve this now, I must use complex
fillna
logics considering the positions ofcells
.Thanks a lot!
The text was updated successfully, but these errors were encountered: