Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with merged table cells #79

Closed
blmoistawinde opened this issue Aug 23, 2018 · 11 comments
Closed

Dealing with merged table cells #79

blmoistawinde opened this issue Aug 23, 2018 · 11 comments

Comments

@blmoistawinde
Copy link

I'm doing table extraction on a PDF generated by Word, and there are many merged cells in the table.
1

Now if I use extract_table, I would get a table like this, with many None cells.
2
I want to praise this tool for its excellent accuracy on OCR! And I would be even more grateful if there is a new, easy function that can deal with these merged cells properly.

For example, I would expect the 3rd column to be also “指标”, cell at loc[0,"项目"] to be also "项目", and cell at loc[2,None] to be also 8.0, etc. To achieve this now, I must use complex fillna logics considering the positions of cells.

Thanks a lot!

@jsvine
Copy link
Owner

jsvine commented Sep 11, 2018

Thank you, @blmoistawinde. And better handling of merged cells is a good idea. I'll consider it for a future release. In the meantime, could you provide of the following?:

  • A PDF containing merged cells (such as the one you have screenshotted above)
  • The code you are using to extract the table
  • A screenshot or text file demonstrating how you would prefer the extracted table to appear

@blmoistawinde
Copy link
Author

Ok, below is a similar example I've made up with Microsoft Word.
example_merged.pdf
image

The code I used:

#coding=utf-8
import pdfplumber
import pandas as pd
import numpy as np

pdf = pdfplumber.open("example_merged.pdf")
p0 = pdf.pages[0]
table2df = lambda table: pd.DataFrame(table[1:], columns=table[0])
tables = table2df(p0.extract_table())
tables

And I got this.
image

The expected output is:
image

By the way, the final structure I'd like to extract from this complex table is:
A pivot table like this:
image
and a contingence table like this:
image
I was wondering if there exists another elegant solution for the merged cells, but I think the above one I've presented would be a reasonable one, especially for the data[8.0].

Thanks a lot!

@jsvine
Copy link
Owner

jsvine commented Sep 21, 2018

Thanks! This will be a helpful reference when I attempt to improve the handling of merged cells.

@binzhouchn
Copy link

Hi jsvine, is there any page merge operation available?

@jsvine
Copy link
Owner

jsvine commented Jul 30, 2019

@binzhouchn No, currently there is not a page merge operation available.

@rohaan2614
Copy link

Hi, can someone please confirm the merged-cell capability has been included in the releases after July 2020?

Thanks

@jsvine
Copy link
Owner

jsvine commented Mar 3, 2022

@rohaan2614 There is no specific merged-cell capability at the moment.

@pribadihcr
Copy link

+1

1 similar comment
@jvfd3
Copy link

jvfd3 commented Feb 18, 2024

+1

@Guyodub
Copy link

Guyodub commented Mar 18, 2024

@jsvine is the merged capability available in 2024

@san2sreshta
Copy link

@jsvine Is merged cell parsing available in 2024?

Repository owner deleted a comment from rohaan2614 Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants