Dealing with merged table cells #79

blmoistawinde · 2018-08-23T12:15:52Z

I'm doing table extraction on a PDF generated by Word, and there are many merged cells in the table.

Now if I use extract_table, I would get a table like this, with many None cells.

I want to praise this tool for its excellent accuracy on OCR! And I would be even more grateful if there is a new, easy function that can deal with these merged cells properly.

For example, I would expect the 3rd column to be also “指标”, cell at loc[0,"项目"] to be also "项目", and cell at loc[2,None] to be also 8.0, etc. To achieve this now, I must use complex fillna logics considering the positions of cells.

Thanks a lot!

The text was updated successfully, but these errors were encountered:

jsvine · 2018-09-11T13:15:42Z

Thank you, @blmoistawinde. And better handling of merged cells is a good idea. I'll consider it for a future release. In the meantime, could you provide of the following?:

A PDF containing merged cells (such as the one you have screenshotted above)
The code you are using to extract the table
A screenshot or text file demonstrating how you would prefer the extracted table to appear

blmoistawinde · 2018-09-12T01:36:27Z

Ok, below is a similar example I've made up with Microsoft Word.
example_merged.pdf

The code I used:

#coding=utf-8
import pdfplumber
import pandas as pd
import numpy as np

pdf = pdfplumber.open("example_merged.pdf")
p0 = pdf.pages[0]
table2df = lambda table: pd.DataFrame(table[1:], columns=table[0])
tables = table2df(p0.extract_table())
tables

And I got this.

The expected output is:

By the way, the final structure I'd like to extract from this complex table is:
A pivot table like this:

and a contingence table like this:

I was wondering if there exists another elegant solution for the merged cells, but I think the above one I've presented would be a reasonable one, especially for the data[8.0].

Thanks a lot!

jsvine · 2018-09-21T03:01:13Z

Thanks! This will be a helpful reference when I attempt to improve the handling of merged cells.

binzhouchn · 2019-07-29T05:56:55Z

Hi jsvine, is there any page merge operation available?

jsvine · 2019-07-30T02:05:39Z

@binzhouchn No, currently there is not a page merge operation available.

rohaan2614 · 2022-02-24T07:37:14Z

Hi, can someone please confirm the merged-cell capability has been included in the releases after July 2020?

Thanks

jsvine · 2022-03-03T02:14:02Z

@rohaan2614 There is no specific merged-cell capability at the moment.

pribadihcr · 2023-07-12T09:53:40Z

+1

jvfd3 · 2024-02-18T15:09:00Z

+1

Guyodub · 2024-03-18T07:39:23Z

@jsvine is the merged capability available in 2024

san2sreshta · 2024-04-08T21:57:57Z

@jsvine Is merged cell parsing available in 2024?

jsvine closed this as completed Jul 18, 2020

guo1017138 mentioned this issue Nov 1, 2020

Add support fill value to all the cells belongs to the same merged cell #302

Closed

Repository owner deleted a comment from rohaan2614 Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with merged table cells #79

Dealing with merged table cells #79

blmoistawinde commented Aug 23, 2018

jsvine commented Sep 11, 2018

blmoistawinde commented Sep 12, 2018

jsvine commented Sep 21, 2018

binzhouchn commented Jul 29, 2019

jsvine commented Jul 30, 2019

rohaan2614 commented Feb 24, 2022

jsvine commented Mar 3, 2022 •

edited

Loading

pribadihcr commented Jul 12, 2023

jvfd3 commented Feb 18, 2024

Guyodub commented Mar 18, 2024

san2sreshta commented Apr 8, 2024

Dealing with merged table cells #79

Dealing with merged table cells #79

Comments

blmoistawinde commented Aug 23, 2018

jsvine commented Sep 11, 2018

blmoistawinde commented Sep 12, 2018

jsvine commented Sep 21, 2018

binzhouchn commented Jul 29, 2019

jsvine commented Jul 30, 2019

rohaan2614 commented Feb 24, 2022

jsvine commented Mar 3, 2022 • edited Loading

pribadihcr commented Jul 12, 2023

jvfd3 commented Feb 18, 2024

Guyodub commented Mar 18, 2024

san2sreshta commented Apr 8, 2024

jsvine commented Mar 3, 2022 •

edited

Loading