extract data from tables not properly aligned #943

chanpreet90 · 2023-07-22T12:25:44Z

chanpreet90
Jul 22, 2023

Hello
I have a PDF that has data in tabular format, and the table spans several pages, but the columns are not properly aligned. So, how to appropriately extract data?

sample.pdf

Thanks in advance

cmdlineluser · 2023-07-22T13:31:09Z

cmdlineluser
Jul 22, 2023

That's an interesting PDF

>>> page.rects
[]
>>> page.lines
[]
>>> page.curves
[]

Anybody know how those lines are drawn?

Perhaps you could use the column names as markers and create lines based on their position:

def extract_table_with_column_names(page, columns):
    columns_re = [re.escape(col) for col in columns]
    header = page.search(' '.join(columns_re))[0]
    
    bbox = header['x0'], header['top'], header['x1'], page.height
    table_area = page.crop(bbox)
    
    explicit_vertical_lines = [
        table_area.search(column)[0]['x0'] for column in columns_re
    ]
    
    explicit_vertical_lines.append(header['x1'])
    
    return table_area.extract_table(dict(
        explicit_vertical_lines = explicit_vertical_lines,
        horizontal_strategy = 'text'
    ))
    
    
columns = ['Date', 'Description', 'Withdrawals (S)', 'Deposits (S)', 'Balance (S)']

extract_table_with_column_names(page, columns)

[['Date', 'Description', 'Withdrawals (S)', 'Deposits (S)', 'Balance (S)'],
 ['', '', '', '', ''],
 ['', 'Opening Balance', '', '', '17,928.29'],
 ['', '', '', '', ''],
 ['?Jun', 'Fees/Dues YORK', '704.55', '', '17,223.74'],
 ['', '', '', '', ''],
 ['12Jun', 'e-Transfer received', '', '', ''],
 ['', 'CAbxE4hY', '', '2,600.00', '19,823.74'],
 ['', '', '', '', ''],
 ...

11 replies

88arvin Jul 26, 2023

Can you tell me how you got to know that there is no spacing?
Withdrawals($)

cmdlineluser Jul 27, 2023

The output I showed was from a different tool, it was mutool trace: https://mupdf.readthedocs.io/en/latest/mutool-trace.html

PDF1

    <span font="CZQUFL+Helvetica" wmode="0" bidi="0" trm="8.2 0 0 8.2">
        <g unicode="W" glyph="W" x="289.9726" y="636.76" adv=".94384768"/>
    </span>
</ignore_text>
<ignore_text transform="1 0 0 -1 0 792">
    <span font="CZQUFL+Helvetica" wmode="0" bidi="0" trm="8.2 0 0 8.2">
        <g unicode="i" glyph="i" x="297.8446" y="636.76" adv=".22216797"/>
        <g unicode="t" glyph="t" x="299.90364" y="636.76" adv=".27783204"/>
    </span>
</ignore_text>
<ignore_text transform="1 0 0 -1 0 792">
    <span font="CZQUFL+Helvetica" wmode="0" bidi="0" trm="7.9976 0 0 8.2">
        <g unicode="h" glyph="h" x="302.85" y="636.76" adv=".55615237"/>
        <g unicode="d" glyph="d" x="307.01676" y="636.76" adv=".55615237"/>
        <g unicode="r" glyph="r" x="311.1835" y="636.76" adv=".3330078"/>
        <g unicode="a" glyph="a" x="313.56678" y="636.76" adv=".55615237"/>
        <g unicode="w" glyph="w" x="317.73353" y="636.76" adv=".72216799"/>
        <g unicode="a" glyph="a" x="323.22789" y="636.76" adv=".55615237"/>
    </span>
</ignore_text>
<ignore_text transform="1 0 0 -1 0 792">
    <span font="CZQUFL+Helvetica" wmode="0" bidi="0" trm="8.2 0 0 8.2">
        <g unicode="l" glyph="l" x="328.35" y="636.76" adv=".22216797"/>
        <g unicode="s" glyph="s" x="330.0802" y="636.76" adv=".5"/>
        <g unicode=" " glyph="space" x="334.09" y="636.76" adv=".27783204"/>
    </span>
</ignore_text>
<ignore_text transform="1 0 0 -1 0 792">
    <span font="KCCOXI+Times-Roman" wmode="0" bidi="0" trm="7.9429 0 0 8.3">
        <g unicode="(" glyph="parenleft" x="363.41" y="340.25" adv=".3330078"/>
        <g unicode="S" glyph="S" x="365.77699" y="340.25" adv=".55615237"/>
        <g unicode=")" glyph="parenright" x="369.91523" y="340.25" adv=".3330078"/>
        <g unicode=" " glyph="space" x="372.2822" y="340.25" adv=".25"/>
    </span>
</ignore_text>

PDF2

<g unicode="W" glyph="W" x="290.88" y="636.72006" adv=".832"/>
<g unicode="i" glyph="i" x="297.536" y="636.72006" adv=".274"/>
<g unicode="t" glyph="t" x="299.728" y="636.72006" adv=".341"/>
<g unicode="h" glyph="h" x="302.456" y="636.72006" adv=".565"/>
<g unicode="d" glyph="d" x="306.97599" y="636.72006" adv=".549"/>
<g unicode="r" glyph="r" x="311.36799" y="636.72006" adv=".347"/>
<g unicode="a" glyph="a" x="314.14399" y="636.72006" adv=".512"/>
<g unicode="w" glyph="w" x="318.24" y="636.72006" adv=".699"/>
<g unicode="a" glyph="a" x="323.832" y="636.72006" adv=".512"/>
<g unicode="l" glyph="l" x="327.928" y="636.72006" adv=".281"/>
<g unicode="s" glyph="s" x="330.176" y="636.72006" adv=".481"/>
<g unicode="(" glyph="parenleft" x="335.28599" y="636.72006" adv=".312"/>
<g unicode="$" glyph="dollar" x="337.78199" y="636.72006" adv=".584"/>
<g unicode=")" glyph="parenright" x="342.45399" y="636.72006" adv=".31"/>

88arvin Jul 28, 2023

oohkk thank you @cmdlineluser

88arvin Aug 10, 2023

Hi again @cmdlineluser
I have the similar kind of PDF, but the names of the two columns are different. Can't we just add the column names, either this or that, and leave the rest of the code the same.

88arvin Aug 15, 2023

I have tried the above code on this PDF and it worked fine but getting this issue.

Statement.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract data from tables not properly aligned #943

{{title}}

Replies: 1 comment 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

extract data from tables not properly aligned #943

chanpreet90 Jul 22, 2023

Replies: 1 comment · 11 replies

cmdlineluser Jul 22, 2023

88arvin Jul 26, 2023

cmdlineluser Jul 27, 2023

88arvin Jul 28, 2023

88arvin Aug 10, 2023

88arvin Aug 15, 2023

chanpreet90
Jul 22, 2023

Replies: 1 comment 11 replies

cmdlineluser
Jul 22, 2023