Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In full-lined table, two lines in one cell is recognized as two items #285

Closed
playgithub opened this issue Oct 16, 2020 · 15 comments
Closed
Labels
troubleshooting Issues that seek assistance with parsing specific PDFs

Comments

@playgithub
Copy link

playgithub commented Oct 16, 2020

image

The result for the row is:
["以公允价值计量且其变动计入", None, None, None, None, None]
["当期损益的金融资产", "四(3)", "902,072", "—", "-", "—"]

What is expected is
["以公允价值计量且其变动计入当期损益的金融资产", None, None, None, None, None]

If it can be solved by parse params, how? Thanks

@playgithub playgithub added the troubleshooting Issues that seek assistance with parsing specific PDFs label Oct 16, 2020
@samkit-jain
Copy link
Collaborator

Hi @playgithub Appreciate your interest in the library. Could you please provide the PDF, table settings, pdfplumber version and a sample reproducible code so that we can look into it?

@samkit-jain samkit-jain added the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Oct 16, 2020
@playgithub
Copy link
Author

playgithub commented Oct 16, 2020

pdfplumber 0.5.23

pdf: http://www.cninfo.com.cn/new/disclosure/detail?plate=sse&orgId=9900005970&stockCode=601668&announcementId=1207611180&announcementTime=2020-04-25
(click the button on the top right to download the pdf, which has a download icon)

code:

import pdfplumber

path = '中国建筑:2019年年度报告.PDF'

pdf = pdfplumber.open(path)

txt_file = open("result.txt", "wb")

for page in pdf.pages:
    for table in page.extract_tables():
        # print(table)
        for row in table:
            print(row)
            row = [(item or "") for item in row]
            row_str = "|".join(row) + "\n"
            txt_file.write(row_str.encode())
        print('\n------------------------------------------------------------\n')

txt_file.close()

pdf.close()

@samkit-jain
Copy link
Collaborator

Thanks for sharing the requested details @playgithub The code you shared can be adjusted like the following

import pdfplumber

path = '中国建筑:2019年年度报告.PDF'

pdf = pdfplumber.open(path)

txt_file = open("result.txt", "wb")

ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "snap_tolerance": 8,
}

for page in pdf.pages:
    for table in page.extract_tables(table_settings=ts):
        merged_table = []
        for row in table:
            if None in row:
                merged_table[-1] = [i + j if j is not None else i for i, j in zip(merged_table[-1], row)]
            else:
                merged_table.append(row)
        for row in merged_table:
            print(row)
            row = [(item or "") for item in row]
            row_str = "|".join(row) + "\n"
            txt_file.write(row_str.encode())
        print('\n------------------------------------------------------------\n')

txt_file.close()

pdf.close()

Note that I have specified the table settings to use. I only tested on page 128. The result of drawing the table on that page looks like the following
image
As you can see, certain cells containing multiline text are separated by hidden lines and that is causing the library to return 2 rows instead of 1. To account for that, I added a new variable merged_table that is similar to table but correctly combines those erroneously split rows together.

@playgithub
Copy link
Author

playgithub commented Oct 16, 2020

Does it mean if a row contains at least one "None" value, the row should be merged to the row above?
If yes, it is not the result I want. I want the rows seperated by the visible lines.

@playgithub
Copy link
Author

playgithub commented Oct 16, 2020

image

I'm curious that where do the lines in the yellow rects come from? They are not visible when I'm reading the pdf. Can the extra lines be excluded when parsing the table?

@samkit-jain
Copy link
Collaborator

@playgithub Yes, it means that the row containing None would be merged to the row above. To ignore the hidden lines, you can use the following code:

def keep_visible_lines(obj):
    if obj['object_type'] == 'rect':
        return obj['non_stroking_color'] == 0
    return True

p = p.filter(keep_visible_lines)

Explanation on the above code: On the page 128, I noticed that the hidden lines had "non_stroking_color" == 1. So, I created a filter function that would keep only the visible lines.

The image result
image

The table result

['资产', '附注', '2019年12月31日 \n合并', '2018年12月31日 \n合并', '2019年12月31日 \n公司', '2018年12月31日 \n公司']
['流动资产', '', '', '', '', '']
['货币资金', '四(1)', '292,441,419', '317,500,675', '21,561,651', '24,120,165']
['交易性金融资产', '四(2)', '902,072', '—', '-', '—']
['以公允价值计量且其变动计入\n当期损益的金融资产', '四(3)', '—', '4,622,633', '—', '-']
['应收票据', '四(4)', '26,918,443', '21,438,282', '102,188', '196,649']
['应收账款', '四(5)、十八(1)', '153,961,875', '167,552,941', '28,086,588', '25,461,372']
['应收款项融资', '四(6)', '3,674,166', '—', '6,100', '—']
['预付款项', '四(7)', '55,084,548', '48,611,609', '7,039,642', '7,536,344']
['其他应收款', '四(8)、十八(2)', '53,186,521', '56,489,193', '22,569,711', '21,029,153']
['存货', '四(9)', '578,917,620', '634,967,094', '340,527', '8,626,605']
['合同资产', '四(10)、十八(3)', '150,975,326', '9,078,328', '7,495,304', '—']
['持有待售资产', '', '-', '100', '-', '-']
['一年内到期的非流动资产', '四(11)', '57,463,704', '53,517,559', '4,291,107', '3,267,987']
['其他流动资产', '四(12)', '87,980,288', '48,230,077', '4,132,472', '3,418,616']
['流动资产合计', '', '1,461,505,982', '1,362,008,491', '95,625,290', '93,656,891']
['', '', '', '', '', '']
['非流动资产', '', '', '', '', '']
['债权投资', '四(13)', '17,759,804', '—', '8,115,503', '—']
['可供出售金融资产', '四(14)', '—', '10,049,736', '—', '1,310,892']
['其他债权投资', '', '612,106', '—', '-', '—']
['长期应收款', '四(15)', '164,825,662', '281,480,771', '-', '21,270,975']
['长期股权投资', '四(16)、十八(4)', '74,916,901', '65,993,999', '170,723,729', '163,531,321']
['其他权益工具投资', '四(17)', '8,069,043', '—', '1,837,882', '—']
['其他非流动金融资产', '', '50,510', '—', '-', '—']
['投资性房地产', '四(18)', '76,301,157', '68,650,183', '621,752', '652,407']
['固定资产', '四(19)', '37,554,496', '35,679,994', '839,905', '830,809']
['在建工程', '四(20)', '10,085,813', '8,293,383', '53,957', '52,410']
['无形资产', '四(21)', '16,409,157', '11,594,195', '107,442', '91,534']
['商誉', '四(22)', '2,347,428', '2,293,058', '-', '-']
['长期待摊费用', '', '935,800', '736,192', '3,037', '16,462']
['递延所得税资产', '四(23)', '15,129,128', '12,639,243', '333,973', '778,891']
['其他非流动资产', '四(24)', '147,948,942', '2,421,053', '7,879,408', '70,345']
['非流动资产合计', '', '572,945,947', '499,831,807', '190,516,588', '188,606,046']
['', '', '', '', '', '']

@samkit-jain samkit-jain removed the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Oct 16, 2020
@playgithub
Copy link
Author

playgithub commented Oct 16, 2020

Thanks, it's cool.

@situchen
Copy link

image
image

How to extract the merged text in the table is currently separated, for example: Base and Material, I hope these two are connected together, the expected result: Base Material.

@samkit-jain
Copy link
Collaborator

Hi @situchen Could you please provide some more information? Like the table extraction strategy you are using and the PDF

@situchen
Copy link

situchen commented Apr 12, 2021

Hi @situchen Could you please provide some more information? Like the table extraction strategy you are using and the PDF
@samkit-jain
def extract_table_cell_text(page):
try:
page_object = page.dedupe_chars()
except Exception as err:
print(f'remove chars,error msg is {err}')
page_object = page

row_num = 0
pdf_result_list = []
for table in page_object.extract_tables():
    for row in table:
        cell_num = 1
        for cell in row:
            if cell:
                cell = cell.strip()
                pdf_result_list.append([1, row_num, cell_num, row_num, cell_num, ['加工要求说明', 'Text', cell]])
                cell_num += 1
        row_num += 1

Thank you!
test_table_merged.pdf

@situchen
Copy link

Hi @situchen Could you please provide some more information? Like the table extraction strategy you are using and the PDF

The current result is:
image

The result I expect is:
image

@samkit-jain
Copy link
Collaborator

Thanks for sharing the PDF @situchen Using the default table extraction settings, this is the result I get
image
You'll notice that there is a hidden horizontal line separator in the "Base Material" cell and that's why it is coming in 2 cells.

To discard those invisible lines, you may use the code at #311 (comment) It will give you the result as
image

@situchen
Copy link

Thank you for your answers, thank you very much!

@situchen
Copy link

Thanks for sharing the PDF @situchen Using the default table extraction settings, this is the result I get
image
You'll notice that there is a hidden horizontal line separator in the "Base Material" cell and that's why it is coming in 2 cells.

To discard those invisible lines, you may use the code at #311 (comment) It will give you the result as
image

When I was processing it, I found that the lines in the yellow wire frame would be lost, please see the picture for details
image

test_table_merged_pdf_2.pdf

@jsvine
Copy link
Owner

jsvine commented Apr 20, 2021

To me, this seems like a difficult problem — you want to ignore non-black lines in most cases ... but not all of them. It may be possible to do so, but you'll have to find a set of rules for what non_stroking_color will lead rects to be retained or filtered out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
troubleshooting Issues that seek assistance with parsing specific PDFs
Projects
None yet
Development

No branches or pull requests

4 participants