In full-lined table, two lines in one cell is recognized as two items #285

playgithub · 2020-10-16T07:01:43Z

The result for the row is:
["以公允价值计量且其变动计入", None, None, None, None, None]
["当期损益的金融资产", "四(3)", "902,072", "—", "-", "—"]

What is expected is
["以公允价值计量且其变动计入当期损益的金融资产", None, None, None, None, None]

If it can be solved by parse params, how? Thanks

samkit-jain · 2020-10-16T07:17:19Z

Hi @playgithub Appreciate your interest in the library. Could you please provide the PDF, table settings, pdfplumber version and a sample reproducible code so that we can look into it?

playgithub · 2020-10-16T07:45:13Z

pdfplumber 0.5.23

pdf: http://www.cninfo.com.cn/new/disclosure/detail?plate=sse&orgId=9900005970&stockCode=601668&announcementId=1207611180&announcementTime=2020-04-25
(click the button on the top right to download the pdf, which has a download icon)

code:

import pdfplumber

path = '中国建筑：2019年年度报告.PDF'

pdf = pdfplumber.open(path)

txt_file = open("result.txt", "wb")

for page in pdf.pages:
    for table in page.extract_tables():
        # print(table)
        for row in table:
            print(row)
            row = [(item or "") for item in row]
            row_str = "|".join(row) + "\n"
            txt_file.write(row_str.encode())
        print('\n------------------------------------------------------------\n')

txt_file.close()

pdf.close()

samkit-jain · 2020-10-16T12:16:11Z

Thanks for sharing the requested details @playgithub The code you shared can be adjusted like the following

import pdfplumber

path = '中国建筑：2019年年度报告.PDF'

pdf = pdfplumber.open(path)

txt_file = open("result.txt", "wb")

ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "snap_tolerance": 8,
}

for page in pdf.pages:
    for table in page.extract_tables(table_settings=ts):
        merged_table = []
        for row in table:
            if None in row:
                merged_table[-1] = [i + j if j is not None else i for i, j in zip(merged_table[-1], row)]
            else:
                merged_table.append(row)
        for row in merged_table:
            print(row)
            row = [(item or "") for item in row]
            row_str = "|".join(row) + "\n"
            txt_file.write(row_str.encode())
        print('\n------------------------------------------------------------\n')

txt_file.close()

pdf.close()

Note that I have specified the table settings to use. I only tested on page 128. The result of drawing the table on that page looks like the following

As you can see, certain cells containing multiline text are separated by hidden lines and that is causing the library to return 2 rows instead of 1. To account for that, I added a new variable merged_table that is similar to table but correctly combines those erroneously split rows together.

playgithub · 2020-10-16T12:48:15Z

Does it mean if a row contains at least one "None" value, the row should be merged to the row above?
If yes, it is not the result I want. I want the rows seperated by the visible lines.

playgithub · 2020-10-16T13:07:32Z

I'm curious that where do the lines in the yellow rects come from? They are not visible when I'm reading the pdf. Can the extra lines be excluded when parsing the table?

samkit-jain · 2020-10-16T15:20:49Z

@playgithub Yes, it means that the row containing None would be merged to the row above. To ignore the hidden lines, you can use the following code:

def keep_visible_lines(obj):
    if obj['object_type'] == 'rect':
        return obj['non_stroking_color'] == 0
    return True

p = p.filter(keep_visible_lines)

Explanation on the above code: On the page 128, I noticed that the hidden lines had "non_stroking_color" == 1. So, I created a filter function that would keep only the visible lines.

The image result

The table result

['资产', '附注', '2019年12月31日 \n合并', '2018年12月31日 \n合并', '2019年12月31日 \n公司', '2018年12月31日 \n公司']
['流动资产', '', '', '', '', '']
['货币资金', '四(1)', '292,441,419', '317,500,675', '21,561,651', '24,120,165']
['交易性金融资产', '四(2)', '902,072', '—', '-', '—']
['以公允价值计量且其变动计入\n当期损益的金融资产', '四(3)', '—', '4,622,633', '—', '-']
['应收票据', '四(4)', '26,918,443', '21,438,282', '102,188', '196,649']
['应收账款', '四(5)、十八(1)', '153,961,875', '167,552,941', '28,086,588', '25,461,372']
['应收款项融资', '四(6)', '3,674,166', '—', '6,100', '—']
['预付款项', '四(7)', '55,084,548', '48,611,609', '7,039,642', '7,536,344']
['其他应收款', '四(8)、十八(2)', '53,186,521', '56,489,193', '22,569,711', '21,029,153']
['存货', '四(9)', '578,917,620', '634,967,094', '340,527', '8,626,605']
['合同资产', '四(10)、十八(3)', '150,975,326', '9,078,328', '7,495,304', '—']
['持有待售资产', '', '-', '100', '-', '-']
['一年内到期的非流动资产', '四(11)', '57,463,704', '53,517,559', '4,291,107', '3,267,987']
['其他流动资产', '四(12)', '87,980,288', '48,230,077', '4,132,472', '3,418,616']
['流动资产合计', '', '1,461,505,982', '1,362,008,491', '95,625,290', '93,656,891']
['', '', '', '', '', '']
['非流动资产', '', '', '', '', '']
['债权投资', '四(13)', '17,759,804', '—', '8,115,503', '—']
['可供出售金融资产', '四(14)', '—', '10,049,736', '—', '1,310,892']
['其他债权投资', '', '612,106', '—', '-', '—']
['长期应收款', '四(15)', '164,825,662', '281,480,771', '-', '21,270,975']
['长期股权投资', '四(16)、十八(4)', '74,916,901', '65,993,999', '170,723,729', '163,531,321']
['其他权益工具投资', '四(17)', '8,069,043', '—', '1,837,882', '—']
['其他非流动金融资产', '', '50,510', '—', '-', '—']
['投资性房地产', '四(18)', '76,301,157', '68,650,183', '621,752', '652,407']
['固定资产', '四(19)', '37,554,496', '35,679,994', '839,905', '830,809']
['在建工程', '四(20)', '10,085,813', '8,293,383', '53,957', '52,410']
['无形资产', '四(21)', '16,409,157', '11,594,195', '107,442', '91,534']
['商誉', '四(22)', '2,347,428', '2,293,058', '-', '-']
['长期待摊费用', '', '935,800', '736,192', '3,037', '16,462']
['递延所得税资产', '四(23)', '15,129,128', '12,639,243', '333,973', '778,891']
['其他非流动资产', '四(24)', '147,948,942', '2,421,053', '7,879,408', '70,345']
['非流动资产合计', '', '572,945,947', '499,831,807', '190,516,588', '188,606,046']
['', '', '', '', '', '']

playgithub · 2020-10-16T16:02:40Z

Thanks, it's cool.

situchen · 2021-04-12T05:47:14Z

How to extract the merged text in the table is currently separated, for example: Base and Material, I hope these two are connected together, the expected result: Base Material.

samkit-jain · 2021-04-12T11:28:54Z

Hi @situchen Could you please provide some more information? Like the table extraction strategy you are using and the PDF

situchen · 2021-04-12T11:47:07Z

Hi @situchen Could you please provide some more information? Like the table extraction strategy you are using and the PDF
@samkit-jain
def extract_table_cell_text(page):
try:
page_object = page.dedupe_chars()
except Exception as err:
print(f'remove chars，error msg is {err}')
page_object = page

row_num = 0
pdf_result_list = []
for table in page_object.extract_tables():
    for row in table:
        cell_num = 1
        for cell in row:
            if cell:
                cell = cell.strip()
                pdf_result_list.append([1, row_num, cell_num, row_num, cell_num, ['加工要求说明', 'Text', cell]])
                cell_num += 1
        row_num += 1

Thank you！
test_table_merged.pdf

situchen · 2021-04-12T11:59:03Z

Hi @situchen Could you please provide some more information? Like the table extraction strategy you are using and the PDF

The current result is:

The result I expect is:

samkit-jain · 2021-04-12T12:19:45Z

Thanks for sharing the PDF @situchen Using the default table extraction settings, this is the result I get

You'll notice that there is a hidden horizontal line separator in the "Base Material" cell and that's why it is coming in 2 cells.

To discard those invisible lines, you may use the code at #311 (comment) It will give you the result as

situchen · 2021-04-12T13:12:53Z

将

Thank you for your answers, thank you very much!

situchen · 2021-04-13T02:24:55Z

Thanks for sharing the PDF @situchen Using the default table extraction settings, this is the result I get

You'll notice that there is a hidden horizontal line separator in the "Base Material" cell and that's why it is coming in 2 cells.

To discard those invisible lines, you may use the code at #311 (comment) It will give you the result as

When I was processing it, I found that the lines in the yellow wire frame would be lost, please see the picture for details

test_table_merged_pdf_2.pdf

jsvine · 2021-04-20T13:27:11Z

To me, this seems like a difficult problem — you want to ignore non-black lines in most cases ... but not all of them. It may be possible to do so, but you'll have to find a set of rules for what non_stroking_color will lead rects to be retained or filtered out.

playgithub added the troubleshooting Issues that seek assistance with parsing specific PDFs label Oct 16, 2020

samkit-jain added the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Oct 16, 2020

samkit-jain removed the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Oct 16, 2020

playgithub closed this as completed Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In full-lined table, two lines in one cell is recognized as two items #285

In full-lined table, two lines in one cell is recognized as two items #285

playgithub commented Oct 16, 2020 •

edited

Loading

samkit-jain commented Oct 16, 2020

playgithub commented Oct 16, 2020 •

edited

Loading

samkit-jain commented Oct 16, 2020

playgithub commented Oct 16, 2020 •

edited

Loading

playgithub commented Oct 16, 2020 •

edited

Loading

samkit-jain commented Oct 16, 2020

playgithub commented Oct 16, 2020 •

edited

Loading

situchen commented Apr 12, 2021

samkit-jain commented Apr 12, 2021

situchen commented Apr 12, 2021 •

edited

Loading

situchen commented Apr 12, 2021

samkit-jain commented Apr 12, 2021

situchen commented Apr 12, 2021

situchen commented Apr 13, 2021

jsvine commented Apr 20, 2021

In full-lined table, two lines in one cell is recognized as two items #285

In full-lined table, two lines in one cell is recognized as two items #285

Comments

playgithub commented Oct 16, 2020 • edited Loading

samkit-jain commented Oct 16, 2020

playgithub commented Oct 16, 2020 • edited Loading

samkit-jain commented Oct 16, 2020

playgithub commented Oct 16, 2020 • edited Loading

playgithub commented Oct 16, 2020 • edited Loading

samkit-jain commented Oct 16, 2020

playgithub commented Oct 16, 2020 • edited Loading

situchen commented Apr 12, 2021

samkit-jain commented Apr 12, 2021

situchen commented Apr 12, 2021 • edited Loading

situchen commented Apr 12, 2021

samkit-jain commented Apr 12, 2021

situchen commented Apr 12, 2021

situchen commented Apr 13, 2021

jsvine commented Apr 20, 2021

playgithub commented Oct 16, 2020 •

edited

Loading

playgithub commented Oct 16, 2020 •

edited

Loading

playgithub commented Oct 16, 2020 •

edited

Loading

playgithub commented Oct 16, 2020 •

edited

Loading

playgithub commented Oct 16, 2020 •

edited

Loading

situchen commented Apr 12, 2021 •

edited

Loading