-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In full-lined table, two lines in one cell is recognized as two items #285
Comments
Hi @playgithub Appreciate your interest in the library. Could you please provide the PDF, table settings, pdfplumber version and a sample reproducible code so that we can look into it? |
pdfplumber 0.5.23 pdf: http://www.cninfo.com.cn/new/disclosure/detail?plate=sse&orgId=9900005970&stockCode=601668&announcementId=1207611180&announcementTime=2020-04-25 code: import pdfplumber
path = '中国建筑:2019年年度报告.PDF'
pdf = pdfplumber.open(path)
txt_file = open("result.txt", "wb")
for page in pdf.pages:
for table in page.extract_tables():
# print(table)
for row in table:
print(row)
row = [(item or "") for item in row]
row_str = "|".join(row) + "\n"
txt_file.write(row_str.encode())
print('\n------------------------------------------------------------\n')
txt_file.close()
pdf.close() |
Thanks for sharing the requested details @playgithub The code you shared can be adjusted like the following import pdfplumber
path = '中国建筑:2019年年度报告.PDF'
pdf = pdfplumber.open(path)
txt_file = open("result.txt", "wb")
ts = {
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"snap_tolerance": 8,
}
for page in pdf.pages:
for table in page.extract_tables(table_settings=ts):
merged_table = []
for row in table:
if None in row:
merged_table[-1] = [i + j if j is not None else i for i, j in zip(merged_table[-1], row)]
else:
merged_table.append(row)
for row in merged_table:
print(row)
row = [(item or "") for item in row]
row_str = "|".join(row) + "\n"
txt_file.write(row_str.encode())
print('\n------------------------------------------------------------\n')
txt_file.close()
pdf.close() Note that I have specified the table settings to use. I only tested on page 128. The result of drawing the table on that page looks like the following |
Does it mean if a row contains at least one "None" value, the row should be merged to the row above? |
@playgithub Yes, it means that the row containing def keep_visible_lines(obj):
if obj['object_type'] == 'rect':
return obj['non_stroking_color'] == 0
return True
p = p.filter(keep_visible_lines) Explanation on the above code: On the page 128, I noticed that the hidden lines had The table result
|
Thanks, it's cool. |
Hi @situchen Could you please provide some more information? Like the table extraction strategy you are using and the PDF |
Thank you! |
|
Thanks for sharing the PDF @situchen Using the default table extraction settings, this is the result I get To discard those invisible lines, you may use the code at #311 (comment) It will give you the result as |
Thank you for your answers, thank you very much! |
When I was processing it, I found that the lines in the yellow wire frame would be lost, please see the picture for details |
To me, this seems like a difficult problem — you want to ignore non-black lines in most cases ... but not all of them. It may be possible to do so, but you'll have to find a set of rules for what |
The result for the row is:
["以公允价值计量且其变动计入", None, None, None, None, None]
["当期损益的金融资产", "四(3)", "902,072", "—", "-", "—"]
What is expected is
["以公允价值计量且其变动计入当期损益的金融资产", None, None, None, None, None]
If it can be solved by parse params, how? Thanks
The text was updated successfully, but these errors were encountered: