Pdfplumber not detecting edges of table without solid border lines #972
Replies: 3 comments 1 reply
-
Hi @newusername123123123123 Appreciate your interest in the library. Could you please provide the PDF to assist you in a better way. If sharing the PDF is not possible, can you try the following:
|
Beta Was this translation helpful? Give feedback.
-
Thank you for your response. I have attached the PDF. |
Beta Was this translation helpful? Give feedback.
-
Perhaps this example could also be of use: As the column names are known you could search for them and use their positions to create vertcial lines. Another interesting issue with this PDF is the footer text is merged/hidden behind the last row on page 1. The approach used here is to search for the footer text and to filter out chars on the page whose size is <= the footer text size. for page in pdf.pages:
row = page.search(r'(?i)roll call destination seats')[0]
bbox = pdfplumber.utils.obj_to_bbox(row)
crop = page.crop(bbox)
explicit_vertical_lines = []
for name in ['roll call', 'seats']:
col = crop.search(rf'(?i){name}')[0]
sides = col['x0'], col['x1']
explicit_vertical_lines.extend(sides)
footer = page.search(r'(?is)destinations listed in alphabetical order.*without notice')[0]
size = footer['chars'][0]['size']
ids = set(id(char) for char in page.chars if char['size'] <= size)
updated = page.search(r'(?i)(updated .*\n)(?=destinations listed in alphabetical order)')
for line in updated:
ids.update(id(char) for char in line['chars'])
bbox = list(page.bbox)
bbox[1] = row['bottom'] + 1
filtered_page = page.filter(lambda obj: id(obj) not in ids)
filtered_page = filtered_page.crop(bbox)
table = filtered_page.extract_table(dict(
explicit_vertical_lines = explicit_vertical_lines,
vertical_strategy = 'explicit'
))
|
Beta Was this translation helpful? Give feedback.
-
I am attempting to parse the timetable shown in the above pdf. My issue is that my program does not detect the lines at the bottom of the table (see below), so I'm missing data rows in my table. I have included my code below. Can anyone assist with making my program able to find the edges of the table in the attached pdf? Thank you!
import pdfplumber
pdf = pdfplumber.open("Hickam_72HR_15AUG2023 v2.pdf")
p0 = pdf.pages[0]
table_settings = {
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"intersection_tolerance": 15,
"join_tolerance": 200,
}
im = p0.to_image(resolution=250,antialias=True)
im.reset().debug_tablefinder(table_settings)
Beta Was this translation helpful? Give feedback.
All reactions