Pdfplumber not detecting edges of table without solid border lines #972

newusername123123123123 · 2023-08-25T19:01:36Z

newusername123123123123
Aug 25, 2023

I am attempting to parse the timetable shown in the above pdf. My issue is that my program does not detect the lines at the bottom of the table (see below), so I'm missing data rows in my table. I have included my code below. Can anyone assist with making my program able to find the edges of the table in the attached pdf? Thank you!

import pdfplumber

pdf = pdfplumber.open("Hickam_72HR_15AUG2023 v2.pdf")
p0 = pdf.pages[0]

table_settings = {
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"intersection_tolerance": 15,
"join_tolerance": 200,
}
im = p0.to_image(resolution=250,antialias=True)
im.reset().debug_tablefinder(table_settings)

samkit-jain · 2023-09-01T09:53:54Z

samkit-jain
Sep 1, 2023
Collaborator

Hi @newusername123123123123 Appreciate your interest in the library. Could you please provide the PDF to assist you in a better way. If sharing the PDF is not possible, can you try the following:

Use text for horizontal lines and for vertical lines.
Use text for horizontal lines and for vertical lines, use the table's header. To get the vertical lines from the table header, use your current table extraction setting and get the coordinates of the vertical lines in the first row and pass that to the explicit_vertical_lines after setting explicit as the vertical lines strategy.
Use page.curves or page.edges or page.curves+page.edges with the "explicit" vertical and horizontal lines.

0 replies

newusername123123123123 · 2023-09-11T17:16:42Z

newusername123123123123
Sep 11, 2023
Author

Thank you for your response. I have attached the PDF.

Hickam_72HR_15AUG2023 v2.pdf

1 reply

samkit-jain Sep 14, 2023
Collaborator

Thanks for sharing the PDF. You can first find the vertical line positions from the header and then run the table extraction. Example:

import pdfplumber

pdf = pdfplumber.open("file.pdf")
p = pdf.pages[0]


def get_vertical_lines_from_header(page):
    """
    In this PDF, the header row is a table enclosed in lines-lines. Instead of defining
    custom hand drawn vertical line segments, use that table's coords for explicit vertical lines.

    It is done only once and then the coords are kept for reference.
    """
    column_count = 3  # 3 because the table has 3 headers.

    # Find tables.
    tables = page.find_tables(table_settings={"vertical_strategy": "lines", "horizontal_strategy": "lines"})

    # No table found.
    if len(tables) == 0:
        return []

    cells = None  # To store cell info for the header row.

    # Find header row. It is the first row with no Nones and of length column_count in all the tables.
    for table in tables:
        for row in table.rows:
            if any(cell is None for cell in row.cells):
                continue

            if len(row.cells) != column_count:
                continue

            return [cell[0] for cell in row.cells] + [row.cells[-1][2]]

        if cells is not None:
            break

    return []

ts = {
    "vertical_strategy": "explicit",
    "horizontal_strategy": "text",
    "snap_tolerance": 5,
    "explicit_vertical_lines": get_vertical_lines_from_header(p),
    "intersection_x_tolerance": 10,
}

# Visual debugging.
im = p.to_image(resolution=200)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")

# Output.
tables = p.extract_tables(table_settings=ts)
for table in tables:
    print()
    for row in table:
        print(row)

This will give you the output as

However, if you use "lines" as the horizontal strategy, you'll get a more cleaner output but would require some post-processing cleanup or pre-processing page cropping.

cmdlineluser · 2023-09-15T17:57:07Z

cmdlineluser
Sep 15, 2023

Perhaps this example could also be of use:

As the column names are known you could search for them and use their positions to create vertcial lines.

Another interesting issue with this PDF is the footer text is merged/hidden behind the last row on page 1.

The approach used here is to search for the footer text and to filter out chars on the page whose size is <= the footer text size.

for page in pdf.pages:
    row = page.search(r'(?i)roll call destination seats')[0]
    bbox = pdfplumber.utils.obj_to_bbox(row)
    crop = page.crop(bbox)

    explicit_vertical_lines = []
    for name in ['roll call', 'seats']:
       col = crop.search(rf'(?i){name}')[0]
       sides = col['x0'], col['x1']
       explicit_vertical_lines.extend(sides)

    footer = page.search(r'(?is)destinations listed in alphabetical order.*without notice')[0]
    size = footer['chars'][0]['size']
    ids = set(id(char) for char in page.chars if char['size'] <= size)

    updated = page.search(r'(?i)(updated .*\n)(?=destinations listed in alphabetical order)')
    for line in updated:
       ids.update(id(char) for char in line['chars'])

    bbox = list(page.bbox)
    bbox[1] = row['bottom'] + 1

    filtered_page = page.filter(lambda obj: id(obj) not in ids)
    filtered_page = filtered_page.crop(bbox)

    table = filtered_page.extract_table(dict(
        explicit_vertical_lines = explicit_vertical_lines, 
        vertical_strategy = 'explicit'
    ))

┌───────────┬──────────────────────────────────────┬─────────┐
│ ROLL CALL │             DESTINATION              │  SEATS  │
│  varchar  │               varchar                │ varchar │
├───────────┼──────────────────────────────────────┼─────────┤
│ 2040      │ NORTH ISLAND NAS, CA\nTRAVIS AFB, CA │ 54T     │
└───────────┴──────────────────────────────────────┴─────────┘
┌───────────┬──────────────────────────────────────────────────────────────────────────────┬─────────┐
│ ROLL CALL │                                 DESTINATION                                  │  SEATS  │
│  varchar  │                                   varchar                                    │ varchar │
├───────────┼──────────────────────────────────────────────────────────────────────────────┼─────────┤
│ 0300      │ KWAJALEIN ATOLL                                                              │ 41F     │
│ 0510      │ CLARK INTL, PHILIPPINES\nKADENA AB, JAPAN                                    │ 53T     │
│ 0505      │ COLORADO SPRINGS, CO\nTRAVIS AFB, CA\nVANDENBERG AFB, CA                     │ 73T     │
│ 0640      │ ELMENDORF AFB, AK\nKADENA AB, JAPAN                                          │ 10T     │
│ 0719      │ ANDERSEN AFB, GUAM\nKADENA AB, JAPAN\nOSAN AB, SOUTH KOREA\nYOKOTA AB, JAPAN │ 73T     │
│ 0910      │ ANDERSEN AFB, GUAM\nOSAN AB, SOUTH KOREA\nYOKOTA AB, JAPAN                   │ 53T     │
│           │                                                                              │         │
└───────────┴──────────────────────────────────────────────────────────────────────────────┴─────────┘
┌───────────┬────────────────────────┬─────────┐
│ ROLL CALL │      DESTINATION       │  SEATS  │
│  varchar  │        varchar         │ varchar │
├───────────┼────────────────────────┼─────────┤
│ 0440      │ JOINT BASE ANDREWS, MD │ TBD     │
└───────────┴────────────────────────┴─────────┘

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pdfplumber not detecting edges of table without solid border lines #972

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Pdfplumber not detecting edges of table without solid border lines #972

newusername123123123123 Aug 25, 2023

Replies: 3 comments · 1 reply

samkit-jain Sep 1, 2023 Collaborator

newusername123123123123 Sep 11, 2023 Author

samkit-jain Sep 14, 2023 Collaborator

cmdlineluser Sep 15, 2023

newusername123123123123
Aug 25, 2023

Replies: 3 comments 1 reply

samkit-jain
Sep 1, 2023
Collaborator

newusername123123123123
Sep 11, 2023
Author

samkit-jain Sep 14, 2023
Collaborator

cmdlineluser
Sep 15, 2023