Opaque/ curious behavior with extract_tables and extract_text parameters #1006

RobotDodo · 2023-10-09T12:15:18Z

RobotDodo
Oct 9, 2023

Thanks very much for a great library. I'm just beginning and struggling to understand and manage the parameters which govern the extract_tables and extract_text methods, but I can't figure out what to change in order to fix some puzzling behaviour.

In the table I'm using as a test (first page of attached pdf), I'm getting all the bounding box coordinates and the cell contents with no problem, using

table_settings = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "snap_x_tolerance": 10,
    "snap_y_tolerance": 0
}

I was unable to get any sensible output from extract_text from each cell, iterating through the given bounding boxes, so I gave up on that. Part of that might have been that none of the parameters I tried changing seemed to do much (to be fair, it was pretty random as I couldn't understand the documentation for it).

I now apply extract_tables on a single cell 'tables' that are iteratively cropped from the original table, using the same parameters above.

def extract_text_from_cell(page, cell_bbox):
    cropped_page = page.crop(cell_bbox)
    return cropped_page.extract_text(**text_settings)

This works a bit better, in that I get all the text and can associate it with a cell/ row/ column. However, I get occasional duplicates, seemingly from cells that have multi-line text in them, and the second line of the text is identified as a cell-within-a-cell. For example, in the top right-most table cell, there is the text:

Blog 1 - Introduce Yourself
to the Class

This text is successfully retrieved, but extract_tables method gives me another cell directly below containing

to the Class

and all the other fields/ columns in that row are blank - clearly an erroneous repeat of the second line.

I am currently post-processing the data and I can remove the rows with nothing in the first column, but I feel like I'm missing something here! Should I be using another technique? What should I do to dial down the sensitivity? All I want to do is extract the text cleanly and uniquely from the table cells that are already accurately identified and I need to minimise layout specific post-processing as this is the first of several table formats I plan to be processing.

Thanks in advance for any comments, suggestions.

Justin

Tech Writing Schedule.pdf

cmdlineluser · 2023-10-09T14:36:55Z

cmdlineluser
Oct 9, 2023

Hey Justin.

Can you perhaps show how/where you are seeing duplication of to the Class?

I can't seem to see it using the example PDF.

import pdfplumber
import pandas as pd

pdf = pdfplumber.open("Downloads/Tech.Writing.Schedule.pdf")

table_settings = {
   "snap_x_tolerance": 10,
}

# using pandas just to "pretty-print" the table
pd.DataFrame(pdf.pages[0].extract_tables(table_settings)[0])

                                  0                      1                                                  2                                                  3
0                              Week                   Date                                           Readings                                        Assignments
1                            Week 1              August 22                                                            Blog 1 -- Introduce Yourself\nto the Class
2                              None              August 24  PSTC pp. 2-16; HTW pp. 511-512, 427-\n428, 40-...        Discussion Board -- PSTC\np. 15 Exercise #3
3                            Week 2              August 29  PSTC pp. 50-71, Standing Rock press\nrelease o...                        Blog 2 -- practice analysis
4                              None              August 31                                                                          Documents for Assignment\n#1
5    Week 3\nConferences\nNo Class!  September 5\nNo Class                                 HTW pp. xvii-xxiii
6                              None  September 7\nNo Class                                                            Blog 3 -- response on the\nwriting process
7                            Week 4           September 12                                                                  Assignment #1 Draft\nand Peer Review
8                              None           September 14   Richard Straub article on BB; PSTC pp.\n102-116.  PSTC pp. 146-147, ex. 1-7.\nAssignment #1 fina...
9                            Week 5           September 19  PSTC pp. 27-31, 256; HTW pp. 481-485,\n153-156...
10                             None           September 21                                                                                Assignment #2 proposal
11                           Week 6           September 26                         PSTC pp. 133-145, 236-250.                PSTC pp. 148-149,\nexercises #16-29
12                             None           September 28                                                                  Assignment #2 Draft\nand Peer Review
13                           Week 7              October 3  PSTC pp. 116-133; HTW pp. 521-522,\n544-546, 5...                 PSTC pp. 147-148,\nexercises #8-15
14                             None              October 5                                                     Assignment #2 final\ndraft submission\n(Saturday)
15                           Week 8             October 10  PSTC pp. 291-299; HTW pp. 409-416.\n"The Black...                    Blog 4 - Reflection on\narticle
16                             None             October 12                                                                              Assign. #3 proposal form
17                           Week 9             October 17                              No Class! Fall Break!                                               None
18                             None             October 19      PSTC pp. 33-48; HTW pp. 67-69, 376,\n451-457.      Assign. #3 Annotated\nBibliography submission
19  Week 10\nConferences\nNo Class!   October 24\nNo Class  PSTC pp. 299-307, 307-314; HTW pp.\n298-304; s...
20                             None   October 26\nNo Class                                                          Discussion Board: Critique\nSample Proposals
21                          Week 11             October 31                                                                     Assign. #3 Draft and\nPeer Review
22                             None             November 2                   PSTC pp. 50-64; HTW pp. 479-480.  Assignment #3 final\ndraft submission\n(Saturday)
23                          Week 12             November 7       PSTC pp. 338-354; HTW pp. 166-168,\n183-205.                 Evaluate Assign. #3\ncollaboration
24                             None             November 9                                                     Discussion Board: PSTC p.\n382 #2\nAssign. #4 ...

0 replies

RobotDodo · 2023-10-10T03:23:08Z

RobotDodo
Oct 10, 2023
Author

Hey cmdlineluser,

Many thanks for looking into this.

I'm using two methods to extract different types of calendar and date event information from PDF tables. One is to extract the table itself and manage the content by columns (best for lists of dated events) and the other, required for normal horizontal or vertical calendars. In the latter method, the position of the box in relation to the hours/ time slots must be deduced to understand when the event starts and ends. For this, we need to pull out all the coordinates of the table cell bounding boxes (making sure to omit all text bboxes), identify the reference row/ column and compare the coordinates of the two to see when events start and finish.

For date lists like the Tech Writing PDF example, the table extraction works OK, but for the other type, I'm going to need the box coords. That's the explanation for the code I'm attaching here. It's probably a mess, but I just started using Python a couple of weeks ago!

I'm trying to fully understand the PDFplumber parameters so I can use it properly in extracting various types of PDF tables.

I've commented out the 'cleaning' code that skips the repeating lines (identified by having nothing in the first column) as well as some other post-processing clean up, so you can see the behavior. I am aware of pandas for prettying up data, but I haven't yet tried using it.

If you're up for checking it out, you should be able to run this file (you'll need to change the file path to ingest the PDF), select '1' to discard the first column, 'y' to continue writing the second page to the output file and '1' again to discard the first column on the second page. The output file you should look at is "_table_data_by_row.csv". For me it looks like this when ingested into a spreadsheet:

Hope that makes sense. Huge thanks for any tips or suggestions on this.

231009_2 pdfplumber multipage.zip

0 replies

cmdlineluser · 2023-10-10T15:09:36Z

cmdlineluser
Oct 10, 2023

Ah okay. I now see the duplicates.

The reason I wasn't seeing them is because I didn't use "snap_y_tolerance": 0 nor .crop()

With "snap_y_tolerance": 0 the cell for the first to the Class duplicate is:

(440.1681355932205, 89.75999999999999, 573.8242622950826, 90.24000000000001)

If we save it as an image to inspect it, we see no text.

bbox = 440.1681355932205, 89.75999999999999, 573.8242622950826, 90.24000000000001
page.crop(bbox).to_image(300).save('crop.png')

I was a bit confused by this behaviour and posted a similar question recently: #930

>>> page.crop(bbox).extract_text()
'to the Class'
>>> page.within_bbox(bbox).extract_text()
''

Because some of the "pixels" fall within the bbox .crop() includes the "full object" - whereas, the "strict" version is .within_bbox() which only includes objects that are fully contained.

(Sort of like an intersection vs. subset comparison.)

That should at least explain why this is happening.

It seems in this particular case, you need snap_y_tolerance > 0 but I'm not sure if that answers things.

[CELL]: (440.1681355932205, 64.55999999999995, 573.8242622950826, 89.75999999999999)
[CELL]: (440.1681355932205, 89.75999999999999, 573.8242622950826, 90.24000000000001)

7 replies

jsvine Oct 13, 2023
Maintainer

Thanks for digging into this, @cmdlineluser. This behavior does seem to stem from .crop(...) being used in table extraction and .crop(...) including any object with any amount of overlap. In an ideal world, .extract_table(...) would not duplicate any text. Perhaps it should either (a) "keep track" of characters it has already assigned to a cell (and then not use them again in another cell), or (b) only assign characters to cells if they are more than 50% inside? I've opened an issue here on the topic: #1013

cmdlineluser Oct 14, 2023

Hey @jsvine - just to clarify:

From @RobotDodo's code: https://github.com/jsvine/pdfplumber/files/12852363/231009_2.pdfplumber.multipage.zip

There is manual looping through .cells and calling .crop() e.g.

table = page.find_table()
for row in table.rows:
    for cell in row.cells:
        page.crop(cell).extract_text()

I am not seeing any duplication using the .extract*() methods directly.

(This may not have been clear from reading the previous comments.)

jsvine Oct 14, 2023
Maintainer

Ah, that was a too-quick-to-assume bit on my end. Thanks for flagging, @cmdlineluser. It turns out we do already handle this in a reasonable-seeming way in Table.extract(...):

pdfplumber/pdfplumber/table.py

Lines 399 to 410 in 94da66c

    
           def extract(self, **kwargs: Any) -> List[List[Optional[str]]]: 
        
               chars = self.page.chars 
        
               table_arr = [] 
        
               def char_in_bbox(char: T_obj, bbox: T_bbox) -> bool: 
        
                   v_mid = (char["top"] + char["bottom"]) / 2 
        
                   h_mid = (char["x0"] + char["x1"]) / 2 
        
                   x0, top, x1, bottom = bbox 
        
                   return bool( 
        
                       (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom) 
        
                   )

Maybe that code is useful for @RobotDodo as well.

RobotDodo Oct 17, 2023
Author

Many thanks both and sorry for the delay in replying; had trouble getting in front of this project for a few days.

I really appreciate the explanation, but I'm sorry to say that I don't fully understand. I've looked at one of the papers referenced somewhre describing how edges are found (checking if the neighbour cells increase or decrease in intensity, to find where it tapers off, etc ) but I don't understand how the text in this cell could be identified twice, especially since, to my eye, it's seems that the text is well within the cell. In this image, where I draw rectangles around the bounding boxes, just to check there weren't any cells within cells, there is a pixel's width space even counting for the width of the yellow lines. All of the text
"Blog 1 -- Introduce yourself
to the Class"
is completely within the yellow box.

(This is obviously a low-res version of the data that the code is working on, but it shows the idea).

W.R.T. that particular box, and the most recent code @jsvine posted, I guess this 'overlap' could occur because the limits of the character are inferred from the center of the character and the tolerance settings for how far it extends from its center? But why is it counting the text twice in the same cell? And more practically, what settings should I be changing to minimise that behaviour and maximise accurate text extraction?

This may seem obvious to more seasoned eyes like yours, but I do appreciate the simple explanations!

cmdlineluser Oct 20, 2023

These are the bbox coords of the 2 cells:

(440.1681355932205, 64.55999999999995, 573.8242622950826, 89.75999999999999) # "real" cell
(440.1681355932205, 89.75999999999999, 573.8242622950826, 90.24000000000001) # "duplicate" cell

This is the bbox info for the duplicated string:

>>> page.search("to the Class", return_chars=False)[0]
{'text': 'to the Class',
 'x0': 445.2,
 'top': 78.79584,
 'x1': 496.330656,
 'bottom': 89.83583999999996,
 'groups': ()}

It's probably impossible to see by eye, but there is an overlap:

>>> 89.83583999999996 - 89.75999999999999 # text['bottom'] - cell['top']
0.07583999999997104

Because you're not using the Table.extract() method, you miss out on the extra logic that @jsvine pointed us too in the previous example.

RobotDodo · 2023-10-25T05:16:02Z

RobotDodo
Oct 25, 2023
Author

That's really helpful, thanks. I did use the within_bbox() instead of crop() and it has helped with the duplicates. I also took your idea of saving each cell 'snapshot' as an image and the extracted cell text in a text file - it was an extremely efficient debugging method (I paired the two files with a timestamp and put in a slight time delay to allow it to write the files). I found that it was reading the cells properly.

The reason I'm using find_tables and not extract_table/s is because a) I need the table cell coordinates so that I can map the cells to the corresponding weekdays and time periods for the weekly calendar type formats and b) I want to use the same approach for both formats (weekly calendar vs list of date events). The extract_table/s methods don't provide the cell coordinates, correct?

My goal is to 1) code the table recognition as generally as possible to allow the capture of a variety of formats and 2) avoid file specific post-processing.

2 replies

cmdlineluser Oct 25, 2023

The extract_table/s methods don't provide the cell coordinates, correct?

Correct, they return just the contents.

What I meant was Table.extract() - i.e. the .extract() method on the Table object:

>>> page.find_table().extract
<bound method Table.extract of <pdfplumber.table.Table object at 0x12e0cf400>>

.extract_table() is essentially shorthand for .find_table().extract()

So you find the table, get the cell info then call .extract() on the same table object.

RobotDodo Oct 27, 2023
Author

Nice. I was starting to think that, but it's good to know that find_tables() and extract_tables() are essentially different facets of the same thing, I was concerned that one might find different objects than the other, thereby foiling my plans to use the output of both for the same object.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opaque/ curious behavior with extract_tables and extract_text parameters #1006

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Opaque/ curious behavior with extract_tables and extract_text parameters #1006

RobotDodo Oct 9, 2023

Replies: 4 comments · 9 replies

cmdlineluser Oct 9, 2023

RobotDodo Oct 10, 2023 Author

cmdlineluser Oct 10, 2023

jsvine Oct 13, 2023 Maintainer

cmdlineluser Oct 14, 2023

jsvine Oct 14, 2023 Maintainer

RobotDodo Oct 17, 2023 Author

cmdlineluser Oct 20, 2023

RobotDodo Oct 25, 2023 Author

cmdlineluser Oct 25, 2023

RobotDodo Oct 27, 2023 Author

RobotDodo
Oct 9, 2023

Replies: 4 comments 9 replies

cmdlineluser
Oct 9, 2023

RobotDodo
Oct 10, 2023
Author

cmdlineluser
Oct 10, 2023

jsvine Oct 13, 2023
Maintainer

jsvine Oct 14, 2023
Maintainer

RobotDodo Oct 17, 2023
Author

RobotDodo
Oct 25, 2023
Author

RobotDodo Oct 27, 2023
Author