Text capturing without tabular #242

ibrahimshuail · 2020-08-02T08:38:39Z

Is there any way I can extract only the text information without the tabular data

jsvine · 2020-08-02T13:16:28Z

Hi @ibrahimshuail, I’m not sure whether I’m understanding your question, since there is not much detail or any example, but I think you’re looking for page.extract_text(...) and/or page.extract_words(...). See here: https://github.com/jsvine/pdfplumber#the-pdfplumberpage-class

ibrahimshuail · 2020-08-02T13:45:52Z

in extract text, we are getting the tabular data also, which is converted to plain text. I don't want that tabular data. other than that table data I want to extract other information from the pdf

samkit-jain · 2020-08-02T16:09:17Z

Hi @ibrahimshuail , this is not something natively supported but there are a few ways by which you can achieve the desired result. For example, one of that could be to run page.find_tables() and store the coordinates of the identified table. Then, run page.extract_words() on the full page and discard all the words that fall under the tabular region.

ibrahimshuail · 2020-08-02T17:17:06Z

Then the reported issues should be moved to enhancement... It shouldn't be closed... Correct me if I'm wrong

jsvine · 2020-08-02T17:49:03Z

Thanks, @samkit-jain, I think that's a good solution! Another approach, depending on the particular example: After selecting the table through page.find_tables(...), you could crop the page to that bounding box, with cropped = page.crop(...), and then run cropped.extract_text(...) or cropped.extract_words(...).

@ibrahimshuail: Thank you for your interest in the pdfplumber and its features. I closed the issue because it lacked a clear description and contained no specific examples. From the best I could ascertain, however, the problem you are solving does not seem to be a very common one. (Typically, people are extracting tables precisely for the tabular data.) If you'd like to propose an enhancement, feel free to open a new issue with a fuller explanation, specific example, and some explanation of the motivation for such a feature. (Per the issue template for feature requests: "Please describe, in as much detail as possible, your proposal and how it would improve your experience with pdfplumber.")

samkit-jain · 2020-08-02T18:09:40Z

@jsvine, just a correction in your proposed solution. It would return the text in the tabular region. To get the text not in the tabular region, would have to run text extraction on the full page and then do a replace (full_page_text.replace(cropped_page_text, "")). This assumes that there is no text to the right or left of the table.

jsvine · 2020-08-02T18:52:00Z

Ah yes! My apologies, I misunderstood the request. I think your original solution is the most direct and useful.

ibrahimshuail · 2020-08-03T13:00:41Z

@samkit-jain is there any working examples , i tried but i'm not able to achieve because the alignment of the text completely changes , so the complete tabular data doesn't gets cropped

samkit-jain · 2020-08-03T13:08:37Z

@ibrahimshuail there's not much I can do without the PDF. If you can share, I can try and help.

ibrahimshuail · 2020-08-03T13:40:32Z

@samkit-jain please find the pdf (https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf). I want all text other than the tabular data

samkit-jain · 2020-08-04T08:05:28Z

Hi @ibrahimshuail You can play around with this piece of code I wrote:

import pdfplumber

def curves_to_edges(cs):
    """See https://github.com/jsvine/pdfplumber/issues/127"""
    edges = []
    for c in cs:
        edges += pdfplumber.utils.rect_to_edges(c)
    return edges

# Import the PDF.
pdf = pdfplumber.open("file.pdf")

# Load the first page.
p = pdf.pages[0]

# Table settings.
ts = {
    "vertical_strategy": "explicit",
    "horizontal_strategy": "explicit",
    "explicit_vertical_lines": curves_to_edges(p.curves + p.edges),
    "explicit_horizontal_lines": curves_to_edges(p.curves + p.edges),
    "intersection_y_tolerance": 10,
}

# Get the bounding boxes of the tables on the page.
bboxes = [table.bbox for table in p.find_tables(table_settings=ts)]

def not_within_bboxes(obj):
    """Check if the object is in any of the table's bbox."""
    def obj_in_bbox(_bbox):
        """See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404"""
        v_mid = (obj["top"] + obj["bottom"]) / 2
        h_mid = (obj["x0"] + obj["x1"]) / 2
        x0, top, x1, bottom = _bbox
        return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
    return not any(obj_in_bbox(__bbox) for __bbox in bboxes)

print("Text outside the tables:")
print(p.filter(not_within_bboxes).extract_text())

Result on page 1:

NATIONAL PARTNERSHIP FOR QUALITY AFTERSCHOOL LEARNING
www.sedl.org/afterschool/toolkits
����������� �������� �������
Tutoring to Enhance Science Skills
Tutoring Two: Learning to Make Data Tables
..............................................................................................
Sample Data for Data Tables
Use these data to create data tables following the Guidelines for Making a Data Table and 
Checklist for a Data Table.
Example 1: Pet Survey (GR 2–3)
Ms. Hubert’s afterschool students took a survey of the 600 students at Morales Elementary 
School. Students were asked to select their favorite pet from a list of eight animals. Here 
are the results. 
Lizard 25, Dog 250, Cat 115, Bird 50, Guinea pig 30, Hamster 45, Fish 75, 
Ferret 10 
Example 2: Electromagnets—Increasing Coils (GR 3–5)
The following data were collected using an electromagnet with a 1.5 volt battery, a switch, 
a piece of #20 insulated wire, and a nail. Three trials were run. Safety precautions in 
repeating this experiment include using safety goggles or safety spectacles and avoiding 
short circuits.  
 
 
       
Example 3: pH of Substances (GR 5–10)
The following are pH values of common household substances taken by three different 
teams using pH probes. Safety precautions in repeating this experiment include hooded 
ventilation, chemical-splash safety goggles, gloves, and apron. Do not use bleach, 
ammonia, or strong acids with children.
Lemon juice 2.4, 2.0, 2.2; Baking soda (1 Tbsp) in Water (1 cup) 8.4, 8.3, 8.7; 
Orange juice 3.5, 4.0, 3.4; Battery acid 1.0, 0.7, 0.5; Apples 3.0, 3.2, 3.5; 
Tomatoes 4.5, 4.2, 4.0; Bottled water 6.7, 7.0, 7.2; Milk of magnesia 10.5, 10.3, 
10.6; Liquid hand soap 9.0, 10.0, 9.5; Vinegar 2.2, 2.9, 3.0; Household bleach 
12.5, 12.5, 12.7; Milk 6.6, 6.5, 6.4; Household ammonia 11.5, 11.0, 11.5;
Lye 13.0, 13.5, 13.4; and Sodium hydroxide 14.0, 14.0, 13.9; Anti-freeze 10.1, 
10.9, 9.7; Windex 9.9. 10.2, 9.5; Liquid detergent 10.5, 10.0, 10.3; and 
Cola 3.0, 2.5, 3.2
Teaching tip: The pH scale is from 0 to 14. Have students make two data tables, one 
with the data as given and one with the pH scale 0 to 14 with the substances’ average 
pH in rank order on the scale (Battery acid at the lower end and Sodium hydroxide at 
the upper end) or create a pH graphic organizer.
1

Result on page 2:

Example 4: Automobile Land Speed Records (GR 5-10)
In the first recorded automobile race in 1898, Count Gaston de Chasseloup-Laubat of 
Paris, France, drove 1 kilometer in 57 seconds for an average speed of 39.2 miles per hour 
(mph) or 63.1 kilometers per hour (kph). In 1904, Henry Ford drove his Ford Arrow across 
frozen Lake St. Clair, MI, at an average speed of 91.4 mph. Now, the North American 
Eagle is trying to break a land speed record of 800 mph. The Federation International de 
L’Automobile (FIA), the world’s governing body for motor sport and land speed records, 
recorded the following land speed records. (Retrieved on February 5, 2006, from 
http://www.landspeed.com/lsrinfo.asp.)
Example 5: Distance and Time (GR 8-10)
The following data were collected using a car with a water clock set to release a drop in 
a unit of time and a meter stick. The car rolled down an inclined plane. Three trials were 
run. Create a data table with an average distance column and an average velocity column, 
create an average distance-time graph, and draw the best-fit line or curve. Estimate the 
car’s distance traveled and velocity at six drops of water. Describe the motion of the car. Is 
it going at a constant speed, accelerating, or decelerating? How do you know?
 
   
         
© 2006 WGBH Educational Foundation. All rights reserved.
2

Using the .filter() method provided by pdfplumber to drop any objects that fall inside the bounding box of any of the tables and creating a filtered version of the page.

ibrahimshuail · 2020-08-04T11:08:52Z

Thanks alot @samkit-jain ,, one of the best solution !!!!!

thefirebanks · 2023-02-22T14:08:23Z

Thanks for this solution @samkit-jain !

I think this use case may be more common than expected - for example, if a user wants to use a different table extractor other than pdfplumber but still wants to use the library's text extraction features, having an easy way to extract the text without the tables would be convenient. Regardless, this seems to work for me as well.

pcschreiber1 · 2023-08-21T15:04:50Z

Updating after pdfplumber changes (post 0.6.0)

@samkit-jain brilliant solution!

When I yesterday wanted to re-use this solution again I found that I ran into a KeyError inside of rect_to_edges from pdfpblumber.utils.geometry that is used in the above curves_to_edges. From my understanding, the reason is that page.edges now also contains curve_edge objects which do not have information for the y-axis. The objects under attribute rect_edges conform to the required form. Hence, the solution runs when adapted in curves to edges as follows:

# Table settings.
ts = {
    "vertical_strategy": "explicit",
    "horizontal_strategy": "explicit",
    "explicit_vertical_lines": curves_to_edges(p.curves + p.rect_edges),
    "explicit_horizontal_lines": curves_to_edges(p.curves + p.rect_edges),
    "intersection_y_tolerance": 10,
}

However, in my case this specification of table settings actually underminds the quality of the results, i.e. text that is not part out the table is also mistakenly filtered out. Instead, what works well for me is simply relying on the defaults under pdfplumber 0.10.0:

def filter_tables(page: pdfplumber.page.Page) -> pdfplumber.page.Page:
    if page.find_tables() != []:
        # Get the bounding boxes of the tables on the page.
        # Adapted from
        # https://github.com/jsvine/pdfplumber/issues/242#issuecomment-668448246
        bboxes = [table.bbox for table in page.find_tables()]
        bbox_not_within_bboxes = partial(not_within_bboxes, bboxes=bboxes)

        # Filter-out tables from page
        page = page.filter(bbox_not_within_bboxes)

    return page


def not_within_bboxes(obj, bboxes):
    """Check if the object is in any of the table's bbox."""

    def obj_in_bbox(_bbox):
        """Define objects in box.

        See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404
        """
        v_mid = (obj["top"] + obj["bottom"]) / 2
        h_mid = (obj["x0"] + obj["x1"]) / 2
        x0, top, x1, bottom = _bbox
        return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)

    return not any(obj_in_bbox(__bbox) for __bbox in bboxes)

If you have any feedback or comments, this would be greatly appreciated. And thank you for this fantastic package!

lawrencenika · 2023-10-25T14:40:59Z

what if the PDF has table that has no lines marking, such as https://drive.google.com/file/d/167Y6KKW5cv0-7r8FV830iofWvYZWP4b6/view?usp=sharing I have. I tried running your code but it errors out with KeyError: 'y1' @samkit-jain @pcschreiber1

But since your mentioned that default ts should work under pdfplumber 0.10.0, when I tried on this PDF, it did not filter out the table 1 content.

samkit-jain · 2023-10-30T11:27:58Z

@lawrencenika For the PDF you shared, it is not going to be a simple affair. What you can try is, create your own logic to find when a table starts and ends. The horizontal lines that are on top and bottom of a table will be useful in determining.

ibrahimshuail added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Aug 2, 2020

jsvine closed this as completed Aug 2, 2020

jsvine removed the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Aug 2, 2020

samkit-jain mentioned this issue Sep 21, 2020

Fetch non tabular data from PDF #276

Closed

samkit-jain mentioned this issue Oct 28, 2020

How to extract text and tables from pdf pages and delete duplicate text of tables from the result text? #300

Closed

samkit-jain added the tips-and-tricks label Oct 29, 2020

This was referenced Nov 13, 2020

extract_text() from pages should not extract data within tables #313

Closed

Extract text without tables #314

Closed

jsvine mentioned this issue Feb 23, 2021

Is there a way to extract only text without tables ? #357

Closed

gjreda mentioned this issue Jun 30, 2023

Extract raw PDF text even if Grobid parsing fails refstudio/refstudio#175

Closed

a0js mentioned this issue Mar 12, 2024

pdfreader as a new file format reader redstreet/beancount_reds_importers#93

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text capturing without tabular #242

Text capturing without tabular #242

ibrahimshuail commented Aug 2, 2020

jsvine commented Aug 2, 2020

ibrahimshuail commented Aug 2, 2020 •

edited

samkit-jain commented Aug 2, 2020

ibrahimshuail commented Aug 2, 2020

jsvine commented Aug 2, 2020 •

edited

samkit-jain commented Aug 2, 2020

jsvine commented Aug 2, 2020

ibrahimshuail commented Aug 3, 2020

samkit-jain commented Aug 3, 2020

ibrahimshuail commented Aug 3, 2020

samkit-jain commented Aug 4, 2020 •

edited

ibrahimshuail commented Aug 4, 2020

thefirebanks commented Feb 22, 2023

pcschreiber1 commented Aug 21, 2023 •

edited

lawrencenika commented Oct 25, 2023 •

edited

samkit-jain commented Oct 30, 2023

Text capturing without tabular #242

Text capturing without tabular #242

Comments

ibrahimshuail commented Aug 2, 2020

jsvine commented Aug 2, 2020

ibrahimshuail commented Aug 2, 2020 • edited

samkit-jain commented Aug 2, 2020

ibrahimshuail commented Aug 2, 2020

jsvine commented Aug 2, 2020 • edited

samkit-jain commented Aug 2, 2020

jsvine commented Aug 2, 2020

ibrahimshuail commented Aug 3, 2020

samkit-jain commented Aug 3, 2020

ibrahimshuail commented Aug 3, 2020

samkit-jain commented Aug 4, 2020 • edited

ibrahimshuail commented Aug 4, 2020

thefirebanks commented Feb 22, 2023

pcschreiber1 commented Aug 21, 2023 • edited

Updating after pdfplumber changes (post 0.6.0)

lawrencenika commented Oct 25, 2023 • edited

samkit-jain commented Oct 30, 2023

ibrahimshuail commented Aug 2, 2020 •

edited

jsvine commented Aug 2, 2020 •

edited

samkit-jain commented Aug 4, 2020 •

edited

pcschreiber1 commented Aug 21, 2023 •

edited

lawrencenika commented Oct 25, 2023 •

edited