How to exclude header and footer while extracting text ? #968

Laxmi530 · 2023-08-11T14:49:36Z

Laxmi530
Aug 11, 2023

Hai,
I am extracting text from pdf file and processing those text, but I noticed that if the pdf file has header and footer in every page, it is including both every time which is not required. So is there any method to skip the header and footer or any method to extract only header and footer from the pdf so that we can replace a empty string in place of header and footer.
Below code is using for text extraction.

import pdfplumber
def pdf_text_extract(file):
    text = str()
    with pdfplumber.open(file) as pdf:
        for pages in pdf.pages:
            text += pages.extract_text(x_tolerance= 1)
    return text

so can someone please guide me how to do this.

Thanking you in advance.

Answered by jsvine

Aug 11, 2023

Hi @Laxmi530, and thanks for your interest in pdfplumber. PDFs don't have a specific concept of a header or footer; whatever looks like a header or footer to a human is a design decision made by the PDF's creator. That said, if you know where the header ends and the footer begins, you can use page.crop((x0, top, x1, bottom)).extract_text(...) to get just the text in the core region.

View full answer

jsvine · 2023-08-11T15:56:30Z

jsvine
Aug 11, 2023
Maintainer

Hi @Laxmi530, and thanks for your interest in pdfplumber. PDFs don't have a specific concept of a header or footer; whatever looks like a header or footer to a human is a design decision made by the PDF's creator. That said, if you know where the header ends and the footer begins, you can use page.crop((x0, top, x1, bottom)).extract_text(...) to get just the text in the core region.

3 replies

Laxmi530 Aug 13, 2023
Author

Thanks @jsvine for the solution will try this out.

dhdaines Aug 17, 2023

Note that some PDFs do have a specific concept of headers and footers, which show up as marked content sections with the tag Artifact and the Subtype attributes Header or Footer. This is part of the Tagged PDF standard, see https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#page=583

The Artifact tag is accessible by checking the tag field of any object if you are using this branch (@jsvine could you review this soon if you have time): #961

dhdaines Aug 17, 2023

... but the attributes for a marked content section aren't accessible, I know how to get them but I'm not totally sure how to expose them in pdfplumbers API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to exclude header and footer while extracting text ? #968

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to exclude header and footer while extracting text ? #968

Laxmi530 Aug 11, 2023

Replies: 1 comment · 3 replies

jsvine Aug 11, 2023 Maintainer

Laxmi530 Aug 13, 2023 Author

dhdaines Aug 17, 2023

dhdaines Aug 17, 2023

Laxmi530
Aug 11, 2023

Replies: 1 comment 3 replies

jsvine
Aug 11, 2023
Maintainer

Laxmi530 Aug 13, 2023
Author