Skip to content

Generalize Parser to handle all types of PDFs (2-cols, 3-cols, or Combination) #974

Closed Answered by samkit-jain
DevanshuBrahmbhatt asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @DevanshuBrahmbhatt Appreciate your interest in the library. Request you to please provide more information on what you want to achieve here. Assuming you want to read a N-column PDF column be column, you can refer to the code I have shared at #975 (comment) which will give you all the vertical lines that divide the PDF into columns. Once you have those, you can recursively crop the page and extract the text. Something like

import math

import pdfplumber

pdf = pdfplumber.open("tests/pdfs/federal-register-2020-17221.pdf")  # https://github.com/jsvine/pdfplumber/blob/stable/tests/pdfs/federal-register-2020-17221.pdf
page = pdf.pages[0]

# Crop the top and bottom 5% of the page.
page = page

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
1 reply
@samkit-jain
Comment options

Answer selected by samkit-jain
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants