Text exctraction #377
-
how can i extract text form particular area without surrounded lines? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Hi @erkin98, could you please provide the PDF and provide more details on where that area is in the PDF? |
Beta Was this translation helpful? Give feedback.
-
https://drive.google.com/file/d/18smAX6VTvqbfyEQ6e20Tg6iOl3mDBMZF/view?usp=sharing |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing the PDF @erkin98 Since, that top left text is not wrapped in a You can do so by running the following code: import pdfplumber
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]
top_line = page.horizontal_edges[2] # The top line is actually the 3rd horizontal edge.
bottom_line = page.horizontal_edges[4] # The bottom line is actually the 5th horizontal edge.
page = page.crop(
(top_line["x0"], top_line["top"], top_line["x1"], bottom_line["bottom"]) # Create a rect object using the coordinates of the 2 horizontal edges.
)
print(page.extract_text()) The cropped page looks like
|
Beta Was this translation helpful? Give feedback.
Thanks for sharing the PDF @erkin98 Since, that top left text is not wrapped in a
rect
object, there is no straightforward way to extract it. One alternate workaround would be to get the coordinates of those horizontal lines at the top and bottom of the text, crop the page and then extract text from it.You can do so by running the following code: