Text orientation #1071

matteodefra · 2022-07-07T10:02:04Z

Explanation

Hello, I have some PDF documents which contains normal text with portrait orientation and a lateral text on the page side which is oriented in landscape mode.
How can I tell PyPDF2 to extract only text that is oriented in portrait mode as the page orientation and to ignore the landscape text?

I attach an image example

Basically what I want is to extract the series of "example" above and ignore the three "example" rotated by 90 degrees

Thank you in advance for your help

MartinThoma · 2022-07-10T11:41:29Z

Some more evidence that people want this feature: https://stackoverflow.com/q/52530293/562769

pubpub-zz · 2022-07-27T19:36:16Z

@MartinThoma, @MasterOdin , @mtd91429,
the current parameters are the following:
extract_text( self, Tj_sep: str = "", TJ_sep: str = "", space_width: float = 200.0) -> str

Tj_sep and TJ_sep are no more used : I would propose to take opportunity of introducing the orientation to remove them:
extract_text( self, orientations : Union[int, Tuple[int]] = (0,90,270,360), space_width: float = 200.0) -> str

Your opinion ?
edited to add a single int as acceptable

pubpub-zz · 2022-07-27T20:54:43Z

some examples of calls:
page.extract_text(0) => extract all text strings oriented up
page.extract_text((0,)) => extract all text strings oriented up (synonym)
page.extract_text((0,180)) => extract all text strings oriented up or down

MartinThoma · 2022-07-30T06:38:33Z

Thank you so much @pubpub-zz ! I didn't think that this was possible 😲

@matteodefra We will have a release towmorrow with this change.

matteodefra · 2022-07-30T10:12:49Z

Thank you so much @pubpub-zz and @MartinThoma !

Liin159 · 2022-12-07T15:30:18Z

@matteodefra eventually, does this solution work?

matteodefra assigned MartinThoma Jul 7, 2022

MartinThoma removed their assignment Jul 9, 2022

pubpub-zz mentioned this issue Jul 27, 2022

ENH: Add orientation param for text_extraction (# 1071) #1175

Merged

MartinThoma added the is-feature A feature request label Jul 29, 2022

MartinThoma closed this as completed in 8a27fa4 Jul 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text orientation #1071

Text orientation #1071

matteodefra commented Jul 7, 2022

MartinThoma commented Jul 10, 2022

pubpub-zz commented Jul 27, 2022 •

edited

pubpub-zz commented Jul 27, 2022

MartinThoma commented Jul 30, 2022

matteodefra commented Jul 30, 2022

Liin159 commented Dec 7, 2022

Text orientation #1071

Text orientation #1071

Comments

matteodefra commented Jul 7, 2022

Explanation

MartinThoma commented Jul 10, 2022

pubpub-zz commented Jul 27, 2022 • edited

pubpub-zz commented Jul 27, 2022

MartinThoma commented Jul 30, 2022

matteodefra commented Jul 30, 2022

Liin159 commented Dec 7, 2022

pubpub-zz commented Jul 27, 2022 •

edited