Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceptions / missing spaces in extract_text() method #17

Closed
mstamy2 opened this issue Jul 30, 2013 · 13 comments
Closed

Exceptions / missing spaces in extract_text() method #17

mstamy2 opened this issue Jul 30, 2013 · 13 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@mstamy2
Copy link
Collaborator

mstamy2 commented Jul 30, 2013

extractText() method isn't broken, but throws some exceptions in these cases:

http://doctor12wer.blogspot.com/2013/06/extracttext-function-in-pypdf2-throws.html

http://stackoverflow.com/questions/17270387/pypdf2-typeerror-when-trying-to-extract-text

@tnorth
Copy link

tnorth commented Nov 7, 2013

Hello,

Works for me, but the extracted text contains no spaces :/

input = PdfFileReader(open("foo.pdf", 'rb'))
print input.getPage(0).extractText()

Is that a known issue ?

@tnorth
Copy link

tnorth commented Nov 7, 2013

Hmm to make it more clear, the issue seem to appear for 2 columns papers, this one for example:
www.rowland.harvard.edu/rjf/vollmer/images/vollmer_fischer.pdf

@mstamy2
Copy link
Collaborator Author

mstamy2 commented Nov 7, 2013

The extractText method is probably a little crude, and definitely doesn't function well for PDFs with complicated text. It could use some work to return text in a more orderly fashion that more closely appears like the text you see in a PDF viewer.

@alisufian
Copy link

Another pdf where whitespace is not preserved in extracted text
http://webapp.psc.state.md.us/Intranet/Casenum/NewIndex3_VOpenFile.cfm?ServerFilePath=C:\Casenum\9100-9199\9155\\354.pdf

@kursataker
Copy link

I tried to extract arabic text out of a PDF file using extractText() method. However, arabic text disappears in the output.

@Lerchensporn
Copy link

Lerchensporn commented May 14, 2016

To resolve the problem of missing whitespaces, I propose the following for-loop in the extractText method. The part below “text += i” is new. The limit “i < -100” where a spacing becomes a whitespace is arbitrarily chosen; in a typical Springer pdf book a value of -300 to -200 determines a whitespace. Although this may look like a hack, I can think of no other criterion for a whitespace in such documents.
edit: Furthermore, I suggest to remove “text += "\n"" after the TJ operator, because it breaks words in some documents.
Handling of the TD, Td, Tm operators still demands refinement.

        for operands, operator in content.operations:
            if operator == b_("Tj"):
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += _text
            elif operator == b_("T*"):
                text += "\n"
            elif operator == b_("'"):
                text += "\n"
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += operands[0]
            elif operator == b_('"'):
                _text = operands[2]
                if isinstance(_text, TextStringObject):
                    text += "\n"
                    text += _text
            elif operator == b_("TJ"):
                for i in operands[0]:
                    if isinstance(i, TextStringObject):
                        text += i
                    elif isinstance(i, FloatObject) or isinstance(i, NumberObject):
                        if i < -100:
                            text += " "
            elif operator == b_("TD") or operator == b_("Tm"):
                if len(text) > 0 and text[-1] != " " and text[-1] != "\n":
                    text += " "

@mborus
Copy link

mborus commented Nov 23, 2016

@woho's idea worked for me.
I got too many spaces, so I changed the code slightly...

       # add spaces
       # q&d - https://github.com/mstamy2/PyPDF2/issues/17
                elif isinstance(i, FloatObject) or isinstance(i, NumberObject):
                    if text and (not text[-1] in " \n"):
                        text += " "
        elif operator == b_("TD") or operator == b_("Tm"):
            if text and (not text[-1] in " \n"):
                text += " "
        # end add spaces

@chrisjcameron
Copy link

chrisjcameron commented Dec 20, 2017

Some PDFs apparently generate empty operands. If this condition is explicitly checked, then I can avoid some thrown exceptions:

_text = operands[0] throws an exception if operands is empty.

Quick fix:

for operands, operator in content.operations:
            if not operands:          # Empty operands list contributes no text
                operands = [""]
            if operator == b_("Tj"):
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += _text

@Tom-Evers
Copy link

There should be a newline somewhere:

        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
                elif isinstance(i, FloatObject) or isinstance(i, NumberObject):
                    if text and (not text[-1] in " \n"):
                        text += " "
            text += "\n"

@Tom-Evers
Copy link

It seems that the value of the Float/NumberObject directly encodes the distance between two pieces of text, with the width of one space equaling -600:

                if text and (not text[-1] in " \n"):
                        text += " " * int(i / -600)

@MartinThoma
Copy link
Member

A lot of the whitespace issues got fixed via #569

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Apr 16, 2022
@MartinThoma
Copy link
Member

#924 Improved further on the whitespace issue

@MartinThoma
Copy link
Member

I think it is fixed.

Minimal example

from PyPDF2 import PdfReader

reader = PdfReader("vollmer_fischer.pdf")  # www.rowland.harvard.edu/rjf/vollmer/images/vollmer_fischer.pdf
text = reader.pages[0].extract_text()

text now is:

Ring-resonator-based frequency-domain opticalactivity measurements of a chiral liquid
Frank Vollmer and Peer Fischer
The Rowland Institute at Harvard, Harvard University, Cambridge, Massachusetts 02142Received September 22, 2005; revised November 11, 2005; accepted November 12, 2005; posted November 16, 2005 (Doc. ID 64961)
Chiral liquids rotate the plane of polarization of linearly polarized light and are therefore optically active.Here we show that optical rotation can be observed in the frequency domain. A chiral liquid introduced in afiber-loop ring resonator that supports left and right circularly polarized modes gives rise to relative fre-quency shifts that are a direct measure of the liquid’s circular birefringence and hence of its optical activity.The effect is in principle not diminished if the circumference of the ring is reduced. The technique is simi-larly applicable to refractive index and linear birefringence measurements.
© 2006 Optical Society ofAmericaOCIS codes:260.1440, 120.5410
.Natural optical activity arises because a medium hasdifferent refractive indices for left (/H11002) and right (/H11001)circularly polarized light. The optical rotation, in ra-dians, developed over a path lengthlis a function ofthe wavelength/H9261and is given by
/H9258=/H9266l

/H9261/H20851n/H20849−/H20850−n/H20849+/H20850/H20852./H208491/H20850The circular birefringence,n
/H20849−/H20850−n/H20849+/H20850, is, however,even in a pure chiral liquid small and at most a fewparts in 10
6. It is thus desirable to increase the effec-tive path length through the optically active mediumwithout the need for large sample volumes. This canbe achieved in an optical cavity as long as one en-sures that the optical rotation does not cancel on theround trip, which in practice one can accomplish byplacing quarter-wave plates in the cavity.
1Signifi-cant enhancements in sensitivity compared withsingle-pass instruments have been reported for mea-surements that make use of Fabry–Perotresonators,
1–3including polarization-sensitive imple-mentations of cavity-ringdown spectroscopy,4,5aswell as laser cavities.6,7Both single-pass and multi-pass techniques typically determine the rotation inEq. (1) via intensity measurements that either re-quire rotating polarization optics or separate the or-thogonally polarized components of the light andtherefore require a balanced detection scheme.In this Letter we show that circular birefringence(optical rotation) can also be determined by fre-quency measurements. Left and right circularly po-larized modes acquire unequal phases when a chiralliquid is introduced into a resonator such that theirresonance frequencies shift relative to each other. Wedemonstrate the method, using a fiber optic ringresonator in combination with a narrow-linewidth cwlaser.A fiber-loop resonator
8,9may be considered to be afiber- or waveguide-based Fabry–Perot resonatorthat consists of a closed fiber loop in contact with alinear waveguide via a variable (directional) coupler.A resonance in the ring requires that the optical pathlength be a multiple of the wavelength of the light.Resonances are observed as minima in a transmis-sion spectrum whenever an integral multiple of thewavelength in the ring equals the circumference ofthe fiber loop. A shift in the resonance wavelength oc-curs if either the path length or the refractive indexchanges. Refractive indices may be measured by tun-ing the frequency of a laser with a sufficiently narrowlinewidth.Introduction of a sample with refractive indexnsinto the ring resonator will cause a wavelength shiftof the resonances relative to the reference mediumwith refractive indexn
0, which may, for instance, beair:/H9004/H9261
/H9261=ns−n0

nefff,/H208492/H20850wherefis the fraction of the total ring circumferencethat contains the optically active sample.n
effis an ef-fective refractive index used to describe the entirefiber-loop resonator in the presence of the referencemedium and corresponds to the round-trip phase2
/H9266neffL//H9261acquired by a resonant mode at wave-length/H9261, where the circumference (fiber and free-space part) isL.The inherent birefringence of a bent optical fiberwill in general give rise to resonant modes with dif-ferent polarization states.
10These modes may beused to generate circularly polarized modes that aresensitive to chirality. A wavelength shift that is equalin magnitude and opposite in sign for the two circu-larly polarized modes is a direct function of the liq-uid’s circular birefringence and hence of its opticalactivity. Thus, particular interest are relativechanges in the resonance wavelengths of a pair of leftand right circularly polarized modes centered at/H9261:
/H20879/H9004/H9261/H20849−/H20850−/H9004/H9261/H20849+/H20850

/H9261/H20879=n/H20849−/H20850−n/H20849+/H20850

nefff,/H208493/H20850where any common mode noise is automaticallyeliminated. It can also be seen that the equation de-scribing optical activity in a ring resonator is inde-pendent of the actual dimension of the ring. For agiven finesse and a given fractionf, a reduction in thesize of the ring does not lead to a loss of sensitivity.February 15, 2006 / Vol. 31, No. 4 / OPTICS LETTERS4530146-9592/06/040453-3/$15.00 © 2006 Optical Society of America

@MartinThoma MartinThoma changed the title extractText() method Exceptions in extract_text() method Jun 6, 2022
@MartinThoma MartinThoma changed the title Exceptions in extract_text() method Exceptions / missing spaces in extract_text() method Jun 6, 2022
MartinThoma added a commit that referenced this issue Jun 6, 2022
The highlight of the 2.1.0 release is the most massive improvement to the
text extraction capabilities of PyPDF2 since 2016 🥳🎊 A very big thank you goes
to [pubpub-zz](https://github.com/pubpub-zz) who took a lot of time and
knowledge about the PDF format to finally get those improvements into PyPDF2.
Thank you 🤗💚

In case the new function causes any issues, you can use `_extract_text_old`
for the old functionality. Please also open a bug ticket in that case.

There were several people who have attempted to bring similar improvements to
PyPDF2. All of those were valuable. The main reason why they didn't get merged
is the big amount of open PRs / issues. pubpub-zz was the most comprehensive
PR which also incorporated the latest changes of PyPDF2 2.0.0.

Thank you to [VictorCarlquist](https://github.com/VictorCarlquist) for #858 and
[asabramo](https://github.com/asabramo) for #464 🤗

New Features (ENH):
-  Massive text extraction improvement (#924). Closed many open issues:
    - Exceptions / missing spaces in extract_text() method (#17) 🕺
      - Whitespace issues in extract_text() (#42) 💃
      - pypdf2 reads the hifenated words in a new line (#246)
    - PyPDF2 failing to read unicode character (#37)
      - Unable to read bullets (#230)
    - ExtractText yields nothing for apparently good PDF (#168) 🎉
    - Encoding issue in extract_text() (#235)
    - extractText() doesn't work on Chinese PDF (#252)
    - encoding error (#260)
    - Trouble with apostophes in names in text "O'Doul" (#384)
    - extract_text works for some PDF files, but not the others (#437)
    - Euro sign not being recognized by extractText (#443)
    - Failed extracting text from French texts (#524)
    - extract_text doesn't extract ligatures correctly (#598)
    - reading spanish text - mark convert issue (#635)
    - Read PDF changed from text to random symbols (#654)
    - .extractText() reads / as 1. (#789)
-  Update glyphlist (#947) - inspired by #464
-  Allow adding PageRange objects (#948)

Bug Fixes (BUG):
-  Delete .python-version file (#944)
-  Compare StreamObject.decoded_self with None (#931)

Robustness (ROB):
-  Fix some conversion errors on non conform PDF (#932)

Documentation (DOC):
-  Elaborate on PDF text extraction difficulties (#939)
-  Add logo (#942)
-  rotate vs Transformation().rotate (#937)
-  Example how to use PyPDF2 with AWS S3 (#938)
-  How to deprecate (#930)
-  Fix typos on robustness page (#935)
-  Remove scripts (pdfcat) from docs (#934)

Developer Experience (DEV):
-  Ignore .python-version file
-  Mark deprecated code with no-cover (#943)
-  Automatically create Github releases from tags (#870)

Testing (TST):
-  Text extraction for non-latin alphabets (#954)
-  Ignore PdfReadWarning in benchmark (#949)
-  writer.remove_text (#946)
-  Add test for Tree and _security (#945)

Code Style (STY):
-  black, isort, Flake8, splitting buildCharMap (#950)

Full Changelog: 2.0.0...2.1.0
@MartinThoma MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

9 participants