Fixed extractText()-Not returning text with spaces #569

inboxsgk · 2020-07-25T02:45:36Z

Previously the function .extractText() reads the text in the PDF and returns without any spaces.
In this fix the pdf.py file has been modified to add " " (space) in between two words

Here is an example below:-
Original Sentence : "The quick brown fox jumps over the lazy dog"

Previous Output : "Thequickbrownfoxjumpsoverthelazydog"

Output After fix : "The quick brown fox jumps over the lazy dog"

Previously the function .extractText() reads the text in the PDF and returns without any spaces. In this fix the pdf.py file has been modified to add " " (space) in between two words Here is an example below:- Original Sentence : "The quick brown fox jumps over the lazy dog" Previous Output : "Thequickbrownfoxjumpsoverthelazydog" After the fix : "The quick brown fox jumps over the lazy dog"

MartinThoma · 2022-04-06T05:45:44Z

Thank you for the contribution! I'm sorry that it took so long - I try to be quicker in future 🤞

Features: - Add alpha channel support for png files in Script (#614) Bug fixes (BUG): - Fix formatWarning for filename without slash (#612) - Add whitespace between words for extractText() (#569, #334) - "invalid escape sequence" SyntaxError (#522) - Avoid error when printing warning in pythonw (#486) - Stream operations can be List or Dict (#665) Documentation (DOC): - Added Scripts/pdf-image-extractor.py - Documentation improvements (#550, #538, #324, #426, #394) Tests and Test setup (TST): - Add Github Action which automatically run unit tests via pytest and static code analysis with Flake8 (#660) - Add several unit tests (#661, #663) - Add .coveragerc to create coverage reports Developer Experience Improvements (DEV): - Pre commit: Developers can now `pre-commit install` to avoid tiny issues like trailing whitespaces Miscallenious: - Add the LICENSE file to the distributed packages (#288) - Use setuptools instead of distutils (#599) - Improvements for the PyPI page (#644) - Python 3 changes (#504, #366) You can see the full changelog at: 1.26.0...1.27.0

Viennoiserie · 2022-06-02T16:05:52Z

Could you please show in which directory can be found the pyPDF2 source file containing the " extractText() " method please ?

MartinThoma · 2022-06-02T16:08:43Z

It's in _page.py

Viennoiserie · 2022-06-02T16:17:38Z

I just found in "PyPDF2" files (outside of the pycache folder) the -page.py ... problem is it only has 1000 or so linges whereas the ones modified on ghit have about 3000 of those ... maybe I don't have the right file or version (yet I installed the package yesterday :/)

Viennoiserie · 2022-06-02T16:18:00Z

_page.py *

Viennoiserie · 2022-06-02T16:24:37Z

I just modified my " _page.py " file and copy pasted the one on git here... still not working, if you don't mind of course, could you tell me where might be the problem

pubpub-zz · 2022-06-02T16:41:25Z

@Viennoiserie / @inboxsgk ,
can you provide a example of PDF file where you are getting the issue for analysis

Viennoiserie · 2022-06-02T16:50:02Z

TEST.pdf

I am trying to make a function (for my webapp) that can append all the words contained in the pdf into an array. The app then finds the words asked by the user... so I went onto word : wrote text that would be " hard " for python to work with and the results aren't the ones I wanted:

I expect: ['Thomas', 'Vienot', 'CACA', 'Partie']

but I get: ['Thomas', 'VienotCACA', 'Partie']

Viennoiserie · 2022-06-02T16:52:01Z

If you want I can also provide you my code (nothing to complex but I think it should work):

from PyPDF2 import PdfFileReader

def pdf_to_words(file_name):

pdf_obj = open(file_name + '.pdf', 'rb')
pdf_array = []
word_array = []
    
pdf_reader = PdfFileReader(pdf_obj)   
nb_page = pdf_reader.numPages
    
for i in range(nb_page):
        
       pdf_array.append(pdf_reader.getPage(i).extractText())
       
pdf_obj.close()

for i in range(len(pdf_array)):
    
    word = ""
    
    for j in range(len(pdf_array[i])):
        
        if(ord(pdf_array[i][j]) not in range(0,65) and ord(pdf_array[i][j]) not in range(91,97) and ord(pdf_array[i][j]) not in range(123,128)):
            
            word += pdf_array[i][j]
            
        else:
            if(word != ""):
                word_array.append(word)
                
            word = ""
            
return(word_array)

def main():

file_name = "TEST"
# file_name = "VIENOT_Thomas"

word_array = pdf_to_words(file_name)

print(word_array)

if name == "main":

main()

pubpub-zz · 2022-06-02T18:36:05Z

@Viennoiserie,
The issue is not within PyPDF2.
If you just run extract_text on your PDF you get :
' Thomas \n \n Vienot CACA\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n '
There seems to be lots of empty lines, but I've checked they are part of your way
I would propose you this solution:
[x for x in "".join( [ x if x.isalnum() else " " for x in PyPDF2.PdfReader("TEST(3).pdf").pages[0].extractText().replace("\n","")] ) .split(" ') if x!=""]

Viennoiserie · 2022-06-02T19:20:15Z

Thank you, indeed, I have tried my program on other PDFs and there was no problem :/
Sorry for bothering !

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected labels Apr 6, 2022

MartinThoma merged commit 02cc54b into py-pdf:master Apr 6, 2022

MartinThoma mentioned this pull request Apr 16, 2022

Exceptions / missing spaces in extract_text() method #17

Closed

MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed extractText()-Not returning text with spaces #569

Fixed extractText()-Not returning text with spaces #569

inboxsgk commented Jul 25, 2020

MartinThoma commented Apr 6, 2022

Viennoiserie commented Jun 2, 2022

MartinThoma commented Jun 2, 2022

Viennoiserie commented Jun 2, 2022

Viennoiserie commented Jun 2, 2022

Viennoiserie commented Jun 2, 2022

pubpub-zz commented Jun 2, 2022

Viennoiserie commented Jun 2, 2022

Viennoiserie commented Jun 2, 2022

pubpub-zz commented Jun 2, 2022 •

edited

Loading

Viennoiserie commented Jun 2, 2022

Fixed extractText()-Not returning text with spaces #569

Fixed extractText()-Not returning text with spaces #569

Conversation

inboxsgk commented Jul 25, 2020

MartinThoma commented Apr 6, 2022

Viennoiserie commented Jun 2, 2022

MartinThoma commented Jun 2, 2022

Viennoiserie commented Jun 2, 2022

Viennoiserie commented Jun 2, 2022

Viennoiserie commented Jun 2, 2022

pubpub-zz commented Jun 2, 2022

Viennoiserie commented Jun 2, 2022

Viennoiserie commented Jun 2, 2022

pubpub-zz commented Jun 2, 2022 • edited Loading

Viennoiserie commented Jun 2, 2022

pubpub-zz commented Jun 2, 2022 •

edited

Loading