-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed extractText()-Not returning text with spaces #569
Fixed extractText()-Not returning text with spaces #569
Conversation
Previously the function .extractText() reads the text in the PDF and returns without any spaces. In this fix the pdf.py file has been modified to add " " (space) in between two words Here is an example below:- Original Sentence : "The quick brown fox jumps over the lazy dog" Previous Output : "Thequickbrownfoxjumpsoverthelazydog" After the fix : "The quick brown fox jumps over the lazy dog"
Thank you for the contribution! I'm sorry that it took so long - I try to be quicker in future 🤞 |
Features: - Add alpha channel support for png files in Script (#614) Bug fixes (BUG): - Fix formatWarning for filename without slash (#612) - Add whitespace between words for extractText() (#569, #334) - "invalid escape sequence" SyntaxError (#522) - Avoid error when printing warning in pythonw (#486) - Stream operations can be List or Dict (#665) Documentation (DOC): - Added Scripts/pdf-image-extractor.py - Documentation improvements (#550, #538, #324, #426, #394) Tests and Test setup (TST): - Add Github Action which automatically run unit tests via pytest and static code analysis with Flake8 (#660) - Add several unit tests (#661, #663) - Add .coveragerc to create coverage reports Developer Experience Improvements (DEV): - Pre commit: Developers can now `pre-commit install` to avoid tiny issues like trailing whitespaces Miscallenious: - Add the LICENSE file to the distributed packages (#288) - Use setuptools instead of distutils (#599) - Improvements for the PyPI page (#644) - Python 3 changes (#504, #366) You can see the full changelog at: 1.26.0...1.27.0
Could you please show in which directory can be found the pyPDF2 source file containing the " extractText() " method please ? |
It's in _page.py |
I just found in "PyPDF2" files (outside of the pycache folder) the -page.py ... problem is it only has 1000 or so linges whereas the ones modified on ghit have about 3000 of those ... maybe I don't have the right file or version (yet I installed the package yesterday :/) |
_page.py * |
I just modified my " _page.py " file and copy pasted the one on git here... still not working, if you don't mind of course, could you tell me where might be the problem |
@Viennoiserie / @inboxsgk , |
I am trying to make a function (for my webapp) that can append all the words contained in the pdf into an array. The app then finds the words asked by the user... so I went onto word : wrote text that would be " hard " for python to work with and the results aren't the ones I wanted: I expect: ['Thomas', 'Vienot', 'CACA', 'Partie'] but I get: ['Thomas', 'VienotCACA', 'Partie'] |
If you want I can also provide you my code (nothing to complex but I think it should work): from PyPDF2 import PdfFileReader def pdf_to_words(file_name):
def main():
if name == "main":
|
@Viennoiserie, |
Thank you, indeed, I have tried my program on other PDFs and there was no problem :/ |
Previously the function
.extractText()
reads the text in the PDF and returns without any spaces.In this fix the pdf.py file has been modified to add " " (space) in between two words
Here is an example below:-
Original Sentence :
"The quick brown fox jumps over the lazy dog"
Previous Output :
"Thequickbrownfoxjumpsoverthelazydog"
Output After fix :
"The quick brown fox jumps over the lazy dog"