-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[extracting words from table] #914
Comments
A few suggestions - 1 - Don't use an output type of XML; the comments in high_level.py specifically say only 'text' works properly. 2 - If you are extracting text using extract_text_to_fp as a library call, the exact call signature would be helpful to diagnose what is occurring. In particular, what LAParams are you passing to the call? Typically, you would tweak the LAParams for the nature of your document, specifically the narrow spacing in your case. The kind of parameters you can specify in an LAParams object are:
3 - If you are uncomfortable with the above, pdfminer comes with a CLI tool, pdf2txt.py, which you can call on the file whose text you wish to extract, with such parameters as --word-margin, which allows you to define when pdfminer will interpret a physical space between characters as whitespace chars. In general, when dealing with idiosyncratic documents, one has to simply try different input parameters until the right combination is found for your particular document that allows you to extract the text in a sensible way. |
Issue: Words Extracted Too Closely Together in Tables with PDFMiner
Problem Description
When extracting text from a PDF document containing tables with PDFMiner, it seems that the words inside the table are being extracted incorrectly. The words are all closely spaced together, making the extracted text unusable. This negatively affects our ability to properly process the information within these tables.
Steps to Reproduce
document.pdf
Expected Behavior
We expect PDFMiner to correctly extract text inside tables, preserving the spacing between words, so that the extracted text is usable.
This issue significantly impacts our ability to correctly extract and use data contained within PDF document tables. Any assistance in resolving this problem would be greatly appreciated.
I thank you for the time spent studying this case and hope that a solution can be put in place :)
The text was updated successfully, but these errors were encountered: