-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeating characters #71
Comments
On doing
And repeating characters are still present for some words,
|
That's strange, indeed. My hunch is that there really are two copies of each letter in the PDF. (One set of letters might be transparent, perhaps?) What happens if you try extracting the text with another tool, such as |
No such problem with No repeating lines
No repeating characters
|
I've encountered this problem as well. In my case it was cropping up in fillable pdfs, and I theorized that the folks filling out the pdf were somehow resaving it on top of the original text. I found it was easier to just remove duplicate characters via script than make sense of the pdf. I dunno for sure, I suspect that other pdf output tools are removing duplicate characters. I'm not really sure what the right solution is, but possibly adding a 'remove duplicate characters' option would make this more manageable? My case involved exact matches--characters occurring at exactly the same spot--so a fix was easy... I suppose if they were slightly offset it would be more challenging. |
Getting same issue, please pass some resolution |
AFAIK, duplicated characters are also for bold representation and there will be cases with small offset. |
Any solution to it? I have the same issue. |
I recently stumbled across this issue - just tossing it out there to let folks know it's a continuing thing. |
@hannylicious and other watchers of this issue, if you have a PDF with this issue that you can share publicly, please do so that this issue can be investigated in further detail. I am pretty sure I have a PDF with this issue but it will take me some time to find it. |
Unfortunately - I dabble with PDF's very infrequently. I just happened across it this time because another library (pyPDF2) didn't see any text at all - whereas pdfplumber saw the text, but it was duplicated. The PDF I'm working with at this time has some information that I can't publicly display so I won't be of much assistance I'm afraid. I resolved my use case simply by grabbing the first of the results and using that. Pdfplumber is a great tool - I will most likely be using this from now on! If I run across this issue on a PDF that I can link up - I definitely will! |
Thanks, @hannylicious! If you have the time, you could try using https://github.com/JoshData/pdf-redactor to remove the sensitive information without altering the PDF structure. If the result still produces the same character-duplication, then it could be very helpful for resolving this issue. |
Thanks @jsvine - I will definitely have a look at that pdf-redactor library. If it works - I'll be sure and post that PDF here! |
I would like to help, but my file has confidential content. Anyone have some issue file? |
Same issue here. |
@pajaskowiak Can you share a PDF that demonstrates the issue? |
repeat.pdf |
The duplicate text indeed is drawn twice in the PDF, the second time with a small horizontal offset to create the appearance of a bold font. |
Many thanks @xv44586 and @mkl-public. This is helpful. Given the way |
Indeed, there are many PDFs out there drawing text twice for some visual effect (bold, shadow, ...) but by far not all of them use ActualText to mark one copy as ignorable like @xv44586's example file does. Thus, finding duplicates explicitly will help more often in this regard than checking the ActualText. |
I'm really sorry but I can't. It contains sensitive information. |
I did something similar to this. Anyways, I could fix the duplicates in my own code. Having the text from the pdf, even with eventual duplicates is a big help already! Thank you for the project! |
Commit 04fd56a (available in |
When using pdfplumber, some documents may be parsed incorrectly, resulting in duplicated characters. Add `dedupe` paramter for dedupe duplicated characters. Refer the Issue#71 of pdfplumber: jsvine/pdfplumber#71
…ader` (#10165) (Reopen PR #7706, hope this problem can fix.) When using `pdfplumber`, some documents may be parsed incorrectly, resulting in **duplicated characters**. Taking the [linked](https://bruusgaard.no/wp-content/uploads/2021/05/Datasheet1000-series.pdf) document as an example: ## Before ```python from langchain.document_loaders import PDFPlumberLoader pdf_file = 'file.pdf' loader = PDFPlumberLoader(pdf_file) docs = loader.load() print(docs[0].page_content) ``` Results: ``` 11000000 SSeerriieess PPoorrttaabbllee ssiinnggllee ggaass ddeetteeccttoorrss ffoorr HHyyddrrooggeenn aanndd CCoommbbuussttiibbllee ggaasseess TThhee RRiikkeenn KKeeiikkii GGPP--11000000 iiss aa ccoommppaacctt aanndd lliigghhttwweeiigghhtt ggaass ddeetteeccttoorr wwiitthh hhiigghh sseennssiittiivviittyy ffoorr tthhee ddeetteeccttiioonn ooff hhyyddrrooccaarrbboonnss.. TThhee mmeeaassuurreemmeenntt iiss ppeerrffoorrmmeedd ffoorr tthhiiss ppuurrppoossee bbyy mmeeaannss ooff ccaattaallyyttiicc sseennssoorr.. TThhee GGPP--11000000 hhaass aa bbuuiilltt--iinn ppuummpp wwiitthh ppuummpp bboooosstteerr ffuunnccttiioonn aanndd aa ddiirreecctt sseelleeccttiioonn ffrroomm aa lliisstt ooff 2255 hhyyddrrooccaarrbboonnss ffoorr eexxaacctt aalliiggnnmmeenntt ooff tthhee ttaarrggeett ggaass -- OOnnllyy ccaalliibbrraattiioonn oonn CCHH iiss nneecceessssaarryy.. 44 FFeeaattuurreess TThhee RRiikkeenn KKeeiikkii 110000vvvvttaabbllee ssiinnggllee HHyyddrrooggeenn aanndd CCoommbbuussttiibbllee ggaass ddeetteeccttoorrss.. TThheerree aarree 33 ssttaannddaarrdd mmooddeellss:: GGPP--11000000:: 00--1100%%LLEELL // 00--110000%%LLEELL ›› LLEELL ddeetteeccttoorr NNCC--11000000:: 00--11000000ppppmm // 00--1100000000ppppmm ›› PPPPMM ddeetteeccttoorr DDiirreecctt rreeaaddiinngg ooff tthhee ccoonncceennttrraattiioonn vvaalluueess ooff ccoommbbuussttiibbllee ggaasseess ooff 2255 ggaasseess ((55 NNPP--11000000)).. EEaassyy ooppeerraattiioonn ffeeaattuurree ooff cchhaannggiinngg tthhee ggaass nnaammee ddiissppllaayy wwiitthh 11 sswwiittcchh bbuuttttoonn.. LLoonngg ddiissttaannccee ddrraawwiinngg ppoossssiibbllee wwiitthh tthhee ppuummpp bboooosstteerr ffuunnccttiioonn.. VVaarriioouuss ccoommbbuussttiibbllee ggaasseess ccaann bbee mmeeaassuurreedd bbyy tthhee ppppmm oorrddeerr wwiitthh NNCC--11000000.. www.bruusgaard.no postmaster@bruusgaard.no +47 67 54 93 30 Rev: 446-2 ``` We can see that there are a large number of duplicated characters in the text, which can cause issues in subsequent applications. ## After Therefore, based on the [solution](jsvine/pdfplumber#71) provided by the `pdfplumber` source project. I added the `"dedupe_chars()"` method to address this problem. (Just pass the parameter `dedupe` to `True`) ```python from langchain.document_loaders import PDFPlumberLoader pdf_file = 'file.pdf' loader = PDFPlumberLoader(pdf_file, dedupe=True) docs = loader.load() print(docs[0].page_content) ``` Results: ``` 1000 Series Portable single gas detectors for Hydrogen and Combustible gases The Riken Keiki GP-1000 is a compact and lightweight gas detector with high sensitivity for the detection of hydrocarbons. The measurement is performed for this purpose by means of catalytic sensor. The GP-1000 has a built-in pump with pump booster function and a direct selection from a list of 25 hydrocarbons for exact alignment of the target gas - Only calibration on CH is necessary. 4 Features The Riken Keiki 100vvtable single Hydrogen and Combustible gas detectors. There are 3 standard models: GP-1000: 0-10%LEL / 0-100%LEL › LEL detector NC-1000: 0-1000ppm / 0-10000ppm › PPM detector Direct reading of the concentration values of combustible gases of 25 gases (5 NP-1000). Easy operation feature of changing the gas name display with 1 switch button. Long distance drawing possible with the pump booster function. Various combustible gases can be measured by the ppm order with NC-1000. www.bruusgaard.no postmaster@bruusgaard.no +47 67 54 93 30 Rev: 446-2 ``` --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
I'm facing a weird problem wherein characters are repeated when using
extract_text()
orextract_tables()
. Example,SSttaatteemmeenntt ooff AAccccoouunnttss
is printed instead ofStatement of Accounts
.Sometimes, it happens in a portion of the PDF and sometimes in the whole PDF. When this happens in a portion of PDF, it is fixable (not completely) via
extract_text(x_tolerance=0, y_tolerance=0)
but not when the issue affects the whole PDF. Also, note that I do not face this issue in all PDFs but in some.Lines are also repeated. Example,
The text was updated successfully, but these errors were encountered: