Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeating characters #71

Closed
samkit-jain opened this issue Jul 31, 2018 · 22 comments
Closed

Repeating characters #71

samkit-jain opened this issue Jul 31, 2018 · 22 comments

Comments

@samkit-jain
Copy link
Collaborator

samkit-jain commented Jul 31, 2018

I'm facing a weird problem wherein characters are repeated when using extract_text() or extract_tables(). Example, SSttaatteemmeenntt ooff AAccccoouunnttss is printed instead of Statement of Accounts.

Sometimes, it happens in a portion of the PDF and sometimes in the whole PDF. When this happens in a portion of PDF, it is fixable (not completely) via extract_text(x_tolerance=0, y_tolerance=0) but not when the issue affects the whole PDF. Also, note that I do not face this issue in all PDFs but in some.

Lines are also repeated. Example,

Year-to-date totals do not reflect any fee or interest refunds
Year-to-date totals do not reflect any fee or interest refunds
you may have received.
you may have received.
@samkit-jain
Copy link
Collaborator Author

On doing first_page.extract_words(x_tolerance=0, y_tolerance=0), there are two instances of a single word

{'x0': Decimal('231.532'), 'x1': Decimal('252.251'), 'top': Decimal('916.343'), 'bottom': Decimal('925.422'), 'text': 'reflect'}
{'x0': Decimal('231.533'), 'x1': Decimal('252.252'), 'top': Decimal('916.383'), 'bottom': Decimal('925.462'), 'text': 'reflect'}

And repeating characters are still present for some words,

{'x0': Decimal('489.040'), 'x1': Decimal('506.160'), 'top': Decimal('269.320'), 'bottom': Decimal('277.480'), 'text': 'ttooddaayy'}

@jsvine
Copy link
Owner

jsvine commented Aug 1, 2018

That's strange, indeed. My hunch is that there really are two copies of each letter in the PDF. (One set of letters might be transparent, perhaps?) What happens if you try extracting the text with another tool, such as poppler-utils's pdftotext? (https://en.wikipedia.org/wiki/Pdftotext)

@samkit-jain
Copy link
Collaborator Author

No such problem with pdftotext. This is the output,

No repeating lines

Year-to-date totals do not reflect any fee or interest refunds
you may have received.

No repeating characters

today
Statement of Accounts

@jsfenfen
Copy link
Contributor

jsfenfen commented Aug 1, 2018

I've encountered this problem as well. In my case it was cropping up in fillable pdfs, and I theorized that the folks filling out the pdf were somehow resaving it on top of the original text. I found it was easier to just remove duplicate characters via script than make sense of the pdf. I dunno for sure, I suspect that other pdf output tools are removing duplicate characters.

I'm not really sure what the right solution is, but possibly adding a 'remove duplicate characters' option would make this more manageable? My case involved exact matches--characters occurring at exactly the same spot--so a fix was easy... I suppose if they were slightly offset it would be more challenging.

@NaveenBandi
Copy link

Getting same issue, please pass some resolution

@BryanKoo
Copy link

AFAIK, duplicated characters are also for bold representation and there will be cases with small offset.
Deduplication is possible by checking overlap ratio of all characters using coordinates.

@samkit-jain samkit-jain added the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Aug 21, 2020
@tiagosamaha
Copy link

Any solution to it? I have the same issue.

@hannylicious
Copy link

I recently stumbled across this issue - just tossing it out there to let folks know it's a continuing thing.

@samkit-jain
Copy link
Collaborator Author

@hannylicious and other watchers of this issue, if you have a PDF with this issue that you can share publicly, please do so that this issue can be investigated in further detail.

I am pretty sure I have a PDF with this issue but it will take me some time to find it.

@hannylicious
Copy link

Unfortunately - I dabble with PDF's very infrequently. I just happened across it this time because another library (pyPDF2) didn't see any text at all - whereas pdfplumber saw the text, but it was duplicated. The PDF I'm working with at this time has some information that I can't publicly display so I won't be of much assistance I'm afraid.

I resolved my use case simply by grabbing the first of the results and using that.

Pdfplumber is a great tool - I will most likely be using this from now on! If I run across this issue on a PDF that I can link up - I definitely will!

@jsvine
Copy link
Owner

jsvine commented Sep 1, 2020

Thanks, @hannylicious! If you have the time, you could try using https://github.com/JoshData/pdf-redactor to remove the sensitive information without altering the PDF structure. If the result still produces the same character-duplication, then it could be very helpful for resolving this issue.

@hannylicious
Copy link

Thanks @jsvine - I will definitely have a look at that pdf-redactor library. If it works - I'll be sure and post that PDF here!

@tiagosamaha
Copy link

I would like to help, but my file has confidential content. Anyone have some issue file?

@pajaskowiak
Copy link

Same issue here.

@jsvine
Copy link
Owner

jsvine commented Sep 26, 2020

@pajaskowiak Can you share a PDF that demonstrates the issue?

@xv44586
Copy link

xv44586 commented Sep 28, 2020

repeat.pdf
Getting samge issue, the pdf is repeat.pdf

@mkl-public
Copy link

The duplicate text indeed is drawn twice in the PDF, the second time with a small horizontal offset to create the appearance of a bold font.
Actually, though, this PDF gives a hint that the second copy shall be ignored by marking it with an empty ActualText property. By evaluating that property, therefore, pdfplumber could correctly extract this PDF.

@jsvine
Copy link
Owner

jsvine commented Sep 29, 2020

Many thanks @xv44586 and @mkl-public. This is helpful. Given the way pdfminer.six parses PDFs, I'm not sure we can easily pick up on that ActualText property. And I'm not certain all PDFs would provide the same hinting. But one way we could potentially handle this is by adding an option to remove all characters that are, effectively, duplicates — perhaps those with same text, fontname, and size and within a few x/y points of the "original" character.

@mkl-public
Copy link

Indeed, there are many PDFs out there drawing text twice for some visual effect (bold, shadow, ...) but by far not all of them use ActualText to mark one copy as ignorable like @xv44586's example file does. Thus, finding duplicates explicitly will help more often in this regard than checking the ActualText.

@pajaskowiak
Copy link

@pajaskowiak Can you share a PDF that demonstrates the issue?

I'm really sorry but I can't. It contains sensitive information.

@pajaskowiak
Copy link

Many thanks @xv44586 and @mkl-public. This is helpful. Given the way pdfminer.six parses PDFs, I'm not sure we can easily pick up on that ActualText property. And I'm not certain all PDFs would provide the same hinting. But one way we could potentially handle this is by adding an option to remove all characters that are, effectively, duplicates — perhaps those with same text, fontname, and size and within a few x/y points of the "original" character.

I did something similar to this. Anyways, I could fix the duplicates in my own code. Having the text from the pdf, even with eventual duplicates is a big help already! Thank you for the project!

jsvine added a commit that referenced this issue Oct 3, 2020
h/t @xv44586 for the initial inspiration 👍

These new methods return a version of the chars/page with duplicate
chars — those sharing the same text, fontname, size, and positioning
(within `tolerance` x/y) as other characters — removed.
@jsvine
Copy link
Owner

jsvine commented Oct 3, 2020

Commit 04fd56a (available in develop and in the next release) provides a Page.dedupe_chars(...) method that should address this general type of character duplication. (Thanks to @xv44586 for the PDF and test!) I'm closing this issue for now, but if anyone encounters character-duplication issues that the new method does not solve, feel free to comment on this thread. Priority will be given to comments containing a specific PDF and code that demonstrate the problem.

@jsvine jsvine closed this as completed Oct 3, 2020
@jsvine jsvine removed the awaiting-code-or-pdf Issues and PRs awaiting code and/or a PDF from issue/PR-author label Apr 27, 2022
Lin-jun-xiang added a commit to Lin-jun-xiang/langchain that referenced this issue Sep 4, 2023
When using pdfplumber, some documents may be parsed incorrectly, resulting in duplicated characters.

Add `dedupe` paramter for dedupe duplicated characters.

Refer the Issue#71 of pdfplumber:
jsvine/pdfplumber#71
baskaryan added a commit to langchain-ai/langchain that referenced this issue Sep 4, 2023
…ader` (#10165)

(Reopen PR #7706, hope this problem can fix.)

When using `pdfplumber`, some documents may be parsed incorrectly,
resulting in **duplicated characters**.

Taking the
[linked](https://bruusgaard.no/wp-content/uploads/2021/05/Datasheet1000-series.pdf)
document as an example:

## Before
```python
from langchain.document_loaders import PDFPlumberLoader

pdf_file = 'file.pdf'
loader = PDFPlumberLoader(pdf_file)
docs = loader.load()
print(docs[0].page_content)
```

Results:
```
11000000 SSeerriieess
PPoorrttaabbllee ssiinnggllee ggaass ddeetteeccttoorrss ffoorr HHyyddrrooggeenn aanndd CCoommbbuussttiibbllee ggaasseess
TThhee RRiikkeenn KKeeiikkii GGPP--11000000 iiss aa ccoommppaacctt aanndd
lliigghhttwweeiigghhtt ggaass ddeetteeccttoorr wwiitthh hhiigghh sseennssiittiivviittyy ffoorr
tthhee ddeetteeccttiioonn ooff hhyyddrrooccaarrbboonnss.. TThhee mmeeaassuurreemmeenntt
iiss ppeerrffoorrmmeedd ffoorr tthhiiss ppuurrppoossee bbyy mmeeaannss ooff ccaattaallyyttiicc
sseennssoorr.. TThhee GGPP--11000000 hhaass aa bbuuiilltt--iinn ppuummpp wwiitthh
ppuummpp bboooosstteerr ffuunnccttiioonn aanndd aa ddiirreecctt sseelleeccttiioonn ffrroomm
aa lliisstt ooff 2255 hhyyddrrooccaarrbboonnss ffoorr eexxaacctt aalliiggnnmmeenntt ooff tthhee
ttaarrggeett ggaass -- OOnnllyy ccaalliibbrraattiioonn oonn CCHH iiss nneecceessssaarryy..
44
FFeeaattuurreess
TThhee RRiikkeenn KKeeiikkii 110000vvvvttaabbllee ssiinnggllee HHyyddrrooggeenn aanndd
CCoommbbuussttiibbllee ggaass ddeetteeccttoorrss..
TThheerree aarree 33 ssttaannddaarrdd mmooddeellss::
GGPP--11000000:: 00--1100%%LLEELL // 00--110000%%LLEELL ›› LLEELL ddeetteeccttoorr
NNCC--11000000:: 00--11000000ppppmm // 00--1100000000ppppmm ›› PPPPMM
ddeetteeccttoorr
DDiirreecctt rreeaaddiinngg ooff tthhee ccoonncceennttrraattiioonn vvaalluueess ooff
ccoommbbuussttiibbllee ggaasseess ooff 2255 ggaasseess ((55 NNPP--11000000))..
EEaassyy ooppeerraattiioonn ffeeaattuurree ooff cchhaannggiinngg tthhee ggaass nnaammee
ddiissppllaayy wwiitthh 11 sswwiittcchh bbuuttttoonn..
LLoonngg ddiissttaannccee ddrraawwiinngg ppoossssiibbllee wwiitthh tthhee ppuummpp
bboooosstteerr ffuunnccttiioonn..
VVaarriioouuss ccoommbbuussttiibbllee ggaasseess ccaann bbee mmeeaassuurreedd bbyy tthhee
ppppmm oorrddeerr wwiitthh NNCC--11000000..
www.bruusgaard.no postmaster@bruusgaard.no +47 67 54 93 30 Rev: 446-2
```

We can see that there are a large number of duplicated characters in the
text, which can cause issues in subsequent applications.

## After

Therefore, based on the
[solution](jsvine/pdfplumber#71) provided by
the `pdfplumber` source project. I added the `"dedupe_chars()"` method
to address this problem. (Just pass the parameter `dedupe` to `True`)

```python
from langchain.document_loaders import PDFPlumberLoader

pdf_file = 'file.pdf'
loader = PDFPlumberLoader(pdf_file, dedupe=True)
docs = loader.load()
print(docs[0].page_content)
```

Results:

```
1000 Series
Portable single gas detectors for Hydrogen and Combustible gases
The Riken Keiki GP-1000 is a compact and
lightweight gas detector with high sensitivity for
the detection of hydrocarbons. The measurement
is performed for this purpose by means of catalytic
sensor. The GP-1000 has a built-in pump with
pump booster function and a direct selection from
a list of 25 hydrocarbons for exact alignment of the
target gas - Only calibration on CH is necessary.
4
Features
The Riken Keiki 100vvtable single Hydrogen and
Combustible gas detectors.
There are 3 standard models:
GP-1000: 0-10%LEL / 0-100%LEL › LEL detector
NC-1000: 0-1000ppm / 0-10000ppm › PPM
detector
Direct reading of the concentration values of
combustible gases of 25 gases (5 NP-1000).
Easy operation feature of changing the gas name
display with 1 switch button.
Long distance drawing possible with the pump
booster function.
Various combustible gases can be measured by the
ppm order with NC-1000.
www.bruusgaard.no postmaster@bruusgaard.no +47 67 54 93 30 Rev: 446-2
```

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants