Problems reading scientific papers using pdfminer #354

ceilbeck · 2021-02-16T17:08:30Z

ceilbeck
Feb 16, 2021

I am trying to identify some key words in a large collections of PDF files for a neuro-rehab study. As an example, my python program miner1.py to analyse Fujita2016.pdf is

#!/usr/bin/env python3

import pdfplumber
import re

pattern = 'Sever|Mild|Moder|Level|Impair|Stratif|Degree|Low|High|Group'
fileName = 'Fujita2016.pdf' #input("Enter file path and name: ")
print(fileName)

with pdfplumber.open(fileName) as pdf:
    
    numPages = len(pdf.pages)
    print(f"No of Pages {numPages}")
    
    for i in range(0, numPages):
        print(f"Page {i}")
        p0 = pdf.pages[i]
        text = p0.extract_text()
        #print(text)
        for match in re.finditer(pattern, text, re.I):
            i1 = match.start()
            i2 = match.end()
            str1 = text[i1-20:i1]+"["+ text[i1:i2]+"]" +text[i2:i2+20]
            str1 = str1.replace('\n','\\n ')
            print(f"Match: {i} {i1} {i2} {str1}")

With the file Fujita2016.pdf

Fujita2016.pdf

The key words get printed out together with 20 characters on either side. The program works to some extent, but only recognises the first page of the PDF. More minor problems are that some of the text gets printed out without white space separators, and the double-column format causes some confusion.

I am running Python 3.8.5.
I'd be most grateful for any suggestions.

Chris Eilbeck

jsvine · 2021-02-23T14:28:56Z

jsvine
Feb 23, 2021
Maintainer

Hi @ceilbeck, and thanks for your interest in this library. In this case, it appears that the PDF is malformed — something that's not related to pdfplumber specifically. There are various tools to fix malformed PDFs; I happen to like cpdf: https://community.coherentpdf.com/

Running that tool like so:

cpdf Fujita2016.pdf -o Fujita2016-fixed.pdf

... produces this output:

Attempting to reconstruct the malformed pdf Fujita2016.pdf...
list length 0
list length 0
list length 0
Read 143 objects
Malformed PDF reconstruction succeeded!

Then, running the same code as above, but swapping in Fujita2016-fixed.pdf seems to produce the results you expected.

1 reply

ceilbeck Feb 23, 2021
Author

Many thanks, I was starting to think I was just unlucky as choosing this PDF as my test case! My code seems to work well on other examples I tried later. It's good to hear about cpdf, I'll try it if I have similar problems elsewhere. I managed to fix the other problem I reported, that of the output not recognising the space between words. If I replace the call

p0.extract_text() by p0.extract_text(x_tolerance=1.5)

Then the problem goes away on my example. It would be good to know what the units are for this type of function.

That still leaves one last problem, identifying and parsing two-column text. If anyone has any suggestions for this I'd be very grateful to hear from them - it must be a common problem since many journal articles use this format.

jsvine · 2021-02-25T13:32:08Z

jsvine
Feb 25, 2021
Maintainer

It would be good to know what the units are for this type of function.

The units, there and throughout pdfplumber and pdfminer.six, are "points," which for PDFs default to 1/72 of an inch.

That still leaves one last problem, identifying and parsing two-column text. If anyone has any suggestions for this I'd be very grateful to hear from them - it must be a common problem since many journal articles use this format.

If the layout is predictable between pages, I think the simplest approach would be to use page.crop(...) to divide the page into two halves, and then extract the text from those halves. If the layout becomes more complex, though, you may need to start coding custom heuristics for detecting the imaginary dividing line between the columns.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems reading scientific papers using pdfminer #354

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Problems reading scientific papers using pdfminer #354

ceilbeck Feb 16, 2021

Replies: 2 comments · 1 reply

jsvine Feb 23, 2021 Maintainer

ceilbeck Feb 23, 2021 Author

jsvine Feb 25, 2021 Maintainer

ceilbeck
Feb 16, 2021

Replies: 2 comments 1 reply

jsvine
Feb 23, 2021
Maintainer

ceilbeck Feb 23, 2021
Author

jsvine
Feb 25, 2021
Maintainer