Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to get something similar to pdftotext's layout? #10

Closed
dannguyen opened this issue Mar 15, 2016 · 11 comments
Closed

Is there a way to get something similar to pdftotext's layout? #10

dannguyen opened this issue Mar 15, 2016 · 11 comments

Comments

@dannguyen
Copy link

Is there an option similar to pdftotext's -layout flag, which "maintain[s] original physical layout"? I understand that's probably up to the pdfminer engine...

Here's what I mean:

Original PDF

http://www.supremecourt.gov/oral_arguments/argument_transcripts/07-1315.pdf

image

pdftotext with -layout

$ curl -O http://www.supremecourt.gov/oral_arguments/argument_transcripts/07-1315.pdf
$ pdftotext -f 1 -l 1 -layout 07-1315.pdf -

Output:

                        Official - Subject to Final Review


 1      IN THE SUPREME COURT OF THE UNITED STATES

 2   - - - - - - - - - - - - - - - - - x

 3   MICHAEL A. KNOWLES,                            :

 4   WARDEN,                                        :

 5              Petitioner                          :

 6         v.                                       :        No. 07-1315

 7   ALEXANDRE MIRZAYANCE.                          :

 8   - - - - - - - - - - - - - - - - - x

 9                              Washington, D.C.

10                              Tuesday, January 13, 2009

11

12                  The above-entitled matter came on for oral

13   argument before the Supreme Court of the United States

14   at 1:01 p.m.

15   APPEARANCES:

16   STEVEN E. MERCER, ESQ., Deputy Attorney General, Los

17     Angeles, Cal.; on behalf of the Petitioner.

18   CHARLES M. SEVILLA, ESQ., San Diego, Cal.; on behalf

19     of the Respondent.

20

21

22

23

24

25


                                        1

                           Alderson Reporting Company

pdfminer

import pdfplumber
import requests
fname = "/tmp/whatev.pdf"
resp = requests.get("http://www.supremecourt.gov/oral_arguments/argument_transcripts/07-1315.pdf")
with open(fname, 'wb') as f:
      f.write(resp.content)
pdf = pdfplumber.open(fname)
print(pdf.pages[0].extract_text())

Output:

Official - Subject to Final Review 
1 IN THE SUPREME COURT OF THE UNITED STATES 
2 - - - - - - - - - - - - - - - - - x 
3 MICHAEL A. KNOWLES,               : 
4 WARDEN,                           :
5  Petitioner            :
6  v.                         :  No. 07-1315 
7 ALEXANDRE MIRZAYANCE.             : 
8 - - - - - - - - - - - - - - - - - x
9  Washington, D.C.
10  Tuesday, January 13, 2009
11
12  The above-entitled matter came on for oral 
13 argument before the Supreme Court of the United States 
14 at 1:01 p.m. 
15 APPEARANCES: 
16 STEVEN E. MERCER, ESQ., Deputy Attorney General, Los
17  Angeles, Cal.; on behalf of the Petitioner. 
18 CHARLES M. SEVILLA, ESQ., San Diego, Cal.; on behalf
19  of the Respondent. 
20
21
22
23
24
25
1


Alderson Reporting Company 
@jsfenfen
Copy link
Contributor

I was able to do something like this using an alternate utils.collate_line_layout.

def collate_line_layout(line_chars, tolerance=0, line_width=612, fixed_width=90):
    tolerance = decimalize(tolerance)
    space_width =  decimalize ((line_width + 0.0) / fixed_width )
    coll = ""
    last_x1 = 0
    for char in sorted(line_chars, key=itemgetter("x0")):
        if last_x1 != None and char["x0"] > (last_x1 + tolerance):
            delta_x = char["x0"] - last_x1
            num_spaces = int(round( delta_x / space_width ))
            if num_spaces == 0: 
                num_spaces = 1
            coll += " " * num_spaces
        last_x1 = char["x1"]
        coll += char["text"]
    return coll

Output is below (I used the default of 90 chars wide). Not as pretty -- and you probably would wanna remove the left margin. How do you think this should work? One could, of course, just use pdftotext -layout, but I've come across stuff where just being able to grab a region and deal with it in a way that spaces matter seems like a good idea...

                                  Official - Subject to Final Review 
          1      IN THE SUPREME COURT OF THE UNITED STATES 
          2   - - - - - - - - - - - - - - - - - x 
          3   MICHAEL A. KNOWLES,               : 
          4   WARDEN,                           :
          5                Petitioner            :
          6           v.                         :  No. 07-1315 
          7   ALEXANDRE MIRZAYANCE.             : 
          8   - - - - - - - - - - - - - - - - - x
          9                           Washington, D.C.
         10                           Tuesday, January 13, 2009
         11
         12                  The above-entitled matter came on for oral 
         13   argument before the Supreme Court of the United States 
         14   at 1:01 p.m. 
         15   APPEARANCES: 
         16   STEVEN E. MERCER, ESQ., Deputy Attorney General, Los
         17      Angeles, Cal.; on behalf of the Petitioner. 
         18   CHARLES M. SEVILLA, ESQ., San Diego, Cal.; on behalf
         19      of the Respondent. 
         20
         21
         22
         23
         24
         25
                                           1


                                   Alderson Reporting Company 

@jsfenfen
Copy link
Contributor

It would be more consistent if one used utils.extract_words and then, for each line, put the start at exactly the right number of spaces. Doing that for just a piece of the page would work if one knew the width--though I guess you'd have to know it, realistically.

@jsvine
Copy link
Owner

jsvine commented Mar 17, 2016

🎉 This seems like a great feature to have/add. And seems quite doable. There are (at least) two options for implementation:

  • As an optional flag to .extract_text(...), e.g., preserve_layout=True.
  • As a method unto its own, e.g., .extract_layout(...)

My slight preference is for the first option, since that seems like the logical place where people would look. Thoughts?

And, yep @jsfenfen, seems like .extract_words(...) would be a good underlying approach.

@dannguyen: Are you looking for something similar to pdftotext's layout, or exactly the same as it?

Should be able to knock this out some evening or weekend soon.

@dannguyen
Copy link
Author

@jsvine Uh...I was only interested in it in the assumption that it was just some "standard" but you have to re-implement it yourself from scratch? I don't know why I assumed that, maybe I was just really optimistic and intoxicated at the time...mostly, I was hoping for a cross-platform way to parse PDFs to text with the layout, just for regex exercises that didn't require installing poppler. But the layout option was never perfect and would sometimes mangle lines anyway.

@jsvine
Copy link
Owner

jsvine commented Mar 17, 2016

Intoxicated Optimism will be the name of my next band. But to your point:

  • There's no standard approach/spec for extracting that sort of preserved-layout text.
  • I think it's worth doing anyhow.
  • Given the current architecture, seems straightforward enough.
  • I'll regret writing the previous bullet point about 30 minutes after I try implementing this feature.

@abhishek-jain-infrrd
Copy link

Did anyone got the time to work on this feature? Having extract_text(...) with the preserve_layout=True seems the best way. Can anyone confirm if this was implemented?

@oliverbj
Copy link

oliverbj commented Jul 4, 2019

I am also interested in this. Did you give this a try @jsvine ?

@toby2
Copy link

toby2 commented Feb 4, 2021

I am also interested in this. Did you give this a try @jsvine ?

@jsvine
Copy link
Owner

jsvine commented Feb 4, 2021

I've sketched out some possible implementations, but haven't made the time yet to code/debug them. Thank you for noting your interest, however; it's useful information.

@jigsawcoder
Copy link

I was able to do something like this using an alternate utils.collate_line_layout.

def collate_line_layout(line_chars, tolerance=0, line_width=612, fixed_width=90):
    tolerance = decimalize(tolerance)
    space_width =  decimalize ((line_width + 0.0) / fixed_width )
    coll = ""
    last_x1 = 0
    for char in sorted(line_chars, key=itemgetter("x0")):
        if last_x1 != None and char["x0"] > (last_x1 + tolerance):
            delta_x = char["x0"] - last_x1
            num_spaces = int(round( delta_x / space_width ))
            if num_spaces == 0: 
                num_spaces = 1
            coll += " " * num_spaces
        last_x1 = char["x1"]
        coll += char["text"]
    return coll

Output is below (I used the default of 90 chars wide). Not as pretty -- and you probably would wanna remove the left margin. How do you think this should work? One could, of course, just use pdftotext -layout, but I've come across stuff where just being able to grab a region and deal with it in a way that spaces matter seems like a good idea...

                                  Official - Subject to Final Review 
          1      IN THE SUPREME COURT OF THE UNITED STATES 
          2   - - - - - - - - - - - - - - - - - x 
          3   MICHAEL A. KNOWLES,               : 
          4   WARDEN,                           :
          5                Petitioner            :
          6           v.                         :  No. 07-1315 
          7   ALEXANDRE MIRZAYANCE.             : 
          8   - - - - - - - - - - - - - - - - - x
          9                           Washington, D.C.
         10                           Tuesday, January 13, 2009
         11
         12                  The above-entitled matter came on for oral 
         13   argument before the Supreme Court of the United States 
         14   at 1:01 p.m. 
         15   APPEARANCES: 
         16   STEVEN E. MERCER, ESQ., Deputy Attorney General, Los
         17      Angeles, Cal.; on behalf of the Petitioner. 
         18   CHARLES M. SEVILLA, ESQ., San Diego, Cal.; on behalf
         19      of the Respondent. 
         20
         21
         22
         23
         24
         25
                                           1


                                   Alderson Reporting Company 

Can you please explain how can I use this function as per my use case.

jsvine added a commit that referenced this issue Oct 22, 2021
See the docstring in utils.words_to_layout for details on the
implementation.

Addresses issue #10 and related issues.
jsvine added a commit that referenced this issue Nov 3, 2021
See the docstring in utils.words_to_layout for details on the
implementation.

Addresses issue #10 and related issues.
@jsvine
Copy link
Owner

jsvine commented Dec 24, 2021

More than five years after this issue was first opened (thank you @dannguyen), pdfplumber now has an experimental version of this feature! It uses a similar approach to @jsfenfen suggestion (thank you!), with a few tweaks. Documentation in README.md:

Screen Shot 2021-12-23 at 10 06 17 PM

... and more extensively in the code itself:

def words_to_layout(
words,
x_density=DEFAULT_X_DENSITY,
y_density=DEFAULT_Y_DENSITY,
x_shift=0,
y_shift=0,
y_tolerance=DEFAULT_Y_TOLERANCE,
presorted=False,
):
"""
Given a set of word objects generated by `extract_words(...)`, return a
string that mimics the structural layout of the text on the page(s), using
the following approach:
- Sort the words by (doctop, x0) if not already sorted.
- Calculate the initial doctop for the starting page.
- Cluster the words by doctop (taking `y_tolerance` into account), and
iterate through them.
- For each cluster, calculate the distance between that doctop and the
initial doctop, in points, minus `y_shift`. Divide that distance by
`y_density` to calculate the minimum number of newlines that should come
before this cluster. Append that number of newlines *minus* the number of
newlines already appended, with a minimum of one.
- Then for each cluster, iterate through each word in it. Divide each
word's x0, minus `x_shift`, by `x_density` to calculate the minimum
number of characters that should come before this cluster. Append that
number of spaces *minus* the number of characters and spaces already
appended, with a minimum of one. Then append the word's text.
Note: This approach currently works best for horizontal, left-to-right
text, but will display all words regardless of orientation. There is room
for improvement in better supporting right-to-left text, as well as
vertical text.
"""

@jsvine jsvine closed this as completed Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants