Is there a way to get something similar to pdftotext's layout? #10

dannguyen · 2016-03-15T07:12:17Z

Is there an option similar to pdftotext's -layout flag, which "maintain[s] original physical layout"? I understand that's probably up to the pdfminer engine...

Here's what I mean:

Original PDF

http://www.supremecourt.gov/oral_arguments/argument_transcripts/07-1315.pdf

pdftotext with `-layout`

$ curl -O http://www.supremecourt.gov/oral_arguments/argument_transcripts/07-1315.pdf
$ pdftotext -f 1 -l 1 -layout 07-1315.pdf -

Output:

                        Official - Subject to Final Review


 1      IN THE SUPREME COURT OF THE UNITED STATES

 2   - - - - - - - - - - - - - - - - - x

 3   MICHAEL A. KNOWLES,                            :

 4   WARDEN,                                        :

 5              Petitioner                          :

 6         v.                                       :        No. 07-1315

 7   ALEXANDRE MIRZAYANCE.                          :

 8   - - - - - - - - - - - - - - - - - x

 9                              Washington, D.C.

10                              Tuesday, January 13, 2009

11

12                  The above-entitled matter came on for oral

13   argument before the Supreme Court of the United States

14   at 1:01 p.m.

15   APPEARANCES:

16   STEVEN E. MERCER, ESQ., Deputy Attorney General, Los

17     Angeles, Cal.; on behalf of the Petitioner.

18   CHARLES M. SEVILLA, ESQ., San Diego, Cal.; on behalf

19     of the Respondent.

20

21

22

23

24

25


                                        1

                           Alderson Reporting Company

pdfminer

import pdfplumber
import requests
fname = "/tmp/whatev.pdf"
resp = requests.get("http://www.supremecourt.gov/oral_arguments/argument_transcripts/07-1315.pdf")
with open(fname, 'wb') as f:
      f.write(resp.content)
pdf = pdfplumber.open(fname)
print(pdf.pages[0].extract_text())

Output:

Official - Subject to Final Review 
1 IN THE SUPREME COURT OF THE UNITED STATES 
2 - - - - - - - - - - - - - - - - - x 
3 MICHAEL A. KNOWLES,               : 
4 WARDEN,                           :
5  Petitioner            :
6  v.                         :  No. 07-1315 
7 ALEXANDRE MIRZAYANCE.             : 
8 - - - - - - - - - - - - - - - - - x
9  Washington, D.C.
10  Tuesday, January 13, 2009
11
12  The above-entitled matter came on for oral 
13 argument before the Supreme Court of the United States 
14 at 1:01 p.m. 
15 APPEARANCES: 
16 STEVEN E. MERCER, ESQ., Deputy Attorney General, Los
17  Angeles, Cal.; on behalf of the Petitioner. 
18 CHARLES M. SEVILLA, ESQ., San Diego, Cal.; on behalf
19  of the Respondent. 
20
21
22
23
24
25
1


Alderson Reporting Company

The text was updated successfully, but these errors were encountered:

jsfenfen · 2016-03-15T23:49:17Z

I was able to do something like this using an alternate utils.collate_line_layout.

def collate_line_layout(line_chars, tolerance=0, line_width=612, fixed_width=90):
    tolerance = decimalize(tolerance)
    space_width =  decimalize ((line_width + 0.0) / fixed_width )
    coll = ""
    last_x1 = 0
    for char in sorted(line_chars, key=itemgetter("x0")):
        if last_x1 != None and char["x0"] > (last_x1 + tolerance):
            delta_x = char["x0"] - last_x1
            num_spaces = int(round( delta_x / space_width ))
            if num_spaces == 0: 
                num_spaces = 1
            coll += " " * num_spaces
        last_x1 = char["x1"]
        coll += char["text"]
    return coll

Output is below (I used the default of 90 chars wide). Not as pretty -- and you probably would wanna remove the left margin. How do you think this should work? One could, of course, just use pdftotext -layout, but I've come across stuff where just being able to grab a region and deal with it in a way that spaces matter seems like a good idea...

                                  Official - Subject to Final Review 
          1      IN THE SUPREME COURT OF THE UNITED STATES 
          2   - - - - - - - - - - - - - - - - - x 
          3   MICHAEL A. KNOWLES,               : 
          4   WARDEN,                           :
          5                Petitioner            :
          6           v.                         :  No. 07-1315 
          7   ALEXANDRE MIRZAYANCE.             : 
          8   - - - - - - - - - - - - - - - - - x
          9                           Washington, D.C.
         10                           Tuesday, January 13, 2009
         11
         12                  The above-entitled matter came on for oral 
         13   argument before the Supreme Court of the United States 
         14   at 1:01 p.m. 
         15   APPEARANCES: 
         16   STEVEN E. MERCER, ESQ., Deputy Attorney General, Los
         17      Angeles, Cal.; on behalf of the Petitioner. 
         18   CHARLES M. SEVILLA, ESQ., San Diego, Cal.; on behalf
         19      of the Respondent. 
         20
         21
         22
         23
         24
         25
                                           1


                                   Alderson Reporting Company

jsfenfen · 2016-03-16T00:00:35Z

It would be more consistent if one used utils.extract_words and then, for each line, put the start at exactly the right number of spaces. Doing that for just a piece of the page would work if one knew the width--though I guess you'd have to know it, realistically.

jsvine · 2016-03-17T13:02:56Z

🎉 This seems like a great feature to have/add. And seems quite doable. There are (at least) two options for implementation:

As an optional flag to .extract_text(...), e.g., preserve_layout=True.
As a method unto its own, e.g., .extract_layout(...)

My slight preference is for the first option, since that seems like the logical place where people would look. Thoughts?

And, yep @jsfenfen, seems like .extract_words(...) would be a good underlying approach.

@dannguyen: Are you looking for something similar to pdftotext's layout, or exactly the same as it?

Should be able to knock this out some evening or weekend soon.

dannguyen · 2016-03-17T15:15:52Z

@jsvine Uh...I was only interested in it in the assumption that it was just some "standard" but you have to re-implement it yourself from scratch? I don't know why I assumed that, maybe I was just really optimistic and intoxicated at the time...mostly, I was hoping for a cross-platform way to parse PDFs to text with the layout, just for regex exercises that didn't require installing poppler. But the layout option was never perfect and would sometimes mangle lines anyway.

jsvine · 2016-03-17T22:47:05Z

Intoxicated Optimism will be the name of my next band. But to your point:

There's no standard approach/spec for extracting that sort of preserved-layout text.
I think it's worth doing anyhow.
Given the current architecture, seems straightforward enough.
I'll regret writing the previous bullet point about 30 minutes after I try implementing this feature.

abhishek-jain-infrrd · 2018-09-12T11:33:23Z

Did anyone got the time to work on this feature? Having extract_text(...) with the preserve_layout=True seems the best way. Can anyone confirm if this was implemented?

oliverbj · 2019-07-04T12:32:27Z

I am also interested in this. Did you give this a try @jsvine ?

toby2 · 2021-02-04T10:46:34Z

I am also interested in this. Did you give this a try @jsvine ?

jsvine · 2021-02-04T13:56:38Z

I've sketched out some possible implementations, but haven't made the time yet to code/debug them. Thank you for noting your interest, however; it's useful information.

jigsawcoder · 2021-07-21T09:08:47Z

I was able to do something like this using an alternate utils.collate_line_layout.

def collate_line_layout(line_chars, tolerance=0, line_width=612, fixed_width=90):
    tolerance = decimalize(tolerance)
    space_width =  decimalize ((line_width + 0.0) / fixed_width )
    coll = ""
    last_x1 = 0
    for char in sorted(line_chars, key=itemgetter("x0")):
        if last_x1 != None and char["x0"] > (last_x1 + tolerance):
            delta_x = char["x0"] - last_x1
            num_spaces = int(round( delta_x / space_width ))
            if num_spaces == 0: 
                num_spaces = 1
            coll += " " * num_spaces
        last_x1 = char["x1"]
        coll += char["text"]
    return coll

Output is below (I used the default of 90 chars wide). Not as pretty -- and you probably would wanna remove the left margin. How do you think this should work? One could, of course, just use pdftotext -layout, but I've come across stuff where just being able to grab a region and deal with it in a way that spaces matter seems like a good idea...

                                  Official - Subject to Final Review 
          1      IN THE SUPREME COURT OF THE UNITED STATES 
          2   - - - - - - - - - - - - - - - - - x 
          3   MICHAEL A. KNOWLES,               : 
          4   WARDEN,                           :
          5                Petitioner            :
          6           v.                         :  No. 07-1315 
          7   ALEXANDRE MIRZAYANCE.             : 
          8   - - - - - - - - - - - - - - - - - x
          9                           Washington, D.C.
         10                           Tuesday, January 13, 2009
         11
         12                  The above-entitled matter came on for oral 
         13   argument before the Supreme Court of the United States 
         14   at 1:01 p.m. 
         15   APPEARANCES: 
         16   STEVEN E. MERCER, ESQ., Deputy Attorney General, Los
         17      Angeles, Cal.; on behalf of the Petitioner. 
         18   CHARLES M. SEVILLA, ESQ., San Diego, Cal.; on behalf
         19      of the Respondent. 
         20
         21
         22
         23
         24
         25
                                           1


                                   Alderson Reporting Company

Can you please explain how can I use this function as per my use case.

See the docstring in utils.words_to_layout for details on the implementation. Addresses issue #10 and related issues.

jsvine · 2021-12-24T03:08:21Z

More than five years after this issue was first opened (thank you @dannguyen), pdfplumber now has an experimental version of this feature! It uses a similar approach to @jsfenfen suggestion (thank you!), with a few tweaks. Documentation in README.md:

... and more extensively in the code itself:

pdfplumber/pdfplumber/utils.py

Lines 345 to 382 in 95c049d

    
           def words_to_layout( 
        
               words, 
        
               x_density=DEFAULT_X_DENSITY, 
        
               y_density=DEFAULT_Y_DENSITY, 
        
               x_shift=0, 
        
               y_shift=0, 
        
               y_tolerance=DEFAULT_Y_TOLERANCE, 
        
               presorted=False, 
        
           ): 
        
               """ 
        
               Given a set of word objects generated by `extract_words(...)`, return a 
        
               string that mimics the structural layout of the text on the page(s), using 
        
               the following approach: 
        
               - Sort the words by (doctop, x0) if not already sorted. 
        
               - Calculate the initial doctop for the starting page. 
        
               - Cluster the words by doctop (taking `y_tolerance` into account), and 
        
                 iterate through them. 
        
               - For each cluster, calculate the distance between that doctop and the 
        
                 initial doctop, in points, minus `y_shift`. Divide that distance by 
        
                 `y_density` to calculate the minimum number of newlines that should come 
        
                 before this cluster. Append that number of newlines *minus* the number of 
        
                 newlines already appended, with a minimum of one. 
        
               - Then for each cluster, iterate through each word in it. Divide each 
        
                 word's x0, minus `x_shift`, by `x_density` to calculate the minimum 
        
                 number of characters that should come before this cluster.  Append that 
        
                 number of spaces *minus* the number of characters and spaces already 
        
                 appended, with a minimum of one. Then append the word's text. 
        
               Note: This approach currently works best for horizontal, left-to-right 
        
               text, but will display all words regardless of orientation. There is room 
        
               for improvement in better supporting right-to-left text, as well as 
        
               vertical text. 
        
               """

jsvine added the enhancement label Jul 30, 2020

samkit-jain mentioned this issue Aug 12, 2020

TXWylie01a-FIN.pdf #250

Closed

jsvine mentioned this issue Oct 20, 2020

Issue : Can't able to extract line spacing between paragraph #295

Closed

samkit-jain mentioned this issue Jan 6, 2021

Extract paragraphs #331

Closed

samkit-jain mentioned this issue Jul 21, 2021

Layout Detection similar to pdfminer.six #476

Closed

jsvine mentioned this issue Oct 16, 2021

Is there any way to include blank lines when extracting texts? #516

Closed

jsvine added a commit that referenced this issue Oct 22, 2021

Add experimental .extract_text(layout=True)

6236483

See the docstring in utils.words_to_layout for details on the implementation. Addresses issue #10 and related issues.

jsvine added a commit that referenced this issue Nov 3, 2021

Add experimental .extract_text(layout=True)

d235d4b

See the docstring in utils.words_to_layout for details on the implementation. Addresses issue #10 and related issues.

jsvine mentioned this issue Nov 3, 2021

Add experimental .extract_text(layout=True) #532

Merged

jsvine closed this as completed Dec 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to get something similar to pdftotext's layout? #10

Is there a way to get something similar to pdftotext's layout? #10

dannguyen commented Mar 15, 2016

jsfenfen commented Mar 15, 2016

jsfenfen commented Mar 16, 2016

jsvine commented Mar 17, 2016

dannguyen commented Mar 17, 2016

jsvine commented Mar 17, 2016

abhishek-jain-infrrd commented Sep 12, 2018

oliverbj commented Jul 4, 2019

toby2 commented Feb 4, 2021

jsvine commented Feb 4, 2021

jigsawcoder commented Jul 21, 2021

jsvine commented Dec 24, 2021

Is there a way to get something similar to pdftotext's layout? #10

Is there a way to get something similar to pdftotext's layout? #10

Comments

dannguyen commented Mar 15, 2016

Original PDF

pdftotext with -layout

pdfminer

jsfenfen commented Mar 15, 2016

jsfenfen commented Mar 16, 2016

jsvine commented Mar 17, 2016

dannguyen commented Mar 17, 2016

jsvine commented Mar 17, 2016

abhishek-jain-infrrd commented Sep 12, 2018

oliverbj commented Jul 4, 2019

toby2 commented Feb 4, 2021

jsvine commented Feb 4, 2021

jigsawcoder commented Jul 21, 2021

jsvine commented Dec 24, 2021

pdftotext with `-layout`