Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page dimensions #3

Closed
jsfenfen opened this issue Mar 3, 2016 · 3 comments
Closed

page dimensions #3

jsfenfen opened this issue Mar 3, 2016 · 3 comments

Comments

@jsfenfen
Copy link
Contributor

jsfenfen commented Mar 3, 2016

Is there a way to return overall page dimensions? Or alternatively, relative positions (i.e. in fractional terms--so like 0.56 of the page)?

Use case is comparing / finding words at comparable positions in documents that have different sizes due to different prior processing. (This is also required for accurately displaying word positions as overlays on a pdf). One could get at this by using relative positions (and I guess doctop would be prior_pages + current relative position). If you just captured relative position you'd probably also want to add an orientation variable--though I guess that would be determinable based on letter box proportions.

Having page_width and page_height in every line in the csv seems awkward--but would work. In json output one could just add it as a variable outside the rest of the data. Maybe that's the cleanest approach. Do you have any thoughts?

@jsvine
Copy link
Owner

jsvine commented Mar 4, 2016

Is there a way to return overall page dimensions?

In the library, you should be able to do mypdf.pages[0].width/mypdf.pages[0].height. When I update the docs for the forthcoming v0.3.0, I'll document that.

In the command-line tool, there's currently no way to output page dimensions. However, adding a subcommand to do that wouldn't be terribly difficult. I'll hope to get this in v0.3.0.

Having page_width and page_height in every line in the csv seems awkward--but would work. In json output one could just add it as a variable outside the rest of the data. Maybe that's the cleanest approach. Do you have any thoughts?

Yeah, I think it might be worth adding some more structure to the JSON output, so that it looks something like:

{
  "metadata": { ... },
  "pages": [
    { 
       "width": 800,
       "height": 600,
       "objects": { "chars": [], "rects": [], ... }
    }
    ...
  ]
}

Thoughts on that?

In CSV output, maybe there's just a separate command, e.g., pdfplumber pageinfo, to provide id/width/height?

@jsvine jsvine mentioned this issue Mar 7, 2016
@jsvine
Copy link
Owner

jsvine commented Mar 7, 2016

Done! Richer JSON representation now in v0.3.0. #4

@jsfenfen
Copy link
Contributor Author

jsfenfen commented Mar 7, 2016

That's awesome. Thanks again!

@jsfenfen jsfenfen closed this as completed Mar 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants