-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass in options that tell pdf2json what to output #21
Comments
I like the idea. However I have two things I consider important: First, there need to need to be reasonable defaults which allow you to use pdf2json as it was meant to be without getting a headache about whether and if so which options you have to pass to achieve this.. Second, options - where suitable - should not be exclusive. What if I was interested in getting out both, pixel and PostScript coordinates? |
Agree on both counts. This is why I wanted to get feedback before building anything :) Regarding the 'reasonable defaults'; perhaps if the And as for inclusive/exclusive, I agree, we should be able to get all possible properties returned. So instead of properties Speaking of which, I've been thinking about the JSON object that gets returned as output. Properties with names like I'm thinking of going even one step further: within this
Which return only the horizontal lines, vertical lines, and fills (but in an array named Why do something so elaborate? Two reasons: so that the output is ready to be processed and inserted into a user's database, which is easier when the property names already match the database's field names; or so that the output is ready to be processed and turned into HTML to be output on a page. I understand the desire to have short property names to reduce the size of the payload, and using this implementation that can still be achieved (even better than the current output, in fact). Other than database field names, perhaps Which raises yet another point: perhaps all the attribute names that have equivalent CSS properties should have their default names as such? So a few questions:
|
For output format: currently, pdf2json is deployed in production that converts ~250 PDFs daily and the resulting JSON is feeding a HTML5 renderer without big issues. I understand your intention on idiomatic JavaScript, but it's nice to have, not essential to the parser. In case you do need to have a different or improved output format, I'd suggest to create a new node module that runs on top of pdf2json, then you can get exactly what you want, like pdf2csv: https://www.npmjs.org/package/pdf2csv. I've been trying to avoid using "option" object to drive different outputs, mainly because this "derived configuration" way really creates mess for maintainability in the long run, especially when options object grows bigger over time and having multiple valid and invalid combinations. I'm in the camp of Ruby's "convention over configuration", not a fan of Java EE style configurations and options. What's essential to pdf2json is parsing and packaging PDF in a clean and efficient way. #18 is the biggest issue so far, it makes the output JSON bigger and causes headaches to client renderer, not clean and definitely not efficient either. If you have some bandwidth to make pdf2json better, I'd recommend to work on issues like this. As for the code structure, you are right, "base" is where the fork of pdf.js is, with documented modifications, since I'm running it outside of browser. Please checkout the file base/display/canvas.js, methods like showText and showSpacedText, I've attempted to fix #18 by auto-merging text blocks, but didn't work well in different test cases. What's above is just some thoughts, if you think differently, let's talk more. |
I understand the concern about an Here's the thing though: if there's no way to tell pdf2json what to output, then it will always need to output everything, and it will be up to a wrapper module (like pdf2csv) or the parent app itself to filter out the unneeded items. For example @modesty needs the pixel coordinates and @RST-J needs PostScript coordinates; well, if there's no way to tell pdf2json which to output, it will need to output both sets of coordinates so that both @modesty and @RST-J get what they need and then it's up to each user's app to filter out the properties it doesn't need. Likewise the dictionary would have to be scrapped entirely; many apps would need the original font face etc. data that the dictionary would be eliding, and if there's no way to tell pdf2json whether or not to use the dictionary, it would have to never use the dictionary and always output the verbose version of all that data (or even worse, output all that detail and the dictionary short version). Personally I think it makes more sense to put this "reformat the output" code as one standalone function within pdf2json, as a final step before returning the output object to the caller. You can still run all your tests as is, because if no If you agree with me that we should add this functionality, then we should go on to specify its design; what options should we allow, etc. As a separate question, what do you guys think about refactoring the output to be in idiomatic JavaScript? Things like And @modesty I understand your feelings about #18, I'd love to get that fixed too :) I have to deal with this for my app, though, while I don't necessarily need to fix #18 for my app to work. I'll definitely devote my attention to it after I get this output format stuff sorted. |
Building off of the 'add PostScript coordinates' idea in #12 , perhaps we should support for an
options
object passed as a second parameter toloadPDF
? And thisoptions
object could include key-value pairs likecoordinates: 'PostScript'
oruseDictionary: false
orexcludeTextsProperties: ['clr', 'oc', 'A', 'R.S']
etc. This could also be an easy way to implement #20 .I understand @modesty 's comment in closing #20 that an output format other than what he has produced is not what he had in mind, but surely we can make pdf2json more flexible to support more varied projects. I think it can produce the current output as well as others as different projects may desire. I'm happy to try to tackle this if others think it's a good idea, and I welcome ideas on the best implementation.
The text was updated successfully, but these errors were encountered: