-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract texkeys from PDFs #18
Conversation
xpos_count[xpos] = xpos_count.get(xpos, 0) + 1 | ||
if xpos_count[xpos] >= cutoff: | ||
return True | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a collections.Counter
for this:
xpositions = ...
xpositions_counts = Counter(xpositions)
return any(map(lambda count: count >= cutoff, xpositions_counts.values()))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but this would break compatibility with Python 2.6 that does not have collections.Counter
, and the plan is to backport this to legacy. Plus Counter
would build the whole dict while here we return early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but this would break compatibility with Python 2.6 that does not have collections.Counter, and the plan is to backport this to legacy.
Aww, ok.
Plus
Counter
would build the whole dict while here we return early.
TBH this matters only if it is code in a tight loop which is expected to run as fast as possible. In all other cases, I would call this a Premature Optimization: http://wiki.c2.com/?PrematureOptimization
@kaplun waiting for you to merge, @jacquerie is afraid he will get all dirty by getting too close to the mess of refextract :D |
ahahah |
contents in from the file. In the case of a PDF/PostScript however, | ||
this means converting the document to plaintext. | ||
@param fpath: (string) - the path to the fulltext file | ||
@return: (list) of strings - each string being a line in the document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not corresponds to the current return
, which also return a status
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can fix this, but wrong docstrings are the norm rather than the exception here
pdf.getDestinationPageNumber(destination) | ||
).cropBox.lowerRight[0] | ||
# assuming max 2 columns | ||
column = (2*destination.left)//pagewidth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not PEP8 friendly. Can you add spacing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where? around \\
? pylint
was fine with it
LGTM! Just very small silly comments. Can you also rebase (I have merged previous PRs). |
f74d638
to
d633a2a
Compare
d633a2a
to
af17ff1
Compare
@kaplun I addressed your comments (and much more) |
af17ff1
to
686cd74
Compare
686cd74
to
21858e4
Compare
@michamos can you just rebase so that we can test on top of Python 2.6 as well? |
* refactor in order to decouple filetype detection from plaintext extraction Signed-off-by: Micha Moskovic <michamos@gmail.com>
* INCOMPATIBLE the api now only returns the references, not the stats * parse_references returns a tuple (references, stats) instead of a dictionary with those two keys. Signed-off-by: Micha Moskovic <michamos@gmail.com>
* texkeys are extracted from the named destinations in the PDF in extract_texkeys_from_url and extract_texkeys_from_file and if the number of texkeys matches with the number of references found from text parsing
* Add tests for extract_texkeys_from_pdf, both for one- and two-column layouts * Refactor tests to get the pdf files as a fixture Signed-off-by: Micha Moskovic <michamos@gmail.com>
Signed-off-by: Micha Moskovic <michamos@gmail.com>
* Make refextract more idiomatic, raising exceptions instead of having (result, error) return values in functions. * INCOMPATIBLE FullTextNotAvailable is renamed to FullTextNotAvailableError. * NEW There are two new exceptions, UnknownDocumentTypeError when the file/URL is not a PDF or plain text and GarbageFullTextError when the PDF fulltext extraction gives garbage. * The exception raised when 'pdftotext' is not found is now FileNotFoundError instead of Exception. * Fix the utterly broken error handling in extract_references_from_url. * Add tests for UnknownDocumentTypeError and FullTextNotAvailableError. Signed-off-by: Micha Moskovic <michamos@gmail.com>
No description provided.