Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract texkeys from PDFs #18

Merged
merged 6 commits into from
Jan 5, 2017
Merged

Conversation

michamos
Copy link
Contributor

@michamos michamos commented Dec 6, 2016

No description provided.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.4%) to 77.945% when pulling 105982c on michamos:pdf_texkeys into c697194 on inspirehep:master.

xpos_count[xpos] = xpos_count.get(xpos, 0) + 1
if xpos_count[xpos] >= cutoff:
return True
return False
Copy link
Contributor

@jacquerie jacquerie Dec 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a collections.Counter for this:

xpositions = ...
xpositions_counts = Counter(xpositions)

return any(map(lambda count: count >= cutoff, xpositions_counts.values()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but this would break compatibility with Python 2.6 that does not have collections.Counter, and the plan is to backport this to legacy. Plus Counter would build the whole dict while here we return early.

Copy link
Contributor

@jacquerie jacquerie Dec 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but this would break compatibility with Python 2.6 that does not have collections.Counter, and the plan is to backport this to legacy.

Aww, ok.

Plus Counter would build the whole dict while here we return early.

TBH this matters only if it is code in a tight loop which is expected to run as fast as possible. In all other cases, I would call this a Premature Optimization: http://wiki.c2.com/?PrematureOptimization

@michamos
Copy link
Contributor Author

michamos commented Dec 8, 2016

@kaplun waiting for you to merge, @jacquerie is afraid he will get all dirty by getting too close to the mess of refextract :D

@kaplun
Copy link
Contributor

kaplun commented Dec 8, 2016

ahahah

@michamos michamos mentioned this pull request Dec 14, 2016
contents in from the file. In the case of a PDF/PostScript however,
this means converting the document to plaintext.
@param fpath: (string) - the path to the fulltext file
@return: (list) of strings - each string being a line in the document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not corresponds to the current return, which also return a status flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can fix this, but wrong docstrings are the norm rather than the exception here

pdf.getDestinationPageNumber(destination)
).cropBox.lowerRight[0]
# assuming max 2 columns
column = (2*destination.left)//pagewidth
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not PEP8 friendly. Can you add spacing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where? around \\? pylint was fine with it

@kaplun
Copy link
Contributor

kaplun commented Dec 14, 2016

LGTM! Just very small silly comments. Can you also rebase (I have merged previous PRs).

@coveralls
Copy link

Coverage Status

Coverage increased (+0.8%) to 78.31% when pulling d633a2a on michamos:pdf_texkeys into 47710a2 on inspirehep:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.8%) to 78.31% when pulling af17ff1 on michamos:pdf_texkeys into 47710a2 on inspirehep:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.8%) to 78.31% when pulling af17ff1 on michamos:pdf_texkeys into 47710a2 on inspirehep:master.

@michamos
Copy link
Contributor Author

@kaplun I addressed your comments (and much more)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.8%) to 78.319% when pulling 686cd74 on michamos:pdf_texkeys into 47710a2 on inspirehep:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.8%) to 78.319% when pulling 21858e4 on michamos:pdf_texkeys into 47710a2 on inspirehep:master.

@kaplun
Copy link
Contributor

kaplun commented Jan 5, 2017

@michamos can you just rebase so that we can test on top of Python 2.6 as well?

* refactor in order to decouple filetype detection from plaintext
extraction

Signed-off-by: Micha Moskovic <michamos@gmail.com>
* INCOMPATIBLE the api now only returns the references, not the stats
* parse_references returns a tuple (references, stats) instead of a
dictionary with those two keys.

Signed-off-by: Micha Moskovic <michamos@gmail.com>
* texkeys are extracted from the named destinations in the PDF in
extract_texkeys_from_url and extract_texkeys_from_file and if the number
of texkeys matches with the number of references found from text parsing
* Add tests for extract_texkeys_from_pdf, both for one- and two-column
layouts
* Refactor tests to get the pdf files as a fixture

Signed-off-by: Micha Moskovic <michamos@gmail.com>
Signed-off-by: Micha Moskovic <michamos@gmail.com>
* Make refextract  more idiomatic, raising exceptions instead of
having (result, error) return values in functions.
* INCOMPATIBLE FullTextNotAvailable is renamed to
FullTextNotAvailableError.
* NEW There are two new exceptions, UnknownDocumentTypeError when the
file/URL is not a PDF or plain text and GarbageFullTextError when the
PDF fulltext extraction gives garbage.
* The exception raised when 'pdftotext' is not found is now
FileNotFoundError instead of Exception.
* Fix the utterly broken error handling in extract_references_from_url.
* Add tests for UnknownDocumentTypeError and
FullTextNotAvailableError.

Signed-off-by: Micha Moskovic <michamos@gmail.com>
@coveralls
Copy link

Coverage Status

Coverage increased (+0.8%) to 78.328% when pulling 350b4e8 on michamos:pdf_texkeys into c7a18bd on inspirehep:master.

@kaplun kaplun merged commit 3a5c0ee into inspirehep:master Jan 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants