Extract texkeys from PDFs #18

michamos · 2016-12-06T14:50:22Z

No description provided.

coveralls · 2016-12-06T23:05:31Z

Coverage increased (+0.4%) to 77.945% when pulling 105982c on michamos:pdf_texkeys into c697194 on inspirehep:master.

jacquerie · 2016-12-07T15:34:03Z

refextract/references/pdf.py

+        xpos_count[xpos] = xpos_count.get(xpos, 0) + 1
+        if xpos_count[xpos] >= cutoff:
+            return True
+    return False


You can use a collections.Counter for this:

xpositions = ... xpositions_counts = Counter(xpositions) return any(map(lambda count: count >= cutoff, xpositions_counts.values()))

I know, but this would break compatibility with Python 2.6 that does not have collections.Counter, and the plan is to backport this to legacy. Plus Counter would build the whole dict while here we return early.

I know, but this would break compatibility with Python 2.6 that does not have collections.Counter, and the plan is to backport this to legacy.

Aww, ok.

Plus Counter would build the whole dict while here we return early.

TBH this matters only if it is code in a tight loop which is expected to run as fast as possible. In all other cases, I would call this a Premature Optimization: http://wiki.c2.com/?PrematureOptimization

michamos · 2016-12-08T15:12:55Z

@kaplun waiting for you to merge, @jacquerie is afraid he will get all dirty by getting too close to the mess of refextract :D

kaplun · 2016-12-08T16:09:25Z

ahahah

kaplun · 2016-12-14T08:47:15Z

refextract/references/engine.py

+       contents in from the file. In the case of a PDF/PostScript however,
+       this means converting the document to plaintext.
+       @param fpath: (string) - the path to the fulltext file
+       @return: (list) of strings - each string being a line in the document.


This does not corresponds to the current return, which also return a status flag.

I can fix this, but wrong docstrings are the norm rather than the exception here

kaplun · 2016-12-14T08:50:37Z

refextract/references/pdf.py

+        pdf.getDestinationPageNumber(destination)
+        ).cropBox.lowerRight[0]
+    # assuming max 2 columns
+    column = (2*destination.left)//pagewidth


This is not PEP8 friendly. Can you add spacing?

where? around \\? pylint was fine with it

kaplun · 2016-12-14T08:51:39Z

LGTM! Just very small silly comments. Can you also rebase (I have merged previous PRs).

coveralls · 2016-12-14T15:10:15Z

Coverage increased (+0.8%) to 78.31% when pulling d633a2a on michamos:pdf_texkeys into 47710a2 on inspirehep:master.

coveralls · 2016-12-14T15:11:40Z

Coverage increased (+0.8%) to 78.31% when pulling af17ff1 on michamos:pdf_texkeys into 47710a2 on inspirehep:master.

coveralls · 2016-12-14T15:13:45Z

Coverage increased (+0.8%) to 78.31% when pulling af17ff1 on michamos:pdf_texkeys into 47710a2 on inspirehep:master.

michamos · 2016-12-14T15:14:09Z

@kaplun I addressed your comments (and much more)

coveralls · 2016-12-16T14:52:01Z

Coverage increased (+0.8%) to 78.319% when pulling 686cd74 on michamos:pdf_texkeys into 47710a2 on inspirehep:master.

coveralls · 2016-12-16T14:57:24Z

Coverage increased (+0.8%) to 78.319% when pulling 21858e4 on michamos:pdf_texkeys into 47710a2 on inspirehep:master.

kaplun · 2017-01-05T10:28:00Z

@michamos can you just rebase so that we can test on top of Python 2.6 as well?

* refactor in order to decouple filetype detection from plaintext extraction Signed-off-by: Micha Moskovic <michamos@gmail.com>

* INCOMPATIBLE the api now only returns the references, not the stats * parse_references returns a tuple (references, stats) instead of a dictionary with those two keys. Signed-off-by: Micha Moskovic <michamos@gmail.com>

* texkeys are extracted from the named destinations in the PDF in extract_texkeys_from_url and extract_texkeys_from_file and if the number of texkeys matches with the number of references found from text parsing

* Add tests for extract_texkeys_from_pdf, both for one- and two-column layouts * Refactor tests to get the pdf files as a fixture Signed-off-by: Micha Moskovic <michamos@gmail.com>

Signed-off-by: Micha Moskovic <michamos@gmail.com>

* Make refextract more idiomatic, raising exceptions instead of having (result, error) return values in functions. * INCOMPATIBLE FullTextNotAvailable is renamed to FullTextNotAvailableError. * NEW There are two new exceptions, UnknownDocumentTypeError when the file/URL is not a PDF or plain text and GarbageFullTextError when the PDF fulltext extraction gives garbage. * The exception raised when 'pdftotext' is not found is now FileNotFoundError instead of Exception. * Fix the utterly broken error handling in extract_references_from_url. * Add tests for UnknownDocumentTypeError and FullTextNotAvailableError. Signed-off-by: Micha Moskovic <michamos@gmail.com>

coveralls · 2017-01-05T10:33:38Z

Coverage increased (+0.8%) to 78.328% when pulling 350b4e8 on michamos:pdf_texkeys into c7a18bd on inspirehep:master.

jacquerie reviewed Dec 7, 2016

View reviewed changes

kaplun added the Need: Review label Dec 8, 2016

michamos mentioned this pull request Dec 14, 2016

Use logging library #19

Closed

kaplun suggested changes Dec 14, 2016

View reviewed changes

michamos force-pushed the pdf_texkeys branch from f74d638 to d633a2a Compare December 14, 2016 15:08

michamos force-pushed the pdf_texkeys branch from d633a2a to af17ff1 Compare December 14, 2016 15:11

michamos force-pushed the pdf_texkeys branch from af17ff1 to 686cd74 Compare December 16, 2016 14:49

michamos force-pushed the pdf_texkeys branch from 686cd74 to 21858e4 Compare December 16, 2016 14:55

kaplun assigned michamos Jan 5, 2017

michamos added 6 commits January 5, 2017 11:29

engine: separate function for filetype detection

7a25430

* refactor in order to decouple filetype detection from plaintext extraction Signed-off-by: Micha Moskovic <michamos@gmail.com>

api: remove stats

f197cb2

* INCOMPATIBLE the api now only returns the references, not the stats * parse_references returns a tuple (references, stats) instead of a dictionary with those two keys. Signed-off-by: Micha Moskovic <michamos@gmail.com>

api: extraction of texkeys from PDF

4101a11

* texkeys are extracted from the named destinations in the PDF in extract_texkeys_from_url and extract_texkeys_from_file and if the number of texkeys matches with the number of references found from text parsing

tests: new tests for texkey extraction

73f09b4

* Add tests for extract_texkeys_from_pdf, both for one- and two-column layouts * Refactor tests to get the pdf files as a fixture Signed-off-by: Micha Moskovic <michamos@gmail.com>

filetype detection through python-magic

10f0ccf

Signed-off-by: Micha Moskovic <michamos@gmail.com>

michamos force-pushed the pdf_texkeys branch from 21858e4 to 350b4e8 Compare January 5, 2017 10:31

kaplun approved these changes Jan 5, 2017

View reviewed changes

kaplun merged commit 3a5c0ee into inspirehep:master Jan 5, 2017

kaplun removed the Need: Review label Jan 5, 2017

This was referenced Jan 10, 2017

backport improvements to refextract inspirehep/inspire#256

Closed

backport improvements to refextract inspirehep/invenio#375

Closed

michamos mentioned this pull request Apr 24, 2017

Are TeXkeys really in 999C5k? inspirehep/inspire-schemas#146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract texkeys from PDFs #18

Extract texkeys from PDFs #18

michamos commented Dec 6, 2016

coveralls commented Dec 6, 2016

jacquerie Dec 7, 2016 •

edited

Loading

michamos Dec 7, 2016

jacquerie Dec 8, 2016 •

edited

Loading

michamos commented Dec 8, 2016

kaplun commented Dec 8, 2016

kaplun Dec 14, 2016

michamos Dec 14, 2016

kaplun Dec 14, 2016

michamos Dec 14, 2016

kaplun commented Dec 14, 2016

coveralls commented Dec 14, 2016

coveralls commented Dec 14, 2016

coveralls commented Dec 14, 2016

michamos commented Dec 14, 2016

coveralls commented Dec 16, 2016

coveralls commented Dec 16, 2016

kaplun commented Jan 5, 2017

coveralls commented Jan 5, 2017

Extract texkeys from PDFs #18

Extract texkeys from PDFs #18

Conversation

michamos commented Dec 6, 2016

coveralls commented Dec 6, 2016

jacquerie Dec 7, 2016 • edited Loading

Choose a reason for hiding this comment

michamos Dec 7, 2016

Choose a reason for hiding this comment

jacquerie Dec 8, 2016 • edited Loading

Choose a reason for hiding this comment

michamos commented Dec 8, 2016

kaplun commented Dec 8, 2016

kaplun Dec 14, 2016

Choose a reason for hiding this comment

michamos Dec 14, 2016

Choose a reason for hiding this comment

kaplun Dec 14, 2016

Choose a reason for hiding this comment

michamos Dec 14, 2016

Choose a reason for hiding this comment

kaplun commented Dec 14, 2016

coveralls commented Dec 14, 2016

coveralls commented Dec 14, 2016

coveralls commented Dec 14, 2016

michamos commented Dec 14, 2016

coveralls commented Dec 16, 2016

coveralls commented Dec 16, 2016

kaplun commented Jan 5, 2017

coveralls commented Jan 5, 2017

jacquerie Dec 7, 2016 •

edited

Loading

jacquerie Dec 8, 2016 •

edited

Loading