Replace PDFContentImporter by another library #169
As user, I want to import a PDF into JabRef. Each PDF contains bibliographic information. Either by embedded XMP data or by just the txt on the title page (author, title, doi, ..., maybe included bibtex: https://www.ctan.org/pkg/coverpage or https://ctan.org/pkg/authorarchive?lang=de). This information should be extracted from the PDF.
Currently, a self-written functionality is employed. This works OK for LNCS and IEEE papers, but not for other publishers.
We have a grobid in place. This should be used. Check Apache Tika, too.
Offer merge dialog from the different options (e.g., XMP + PDF scraping via GROBID)
Check current drag'n'drop behavior. In 3.8.2, the user was asked whether (s)he wants to create a new entry or link the PDF.
In http://discourse.jabref.org/t/more-control-on-the-duplicate-finder/120/4?u=koppor the tool https://github.com/CrossRef/pdfextract was recommended. At first sight, it can fully replace our PDFContentImporter.
The text was updated successfully, but these errors were encountered:
Depends on the definition of "easily". https://github.com/jruby/jruby/wiki/DirectJRubyEmbedding
(i) Being the initial author of PDFContentImporter, (ii) seeing that no one took over the last years, (iii) knowing the issues and (iv) limitations of the current implementation, I think, it is more easy to integrate the library than to fix PDFContentImporter.
Side story: Colleagues from an other department use the PDFContentImporter successfully with Springer and IEEE papers. Which are the two publishers it was designed for.
There are questions on StackExchange asking for PDF2Bib:
Can we use python to implement pdf2bibtex convertion??