Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Scrapper: Look-up existing .bib files during "BibTeX mode" #144

Open
j-steinbach opened this issue Dec 8, 2020 · 5 comments
Open

PDF Scrapper: Look-up existing .bib files during "BibTeX mode" #144

j-steinbach opened this issue Dec 8, 2020 · 5 comments
Labels
1. enhancement New feature or request :pdf_scrapper

Comments

@j-steinbach
Copy link

At the moment, the PDF Scrapper looks up extracted cite-keys after it has finished its process and then sorts them into in-roam and in-bib.

It makes sense to compare the "BibTeX mode" references with the existing bibliographic files during the extraction process.

Ideally it would compare the extracted keys with existing keys, so that the user can immediately identify erroneous or duplicate keys. This is very helpful if the the user uses an external reference management tool (Zotero, papis, ...), which auto-generates keys in different way.


Example (split pane view)

(extracted)               |    (global .bib file)
adam2003eve               |    adam2002eva
                          |    adam2003event
bert2004egon              |  
charles2997manson         |   
dagmar1002duck            |    dagmar1002duck_tales

For a more long-term perspective, there should also be ways to insert those references into the (global) .bib file(s).

@myshevchuk
Copy link
Member

Have you checked the orb-autokey functionality? It allows to configure key generation to your liking, and should be able to cover the format of the keys listed in the right pane. The keys presented initially in the buffer are generated by AnyStyle, not ORB, there is no control over them. You can then press C-c C-u to generate keys according to orb-autokey-format. This of course can be automated, so that the generation happens immediately after the entries have been extracted. I will add an option for that.

Similarity search is a bit trickier than simple exact matching of the keys. I will need to check if there exist a general Elisp library for text similarity search. Otherwise, it wouldn't be currently viable to implement it from scratch, at least not now.

The split pane view should probably be a separate buffer invoked with its own command where those keys are listed in a table. This will also need to wait.

What can be done quickly and reliably, is inserting a comment above the BibTeX entry with the key matched from the library.

In any case, thank you very much for your interest and ideas!

@j-steinbach
Copy link
Author

Yes, I am using (setq orb-autokey-format "%a%y%t") in ORB and [auth:lower][year][shorttitle1_1:lower] in Zotero (BBT), but sometimes they produce different titles. My guess: They "ignore" different words to create the short title.

I know (and use) the C-c C-u generation. It does not always work - for those I have to manually "create" a title. Here it is very helpful if I know if that paper is already in my database.
Note: Even if I know that a title is duplicate/already exists, I still generate the key, so I have a local, note-only list of extracted cite keys.

Is it not possible to re-use the "fuzzy search" features from (don't know the name, Helm? Ivy?).

Also I am not sure how much a "similarity" search is needed. The first word (usually the author) should do the job - In my example, it lists all the keys/entries from the global .bib file. Then it puts them on the same "height" as the first letter that matches. There is no need to list the whole bibliography on the right.
Ive had the "git compare sources" view in mind when I wrote double-pane. Maybe that helps.


And I am enthusiastic about the project because I currently have to read lots of papers. The PDF Scrapper fits perfectly into my workflow and saves me lots of time - time I "give back" by reporting every minor annoyance :)

@myshevchuk
Copy link
Member

myshevchuk commented Dec 9, 2020

Yes, I am using (setq orb-autokey-format "%a%y%t") in ORB and [auth:lower][year][shorttitle1_1:lower]

I haven't been using Zotero for years now; what does this do exactly: shorttitle1_1? Is it the first two words separated with underscore? If yes, then you can achieve the same in ORB with %t[2][][_]. See also orb-autokey-titlewords-ignore for a list of words in titles to be ignored in autokey generation. You can add to or remove words from this variable to match the BBT's behaviour.

I will look how Helm and Ivy implement the fuzzy search, and check what else is available in Emacs ecosystem. There definitely should be something. The split pane view will require a major effort, so I can't promise it done quickly.

And I am enthusiastic about the project because I currently have to read lots of papers. The PDF Scrapper fits perfectly into my workflow and saves me lots of time - time I "give back" by reporting every minor annoyance :)

Great! I'm glad you find it useful and your input is really very appreciated.

@j-steinbach
Copy link
Author

From the BBT docs it is simply

shorttitleN_M: The first N (default: 3) words of the title, apply capitalization to first M (default: 0) of those


Here are two keys I observed to get generated differently:


Author:Ryan
Date: 2000
Title Self-Determination Theory and the Facilitation of Intrinsic Motivation, Social Development, and Well-Being

gets turned into

Zotero/BBT: ryan2000selfdetermination
ORB Autokey: ryan2000self

and

author: Gartner
year: 2013
title: Gartner’s 2013 hype cycle for emerging technologies

gets turned into

Zotero/BBT: gartner2013gartner
ORB Autokey: gartner2013

@myshevchuk
Copy link
Member

gets turned into

Zotero/BBT: ryan2000selfdetermination
ORB Autokey: ryan2000self
and

author: Gartner
year: 2013
title: Gartner’s 2013 hype cycle for emerging technologies
gets turned into

Zotero/BBT: gartner2013gartner
ORB Autokey: gartner2013

This can be considered a bug, if filed a new issue for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1. enhancement New feature or request :pdf_scrapper
Projects
None yet
Development

No branches or pull requests

2 participants