Autobib is a python script that helps organizing your bibliography by automatizing mundane tasks such as querying
.bib information from the web.
The way it works is that you simply provide
autobib.py with the root of your folder tree containing all the pdfs you want to index, and it will automagically query information from the internet and create a nice
biblio.bib out of it.
It can also do other stuff, such as autoformating you
.bib files, generating unique keys, and so on.
This program has been tested with Python 3.5, and I don't know if it works with Python 2.7.
./autobib.py /home/username/biblio -cgf
Will find all pdf in subdirectories of
/home/username/biblio, retrieve bibtex entries from online databases Crossref and Google Scholar, and producde a nice formatted
biblio.bib file in all folders containing pdfs.
Note that it only use filename information (no pdf metadata query yet), so each article needs to be named roughly like so:
(Author1, Author2) Title of the paper.pdf
This can of course be customized, but you'll need to adapt the functions
gen_filename to your liking.
./autobib.py /home/username/biblio -srm
Will synchronize the entries in your bibtex with the files in each folder (by updating the
file field, or removing the entry if it does not exist). Then
-r will rename the files according to the bibtex entries (the program always asks for confirmation before doing something). Finally,
-m creates a
master.bib file at the root folder that is being process (effectively merging and sorting each entry in subfolders).
pip3 install --user habanero scholarly bibtexparser latexcodec titlecase termcolor colorama
sudo pacman -S python-termcolor python-colorama yaourt -S python-bibtexparser pip install --user habanero scholarly titlecase latexcodec
usage: autobib.py [-h] [-b] [-c] [-g] [-f] [-s] [-r] [-m] [-d] [filename] positional arguments: filename input file/folder optional arguments: -h, --help show this help message and exit -b, --backup backup files upon writing -c, --crossref query missing from Crossref -g, --google query missing from Google Scholar -f, --format format biblio -s, --sync sync filenames and bib entries -r, --rename rename files -m, --merge merge bib files in subfolders into a master bib file -d, --delete delete backuped files
How It Works
This program relies heavily on the python-bibtexparser library for reading/writing formated bibtex files.
The queries are currently done using 2 different backends : Crossref and Google Scholar. Crossref provides a nice API, and returns results with very few false negative1. False positive and discarded according to the confidence score returned by Crossref.
Papers that were not matched using Crossref can be queried on Google Scholar, which doesn't offer a query API and might block you if you have too many requests. But it usually finds the more obscure references you might have in your library.
1: It might happen if you have a paper that has been published in little known conference, but has been republished latter in a higher-impact journal, by the same authors and under almost the same title. So be wary to always check the results that are returned by the online queries.
Sometimes it happens that a result produced by Crossref or Google Scholar doesn't quite correspond to what you had in mind. Maybe the match is wrong, or maybe it's missing a piece of information. In that case, you can override the matched entry by providing your own record for the offending file.
By default, the results of the online queries are stored in a
.queried.bib file in each processed folder. You can provide your own
.manual.bib file, which contain entries that you wrote manually and that will override the results found in
./autobib.py --format, the program will read both
.manual.bib in each folder, and override entries from
.queried.bib which are similar to the ones in
There is several aspects of
autobib.py which can be customized. The 2 files that you can modify to fit your need are
gen_filenamedefines the expected naming convention of a file. By default it is something like
(Author1, Author2) Title of the paper.pdf, but maybe you want to include the year, maybe you want to keep only the author initials.
gen_bibkeydefines how bibkey identifiers are generated when invoking
./autobib.py --format. By default it uses ACM style
This file defines capitalization and pattern substitution rules. You can complete the lists of
lowercase_words. For now pattern substitutions are hardcorded, but this will change in the future.
Right now the whole situation is a little bit messy. Problematic characters are replaced using hardcoded substitutions everywhere. The current rule is to use UTF-8 encoding to write and read every file, even on Windows. So please make sure you do not have any weird accent written in a file that uses an exotic encoding, or it might create troubles.
Storing non-ASCII characters as their latex equivalent code is the best option for maximum compatibility, but I would also like to keep the possibility of using unicode characters, which pose no problem with modern alternatives such as
biber. Eventually I should provide an option for both strategies. But this will come later when I have the time to implement it correctly.
Q1: The program crashed when parsing the filename of my paper?
Are you sure your file is named according to the format
(%authors) %title.pdf? Right now this is the only supported naming scheme, but this will be improved in the future. Please see the customization section for more information.
Q2: Can I skip a sub-folder that I do not want to be processed?
Sure! Simply put an empty file named
.nobibin this folder. There could be other options in the future, but for now this should be sufficient.
Q3: There are some weird characters in my results?
Ah, this is probably the result of some encoding madness. Right now there are some hardcoded character substitutions written here and there in the code. This will be improved in the future, but right now all I can do it add more quick fix that works for you.
Q4: I have trouble with unicode characters on my Windows terminal.
This is heavily work-in-progress, so use at your own risks. I wrote this program to help me organize papers for my PhD thesis, so I know it works for me, but maybe you will not like the way I sort my bib entries, or the way I generate my bib keys, etc. In that case I encourage you to contribute or modify the script directly to fit your needs.
Also, there is no telling that the Google Scholar backend will still work tomorrow, because Google doesn't seem to like scripts too much and will probably eventually block it.
Cf. the TODO list.