Paste your bibtex here in UTF-8 encoding to the input.bib, and run the following code.

Note: A small number of references may still not have URLs added automatically. For these cases, please write the url manually.

Author: Hakaze Cho.

If you have a semantics scholar api key, you can input it in the following cell.

If you don't have a semantics scholar api key, you can skip. It can only **accelerate** the process for only efficiency but not accuracy.

In [1]:
semantics_scholar_api_key = None

In [2]:
%pip install bibtexparser
%pip install requests

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [3]:
import requests
import bibtexparser
import xml.etree.ElementTree as ET
import time
import os

In [7]:
import os
try:
    os.chdir(os.path.join(os.getcwd(), '/home/s2320415/Auto_Bib_Link')) # '.' if the path is to current folder
    #print(os.getcwd())
except:
    pass

In [8]:
with open('input.bib', 'rb') as bibtex_file:
    library = bibtexparser.load(bibtex_file)

In [9]:
# Arxiv Title Search
count = 0
succeed = 0
for paper in library.entries:
    count += 1
    if 'url' not in paper:
        print(f"Missing URL for {paper['ID']}, \"{paper['title']}\". Searching for URL...")
        # search for URL
        title = paper['title']
        title_replaced = title.replace(" ", "+")
        query = f"http://export.arxiv.org/api/query?search_query=ti:{title_replaced}&sortBy=relevance&max_results=50"
        response = requests.get(query)
        root = ET.fromstring(response.content)
        entrys = root.findall('{http://www.w3.org/2005/Atom}entry')
        for entry in entrys:
            link = entry.find('{http://www.w3.org/2005/Atom}link')
            titlefound = entry.find('{http://www.w3.org/2005/Atom}title')
            if link is not None:
                if title.lower() != titlefound.text.lower().replace("\n ", ""):
                    print(f"Title found for {paper['ID']} does not match: \"{titlefound.text}\", rejected.")
                else:
                    paper['url'] = link.attrib['href']
                    succeed += 1
                    print(f"Found URL for {paper['ID']}, \"{paper['title']}\": {paper['url']}, title found: \"{titlefound.text}\", accepted.")
                    break
            else:
                print(f"Could not find URL for {paper['ID']}.")
        else:
            print(f"Could not find URL for {paper['ID']}.")
    else:
        print(f"URL already exists for {paper['ID']}, \"{paper['title']}\".")
        succeed += 1
    print("\n")
print(f"Found {succeed} URLs out of {count} papers.")


URL already exists for abbas2024enhancing, "Enhancing in-context learning via linear probe calibration".


URL already exists for basile-etal-2019-semeval, "{S}em{E}val-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in {T}witter".


URL already exists for biderman2023pythia, "Pythia: A suite for analyzing large language models across training and scaling".


URL already exists for bruneticl, "ICL-Markup: Structuring In-Context Learning using Soft-Token Tags".


URL already exists for chen2023relation, "On the Relation between Sensitivity and Accuracy in In-Context Learning".


URL already exists for cho2024token, "Token-based Decision Criteria Are Suboptimal in In-context Learning".


URL already exists for collins2024context, "In-context learning with transformers: Softmax attention adapts to function lipschitzness".


URL already exists for commitmentbank, "The commitmentbank: Investigating projection in naturally occurring discourse".


URL already e

In [10]:
# Arxiv Number Search
count = 0
succeed = 0
for paper in library.entries:
    count += 1
    if 'url' not in paper:
        print(f"Missing URL for {paper['ID']}, \"{paper['title']}\". Searching for URL...")
        # search for URL
        title = paper['title']
        if 'journal' in paper:
            journal = paper['journal']
            if 'arXiv' in journal:
                arxiv_number = journal.split('arXiv:')[1].split(' ')[0]
                print(f"Arxiv number found for {paper['ID']}: {arxiv_number}.")
                paper['url'] = f"https://arxiv.org/abs/{arxiv_number}"
                print(f"Found URL for {paper['ID']}, \"{paper['title']}\": {paper['url']}, accepted.")
                succeed += 1
            else:
                print(f"Arxiv number not found for {paper['ID']}.")
        else:
            print(f"Journal not found for {paper['ID']}.")
    else:
        print(f"URL already exists for {paper['ID']}, \"{paper['title']}\".")
        succeed += 1
    print("\n")
print(f"Found {succeed} URLs out of {count} papers.")


URL already exists for abbas2024enhancing, "Enhancing in-context learning via linear probe calibration".


URL already exists for basile-etal-2019-semeval, "{S}em{E}val-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in {T}witter".


URL already exists for biderman2023pythia, "Pythia: A suite for analyzing large language models across training and scaling".


URL already exists for bruneticl, "ICL-Markup: Structuring In-Context Learning using Soft-Token Tags".


URL already exists for chen2023relation, "On the Relation between Sensitivity and Accuracy in In-Context Learning".


URL already exists for cho2024token, "Token-based Decision Criteria Are Suboptimal in In-context Learning".


URL already exists for collins2024context, "In-context learning with transformers: Softmax attention adapts to function lipschitzness".


URL already exists for commitmentbank, "The commitmentbank: Investigating projection in naturally occurring discourse".


URL already e

In [14]:
# Semantics Scholar Search
count = 0
succeed = 0

if semantics_scholar_api_key is not None:
    headers = {
        "x-api-key": semantics_scholar_api_key
    }
else:
    headers = {}

for paper in library.entries:
    count += 1
    if 'url' not in paper:
        print(f"Missing URL for {paper['ID']}, \"{paper['title']}\". Searching for URL...")
        # search for URL
        title = paper['title']
        query = f"https://api.semanticscholar.org/graph/v1/paper/search/match?query={title}"
        while True:
            time.sleep(1)
            response = requests.get(query, headers=headers)
            if response.status_code == 200 or response.status_code == 404:
                break
        if response.status_code == 404:
            print(f"Could not find URL for {paper['ID']}.\n")
            continue
        response_json = response.json()
        print(response_json)
        if response_json['data'] == []:
            print(f"Could not find URL for {paper['ID']}.\n")
            continue
        else:
            paperID = response_json['data'][0]['paperId']
            query_new = f"https://api.semanticscholar.org/graph/v1/paper/{paperID}?fields=url,openAccessPdf"
            while True:
                time.sleep(1)
                response_new = requests.get(query_new, headers=headers)
                if response_new.status_code == 200 or response_new.status_code == 404:
                    break
            if response_new.status_code == 404:
                print(f"Could not find URL for {paper['ID']}.\n")
                continue
            response_new_json = response_new.json()
            scholar_url = response_new_json['url']
            if "openAccessPdf" in response_new_json and response_new_json['openAccessPdf'] is not None:
                paper['url'] = response_new_json['openAccessPdf']['url']
                print(f"Found URL for {paper['ID']}, \"{paper['title']}\": {paper['url']}, accepted.")
            else:
                paper['url'] = scholar_url
                print(f"Found URL for {paper['ID']}, \"{paper['title']}\": {paper['url']}, accepted.")
            succeed += 1
    else:
        print(f"URL already exists for {paper['ID']}, \"{paper['title']}\".")
        succeed += 1
    print("\n")
print(f"Found {succeed} URLs out of {count} papers.")

URL already exists for abbas2024enhancing, "Enhancing in-context learning via linear probe calibration".


URL already exists for basile-etal-2019-semeval, "{S}em{E}val-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in {T}witter".


URL already exists for biderman2023pythia, "Pythia: A suite for analyzing large language models across training and scaling".


URL already exists for bruneticl, "ICL-Markup: Structuring In-Context Learning using Soft-Token Tags".


URL already exists for chen2023relation, "On the Relation between Sensitivity and Accuracy in In-Context Learning".


URL already exists for cho2024token, "Token-based Decision Criteria Are Suboptimal in In-context Learning".


URL already exists for collins2024context, "In-context learning with transformers: Softmax attention adapts to function lipschitzness".


URL already exists for commitmentbank, "The commitmentbank: Investigating projection in naturally occurring discourse".


URL already e

In [15]:
# List the papers that are still missing URLs

for paper in library.entries:
    if 'url' not in paper:
        print(f"Missing URL for {paper['ID']}, \"{paper['title']}\". Please add manually.")

Missing URL for falcon40b, "{Falcon-40B}: an open large language model with state-of-the-art performance". Please add manually.
Missing URL for RTE5, "The Fifth PASCAL Recognizing Textual Entailment Challenge.". Please add manually.
Missing URL for wang2021gpt, "GPT-J-6B: A 6 billion parameter autoregressive language model". Please add manually.


In [16]:
print(f"Found {succeed} URLs out of {count} papers.")
with open('output.bib', 'w') as bibtex_file:
    bibtexparser.dump(library, bibtex_file)
print("Done. Output saved as output.bib.")

Found 76 URLs out of 79 papers.
Done. Output saved as output.bib.
