# Notebook #3 for AHED Project

___
## Task 3: Match AHED to ADS Items by Title
Match additional papers by Title to bibcodes via ADS API

Outline:

- Step 1: Format titles to query the ADS API
- Step 2: Query the ADS API with titles, return bibcodes
- Step 3: Match the bibcodes back to the paper list

### Step 1: Format file with titles list

In [None]:
import pandas as pd
import numpy as np

# Open my excel sheet as a data frame
df = pd.read_excel("AHED/refs_matched.xlsx")

# Grab only rows where bibcode is null
dt = df[df['BIBCODE'].isna()]

# Create title & year query strings
dt['QUERY'] = ('(title: "' + dt['TITLE'].astype(str) + '" AND year:' + dt['YEAR'].astype(str) + ')')

# Format query list
titles = dt['QUERY'].to_list()

dt

### Step 2: API Connection

In [None]:
# This loops through the titles list in chunks of 25 titles, querying the API, 
# returning bibcodes and titles matched, and then appending the results as a data frame.

import requests
import json

# --- API REQUEST --- 
token = "pHazHxvHjPVPAcotvj7DIijROZXUjG5vXa2OaCQO"
url = "https://api.adsabs.harvard.edu/v1/search/query?"

data=[]

for i in range(0, len(titles), 25):
    chunk = titles[i:i + 25]
    tagged = [t for t in chunk]
    query = ' OR '.join(tagged)
    
    params = {"q":query,"fl":"title,bibcode","rows":200}
    headers = {'Authorization': 'Bearer ' + token}
    response = requests.get(url, params=params, headers=headers)
#     print(data.text, '\n')

    from_solr = response.json()
    if (from_solr.get('response')):
        num_docs = from_solr['response'].get('numFound', 0)
        if num_docs > 0:
            for doc in from_solr['response']['docs']:
                data.append((doc['bibcode'],doc['title'][0]))
#     print(data)

titles_matched = pd.DataFrame(data, columns = ['bibcode','TITLE'])
titles_matched

### Step 3: Merge list to original data frame

In [None]:
# Merge/Join new table to original, joined on 'TITLE'
merged = df.merge(titles_matched, on='TITLE', how='left')

# Combine bibcode columns
merged['BIBCODE'] = merged['BIBCODE'].fillna(merged['bibcode'])
merged = merged.drop('bibcode',axis=1)

# Count my running total of bibcodes matched
# merged = merged['BIBCODE'].dropna()

# Clean up nulls
merged = merged.replace(np.nan,'NA')

# Export merged data to new excel file
merged.to_excel("AHED/final_matched_2.xlsx", index=False)

merged

Status/Summary:

- We started with 862 papers from the provided AHED spreadsheets
- Refined it to 797 papers removing duplicates

- Matched 156 DOIs to existing ADS Bibcodes
- Matched 397 additional papers by refstrings
- Matched 192 additional papers by Title

= Total ~730 bibcodes out of possible 797