# Notebook #2 for ARC/SSAD Project
___

## Task 2: Match ARC/SSAD to ADS Items by Reference Strings
Match papers (without DOIs) to ADS bibcodes via Reference Service

Outline:
- Step 1: Format file of papers into reference strings
- Step 2: Query the Reference API with reference strings, return bibcodes
- Step 3: Match the bibcodes back to the paper list
___

### Step 1: Format file with reference list
Took my new publications list to prep them to be reference strings that would be queried in the reference service API. 

In [None]:
import pandas as pd
import numpy as np

# Open my excel sheet as a data frame
df = pd.read_excel("AHED/dois_matched.xlsx")

# String together the fields into single reference strings (Authors, Year, Journal)
df['REFS'] = df['AUTHORS'].astype(str) + ', ' + df['YEAR'].astype(str) + ', ' + df['JOURNAL'].astype(str)

# Grab only rows where DOI is null
dt = df[df['DOI'].isna()]

# Export my reference strings to text file
dt['REFS'].to_csv("AHED/ref_list.txt", index=False, header=False, sep='\t')

dt

### Step 2: Connect to Reference Service API

In [None]:
import sys, os, io
import requests
import argparse
import json

# ADS Prod API Token
token = 'pHazHxvHjPVPAcotvj7DIijROZXUjG5vXa2OaCQO'
domain = 'https://api.adsabs.harvard.edu/v1/'

## REFERENCE SERVICE ##

# --- Function to read my reference strings file and make a list called 'references'
def read_file(filename):

    references = []
    with open(filename, "r") as f:
        for line in f:
            references.append(line)
    return references

# --- Function to connect to Reference Service API, querying my 'references' list
def resolve(references):
    
    payload = {'reference': references}

    response = requests.post(
        url = domain + 'reference/text',
        headers = {'Authorization': 'Bearer ' + token,
                 'Content-Type': 'application/json',
                 'Accept':'application/json'},
        data = json.dumps(payload)
    )
    
    if response.status_code == 200:
        return json.loads(response.content)['resolved'], 200
    else:
        print('From reference status_code is ', response.status_code)
    return None, response.status_code

In [None]:
# Read my reference strings file
references = read_file("/Users/sao/Documents/Python-Projects/AHED/ref_list.txt")
references = [ref.replace('\n','') for ref in references]

In [None]:
# Resolve my references, results in 'total results' list
total_results = []

for i in range(0, len(references), 16):
    results, status = resolve(references[i:i+16])
    if results:
        total_results += results

In [None]:
# Method to count how many total bibcodes were matched
bibcodes = []
for record in total_results:
    if record['bibcode']!='...................':
        bibcodes.append(record['bibcode'])

print('Matched',len(bibcodes),'bibcodes')

### Step 3: Join to original data frame

In [None]:
# Convert my reference results to a data frame and drop null values
ref_results = pd.DataFrame(total_results)
ref_results = ref_results.replace('...................', np.nan)
ref_results = ref_results.dropna(subset=['bibcode'])
ref_results

In [None]:
# Merge my new ref service results with my original paper list, join by the refstrings
merged = pd.merge(df, ref_results, how='left', left_on='REFS', right_on='refstring')
merged

# Combine bibcode columns
merged['BIBCODE'] = merged['bibcode_x'].fillna(merged['bibcode_y'])

# Cleanup; drop unneeded columns
merged = merged.drop('refstring',axis=1)
merged = merged.drop('REFS',axis=1)
merged = merged.drop('bibcode_x',axis=1)
merged = merged.drop('bibcode_y',axis=1)
merged = merged.drop('score',axis=1)
merged = merged.drop('comment',axis=1)

# Count my running total of bibcodes matched
# merged = merged.dropna(subset=['BIBCODE'])

# Clean up nulls
merged = merged.replace(np.nan,'NA')

# Export merged data to new excel file
merged.to_excel("AHED/refs_matched.xlsx", index=False)

merged

Status/Summary:
- Matched 156 DOIs to Bibcodes
- Matched 397 additional bibcodes via ref strings
- My total is 550 after merging (lost 2, probably duplicates)
- Still have about 250 unmatched to go...