## NASA Astrobiology - SciX/ADS Curation Notebook
Updated 02/2025.

This notebook consists of Python script that can be run in sections alongside manual maintenance of monthly publications metadata. The goal is to utilize the master spreadsheet ("NASA_Astrobiology.xlsx") to keep track of new publications, and to help curate them for ingest to the SciX/ADS repositories. The 'bibcode' column helps to indicate what exists already, and what has yet to be included. If an ADS bibcode is matched, we can add it to the matching row, which will be sent to the ADS Library ([NASA Astrobiology Library](https://scixplorer.org/public-libraries/UTViEyO9T7izQP7i_r6yqA)). Anything not matched, and published in a journal (not in preprint/press/early view), can then be curated and submitted to ADS for indexing.

<b>1. Data maintenance:

* Manually insert new publications to the spreadsheet with as much metadata as possible.
  
<b>2. Bibcode Matching:

* Run the 'Reference resolver' script, which queries the ADS API (Reference Resolver service) with publication metadata to see if any publications already exist.
   
* After reviewing the reference resolver results ("scix_refs_resolved.xlsx"), insert any bibcode matches into the 'bibcode' column of the master spreadsheet (Note: anything with a score of 1.0 should be an exact match).
   
<b>3. Updating the Scix/ADS Library:
  
* Next run the code that adds the bibcodes from the 'bibcodes' column to the ADS Library.
   
<b>4. Data Curation:

* Finally, curate any unmatched/new records to submit to ADS, either by:
    * Submitting directly to ADS using the [online submission form](https://ui.adsabs.harvard.edu/feedback/correctabstract)
    * Send a text file of the records in [ADS Tagged Format](https://ui.adsabs.harvard.edu/help/data_faq/tagged-format). If there are a large number of records, run the Curation code section of this notebook which will take the metadata from the spreadsheet and generate all the records automatically using the ADS Pyingest Manual Parser service. Email the file formatted '{date}_records.tag' to the SciX/ADS Curation team (Jenny Koch).

Please email any questions to Jenny Koch (SciX/ADS Librarian) at jennifer.koch@cfa.harvard.edu. 

In [None]:
import pandas as pd
import re
import json
import requests
from pyingest.serializers.classic import Tagged
import datetime

# Set up your path to your local directory and the file where your project data is saved
filepath = "" # Insert a local filepath if necessary
filename = "NASA_Astrobiology.xlsx"
api_token = "my_api_token" # Insert your API token here

# Get today's date in the desired format
todays_date = datetime.datetime.now().strftime("%y%m%d")

# Name the .tag file with today's date
tagged_output = f"{todays_date}_records.tag"

## 2. Bibcode matching
Use this section to find new bibcodes for records where one is not yet associated.

In [None]:
# Read the spreadsheet into a Data Frame
input_data = pd.read_excel(filepath + filename)
data = pd.DataFrame(input_data)

# Initialize an empty list for reference strings
ref_list = []

# Iterate through rows with no bibcode
for index, row in data.iterrows():
    authors = row["authors"] if pd.notna(row["authors"]) else ""
    title = row["title"] if pd.notna(row["title"]) else ""
    doi = str(row["doi"]) if pd.notna(row["doi"]) else ""
    date = row["date"] if pd.notna(row["date"]) else ""
    if pd.notna(date) or date != "":
        year = date[:4]
    
    # If 'bibcode' is labeled 'preprint', create a reference string from the metadata
    if row["bibcode"] == "preprint" or pd.isna(row["bibcode"]):
        if all(item != "" for item in [authors, title, date, doi]) and pd.notna(doi):
            ref = {
                "refstr": f"{authors}, {title}, {str(year)}, {doi}",
                "authors": authors,
                "title": title,
                "year": str(year),
                "doi": doi
            }
        elif all(item != "" for item in [title, date, doi]) and pd.notna(doi):
            ref = {
                "refstr": f"{title}, {str(year)}, {doi}",
                "title": title,
                "year": str(year),
                "doi": doi
            }
        elif doi is not None and pd.notna(doi):
            ref = {
                "refstr": f"{doi}",
                "doi": doi
            }
        ref_string = json.dumps(ref, ensure_ascii=False)
        ref_list.append(ref_string)

# Reference Service API request, querying my 'references' list
# ADS Prod API Token
domain = 'https://api.adsabs.harvard.edu/v1/'
def resolve(references):
    payload = {'parsed_reference': references}
    response = requests.post(
        url = domain + 'reference/xml',
        headers = {'Authorization': 'Bearer ' + api_token,
                 'Content-Type': 'application/json',
                 'Accept':'application/json'},
        data = json.dumps(payload))
    if response.status_code == 200:
        return json.loads(response.content)['resolved'], 200
    else:
        print('From reference status_code is', response.status_code)
        return None, response.status_code

# Resolve my references, results in 'total results' list
references = [json.loads(ref) for ref in ref_list]
total_results = []
print('Querying %d references with the Reference Service ...'%len(references))
for i in range(0, len(references), 16):
    results, status = resolve(references[i:i+16])
    if results:
        total_results += results

# Save the results to excel
dt = pd.DataFrame(total_results)
refs_outfile = "scix_refs_resolved.xlsx"
dt.to_excel(refs_outfile, index=False)
print(f"Saved results to {refs_outfile}")
dt

## 3. Update the Scix/ADS Library
Use this section to send bibcodes from the 'bibcode' column to the library.

In [None]:
# Read the spreadsheet into a Data Frame
input_data = pd.read_excel(filepath + filename)
data = pd.DataFrame(input_data)

# -- Update/Add Bibcodes to Library
biblib = "UTViEyO9T7izQP7i_r6yqA" # NASA Astrobiology

# Get bibcodes where bibcodes is not null
filtered_df = input_data.dropna(subset=['bibcode']) # drop null values

# Define conditions to exclude
conditions = ['…................','...................','preprint']

# Filter out unwanted values
filtered_df = filtered_df[~filtered_df['bibcode'].isin(conditions)]

# Extract the 'bibcode' column as a list
biblist = filtered_df['bibcode'].tolist()
      
# My ADS API token, and the base url for the ADS Libraries API
url = "https://api.adsabs.harvard.edu/v1/biblib/documents/" + biblib
data = { 
    "bibcode": biblist,
    "action": "add"
        }
headers = {'Authorization': 'Bearer ' + api_token}
    
# Send the API request
response = requests.post(url=url, data=json.dumps(data), headers=headers)
if response.status_code == 200:
    print(f'Success: Added {len(set(biblist))} bibcodes to Library')
else:
    print(f'From SciX/ADS status_code is {response.status_code}. No bibcodes were added to the library at this time.')

## 4. Data Curation (Pyingest / ADS Tagged formatter)
Use this section to generate a .tag file of records from the spreadsheet (where 'bibcode' is empty/null)

In [None]:
# Read the spreadsheet into a Data Frame
input_data = pd.read_excel(filepath + filename)
data = pd.DataFrame(input_data)

# Initialize list for curated records
ingest_records = []

for index, row in data.iterrows():

    # Extract the metadata from columns
    bibcode = row["bibcode"] if pd.notna(row["bibcode"]) else ""
    authors = row["authors"] if pd.notna(row["authors"]) else ""
    affs = row["affiliations"] if pd.notna(row["affiliations"]) else ""
    title = row["title"] if pd.notna(row["title"]) else ""
    pubdate = row["date"] if pd.notna(row["date"]) else ""
    journal = row["journal"] if pd.notna(row["journal"]) else ""
    vol = row["volume"] if pd.notna(row["volume"]) else ""
    issue = row["issue"] if pd.notna(row["issue"]) else ""
    pages = row["pages"] if pd.notna(row["pages"]) else ""
    abstract = row["abstract"] if pd.notna(row["abstract"]) else ""
    doi = row["doi"] if pd.notna(row["doi"]) else ""

    if bibcode == "":

        # Only process affiliations if they are not missing
        if pd.notna(affs) and affs.strip() != '':
            if "AA(" in affs:
                affiliations = affs.replace("; ", "_sepchar_ ").replace("<ORCID>", "<ID system=\"ORCID\">").replace("</ORCID>", "</ID>")
            else:
                affiliations = affs
        else:
            affiliations = ''  # Leave empty if no affiliations are present
        
        # Format pages
        if pages and "\-" in str(pages):
            p = "pp. " + str(pages)
        elif pages and "\-" not in str(pages):
            p = "page " + str(pages)

        # Format the publication field with journal, volume, issue, and pages
        pub = ""
        if journal and vol and issue and pages:
            pub = f"{journal}, Volume {str(vol)}, Issue {str(issue)}, {str(p)}"
        elif journal and vol and pages:
            pub = f"{journal}, Volume {str(vol)}, {str(p)}"
        elif journal:
            pub = f"{journal}"

        properties = ""
        if doi:
            properties = f"DOI: {str(doi)}"
            
        r = {
            "bibcode": "",
            "authors": authors.split("; "),
            "affiliations": affiliations.split(": "),
            "pubdate": pubdate,
            "title": title,
            "publication": pub.replace(".0",""),
            "abstract": abstract,
            "properties": properties
        }
        ingest_records.append(r)

# Pyingest Serializer - Transform records into tagged format
outputfp = open(filepath + tagged_output, 'a')
try:
    for record in ingest_records:
        try:
            serializer = Tagged()
            serializer.write(record, outputfp)
        except Exception as e:
            print(f"Serializer failed for record: {record}, Error: {e}")
except Exception as e:
    print(f"An error occurred: {e}")
finally:
    outputfp.close()
print(f"Saved {len(ingest_records)} records to {tagged_output}")

# Read the contents of the .tag file
with open(filepath + tagged_output, 'r') as file:
    data = file.read()

# Define the pattern to match %F AA(A[A-Z]( and remove the initial AA(
pattern1 = r'(%F )AA\(A([A-Z]\()'
pattern2 = r'\)\)\n%D '
pattern3 = r'%F AA\(\)\n%D '

# Perform the replacements and count occurrences
data = re.sub(pattern1, r'\1A\2', data)
data = re.sub(pattern2, ')\n%D ', data)
data = re.sub(pattern3, '%D ', data)
data = re.sub('_sepchar_', ';', data)

# Write the modified content back to the file
with open(filepath + tagged_output, 'w') as file:
    file.write(data)
print("Tagged file updated successfully.")