# Library of Congress

In this tutorial we will be using the Library of Congress id.loc.gov to search for more biblographic information if we have a ISBN or OCLC number. Of course if you already have a LCCN number you could use that but this tutorial assumes you have either the ISBN or OCLC. While there are millions of records in LC's system it does not not include every ISBN and OCLC number in the world.


Getting started checklist:
1. A CSV or TSV file with metadata, at minimum it needs to contain the ISBN or OCLC id.
2. Python, with Pandas, Requests module installed and internet connection


Our first steps will be defining some variables we will be using, set these variables below based on your setup:


`path_to_tsv` - the path to the TSV/CSV file you want to run it on

`id_column_name` - the name of the column that contains the oclc or isbn number

`user_agent` - this is the value put into the headers on each request, it is good practice to identifiy your client/project

`pause_between_req` - number of seconds to wait between each API call, if you want the script to run slower


We also will load the modules we will be using.

In [None]:
path_to_tsv = "/path/to/the_file.tsv"

id_column_name = 'oclc'
user_agent = 'YOUR PROJECT NAME HERE'
pause_between_req = 0

import pandas as pd
import requests
import time


We will download each MARC record from HathiTrust and then extract the information we want. We will then try to reconcile the name of the author against the id.loc.gov data service.

In [None]:
def lookup(d):


    
    d[id_column_name] = int(d[id_column_name])
        
    url = f"https://id.loc.gov/resources/instances/identifier/{d[id_column_name]}"
    
    # here we use the head request to just get the headers which will have the LC identfier for the ISBN/OCLC if found
    r = requests.head(url, headers={'Accept': 'application/json', 'User-Agent': user_agent})
    print(url)
    if 'Location' in r.headers:
        # this means we found a hit, the Location to the record
        # it will look something like this https://id.loc.gov/resources/instances/#######
        # and we can just add on https://id.loc.gov/resources/instances/######.bibframe.json
        # to get the instance level data, this is data about the manfestation

        r2 = requests.get(r.headers['Location'] + '.bibframe.json', headers={'Accept': 'application/json', 'User-Agent': user_agent})
        try:
            graph_data = r2.json()
        except:
            print("Error with json:",r.headers['Location'] + '.bibframe.json')
            return d

        # we can pull out the LCCN of the book for example
        for graph in graph_data:

            if '@type' in graph:
                if 'http://id.loc.gov/ontologies/bibframe/Lccn' in graph['@type']:

                    d['instance_lccn'] = graph['http://www.w3.org/1999/02/22-rdf-syntax-ns#value'][0]['@value']

        # there are a lot of other information at the instance level you could get
        print()
        # but lets jump to the Work level to get more info
        for graph in graph_data:
            # make sure it is the right URI, normalize the http 
            if graph['@id'].replace("https://",'http://') == r.headers['Location'].replace("https://",'http://'):
                if 'http://id.loc.gov/ontologies/bibframe/instanceOf' in graph:
                    work_uri = graph['http://id.loc.gov/ontologies/bibframe/instanceOf'][0]['@id']

                    # and we can do the same and grab the work info
                    
                    r3 = requests.get(work_uri + '.bibframe.json', headers={'Accept': 'application/json', 'User-Agent': user_agent})
                    work_graph = r3.json()
                    # lets say we want to find the contributor 
                    for w in work_graph:
                        if '@type' in w:
                            if 'http://id.loc.gov/ontologies/bibframe/Contribution' in w['@type']:
                                if 'http://id.loc.gov/ontologies/bibframe/agent' in w:
                                    agent_id = w['http://id.loc.gov/ontologies/bibframe/agent'][0]['@id']
                                    
                                    if 'id.loc.gov' in agent_id:
                                        # the URI to the agent, split off the LCCN and store it
                                        agent_lccn = agent_id.split("/")[-1]
                                        d['author_lccn'] = agent_lccn

                    # at this point you could also look for Subject headings and other relationships
                                    
   
    # if we need to script to run slower we can configure it setting the  pause_between_req variable above 
    time.sleep(pause_between_req)

    return d

Our next step will be to load the Pandas module and load the data we are using, you can adjust the `sep` argument to change what delimiter is being used (for example if you are using a CSV file, change it to ","). Once loaded we pass each record to the `lookup()` function to kick off adding the data to the record

The below code breaks the CSV/TSV into multiple chunks and writes it out after each chunk, this allows for recovery from any errors such as as internet timeout or other problems that would cause you to loose all progress unless the script runs flawlessly, you would likely want to use this approch for larger datasets. 



In [None]:
# load the tsv
df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)

# we are going to split the dataframe into chunks so we can save our progress as we go but don't want to save the entire file on on every record operation
n = 100  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

# loop through each chunk
for idx, df_chunk in enumerate(list_df):

    # if you want it to skip X number of chunks uncomment this, the number is the row to skip to
    # if idx < 10:
    #     continue

    print("Working on chunk ", idx, 'of', len(list_df))
    list_df[idx] = list_df[idx].apply(lambda d: lookup(d),axis=1 )  

    reformed_df = pd.concat(list_df)
    reformed_df.to_csv(path_to_tsv, sep='\t')
