In this tutorial we will be using the HathiTrust identifier to download the corresponding MARC record which we will then extract identifiers and reconcile names of the author.

Getting started checklist:
1. A CSV or TSV file with metadata, at minimum it needs to contain the HathiTrust record id or volume id.
2. Python, with Pandas, Requests and pymarc module installed and internet connection


Our first steps will be defining some variables we will be using, set these variables below based on your setup:


`path_to_tsv` - the path to the TSV/CSV file you want to run it on

`id_column_name` - the name of the column that contains the hathitrust volume id or record id

`user_agent` - this is the value put into the headers on each request, it is good practice to identifiy your client/project

`pause_between_req` - number of seconds to wait between each API call, if you want the script to run slower


We also will load the modules we will be using.

In [2]:
path_to_tsv = "/path/to/the_file.tsv"
id_column_name = 'recordid'
user_agent = 'YOUR PROJECT NAME HERE'
pause_between_req = 0

import pandas as pd
import requests
import time
import pymarc
import io


Let's take a look at a sample of what data you run through this script would look like, here are a few rows from a dataset that would be used with this approch. In the dataframe below the `docid` field contains the volume id and the `recordid` is the hathi record id.

Unnamed: 0,id,docid,oldauthor,author,authordate,inferreddate,latestcomp,datetype,startdate,enddate,...,place,recordid,enumcron,volnum,title,parttitle,shorttitle,instances,juvenileprob,nonficprob
90,91,uc1.$b300982,"Iverson, Andrima","Iverson, Andrima",1911-,1946,1946,s,1946,,...,nyu,480675,,0,"The gifts of love, | $c: a novel by Andrina Iv...",,The gifts of love,2,0.010292,0.0731028284985105
91,92,uc1.b3470626,"Jackson, Charles","Jackson, Charles",1903-1968.,1946,1946,s,1946,,...,nyu,480847,,0,The fall of valor; | a novel.,,The fall of valor; a novel,3,0.002537,0.0815045571435812
92,93,uc1.$b243768,"Isherwood, Christopher","Isherwood, Christopher",1904-1986.,1946,1946,s,1946,,...,ctu,481696,,0,The memorial; portrait of a family,,The memorial; portrait of a family,1,0.350944,0.0812269250070957
93,94,mdp.39015014138898,"Huxley, Aldous","Huxley, Aldous",1894-1963.,1946,1946,s,1946,,...,enk,481928,,0,Limbo | $c: [by] Aldous Huxley.,,Limbo,1,0.084448,0.3556799432362225
94,95,mdp.39015002305061,"Huxley, Aldous","Huxley, Aldous",1894-1963.,1946,1932,r,1946,1932.0,...,nyu,481932,,0,"Brave new world, | $c: a novel by Aldous Huxle...",,Brave new world,3,0.050987,0.2182374183827324


We will download each MARC record from HathiTrust and then extract the information we want. We will then try to reconcile the name of the author against the id.loc.gov data service.

In [None]:
def add_hathi(d):


    # we need to know if we are using the volume id or the record id in the API call if it is a number then it is a record number id
    field = "htid"
    if type(d[id_column_name]) == int:
        field='recordnumber'


    # if there is already a value skip it
    if 'hathi_marc' in d:
        if type(d['hathi_marc']) == str:        
            print('Skip',d[id_column_name])
            return d
        
    url = f"https://catalog.hathitrust.org/api/volumes/full/{field}/{d[id_column_name]}.json"
    r = requests.get(url, headers={'Accept': 'application/json', 'User-Agent': user_agent})
    try:
        data = r.json()
    except:
        print("JSON decode error with:",d[id_column_name])
        return d

    if 'records' not in data:
        print("No record response found in:",d[id_column_name])
        return d

    for recordid in data['records']:        
        d['hathi_marc'] = data['records'][recordid]['marc-xml']

    if 'items' not in data:
        print("No items response found in:",d[id_column_name])
        return d

    rights_codes = []
    for item in data['items']:
        rights_codes.append(item['rightsCode'])

    rights_codes = list(set(rights_codes))
    code = "|".join(rights_codes)
    
    d['hathi_rights'] = code


    # we can now extract what we want from the marc record
    # the pymarc library expects a file to open, we don't have files we have strings, 
    # so make a file like object and put our string into it so we can parse it
    with io.StringIO() as f:
        f.write(d['hathi_marc'])
        f.seek(0)
        # parse it, its returns a list of records, but we only have one, so take the 0 index
        record = pymarc.marcxml.parse_xml_to_array(f)[0]

        if '001' in record:                
            if 'ocm' in record['001'].value():
                d['oclc'] = record['001'].value().replace("ocm",'')
            if 'ocn' in record['001'].value():
                d['oclc'] = record['001'].value().replace("ocn",'')
            if 'on' in record['001'].value():
                d['oclc'] = record['001'].value().replace("on",'')                    

        if '035' in record:
            for f in record.get_fields('035'):
                if 'a' in f:
                    if 'OCoLC' in f['a']:
                        d['oclc'] = f['a'].split(")")[1]



            
        if '020' in record:
            if 'a' in record['020']:
                i = record['020']['a']
                i = i.split("(")[0].strip()
                i = i.split(":")[0].strip()
                d['isbn'] = i

        # we'll grab the first contribtuor from the 7xx fields if there is no 1xx author
        # there will likely be other 7xx contributor though we are only using the first one which is often the editor
        field = None
        if '100' in record:
            field = record['100']
        elif '110' in record:
            field = record['110']
        elif '111' in record:
            field = record['111']
        elif '700' in record:
            field = record['700']
        elif '710' in record:
            field = record['710']
        elif '711' in record:
            field = record['711']                                
        else:
            print("No Author found!:", d['hathi_marc'])
            return d

        # assemble the heading in the correct order 
        name = field['a']
        if 'b' in field:
            name = name + ' ' + field['b']
        if 'c' in field:
            name = name + ' ' + field['c']
        if 'q' in field:
            name = name + ' ' + field['q']                  
        if 'd' in field:
            name = name + ' ' + field['d']   
        if 'g' in field:
            name = name + ' ' + field['g']   

        # have seen empty "" 100 fields
        if len(name.strip()) == 0:
            print("No Author found!:", d['hathi_marc'])
            return d

        # remove the optional trailing period on all headings if there
        if name[-1] == '.':
            name = name[:-1]

        d['author_marc'] = name
        

    # once you have the name you could reconcile it to id.loc.gov to get a LCCN identfier for use in Wikidata and Viaf.org
    # this would likely be a different script but is included here to show how it could be done.

    if 'author_marc' in d:
        if type(d['author_marc']) != str:        
            print('No author:',d[id_column_name])
            return d
            
        params = {
                    'q' : d['author_marc'],
                    'count': 5
                }
        
        headers={'Accept': 'application/json', 'User-Agent': user_agent}
        url = f"https://id.loc.gov/authorities/names/suggest2/"

        r = requests.get(url,params=params,headers=headers)
        try:
            data = r.json()
        except:
            print("JSON decode error with:",d[id_column_name])
            return d            

        results = data['hits']

        # loop throguh each result and test the name
        for hit in results:
            if hit['suggestLabel'] == name:
                d['author_lccn'] = hit['uri'].split('/')[-1]
                d['author_authorized_heading'] = hit['aLabel']
                return d
        # check the main variant label 
        for hit in results:
            if hit['vLabel'] == name:
                d['author_lccn'] = hit['uri'].split('/')[-1]
                d['author_authorized_heading'] = hit['aLabel']
                return d

        # if there is only one hit and it has unclosed life dates and the name partially matches then select it
        if name[-1] == '-':
            if len(results) == 1:
                if name in results[0]['aLabel'] or name in results[0]['vLabel']:
                    d['author_lccn'] = hit['uri'].split('/')[-1]
                    d['author_authorized_heading'] = hit['aLabel']
                    return d



    # if we need to script to run slower we can configure it setting the  pause_between_req variable above 
    time.sleep(pause_between_req)

    return d

Our next step will be to load the Pandas module and load the data we are using, you can adjust the `sep` argument to change what delimiter is being used (for example if you are using a CSV file, change it to ","). Once loaded we pass each record to the `add_classify()` function to kick off adding the data to the record



In [None]:
df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)
df = df.apply(lambda d: add_hathi(d),axis=1 )  


# we are writing out the file to the same location here, you may want to modifythe filename to create a new file, and change the sep argument if using a CSV
df.to_csv(path_to_tsv, sep='\t')




The below code does the same thing as the block above but it breaks the CSV/TSV into multiple chunks and writes it out after each chunk, this allows for recovery from any errors such as as internet timeout or other problems that would cause you to loose all progress unless the script runs flawlessly, you would likely want to use this approch for larger datasets. 



In [None]:
# load the tsv
df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)

# we are going to split the dataframe into chunks so we can save our progress as we go but don't want to save the entire file on on every record operation
n = 100  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

# loop through each chunk
for idx, df_chunk in enumerate(list_df):

    # if you want it to skip X number of chunks uncomment this, the number is the row to skip to
    # if idx < 10:
    #     continue

    print("Working on chunk ", idx, 'of', len(list_df))
    list_df[idx] = list_df[idx].apply(lambda d: add_hathi(d),axis=1 )  

    reformed_df = pd.concat(list_df)
    reformed_df.to_csv(path_to_tsv, sep='\t')


After running the data through this process you would have a few new fields added to the dataset the most important being the authorized heading, LCCN  identfiers for the author:
 

Unnamed: 0,id,docid,author_authorized_heading,author_lccn,oldauthor,author,author_marc,authordate,inferreddate,latestcomp,...,recordid,enumcron,volnum,hathi_rights,title,parttitle,shorttitle,instances,juvenileprob,nonficprob
90,91,uc1.$b300982,"Iverson, Andrina, 1911-",no97037668,"Iverson, Andrima","Iverson, Andrima","Iverson, Andrina, 1911-",1911-,1946,1946,...,480675,,0,ic,"The gifts of love, | $c: a novel by Andrina Iv...",,The gifts of love,2,0.010292,0.0731028284985105
91,92,uc1.b3470626,"Jackson, Charles, 1903-1968",n86069287,"Jackson, Charles","Jackson, Charles","Jackson, Charles, 1903-1968",1903-1968.,1946,1946,...,480847,,0,ic,The fall of valor; | a novel.,,The fall of valor; a novel,3,0.002537,0.0815045571435812
92,93,uc1.$b243768,"Isherwood, Christopher, 1904-1986",n79073501,"Isherwood, Christopher","Isherwood, Christopher","Isherwood, Christopher, 1904-1986",1904-1986.,1946,1946,...,481696,,0,ic,The memorial; portrait of a family,,The memorial; portrait of a family,1,0.350944,0.0812269250070957
93,94,mdp.39015014138898,"Huxley, Aldous, 1894-1963",n80057246,"Huxley, Aldous","Huxley, Aldous","Huxley, Aldous, 1894-1963",1894-1963.,1946,1946,...,481928,,0,ic,Limbo | $c: [by] Aldous Huxley.,,Limbo,1,0.084448,0.3556799432362225
94,95,mdp.39015002305061,"Huxley, Aldous, 1894-1963",n80057246,"Huxley, Aldous","Huxley, Aldous","Huxley, Aldous, 1894-1963",1894-1963.,1946,1932,...,481932,,0,ic,"Brave new world, | $c: a novel by Aldous Huxle...",,Brave new world,3,0.050987,0.2182374183827324
