In this tutorial we are going to get biographical information for a group of authors by using their LCCN, having already enriched the data to have the LCCN in pervious scripts. You can use other identifiers if you have them reconciled for the author already including Wikidata Q Id, VIAF, etc.

Getting started checklist:
1. A CSV or TSV file with metadata, at minimum it needs to contain the LCCN for the author.
2. Python, with Pandas, Requests module installed and internet connection


Our first steps will be defining some variables we will be using, set these variables below based on your setup:


`path_to_tsv` - the path to the TSV/CSV file you want to run it on

`lccn_column_name` - the name of the column that contains the lccn

`user_agent` - this is the value put into the headers on each request, it is good practice to identifiy your client/project

`pause_between_req` - number of seconds to wait between each API call, if you want the script to run slower


We also will load the modules we will be using.

In [1]:
path_to_tsv = "/path/to/the_file.tsv"
path_to_tsv = "/Users/m/Downloads/hathitrust_post45fiction_metadata.tsv"
lccn_column_name = 'author_lccn'
user_agent = 'YOUR PROJECT NAME HERE'
pause_between_req = 0

import pandas as pd
import requests
import time

We will read in all records and for each one make a SPARQL query out to wikidata for information about that author and store it in the file. You can look for a large number of properties, there is no guarantee that the property will exist for a person (Qid) so we wrap it in the `OPTIONAL { }` brackets. You can find the list of possible properties for people here: https://prop-explorer.toolforge.org/ just as a test lets look for Instagram (P2003) handles for these authors. You could of course look for more substantial properties like nationality (P27) or education (P69) etc.

In [2]:
def add_wikidtat_info(d):

    if lccn_column_name not in d:
       print("No lccn_column_name field found!")
       return d
    
    if type(d[lccn_column_name]) != str:     
        # doesn't have one   
        return d

    sparql = f"""
        SELECT ?item ?itemLabel ?instagram
        WHERE 
        {{
    
        ?item wdt:P244 "{d[lccn_column_name]}".  #P244 is the LCCN number property
        OPTIONAL {{
            ?item wdt:P2003 ?instagram .
        }}
        
        SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}
        }}       
    """
    params = {
        'query' : sparql
    }

    headers = {
        'Accept' : 'application/json',
        'User-Agent': user_agent
    }
    url = "https://query.wikidata.org/sparql"

    r = requests.get(url, params=params, headers=headers)

    data = r.json()


    # did we get any results
    if len(data['results']['bindings']) > 0:

      for result in data['results']['bindings']:
        if 'instagram' in result:
            d['instagram'] = result['instagram']['value']



    # if we need to script to run slower we can configure it setting the  pause_between_req variable above 
    time.sleep(pause_between_req)

    return d

Our next step will be to load the Pandas module and load the data we are using, you can adjust the `sep` argument to change what delimiter is being used (for example if you are using a CSV file, change it to ","). Once loaded we pass each record to the `add_classify()` function to kick off adding the data to the record



In [3]:
df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)
df = df.apply(lambda d: add_wikidtat_info(d),axis=1 )  


# we are writing out the file to the same location here, you may want to modifythe filename to create a new file, and change the sep argument if using a CSV
# df.to_csv(path_to_tsv, sep='\t')




FileNotFoundError: [Errno 2] No such file or directory: '/Users/m/Downloads/hathitrust_post45fiction_metadata.tsv'

The below code does the same thing as the block above but it breaks the CSV/TSV into multiple chunks and writes it out after each chunk, this allows for recovery from any errors such as as internet timeout or other problems that would cause you to loose all progress unless the script runs flawlessly, you would likely want to use this approch for larger datasets. 



In [None]:
# load the tsv
df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)

# we are going to split the dataframe into chunks so we can save our progress as we go but don't want to save the entire file on on every record operation
n = 100  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

# loop through each chunk
for idx, df_chunk in enumerate(list_df):

    # if you want it to skip X number of chunks uncomment this, the number is the row to skip to
    # if idx < 10:
    #     continue

    print("Working on chunk ", idx, 'of', len(list_df))
    list_df[idx] = list_df[idx].apply(lambda d: add_wikidtat_info(d),axis=1 )  

    reformed_df = pd.concat(list_df)
    reformed_df.to_csv(path_to_tsv, sep='\t')
