# OCLC Classify

In this tutorial we will be using OCLC Classify service to reconcile bibliographic titles and authors. In this situation we an identfiier for the instance or manifestation level of the work such as an OCLC number or ISBN. Our goal is to enrich the dataset with unique persistent identifiers for the names and titles provided. The approach we are taking in getting these identifiers is to start with the Work, which is the highest level of hierarchy. You can imagine a Work as having many instances, or editions that belong to the same work. So in order to get those instance level information we need to reconcile the Work to the OCLC Classify Work. 

Getting started checklist:
1. A CSV or TSV file with metadata, at minimum it needs to contain a OCLC or ISBN number.
2. A OCLC WSkey for the Classify service, this can be generated if you have a subscription/membership with OCLC. You need to speak with someone at your institution who has access to your organization’s account at http://platform.worldcat.org/wskey/
3. Python, with Pandas, Requests and BeautifulSoup module installed and internet connection


Our first steps will be defining some variables we will be using, set these variables below based on your setup:


`path_to_tsv` - the path to the TSV/CSV file you want to run it on

`id_oclc_number` - the name of the column header in the file that contains the OCLC number

`id_isbn_number` - the name of the column header in the file that contains the ISBN number

`user_agent` - this is the value put into the headers on each request, it is good practice to identifiy your client/project

`pause_between_req` - number of seconds to wait between each API call, if you want the script to run slower

`WSKey` - the OCLC WSkey, will be a 80 character alpha numeric code

We also will load the modules we will be using.

In [1]:
path_to_tsv = "/path/to/the_file.tsv"
id_oclc_number = 'oclc'
id_isbn_number = None
user_agent = 'YOUR PROJECT NAME HERE'
pause_between_req = 0
WSkey = "WSKeyXXXXXXXXXX"

import pandas as pd
import requests
import time
from bs4 import BeautifulSoup


Let's take a look at a sample of what data you run through this script would look like, here are a few rows from a dataset that would be used with this approch:

Unnamed: 0,id,docid,oldauthor,author,authordate,inferreddate,latestcomp,datetype,startdate,enddate,...,place,recordid,enumcron,volnum,title,parttitle,shorttitle,instances,juvenileprob,nonficprob
40,41,mdp.39015003344424,"Fast, Julius","Fast, Julius",1919-2008.,1946,1946,s,1946,,...,nyu,389033,,0,The bright face of danger.,,The bright face of danger,1,0.030208,0.091144873997209
41,42,mdp.39015031233847,"Fisher, Vardis","Fisher, Vardis",1895-1968.,1946,1946,s,1946,,...,nyu,389084,,0,"Intimations of Eve, | $c: by Vardis Fisher.",,Intimations of Eve,2,0.016686,0.1069321426830509
42,43,uc1.b3687383,"Fast, Howard","Fast, Howard",1914-2003.,1946,1946,s,1946,,...,nyu,389225,,0,"The American, | a Middle Western legend, | $c:...",,"The American, a Middle Western legend",3,0.003327,0.3327389537722788
43,44,mdp.39015003345868,"Faure, Raoul C. (Raoul Cohen)","Faure, Raoul C. (Raoul Cohen)",1909-,1946,1946,s,1946,,...,nyu,389413,,0,"The spear in the sand, | a novel by Raoul C. F...",,"The spear in the sand, a novel by Raoul C. Faure",2,0.04396,0.2356557539136781
44,45,mdp.39015003345694,"Fearing, Kenneth","Fearing, Kenneth",1902-1961.,1946,1946,s,1946,,...,nyu,389521,,0,"The big clock, | $c: by Kenneth Fearing.",,The big clock,2,0.020603,0.2676949271031792
45,46,mdp.39015003341677,"Field, Ben","Field, Ben",1901-,1946,1946,s,1946,,...,nyu,389616,,0,"Piper Tompkins, | $c: by Ben Field.",,Piper Tompkins,1,0.041732,0.1406484842794505
46,47,mdp.39015005095388,"Finney, Charles G. (Charles Grandison)","Finney, Charles G. (Charles Grandison)",1905-1984.,1946,1946,s,1946,,...,nyu,389654,,0,The circus of Dr. Lao. | $c: With drawings by ...,,The circus of Dr. Lao,2,0.553365,0.3165050259244235
47,48,mdp.39015030738366,"Fisher, Anne Benson","Fisher, Anne (Benson) Mrs",1898-,1946,1946,s,1946,,...,cau,389837,,0,"No more a stranger, | $c: by Anne B. Fisher.",,No more a stranger,3,0.096911,0.3995581826347221
48,49,mdp.39015003684068,"Frank, Waldo David","Frank, Waldo David",1889-1967.,1946,1946,s,1946,,...,nyu,390099,,0,"Island in the Atlantic, | a novel | $c: by Wal...",,"Island in the Atlantic, a novel",2,0.029437,0.1368402262314863
49,50,mdp.39015003929620,"Gibbs, Philip","Gibbs, Philip",1877-1962.,1946,1946,s,1946,,...,nyu,391195,,0,"Through the storm, | $c: a novel by Philip Gibbs.",,Through the storm,1,0.005444,0.1875203063918635


Before we start working with the file we need to define two helper functions. The first function (`add_classify`) will be what each row of the file is run throuh to modify the values. The second helper function (`extract_classify`) is to parse the results returned from the Classify service


In [2]:
def add_classify(d):

    # You can add some logic here to skip rows that already have some data if you previously ran it
    # if 'oclc_classify' in d:
    #     if type(d['oclc_classify']) == str:        
    #         print('Skip',d[id_author_name])
    #         return d


    # make the call out to the classift service
    headers = {'X-OCLC-API-Key': WSkey}
    
    if id_oclc_number != None:
        params = {'oclc': d[id_oclc_number],'summary' : 'false', 'maxRecs':100}
    elif id_isbn_number != None:
        params = {'isbn': d[id_isbn_number],'summary' : 'false', 'maxRecs':100}
    else:
        print("No identfier defined!")
        return d
    r = requests.get('https://metadata.api.oclc.org/classify/', params=params,headers=headers)

    work_parsed = None
    work_unparsed = None

    # different response codes mean different things:
    # 0:	Success. Single-work summary response provided.
    # 2:	Success. Single-work detail response provided.
    # 4:	Success. Multi-work response provided.
    # 100:	No input. The method requires an input argument.
    # 101:	Invalid input. The standard number argument is invalid.
    # 102:	Not found. No data found for the input argument.
    # 200:	Unexpected error.

    if r.text.find('<response code="2"/>') > -1:
        work_parsed = extract_classify(r.text)
        work_unparsed = r.text

    elif r.text.find('<response code="4"/>') > -1:
        
        # we need to look through this reponse since it returned multiple possiblities, we will just be selecting the one with the most holdings
        soup = BeautifulSoup(str(r.text))
        work_soup = soup.find("works")
        largest_count = 0
        largest_work = None
        for work in work_soup.find_all("work"):
            if int(work['holdings']) > largest_count:
                largest_count = int(work['holdings'])
                largest_work = work

        # once we have that one we want to use, make the request again with its ID now
        params = {'owi': largest_work['owi'], 'summary' : 'false', 'maxRecs':100}
        r = requests.get('https://metadata.api.oclc.org/classify/', params=params,headers=headers)

        work_parsed = extract_classify(r.text)
        work_unparsed = r.text

    elif r.text.find('<response code="100"/>') > -1 or r.text.find('<response code=\\"100\\"/>') >-1:
        print(params,'100: No input. The method requires an input argument.')
    elif r.text.find('<response code="101"/>') > -1 or r.text.find('<response code=\\"101\\"/>') >-1:
        print(params,'101: Invalid input. The standard number argument is invalid.')
    elif r.text.find('<response code="102"/>') > -1 or r.text.find('<response code=\\"102\\"/>') >-1:
        print(params,'102: Not found. No data found for the input argument.')
    elif r.text.find('<response code="200"/>') > -1 or r.text.find('<response code=\\"200\\"/>') >-1:
        print(params,'200: Unexpected error.')
    else:
        print("unknown Problem:",r.text)


    if work_parsed != None:
        # we are going to store the raw XML from the response in the file as well as the extracted information
        d['oclc_classify'] = work_unparsed

        # add in the author info if is missing
        if pd.isnull(d[id_author_viaf]) == True and work_parsed['work_author'] != None:
            if work_parsed['work_author']['viaf'] != None:
                d[id_author_authorized_heading] = work_parsed['work_author']['name']
                d[id_author_viaf] = work_parsed['work_author']['viaf']
        
        if pd.isnull(d[id_author_lccn]) == True and work_parsed['work_author'] != None:
            if work_parsed['work_author']['lccn'] != None:                    
                d[id_author_lccn] = work_parsed['work_author']['lccn']
                d[id_author_authorized_heading] = work_parsed['work_author']['name']

        d['oclc_eholdings'] = work_parsed['work_eholdings']
        d['oclc_holdings'] = work_parsed['work_holdings']
        d['oclc_owi'] = work_parsed['work_owi']


    # if we need to script to run slower we can configure it setting the  pause_between_req variable above 
    time.sleep(pause_between_req)

    return d

This is the second helper function, the Classify service returns a XML blob with data that can be parsed. We use the BeautifulSoup module to help read a parse the data.
The function returns a dictonary with differnt keys:

`work_statement_responsibility` - the string statement of responsibility

`work_editions` - the total number of editions this work has

`work_eholdings` - the total eletronic holdings this work has

`work_format` - what format the work is, likely "Book"

`work_holdings` - how many instances of this work exists in all the OCLC membership libraries

`work_itemtype` - likely "itemtype-book"

`work_owi` - a Work identifier from OCLC, mostly only used inside this Classify service

`work_title` - title of the work

`authors` - A list of dictonaries that contain `name` `lccn` `viaf` for each contributor, their authorized heading and identfiiers

`work_author` - the "main" contributor, like the author opposed to illustrator or other contributor, a way to differentiate the main author from other contributors

`normalized_ddc` - the most common dewey decimal value for this work

`normalized_lcc` - the most common library of congress classifciation number for this work

`editions` - a list of dictonaries for all the editions (instances) that represent this work in OCLC insitutions, each dict contains: `author` `eholdings` `format` `holdings` `itemtype` `language` `oclc` `title`

You can view the Classify documentation for more information on the fields returned: https://classify.oclc.org/classify2/api_docs/classify.html


In [3]:
def extract_classify(xml):

		# we are parsing the XML with BeautifulSoup to make it easier to query
		soup = BeautifulSoup(str(xml))

		# search for the <work> element
		work_soup = soup.find("work")

		if work_soup == None:
			# print("can not parse xml:")
			# print(xml)
			return None

		results = {}

		# each one of these are attributes on the <work> element, if they are present add them otherwise set it to None/null
		results['work_statement_responsibility'] = None if work_soup.has_attr('author') == False else work_soup['author']
		results['work_editions'] = None if work_soup.has_attr('editions') == False else int(work_soup['editions'])
		results['work_eholdings'] = None if work_soup.has_attr('eholdings') == False else int(work_soup['eholdings'])
		results['work_format'] = None if work_soup.has_attr('format') == False else work_soup['format']
		results['work_holdings'] = None if work_soup.has_attr('holdings') == False else int(work_soup['holdings'])
		results['work_itemtype'] = None if work_soup.has_attr('itemtype') == False else work_soup['itemtype']
		results['work_owi'] = None if work_soup.has_attr('owi') == False else work_soup['owi']
		results['work_title'] = None if work_soup.has_attr('title') == False else work_soup['title']
		results['main_oclc'] = work_soup.text

		# the authors nested in their own elemnts, so find all of them
		authors_soup = soup.find_all("author")
		results['authors'] = []
		# find the attributres for each authors and save these pieces of info for each one
		for a in authors_soup:
			results['authors'].append({
					"name" : a.text.split('[')[0].strip(),
					"lccn" : None if a.has_attr('lc') == False else a['lc'],
					"viaf" : None if a.has_attr('viaf') == False else a['viaf']
				})
		# clean up them if they were "null" strings in the data
		for a in results['authors']:
			if a['lccn'] == 'null':
				a['lccn'] = None	
			if a['viaf'] == 'null':
				a['viaf'] = None	

		# try to find the first main contributor
		# the main contributor is usally the first in the work_statement_responsibility, but is just a string
		# so try to match that name to the <authors> element to get their other information and mark them as being the
		# "main" contributor
		results['work_author'] = None
		if results['work_statement_responsibility'] != None:
			if len(results['work_statement_responsibility'].split("|"))>0:
				first_author = results['work_statement_responsibility'].split("|")[0].strip()
				for a in results['authors']:
					print(a['name'].split('[')[0].strip(), first_author )
					if a['name'].strip() == first_author:
						results['work_author'] = a


		# no classifications by default and populate if they exist later
		results["normalized_ddc"] = None
		results["normalized_lcc"] = None

		# both Dewey and Library of Congress Classification are added, these are the most common 
		# out of all of the holdings in OCLC
		ddc_soup = soup.find("ddc")
		if ddc_soup != None:
			ddc_soup = soup.find("ddc").find("mostpopular")
			if ddc_soup != None:
				if ddc_soup.has_attr('nsfa'):
					results["normalized_ddc"] = ddc_soup['nsfa']

		lcc_soup = soup.find("lcc")
		if lcc_soup != None:
			lcc_soup = soup.find("lcc").find("mostpopular")
			if lcc_soup != None:
				if lcc_soup.has_attr('nsfa'):
					results["normalized_lcc"] = lcc_soup['nsfa']

		# headings are subject headings, and are in FAST headings
		results["headings"] = []
		heading_soup = soup.find_all("heading")
		for h in heading_soup:
			results["headings"].append({
					"id" : h['ident'],
					"src": h['src'],
					"value" : h.text
				})
			
		# number of editions 
		edition_soup = soup.find_all("edition")
		# print(isbn,len(edition_soup))
		results["editions"] = []
		# for each "edition" pull out the info for it and store it
		for e in edition_soup:
			edition = {}
			edition['author'] = None if e.has_attr('author') == False else e['author']
			edition['eholdings'] = None if e.has_attr('eholdings') == False else int(e['eholdings'])
			edition['format'] = None if e.has_attr('format') == False else e['format']
			edition['holdings'] = None if e.has_attr('holdings') == False else int(e['holdings'])
			edition['itemtype'] = None if e.has_attr('itemtype') == False else e['itemtype']
			edition['language'] = None if e.has_attr('language') == False else e['language']
			edition['oclc'] = None if e.has_attr('oclc') == False else e['oclc']
			edition['title'] = None if e.has_attr('title') == False else e['title']
			results["editions"].append(edition)
			
		# the first one is always the one with the largest holdings so mark that one as largest
		if len(results["editions"]) > 0:
			results["largest_holding_oclc"] = results["editions"][0]['oclc']

		return results

Our next step will be to load the Pandas module and load the data we are using, you can adjust the `sep` argument to change what delimiter is being used (for example if you are using a CSV file, change it to ","). Once loaded we pass each record to the `add_classify()` function to kick off adding the data to the record



In [4]:
df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)
df = df.apply(lambda d: add_classify(d),axis=1 )  

# we are writing out the file to the same location here, you may want to modifythe filename to create a new file, and change the sep argument if using a CSV
df.to_csv(path_to_tsv, sep='\t')




FileNotFoundError: [Errno 2] No such file or directory: '/path/to/the_file.tsv'

The below code does the same thing as the block above but it breaks the CSV/TSV into multiple chunks and writes it out after each chunk, this allows for recovery from any errors such as as internet timeout or other problems that would cause you to loose all progress unless the script runs flawlessly, you would likely want to use this approch for larger datasets. Also uncomment the starting check in add_classify:
```
def add_classify(d):

    # You can add some logic here to skip rows that already have some data if you previously ran it
    # if 'oclc_classify' in d:
    #     if type(d['oclc_classify']) == str:        
    #         print('Skip',d[id_author_name])
    #         return d

```

To allow it to check if it needs to skip the record

In [None]:
# load the tsv
df = pd.read_csv(path_to_tsv, sep='\t', header=0, low_memory=False)

# we are going to split the dataframe into chunks so we can save our progress as we go but don't want to save the entire file on on every record operation
n = 100  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

# loop through each chunk
for idx, df_chunk in enumerate(list_df):

    # if you want it to skip X number of chunks uncomment this, the number is the row to skip to
    # if idx < 10:
    #     continue

    print("Working on chunk ", idx, 'of', len(list_df))
    list_df[idx] = list_df[idx].apply(lambda d: add_classify(d),axis=1 )  

    reformed_df = pd.concat(list_df)
    reformed_df.to_csv(path_to_tsv, sep='\t')


After running the data through this process you would have a few new fields added to the dataset the most important being the authorized heading, LCCN, VIAF and some information about the Work level. Now that we have some base identifiers in the dataset we can use other [scripts found here](https://github.com/Post45-Data-Collective/data-utilities/tree/main/enrichment) that use this same approch to add even more identifiers for the author and Work/Instance. Here is an example output of running this data set through all of these scripts:


Unnamed: 0,id,docid,author_authorized_heading,author_lccn,author_wikidata_qid,oldauthor,author,author_marc,authordate,inferreddate,...,recordid,enumcron,volnum,hathi_rights,title,parttitle,shorttitle,instances,juvenileprob,nonficprob
40,41,mdp.39015003344424,"Fast, Julius, 1919-2008",n79027137,Q1813909,"Fast, Julius","Fast, Julius","Fast, Julius, 1919-2008",1919-2008.,1946,...,389033,,0,ic,The bright face of danger.,,The bright face of danger,1,0.030208,0.091144873997209
41,42,mdp.39015031233847,"Fisher, Vardis, 1895-1968",n79045232,Q2510607,"Fisher, Vardis","Fisher, Vardis","Fisher, Vardis, 1895-1968",1895-1968.,1946,...,389084,,0,ic,"Intimations of Eve, | $c: by Vardis Fisher.",,Intimations of Eve,2,0.016686,0.1069321426830509
42,43,uc1.b3687383,"Fast, Howard, 1914-2003",n79043401,Q380202,"Fast, Howard","Fast, Howard","Fast, Howard, 1914-2003",1914-2003.,1946,...,389225,,0,und,"The American, | a Middle Western legend, | $c:...",,"The American, a Middle Western legend",3,0.003327,0.3327389537722788
43,44,mdp.39015003345868,"Faure, Raoul C. (Raoul Cohen), 1909-1987",nr96025904,,"Faure, Raoul C. (Raoul Cohen)","Faure, Raoul C. (Raoul Cohen)","Faure, Raoul C. (Raoul Cohen), 1909-",1909-,1946,...,389413,,0,ic,"The spear in the sand, | a novel by Raoul C. F...",,"The spear in the sand, a novel by Raoul C. Faure",2,0.04396,0.2356557539136781
44,45,mdp.39015003345694,"Fearing, Kenneth, 1902-1961",n50001146,Q6390093,"Fearing, Kenneth","Fearing, Kenneth","Fearing, Kenneth, 1902-1961",1902-1961.,1946,...,389521,,0,ic,"The big clock, | $c: by Kenneth Fearing.",,The big clock,2,0.020603,0.2676949271031792
45,46,mdp.39015003341677,"Field, Ben, 1901-",no2003103718,Q27839130,"Field, Ben","Field, Ben","Field, Ben, 1901-",1901-,1946,...,389616,,0,pd,"Piper Tompkins, | $c: by Ben Field.",,Piper Tompkins,1,0.041732,0.1406484842794505
46,47,mdp.39015005095388,"Finney, Charles G. (Charles Grandison), 1905-1984",n83017287,Q4894197,"Finney, Charles G. (Charles Grandison)","Finney, Charles G. (Charles Grandison)","Finney, Charles G. (Charles Grandison), 1905-1984",1905-1984.,1946,...,389654,,0,ic,The circus of Dr. Lao. | $c: With drawings by ...,,The circus of Dr. Lao,2,0.553365,0.3165050259244235
47,48,mdp.39015030738366,"Fisher, Anne B. (Anne Benson), 1898-1967",n89125974,Q4768143,"Fisher, Anne Benson","Fisher, Anne (Benson) Mrs","Fisher, Anne B. (Anne Benson), 1898-1967",1898-,1946,...,389837,,0,ic,"No more a stranger, | $c: by Anne B. Fisher.",,No more a stranger,3,0.096911,0.3995581826347221
48,49,mdp.39015003684068,"Frank, Waldo David, 1889-1967",n50025719,Q1858007,"Frank, Waldo David","Frank, Waldo David","Frank, Waldo David, 1889-1967",1889-1967.,1946,...,390099,,0,ic,"Island in the Atlantic, | a novel | $c: by Wal...",,"Island in the Atlantic, a novel",2,0.029437,0.1368402262314863
49,50,mdp.39015003929620,"Gibbs, Philip, 1877-1962",n85232177,Q5758296,"Gibbs, Philip","Gibbs, Philip","Gibbs, Philip, 1877-1962",1877-1962.,1946,...,391195,,0,ic,"Through the storm, | $c: a novel by Philip Gibbs.",,Through the storm,1,0.005444,0.1875203063918635
