# OhioLINK Data Extraction

## Attributions

This work contains data from [OhioLINK Circulation Data](https://www.oclc.org/research/areas/systemwide-library/ohiolink/circulation.html) which is made available by OCLC Online Computer Library Center, Inc. and OhioLINK under the [ODC Attribution License]( https://www.oclc.org/research/areas/systemwide-library/ohiolink/odcby.html).

The 3 files used are:
- OhioLINK_1.marc
- OhioLINK_2.marc
- OhioLINKCirc.fil

They should be stored in the folder Library Bias Analysis/Data/OhioLINK Data.

## Imports

- [*pymarc*](https://pypi.org/project/pymarc/) handles the bibliographic data stored in .marc files 

In [1]:
import pickle
import random
from pymarc import MARCReader

## MARC Bibliographic Data

Extract bibliographic information from MARC records. Bibliograhic data is stored in .marc files that follow the MARC 21 standard for bibliographic data. MARC files are parsed to extract Library of Congress Classification (LCC) and Dewey Decimal Classification (DDC) numbers for each item with a unique OCLC number. A unique OCLC number corresponds to a unique library resource (i.e. book, enyclopedia, article, etc.) 

#### Summary of relevant tags 
- '050': Library of Congress Classification Number 
- '090': Locally assigned LCC number (not supposed to be used anymore but still exists)
- '082': Dewey Decimal Classification Number
- '092': Locally assigned DDC number
- '001': OCLC number (for this dataset)
- '100': The main author or the person chiefly responsible for the work 

#### Missing LCC and DDC Data
- Some records did not contain a field with the tag '050' and/or the tag '082'. 
- Some records contained a field with tag '090' or '092' for locally assigned LCC and DDC numbers respectively

    
#### Items with more than one LCC or DDC number

Some MARC files can have multiple LCC or DDC tags (multiple entries tagged with '050' or '082'). For simplicity only the first tag is considered

### Loading Bibliographic Data

In [2]:
def getMarcData(filepath, bibList):
    i = 0
    with open(filepath, 'rb') as f:
        reader = MARCReader(f, to_unicode=True, force_utf8=True,utf8_handling='ignore')
        for record in reader:
            if i == 0:
                example = record 
            # Library of Congress Call Number 
            lcc = [subfield for field in record.get_fields('050') for subfield in field.get_subfields('a')]
            if lcc == []:
                lcc = [subfield for field in record.get_fields('090') for subfield in field.get_subfields('a')]   
            # Dewey Decimal Classification Number
            ddc = [subfield for field in record.get_fields('082') for subfield in field.get_subfields('a')] 
            if ddc == []:
                ddc = [subfield for field in record.get_fields('092') for subfield in field.get_subfields('a')] 

            if lcc == []: lcc = None
            if ddc == []: ddc = None
            
            main_auth = [subfield for field in record.get_fields('100') for subfield in field.get_subfields('a')]
            oclc = int(record['001'].value()[3:])
            bibList.append({'title': record.title, 
                'auth': main_auth, 
                'lcc': lcc, 
                'ddc': ddc, 
                'pub': record.pubyear,
                'oclc': oclc})
            # OCLC number, LCCN, main author, year published, title, additional people, dewey
            #bibList.append(oclc, lcc, ddc, main_auth, record.pubyear, record.title))
            i+=1
            if i%500000 == 0:
                print(f'{i} files read.')

    return (bibList, example)

In [3]:
filePath1 = 'OhioLINK Data\\OhioLINK_1.marc'
filePath2 = 'OhioLINK Data\\OhioLINK_2.marc'

bibList = []
print('Reading MARC records from file 1')
bibList, example = getMarcData(filePath1, bibList)
print('\nReading MARC records from file 2')
bibList, _ = getMarcData(filePath2, bibList)

Reading MARC records from file 1
500000 files read.
1000000 files read.
1500000 files read.
2000000 files read.
2500000 files read.
3000000 files read.
3500000 files read.

Reading MARC records from file 2
500000 files read.
1000000 files read.
1500000 files read.
2000000 files read.
2500000 files read.


### Example of bibliographic data

- The first is an example of an entry in a .marc file. 
- The second is a tuple containg the relevant information extracted from the .marc file
    - (OCLC number, LCCN, main author, year published, title, additional people)
- The third is an additional example of data stored in the tuple

In [4]:
print(example)
print(bibList[1], "\n")
print(random.choice(bibList))

=LDR  01075cam  2200289 a 4500
=001  ocm00000001\
=003  OCoLC
=005  20061229000001.0
=008  690526s1963\\\\ilua\\\j\\\\\\000\1\eng\\
=010  \\$a   63011276 
=040  \\$aDLC$cDLC$dIUL$dOCL$dOCLCQ$dTML$dOCL$dOCLCQ$dBTCTA
=019  \\$a6567842$a9987701$a53095235
=042  \\$alcac
=050  00$aPZ5$b.R1924 
=082  00$a[Fic]
=096  \\$aQV 4 An78 v.46 2006
=245  04$aThe Rand McNally book of favorite pastimes /$cillustrated by Dorothy Grider.
=246  30$aFavorite pastimes
=260  \\$aChicago :$bRand McNally,$c[1963]
=300  \\$a110 p. :$bcol. ill. ;$c33 cm.
=520  \\$aBoys and girls in these four stories work hard to master ballet dancing, riding, baton twirling, and swimming.
=505  0\$aLittle ballerina / by D. Grider -- Little horseman / by M. Watts -- Little majorette / by D. Grider -- Little swimmers / by V. Hunter.
=650  \1$aShort stories.
=700  1\$aGrider, Dorothy.
=700  1\$aHunter, Virginia.
=938  \\$aBaker and Taylor$bBTCP$n63011276
=994  \\$a11$bOCL$i00466

{'title': 'Mudlumps at the mouth of South Pass, Mis

## OhioLINK Circulation Data

Extract circulation statistics from the OhioLINK circulation data. This dataset contains usage data on books in the Ohio Academic libraries from 2007-2008.  3 Key pieces of information were extracted for each unique OCLC number:
- The number of books with this OCLC number (i.e. how many copies of a book there are across the Ohio Academic libraries)
- The number of books with this OCLC number that were in circulation (available for borrowing)
- The total number of times books with this OCLC number were borrowed (total circulation)
- The number of times books with this OCLC number were borrowed in 2007 (annual circulation)

In [5]:
filePath = 'OhioLINK Data\\OhioLinkCirc.fil'
with open(filePath, 'r') as f:
    circDat = f.read().splitlines()
circDat = [item.split('\t') for item in circDat]

In [6]:
# Stores each resources circulation data in a dictionary.
# Each key corresponds to a unique OCLC number. 
def gatherCircStats(itemList):
    #{oclcNum:[numItems, numItemsInCirc, totalCirculation, anualCirc]}
    itemDic = {}
    for item in itemList:
        if int(item[1]) not in itemDic.keys():
            itemDic[int(item[1])] = [1, int(item[9]), int(item[10]), int(item[11])]
        else:
            itemDic[int(item[1])][0] += 1  # copies
            itemDic[int(item[1])][1] += int(item[9]) # copies in circulation
            itemDic[int(item[1])][2] += int(item[10]) # total circulation
            itemDic[int(item[1])][3] += int(item[11]) # circulation in 2007
    return itemDic

def addCircStats(books, circStats):
    for item in books:
        if item['oclc'] in circStats.keys():
            num = item['oclc']
            item['copies'] = circStats[num][0]
            item['total_circ'] = circStats[num][-1] # total circulation in 2007
            item['circ_status'] = circStats[num][1] # 0 only if no copies of the book circulated
        else:
            print('Not in MARC')

circStats = gatherCircStats(circDat)
addCircStats(bibList, circStats)

In [9]:
with open('OhioLINK Data\\marcData.pk', 'wb') as f:
    pickle.dump(bibList, f)