# MARC Data Extraction

Extract bibliographic information from MARC records 


## Attributions

This report contains information from [OhioLINK Circulation Data](https://www.oclc.org/research/areas/systemwide-library/ohiolink/circulation.html) which is made available by OCLC Online Computer Library Center, Inc. and OhioLINK under the [ODC Attribution License]( https://www.oclc.org/research/areas/systemwide-library/ohiolink/odcby.html).


## Imports

- *pymarc* handles the bibliographic data stored in .marc files 

In [2]:
import pickle
import random
from pymarc import MARCReader
import sys
#!{sys.executable} -m spacy download en


## Bibliographic Data

Bibliograhic data is stored in .marc files that follow the MARC 21 standard for bibliographic data. MARC files are parsed to extract Library of Congress Call Numbers (LCCNs) for each item with a unique OCLC number. A unique OCLC number corresponds to a unique library resource (i.e. book, enyclopedia, article, etc.) 


#### Missing LCCN Data
Items that did not have a LCCN did not contain a field with the tag '050'. 
- Some records contain a field with tag '090' for locally assigned LCCNs. 
- Records without '090' or '050' tag appear to not contain LCCN data within the marc file.
    - On [Classify](http://classify.oclc.org/classify2/ClassifyDemo?&startRec=0) these items contained LCCN numbers for some holdings while some where unclassified. 
    - For example, see the entry for an item with OCLC # [6728](http://classify.oclc.org/classify2/ClassifyDemo?search-standnum-txt=6728&startRec=0).

#### Items with more than one LCCN

Some MARC files can have multiple LCCN tags (multiple entries tagged with '050'). For now only the first tag is considered

#### Summary of relevant tags 
- '050': Library of Congress Classification Number 
- '090': Locally assigned LCCN (not supposed to be used anymore but still exists)
- '001': OCLC number (for this dataset)
### Loading Bibliographic Data

In [5]:
filePath1 = 'OhioCirculationData/OhioLINK_1.marc'
filePath2 = 'OhioCirculationData/OhioLINK_2.marc'

def getMarcData(filepath, bibList, noLCC):
    i = 0
    with open(filepath, 'rb') as f:
        reader = MARCReader(f, to_unicode=True, force_utf8=True,utf8_handling='ignore')
        for record in reader:
            if i == 0:
                example = record 
            # Library of Congress Call Number 
            lcc = [subfield for field in record.get_fields('050') for subfield in field.get_subfields('a')]
            if lcc == []:
                lcc = [subfield for field in record.get_fields('090') for subfield in field.get_subfields('a')]
                if lcc == []:
                    num = random.randint(1, 1000)
                    if num == 1:
                        noLCC.append(record)     
            # summary
            summary = [subfield for field in record.get_fields('520') for subfield in field.get_subfields('a')]
            # OCLC number, LCCN, year published, title, summary, languages
            bibList.append(((int(record['001'].value()[3:])), lcc, record.pubyear(), record.title(), summary))
            i+=1
    return (bibList, noLCC, example)

try:
    with open ('ohioLCCData_withSummary.pkl', 'rb') as f:
        bibList = pickle.load(f)
        example = pickle.load(f)
        noLCC = pickle.load(f)
        
except FileNotFoundError:
    bibList = []
    noLCC = []
    bibList, noLCC, example = getMarcData(filePath1, bibList, noLCC)
    bibList, noLCC, _ = getMarcData(filePath2, bibList, noLCC)

    with open('ohioLCCData_withSummary.pkl', 'wb') as f:
        pickle.dump(bibList, f)
        pickle.dump(example, f)
        pickle.dump(noLCC, f)

### Example of bibliographic data

- The first is an example of an entry in a .marc file. 
- The second is a tuple containg the relevant information extracted from the .marc file
    - (OCLC number, LCCN, year published, title)
- The third is an additional example of data stored in the tuple

In [6]:
print(example)
print(bibList[0], "\n")
print(bibList[random.randint(0, len(bibList))])

=LDR  01075cam  2200289 a 4500
=001  ocm00000001\
=003  OCoLC
=005  20061229000001.0
=008  690526s1963\\\\ilua\\\j\\\\\\000\1\eng\\
=010  \\$a   63011276 
=040  \\$aDLC$cDLC$dIUL$dOCL$dOCLCQ$dTML$dOCL$dOCLCQ$dBTCTA
=019  \\$a6567842$a9987701$a53095235
=042  \\$alcac
=050  00$aPZ5$b.R1924 
=082  00$a[Fic]
=096  \\$aQV 4 An78 v.46 2006
=245  04$aThe Rand McNally book of favorite pastimes /$cillustrated by Dorothy Grider.
=246  30$aFavorite pastimes
=260  \\$aChicago :$bRand McNally,$c[1963]
=300  \\$a110 p. :$bcol. ill. ;$c33 cm.
=520  \\$aBoys and girls in these four stories work hard to master ballet dancing, riding, baton twirling, and swimming.
=505  0\$aLittle ballerina / by D. Grider -- Little horseman / by M. Watts -- Little majorette / by D. Grider -- Little swimmers / by V. Hunter.
=650  \1$aShort stories.
=700  1\$aGrider, Dorothy.
=700  1\$aHunter, Virginia.
=938  \\$aBaker and Taylor$bBTCP$n63011276
=994  \\$a11$bOCL$i00466

(1, ['PZ5'], '[1963]', 'The Rand McNally book of fa

### Examples of Records missing LCCNs

In [6]:
for i in range(2):
    index = random.randint(0, len(noLCC))
    print(noLCC[index])

=LDR  00851cam  2200193Ia 4500
=001  ocm11129990\
=003  OCoLC
=005  20060407213023.0
=008  840907s1955\\\\nyua\\\\\\\\\\000\1\eng\d
=040  \\$aTVP$cTVP$dOCLCQ
=029  1\$aNLGGC$b268913269
=092  \\$aFIC$bIRV
=100  1\$aIrving, Washington,$d1783-1859.
=245  10$aRip van Winkle, and other stories /$cby Washington Irving ; illustrated by Susanne Suba.
=260  \\$aGarden City, N.Y. :$bNelson Doubleday,$cc1955.
=300  \\$a285 p.$billus.$c22 cm.
=505  0\$aRip van Winkle -- The legend of Sleepy Hollow -- Dolph Heyliger -- The legend of the storm-ship -- Kidd the pirate -- The devil and Tom Walker -- Philip of Pokanoket -- The early experiences of Ralph Ringwood -- The phantom island -- The adalantado of the seven cities.
=700  1\$aSuba, Susanne,$d1913-
=994  \\$a11$bOCL$i02162

=LDR  01082cam  2200253Ia 4500
=001  ocm27787457\
=003  OCoLC
=005  20050323183833.0
=007  he\bmb---buuu
=008  930324s1992\\\\dcua\\\\bb\\\f000\0\eng\d
=040  \\$aGPO$cGPO$dOCL$dSPI$dOCL$dOCLCQ
=074  \\$a0830-D (MF)
=086  0\$aNA