# Basic MARC Parser

This notebook parses a MARC record, in a string format
(aka the "communications" or transmission format). 
The parsing approach is based on the
instructions and example presented by Betty Furrie
in the essay "Understanding MARC Bibliographic"
at https://www.loc.gov/marc/umb/um11to12.html

In [1]:
marc_record = '01041cam  2200265 a 4500001002000000003000400020005001700024008004100041010002400082020002500106020004400131040001800175050002400193082001800217100003200235245008700267246003600354250001200390260003700402300002900439500004200468520022000510650003300730650001200763^###89048230#/AC/r91^DLC^19911106082810.9^891101s1990####maua###j######000#0#eng##^##$a###89048230#/AC/r91^##$a0316107514 :$c$12.95^##$a0316107506 (pbk.) :$c$5.95 ($6.95 Can.)^##$aDLC$cDLC$dDLC^00$aGV943.25$b.B74 1990^00$a796.334/2$220^10$aBrenner, Richard J.,$d1941-^10$aMake the team.$pSoccer :$ba heads up guide to super soccer! /$cRichard J. Brenner.^30$aHeads up guide to super soccer.^##$a1st ed.^##$aBoston :$bLittle, Brown,$cc1990.^##$a127 p. :$bill. ;$c19 cm.^##$a"A Sports illustrated for kids book."^##$aInstructions for improving soccer skills. Discusses dribbling, heading, playmaking, defense, conditioning, mental attitude, how to handle problems with coaches, parents, and other players, and the history of soccer.^#0$aSoccer$vJuvenile literature.^#1$aSoccer.^\\'


In [2]:
len(marc_record)

1041

The "data" in this case is a long string of letters and numbers. In this case, it is what Python calls a string datatype. You can do certain things with string data in Python, including "slicing." Slicing is a way to extract or isolate a portion of the string based on its position in the data. To make a slice, you provide a starting index (strings start with index `0`) and an ending index (the slice goes up to, but doesn't include the last number noted). So, to "slice" the first 24 characters (the length of the leader), ask for the index `[0:24]`. (Even though it seems like that will give the first 25 characters it stops at the character preceding the last index.)

In [3]:
marc_record[0:24]

'01041cam  2200265 a 4500'

According to Furrie, "The first 24 positions are the leader. In this example the leader fills approximately 1/3 of the first line and ends with '4500.' Since the above also ends in 4500, this suggests that the above indeed isolates the leader data.

The leader is followed by the directory. Because a MARC record could, in theory, have about 100 separate fields, there is not a fixed length of the directory. But we do know that the carat `^` character is used to separate (or "delimit") the fields following the directory. So to isolate the directory, try splitting the string based on the delimiter. Then, since you know the leader is 24 characters long, subtract that to get the usable indexes to "slice" the string to show the directory.

In [4]:
broken_record = marc_record.split('^')

ldr_dir = broken_record[0]

print(ldr_dir)

01041cam  2200265 a 4500001002000000003000400020005001700024008004100041010002400082020002500106020004400131040001800175050002400193082001800217100003200235245008700267246003600354250001200390260003700402300002900439500004200468520022000510650003300730650001200763


In [5]:
len_ldr = 24

len_dir = len(broken_record[0])

directory = marc_record[len_ldr:len_dir]

print(directory)

001002000000003000400020005001700024008004100041010002400082020002500106020004400131040001800175050002400193082001800217100003200235245008700267246003600354250001200390260003700402300002900439500004200468520022000510650003300730650001200763


Alternatively, you could just count up the number of characters in the sample record. The leader is always 24. In this case, 20 fields, each at 12 characters suggests 240. When you add the dir and ldr, it should stop at 264. 

Test it:

In [6]:
leader = marc_record[:24]
directory = marc_record[24:264]
print('leader:',leader)
print('directory:',directory)

leader: 01041cam  2200265 a 4500
directory: 001002000000003000400020005001700024008004100041010002400082020002500106020004400131040001800175050002400193082001800217100003200235245008700267246003600354250001200390260003700402300002900439500004200468520022000510650003300730650001200763


Assuming the above, without even knowing the delimiter character, you could assume the record data begins at position 266 (that is, the characters in the leader, directory, plus one for the delimiter).

In [7]:
fields = marc_record[265:]
print(fields)

###89048230#/AC/r91^DLC^19911106082810.9^891101s1990####maua###j######000#0#eng##^##$a###89048230#/AC/r91^##$a0316107514 :$c$12.95^##$a0316107506 (pbk.) :$c$5.95 ($6.95 Can.)^##$aDLC$cDLC$dDLC^00$aGV943.25$b.B74 1990^00$a796.334/2$220^10$aBrenner, Richard J.,$d1941-^10$aMake the team.$pSoccer :$ba heads up guide to super soccer! /$cRichard J. Brenner.^30$aHeads up guide to super soccer.^##$a1st ed.^##$aBoston :$bLittle, Brown,$cc1990.^##$a127 p. :$bill. ;$c19 cm.^##$a"A Sports illustrated for kids book."^##$aInstructions for improving soccer skills. Discusses dribbling, heading, playmaking, defense, conditioning, mental attitude, how to handle problems with coaches, parents, and other players, and the history of soccer.^#0$aSoccer$vJuvenile literature.^#1$aSoccer.^\


To actually parse the directory, you need to develop a way to look through the directory string, then separate it into units of 12 characters. Python has a function called `range()` which will do this. It allows you to loop through a specifed segment of a variable and, critically, define how big the "steps" are (in this case, 12):

In [8]:
for entry in range(0, len(directory), 12):
    print(directory[entry:entry+12])
    toc = directory[entry:entry+12]
    if toc.startswith('245'):
        dir_tstmt = toc 

print('\nTitle Statement directory:',dir_tstmt)

001002000000
003000400020
005001700024
008004100041
010002400082
020002500106
020004400131
040001800175
050002400193
082001800217
100003200235
245008700267
246003600354
250001200390
260003700402
300002900439
500004200468
520022000510
650003300730
650001200763

Title Statement: 245008700267


Finally, with a bit more work, use the directory to save the field tag, and the indices to slice the data for the title from the full string. 

Hint: the directory indices start with the first character in the data fields, so you have to make use of the total length of the directory, for which you can reuse the `len_dir` value determined and saved above.

In [11]:
len_dir = len(broken_record[0])
print('Full directory length:',len_dir)

print('Field:',dir_tstmt[:3])

tstmt_start = dir_tstmt[7:]
print('Starts at:',tstmt_start)

tstmt_len = dir_tstmt[3:7]
print('Length:',tstmt_len)

# the next line is a bit clunky, but you need to convert the strings to numbers so that Python understands how to add them:
tstmt = marc_record[int(len_dir)+1+int(tstmt_start):int(len_dir)+int(tstmt_start)+int(tstmt_len)]

print('\nFinally, here is the data you were looking for, the 245 field:\n',tstmt)

Full directory length: 264
Field: 245
Starts at: 00267
Length: 0087

Finally, here is the data you were looking for, the 245 field:
 10$aMake the team.$pSoccer :$ba heads up guide to super soccer! /$cRichard J. Brenner.


It's a bit complicated, right? But the above intends to illustrate some of the logical rules that you could use to build up more complicated parser code that would process the full record. 

Fortunately, you don't need to do that. Why not? Because the MARC Bibliographic data format is well documented and well known, and other people have already done this for us. Yay! Some tools you can use include a Python library called [`pymarc`](https://pypi.org/project/pymarc/), or the widely used [MarcEdit](https://marcedit.reeset.net/downloads).