# Pymarc Patterns

[Pymarc Documentation](https://pymarc.readthedocs.io/en/latest/)

This notebook covers common patterns for working with MARC records in Python. It starts with the basics like getting fields and moves into more complex examples. It uses the latest version of Pymarc as of this writing (5.1.2). There were two fairly major changes from 4.x, both of which I mention when relevant.

The example records come from [Harvard's bibliographic records](https://archive.org/download/harvard_bibliographic_metadata) on the Internet Archive. There are in the docs/assets directory.

## Reading Records from Files

Use the `MARCReader` class to read records from a file, it accepts an open file handle and returns an iterator of `Record` objects. Make sure each record is not `None` when iterating.

In [1]:
from pathlib import Path
from pymarc import MARCReader

with open(Path('assets', 'example.mrc'), 'rb') as fh:
    reader = MARCReader(fh)
    for record in reader:
        if record:
            print("Got record:", record.title)
            # save this global variable for use in later code blocks
            global venetian
            venetian = record
        else:
            print("No record found.")

Got record: Photographs of Venetian villas /


Note that the file is opened in read-binary mode (`rb`). Read mode is sufficient because we are not modifying the file. We use binary mode because Pymarc will handle decoding strings in records, we don't want Python to do it. Try deleting the `b` in `rb`—what happens? What would happen if we didn't have an `if record` condition?

There are several gotchas we can run into with encoding issues. The [`MARCReader`](https://pymarc.readthedocs.io/en/latest/#pymarc.reader.MARCReader) class has a `to_unicode` parameter to return UTF-8 strings as well as a `force_utf8` parameter which coerces the data to UTF-8 (useful if we have records with inaccurate encodings). These parameters were more commonly needed under Python 2 and Pymarc 4. In all my recent scripts, I have not needed them. Sometimes life does get easier!



## Simple Ways to View Record Data

Pymarc comes with convenience properties for accessing common MARC fields on a record:

- `record.title`
- `record.author`
- `record.isbn` and `record.issn`
- `record.publisher`
- `record.pubyear`

**Pymarc 4.x Note**: prior to version 5, these properties were methods; looked like `record.title()` instead.

Also, if we print a record, its string representation is the "mnemonic marc" format, which is a human-readable version of the MARC data where each field is printed on a new line with its tag, indicators, and subfields visible.

In [34]:
print('Title:', venetian.title)
print('Author:', venetian.author) # properties are None if there are no matching fields
print('Publisher:', venetian.publisher)
print('Year:', venetian.pubyear)
print(venetian)

Title: Photographs of Venetian villas /
Author: None
Publisher: The Institute,
Year: 1954.
=LDR  00774nam a22002057u 4500
=001  000000010-8
=005  20020606090541.3
=008  821202|1954\\\\|||||||\\||||\|0||||eng|d
=035  0\$aocm78684367
=245  00$aPhotographs of Venetian villas /$cRoyal Institute of British Architects ; detailed information compiled by Giuseppe Mazzotti.
=246  3\$aVenetian villas.
=260  0\$aLondon, England :$bThe Institute,$c1954.
=300  \\$a39 p. ;$c21 cm.
=500  \\$aCover title : Venetian villas.
=500  \\$aCatalogue of an exhibition held at Royal Institute of British Architects, Feb. 25-Mar. 27, 1954.
=650  \0$aArchitecture, Domestic$zItaly$zVenice.
=700  1\$aMazzotti, Giuseppe.
=710  2\$aRoyal Institute of British Architects.
=988  \\$a20020608
=906  \\$0MH



As we can see, the properties are `None` if they do not exist, like the Author in our example. What will `record.isbn` return?

For serious work with records, **do not use these convenience properties**. They only find the first instance of a field and return the text of certain subfields. They are useful for quick peaks at data, but not functional for most purposes. Instead, we will typically use the `get_fields()` method and iterate over all existing fields.

There are also `record.series`, `record.subjects`, `record.physicaldescription` (all 300 fields), and `record.notes` (all 5XX fields) properties. Since these return a list of actual `Field` objects, they are more useful, though we should be careful that they're using the MARC fields we care about.

## Writing Records

The basic steps to modify MARC records with Pymarc are:

- Read the records in with `MARCReader`
- Modify them in place—assign values to fields and subfields
- Write the records out with `MARCWriter`

Below, we prefix the example record's title `$a` subfield with "Great".

In [21]:
from pymarc import MARCReader, MARCWriter

with open('assets/example.mrc', 'rb') as fh:
    reader = MARCReader(fh)
    with open('assets/great.mrc', 'wb') as out:
        writer = MARCWriter(out)
        for record in reader:
            if record:
                record["245"]["a"] = f'Great {record["245"]["a"]}'
                writer.write(record)
                print(record.title)

Great Photographs of Venetian villas


## Getting Fields

Pymarc's [Record](https://pymarc.readthedocs.io/en/latest/index.html#module-pymarc.record) object provides three better ways to retrieve fields other than the convenience properties:

- `get` which is like `dict`'s `get` method in that it lets us define a default value if the field doesn't exist
- `get_fields` which returns a list of fields with a given tag
- bracket notation, which returns the first field with a given tag

In general, `get_fields` is probably the most foolproof method.

For all methods, field names are strings, not numbers. Fields that begin with a 0, like 020, would be awkward otherwise. We will talk more about `Field` objects later but below we must use the `value()` method to return a string representations of fields.

In [2]:
# record.get(field, default)
# if we can't find a uniform title in 130 return 245
title = venetian.get('130', venetian['245'])
print('130 or, if not present, 245:', title.value())

# record.get_fields(field) -> list[Field]
print('\n500 fields:')
for field in venetian.get_fields('500'):
    print(field.value())

# we can pass multiple fields to get a list of all of them
print('\nAll 2XX fields:')
for field in venetian.get_fields('245', '246', '260'):
    print(field.value())
# we could pass a list & use * to unpack it, too:
# venetian.get_fields(*['245', '246', '260'])
# or actually ALL 2XX fields using `range`:
# venetian.get_fields(*[str(i) for i in range(200, 300)])

# though there are two 500s fields, bracket notation only returns the first one
print('\nFirst 500:', venetian['500'].value())

# if we try to access a field that doesn't exist, we get a KeyError
# this is why I do not recommend bracket notation
try:
    print(venetian['999'])
except KeyError as e:
    print("\nKeyError from accessing a non-existent field")

130 or, if not present, 245: Photographs of Venetian villas / Royal Institute of British Architects ; detailed information compiled by Giuseppe Mazzotti.

500 fields:
Cover title : Venetian villas.
Catalogue of an exhibition held at Royal Institute of British Architects, Feb. 25-Mar. 27, 1954.

All 2XX fields:
Photographs of Venetian villas / Royal Institute of British Architects ; detailed information compiled by Giuseppe Mazzotti.
Venetian villas.
London, England : The Institute, 1954.

First 500: Cover title : Venetian villas.

KeyError from accessing a non-existent field


Because bracket notation throws errors if the field doesn't exist, and only returns the first instance of a field, it's not very useful. It seems like a rare scenario to have a default field value handy for use with `get`. In general, using `get_fields` with for-in loops (which will simply not execute if there are no matching fields) is the most foolproof way to access record fields.

## Field Objects

Pymarc has a [Field](https://pymarc.readthedocs.io/en/latest/#module-pymarc.field) object for representing MARC fields, the `get_fields` method returns a list of these objects, not strings.

In [23]:
field = venetian['245']
print('Tag:', field.tag)
# Fields support bracket notation just like Records do for Fields
# with the same caveat: we get a KeyError for non-existent subfields
print('Subfield A:', field['a'].rstrip(' /'))
print('Subfield B with fallback:', field.get('b', 'Remainder of title')) # dict-like get for a specific subfield with fallback value

# format_field() applies extra formatting for subject fields while value() returns all subfield values concatenated
print(venetian.subjects[0].format_field(), 'versus', venetian.subjects[0].value())

Tag: 245
Subfield A: Photographs of Venetian villas
Subfield B with fallback: Remainder of title
Architecture, Domestic -- Italy -- Venice. versus Architecture, Domestic Italy Venice.


There are several ways to get sets of subfields:

- `get_subfields()` returns a list of subfield _values_ for the codes we pass in
- `subfields` is a property that's a list of actual Subfield objects which each have `code` and `value` properties
- `subfields_as_dict()` returns a dictionary of subfield codes and values, but because subfields can be repeated the values are a _list_ of strings

In [24]:
print(field.get_subfields('a', 'b', 'c')) # get list of values of specific subfields
print(field.subfields) # get all subfields as objects
print(field.subfields_as_dict()) # dict of {subfield code: [subfield values]}

['Photographs of Venetian villas /', 'Royal Institute of British Architects ; detailed information compiled by Giuseppe Mazzotti.']
[Subfield(code='a', value='Photographs of Venetian villas /'), Subfield(code='c', value='Royal Institute of British Architects ; detailed information compiled by Giuseppe Mazzotti.')]
{'a': ['Photographs of Venetian villas /'], 'c': ['Royal Institute of British Architects ; detailed information compiled by Giuseppe Mazzotti.']}


The most appropriate subfield method depends on our use case.

## Modifying Fields

Record's `add_field` method lets us add a field but we need to construct it using the Field class. We can also append to the `fields` list. Both methods add the field _to the end of the record_ in case we care about field order, which I will mostly not discuss here.

**Pymarc 4.x Note**: prior to version 5, the `Field` objects `subfields` argument was a list of strings which alternated between subfield codes and values e.g. `['a', 'Title', 'C', 'Responsibility']`. It's now a list of `Subfield` objects.

In [39]:
from pymarc import Field, Subfield

# adding a subject heading to the record
print('Number of subjects before:', len(venetian.subjects))
subject = Field(tag='650', indicators=[' ', '0'], subfields=[
    Subfield(code='a', value='Venice (Italy)'),
    Subfield(code='v', value='Non-fiction.'),
])
venetian.add_field(subject)
print('Number of subjects after:', len(venetian.subjects))
# we can also append directly to the Record.fields list
venetian.fields.append(subject)
print('Number of subjects after append:', len(venetian.subjects))

Number of subjects before: 3
Number of subjects after: 4
Number of subjects after append: 5


There are two ways to remove fields which differ in how they reference fields. The most useful method is `remove_field` which removes all `Field` objects passed to it. It can target specific fields, e.g. the second of three subjects.

The `remove_fields` method removes all fields with the given tags. It's more concise in situation where we want to remove _all_ of a particular field without caring about values.

These methods return `None`, they do not return the removed field(s).

In [26]:
# remove all 906 fields
venetian.remove_fields('906')
# remove only subject fields with the 'Venice (Italy)' value we added earlier
for subject_field in venetian.subjects:
    if 'Venice (Italy)' in subject_field.value():
        venetian.remove_field(subject_field)
print('Number of subjects after removal:', len(venetian.subjects))

Number of subjects after removal: 1


We can modify a field directly wherever one is returned by a `Record` method or property, for instance by interacting with the `fields` list, but it can be difficult to get the field we want. Remember that the convenience properties like `record.isbn` are read-only _strings_ so code like `record.isbn.subfields.append('z', 'new subfield')` throws an error. `record['020'].subfields.append('z', 'new subfield')` works but will throw an error on records lacking an 020 field.

A more common pattern is to iterate over `get_fields`, create a modified copy of the field, `remove_field` the original, and `add_field` the modified one.

In [27]:
# add a subfield to the last field
# generally would not recommend this method of modifying Records.fields entries directly
venetian.fields[-1].subfields.append(Subfield(code='z', value='Additional subfield'))
print(venetian.fields[-1])

# safer and more generally useful way to moidfy a field
for field in venetian.get_fields('245'):
    if 'Venetian' in field.value():
        new_title = field
        new_title['a'] = new_title['a'].replace('Venetian', 'Sicilian')
        venetian.remove_field(field)
        venetian.add_field(new_title)
print(venetian['245'])


=988  \\$a20020608$zAdditional subfield
=245  00$aPhotographs of Sicilian villas /$cRoyal Institute of British Architects ; detailed information compiled by Giuseppe Mazzotti.


Like `add_field`, `add_subfield` adds a subfield _onto the end_ of the field by default, but accepts a `pos` parameter for which position the subfield should be inserted into. If we care about subfield order, which tends to affect record display much more than field order, we have to be deliberate in constructing the modified field.

In [28]:
# adding a subfield in a specific position
new_title = venetian['245']
new_title.add_subfield(code='b', value='Subtitle ', pos=1)
# this also works:
# new_title.subfields.insert(1, Subfield('b', 'Subtitle '))
print(new_title)

=245  00$aPhotographs of Sicilian villas /$bSubtitle $cRoyal Institute of British Architects ; detailed information compiled by Giuseppe Mazzotti.


We could also add the subfield and then sort the field's subfields list, because we might not be able to determine the position of a new subfield without knowing the structure of the entire field. This works well for a `245` with `a`, `b`, and `c` subfields but as soon as there are other subfields (`h` for Medium/GMD for instance, which does not go in alphabetical order) it breaks down.

In [40]:
title = Field(tag='245', indicators=['0', '0'], subfields=[Subfield('a', 'Title /'), Subfield('c', 'Responsibility')])
print(title)
title.subfields.append(Subfield('b', 'Subtitle'))
title.subfields.sort(key=lambda sf: sf.code) # sort subfields alphabetically by code
print(title)

=245  00$aTitle /$cResponsibility
=245  00$aTitle /$bSubtitle$cResponsibility


We can see this still isn't great because adding a subtitle actually should change the `a` subfield (to end in a colon and not a forward slash `$aTitle : $bSubtitle /`). Basically, the point is that subfield lists are tricky and require a lot of forethought to get right with records of varying structure. If we know the structure of our records, we can be more confident in our manipulations. We can, of course, use some of the `get`-type methods to confirm our assumptions first, such as "our 245 fields only have these subfields", "records only have one 856 field", etc.

We also delete subfields with, you guessed it, `delete_subfield`. It returns _the value of the subfield_ (not a subfield object!!) or `None` if there was no subfield with the provided code.

In [31]:
responsibility = title.delete_subfield('c')
print('Title with no c subfield:', title, '| C subfield value:', responsibility)
# this also works, to remove the last subfield (can provide a specific position to pop() method, too)
title.subfields.pop()
print('Title with only $a subfield:', title)

Title with no c subfield: =245  00$aTitle / | C subfield value: None
Title with only $a subfield: =245  00


## Conclusion

Pymarc provides a number of ways of accessing record data, but the quickest, most convenient ways (convenience properties, bracket notation) tend to be the most error prone, especially when record structures vary or are unknown. We usually want to use methods that return lists of the fields and subfields we are interested in and iterate over these lists, which naturally works around absent data.

We've seen that Pymarc let's us operate on fields/subfields either with its own specific object methods or by interacting with list properties on their parent objects. In general, the object methods are preferable and more targeted, but sometimes the lists are useful when we care about order (e.g. "delete last field/subfield").