# Sample code for Lab 8

This notebook includes sample code for the questions in Lab 8 <>.

## Q1: Is it working?

In [2]:
# Sample import statement that runs lxml,
# but if lxml isn't available will load 
# the built-in ElementTree library for XML parsing
try:
    from lxml import etree
    print("running with lxml.etree")
except ImportError:
    import xml.etree.ElementTree as etree
    print("running with Python's xml.etree.ElementTree")
from pathlib import Path

MODS_collection_25 = Path('..','data','xml','2018_lcwa_MODS_25.xml')

# using the `.XMLParser()` function to remove spaces to improve printing later on
# NOTE: only works if spaces are not significant in the XML input
parser = etree.XMLParser(remove_blank_text=True)

metadata = etree.parse(MODS_collection_25, parser=parser).getroot()

el_count = 0 

for element in metadata.iter():
    el_count += 1
    # uncomment the next line to print display
    #print(element.tag)

print(el_count)

running with lxml.etree
1358


## Q2: Look for `subject` tags

These records contain `<subject>` elements, but only some of these correspond to headings that are authorized headings in the Library of Congress Subject Headings. Those are marked with the attribute `authority='lcsh'`, embedded in the element. Loop through `<subject>` tags, identify only the ones that include an LCSH attribute, then print the content of those subject headings. Use an XPath expression here.

In [3]:
# define dictionary nsmap as namespace map with mods
nsmap = {
    'mods' : 'http://www.loc.gov/mods/v3',
    'ead3'  : 'http://ead3.archivists.org/schema/',
}

count = 0

for subject in metadata.findall('.//mods:subject[@authority="lcsh"]', nsmap):
    count += 1
    print(subject.tag, subject.attrib)
    # display the element
    print(etree.tostring(subject).decode())

print('Subject tags:',count)

{http://www.loc.gov/mods/v3}subject {'authority': 'lcsh'}
<subject xmlns="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" authority="lcsh"><name authority="naf" type="corporate"><namePart><!-- TODO: Insert name authority here (can be same as name authority above, under title). --></namePart></name></subject>
{http://www.loc.gov/mods/v3}subject {'authority': 'lcsh'}
<subject xmlns="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" authority="lcsh"><topic>Animals</topic><genre>Pictorial works</genre></subject>
{http://www.loc.gov/mods/v3}subject {'authority': 'lcsh'}
<subject xmlns="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" authority="lcsh"><name authority="naf" type="corporate"><namePart><!-- TODO: Insert name authority here (can be same as name authority 

## Q3. Validating call numbers

Check the local call number references to ensure that they are in the proper format (e.g., `lcwaAddddddd` or `lcwaEddddddd`). Try adapting the [regular expression implementation illustrated in the sample MODS notebook under activities 6 & 7](https://github.com/morskyjezek/si676-2025-data/blob/main/examples/xml-working-with-MODS.ipynb). **Hint:** this is similar to what is demonstrated there, but you will need to modify the regex. Look carefully at the identifiers because some will be similar but won't match the expression we built in class. Question: how many correctly formatted identifiers for the LC web archives are there?

In [4]:
# import regular expressions library
import re

# the following code assumes the MODS records have been parsed & assigned to "metadata"
# the following code assumes you have established a dictionary called "nsmap" for namespaces, including "mods"

# set up regex pattern to identify the call number:
call_num_pattern = re.compile(r'lcwa[NE]\d{7}')

# a counter for lc IDs
lc_ids = 0

for identifier in metadata.findall('.//mods:mods/mods:identifier', nsmap): 
    if re.match(call_num_pattern, identifier.text):
        lc_ids += 1
        print(identifier.text, identifier.attrib)
    else:
        print(f'not an LC number: { identifier.text }, {identifier.attrib}')

print(f'Found {str(lc_ids)} correctly formatted LC identifiers')

lcwaN0010234 {}
not an LC number: 85999, {'invalid': 'yes', 'type': 'database id'}
not an LC number: 109353, {'invalid': 'yes', 'type': 'database id'}
lcwaN0001999 {}
not an LC number: 91224, {'invalid': 'yes', 'type': 'database id'}
not an LC number: 109272, {'invalid': 'yes', 'type': 'database id'}
lcwaN0003238 {}
not an LC number: 91275, {'invalid': 'yes', 'type': 'database id'}
not an LC number: 109273, {'invalid': 'yes', 'type': 'database id'}
not an LC number: 96782, {'invalid': 'yes', 'type': 'database id'}
lcwaN0010144 {}
not an LC number: nan, {'invalid': 'yes', 'type': 'database id'}
lcwaN0010145 {}
not an LC number: 82949, {'invalid': 'yes', 'type': 'database id'}
not an LC number: 109227, {'invalid': 'yes', 'type': 'database id'}
lcwaN0012178 {}
not an LC number: 85778, {'invalid': 'yes', 'type': 'database id'}
lcwaN0012179 {}
not an LC number: 85779, {'invalid': 'yes', 'type': 'database id'}
lcwaN0012180 {}
not an LC number: 85780, {'invalid': 'yes', 'type': 'database id'}

## Q4. Data modification/addition

This step builds on the previous validation question. Once you identify the local call numbers, then ensure they are properly formatted (validation), add a `type` attribute to indicate their source authority.

The following code demonstrates how you might add information based on the successful validation.
In this case, if the LC web archives identifier number is correctly formatted,
python adds a new attribute of `type="lcwa"` using the `.set` method.

_Below the code then prints the new identifier element. In production it would be more likely to then save this as an XML file or into a database._

In [5]:
# import regular expressions library
import re

# the following code assumes the MODS records have been parsed & assigned to "metadata"
# the following code assumes you have established a dictionary called "nsmap" for namespaces, including "mods"

# set up regex pattern to identify the call number:
call_num_pattern = re.compile(r'lcwa[NE]\d{7}')

for identifier in metadata.findall('.//mods:mods/mods:identifier', nsmap): 
    if re.match(call_num_pattern, identifier.text):
        print('LC identifier identified:',identifier.text)
        print(f'ID element and attribs before change:\n\t{identifier.text} {identifier.attrib}\n')
        # when you run the above, note that these identifiers do not have attributes. A valid identifier should have a type, so add "type" as "lcwa" - see https://www.loc.gov/standards/mods/userguide/identifier.html#type
        # set a "type" attribute to note this is an LC web archives identifier aka "lcwa"
        identifier.set('type','lcwa')
        print(f'ID element with attribs after change:\n\t{identifier.text} {identifier.attrib}\n')
        print('full XML element after validation and attribute addition:\n',etree.tostring(identifier).decode(),'\n\n')


LC identifier identified: lcwaN0010234
ID element and attribs before change:
	lcwaN0010234 {}

ID element with attribs after change:
	lcwaN0010234 {'type': 'lcwa'}

full XML element after validation and attribute addition:
 <identifier xmlns="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" type="lcwa">lcwaN0010234</identifier> 


LC identifier identified: lcwaN0001999
ID element and attribs before change:
	lcwaN0001999 {}

ID element with attribs after change:
	lcwaN0001999 {'type': 'lcwa'}

full XML element after validation and attribute addition:
 <identifier xmlns="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" type="lcwa">lcwaN0001999</identifier> 


LC identifier identified: lcwaN0003238
ID element and attribs before change:
	lcwaN0003238 {}

ID element with attribs after change:
	lcwaN0003238 {'type': 'lcwa'}

full XML element 

## Q5. Write to a file

Save the modifications to a file.

Note: Because lxml does not support directly writing to a file,
this block uses the `ElementTree.write()` function.
Because lxml builds on the ET library, this should
be available without an additional import.

In [None]:
# the following code assumes:
#     lxml or ET is running as etree
#     25 MODS records have been parsed & assigned to "metadata"
#     you have established a dictionary called "nsmap" for namespaces, including "mods"
#     you have made the modifications to identifiers in Qs 3, 4, and 5

updated_file = Path('..','data','xml','lcwa_MODS_25_updated.xml')

# assign the updated metadata to an ElementTree (full XML doc)
tree = etree.ElementTree(metadata)
#tree.parse(parser)

tree.write(updated_file, xml_declaration=True, encoding='utf-8', method='xml', pretty_print=True, strip_text=True)