## Setup

We'll be using a Python library that helps us to parse markup languages like HTML and XML called BeautifulSoup. We will be using an additional library called `lxml`, which helps BeautifulSoup (aka BS4) to search and build XML. It is possible that you may need to do an extra step to install `lxml` if you have not used it before, and those steps are [outlined in the BS4 documentation here](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=namespace#installing-a-parser).

In [1]:
from bs4 import BeautifulSoup

If you use the cells below that use LXML on its own, then you'll also need to import
the library so you can call it directly:

In [2]:
import xml.etree.ElementTree as ET

try:
    from lxml import etree
    print('running with lxml.etree')
except ImportError:
    print('you\re not running with lxml active')

running with lxml.etree


Later on, we will use regular expressions to identify strings that match certain patterns. 
To do this, you also need to install the `re` library:

In [3]:
import re

## First Steps to Navigating the Tree: Beautiful Soup

### Load the records

This activity is designed to have the data included alongside our notebook, so the files are already included in this repository. This should allow you to download and run the notebook yourself, using the same commands and finding the same results. There could be other ways to do this if you are working in a different context - for example, if you are working with records you're pulling from the web, you might want to pull them dynamically using the `requests` library (that allows you to make HTML requests). 

Here is how we can parse one of the XML records using BeautifulSoup:

In [4]:
MODS_collection = open('2018_lcwa_MODS_5.xml', 'r')

Let's find a bit of information in the "soup" (that is, the file loaded as data). 
To do that, we can use the BS4 library to call items by name. (This, of course,
requires a knowledge of what tags and items you would expect to find in the record,
which we will look at later.)

In [5]:
soup = BeautifulSoup(MODS_collection, 'lxml')

In [6]:
print(soup.text[:100])


lcwaN001023485999109353Slate Magazineengelectronictext/htmlborn digitalgeneraltextweb siteUnited St


The above cell prints a string of text from the XML object that we've loaded, 
in order to demonstrate that, yes, we have loaded content. 

Later we will pull more meaningful information. For now, let's quickly 
pull out some of the titles, this time using the tag names: 

In [7]:
for tag in soup.find_all('title'):
    print(tag.name, tag.text)

title Slate Magazine
title General News on the Internet Web Archive
title Serial and Government Publications Division
title Raw Story
title General News on the Internet Web Archive
title Serial and Government Publications Division
title Huffington Post
title General News on the Internet Web Archive
title Serial and Government Publications Division
title BuzzFeed
title General News on the Internet Web Archive
title Serial and Government Publications Division
title Drudge Report
title General News on the Internet Web Archive
title Serial and Government Publications Division


### Navigating and Exploring the Tree

To get a list of all the tags in the document, try something like this (using `True` to demonstrate the existence of each tag in the file). Here, note the use of the `limit` argument to return only 10 instances. We don't need the whole list here for the purposes of demonstration:

In [9]:
for tag in soup.find_all(True, limit=10):
    print(tag.name)

html
body
modscollection
mods
identifier
identifier
identifier
titleinfo
title
language


Or, we could look for each of the attributes on the top-level `mods` tags. We can see that 
they are stored in a dictionary-like object, which is indicated with curly braces `{}`):

In [10]:
for tag in soup.find_all('mods'):
    print(tag.name, tag.attrs)

mods {'version': '3.4', 'xmlns': 'http://www.loc.gov/mods/v3', 'xmlns:xlink': 'http://www.w3.org/1999/xlink', 'xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance', 'xsi:schemalocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd'}
mods {'version': '3.4', 'xmlns': 'http://www.loc.gov/mods/v3', 'xmlns:xlink': 'http://www.w3.org/1999/xlink', 'xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance', 'xsi:schemalocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd'}
mods {'version': '3.4', 'xmlns': 'http://www.loc.gov/mods/v3', 'xmlns:xlink': 'http://www.w3.org/1999/xlink', 'xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance', 'xsi:schemalocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd'}
mods {'version': '3.4', 'xmlns': 'http://www.loc.gov/mods/v3', 'xmlns:xlink': 'http://www.w3.org/1999/xlink', 'xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance', 'xsi:schemalocation': 

Finally, don't forget to close the file:

In [11]:
MODS_collection.close()

### Namespaces

In some cases, you may be working with data that has tags from various namespaces (that is, 
basically different tag schemas, such as MODS or EAD). For specificity, it can be 
important to have a list of the namespaces that you will reference. In this case, 
that list is a python dictionary named `ns`: 

In [12]:
ns = {
    'mods' : 'http://www.loc.gov/mods/v3',
    'ead3'  : 'http://ead3.archivists.org/schema/',
}

## First Steps to Navigate the Tree Using LXML / XPath

Load records, list tags and child elements, see subelements, display tags and attributes... 

## Examples

Here are some things we'll do using the BeautifulSoup library,
to be developed below in the notebook following:
 
1. Looking in a single XML file with multiple `<mods>` subelements, use a loop to count the elements. 
1. Using a similar loop to the above, can you go through the individual `<mods>` and extract the record's identifier?  
1. Let's get a bit deeper into MODS. These records use the `<titleInfo>` and `<title>` tags. Use the `.findall()` function to look into each of these and pull out the text of each `<title>` element. 
1. The above pulls out all the titles, including related titles. Use the `.find()` funtion to search for just the first instance and pull out only the main titles.  
1. These records contain `<subject>` designations, but only some of these correspond to headings that are authorized headings in the Library of Congress Subject Headings. Those are marked with an attribute `authority='lcsh'`, which is indicated as an embedded attribute in the tag. Look through `<subject>` tags, identify only the ones that include an LCSH attribute, then print the content of those subject headings.  
1. Data addition or modification: identify the local call numbers, then check to make sure all of them have appropriate attribute data attached.
1. Data validation: check the local call number references to ensure that they are in the proper format (e.g., _lcwaAddddddd_).
1. Save the updated metadata. In this case, write the udpated metadata to a new file.

#### Counting Records in the Set

Activity 1: How many metadata records for discrete items are included in the set? A compound MODS file may create multiple records in one file; in this scenario, the tag `<mods>` encloses each individual record, and the list of records is enclosed in a `<modsCollection>` tag. Use a loop to count the `<mods>` elements.

In [13]:
#BS4
record_count = 0

with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')
    for mods in metadata.find_all('mods'):
        print(mods.name, mods.title)
        record_count += 1

print(record_count)

mods <title>Slate Magazine</title>
mods <title>Raw Story</title>
mods <title>Huffington Post</title>
mods <title>BuzzFeed</title>
mods <title>Drudge Report</title>
5


#### Extract Item Identifiers

Activity 2: Each individual metadata record has at least one `<identifier>` element; this element is used to include a reference to the item, such as a URI, or another identifier that a system may use to locate an item. Using a loop similar to the example above, how would you print each record's identifier(s)?  

In [14]:
#BS4
with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')
    for mods in metadata.find_all('mods'):
        for identifier in mods.find_all('identifier'):
            print(identifier.name, identifier.text)

identifier lcwaN0010234
identifier 85999
identifier 109353
identifier http://www.slate.com/
identifier 15046
identifier lcwaN0001999
identifier 91224
identifier 109272
identifier http://rawstory.com/
identifier 2771
identifier lcwaN0003238
identifier 91275
identifier 109273
identifier 96782
identifier http://www.huffingtonpost.com/
identifier 4619
identifier lcwaN0010144
identifier nan
identifier https://medium.com/buzzfeed-collections
identifier 24463
identifier http://www.buzzfeed.com/
identifier 14906
identifier lcwaN0010145
identifier 82949
identifier 109227
identifier http://www.drudgereport.com/
identifier 14951


There are clearly different types of identifiers here, and when we check the identifier attributes, 
it is clear that some of these will be more useful than others. Below, use the `.attrs` method to see the dictionary that each element carries:

In [15]:
#BS4
with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')
    for mods in metadata.find_all('mods'):
        for identifier in mods.find_all('identifier'):
            print(identifier.name, identifier.text, identifier.attrs)

identifier lcwaN0010234 {}
identifier 85999 {'invalid': 'yes', 'type': 'database id'}
identifier 109353 {'invalid': 'yes', 'type': 'database id'}
identifier http://www.slate.com/ {'displaylabel': 'Access URL', 'type': 'uri'}
identifier 15046 {'type': 'database id'}
identifier lcwaN0001999 {}
identifier 91224 {'invalid': 'yes', 'type': 'database id'}
identifier 109272 {'invalid': 'yes', 'type': 'database id'}
identifier http://rawstory.com/ {'displaylabel': 'Access URL', 'type': 'uri'}
identifier 2771 {'type': 'database id'}
identifier lcwaN0003238 {}
identifier 91275 {'invalid': 'yes', 'type': 'database id'}
identifier 109273 {'invalid': 'yes', 'type': 'database id'}
identifier 96782 {'invalid': 'yes', 'type': 'database id'}
identifier http://www.huffingtonpost.com/ {'displaylabel': 'Access URL', 'type': 'uri'}
identifier 4619 {'type': 'database id'}
identifier lcwaN0010144 {}
identifier nan {'invalid': 'yes', 'type': 'database id'}
identifier https://medium.com/buzzfeed-collections {'

Since some of the elements do not have attributes, we need a try-except loop to look at each dictionary, 
and to generate a "Blank" value for elements without attributes:

In [16]:
#BS4
with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')
    for mods in metadata.find_all('mods'):
        for identifier in mods.find_all('identifier'):
            tag = identifier.name
            content = identifier.text
            try:
                type_ = identifier.attrs['type']
            except:
                type_ = "Blank type"
            print(tag, content, type_)

identifier lcwaN0010234 Blank type
identifier 85999 database id
identifier 109353 database id
identifier http://www.slate.com/ uri
identifier 15046 database id
identifier lcwaN0001999 Blank type
identifier 91224 database id
identifier 109272 database id
identifier http://rawstory.com/ uri
identifier 2771 database id
identifier lcwaN0003238 Blank type
identifier 91275 database id
identifier 109273 database id
identifier 96782 database id
identifier http://www.huffingtonpost.com/ uri
identifier 4619 database id
identifier lcwaN0010144 Blank type
identifier nan database id
identifier https://medium.com/buzzfeed-collections uri
identifier 24463 database id
identifier http://www.buzzfeed.com/ uri
identifier 14906 database id
identifier lcwaN0010145 Blank type
identifier 82949 database id
identifier 109227 database id
identifier http://www.drudgereport.com/ uri
identifier 14951 database id


Finally, let's print a clean list of only the URI identifiers:

In [17]:
#BS4
with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')
    for mods in metadata.find_all('mods'):
        for identifier in mods.find_all('identifier', type="uri"):
            print(identifier.attrs['type'], identifier.text)

uri http://www.slate.com/
uri http://rawstory.com/
uri http://www.huffingtonpost.com/
uri https://medium.com/buzzfeed-collections
uri http://www.buzzfeed.com/
uri http://www.drudgereport.com/


Now, we could try that another way using the `lxml` XML library directly:

In [18]:
#lxml
xml_records = ET.parse('2018_lcwa_MODS_5.xml')

for identifier in xml_records.findall('.//mods:identifier', namespaces=ns): 
    element = identifier
    print(element.tag, element.text, element.attrib)

{http://www.loc.gov/mods/v3}identifier lcwaN0010234 {}
{http://www.loc.gov/mods/v3}identifier 85999 {'invalid': 'yes', 'type': 'database id'}
{http://www.loc.gov/mods/v3}identifier 109353 {'invalid': 'yes', 'type': 'database id'}
{http://www.loc.gov/mods/v3}identifier http://www.slate.com/ {'displayLabel': 'Access URL', 'type': 'uri'}
{http://www.loc.gov/mods/v3}identifier 15046 {'type': 'database id'}
{http://www.loc.gov/mods/v3}identifier lcwaN0001999 {}
{http://www.loc.gov/mods/v3}identifier 91224 {'invalid': 'yes', 'type': 'database id'}
{http://www.loc.gov/mods/v3}identifier 109272 {'invalid': 'yes', 'type': 'database id'}
{http://www.loc.gov/mods/v3}identifier http://rawstory.com/ {'displayLabel': 'Access URL', 'type': 'uri'}
{http://www.loc.gov/mods/v3}identifier 2771 {'type': 'database id'}
{http://www.loc.gov/mods/v3}identifier lcwaN0003238 {}
{http://www.loc.gov/mods/v3}identifier 91275 {'invalid': 'yes', 'type': 'database id'}
{http://www.loc.gov/mods/v3}identifier 109273 {'

And, filter to identify only the URI elements...

Note the commands are similar, but the process is not exactly the same. Some of the 
methods to show elements, attributes, and other elements differ, and 
the syntax for searching and navigating the XML tree is slightly different, too!
The display is also slightly different when looking at the element tags, since
the `lxml` parser is very specific about the "namespace" (in this case, that is,
the rules that are specifying what goes in the MODS record and how it is
structured) of each tag.

In [19]:
#lxml
xml_records = ET.parse('2018_lcwa_MODS_5.xml')

for identifier in xml_records.findall('.//mods:identifier', namespaces=ns): 
    element = identifier
    attribs = element.attrib
    type = attribs.get('type')
    if type == 'uri':
        print(element.tag, type, element.text)

{http://www.loc.gov/mods/v3}identifier uri http://www.slate.com/
{http://www.loc.gov/mods/v3}identifier uri http://rawstory.com/
{http://www.loc.gov/mods/v3}identifier uri http://www.huffingtonpost.com/
{http://www.loc.gov/mods/v3}identifier uri https://medium.com/buzzfeed-collections
{http://www.loc.gov/mods/v3}identifier uri http://www.buzzfeed.com/
{http://www.loc.gov/mods/v3}identifier uri http://www.drudgereport.com/


#### Extract the record titles 

Activity 3: Each individual metadata record has at least one `<title>` element; this element is used to identyify an item's title. Using a loop similar to the example above, how would you print each record's title(s)?  

In [20]:
#BS4
with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')
    for mods in metadata.find_all('mods'):
        for title in mods.find_all('title'):
            print(title.name, title.find_parent())

title <titleinfo><title>Slate Magazine</title></titleinfo>
title <titleinfo><title>General News on the Internet Web Archive</title></titleinfo>
title <titleinfo><title>Serial and Government Publications Division</title></titleinfo>
title <titleinfo><title>Raw Story</title></titleinfo>
title <titleinfo><title>General News on the Internet Web Archive</title></titleinfo>
title <titleinfo><title>Serial and Government Publications Division</title></titleinfo>
title <titleinfo><title>Huffington Post</title></titleinfo>
title <titleinfo><title>General News on the Internet Web Archive</title></titleinfo>
title <titleinfo><title>Serial and Government Publications Division</title></titleinfo>
title <titleinfo><title>BuzzFeed</title></titleinfo>
title <titleinfo><title>General News on the Internet Web Archive</title></titleinfo>
title <titleinfo><title>Serial and Government Publications Division</title></titleinfo>
title <titleinfo><title>Drudge Report</title></titleinfo>
title <titleinfo><title>

Note above that the `.find_parent()` method can be used to look "up" the tree, 
in this case displaying the parent element of the `title` element.

A similar result can be produced using the `lxml` library directly. In this case, 
the `.findall()` method is similar, but notice that the request can be given
using XPATH references and while specifying namespaces:

In [21]:
#lxml
xml_records = ET.parse('2018_lcwa_MODS_5.xml')

for titleInfo in xml_records.findall('.//mods:title', namespaces=ns): 
    element = titleInfo
    print(element.text)

Slate Magazine
General News on the Internet Web Archive
Serial and Government Publications Division
Raw Story
General News on the Internet Web Archive
Serial and Government Publications Division
Huffington Post
General News on the Internet Web Archive
Serial and Government Publications Division
BuzzFeed
General News on the Internet Web Archive
Serial and Government Publications Division
Drudge Report
General News on the Internet Web Archive
Serial and Government Publications Division


#### Extract only the Main Titles

Activity 4: Notice in the previous activity that even though we are working with only five
records, there are well more than five titles. Each of these records has multiple `title` elements,
some of which are for `relatedItem` elements. If we want only the main titles, use the `.find()` function 
to search for only the first instance and print out only the main title. An alternative way to do this in `lxml`
is to use a more specific XPath selector.

In [22]:
#BS4
with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')
    for mods in metadata.find_all('mods'):
        title = mods.find('title')
        print(title.name, title.text)

title Slate Magazine
title Raw Story
title Huffington Post
title BuzzFeed
title Drudge Report


Try `lxml` to make a more specific XPath request. (Note: you can also use the `.find()` method in `lxml` to return only the first result.)

In [23]:
#lxml
xml_records = ET.parse('2018_lcwa_MODS_5.xml')

for title in xml_records.findall('.//mods:mods/mods:titleInfo/mods:title', namespaces=ns): 
    element = title
    print(element.text, element.tag)

Slate Magazine {http://www.loc.gov/mods/v3}title
Raw Story {http://www.loc.gov/mods/v3}title
Huffington Post {http://www.loc.gov/mods/v3}title
BuzzFeed {http://www.loc.gov/mods/v3}title
Drudge Report {http://www.loc.gov/mods/v3}title


Above, the query specifically asks for the `title` elements that are direct 
child elements of a `titleInfo` element, which is a child of the `mods` element. 
This is necessary to filter out any `titleInfo` elements that are actually under a
`relatedItem` element. With less specificity, the query will return numerous elements 
that "related" but not the title of the actual item: 

#### Exploring the Subject Element

Activity 5: These records contain `<subject>` designations, but only some of these correspond to headings that are authorized headings in the Library of Congress Subject Headings (LCSH). Those are indicated with an attribute `authority='lcsh'`, which is indicated as an embedded attribute in the tag. Look through the `<subject>` tags and identify only the ones that include an LCSH attribute, then print the content of those subject headings.

Note: this activity requires using the twenty-five record set rather than the one with five records.

As LCSH headings are generally constructed as a main topic word, followed by descriptors that indicate further
topical, geographic, or chronological details, note that the structure is mimicked here, with `<topic>` and
various `<genre>`, `<geographic>`, or other specifiers.

In [24]:
#BS4
metadata = BeautifulSoup(open('2018_lcwa_MODS_25.xml'), 'lxml')

for mods in metadata.find_all('mods'):
    for subject in mods.find_all('subject', authority="lcsh"):
        print(subject, subject.attrs, '\n')

<subject authority="lcsh">
<name authority="naf" type="corporate">
<namepart><!-- TODO: Insert name authority here (can be same as name authority above, under title). --></namepart>
</name>
</subject> {'authority': 'lcsh'} 

<subject authority="lcsh">
<topic>Animals</topic>
<genre>Pictorial works</genre>
</subject> {'authority': 'lcsh'} 

<subject authority="lcsh">
<name authority="naf" type="corporate">
<namepart><!-- TODO: Insert name authority here (can be same as name authority above, under title). --></namepart>
</name>
</subject> {'authority': 'lcsh'} 

<subject authority="lcsh">
<topic>Memes</topic>
</subject> {'authority': 'lcsh'} 

<subject authority="lcsh">
<name authority="naf" type="corporate">
<namepart><!-- TODO: Insert name authority here (can be same as name authority above, under title). --></namepart>
</name>
</subject> {'authority': 'lcsh'} 

<subject authority="lcsh">
<topic>Memes</topic></subject> {'authority': 'lcsh'} 

<subject authority="lcsh">
<name authority="

In [25]:
# demonstrating navigate in BS4 using dot notation (subject.topic) to go down in the tree
#BS4
metadata = BeautifulSoup(open('2018_lcwa_MODS_25.xml'), 'lxml')

for mods in metadata.find_all('mods'):
    for subject in mods.find_all('subject', authority="lcsh"):
        print(subject.topic)

None
<topic>Animals</topic>
None
<topic>Memes</topic>
None
<topic>Memes</topic>
None
<topic>Memes</topic>
None
<topic>Web portals</topic>
<topic>Political candidates</topic>
<topic>Elections</topic>
<topic>Politics and government</topic>
None
None
<topic>Political candidates</topic>
<topic>Elections</topic>
<topic>Politics and government</topic>
None
None
<topic>Political candidates</topic>
<topic>Elections</topic>
<topic>Politics and government</topic>
None
None
<topic>Political candidates</topic>
<topic>Elections</topic>
<topic>Politics and government</topic>
None
None
<topic>Political candidates</topic>
<topic>Elections</topic>
<topic>Politics and government</topic>
None
None


Using `lxml`, we can use XPath queries. (Again, remember to use the 25-record file, not the 5-record file.)

In [26]:
#lxml
xml_records = ET.parse('2018_lcwa_MODS_25.xml')

for subject in xml_records.findall('.//mods:mods/mods:subject', namespaces=ns): 
    element = subject
    print(element.tag, element.attrib)

{http://www.loc.gov/mods/v3}subject {'authority': 'keyword'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcwabt'}
{http://www.loc.gov/mods/v3}subject {'authority': 'keyword'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcwabt'}
{http://www.loc.gov/mods/v3}subject {'authority': 'keyword'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcwabt'}
{http://www.loc.gov/mods/v3}subject {'authority': 'keyword'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject {'authority': 'lcwabt'}
{http://www.loc.gov/mods/v3}subject {'authority': 'k

Similarly, we can filter to view only those with `lcsh` subject authorities:

In [27]:
#lxml
xml_records = ET.parse('2018_lcwa_MODS_25.xml')

for subject in xml_records.findall('.//mods:mods/mods:subject', namespaces=ns): 
    if subject.attrib['authority'] == 'lcsh':
        print(subject.tag, len(subject), subject.attrib)

{http://www.loc.gov/mods/v3}subject 1 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 2 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 1 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 1 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 1 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 1 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 1 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 1 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 1 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 1 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 2 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 2 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 2 {'authority': 'lcsh'}
{http://www.loc.gov/mods/v3}subject 1 {'authority': 'lcsh', 'displayLabel': 'United States Elections, 2014'}
{http://www.loc.gov/mods/v3}subject 1 {'authority': 'lcsh', 'displayLabel': 'United States Elections, 2014'}
{h

You may notice that the lxml tools are more literal, in a sense, 
meaning that they really only usually give you what you ask for. So, 
for example, unlike using BeautifulSoup when we can ask for the contents 
of a tag (that is, all of the text that is enclosed by the tag), 
`lxml` treats the metadata more like data. In this case, the `subject` tags don't
strictly contain any actual text (that is, nothing the parser recognizes as 
a string of characters), in fact they only contain more sublements, which then contain text. 
The structure looks something like this: 

```xml
<subject authority="lcsh">
    <topic>Animals</topic>
    <genre>Pictorial works</genre>
</subject>
```
So, as the `lxml` parser sees it, the "contents" of the `<subject>` tag 
are two subelements: `<topic>` and `<genre>`. 
To get the "text" or content of the element, we need to look at the attributes
(in this case, the authority type), and then to extract the subelements. 
Only when the subelements are obtained can we retrieve their text.

First, look for the contents of the `subject` element, the list of its subelements,
here identified as the "child" elements since they are further "down" in the tree:

In [28]:
#lxml
xml_records = ET.parse('2018_lcwa_MODS_25.xml')
count = 0 

for subject in xml_records.findall('.//mods:mods/mods:subject', namespaces=ns): 
    if subject.attrib['authority'] == 'lcsh':
        count += 1
        print(subject.tag, count, 'children:')
        for subelement in subject:
            print('  ',subelement.tag)
        print('\n')
        if count > 4:
            break

{http://www.loc.gov/mods/v3}subject 1 children:
   {http://www.loc.gov/mods/v3}name


{http://www.loc.gov/mods/v3}subject 2 children:
   {http://www.loc.gov/mods/v3}topic
   {http://www.loc.gov/mods/v3}genre


{http://www.loc.gov/mods/v3}subject 3 children:
   {http://www.loc.gov/mods/v3}name


{http://www.loc.gov/mods/v3}subject 4 children:
   {http://www.loc.gov/mods/v3}topic


{http://www.loc.gov/mods/v3}subject 5 children:
   {http://www.loc.gov/mods/v3}name




Finally, to extract the actual subject terms, we can request the text of the subelements:

In [29]:
#lxml
xml_records = ET.parse('2018_lcwa_MODS_25.xml')
count = 0

for subject in xml_records.findall('.//mods:mods/mods:subject', namespaces=ns): 
    if subject.attrib['authority'] == 'lcsh':
        count += 1
        print(subject.tag, count, 'children:')
        for subelement in subject:
            print('  {} - {}'.format(subelement.tag, subelement.text))
        print('\n')
        if count > 4:
            break

{http://www.loc.gov/mods/v3}subject 1 children:
  {http://www.loc.gov/mods/v3}name - 
            


{http://www.loc.gov/mods/v3}subject 2 children:
  {http://www.loc.gov/mods/v3}topic - Animals
  {http://www.loc.gov/mods/v3}genre - Pictorial works


{http://www.loc.gov/mods/v3}subject 3 children:
  {http://www.loc.gov/mods/v3}name - 
            


{http://www.loc.gov/mods/v3}subject 4 children:
  {http://www.loc.gov/mods/v3}topic - Memes


{http://www.loc.gov/mods/v3}subject 5 children:
  {http://www.loc.gov/mods/v3}name - 
            




#### Data Addition or Modification

Activity 6: Now that we can request things in the tree, let's look for 
more specific things, like content strings that meet certain criteria, 
then add or modify content to enhance them. 

Let's return to the `identifier` elements. Some of these are structured with
local call numbers, but those don't appear to be identified with any additional 
attributes:

```xml
  <identifier>lcwaN0010234</identifier>
  <identifier invalid="yes" type="database id">85999</identifier>
  <identifier invalid="yes" type="database id">109353</identifier>

```

Would it be possible to modify these and add a "type" attribute for those
local numbers? Let's start with BeautifulSoup:

In [30]:
#BS4

#BS4 - show the identifiers and their attributes
with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')

    for mods in metadata.find_all('mods'):
        for identifier in mods.find_all('identifier'):
            print(identifier.name, identifier.text, identifier.attrs)

identifier lcwaN0010234 {}
identifier 85999 {'invalid': 'yes', 'type': 'database id'}
identifier 109353 {'invalid': 'yes', 'type': 'database id'}
identifier http://www.slate.com/ {'displaylabel': 'Access URL', 'type': 'uri'}
identifier 15046 {'type': 'database id'}
identifier lcwaN0001999 {}
identifier 91224 {'invalid': 'yes', 'type': 'database id'}
identifier 109272 {'invalid': 'yes', 'type': 'database id'}
identifier http://rawstory.com/ {'displaylabel': 'Access URL', 'type': 'uri'}
identifier 2771 {'type': 'database id'}
identifier lcwaN0003238 {}
identifier 91275 {'invalid': 'yes', 'type': 'database id'}
identifier 109273 {'invalid': 'yes', 'type': 'database id'}
identifier 96782 {'invalid': 'yes', 'type': 'database id'}
identifier http://www.huffingtonpost.com/ {'displaylabel': 'Access URL', 'type': 'uri'}
identifier 4619 {'type': 'database id'}
identifier lcwaN0010144 {}
identifier nan {'invalid': 'yes', 'type': 'database id'}
identifier https://medium.com/buzzfeed-collections {'

In [None]:
#BS4 - use regular expressions to identify the local identifiers or "call numbers"

with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')
    
    # set up a regex pattern
    call_num_pattern = re.compile(r'[a-z]{4}N\d{7}')
    # alternatively, be more specific and look for the lcwa string at the beginning:
    # call_num_pattern = re.compile(r'^blcwaN\d{7}')

    for mods in metadata.find_all('mods'):
        for identifier in mods.find_all('identifier'):
            if re.match(call_num_pattern, identifier.text):
                print(identifier.name, identifier.text, identifier.attrs)    

In [32]:
#BS4

# now, add in new attributes for these "local call number" elements
#BS4 - use regular expressions to identify the local identifiers or "call numbers"

with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')
    
    # set up a regex pattern
    call_num_pattern = re.compile(r'[a-z]{4}N\d{7}')
    # alternatively, be more specific and look for the lcwa string at the beginning:
    # call_num_pattern = re.compile(r'^blcwaN\d{7}')

    for mods in metadata.find_all('mods'):
        for identifier in mods.find_all('identifier'):
            if re.match(call_num_pattern, identifier.text):
                # add attributes by assigning values
                identifier['type'] = 'local_call_number'
                identifier['invalid'] = 'no'
                identifier['displaylabel'] = 'Local Call Number'
                #print, to make sure that these were added
                print(identifier.name, identifier.text, identifier.attrs)               

identifier lcwaN0010234 {'type': 'local_call_number', 'invalid': 'no', 'displaylabel': 'Local Call Number'}
identifier lcwaN0001999 {'type': 'local_call_number', 'invalid': 'no', 'displaylabel': 'Local Call Number'}
identifier lcwaN0003238 {'type': 'local_call_number', 'invalid': 'no', 'displaylabel': 'Local Call Number'}
identifier lcwaN0010144 {'type': 'local_call_number', 'invalid': 'no', 'displaylabel': 'Local Call Number'}
identifier lcwaN0010145 {'type': 'local_call_number', 'invalid': 'no', 'displaylabel': 'Local Call Number'}


In [33]:
#BS4 - check to make sure they look okay as XML

with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')
    
    call_num_pattern = re.compile(r'[a-z]{4}N\d{7}')

    for mods in metadata.find_all('mods'):
        for identifier in mods.find_all('identifier'):
            if re.match(call_num_pattern, identifier.text):
                # add attributes by assigning values
                identifier['type'] = 'local_call_number'
                identifier['invalid'] = 'no'
                identifier['displaylabel'] = 'Local Call Number'
                print(identifier.prettify())

<identifier displaylabel="Local Call Number" invalid="no" type="local_call_number">
 lcwaN0010234
</identifier>

<identifier displaylabel="Local Call Number" invalid="no" type="local_call_number">
 lcwaN0001999
</identifier>

<identifier displaylabel="Local Call Number" invalid="no" type="local_call_number">
 lcwaN0003238
</identifier>

<identifier displaylabel="Local Call Number" invalid="no" type="local_call_number">
 lcwaN0010144
</identifier>

<identifier displaylabel="Local Call Number" invalid="no" type="local_call_number">
 lcwaN0010145
</identifier>



In [34]:
#lxml identify the local call number identifiers

xml_records = ET.parse('2018_lcwa_MODS_5.xml')

# regex pattern to identify the call number:
call_num_pattern = re.compile(r'[a-z]{4}N\d{7}')

for identifier in xml_records.findall('.//mods:mods/mods:identifier', namespaces=ns): 
    if re.match(call_num_pattern, identifier.text):
        print(identifier.text)

lcwaN0010234
lcwaN0001999
lcwaN0003238
lcwaN0010144
lcwaN0010145


In [35]:
#lxml insert attributes to make a more complete metadata record

xml_records = ET.parse('2018_lcwa_MODS_5.xml')

# regex pattern to identify the call number:
call_num_pattern = re.compile(r'[a-z]{4}N\d{7}')

for identifier in xml_records.findall('.//mods:mods/mods:identifier', namespaces=ns): 
    if re.match(call_num_pattern, identifier.text):
        print(identifier.text)
        identifier.attrib['displaylabel'] = 'Local Call Number'
        identifier.attrib['invalid'] = 'no'
        identifier.attrib['type'] = 'local_call_number'
        print('  ',identifier.attrib)

lcwaN0010234
   {'displaylabel': 'Local Call Number', 'invalid': 'no', 'type': 'local_call_number'}
lcwaN0001999
   {'displaylabel': 'Local Call Number', 'invalid': 'no', 'type': 'local_call_number'}
lcwaN0003238
   {'displaylabel': 'Local Call Number', 'invalid': 'no', 'type': 'local_call_number'}
lcwaN0010144
   {'displaylabel': 'Local Call Number', 'invalid': 'no', 'type': 'local_call_number'}
lcwaN0010145
   {'displaylabel': 'Local Call Number', 'invalid': 'no', 'type': 'local_call_number'}


#### Data validation - ADD new sample file TODO

7 - Data validation: check the reference IDs to ensure that they are in the proper format, then identify for correction as needed. 

In [36]:
#lxml
#filter the IDs, then do a regex match?

xml_records = ET.parse('2018_lcwa_MODS_25.xml')

# previously we used a regex pattern to identify the call number:
# call_num_pattern = re.compile(r'[a-z]{4}N\d{7}')
# this time, be more specific and look for the lcwa string at the beginning:
call_num_pattern = re.compile(r'^blcwa[A-Z]{1}\d{7}')

for identifier in xml_records.findall('.//mods:mods/mods:identifier', namespaces=ns): 
    print(identifier.text)
    if re.match(call_num_pattern, identifier.text):
        print(identifier.tag, identifier.text, identifier.attrib)

lcwaN0010234
85999
109353
lcwaN0001999
91224
109272
lcwaN0003238
91275
109273
96782
lcwaN0010144
nan
lcwaN0010145
82949
109227
lcwaN0012178
85778
lcwaN0012179
85779
lcwaN0012180
85780
lcwaN0012184
85784
lcwaN0012195
85795
lcwaN0010932
110933
lcwaN0010933
110934
lcwaN0010936
110937
lcwaN0010937
110938
lcwaN0010940
nan
lcwaN0010888
lcwaN0010226
lcwaN0009692
lcwaN0009700
lcwaN0010401
lcwaE0008846
lcwaE0008263
lcwaE0008338
lcwaE0008918
lcwaE0008001


#### Saving the Updated Metadata

Activity 8: Now, let's write the updated metadata to a new file. 

In [37]:
#BS4 - write out to a new file... 

newfile_name = '2018_lcwa_MODS_5_updated.xml'

with open('2018_lcwa_MODS_5.xml', 'r') as xml_records:
    metadata = BeautifulSoup(xml_records, 'lxml')
    
    call_num_pattern = re.compile(r'[a-z]{4}N\d{7}')

    for mods in metadata.find_all('mods'):
        for identifier in mods.find_all('identifier'):
            if re.match(call_num_pattern, identifier.text):
                # add attributes by assigning values
                identifier['type'] = 'local_call_number'
                identifier['invalid'] = 'no'
                identifier['displaylabel'] = 'Local Call Number'

    #new file
    with open(newfile_name, 'w') as updated_records:
        updated_records.write(metadata.prettify(formatter="minimal"))
        print("Wrote a new file, you're welcome!")

Wrote a new file, you're welcome!


This is the end of the activities section. 

Materials below this cell are draft code blocks that were either modified or prepared
for interactive discussions in class. 

############

OLD 

#### Data Addition or Modification

Old Activity 6: Now that we can request things in the tree, let's look for 
more specific things, like content strings that meet certain criteria, 
then add or modify content to enhance them. 

Above you may have noticed that some of the subject terms are blank. 
Looking at the tree, it is clear that some were left to complete later:

```xml
<subject authority="lcsh">
    <name authority="naf" type="corporate">
        <namePart><!-- TODO: Insert name authority here (can be same as name authority above, under title). --></namePart>
    </name>
</subject>
```

Let's find these and then replace the comments with different content:

======

These cells attempted to look for comments in the XML using the LXML parser:

In [None]:
#lxml
xml_records = ET.parse('2018_lcwa_MODS_25.xml')
count = 0

for subject in xml_records.findall('.//mods:mods/mods:subject', namespaces=ns): 
    if subject.attrib['authority'] == 'lcsh':
        count += 1
        print(subject.tag, count, 'children:')
        for subelement in subject:
            print('  {} - {}'.format(subelement.tag, subelement.text))
        print('\n')
        if count > 4:
            break

In [None]:
#lxml
xml_records = ET.parse('2018_lcwa_MODS_25.xml')
count = 0

for subject in xml_records.findall('.//mods:mods/mods:subject', namespaces=ns): 
    if subject.attrib['authority'] == 'lcsh':
        count += 1
        print(subject.tag, count, 'children:')
        for subelement in subject:
            print('  {} - {}'.format(subelement.tag, subelement.text))
            for subsubelement in subelement:
                print(subsubelement.tag, subsubelement.text)
        print('\n')
        if count > 4:
            break