# Notes on Things to do with MODS

This notebook walks through some basic operations of using python to work with an XML document. This notebook uses XPATH expressions and a dictionary to manage the XML namespaces within these records. XPATH and namespaces are outside the scope of the notebook's exerciese, but the sample code provided fills in all the XPATH and covers namespace management. The steps include the following:

* Setup
* Examples, in the sequence they are illustrated in the notebook following.
  1. Looking in a single XML file with multiple `<mods>` subelements, use a loop to count the elements.
  2. Using a similar loop to the above, can you go through the individual `<mods>` and extract the record's identifier? 
  3. Let's get a bit deeper into MODS. These records use the `<titleInfo>` and `<title>` tags. Use the `.findall()` function to look into each of these and pull out the text of each `<title>` element.
  4. The above pulls out all the titles, including related titles. Use the `.find()` funtion to search for just the first instance and pull out only the main titles. 
  5. These records contain `<subject>` designations, but only some of these correspond to headings that are authorized headings in the Library of Congress Subject Headings. Those are marked with an attribute `authority='lcsh'`, which is indicated as an embedded attribute in the tag. Look through `<subject>` tags, identify only the ones that include an LCSH attribute, then print the content of those subject headings.  

## Setup

First, you'll want to get the libraries you need to parse XML in Python. That would be the ElementTree API. Basic documentation can be found here: https://docs.python.org/3/library/xml.etree.elementtree.html. You may also find uses for the lxml library, which allows for certain other options and has additional parsing advantages. That library is documented here: https://lxml.de/tutorial.html. Here is how to call the libraries:

In [1]:
import xml.etree.ElementTree as ET

try:
    from lxml import etree
    print('running with lxml.etree')
except ImportError:
    print('you\re not running with lxml active')


running with lxml.etree


To do things with the XML material, we need to tell python where it is, then to bring it in as an object that python can work with. In this example, we give the filepath for the XML directly to python (row 4), but in a more advanced, extensible example you would want to use an input to get the file or to use a module like `os` to look for the XMl files that you want. 

Below, the code uses the ETree API to read the XML file into a an object called `tree`, then the XML is parsed into a readable tree structure using the `.getroot()` function and assigning it to the `root` variable. 

In [2]:
# You need to get the file. One way to do this would be to dynamically get it, from an input.
# But here, we will just pull it directly 

tree = ET.parse('2018_lcwa_MODS_25.xml')
root = tree.getroot()

## Namespaces can be tricky in XML. This establishes a dictionary that
## can be used later to prefix the appropriate namespaces using simple keys.
## For example, when the ns dictionary is invoked, mods: can be used as 
## a shorthand reference for the long URL to the mods namespace.
nspace = 'http://www.loc.gov/mods/v3'
ns = {'mods' : nspace}


## Examples: Pulling from XML

### Basics

* Basics: pull out title, ID, URL, seed
* Loop through files
* Pull elements into a list or dictionary
* Write information to a CSV
* detect if information is present/not present.

Look for title, ID, URL. 

In [3]:
countMODS = 0 

## The root tag, now assigned to "root" variable is <modsCollection>
## To iterate through the 25 records in the file, you can try the following.
## This code will loop through the <mods> child tags, print the name of the tag
## and also print a dictionary containing all the attributes of the <mods> tag.
for mods in root:
    countMODS = countMODS + 1
    print(mods.tag, mods.attrib)
    print(countMODS)

{http://www.loc.gov/mods/v3}mods {'version': '3.4', '{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd'}
1
{http://www.loc.gov/mods/v3}mods {'version': '3.4', '{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd'}
2
{http://www.loc.gov/mods/v3}mods {'version': '3.4', '{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd'}
3
{http://www.loc.gov/mods/v3}mods {'version': '3.4', '{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd'}
4
{http://www.loc.gov/mods/v3}mods {'version': '3.4', '{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd'}
5
{http://www.loc.gov/mods/

Let's try something a bit more complicated. Can you write a loop that goes through the individual records, as above, but pulls out the content of the title tag?

Hint: knowing the MODS structure is very helpful! You can use a series of index references to count down to the content...

In [4]:
countMODS2 = 0

## If we look for the first element in the dictionary, 
## we can ask to print the text and get a list of the 
## record identifiers.
for mods in root:
    countMODS2 = countMODS2 + 1
    print(mods[0].text)
    if countMODS2 > 3:
        break

lcwaN0010234
lcwaN0001999
lcwaN0003238
lcwaN0010144


How could you pull out all the elements that are indicated as `<title>`?

In [5]:
## Pulling out information with .findall() on a known child
countMODS3 = 0
print(countMODS3)

## note the following uses an xpath construction: 
## two slashes indicate one level down from the mods container (child of the root)
## For documentation on using the ns dictionary, see https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces
for titleInfo in root.findall('.//mods:titleInfo/mods:title', namespaces=ns): 
    countMODS3 = countMODS3 + 1
    print(countMODS3)
    element = titleInfo
    print(element.text)

print(countMODS3)

0
1
Slate Magazine
2
General News on the Internet Web Archive
3
Serial and Government Publications Division
4
Raw Story
5
General News on the Internet Web Archive
6
Serial and Government Publications Division
7
Huffington Post
8
General News on the Internet Web Archive
9
Serial and Government Publications Division
10
BuzzFeed
11
General News on the Internet Web Archive
12
Serial and Government Publications Division
13
Drudge Report
14
General News on the Internet Web Archive
15
Serial and Government Publications Division
16
Life in this Girl's Army / New Lives - Blog
17
Iraq War 2003 Web Archive
18
Research and Reference Services Division
19
Sgt. Missick, A Line In The Sand - Blog
20
Iraq War 2003 Web Archive
21
Research and Reference Services Division
22
OIF - Operation Iraqi Freedom - Blog
23
Iraq War 2003 Web Archive
24
Research and Reference Services Division
25
Doc in the Box - Blog
26
Iraq War 2003 Web Archive
27
Research and Reference Services Division
28
Intel Dump - Blog
29
Ir

Note that in the above, there are multiple titles in some of the records. What if we just wanted to pull out the first instance? In these records, that would be the main title of the website. We can use the .find() function, which pulls only the first instance. (.findall() finds all instances)

In [6]:
## Block 4: Pull out the first instance of the title, so we can only get the main title
## This block uses .findall() to pull all of the mods subelements;
## Then, within each subelement, we use .find() to look for the first instance
## of titleInfo/title, which gives the first main title. We use the 'ns' dictionary
## to provide the namespacing since all of the mods subelements are within the 
## mods namespace, unless otherwise indicated. 
countMODS4 = 0

for item in root.findall('./mods:mods', namespaces=ns):
#    print(item)
    countMODS4 = countMODS4 + 1
    title = item.find('./mods:titleInfo/mods:title', namespaces=ns)
    print(countMODS4, title.text)

1 Slate Magazine
2 Raw Story
3 Huffington Post
4 BuzzFeed
5 Drudge Report
6 Life in this Girl's Army / New Lives - Blog
7 Sgt. Missick, A Line In The Sand - Blog
8 OIF - Operation Iraqi Freedom - Blog
9 Doc in the Box - Blog
10 Intel Dump - Blog
11 Official Campaign Web Site - Maithripala Sirisena
12 Tamil National Alliance (TNA) - Sri Lanka
13 Rajiva Wijesinha Blog - Sri Lanka
14 Liberal Party of Sri Lanka
15 Sri Lanka Guardian
16 Cute Overload! ;)
17 Homepage | Meme Generator
18 Internet Meme Database | Know Your Meme
19 YTMND: You're the man now dog!
20 Metafilter | Community Weblog
21 Official Campaign Web Site - Gregory John Orman
22 Official Campaign Web Site - Joan Elizabeth Farr
23 Official Campaign Web Site - Danny Page
24 Official Campaign Web Site - C. Salekin
25 Official Campaign Web Site - Scott J. Barnhart


In [7]:
## Block 5: Identify LCSH subject authority tags

countMODS5 = 0

# Loop through each of the MODS in the modsCollection root
for item in root.findall('./mods:mods', namespaces=ns):
    countMODS5 = countMODS5 + 1
    
    # Get the title
    title = item.find('./mods:titleInfo/mods:title', namespaces=ns)
    print(countMODS5, title.text)

    # Get the topic child element of subject subelements with authority="lcsh" attributes
    subject = item.findall('.//mods:subject[@authority="lcsh"]/mods:topic', namespaces=ns)
    if subject:
        i = 0
        #assign the list to a variable topoi
        topoi = list(subject)
        for item in topoi:
            print('term',i,topoi[i].text)
            i = i + 1
    else:
        print('No subject tags found.')

1 Slate Magazine
No subject tags found.
2 Raw Story
No subject tags found.
3 Huffington Post
No subject tags found.
4 BuzzFeed
No subject tags found.
5 Drudge Report
No subject tags found.
6 Life in this Girl's Army / New Lives - Blog
No subject tags found.
7 Sgt. Missick, A Line In The Sand - Blog
No subject tags found.
8 OIF - Operation Iraqi Freedom - Blog
No subject tags found.
9 Doc in the Box - Blog
No subject tags found.
10 Intel Dump - Blog
No subject tags found.
11 Official Campaign Web Site - Maithripala Sirisena
No subject tags found.
12 Tamil National Alliance (TNA) - Sri Lanka
No subject tags found.
13 Rajiva Wijesinha Blog - Sri Lanka
No subject tags found.
14 Liberal Party of Sri Lanka
No subject tags found.
15 Sri Lanka Guardian
No subject tags found.
16 Cute Overload! ;)
term 0 Animals
17 Homepage | Meme Generator
term 0 Memes
18 Internet Meme Database | Know Your Meme
term 0 Memes
19 YTMND: You're the man now dog!
term 0 Memes
20 Metafilter | Community Weblog
term 0 W

### More complicated: multiple items

* use for loops, conditionals?
* QR: check to see if something is there
  * write a basic function that looks for lcsh?

### Looking through multiple files (MODS in tree structure)

Here are some possible scenarios:

* practice for looping could be useful for practicing os work, too; reading in text from files in a list of directories
* alternative title is not in minimal records MODS
* check if there's related item, and if there's a handle (minimal MODS do not have a handle)
* QR on thumbnails. Look for identifier (LCWAN00XX.xml), and check if the thumbnail is there. The thumbnail ID may match the 


## Inserting things Into XML

* working with strings, xml appends
* inserting information, say where there is not an lcsh