# Library of Congress MODS Exercise

## Lesson Outline
* What is XML?
* Introduction to BeautifulSoup
* Parse MODS XML with BeautifulSoup
* Loop through directory of XML files
* Find and replace in XML files
* Introduction to ElementTree 
* Learn how to explore XML using functions, loops, XPath

## What is XML?
* Let's briefly desscribe what it is.
* Let's look at an example.

In [232]:
## let's view a MODS XML file
## open file, read string into xml, import BeautifulSoup, print XML using prettify, close
## more on BeautifulSouop https://www.crummy.com/software/BeautifulSoup/
file = open('MODS/lcwaN0012195.xml', 'r')
xml = file.read()
from bs4 import BeautifulSoup
print(BeautifulSoup(xml, "xml").prettify())
file.close()

<?xml version="1.0" encoding="utf-8"?>
<mods version="3.4" xmlns="httpss://www.loc.gov/mods/v3" xmlns:xlink="httpss://www.w3.org/1999/xlink" xmlns:xsi="httpss://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="httpss://www.loc.gov/mods/v3 httpss://www.loc.gov/standards/mods/v3/mods-3-4.xsd">
 <identifier>
  lcwaN00 USA\
 </identifier>
 <identifier invalid="yes" type="database id">
  USA\
 </identifier>
 <titleInfo>
  <title>
   Intel Dump - Blog
  </title>
 </titleInfo>
 <language>
  <languageTerm authority="iso639-2b" type="code">
   eng
  </languageTerm>
 </language>
 <physicalDescription>
  <form authority="marcform">
   electronic
  </form>
  <internetMediaType>
   text/html
  </internetMediaType>
  <digitalOrigin>
   born digital
  </digitalOrigin>
 </physicalDescription>
 <targetAudience>
  general
 </targetAudience>
 <typeOfResource>
  text
 </typeOfResource>
 <genre authority="marcgt">
  web site
 </genre>
 <originInfo>
  <place>
   <placeTerm type="text">
    United Sta

## What is BeautifulSoup
[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. It's also a widely used package, so many tutorials and examples are available. For instance, the Programming Historian has a lesson:
https://programminghistorian.org/en/lessons/intro-to-beautiful-soup

Let's use it to parse XML...

In [218]:
## import BeautifulSoup, open and read XML file, use BeautifulSoup to final all title text
from bs4 import BeautifulSoup
infile = open("MODS/lcwaN0012195.xml","r")
contents = infile.read()
soup = BeautifulSoup(contents,'xml')
titles = soup.find_all('title')
for title in titles:
    print(title.get_text())

Intel Dump - Blog
Iraq War 2003 Web Archive
Research and Reference Services Division


## Exercises
* Print only the first title from the array returned by 'titles = soup.find_all('title')'.
* Use the script above to parse for text from other elements.
* Can you use the same script to get the text for element attributes? Try 'languageTerm' and 'authority'.
* Extra credit: Can you use the same script to search by a particular attribute value? Try 'languageTerm' and 'authority' again and 'iso639-2b'.

## Loop through directory of XML files

In [220]:
## import operating system interface, BeautifulSoup, set path to files, loop through directory
## conditional for .xml, join path and filename, open and read XML file with BeautifulSoup
## find all titles in file but only print first title
import os
from bs4 import BeautifulSoup

path = 'MODS'
for filename in os.listdir(path):
    if filename.endswith('.xml'): 
        fullname = os.path.join(path, filename)
        infile = open(fullname,"r")
        contents = infile.read()
        soup = BeautifulSoup(contents,'xml')
        titles = soup.find_all('title')
        print(filename, titles[0].get_text())

lcwaN0009700.xml YTMND: You're the man now dog!
lcwaN0003238.xml Huffington Post
lcwaN0010888.xml Cute Overload! ;)
lcwaE0008001.xml Official Campaign Web Site - Scott J. Barnhart
lcwaN0012195.xml Intel Dump - Blog
lcwaE0008338.xml Official Campaign Web Site - Danny Page
lcwaE0008846.xml Official Campaign Web Site - Gregory John Orman
lcwaN0010226.xml Homepage | Meme Generator
lcwaE0008918.xml Official Campaign Web Site - C. Salekin
lcwaN0010144.xml BuzzFeed
lcwaN0010145.xml Drudge Report
lcwaN0012178.xml Life in this Girl's Army / New Lives - Blog
lcwaE0008263.xml Official Campaign Web Site - Joan Elizabeth Farr
lcwaN0012179.xml Sgt. Missick, A Line In The Sand - Blog
lcwaN0012184.xml Doc in the Box - Blog
lcwaN0010234.xml Slate Magazine
lcwaN0010401.xml Metafilter | Community Weblog
lcwaN0010940.xml Sri Lanka Guardian
lcwa00097019.xml PMDB : O PARTIDO DO BRASIL
lcwaN0010933.xml Tamil National Alliance (TNA) - Sri Lanka
lcwaN0010932.xml Official Campaign Web Site - Maithripala Sirisen

## Exercises
* Use the script above to return one or two more elements from each XML file.
* Now save the list of elements returned as a CSV file.

## Find and replace in XML files

In [227]:
import os
from bs4 import BeautifulSoup
import re

path = 'MODS'
for filename in os.listdir(path):
    if filename.endswith('.xml'): 
        fullname = os.path.join(path, filename)
        infile = open(fullname,"r")
        contents = infile.read()
        filedata = contents.replace('http', 'https')
        outfile = open(fullname, 'w')
        outfile.write(filedata)

## Exercise & Question
* Add 'USA' in 'physicalLocation' if it is missing from the address.
* Can you use something like what we have been exploring in your work?

## More on BeautifulSoup, Searching, and Modifying XML
https://code.tutsplus.com/tutorials/scraping-webpages-in-python-with-beautiful-soup-search-and-dom-modification--cms-28276

## Intro to ElementTree
Since XML is structured, you can work with it programmatically. Python has a built in library called ElementTree which has functions that allow you to work with XML. Let's look at the Python documentation for [ElementTree](https://docs.python.org/2/library/xml.etree.elementtree.html).

In [223]:
## import package and give it an alias
import xml.etree.ElementTree as ET

In [224]:
## get the root element tag
etree = ET.parse('MODS/lcwaN0012195.xml')
root = etree.getroot()
root.tag
## now try the attribute

'{http://www.loc.gov/mods/v3}mods'

In [42]:
## try looping through the children
for child in root:
    print(child.tag, child.attrib)

{http://www.loc.gov/mods/v3}identifier {}
{http://www.loc.gov/mods/v3}identifier {'invalid': 'yes', 'type': 'database id'}
{http://www.loc.gov/mods/v3}titleInfo {}
{http://www.loc.gov/mods/v3}language {}
{http://www.loc.gov/mods/v3}physicalDescription {}
{http://www.loc.gov/mods/v3}targetAudience {}
{http://www.loc.gov/mods/v3}typeOfResource {}
{http://www.loc.gov/mods/v3}genre {'authority': 'marcgt'}
{http://www.loc.gov/mods/v3}originInfo {}
{http://www.loc.gov/mods/v3}abstract {}
{http://www.loc.gov/mods/v3}relatedItem {'type': 'host'}
{http://www.loc.gov/mods/v3}relatedItem {'type': 'host'}
{http://www.loc.gov/mods/v3}relatedItem {'displayLabel': 'URL', 'type': 'constituent'}
{http://www.loc.gov/mods/v3}location {}
{http://www.loc.gov/mods/v3}location {}
{http://www.loc.gov/mods/v3}accessCondition {'type': 'restrictionOnAccess'}
{http://www.loc.gov/mods/v3}recordInfo {}


In [158]:
## iterate through all the children or subelements
for elem in root.iter():
    print(elem.tag)
    ## try to get the text from each element as well

{http://www.loc.gov/mods/v3}mods
{http://www.loc.gov/mods/v3}identifier
{http://www.loc.gov/mods/v3}identifier
{http://www.loc.gov/mods/v3}titleInfo
{http://www.loc.gov/mods/v3}title
{http://www.loc.gov/mods/v3}language
{http://www.loc.gov/mods/v3}languageTerm
{http://www.loc.gov/mods/v3}physicalDescription
{http://www.loc.gov/mods/v3}form
{http://www.loc.gov/mods/v3}internetMediaType
{http://www.loc.gov/mods/v3}digitalOrigin
{http://www.loc.gov/mods/v3}targetAudience
{http://www.loc.gov/mods/v3}typeOfResource
{http://www.loc.gov/mods/v3}genre
{http://www.loc.gov/mods/v3}originInfo
{http://www.loc.gov/mods/v3}place
{http://www.loc.gov/mods/v3}placeTerm
{http://www.loc.gov/mods/v3}abstract
{http://www.loc.gov/mods/v3}relatedItem
{http://www.loc.gov/mods/v3}titleInfo
{http://www.loc.gov/mods/v3}title
{http://www.loc.gov/mods/v3}relatedItem
{http://www.loc.gov/mods/v3}titleInfo
{http://www.loc.gov/mods/v3}title
{http://www.loc.gov/mods/v3}relatedItem
{http://www.loc.gov/mods/v3}identifier

In [159]:
for url in root.iter('{http://www.loc.gov/mods/v3}url'):
    print(url.text)
## try another child element and print out the tag, attrib, text

http://cdn.loc.gov/service/webcapture/project_1/thumbnails/lcwaS0017477.jpg
http://www.loc.gov/item/lcwaN0012195


In [169]:
nspace = 'http://www.loc.gov/mods/v3'
ns = {'mods' : nspace}

for titleInfo in root.findall('.//mods:titleInfo/mods:title', namespaces=ns): 
    element = titleInfo
    print(element.text)

Intel Dump - Blog
Iraq War 2003 Web Archive
Research and Reference Services Division


In [178]:
for child_of_root in root:
    print(child_of_root.tag, child_of_root.attrib)

{http://www.loc.gov/mods/v3}identifier {}
{http://www.loc.gov/mods/v3}identifier {'invalid': 'yes', 'type': 'database id'}
{http://www.loc.gov/mods/v3}titleInfo {}
{http://www.loc.gov/mods/v3}language {}
{http://www.loc.gov/mods/v3}physicalDescription {}
{http://www.loc.gov/mods/v3}targetAudience {}
{http://www.loc.gov/mods/v3}typeOfResource {}
{http://www.loc.gov/mods/v3}genre {'authority': 'marcgt'}
{http://www.loc.gov/mods/v3}originInfo {}
{http://www.loc.gov/mods/v3}abstract {}
{http://www.loc.gov/mods/v3}relatedItem {'type': 'host'}
{http://www.loc.gov/mods/v3}relatedItem {'type': 'host'}
{http://www.loc.gov/mods/v3}relatedItem {'displayLabel': 'URL', 'type': 'constituent'}
{http://www.loc.gov/mods/v3}location {}
{http://www.loc.gov/mods/v3}location {}
{http://www.loc.gov/mods/v3}accessCondition {'type': 'restrictionOnAccess'}
{http://www.loc.gov/mods/v3}recordInfo {}


In [188]:
print(root[5].tag, root[5].text)

{http://www.loc.gov/mods/v3}targetAudience general


**Interacting with MODS using ElementTree:**  
https://github.com/morskyjezek/LCWA-MODS/blob/master/Pulling-from-LCWA-MODS.ipynb  
https://github.com/morskyjezek/LCWA-MODS/blob/master/Get-Some-LC-Web-Archive-MODs.ipynb  

**Python XML with ElementTree: Beginner's Guide:**  
https://www.datacamp.com/community/tutorials/python-xml-elementtree

**Parsing Wikipedia XML**  
https://www.heatonresearch.com/2017/03/03/python-basic-wikipedia-parsing.html

**Webscraping:**  
https://librarycarpentry.github.io/lc-webscraping/02-xpath/index.html

**Requests tutorial:**  
https://www.dataquest.io/blog/python-api-tutorial/