# [LEGALST-190] Lab 3/13: Parsing XML Data

This lab will cover parsing XML and attribute lookup, XPath, and web scraping.

*Estimated Time: 45 Minutes *

### Topics Covered:
- XML syntax
- locating content with XPATH
- Web scraping

### Table of Contents
[The Data](#section data)<br>
1 - [XML Syntax](#section 1)<br>
2 - [Using XPath and ElementTree to parse XML](#section 2)<br>
3 - [Web Scraping](#section 3)<br>
4 - [Putting it all in a dataframe](#section 4)<br>

**Dependencies:**

In [1]:
import pandas as pd
import xml.etree.cElementTree as ET #XML Parser
from lxml import etree #ElementTree and lxml allow us to parse the XML file.
import requests #make request to server
import time #pause loop

----
## The Data<a id='section data'></a>

In this notebook, you'll be working with XML files from the Old Bailey API (https://www.oldbaileyonline.org/obapi/). These files contain the proceedings of all trials from 1674 to 1913. For this lab, we'll go through the trials from 1754-1756 and 1824-1826. XML (eXtensible Markup Language) provides a hierarchical representation of data contained within different tags and nodes. We'll go over XML syntax later. We will learn how to parse through these XML files from Old Bailey and grab information from sections of an XML file.

---

## Section 1: XML Syntax<a id='section 1'></a>

First, we'll go over the syntax of a XML file. The basic unit of XML code is called an "element" or "node" and has a start and ending tag. The tags for each element look something like this:

<p style="text-align: center;"> `<exampletag>some text</exampletag>`  </p>

Run the next cell to look at the XML file of one of the cases from the OldBailey API!

In [2]:
#For now, don't worry about the code for now, we'll go through it later.
example = requests.get('https://www.oldbaileyonline.org/obapi/text?div=t17031013-13')
print(example.text)

<?xml version="1.0" encoding="UTF-8"?>
<div1 type="trialAccount" id="t17031013-13">
               <interp inst="t17031013-13" type="collection" value="BAILEY"></interp>
               <interp inst="t17031013-13" type="year" value="1703"></interp>
               <interp inst="t17031013-13" type="uri" value="sessionsPapers/17031013"></interp>
               <interp inst="t17031013-13" type="date" value="17031013"></interp>
               <join result="criminalCharge" id="t17031013-13-off60-c52" targOrder="Y" targets="t17031013-13-defend52 t17031013-13-off60 t17031013-13-verdict64"></join>
         
               <p>
            
                  <persName id="t17031013-13-defend52" type="defendantName">
                  Samuel 
                  Davis
               <interp inst="t17031013-13-defend52" type="surname" value="Davis"></interp>
                     <interp inst="t17031013-13-defend52" type="given" value="Samuel"></interp>
                     <interp inst="t17031013-13-d

The `interp` tags at the beginning of the file are elements that don't have any plain text content. Note that elements may possibly be empty and not contain any text (i.e. `interp` elements mentioned earlier). If the element is empty, the tag may follow a format that looks similar to `<exampletag/>`, which is equivalent to `<exampletag></exampletag>`.

Elements may also contain other elements, which we call "children". Most children are indented, but the indents aren't necessary in XML and are used for clarity to show nesting. For example, if we go down to `<persName id="t17540116-4-defend46" type="defendantName">` , we see that the `rs` tag is a child of `persName`. We will explore about children in XML more in the next section. 

Lastly, elements may have attributes, which are in the format `<exampletag name_of_attribute="somevalue">`. Attributes are designed to store data related to a specific elements. Attributes **must** follow the quotes format (`name = "value"`). As you can tell, in this XML file, attributes are everywhere!

-----
**Question 1.1:** What was the verdict of this case? Was there a punsihment and if so, what was it? List both and state whether you found it as plain text content or as an attribute.

<b>The verdict was "guilty," which is both an attribute and is visible shortly thereafter as plain text. The punishment is an attribute, which was probably found in the Ordinary's Account--it was clearly reduced from death.</b>

----
## Section 2: Using XPath and `ElementTree` to parse XML<a id='section 2'></a>

Now that we know what the syntax and structure of an XML file, let's figure out how to parse through one! We are going to load the same file from the first section and use XPath (XML Path Language) to navigate through elements in this file. 

XPath is designed to locate content in an XML file and uses a ["tree" structure](https://www.researchgate.net/profile/Roger_Moussalli/publication/257631377/figure/fig8/AS:297441854279689@1447927072768/Example-XML-Document-and-XML-Path-Queries-a-Example-XML-Document-b-XML-Tree.png) to extract specific chunks. XPath expressions are made up of "location steps" which are separated by forward slashes.

First, we need to import the file into an ElementTree instance. The ElementTree format will allow us to go through each element, sorting through tags so we can extract the data we want.

In [3]:
xml_file = 'data/old-bailey-example.xml'
tree = ET.ElementTree(file=xml_file)
tree

<xml.etree.ElementTree.ElementTree at 0x7ff96ac71080>

We're going to start working from the root of the tree as XML files have a tree structure. Let's load the root of our tree. 

In [4]:
root = tree.getroot()
print(root)

<Element 'div1' at 0x7ff934f28728>


Now that we have the root, we can now start working down the tree! With the root, we can find each child of the root by printing the tags. This will also help us for future reference, if we every want to go through other children in the XML file.

In [5]:
#get child tags from root
for child in root:
    print(child.tag)

interp
interp
interp
interp
join
p
p


Now that we have a list of children to work with let's select one using `.find`. Using `.find` requires an XPath expression which will navigate through the hierarchical structure of XML and help us keep track of the path we are taking through this file.

In [6]:
choose_p = root.find('p')
for child in choose_p:
    print(child.tag)

persName
placeName
interp
interp
join
rs
persName
rs
join
rs
join
rs


This isn't very helpful, since we're still left with a bunch of tags and on top of that, we have a lot of repeating tags and names. Let's choose `placeName` as our next tag and see what happens. Notice that in our XPath expression, we are using foward slashes to navigate to the next child.

In [7]:
place_name = root.find('p/placeName')
for child in place_name:
    print(child.tag)

Nothing was printed, so it looks like we hit the end! Let's use `.text` to examine the data in this element, following the `.find` path we used to get here.

In [8]:
print(root.find('p/placeName').text)
#alternatively, print(place_name.text)

St. James Westminster


Looking back at the file from earlier, we found where defendant was from. Let's see another feature of XPath we can utilize if, for instance, we know all of the possible children in the XML file. 

With XPath, you can either use a forward slash to move to the next element or child. So in our expression earlier, by following `p/placeName`, we located any `placeName` element that is a child of `p`. Another way to navigate using XPath is using a period and a double forward slash (`.//`), which looks anywhere down the tree from your current element. So, if we start at the root and want to find any element with the tag `placeName`, we can do the following:

In [9]:
print(root.find('.//placeName').text)

St. James Westminster


In [10]:
print(root.find('//placeName').text)

SyntaxError: cannot use absolute path on element (<string>)

**Question 2.1:** What happens if you don't have the period before the double slash? What happens if you change the starting element or use the whole XML file?

If you don't have the period, then it thinks it is the absolute path and gives you a syntax error, "cannot use absolute path on element". If you change the starting element, it would work depending on where you start (since it traverses from root to leaf, if you are below what you are looking for it won't find it); it should work on the whole tree, I think.

**Question 2.2:** Find the defendant's name by traversing through the correct elements. You can check your answer in the printed XML file from [section 1](#section 1).

**Tip:** `print` your final expression so that it looks pretty!

In [15]:
print(root.find('.//persName').text) 


                  Samuel 
                  Davis
               


***WARNING*:** If you want to use `//` to find all elements with a specific child, you need to add a period (`.//`), since the node you're currently at most likely not absolute element ( the whole tree). If you want to try it out yourself, using `root.find(//placeName)` should give you an error but `root.find(.//placeName)` should give you what you want.

----
Luckily, we can use `.getiterator()`, a really helpful method from ElementTree. It creates an object which will let us iterate through all elements in the file. Using this method is powerful, as we can print each element name utilizing `.tag` or see the data for each element with `.text` and `.attrib`.

We can use `.getiterator()` on `tree`, our ElementTree instance. We call it in the form:

<p style="text-align: center;"> `tree.getiterator(tag=None)`  </p>

If you don't specify what tag you want, it'll go through the first element it comes across in `tree` and then through its children and their children, etc. If you only want elements with a specific tag name, like `placeName`, you can pass it as the argument.

Let's see how helpful `.getiterator()` can be! We'll call it on tree and print out the tag and attribute of each element.

In [16]:
iterator = tree.getiterator()
for element in iterator:
    print(element.tag)
    print(element.attrib)
    print()

div1
{'type': 'trialAccount', 'id': 't17031013-13'}

interp
{'inst': 't17031013-13', 'type': 'collection', 'value': 'BAILEY'}

interp
{'inst': 't17031013-13', 'type': 'year', 'value': '1703'}

interp
{'inst': 't17031013-13', 'type': 'uri', 'value': 'sessionsPapers/17031013'}

interp
{'inst': 't17031013-13', 'type': 'date', 'value': '17031013'}

join
{'result': 'criminalCharge', 'id': 't17031013-13-off60-c52', 'targOrder': 'Y', 'targets': 't17031013-13-defend52 t17031013-13-off60 t17031013-13-verdict64'}

p
{}

persName
{'id': 't17031013-13-defend52', 'type': 'defendantName'}

interp
{'inst': 't17031013-13-defend52', 'type': 'surname', 'value': 'Davis'}

interp
{'inst': 't17031013-13-defend52', 'type': 'given', 'value': 'Samuel'}

interp
{'inst': 't17031013-13-defend52', 'type': 'gender', 'value': 'male'}

placeName
{'id': 't17031013-13-defloc59'}

interp
{'inst': 't17031013-13-defloc59', 'type': 'placeName', 'value': 'St. James Westminster'}

interp
{'inst': 't17031013-13-defloc59', 't

**Question 2.3:** Using iterator and the information of the tags above, find the names of the defendant and the plaintiff by getting the text out of each element. You can either use a conditional to specify a tag and use `.tag` for some element, or specify a tag in `.getiterator()`.

***Note:*** Because of the formatting in the XML file, the you should only get the plaintiff's first name.

In [18]:
for element in iterator:
    if element.tag=='persName':
        print(element.text)


                  Samuel 
                  Davis
               

                  Catherine 
                  


What are their names? <b>Samuel Davis and Catherine Herbert, whose surname wasn't recorded in the proceedings but was interpellated later.</b>

**Question 2.4:** How do you think we can use `.attrib` to find their names? You don't have to code anything, just explain how you can using `.attrib`.

<b>Step throught the tree. From the nodes 'persName' select for the value of the attribute "type" to be "defendantName" or "victimName" and then go to the child nodes and select for the value of the attributes "type" to be "given" and "surname",</b> but I am not at all sure how to code it or what the attributes of an element are--they look like a dictionary.

**Question 2.5:** Use `.getiterator()` again, and a new method, `.itertext()`, to get the entire text of the proceeding. Utilizing `.itertext()` method will return all inner text from every child.

**Hint:** Find the tag that will return you the entire text of the trial and a way to join all the text from the file together.

<sub>***Note:*** The text in these XML files are a little wonky, so if the printed text doesn't look formatted well, it's ok.</sub>

In [19]:
iterator = tree.getiterator()
for element in iterator:
    if element.tag == 'p':
        print(''.join(list(element.itertext())))
# this is what I hate about python! you are iterating, putting in a list, and then making a string starting 
#    with an empty string, all at the same time, and it is just not how I think


            
                  
                  Samuel 
                  Davis
               
                     
                     
                  
            , of the Parish of St. James Westminster
                  
                  
                  , was indicted for 
                     
                     
               feloniously Stealing 58 Diamonds set in Silver gilt, value 250 l.
             the Goods of the Honourable 
               
                  Catherine 
                  Lady
                      
                  Herbert
               
                     
                     
                     
                  
            , on the 28th of July
                   last. It appeared that the Jewels were put up in a Closet, which was lockt, and the Prisoner being a Coachman
                   in the House, took his opportunity to take them; the Lady, when missing them, offered a Reward of Fourscore Pounds to any that could give any 

**Question 2.6:** Since the textual data is pretty messy in the XML files of these proceedings, where do you think the data you need might be held and how might you go about extracting this data? 

It depends on what you are doing; some of the data are in the attributes dictionary for each node, and some might actually be in the text and so you would need to strip the text out and put it somewhere as in a csv column

----
## Section 3: Web Scraping<a id='section 3'></a>

We learned how to get parse through one XML file. The Old Bailey API has a total of **197751** cases. Fortunately, we are only going to use the ones from 1754-1756 and 1824-1826, but that still only narrows the number of cases to 6506! 

Don't worry though, you're not going to manually download each case yourself. This is where web scraping comes into play. With web scraping, we can automate data collection to get all 6506 cases. 

Before we start scraping, we need to know how `requests` works. The `requests` library gets (`.get`!) you a response object from a web server and will automatically decode the content from the server, from which you can use `.text` to see the document! Requests through the Old Bailey API will return an XML file, which we can then write as a file and save.

Let's take a look at all of the terms we can use to choose the specific cases we want. We use `.json()` here since the parameters are stored as a JSON object.

In [20]:
requests.get('http://www.oldbaileyonline.org/obapi/terms').json()

[{'name': 'trialtext', 'type': 'text'},
 {'name': 'defgen',
  'terms': ['female', 'indeterminate', 'male'],
  'type': 'select'},
 {'name': 'offcat',
  'terms': ['breakingPeace',
   'damage',
   'deception',
   'kill',
   'miscellaneous',
   'royalOffences',
   'sexual',
   'theft',
   'violentTheft'],
  'type': 'select'},
 {'name': 'offsubcat',
  'terms': ['',
   'animalTheft',
   'arson',
   'assault',
   'assaultWithIntent',
   'assaultWithSodomiticalIntent',
   'bankrupcy',
   'barratry',
   'bigamy',
   'burglary',
   'coiningOffences',
   'concealingABirth',
   'conspiracy',
   'embezzlement',
   'extortion',
   'forgery',
   'fraud',
   'gameLawOffence',
   'grandLarceny',
   'habitualCriminal',
   'highwayRobbery',
   'housebreaking',
   'illegalAbortion',
   'indecentAssault',
   'infanticide',
   'keepingABrothel',
   'kidnapping',
   'libel',
   'mail',
   'manslaughter',
   'murder',
   'other',
   'perjury',
   'pervertingJustice',
   'pettyLarceny',
   'pettyTreason',
   '

If you wanted to explore the full list in your web browser, click [this link](https://www.oldbaileyonline.org/obapi/terms). 

Now that you've had a chance to look through some of the terms, let's see how to grab the specific XML files.

Clicking the URL below returns a JSON object of the number of IDs and the frequency of each term in which every trial contains the term "sheffield" and the offence categrory "deception" from June 14th, 1847 onward. Also, each trial ID that satisfies the terms is returned; the count parameter in this case returns 10 trial IDs, but if left unspecified, the API will return a maximum count of 1000 IDs. 

https://www.oldbaileyonline.org/obapi/ob?term0=trialtext_sheffield&term1=offcat_deception&term2=fromdate_18470614&breakdown=offsubcat&count=10&start=0

Although the terms for time are listed as numbers, the format for the term is
`fromdate_(starting date)` and `todate_(ending date)` without the parentheses.

**Question 3.1:** Use requests.get(...) to get the all trial IDs between the years 1754 and 1756 and return it as a JSON object.

In [21]:
trials = requests.get('http://www.oldbaileyonline.org/obapi/ob?fromdate_17540101&todate_17561231').json()
trials

{'hits': ['t16740429-1',
  't16740429-2',
  't16740429-3',
  't16740429-4',
  't16740429-5',
  't16740429-6',
  't16740429-7',
  't16740429-8',
  't16740429-9',
  't16740717-1',
  't16740717-2',
  't16740717-3',
  't16740717-4',
  't16740717-5',
  't16740717-6',
  't16740909-1',
  't16740909-2',
  't16740909-3',
  't16740909-4',
  't16740909-5',
  't16740909-6',
  't16741014-1',
  't16741014-2',
  't16741014-3',
  't16741014-4',
  't16741014-5',
  't16741014-6',
  't16741014-7',
  't16741014-8',
  't16741212-1',
  't16741212-2',
  't16741212-3',
  't16741212-4',
  't16741212-5',
  't16741212-6',
  't16741212-7',
  't16750115-1',
  't16750115-2',
  't16750115-3',
  't16750115-4',
  't16750219-1',
  't16750219-2',
  't16750219-3',
  't16750219-4',
  't16750414-1',
  't16750414-2',
  't16750414-3',
  't16750414-4',
  't16750414-5',
  't16750414-6',
  't16750414-7',
  't16750707-1',
  't16750707-2',
  't16750707-3',
  't16750707-4',
  't16750707-5',
  't16750707-6',
  't16750707-7',
  't16

Now, lets pick some trials from `trial['hits']`, so we have a list of IDs we can work with. 

**Question 3.2:** Select the first 10 trials by splicing through the list that we retrieved from the previous cell.

In [22]:
first_10 = trials['hits'][:10]
first_10

['t16740429-1',
 't16740429-2',
 't16740429-3',
 't16740429-4',
 't16740429-5',
 't16740429-6',
 't16740429-7',
 't16740429-8',
 't16740429-9',
 't16740717-1']

Using the trial IDs from the previous cell, we are going to format the URL in a way so that we can get the XML file for each trial. In order to get the XML file using the Old Bailey API, we must follow this URL format:

<p style="text-align: center;">`http://www.oldbaileyonline.org/obapi/text?div=(enter trial ID here without parenthesis)`  </p>

For example, http://www.oldbaileyonline.org/obapi/text?div=t16740429-1 gives you the link to the XML file of the first proceeding in the database.


**Question  3.3:** Get the XML file of the first trial in first_10. A successful `.get` request returns `<Response [200]>`.

In [25]:
response = requests.get('http://www.oldbaileyonline.org/obapi/text?div=t16740429-1')

Run the next cell to see the XML format of the text! 

In [26]:
print(response.text)

<?xml version="1.0" encoding="UTF-8"?>
<div1 type="trialAccount" id="t16740429-1">
               <interp inst="t16740429-1" type="collection" value="BAILEY"></interp>
               <interp inst="t16740429-1" type="year" value="1674"></interp>
               <interp inst="t16740429-1" type="uri" value="sessionsPapers/16740429"></interp>
               <interp inst="t16740429-1" type="date" value="16740429"></interp>
               <join result="criminalCharge" id="t16740429-1-off1-c3" targOrder="Y" targets="t16740429-1-defend3 t16740429-1-off1 t16740429-1-verdict4"></join>
         
               <p>
                  <xptr type="pageFacsimile" doc="16740429003"></xptr>The first thing I shall shew thee is about a <rs id="t16740429-1-off1" type="offenceDescription">
                     <interp inst="t16740429-1-off1" type="offenceCategory" value="violentTheft"></interp>
                     <interp inst="t16740429-1-off1" type="offenceSubcategory" value="highwayRobbery"></interp>
   

We can save the XML file:

In [27]:
trial_number = 't17540116-11' #trial ID (make sure its a string)
with open('data/old-bailey/old-bailey-' + trial_number + '.xml', 'w') as file:
    file.write(response.text)

### Challenge: Scraping all trials from 1754 - 1756

Now that you know how to find the trial IDs for certain parameters as well as get an XML file using `requests.get(some_url)`, iterate through each ID in the list of trials (use `trials['hits']` for the list of IDs) we got from 1754-1756 earlier. You can choose how many trials you want to save.

In [28]:
len(trials['hits'])

1000

In [31]:
#SOLUTION
for case in trials['hits']:
    #format URL
    case_url = 'http://www.oldbaileyonline.org/obapi/text?div=' + case
    
    #get text from URL
    case_text = requests.get(case_url)
    print('getting case url: ', case_url)
    #save the file **store in data/old-bailey/file_name
    with open('data/old-bailey/old-bailey-' + case + '.xml', 'w') as file:
         file.write(case_text.text)
    #one second pause so servers aren't overloaded
    time.sleep(1)

getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740429-1
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740429-2
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740429-3
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740429-4
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740429-5
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740429-6
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740429-7
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740429-8
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740429-9
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740717-1
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740717-2
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740717-3
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16740717-4

getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-2
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-3
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-4
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-5
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-6
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-7
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-8
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-9
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-10
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-11
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-12
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16760405-13
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t167604

getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770711a-3
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770711a-4
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770711a-5
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770711a-6
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770711a-7
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770711a-8
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770906-1
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770906-2
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770906-3
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770906-4
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770906-5
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16770906-6
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t1677

getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-1
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-2
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-3
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-4
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-5
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-6
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-7
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-8
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-9
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-10
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-11
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16781016-12
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t1678101

getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790605-12
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790605-13
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790605-14
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790605-15
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790605-16
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790716-1
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790716-2
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790716-3
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790716-4
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790716-5
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790716-6
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790716-7
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16790

getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800421-10
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800421-11
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800421-12
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800421-13
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800421-14
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800526-1
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800526-2
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800526-3
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800526-4
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800526-5
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800526-6
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800526-7
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16800

getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810228-17
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810413-1
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810413-2
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810413-3
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810413-4
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810413-5
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810413-6
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810413-7
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810413-8
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810413-9
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810413-10
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16810413-11
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t1681041

getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820116a-5
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820116a-6
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820116a-7
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820116a-8
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820223-1
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820223-2
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820223-3
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820223-4
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820223-5
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820223-6
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820223-7
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820223-8
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t168202

getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820906a-11
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820906a-12
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820906a-13
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820906a-14
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820906a-15
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820906a-16
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820906a-17
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820906a-18
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820906a-19
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16820906a-20
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16821206-1
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16821206-2
getting case url:  http://www.oldbaileyonline.org/obapi/

getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418-14
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418a-1
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418a-2
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418a-3
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418a-4
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418a-5
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418a-6
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418a-7
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418a-8
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418a-9
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418a-10
getting case url:  http://www.oldbaileyonline.org/obapi/text?div=t16830418a-11
getting case url:  http://www.oldbaileyonline.org/obapi/text?d

You can check if you saved the XML files by executing the cell below!

In [32]:
!ls data/old-bailey/

old-bailey-s16781211e-1.xml   old-bailey-t16800910-3.xml
old-bailey-t16740429-1.xml    old-bailey-t16800910-4.xml
old-bailey-t16740429-2.xml    old-bailey-t16800910-5.xml
old-bailey-t16740429-3.xml    old-bailey-t16800910-6.xml
old-bailey-t16740429-4.xml    old-bailey-t16800910-7.xml
old-bailey-t16740429-5.xml    old-bailey-t16800910-8.xml
old-bailey-t16740429-6.xml    old-bailey-t16800910-9.xml
old-bailey-t16740429-7.xml    old-bailey-t16800910a-1.xml
old-bailey-t16740429-8.xml    old-bailey-t16800910a-2.xml
old-bailey-t16740429-9.xml    old-bailey-t16800910a-3.xml
old-bailey-t16740717-1.xml    old-bailey-t16800910a-4.xml
old-bailey-t16740717-2.xml    old-bailey-t16800910a-5.xml
old-bailey-t16740717-3.xml    old-bailey-t16801013-1.xml
old-bailey-t16740717-4.xml    old-bailey-t16801013-2.xml
old-bailey-t16740717-5.xml    old-bailey-t16801013-3.xml
old-bailey-t16740717-6.xml    old-bailey-t16801013-4.xml
old-bailey-t16740909-1.xml    old-bailey-t16801013-5.xml
old-baile

This cell will show you the XML file.

In [33]:
!cat data/old-bailey/old-bailey-t17540116-1.xml

<?xml version="1.0" encoding="UTF-8"?>
<div1 type="trialAccount" id="t17540116-1">
               <interp inst="t17540116-1" type="collection" value="BAILEY"></interp>
               <interp inst="t17540116-1" type="year" value="1754"></interp>
               <interp inst="t17540116-1" type="uri" value="sessionsPapers/17540116"></interp>
               <interp inst="t17540116-1" type="date" value="17540116"></interp>
               <join result="criminalCharge" id="t17540116-1-off2-c29" targOrder="Y" targets="t17540116-1-defend30 t17540116-1-off2 t17540116-1-verdict4"></join>
         
               <p>80. 
               
                  <persName id="t17540116-1-defend30" type="defendantName">
                     Hannah 
                     Ash 
                  <interp inst="t17540116-1-defend30" type="surname" value="Ash"></interp>
                     <interp inst="t17540116-1-defend30" type="given" value="Hannah"></interp>
                     <interp inst="t

----
## Section 4: Putting it all in a dataframe<a id='section 4'></a>

Now that we have a bunch of XML files and know how to parse through them to extract data, let's put the data from the XML files into a dataframe. As you probably saw earlier from printing the text of the court proceeding, the text was incredibly messy. Feel free to process the text yourself, but specifically for this last section, we'll use the data from each attribute to put in our dataframe.

**Question 4.1:** Complete the body of a function `table_of_cases`, which returns a dataframe with the "type" of data as a column label and the value from that attribute in that column. Make sure to account for cases that either won't have as many attributes as others (e.g. there are two defendants in one trial, but only one in the other). The body of the function is structured for you.

**Tips:** Open up different trials to see all "type" keys in attributes. Which tag contains the attributes with information you can use? And how will you account for repeating "type" keys showing up repeatedly (e.g. surname, given, etc.) so that you don't replace the value you already have in the existing column with the same key? 

## note: this is actually quite a bit of coding; we may have to tone this down!

In [37]:
def table_of_cases(xml_file_name):
    #load file
    file = ET.ElementTree(file=xml_file_name)
    #create an iterator object
    iterate = file.getiterator()
    #create empty dataframe
    table = pd.DataFrame()
    #create a possible index for repeating "types"
    i = 1
    for element in iterate:
        if element.tag == 'interp':
            #get attrib
            t = element.attrib['type']
            #get value of type
            val = [element.attrib['value']]
            #labels of columns in table
            labels = list(table.columns.values)
            #change possible index to string
            num = str(i)
            #Implement conditional clauses to check if we already have
            #the "type" as a column label. If there is, how
            #can we make a unique label for the repeating column name?
            if t not in labels:
                table[t]=val
            #conditional clause 2
            elif t+num not in labels:
                table[t+num]=val
            #conditional clause 3
            elif t+num in labels:
                num=str(i+1)   #increment the counter
                table[t+num]=val
    return table

**Question 4.2:** Now, use `table_of_cases` to load the attribute data from each XML file that you scraped. Load a blank dataframe so you can append the table of information after each call. Use the argument `ignore_index = True` in `.append` so that the indices will be formatted correctly.

**Note:** Use the same file name format used when scraping these files and load from the correct directory, or else you won't be able to load the data.

In [38]:
table = pd.DataFrame()
for i in trials['hits'][:30]:
    raw_data = 'data/old-bailey/old-bailey-' + i + '.xml' #leave it as file name
    data_to_table = table_of_cases(raw_data)
    table = table.append(data_to_table,ignore_index=True)
table

Unnamed: 0,collection,date,gender,gender1,gender2,given,given1,given2,offenceCategory,offenceCategory1,...,punishmentSubcategory,surname,surname1,surname2,type,uri,verdictCategory,verdictCategory1,verdictSubcategory,year
0,BAILEY,16740429,male,male,,,,,violentTheft,,...,,Stutely,,,crimeLocation,sessionsPapers/16740429,guilty,,,1674
1,BAILEY,16740429,male,male,,Oliver,,,theft,,...,,Smith,,,,sessionsPapers/16740429,guilty,,,1674
2,BAILEY,16740429,male,male,male,,,,theft,,...,,Bradbourn,,,crimeLocation,sessionsPapers/16740429,guilty,,,1674
3,BAILEY,16740429,male,female,,,,,sexual,,...,,,,,,sessionsPapers/16740429,notGuilty,,,1674
4,BAILEY,16740429,female,female,,,,,theft,,...,,,,,,sessionsPapers/16740429,guilty,,,1674
5,BAILEY,16740429,male,male,male,Walter,James,,theft,,...,,Carey,Slader,,,sessionsPapers/16740429,guilty,,,1674
6,BAILEY,16740429,male,male,male,,,,theft,,...,,,,,,sessionsPapers/16740429,guilty,,,1674
7,BAILEY,16740429,male,male,,,,,theft,,...,,,,,,sessionsPapers/16740429,guilty,,,1674
8,BAILEY,16740429,indeterminate,male,,Thomas,,,violentTheft,,...,,Feild,,,crimeLocation,sessionsPapers/16740429,guilty,,,1674
9,BAILEY,16740717,male,,,Thomas,,,theft,,...,,Whitehead,,,,sessionsPapers/16740717,guilty,,,1674


That's it! Now you know how to parse through XML files using XPath and web scrape using the `requests` library! 

## Bibliography

 - All files from Old Bailey API - https://www.oldbaileyonline.org/obapi/
 - ElementTree information adapted from Driscoll, Mike. (2013, April). Python 101 – Intro to XML Parsing with ElementTree.
 https://www.blog.pythonlibrary.org/2013/04/30/python-101-intro-to-xml-parsing-with-elementtree/

 - Web Scraping code adapted from MEDST-250 Notebook developed by Tejas Priyadarshan.
 https://github.com/ds-modules/MEDST-250/tree/master/04%20-%20XML_Day_1
 
 - Image source from https://www.researchgate.net/publication/257631377_Efficient_XML_Path_Filtering_Using_GPUs

----
Notebook developed by: Jason Jiang

Data Science Modules: http://data.berkeley.edu/education/modules