# Loading and Processing TEI-Encoded XML
Because I wasn't able to get everyone working with the lxml library, I've simplified the process of loading xml and then processing it. Follow along below and let me know if you have any problems planning and implementing your assignment.

## Step 1: Read XML from github


In [9]:
import urllib.request

# assign your url to the variable url_to_load
# you can try http://papyri.info/ddbdp/bgu;1;133/source .
url_to_load = "http://papyri.info/ddbdp/bgu;1;133/source" # url here

f = urllib.request.urlopen(url_to_load)
tei_as_string = f.read().decode('utf-8')

print(tei_as_string)
# if the above was successful you should see a TEI document below:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.stoa.org/epidoc/schema/8.16/tei-epidoc.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"
     n="0001;1;133"
     xml:id="bgu.1.133"
     xml:lang="en">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>bgu.1.133</title>
         </titleStmt>
         <publicationStmt>
            <authority>Duke Collaboratory for Classics Computing (DC3)</authority>
            <idno type="filename">bgu.1.133</idno>
            <idno type="ddb-perseus-style">0001;1;133</idno>
            <idno type="ddb-hybrid">bgu;1;133</idno>
            <idno type="HGV">8910</idno>
            <idno type="TM">8910</idno>
            <availability>
               <p>© Duke Databank of Documentary Papyri. This work is licensed under a <ref type="license" target="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution 3.0 License</ref>.</p>


## Step 2: We can already do things
The find() method of strings will return the index of a substring. Try it:

In [10]:
url_to_load = "http://papyri.info/ddbdp/bgu;1;133/source" # url here

f = urllib.request.urlopen(url_to_load)
tei_as_string = f.read().decode('utf-8')

start_index = tei_as_string.find('<lb') # find the index of the first lb element

# why is the '[start_index:]' range important here?
tmp_end_index = tei_as_string[start_index:].find('>')

# what happens if we don't add 1 at the end? try it
end_index = start_index + tmp_end_index + 1

print("The first lb element begins at index " + str(start_index))
print("The first lb element ends at index " + str(end_index))

The first lb element begins at index 3674
The first lb element ends at index 3685


## Quiz!
Print the first lb element by using the start_index and end_index.

In [11]:
print(tei_as_string[start_index:end_index])

<lb n="1"/>


## More complex elements
lb elements are zero-lengh, which means they can take the form '&lt;lb n="#" />' . Note the '/>' at the end of the element. This is equivalent to '&lt;lb n="#">&lt;/lb>' but is easier to write.

Next we want to capture elements of the form '&lt;persName type="a type">Tom&lt;/persName>'. You know all the python code to do this. We just put things together a little differently.

## Quiz
The cell below is almost ready to load the sample URL from the first cell, then to find and print out the first 'expan' element. Assign a value to end_tag to make the cell work.


In [17]:
url_to_load = "http://papyri.info/ddbdp/bgu;1;133/source"

f = urllib.request.urlopen(url_to_load)
tei_as_string = f.read().decode('utf-8')

# start_tag
start_tag = '<expan' # why leave off the '>'?
end_tag = '</expan>'

start_index = tei_as_string.find(start_tag)

tmp_end_index = tei_as_string[start_index:].find(end_tag)

end_index = start_index + tmp_end_index + len(end_tag) + 1 # why '+ len(end_tag)'?

print(tei_as_string[start_index:end_index])

<expan>στρ<ex>ατηγῷ</ex></expan> 


## Quiz!
Cut-and-paste the working code from the above cell and adapt it so that it finds the first supplied element. Be careful, supplied elements can have the form '&lt;supplied reason="..."> ... &lt;/supplied>' This means you can just search for '&lt;supplied>'.

In [23]:
url_to_load = "http://papyri.info/ddbdp/bgu;1;133/source"

f = urllib.request.urlopen(url_to_load)
tei_as_string = f.read().decode('utf-8')

# start_tag
start_tag = '<supplied' # why leave off the '>'?
end_tag = '</supplied>'

start_index = tei_as_string.find(start_tag)

tmp_end_index = tei_as_string[start_index:].find(end_tag)

end_index = start_index + tmp_end_index + len(end_tag) + 1 # why '+ len(end_tag)'?

print(tei_as_string[start_index:end_index])

<supplied reason="lost">ρα</supplied> 


## A Better Way
Building your own indexes into a string to find start and end tags could get very complicated very quickly. Let's use a libray. In class we used the 'lxml' library. I recommend sticking with that if you're going to process xml in python as part of your final project. Now we'll use the 'xml' library, which should work on everyone's machine. In the following cells, look at the comments to follow along.

In [24]:
# import what we need from the xml library
import xml.etree.ElementTree as ET

# some simple xml as a test
xml_str = "<doc><p>Hello World!</p><p>Goodbye!!!</p></doc>"

# turn this string into an ElementTree. Think of that as a very specialized version of a python list.
xml_elements = ET.fromstring(xml_str)

# ElementTrees have a .findall method that can take an xpath expression.
# IMPORTANT: being your xpath with '.'
# You can iterate over them with a for loop.
for p in xml_elements.findall('p'):
    print(p.text)

Hello World!
Goodbye!!!


In [25]:
# now let's access attributes
# we don't need to import the xml library again

# some simple xml, but now with attributes
xml_str = '<doc><p n="1" type="salutation">Hello World!</p><p n="2" type="farewell">Goodbye!!!</p></doc>'

# turn this string into an ElementTree. Think of that as a very specialized version of a python list.
xml_elements = ET.fromstring(xml_str)

# ElementTrees have a .findall method that can take an xpath expression.
# IMPORTANT: being your xpath with '.'
# You can iterate over them with a for loop.
for p in xml_elements.findall('.//p'):
    print(p.attrib['type']) # p.attrib on its own returns a dictionary.

salutation
farewell


## Quiz!
Change the above code so that it prints out the n attribute of each p element.

In [27]:
# now let's access attributes
# we don't need to import the xml library again

# some simple xml, but now with attributes
xml_str = '<doc><p n="1" type="salutation">Hello World!</p><p n="2" type="farewell">Goodbye!!!</p></doc>'

# turn this string into an ElementTree. Think of that as a very specialized version of a python list.
xml_elements = ET.fromstring(xml_str)

# ElementTrees have a .findall method that can take an xpath expression.
# IMPORTANT: being your xpath with '.'
# You can iterate over them with a for loop.
for p in xml_elements.findall('.//p'):
    print(p.attrib['n']) # p.attrib on its own returns a dictionary.

1
2


## An example with real TEI

In [33]:
url_to_load = "http://papyri.info/ddbdp/bgu;1;133/source" 
f = urllib.request.urlopen(url_to_load)
tei_as_string = f.read().decode('utf-8')
xml_elements = ET.fromstring(tei_as_string)

# IMPORTANT: In the findall(...) statements that follow,
# you'll see '{http://www.tei-c.org/ns/1.0}'. Leave it.
# But note that you can change the element name.

# if you are using your own TEI, you might need to change the XPATH so that is meaningful for your data

print("\nList found elements:")
# element.tag returns the name of the current element
elements = xml_elements.findall('.//{http://www.tei-c.org/ns/1.0}lb')
for element in elements:
    print(element.tag)

print("\nList the text of found elements:")
# element.text returns the text content of the current element. Here 'expan' elements are found.
elements = xml_elements.findall('.//{http://www.tei-c.org/ns/1.0}expan')
for element in elements:
    print(element.text)

print("List the value attributes of found elements")
# as you've seen element.attrib, return the attributes
elements = xml_elements.findall('.//{http://www.tei-c.org/ns/1.0}num')
for element in elements:
    print(element.attrib['value'])


List found elements:
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb
{http://www.tei-c.org/ns/1.0}lb

List the text of found elements:
στρ
Ἀρσι
Ἡρακ
μερίδο
ἀμφόδο
None
αἶγ
None
List the value attributes of found elements
4
7
104
6
10
14
100
8


## A little bit of bad news
The xml library is simpler than the lxml library. It does not support xpath queries of the form .//num/@value . You have to get all the nums and move through them looking for value attributes using .attrib .

## Assignment!
You can now fetch your TEI from github, find certain elements within it, and then write a for loop to access those elements individually. The last step is like the 'for color in colors:' loop in Chapter 1.

Your assignment is to adapt the code above to do something interesting with your TEI-encoded xml file that is in the github repository. You could find shared words in paragraphs. List all rulers, dates, places, (organs?). How about making links to the wikipedia articles for rulers - or other proper nouns - mentioned in a text? How would you do that?

And as I mentioned, you might have to make changes to your TEI so that python can easily work with the elements within your document. That is OK. Just make sure the file remains valid XML.


In [117]:
url_to_load = "https://isaw-ga-3024.github.io/maticic-del-dm3769/XML_Assignment_Week3.xml" 

f = urllib.request.urlopen(url_to_load)
tei_as_string = f.read().decode('utf-8')

xml_elements = ET.fromstring(tei_as_string)

print("\nMeter:")
# element.tag returns the name of the current element
elements = xml_elements.findall('.//{http://www.tei-c.org/ns/1.0}lb')
for element in elements:
    print(element.attrib['n'])
    print(element.attrib["rend"])
    

print("\nBody Metaphors:")
# element.tag returns the name of the current element
elements = xml_elements.findall('.//{http://www.tei-c.org/ns/1.0}rs')
for element in elements:
    print(element.attrib['n'])
    
        
print("\nDeity Names")
elements = xml_elements.findall('.//{http://www.tei-c.org/ns/1.0}persName')
for element in elements:
    print(element.text)
    
    
print("\nNotes")
elements = xml_elements.findall('.//{http://www.tei-c.org/ns/1.0}quote')
for element in elements:
    print(element.text)
    

print("\nBibliography")
# as you've seen element.attrib, return the attributes
elements = xml_elements.findall('.//{http://www.tei-c.org/ns/1.0}p')
for element in elements:
    print(element.text)



Meter:
5
-oo|--|--|-oo|-oo|--
6
-oo|--|-oo|-oo|-oo|--
7
--|-oo|-oo|--|-oo|--
8
--|-oo|-oo|-oo|-oo|--
9
-oo|--|--|-oo|-oo|--
10
--|--|--|-oo|-oo|--
11
-oo|--|-oo|--|-oo|--
12
--|--|--|-oo|-oo|--
13
-oo|--|-oo|--|-oo|--
14
-oo|--|--|-oo|--|--
15
-oo|--|--|--|-oo|--
16
-oo|-oo|--|--|-oo|--
17
-oo|--|--|-oo|-oo|--
18
--|-oo|-oo|-oo|-oo|--
19
-oo|--|-oo|--|-oo|--
20
-oo|--|-oo|-oo|-oo|--
21
-oo|-oo|--|--|-oo|--
22
--|--|--|--|-oo|--
23
-oo|--|-oo|--|-oo|--
24
--|--|--|--|-oo|--
25
-oo|-oo|--|--|-oo|--
26
-oo|--|--|-oo|-oo|--
27
-oo|--|-oo|-oo|-oo|--
28
-oo|-oo|--|--|-oo|--
29
-oo|--|-oo|-oo|-oo|--
30
--|-oo|-oo|--|-oo|--
31
-oo|--|-oo|-oo|-oo|--
32
-oo|-oo|--|-oo|-oo|--
33
-oo|-oo|--|-oo|-oo|--
34
-oo|--|--|--|-oo|--
35
-oo|--|-oo|-oo|-oo|--

Body Metaphors:
ln. 6: unus erat toto naturae vultus in orbe
lns. 13-14: nec bracchia longo / margine terrarum porrexerat Amphitrite
ln. 33: congeriem secuit sectamque in membra coegit

Deity Names
Titan
Phoebe
Amphitrite

Notes
Line 9 Note: With semi

## How to turn in your assignment

First, exectute the cell with your code in it and make sure the output is what you want.

Then choose 'Save and checkpoint' from the "File" menu of this notebook.

Copy this file, which is 'tei-processing.ipynb' into your folder of the class repository. As usual, commit and sync changes to github. That's it. When loaded from github, the notebook will appear essentially the same as it does to you now.