**By Peter A. Stokes, École Pratique des Hautes Études – Université PSL**

_These are brief notes and exercises on working with TEI XML files using Python. They are intended as teaching aids for the course on 'Image Processing with Python' which is part of the Atelier de formation annuel du Consortium Cahier on the topic of 'Exploiter les corpus d'auteurs' in Poitiers, 18--20 June 2019. For more details see https://cahier.hypotheses.org/4662_

These notes assume a good knowledge of TEI XML but assume **no experience or knowledge at all** in programming. This notebook also assumes that [Beautiful Soup 4](https://pypi.org/project/beautifulsoup4/) has already been installed in your Python system, and that the file [here](http://xtf.bvh.univ-tours.fr/xtf/data/tei/B330636101_S1238/B330636101_S1238_tei.xml) has already been downloaded and saved in the same directory as this notebook.

_If you are viewing this in Jupyter then you can edit the code simply by typing in the boxes. You can also execute the code in any box by clicking on the box and typing SHIFT + ENTER or using the 'Run' button in the menubar above._

# Import Libraries

One of the main strengths of Python is that other people have written code for us that we can import and use for ourselves. A good example of this is the Beautiful Soup library which is designed for working with HTML and other forms like XML. This means that the first thing we often do in Python is tell the system to load in the libraries that we need, so that we can then use them.

In [None]:
from bs4 import BeautifulSoup

# Load in the file

The next step is to load the file into the system and give it to Beautiful Soup. To do this, we need the following steps:

1. Tell Python where the file is located, relative to our current location in the file system. The best way to do this is to store it in memory where we can find it again later. To do this we need to give our location a name so that we can find it again; here I have chosen to call it `path`.
1. Open the file at location `path`. To do this we use the Python command `open()`. Again, we need to store the reference to the open file somewhere so we can get it again; I have chosen to call it `inFile`.
1. Read in the contents of the file (using `read()`) and store it in memory. Here I have chosen to call it `fileXML`.
1. Send the raw XML to Beautiful Soup so that it can process it. It helps (but isn't necessary) to tell Beautiful Soup that this is an XML file. We store the results of the Beautiful Soup creation in memory and call this `bs`.
1. Tell the operating system to `close` the file. This isn't essential but is good practice as the Operating System can then free up memory, allow another program to access the file and so on, which would be important in a big online system, for instance.

In [None]:
# Store the location of the file in a variable for Python
path = "../Example Texts/B330636101_S1238_tei.xml"

# Now open the file and store the reference to the file in another variable
inFile = open(path)

# Read in the contents (and store it in yet another variable)
fileXML = inFile.read()

# Send the contents to Beautiful Soup, and store the results in another variable
bs = BeautifulSoup(fileXML, "xml")

# Close the original file
inFile.close()

# View Parts of the File

We should now have our XML file stored in a variable (i.e. a place in memory with a name) so that we can now process it. We can find specific parts of the XML structure in two ways. The first is using `select()`, where we can use the syntax of CSS selectors (see further https://www.w3schools.com/cssref/css_selectors.asp). For instance, if we want to find all elements at `TEI/teiHeader/revisionDesc/change/persName` then we can do it like this:

In [None]:
bs.select("TEI > teiHeader > revisionDesc > change > persName")

We can also use CSS selectors and `select()` to find only those elements with specific attributes. Let's look for all `<supplied>` elements where the `@resp` is equal to `#MD`. This is how we do it:

In [None]:
bs.select("supplied[resp='#MD']")

**Warning**: If you are used to XPath then be careful here, as CSS selectors look similar but behave differently in the details! Here is a quick summary of some of the main CSS Selector functions:

Selector | Result
---------|--------------------------
#id      | Elements with an @id of id
e1, e2   | All e1 elements _and_ all e2 elements
e1 e2    | All e2 elements _inside_ e1 elements
e1 > e2  | All e2 elements which are immediate children of an e2 element
\[attr = v] | All elements where the attribute @attr has value v
\[attr ~= v] | All elements where the attribut @attr _contains_ the _word_ v

## Exercise

Now you try to do the same thing to find all `supplied` lines. Click into the box below, type your answer, and run it to see if you get the right result.

Now write code in the box below to find the following:

1. All `<supplied>` content inside the `<front>` element
1. All text marked as being in a `<foreign>` language (anywhere in the document)
1. All text in the `<body>` in its `<orig>` form (as opposed to its `<reg>` form)
1. All `<supplied>` content in the `<body>` which includes #IN40413 as a `@source`
1. The `<pb>` element for which `@n` is [4v]

# Counting Results

Now that we have our XML and can find specific parts of it, we want to do some interesting things with it. Let's try counting. To do this we can do the following:

1. Ask Beautiful Soup to find all the `<lb/>` elements in the XML document using `select()`. As we have seen, Beautiful Soup will give us a list of all the `<lb/>`s that match.
1. Store this list in a variable (i.e. in memory with a name).
1. Ask Python to tell us the `len`gth of our list.

Here is an example that counts how many `<supplied>` elements have `@resp` equal to #MD:

In [None]:
sup = bs.select("supplied[resp='#MD']")
len(sup)

Now do the same yourself to find how many `<pb/>` elements there are. How many have `@facs` attributes? How many do _not_ have `@facs` elements?

(Hint: to subtract one number from another, simply use the `-` (minus) sign. So `10 - 7` will give `3`, and so on.)

# Getting and Working with a Single Result

As we have seen, `select()` gives us a list of results, but sometimes we want a specific one. In Python, Lists are actually very powerful and allow us to do many different things. We have already seen how to count the length of the list, but we can also get the first item, the last item, add and remove items, and so on.

To get the first, third, fifth, etc. item in a list, we use the name of the list followed by the number of the item in square brackets. So above we created our list and stored it in a variable that we called `sup`, so we can display different items in our list of results by using `print` and the square bracket notation like this:

In [None]:
print(sup[1])

print(sup[10])

There are some important things to notice here:

* **The first item in the list is always number 0**. This is specific to Python but is also the case in many (but not all!) programming languages. This means that to get the first item in the list stored in variable `sup`, we use `sup[0]`.
* **Getting the last item in the list is not immediately obvious**. We could put in the number of the last item, but we don't know in advance how many items there are in the list. There are therefore two ways of doing this:
  * One is to use `len()` (How? Try it! Be careful, though, as there is a trap!)
  * Another is that `-1` in Pyton means 'the last'. So here `sup[-1]` means 'give me the _last_ element in the list stored in my variable `sup`'
  
Try it now, to print the first and last item in our list `sup` in the box below:

Once we have a single result, we can then do some more interesting things with it.

One is to get the attribute of the result. To do this, we put the name of the attribute into square brackets, so for instance if we have a single result stored in a variable `a`, then to get the `@type` attribute we use `a["type"]`. Here is another example:

In [None]:
# Store the first element in a new variable (let's call it s)

s = sup[0]

# Now use the [] notation to get the value of the @resp attribute and print it

print(s["resp"])

Now use this method to get the `@facs` attribute of the *last* `<pb>` element in the document. To do this, you will need to:

1. Use `bs.select()` to get a list of all `<pb>` elements in the document, and store this in a variable.
1. Get the last element in the list and store that in a new variable
1. Use the `[]` notation to get the attribute value and print it out.

If it works then you should see the a link to the image of the last page in the document. Click on the link: you should then see an image of [this page](http://gallica.bnf.fr/ark:/12148/bpt6k11718168/f1022.highres).

There are many other things we can do with a single result like this. One is to format the XML nicely (more or less) for printing and display. To do this we use the `prettify()` method like this: 

In [None]:
rd = bs.select("TEI > teiHeader > revisionDesc")
print(rd[0].prettify())

# Finding Text

To find specific text content (not names of tags or attributes), we can use `find_all()`. This works similarly to `select()` but is not based on CSS selectors. Instead it is more precise and in many ways more powerful, but it's also a bit harder to use. Here are some examples of how it works:

In [None]:
# Find all examples of the <reg> element

bs.find_all("reg")

In [None]:
# Find all examples of the word 'quand' *inside* a <reg> element

bs.find_all("reg", string="quand")

In [None]:
# NB that 'string' only searches the immediate text node, so this returns no results:
bs.find_all("choice", string="quand")

In [None]:
# If we want to find text anywhere in a descendant of the current node then we have to specify a recursive search:
bs.find_all("choice").find_all(string="quand")

# Modifying the XML

It's also possible to modify the contents and even the structure of the XML with Beautiful Soup:

In [None]:
# Create a new very simple XML document, just to demonstrate the principle
soup = BeautifulSoup('<p>This text is <b class="boldest">extremely bold</b></p>')
tag = soup.find('b')
print(tag)

In [None]:
# Let's change the tag name and add some attributes
tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
print(tag)

In [None]:
# Now remove some attributes

del(tag['class'])
del(tag['id'])
print(tag)

In [None]:
# Let's add some text to the tag's contents

tag.append(' hello')
print(tag)

In [None]:
# And change the contents entirely

tag.string = 'And now goodbye'
print(tag)

This gives us many different possibilities. One, for instance, is to automatically add `id` attributes to all the elements in a given document. Here is the algorithm to follow:

1. Load in the XML file
1. Set an ID counter to 0 (to ensure that all IDs are unique)
1. `Find all` tags in the document to which we want to add IDs.
1. `for` each element `in` the list that results from the previous step:
   1. Set the `id` attribute to the value of the ID counter
   1. Increment the value of the ID counter
1. Save the result

Try writing the code yourself. You may also want to use a different form for the ID rather than a simple number; you could, for instance, add a string as a prefix (to give IDs such as no1, no2, no3 etc.), or use something else.

# An Extended Example

Let's now do an extended example of a (potential) real case. Very often in TEI files, we have internal or even external references to documents, that is, pointers from elements to other elements. The problem is that it's often very difficult to be sure that a pointer does indeed point to something, and that there isn't a mistake somewhere, and so it would be useful to have a program that can look at all the pointers and make sure that they do indeed point to something.

This is what the program would need to do:

1. Load in the XML file from the disk (or the internet via a URL)
1. Create the Beautiful Soup object
1. Create a list of all elements with pointers
1. Create a list of elements being pointed to.
1. For each item in the list of elements with pointers:
   1. Check that the pointer does indeed exist in the list of elements being pointed to
   1. If it is not there then print some sort of error
   
Let's try this with the [Proust Prototype](http://research.cch.kcl.ac.uk/proust_prototype/) by Elena Pierazzo and Julie André. In this case, there are `@change` attributes which should point to the `@xml:id` attributes in `change` elements. Here is how it works:

In [None]:
# Store the location of the file in a variable for Python
path = "../Example Texts/Proust_tei_C46.xml"

# Now open the file and store the reference to the file in another variable
inFile = open(path)

# Read in the contents (and store it in yet another variable)
fileXML = inFile.read()

# Send the contents to Beautiful Soup, and store the results in another variable
bs = BeautifulSoup(fileXML, "xml")

# Close the original file
inFile.close()

# Select and store the list of all elements containing @change attributes
change_list = bs.select("[change]")

# Let's also store the list of change IDs
change_ids = bs.find_all("change", attrs="xml:id")

# Print the list to show that it has worked
print(change_list[20])

print(change_ids[20])

In [None]:
# For each node in the change list:
for c in change_list:

    # Get the value of the @change attribute and store it temporarily
    change_ids = c["change"]
    
    print(change_ids)

---
![Licence Creative Commons](https://i.creativecommons.org/l/by/4.0/88x31.png)
This work (the contents of this Jupyter Python notebook) is licenced under a [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)