**By Peter A. Stokes, École Pratique des Hautes Études – Université PSL**

_These are brief notes and exercises on working with TEI XML files using Python. They are intended as teaching aids for the course on 'Image Processing with Python' which is part of the Atelier de formation annuel du Consortium Cahier on the topic of 'Exploiter les corpus d'auteurs' in Poitiers, 18--20 June 2019. For more details see https://cahier.hypotheses.org/4662_

These notes assume a good knowledge of TEI XML but assume **no experience or knowledge at all** in programming.

_If you are viewing this in Jupyter then you can edit the code simply by typing in the boxes. You can also execute the code in any box by clicking on the box and typing SHIFT + ENTER or using the 'Run' button in the menubar above._

# Python basics

This section gives a very brief summary of the basics of the Python 3 programming language. If you already know Python 3 then you can skip this section. If you know an earlier version of Python such as Python 2.7, then you can also skip this section, but be aware that `print 'hello'` is no longer valid: in Python 3 you must always include the parentheses, so `print('hello')` instead.

## Variables

To create a new variable for storing data in memory, simply provide a unique name for that variable and use `=` to assign the content. Note that you can re-assign different content to an existing variable, in which case the new content will simply replace the old (hence the name 'variable'). You can have as many different variables as you like, as long as your computer doesn't run out of RAM. This is unlikely with modern computers, but it is possible if you have very large images.

Notice also the `#` symbol. This is to signal a 'comment': i.e. everything after `#` on that line is a comment for us humans to read and so will be ignored by Python. It is very good practice to add comments as a reminder to you and a message to others of what your code does. You will be grateful when you come back to your code in a year's time!

In [6]:
a = 1      # Stores the integer (whole number) value 1 in the variable a
b = 2.0    # Stores the decimal value 2.0 in the variable b

c = a + b  # Stores the decimal value 3.0 (1 + 2.0) in the variable c
c = c + 1  # Stores the decimal value 4.0 (3.0 + 1) in the variable c

d = c / b  # Stores the decimal value of c divided by b into the variable d
e = b * c  # Stores the decimal value of b multiplied by c into the variable f

print(c)   # Prints the value currently stored in c (i.e. 4)
print(d)
print(e)


4.0
2.0
8.0


Variables do not have to contain numbers but they can contain many things, including images (as we will see soon). Another common type of data is a string, namely a series of characters:

In [7]:
s1 = 'This is a string. It must be enclosed in single quotes.'
s2 = 'The single quotes tell Python that it is a string.'
s3 = 'Otherwise, Python might think that it is the name of a variable.'

print(s1)
print(s2)
print(s3)

print('You can also print a string directly without storing it first.')
print(s1, s2, s3) # Notice what happens here

This is a string. It must be enclosed in single quotes.
The single quotes tell Python that it is a string.
Otherwise, Python might think that it is the name of a variable.
You can also print a string directly without storing it first.
This is a string. It must be enclosed in single quotes. The single quotes tell Python that it is a string. Otherwise, Python might think that it is the name of a variable.


## Libraries

If you want to use a library of existing code then you **must** first tell Python to load it into your system **before** you use the library code. You can import an entire library, but normally you only import specific parts from that library. For this you use the `import` or `from ... import` command. You can also add `as` to give the library a short name if you want, as we do in the example below: `matplotlib.pyplot` is a pain to type so we give it the name `plt` for short. These are the libraries that we will be using today:

In [8]:
from PIL import Image
from PIL import ImageOps
from PIL import ImageChops
import matplotlib.pyplot as plt

## Lists

Lists are a more complex type of data, but one that is very useful as it allows us to store a list of things in a single variable. To do this we use square brackets, with the contents of the list in the brackets separated by commas. We can have lists of anything we want: integers, decimal numbers, strings, images, ...

In [9]:
list1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
list2 = [1.0, 2.6, 3.3, 4.7, 5.1, 6.7]
list3 = ['a', 'b', 'c', 'd', 'e']
list4 = ['bonjour', 'au revoir', 'ça va ?', 'très bien']

print(list1)
print(list2)
print(list3)
print(list4)

[1, 2, 3, 4, 5, 6, 7, 8, 9]
[1.0, 2.6, 3.3, 4.7, 5.1, 6.7]
['a', 'b', 'c', 'd', 'e']
['bonjour', 'au revoir', 'ça va ?', 'très bien']


At times we may want to access specific items in the list. To do this we use the following system:

In [10]:
list1[0]      # Gives us the first item in the list.
list2[-1]     # Gives us the last item in the list
list3[0:3]    # Gives us the first three items in the list
list3[:3]     # Also gives us the first three items in the list
list4[-3:]    # Gives us the last three items in the list
list1[2:5]    # Gives us the third (!) through fifth items in the list

# Let's test it:
print(list1[0])      # Gives us the first item in the list.
print(list2[-1])     # Gives us the last item in the list
print(list3[0:3])    # Gives us the first three items in the list
print(list3[:3])     # Also gives us the first three items in the list
print(list4[-2:])    # Gives us the last two items in the list
print(list1[2:5])    # Gives us the third (!) through fifth items in the list



1
6.7
['a', 'b', 'c']
['a', 'b', 'c']
['ça va ?', 'très bien']
[3, 4, 5]


## Loops

One of programming's biggest strengths is being able to do things again and again, automatically and very quickly. To do this, we need to use loops: that is, we tell Python:

For every item i in a list l:
   * Do this
   * Do that
   * Etc.

Let's demonstrate this with a simple list. Let's say we want to do the following:

1. Create a list of five numbers and save it in a variable.
1. `For` each number `in` the list:
   1. Add ten to it
   1. Print it out
1. After we've gone through the whole list then print a message ("Finished!")
   
This is what the code looks like:

In [11]:
list = [2, 4, 6, 8, 10]

for n in list:
    temp = n + 10
    print(temp)
    
print("Finished!")

12
14
16
18
20
Finished!


# Reading a File from Disk

Another useful task is to read in a file from disk. There are various ways of doing this, but the easiest for us is to use two functions, `open()` to create a connection between Python and the file, and then `read()` to read in the contents. More specfically, the steps are as follows:

1. Tell Python where the file is located, relative to our current location in the file system. The best way to do this is to store it in memory where we can find it again later. To do this we need to give our location a name so that we can find it again; here I have chosen to call it `path`.
1. Open the file at location `path`. To do this we use the Python command `open()`, and we tell Python that we want to `r`ead the contents (as opposed to `w`riting to the file). Again, we need to store the reference to the open file somewhere so we can get it again; I have chosen to call it `inFile`.
1. Read in the contents of the file (using `read()`) and store it in memory. Here I have chosen to call it `fileXML`.
1. Tell the operating system to `close` the file. This isn't essential but is good practice as the Operating System can then free up memory, allow another program to access the file and so on, which would be important in a big online system, for instance.

Now we can do something with our file. To test it, let's try reading in an XML file in the 'Examples' folder: 

In [12]:
# Store the location of the file in a variable for Python
path = "../Example Texts/302-1-3-063.tei.xml"

# Now open the file and store the reference to the file in another variable
in_file = open(path, "r")

# Read in the contents (and store it in yet another variable)
fileXML = in_file.read()

# Close the connection to the file on disk
in_file.close()

# Now print out the first 500 characters of the file contents to show that it worked
print(fileXML[0:500])

<?xml version="1.0" encoding="UTF-8"?>
<TEI><teiHeader facs="302-1-3-063.jpg"><fileDesc><seriesStmt><biblScope unit="cote">302</biblScope><biblScope unit="numero_ordre_dans_registre">60</biblScope><biblScope unit="numero_page">395</biblScope><biblScope unit="volume">1, tome 3</biblScope></seriesStmt><sourceDesc><msDesc type="recto"><physDesc><handDesc><handNote scribe=""/></handDesc><objectDesc><supportDesc><support>feuille_pliee</support><dimensions><height>219</height><width>172</width></dimen


# Reading a Text File from the Web

This is useful, but often we want to read a file directly from the Web. This could be an HTML file, a JSON file, XML, TXT, and so on. The procedure is very similar to reading from the disk, but we have to `import` a module called `urllib` first. Here are the steps:

1. Tell Python the URL of our file. The best way to do this again is to store it in memory where we can find it later. To do this we need to give our location a name so that we can find it again; here I have chosen to call it path.
1. Open a connection to the file at the URL. To do this we use the Python command `urllib.request.open()`. Again, we need to store the reference to the open connection somewhere so we can get it again; I have chosen to call it conn.
1. Read in the contents of the file (using `read()`) and store it in memory. Here I have chosen to call it fileXML.
1. Tell the operating system to close the connection. This isn't essential but is good practice as the Operating System can then free up memory which would be important in a big online system, for instance.

And here is an example that downloads Book 1 of _Le comte de Monte Cristo_ from Project Gutenberg:

In [13]:
# Import the url library so that we can use it to connect to the Web.
import urllib.request

# Save the URL of the file we want to download
url = "http://www.gutenberg.org/cache/epub/17989/pg17989.txt"

# Open a connection to the URL
conn = urllib.request.urlopen(url)

# Read the text into memory and store it in a variable
text = conn.read()

# Close the connection
conn.close()

# Now print the first 100 characters to see that it worked
print(text[1:100])

b"\xbb\xbfProject Gutenberg's Le comte de Monte-Cristo, Tome I, by Alexandre Dumas\r\n\r\nThis eBook is for the"


**Note that this method is for text and text-like files only**. We will see later that downloading images from the Web is a little bit different.

# Saving Files to Disk

Often we also want to save our work to the disk, so that we can use it later, or send it to another programme or a colleague, or any other reason. The basic process is similar to reading a file, but with some minor differences. Here is the procedure:

1. Tell Python where we want the file to be located, relative to our current location in the file system. The best way to do this is to store it in memory where we can find it again later. To do this we need to give our location a name so that we can find it again; here I have chosen to call it `path`.
1. Open the file at location `path`. To do this we use the Python command `open()`, and we tell Python that we want to `w`rite the data to the file (as opposed to `r`eading the contents from the file). Again, we need to store the reference to the open file somewhere so we can get it again; I have chosen to call it `inFile`.
1. Write out the contents to the file (using `write()`).
1. Tell the operating system to `close` the file. This **is** normally essential, as it tells the system that no new data will be added and so the file system can do what is necessary to complete the file.

As an example, let's save a message in a file:

In [14]:
file = "test_file.txt"

out_file = open(file, "w")

out_file.write("Bonjour, voici mon premier fichier crée avec Python !")

out_file.close()

Have a look in your file-system: you should now find a file called `test_file.txt` in your directory, with the message inside it.

**Warning: be very careful creating files!** If you create a new file with the same path as an existing file, then **your existing file will automatically be deleted and overwritten with the new file**.

**Warning: be very careful looping with files!** It's very easy to create loops accidentally which write hundreds or thousands or millions of new files, potentially very large ones, and this can overload your system or even fill your hard disk.

**Be sure to test your code very carefully before writing files to disk!**

---
![Licence Creative Commons](https://i.creativecommons.org/l/by/4.0/88x31.png)
This work (the contents of this Jupyter Python notebook) is licenced under a [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)