**By Peter A. Stokes, École Pratique des Hautes Études – Université PSL**

_These are brief notes and exercises on working with TEI XML files using Python. They are intended as teaching aids for the course on 'Image Processing with Python' which is part of the Atelier de formation annuel du Consortium Cahier on the topic of 'Exploiter les corpus d'auteurs' in Poitiers, 18--20 June 2019. For more details see https://cahier.hypotheses.org/4662_

These notes assume a good knowledge of TEI XML but assume **no experience or knowledge at all** in programming. This notebook also assumes that the [Python Imaging Library (PIL)](https://pythonware.com/products/pil/) has already been installed in your Python system, and that an adapted version of the file [here](http://xtf.bvh.univ-tours.fr/xtf/data/tei/B330636101_S1238/B330636101_S1238_tei.xml) has already been downloaded and saved in the same directory as this notebook.

_If you are viewing this in Jupyter then you can edit the code simply by typing in the boxes. You can also execute the code in any box by clicking on the box and typing SHIFT + ENTER or using the 'Run' button in the menubar above._

# **\[Ad intro/context\]**

1. IIIF: NB relationship between TEI XML @facs and the IIIF, e.g. https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f17/full/pct:25/0/native.jpg
   * So could easily load images with IIIF
   * Could also use TEI documentary view to load sections of images with Python & IIIF
   
**Remember: already have slides on IIIF!**

# Getting the Image Addresses

In order to manipulate the image, we first need to get all the addresses. This is the procedure we want to follow:

1. `import` the Beautiful Soup library
1. `open()` the _Essais_ file and store the result in a variable
1. `read()` the file into another variable
1. Process the results of the previous step using `BeautifulSoup` with `"xml"` format and save the results in a new variable
1. `close()` the file
1. `select()` all the `pb` elements and store the resulting list in a variable
1. Loop `for` each pb element `in` the list variable:
   1. `print()` the value of the `"facs"` attribute
   
Try to write the code to do this yourself. You will want to look back at material from the previous sessions, particularly [1. Getting Started with XML in Python](1.%20Getting%20Started%20with%20XML%20in%20Python.ipynb) 

In [None]:
import urllib, json, io, requests
from PIL import Image

manifest_url = "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/manifest.json"

conn = urllib.request.urlopen(manifest_url)

manifest = json.loads(conn.read())

Now we read the manifest file to find the URLs of the images. The IIIF is a relatively complex file, so it's not very obvious how to extract the images from it. In principle, though, it's structured as follows:

1. Each manuscript can have one or more sequences of pages. This allows different page orders to be stored, for instance the current and original page ordering. In practice the vast majority of manifests have only one sequence, so we just take the first.
1. Each sequence then contains a number of 'canvases'. In practice, for us, each canvas corresponds to a page, so this is effectively the list of pages.
1. We then go through and loop `for` each `canvas` in our list of canvases:
   1. We get the list of images for each canvas. In principle we can have more than one image but again in practice this is very rare, so we just take the first image record in the list.
   1. The `image` record contains a `resource` record, and the `resource` record contains the `@id` field. The `@id` is in fact is the full URL to our image, so we save it.
   1. We now print out the address for the URL
   
**You must be very careful when looping like this**. It's easy to make a mistake and suddenly send hundreds or thousands of download requests to the BnF website. This is likely to get you in trouble, as it could be interpreted as a common hacking activity (a ['denial of service'](https://en.wikipedia.org/wiki/Denial-of-service_attack) or DOS attack). This is likely to mean you will be blocked from the BnF website, and could theoretically even result in a crimina investigation! For this reason, we test our results first by printing out to the screen and making sure that our code is working. Only then do we really connect to the actual website.

In [None]:
# Get the sequence of pages from the manifest
sequence = manifest["sequences"][0]

# Get the list of pages ('canvases') from the sequence
canvases = sequence["canvases"]

# Now go through the list of canvases
for canvas in canvases[20:25]:
    im_addr = canvas["images"][0]["resource"]["@id"]
    print(im_addr)

You should get five URLs, and you can click on them to make sure that the addresses are correct.

You will notice that the images are very large and very high quality. This is nice but we really don't need such high quality for our work. We want to be good citizens and not load the BnF site any more than we need to, so we can change the URL to tell the IIIF server that we only want an image at 25% of full size. To do this, we need to change the URLs of each address, specifically, the part that says `/full/full/0` we want to change to `/full/pct:25/0` (for '25 percent'). Fortunately this is easy to do in Python using the `replace()` method. Here is how it works:

In [None]:
for canvas in canvases[20:25]:
    # Get the full-resolution address of the image from the canvas record
    im_addr = canvas["images"][0]["resource"]["@id"]
    
    # Change the resolution from 'full' to 25%
    im_addr = im_addr.replace("/full/full/0", "/full/pct:25/0")
    
    # Download the image and store it in a PIL Image variable
    res = requests.get(im_addr)
    image = Image.open(io.BytesIO(res.content))
    
    # Show the image to be sure it worked
    image.show()

This is good, but it's very inefficient to download the image every time we want to use it. Instead, we can save a copy in memory, by storing it in a variable. The problem here is that we have more than one image. We could store them in many different variables, such as `image1`, `image2` etc. but this is very inefficient (what happens if we change the number of images? what happens if we want to download hundreds of images? etc.). Instead, we can create a list of images, and simply add each new image to the list using the `append()` method. Here's how it works:

In [None]:
# To begin, we have to create an empty list
images = []

# Now we go through our loop again
for canvas in canvases[20:25]:
    im_addr = canvas["images"][0]["resource"]["@id"]
    im_addr = im_addr.replace("/full/full/0", "/full/pct:25/0")
    res = requests.get(im_addr)
    image = Image.open(io.BytesIO(res.content))
    
    # Here, instead of showing the image, we add it to our list
    images.append(image)
    
# Now we should have a list of images.
# Let's test by looping through and showing each one
for image in images:
    print(image.size)
    image.show()

Now that we have our list, we can use it again and again to do different things with it. For instance, we can loop `for` each image `in` our list `images` and print the `size`. Try writing the code to do this into the box below:

Another interesting possibility of IIIF is that it allows us to access specific regions of the page (see the [IIIF specifications](https://iiif.io/api/image/2.1/#region) for more details). In this case, we have to change the first occurrence of 'full' in the URL, replacing it with the coordinates of the region we want. For instance, let's compare the following two URLs, one with the full image and one with a section. You should see two images, one of the full page and one of some text in line 7.

In [None]:
url_full = "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/full/pct:25/0/native.jpg"

res = requests.get(url_full)
image = Image.open(io.BytesIO(res.content))
image.show()

url_region = "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/810,1250,2925,185/pct:25/0/native.jpg"

res = requests.get(url_region)
image = Image.open(io.BytesIO(res.content))
image.show()

In [None]:
images2 = []

for canvas in canvases[20:25]:
    im_addr = canvas["images"][0]["resource"]["@id"]
    im_addr = im_addr.replace("/full/full/0", "/810,1250,2925,185/pct:25/0")
    res = requests.get(im_addr)
    image = Image.open(io.BytesIO(res.content))
    images2.append(image)

In [None]:
for image in images2:
    print(image.size)
    image.show()

In [None]:
im_addr = canvases[22]["images"][0]["resource"]["@id"]

line_height = 140
col_width = 2950

start_x = 660
start_y = 435

for lineno in range(33):
    coord = "/" + str(start_x) + "," + str(start_y + lineno*line_height) + "," + str(col_width) + "," + str(line_height)
    line_addr = im_addr.replace("/full/full/0", coord + "/pct:25/0")
    print(line_addr)

In [None]:
line_imgs = []

for lineno in range(33):
    coord = "/" + str(start_x) + "," + str(start_y + lineno*line_height) + "," + str(col_width) + "," + str(line_height)
    line_addr = im_addr.replace("/full/full/0", coord + "/pct:25/0")
    print(line_addr)
    res = requests.get(line_addr)
    image = Image.open(io.BytesIO(res.content))
    line_imgs.append(image)

In [None]:
for image in images2:
    image.show()

In [None]:
from bs4 import BeautifulSoup

path = "../Example Texts/Montaigne f22.xml"

inFile = open(path)
fileXML = inFile.read()

bs = BeautifulSoup(fileXML, "xml")

inFile.close()

In [None]:
linelist = bs.select("surface > zone > line")
no_lines = len(linelist)

print(linelist[2].select("reg"))
line_imgs[2].show()

In [None]:
print(linelist[8].get_text())

In [None]:
searchword = input("Enter a word ")

for lineno in range(no_lines):
    if searchword in linelist[lineno].get_text():
        line_imgs[lineno].show()
        

In [None]:
image_f22 = Image.open("../Example Texts/Montaignef22_25pct.jpg")
image_f22.show()

In [None]:
from PIL import ImageDraw

draw = ImageDraw.Draw(image_f22)

draw.rectangle([start_x, start_y, start_x + col_width, start_y + line_height], width=5)

image_f22.show()

In [None]:
image_f22 = Image.open("../Example Texts/Montaignef22_25pct.jpg")

draw = ImageDraw.Draw(image_f22)

searchword = input("Enter a word ")

for lineno in range(no_lines):
    if searchword in linelist[lineno].get_text():
        draw.rectangle([start_x/4, start_y/4 + (line_height/4 * lineno), 
                        start_x/4 + col_width/4, start_y/4 + (line_height/4 * (lineno + 1))],
                                                        width=5)

image_f22.show()

---
![Licence Creative Commons](https://i.creativecommons.org/l/by/4.0/88x31.png)
This work (the contents of this Jupyter Python notebook) is licenced under a [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)