**By Peter A. Stokes, École Pratique des Hautes Études – Université PSL**

_These are brief notes and exercises on working with TEI XML files using Python. They are intended as teaching aids for the course on 'Image Processing with Python' which is part of the Atelier de formation annuel du Consortium Cahier on the topic of 'Exploiter les corpus d'auteurs' in Poitiers, 18--20 June 2019. For more details see https://cahier.hypotheses.org/4662_

These notes assume a good knowledge of TEI XML but assume **no experience or knowledge at all** in programming. This notebook also assumes that [Beautiful Soup 4](https://pypi.org/project/beautifulsoup4/) has already been installed in your Python system, and that the file [here](http://xtf.bvh.univ-tours.fr/xtf/data/tei/B330636101_S1238/B330636101_S1238_tei.xml) has already been downloaded and saved in the same directory as this notebook.

_If you are viewing this in Jupyter then you can edit the code simply by typing in the boxes. You can also execute the code in any box by clicking on the box and typing SHIFT + ENTER or using the 'Run' button in the menubar above._

# **\[Ad intro/context\]**

1. IIIF: NB relationship between TEI XML @facs and the IIIF, e.g. https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f17/full/pct:25/0/native.jpg
   * So could easily load images with IIIF
   * Could also use TEI documentary view to load sections of images with Python & IIIF
   
**Remember: already have slides on IIIF!**

In [15]:
import urllib, json, io, requests
from PIL import Image

manifest_url = "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/manifest.json"

re = urllib.request.urlopen(manifest_url)
manifest = json.loads(re.read())

In [34]:
canvases = manifest["sequences"][0]['canvases']

for canvas in canvases[20:25]:
    im_addr = canvas["images"][0]["resource"]["@id"]
    #im_addr = im_addr.replace("/full/full/0", "/full/pct:25/0")
    print(im_addr)

https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f21/full/full/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f22/full/full/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/full/full/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f24/full/full/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f25/full/full/0/native.jpg


In [17]:
for canvas in canvases[0:5]:
    im_addr = canvas["images"][0]["resource"]["@id"]
    im_addr = im_addr.replace("/full/full/0", "/full/pct:25/0")
    res = requests.get(im_addr)
    image = Image.open(io.BytesIO(res.content))
    image.show()

In [27]:
images = []

for canvas in canvases[20:25]:
    im_addr = canvas["images"][0]["resource"]["@id"]
    im_addr = im_addr.replace("/full/full/0", "/full/pct:25/0")
    res = requests.get(im_addr)
    image = Image.open(io.BytesIO(res.content))
    images.append(image)
    
for image in images:
    print(image.size)
    image.show()

(1170, 1519)
(1205, 1535)
(1170, 1519)
(1205, 1535)
(1170, 1519)


In [24]:
for image in images:
    print(image.size)

(1147, 1519)
(1208, 1535)
(1147, 1519)
(1208, 1535)
(1147, 1519)


In [35]:
images2 = []

for canvas in canvases[20:25]:
    im_addr = canvas["images"][0]["resource"]["@id"]
    im_addr = im_addr.replace("/full/full/0", "/810,1250,2925,185/pct:25/0")
    res = requests.get(im_addr)
    image = Image.open(io.BytesIO(res.content))
    images2.append(image)

In [36]:
for image in images2:
    print(image.size)
    image.show()

(731, 46)
(731, 46)
(731, 46)
(731, 46)
(731, 46)


In [54]:
im_addr = canvases[22]["images"][0]["resource"]["@id"]

line_height = 140
col_width = 2950

start_x = 660
start_y = 435

for lineno in range(33):
    coord = "/" + str(start_x) + "," + str(start_y + lineno*line_height) + "," + str(col_width) + "," + str(line_height)
    line_addr = im_addr.replace("/full/full/0", coord + "/pct:25/0")
    print(line_addr)

https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,435,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,575,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,715,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,855,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,995,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,1135,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,1275,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,1415,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,1555,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,1695,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f2

In [64]:
line_imgs = []

im_addr = canvases[22]["images"][0]["resource"]["@id"]

line_height = 140
col_width = 2950

start_x = 660
start_y = 435

for lineno in range(33):
    coord = "/" + str(start_x) + "," + str(start_y + lineno*line_height) + "," + str(col_width) + "," + str(line_height)
    line_addr = im_addr.replace("/full/full/0", coord + "/pct:25/0")
    print(line_addr)
    res = requests.get(line_addr)
    image = Image.open(io.BytesIO(res.content))
    line_imgs.append(image)

https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,435,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,575,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,715,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,855,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,995,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,1135,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,1275,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,1415,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,1555,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/660,1695,2950,140/pct:25/0/native.jpg
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f2

In [47]:
for image in images2:
    image.show()

In [55]:
from bs4 import BeautifulSoup

path = "../Example Texts/Montaigne f22.xml"

inFile = open(path)
fileXML = inFile.read()

bs = BeautifulSoup(fileXML, "xml")

inFile.close()

In [68]:
linelist = bs.select("surface > zone > line")

print(linelist[2].select("reg"))
line_imgs[2].show()

[<reg>advantage</reg>, <reg>rapportant</reg>, <reg>tousjours</reg>]
