**By Peter A. Stokes, École Pratique des Hautes Études – Université PSL**

These are brief notes and exercises on working with images using Python. They are intended as the practical component of a larger course. These notes assume a basic knowledge of Python. This notebook also assumes that the Scikit Image library has already been installed in your Python system.

_If you are viewing this in Jupyter then you can edit the code simply by typing in the boxes. You can also execute the code in any box by clicking on the box and typing SHIFT + ENTER or using the 'Run' button in the menubar above._

# Setting the Scene

As we have seen in the lecture/presentation part of this course, the [Image Interoperability Framework](https://iiif.io) (IIIF) is a framework that includes standard ways of linking to and specifying images and regions of images. This framework is also widely used now by very many libraries (for a small sample see the [IIIF Collections  Portal](https://iiif.biblissima.fr/collections/) provided by Biblissima). Given the importance of digital images, the objective of this worksheet is to give you a glimpse into the enormous potential of IIIF and Python together.

# Getting the IIIF Manifest

In order to begin working, we want to download and process the IIIF manifest file for our manuscript. This is the one that contains all the information about our manuscript, such as the addresses of all the images. In order to do this, we read in the file as usual. However, the manifest comes in JSON format, and since this is a standard format we can use existing libraries to process it for us and save it in a structured Python object as follows: 

In [None]:
import urllib, json, requests
import matplotlib.pyplot as plt
from skimage import io

manifest_url = \
    "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/manifest.json"

conn = urllib.request.urlopen(manifest_url)

manifest = json.loads(conn.read())

Experiment a bit by looking at the structure of the `manifest` object. There are a couple of examples to start, but play around yourself to see if you can understand how it works and how it corresponds to the [IIIF standard for manifests](https://iiif.io/api/presentation/2.0/#manifest).

In [None]:
print(manifest["sequences"][0]["canvases"][0]["images"][0]["resource"]["@id"])

# Now add your own tests here:

#print(manifest["sequences"][0]["canvases"])


Now we read the manifest file to find the URLs of the images. The IIIF is a relatively complex file, so it's not very obvious how to extract the images from it. In principle, though, it's structured as follows:

1. Each manuscript can have one or more sequences of pages. This allows different page orders to be stored, for instance the current and original page ordering. In practice the vast majority of manifests have only one sequence, so we just take the first.
1. Each sequence then contains a number of 'canvases'. In practice, for us, each canvas corresponds to a page, so this is effectively the list of pages.
1. We then go through and loop `for` each `canvas` in our list of canvases:
   1. We get the list of images for each canvas. In principle we can have more than one image but again in practice this is very rare, so we just take the first image record in the list.
   1. The `image` record contains a `resource` record, and the `resource` record contains the `@id` field. The `@id` is in fact is the full URL to our image, so we save it.
   1. We now print out the address for the URL
   
**You must be very careful when looping like this**. It's easy to make a mistake and suddenly send hundreds or thousands of download requests to the BnF website. This could get you into trouble, as it  might be interpreted as a common hacking activity (a ['denial of service'](https://en.wikipedia.org/wiki/Denial-of-service_attack) or DOS attack). If so then you could be blocked from the BnF website, and theoretically you could even be subject to a crimina investigation! For this reason, we test our results first by printing out to the screen and making sure that our code is working. Only then do we really connect to the actual website.

In [None]:
# Get the sequence of pages from the manifest
sequence = manifest["sequences"][0]

# Get the list of pages ('canvases') from the sequence
canvases = sequence["canvases"]

# Now go through the list of canvases
for canvas in canvases[20:25]:
    im_addr = canvas["images"][0]["resource"]["@id"]
    print(im_addr)

You should get five URLs, and you can click on them to make sure that the addresses are correct.

You will notice that the images are very large and very high quality. This is nice but we really don't need such high quality for our work. We want to be good citizens and not load the BnF site any more than we need to, so we can change the URL to tell the IIIF server that we only want an image at 25% of full size. To do this, we need to change the URLs of each address, specifically, the part that says `/full/full/0` we want to change to `/full/pct:25/0` (for '25 percent'). Fortunately this is easy to do in Python using the `replace()` method. Here is how it works:

In [None]:
for canvas in canvases[20:25]:
    # Get the full-resolution address of the image from the canvas record
    im_addr = canvas["images"][0]["resource"]["@id"]
    
    # Change the resolution from 'full' to 25%
    im_addr = im_addr.replace("/full/full/0", "/full/pct:25/0")
    print(im_addr)
    
    # Download the image and store it in an skimage variable
    image = io.imread(im_addr)
    
    # Some systems *might* give an error here. If so then try the following
    
    # Show the image to be sure it worked
    # NOTE the use of plt.figure() and plt.imshow() here (rather than io.imshow).
    # Check the documentation online and see if you can tell what's happening
    plt.figure
    plt.imshow(image)

Some systems might give an error here (particularly Windows), because of the way that the information from the internet is interpreted by the software. If this happens to you, then you should use the code in the following cell to load the image, instead of what's above. If this applies to you then **be sure to change every occurrence of io.imread() throughout this worksheet**, of course updating the image address and image variable each time. Of course you only need to import the library once.

Note also that if this applies to you then you will need to use two different libraries with the same name: one is the system library `io`, is the skimage library also called `io`. Normally, we would call the first one simply `io` and the second one `skimage.io`, but this would then be inconsistent with the other worksheets in this series, where the skimage one is called `io`. For this reason, we will use the alias `sysio` for the system one, but in general this is not really best pratice as renaming a library like this makes it harder for people to understand when they come to read the code.

In [None]:
import io as sysio

# Change io.imread to these two lines throughout this worksheet.
res = requests.get(im_addr)
image = io.imread(sysio.BytesIO(res.content))

The code we have above is good, but it's very inefficient to download the image every time we want to use it. Instead, we can save a copy in memory, by storing it in a variable. The problem here is that we have more than one image. We could store them in many different variables, such as `image1`, `image2` etc. but this is very inefficient (what happens if we change the number of images? what happens if we want to download hundreds of images? etc.). Instead, we can create a list of images, and simply add each new image to the list using the `append()` method. Here's how it works:

In [None]:
# To begin, we have to create an empty list
images = []

# Now we go through our loop again
for canvas in canvases[20:25]:
    im_addr = canvas["images"][0]["resource"]["@id"]
    im_addr = im_addr.replace("/full/full/0", "/full/pct:25/0")
    print(im_addr)
    image = io.imread(im_addr)

    # Here, instead of showing the image, we add it to our list
    images.append(image)
    
# Now we should have a list of images.
# Let's test by looping through and showing each one
for image in images:
    plt.figure()
    plt.imshow(image)

Now that we have our list, we can use it again and again to do different things with it. For instance, we can loop `for` each image `in` our list `images` and print the `size`. Try writing the code to do this.

Another interesting possibility of IIIF is that it allows us to access specific regions of the page (see the [IIIF specifications](https://iiif.io/api/image/2.1/#region) for more details). In this case, we have to change the first occurrence of 'full' in the URL, replacing it with the coordinates of the region we want. For instance, let's compare the following two URLs, one with the full image and one with a section. You should see two images, one of the full page and one of some text in line 7.

In [None]:
url_full = \
"https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/full/pct:25/0/native.jpg"

url_region = \
"https://gallica.bnf.fr/iiif/ark:/12148/bpt6k11718168/f23/810,1250,2925,185/pct:25/0/native.jpg"

image_full = io.imread(url_full)
image_region = io.imread(url_region)

io.imshow(image_full)
plt.figure()
plt.imshow(image_region)

For the exercise, let's now apply this same region to images of several different pages. To do this, we use the following algorithm:

1. Create a new empty list to hold our images
1. `for` each canvas `in` the list of canvases:
  1. Get the image URL from the IIIF manifest and save it in a local variable
  1. Replace the full URL with the one for the region (and the reduced image size)
  1. Read in the image data and create a new Image
  1. Add this new image to our list of images

Here is the code to do this:

In [None]:
images2 = []

for canvas in canvases[20:25]:
    im_addr = canvas["images"][0]["resource"]["@id"]
    im_addr = im_addr.replace("/full/full/0", "/810,1250,2925,185/pct:25/0")
    image = io.imread(im_addr)
    
    images2.append(image)

Now let's have a look at the images to make sure this worked:

In [None]:
for image in images2:
    plt.figure()
    plt.imshow(image)

What we've just done is nice, but frankly it's not very interesting. What would be much better is to have an image of each line of text on a given page. We can do this, but first we need to think a little bit about some basic arithmetic and geometry.

In order to get the image of a line of text, we need to calculate the coordinates of each line. This is relatively easy for a printed book, since the lines are very regular (compared to a notebook, for instance). Ideally we would use Python to automatically find the lines for us, but this is much more advanced (though we will see a simple version of this in the last 'Going Further' section of these notes). Instead of doing it automatically, we can do the following:

Using any imaging software, we need to measure the following:

1. The height of each line of text, in pixels
1. The width of each column, again in pixels
1. The x and y coordinates of the start of the first line of text

**Attention!** The SciKit Image library (and indeed most computer imaging systems) measure the origin (coordinates 0,0) in the top left corner, and the y axis counts down. This is different from the 'standard' cartesian coordinate system that you may have learned in school, where the origin is in the middle and the y axis counts up.

Open the image yourself, measure the details and insert the values into the code here:

In [None]:
im_addr = canvases[22]["images"][0]["resource"]["@id"]

start_x =
start_y =

line_height = 
col_width = 

lines_per_page = 

Now that we have these values, we can use IIIF to get the line images.

To do this, IIIF needs two pairs of coordinates for each image region:

1. The x and y coordinates of the starting corner of the image region (normally the upper left corner)
1. The width and height of the box (always in pixels)

For the first line on the page, this is easy:

1. The x and y coordinates of the starting corner are `start_x` and `start_y`
1. The width and height of the box are `col_width` and `line_height` respectively

For all the other lines, this is slightly more complicated:

1. The width and height of the box are always `col_width` and `line_height`: this never changes.
1. The x coordinate of the starting corner is always `start_x`: this never changes either. (But what would happen if we had two columns of text?)
1. The complicated part is the y value of the starting corner of our page. For any given line of text, we need to:
   1. Take the starting y coordinate (`start_y`)
   1. Add the `line_height` once for every line of text down the page; in other words, multiply line_height by the number of the line that we are looking for (1 for the first line, 2 for the second line etc.)
   1. If the number of the line is stored in a variable `line_no`, then the value that we need is therefore `start_y + line_height * line_no`

Let's test this. As usual, we will print out the URLs first to make sure it looks right before running our code. Here is what we want to do:

1. `for` each `line_no` in the list of lines from one to the number of lines per page:
  1. Make a string containing each coordinate, separated by a comma, according to the instructions above
  1. Replace the relevant part of the URL with these coordinates (and also the 25% scaling)
  1. Print the URL so we can check that it's right
  
**Attention!** The URL needs to be a string, but our values of start_x, start_y etc. are numbers. We therefore need to tell Python to convert our numbers into strings, and we do this using the `str()` function.

Here is the code:

In [None]:
for line_no in range(lines_per_page):
    coord = str(start_x) + "," + str(start_y + line_no*line_height) \
    + "," + str(col_width) + "," + str(line_height)
    line_addr = im_addr.replace("/full/full/0", "/" + coord + "/pct:25/0")
    print(line_addr)

Assuming that this worked, you can now change the code to actually download the images and store them in a list. Here is the code again, as well as a new list variable. You now need to replace the `print()` function with the instructions to download the image and store it in the list.

**Hints**:
1. You will need to use `io.imread()` and `append`.
1. We did almost exactly this just a little bit earlier, when we downloaded images of lines from different pages. Have another look at the code there, and you will see that you can copy and paste the necessary parts with only very small changes.

In [None]:
line_imgs = []

for lineno in range(lines_per_page):
    coord = "/" + str(start_x) + "," + str(start_y + lineno*line_height) \
    + "," + str(col_width) + "," + str(line_height)
    line_addr = im_addr.replace("/full/full/0", coord + "/pct:25/0")
    
    # Change this line to download the images and add them to line_images[]
    print(line_addr)

At this point we should have the images stored in line_imgs, which is a list. We now want to see that this worked! To do this, we need to loop through line_imgs, and `for` each image `in` our list, we should `show()` it. Type in the code here to do this (hint: it's only two or three lines, and once again you can copy and paste it almost exactly from an earlier example).

# Taking it Further

We have now seen some fairly powerful techniques, insofar as we can harvest the file for an entire manuscript, and find all the images associated with that manuscript. With this information we could even write our own simple version of the [Mirador Viewer](https://demos.biblissima.fr/mirador/) if we wanted! There are many other things we could do, though, for instance searching the metadata for specific content (for an example see the [Biblissima Collections](https://iiif.biblissima.fr/collections/)). Play around a bit and see what you can do (but be careful to test your loops before hitting the IIIF servers!).

---
![Licence Creative Commons](https://i.creativecommons.org/l/by/4.0/88x31.png)
This work (the contents of this Jupyter Python notebook) is licenced under a [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)