**By Peter A. Stokes, École Pratique des Hautes Études – Université PSL**

Updated October 2022.

These are brief notes and exercises on working with TEI XML files using Python. They are intended as a practical component to a larger taught course. These notes assume a good knowledge of TEI XML and basic knowledge of Python. This notebook also assumes that [NumPy](https://www.numpy.org), [SciPy Lib](https://www.scipy.org/scipylib/index.html) and [Scikit-Image](https://scikit-image.org) has already been installed in your Python system.

_If you are viewing this in Jupyter then you can edit the code simply by typing in the boxes. You can also execute the code in any box by clicking on the box and typing SHIFT + ENTER or using the 'Run' button in the menubar above._

# Setting the Scene

This exercise is relatively advanced and will require you to think carefully about images as arrays of numbers. It shows a (rough) way of counting the number of lines of text in the image of a page. It uses a fairly simple technique and only really works for very clean manuscripts with pretty regular lines of text.

In order to do this, we will again use the well-established libraries for numerical and data processing, NumPy and SciKit-Image.

First, we import our libraries and set up our variables as usual.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from skimage import filters, segmentation
from skimage.io import imread
from skimage.color import rgb2gray
from scipy.signal import argrelmax, argrelmin, savgol_filter

# Create an empty list variable for use later
lines_per_page = []

# Store the path to the image file
f = "Montaignef22_25pct.jpg"

# Pre-Processing

Almost always when we work with images, we need to go through some pre-processing. This usually involves things like turning colour images into black and white, potentially clearing out any 'noise', and so on.

The first thing we want to do here is threshold the image. This means that we turn the image into a simple black and white, where dark sections (i.e. ink) are white, and light sections (i.e. background) are black. The method we use here is reproduced from 'Simple Image Segmentation with Scikit-Image': http://douglasduhaime.com/posts/simple-image-segmentation-with-scikit-image.html It works using the same approach that we saw in the previous worksheet on 'Finding Rubrics in a Manuscript Page', the details of how it works are a bit complicated, so don't worry if you don't understand it fully. In summary, though, it does the following:

1. Convert the image from colour to gray (since we don't need colour anymore).
1. Calculate the threshold level of how dark something needs to be to colour it white, and vice versa. To do this we use a built-in function.
1. Create a 'mask', that is, a map of all the pixels with value `True` if the pixel value is below the threshold, and `False` if it is above the threshold.
1. Use the built-in `segmentation.clear_border` function to remove any `True` pixels at the border of the image, as these are almost certainly just rubbish.

**You may receive a warning when you run this code**. If so then just ignore it: it's not our problem, and the code will still work.

This may also take some time, depending on the size of your image and the speed of your computer.

In [None]:
im_rgb = imread(f)
im = rgb2gray(im_rgb)

print('Grayscale image:')
plt.imshow(im, cmap='Greys_r')
print(im)

In [None]:
# Now we find a dividing line between 0 and 255. Pixels below this value
# will be black, and pixels above this value will be white.

val = filters.threshold_otsu(im)

# The mask object converts each pixel in the image to True or False
# to indicate whether the given pixel is black or white
mask = im < val

# Now we apply the mask to the image object
im_final = segmentation.clear_border(mask)

print('Masked binary image:')
plt.imshow(im_final, cmap='binary')

The line `mask = im < val` may not be clear to you, but in fact it's relatively simple: it simply applies the 'greater-than' comparison `<` to each element in the first NumPy array and the corresponding element in the second NumPy array. Here is a smaller example to make this clearer. The first value in the result is the result of `l1[0] < l2[0]` which in turn is `1 < 9` or `False`. The second value is `l1[1] < l2[2]`, or `8 < 2` which is `True`, and so on. Here is the example:

In [None]:
l1 = np.array([1,8,3,4,5])
l2 = np.array([9,2,7,6,5])

print(l1 < l2)

In 'normal' Python we would have to write a `for` loop to compare each element at a time, but when we use NumPy this is done for us. Not only is it easier to code, but normally it also runs much faster since NumPy can use internal tricks to be more efficient.

# Finding Rows and Columns of Text

Now, finally, we have a nice binarised image, and so we can start our analysis. Specifically, we want to find the rows and columns of text. This is a surprisingly difficult job for a computer, and there are many very sophisticated methods around. We will use a very simple one here, but one that does still work at least in very easy cases.

To do this, first we want to add up all the pixels in each row and save it, and do the same per column. Fortunately this is very easy with NumPy.

In [None]:
row_vals = im_final.sum(axis=1)
col_vals = im_final.sum(axis=0)

# Show the outputs
print('Column values')
plt.plot(col_vals)
plt.show()

print('Row values')
plt.plot(row_vals)
plt.show()

These are nice, but they're very noisy: there are lots of 'ups and downs'. It would be better if we can smooth out the lines a bit. Let's use a fancy function called a 'Savitzky-Golay filter'. You don't really need to understand all the details of how it works; we can just take it and use it. That's the beauty of using libraries that other people have created!

The only hard part here is that the Savitzky-Golay filter needs two parameters, and it's difficult to figure out what they should be, partly because they depend on the size of your image. The first number must be an odd number; experimenting suggests to me that it should be about 1/50 of the total height. The second number seems to work with '3'.

In [None]:
win_length = int(len(row_vals) / 50)

# Remember that the window length must be an odd number
# The % is the 'modulo' or 'remainder' operator: here it gives
# the remainder if win_length is divided by two.
# The += operator means 'add this to the current value of the variable'
# (This could be done in a single line: do you see how?)

if win_length % 2 == 0:
    win_length += 1

print(win_length)


smoothed = savgol_filter(row_vals, win_length, 3)
plt.plot(smoothed)

If you look carefully at the results, you will see that it comprises a number of peaks. Each peak here corresponds to a row of text, with the value of the axis giving the y-coordinate of the line of text. This means that to find the lines, we want to find the peaks in the row values. This again is fairly easy to do with the SciKit signal processing library:

In [None]:
min_diff = 1.5

peaks, = argrelmax(smoothed, order=10)  # NOTE THE COMMA AFTER 'peaks'!
print(peaks)

good_val_list = []
for i in range(len(peaks)-2):
    diff = peaks[i+1] - peaks[i]
    diff2 = peaks[i+2] - peaks[i+1]
    if abs(diff2 - diff) < min_diff:
        print("Line", peaks[i], "Regular")
        print(diff2-diff)
        good_val_list.append(peaks[i])
        
print(good_val_list)

However, some of these are 'false' peaks, namely only small peaks caused by other things on the page. Let's count only those peaks which are greater than a particular value: we can try only those peaks that are at least one third of the highest peak. You may need to change this depending on your image.

In [None]:
min_peak_height = smoothed.max() / 3

are_true_peaks = smoothed[peaks] > min_peak_height
row_peaks = peaks[are_true_peaks]

print('Your script has found', len(row_peaks), 
      'lines of text in your image.')
print('The y-coordinates of the lines of text are', row_peaks)
print("Height of page in pixels:", row_peaks[-1] - row_peaks[0])
print("Top margin in pixels:", row_peaks[1] - peaks[0])
print("Bottom margin:", row_peaks[-1] - row_peaks[-2])

There can be a small problem here, depending on the image, namely that the system often identifies the top and bottom edges of the page as lines of text. (Can you see why? Hint: look closely at the results of the segmentation image above.) This is no problem, though: we can use list slicing to remove the first and last element of the list. The code to do this is as follows:

In [None]:
row_peaks = row_peaks[1:-1]
print('The y-coordinates of the lines of text are', row_peaks)

## Finding the Text Column

Now we need to detect the coordinates of the column. The process is similar, in that we smooth the `col_values` and then look for certain results. Let's start by smoothing the signal as we did before:

In [None]:
win_length = int(len(col_vals) / 30)

# Remember that the window length must be an odd number
# The % is the 'modulo' or 'remainder' operator: here it gives
# the remainder if win_length is divided by two.
# The += operator means 'add this to the current value of the variable'

if win_length % 2 == 0:
    win_length += 1

col_smoothed = savgol_filter(col_vals, win_length, 3)
plt.plot(col_smoothed)

Look carefully at the result. You will see that there is a big wide section in the middle which corresponds to the column of text. We need to find the start and end of this wide peak. Notice, also, that the value is very low just before the big jump to the wide section. This suggests that the easiest way is to look for the _minimum_ value rather than the _maximum_:

In [None]:
peaks, = argrelmin(col_smoothed, order=10)  # NOTE COMMA AFTER 'peaks'!

are_true_peaks = col_smoothed[peaks] < 0
col_peaks = peaks[are_true_peaks]

print(col_peaks)
print("Column width in pixels:", col_peaks[1] - col_peaks[0])
print("Left column width", col_peaks[0])
print("Right column width", col_peaks[2] - col_peaks[1])

Now that we've found the column and rows, we want to convert them into the `start_x`, `start_y` etc. that we need for our IIIF code from Worksheet 4. Most of this is very easy: the only slightly complicated bit is finding the height of each line. This can vary slightly, so in order to get the best results let's find the average height and go from there:

In [None]:
# To find the line height, calculate the average difference between lines
line_heights = []
for i in range(len(row_peaks)-1):
    h = row_peaks[i+1] - row_peaks[i]
    line_heights.append(h)

line_height = np.mean(line_heights)

print("Average line height is", line_height, "pixels")

start_x = col_peaks[0]
col_width = col_peaks[1] - start_x
# NB that the values here measure from the *middle* of each line,
# so for the *top* of the line we have to subtract half the line height
start_y = row_peaks[0] - (line_height / 2)

# Putting it Together

We can put all of this together into a single process that reads in the image, finds the text block and the lines, and calculates the different coordinates of the text, column, lines etc. In order to do this more efficiently we can use _functions_. This means that we can define a set of instructions and re-use them later, rather than typing out the same thing again and again. In this case, we have called the function `process()`, and it takes one parameter, namely the filename `f`. To use the function after we have defined it, we simply store the filename in a variable (e.g. `filename`) and then tell Python to `process(filename)`

In [None]:
def process(f):
    # Threshold and mask the image. The code here is reproduced from 
    # 'Simple Image Segmentation with Scikit-Image': 
    # http://douglasduhaime.com/posts/ [...]
    # [...] simple-image-segmentation-with-scikit-image.html
    im = rgb2gray(imread(f))

    # find a dividing line between 0 and 255
    # pixels below this value will be black
    # pixels above this value will be white
    val = filters.threshold_otsu(im)
    
    # the mask object converts each pixel in the image to True or False
    # to indicate whether the given pixel is black/white
    mask = im < val

    # Remove any border noise
    imfinal = segmentation.clear_border(mask)

    row_vals = imfinal.sum(axis=1)
    col_vals = imfinal.sum(axis=0)
    
    # About 1/30 of the total seems to work for the window length.
    # Remember that the win length must be odd
    win_row_length = int(len(row_vals) / 30)
    if win_row_length % 2 == 0:
        win_row_length += 1

    win_col_length = int(len(col_vals) / 30)
    if win_col_length % 2 == 0:
        win_col_length += 1

    row_smoothed = savgol_filter(row_vals, win_row_length, 3)
    col_smoothed = savgol_filter(col_vals, win_col_length, 3)
    
    # TODO: need a way of calculating the order parameters
    row_peaks, = argrelmax(row_smoothed, order=10)
    col_peaks, = argrelmin(col_smoothed, order=10)
    
    min_row_peak_height = row_smoothed.max() / 3

    are_true_row_peaks = row_smoothed[row_peaks] > min_row_peak_height
    row_peaks = row_peaks[are_true_row_peaks]
    row_peaks = row_peaks[1:-1]
    
    are_true_col_peaks = col_smoothed[col_peaks] < 0
    col_peaks = col_peaks[are_true_col_peaks]
    
    lines_per_page = len(row_peaks)
        
    line_heights = []
    for i in range(lines_per_page - 1):
        h = row_peaks[i+1] - row_peaks[i]
        line_heights.append(h)

    line_height = np.mean(line_heights)
    start_x = col_peaks[0]
    start_y = row_peaks[0] - (line_height / 2)
    col_width = col_peaks[1] - start_x

    
    return (start_x, start_y, col_width, line_height, lines_per_page)

Now that we have it in a function, it's very easy to use and reuse: 

In [None]:
(start_x, start_y, col_width, line_height, lines_per_page) = process(f)

print('Results for image', f)
print('\tLines per page\t\t', lines_per_page)
print('\tText-block start (x,y)\t', start_x, start_y)
print('\tColumn width (px)\t', col_width)
print('\tLine height (px)\t', line_height)

# Further Steps

Look very closely at the last function and its results, and see if you can understand it all. In particular, pay attention to how the results of the function are passed back out to the rest of the code. This is definitely more advanced and may be too much for now. However, if this does make sense to you then it opens up some very interesting possibilities. From here you could easily do the following:

* Write software which takes a TEI XML document, marked up in the documentary view, with a URL to the IIIF manifest in the header. From there it could:
  * Do a search of the contents of the TEI
  * Automatically download the image(s) of matching page(s) from the IIIF server
  * Automatically detect the lines of text on the page(s)
  * Find the coordinates of all lines of text containing the word
  * Display the image(s) of the page(s), with boxes drawn around the corresponding lines of text
  * And/or, display the lines of text alongside the images of those lines, like we saw in [Models of Authority](https://www.modelsofauthority.ac.uk/digipal/search/facets/?text_type=Transcription&page=1&img_is_public=1&locus=face&result_type=clauses&view=images).
  
There are a few other things that could be done here:
* First, the function above is very 'monolithic', meaning that it does everything and is a bit repetetive. It would be much better to break it up into different functions.
* The style of programming here isn't very good, in that it's a bit clumsy and does not use more advanced Python features such as list comprehension. You could easily rewrite it to be much more elegant as you learn more Python.
* The system for detecting lines and columns of text is very simplistic. It works relatively well for simple printed books like the one we've been using, but it fails very quickly when it comes to more complex or irregular cases. There are _much_ more advanced methods out there which you can find very easily if you look around on the internet. You could easily improve the methods here by implementing some of these more advanced techniques.
* There are, of course, many other possibilities here, depending on your imagination!

If these last steps are a bit too much then you really shouldn't worry. It's meant to give you a taste of what is possible, and the fact that you have got this far is a good reason to celebrate!

So, most of all, go, play with Python, and have fun!

---
![Licence Creative Commons](https://i.creativecommons.org/l/by/4.0/88x31.png)
This work (the contents of this Jupyter Python notebook) is licenced under a [Creative Commons Attribution 4.0 International](http://creativecommons.org/licenses/by/4.0/)