Demonstration/proof-of-concept code to take a document from an existing instance of eScriptorium and extract images of all the letters for palaeographical analysis.
The code assumes the following:
- You must have an account on an instance of [eScriptorium](https://escriptorium.readthedocs.io).
- You must set the URL of your specific instance of eScriptorium in the variable `base_url` below.
- You must have a document already in that instance and already transcribed automatically **without any manual correction**. The code will list available transcriptions using their names as stored in eScriptorium and ask which one you wish to use. Note that it will not offer any manual transcriptions.
- You must know your API key (if not see https://escriptorium.readthedocs.io/en/latest/users/#review-and-edit-your-profile), and it is recommended that you set this as an environment variable using `export ESCRIPTORIUM_KEY={key value}` (for Linux/Mac) on the command line. Otherwise the code will ask you for this API key when you run the relevant cell. **You must exercise caution here to make sure that your key is not left visible in the notebook**, as it will be if you put it directly in the code, or if you enter it into the Jupyter input field and then don't clear the cells. **If you do leave it there then anyone who has access to the code will also effectively have your password to eScriptorium.**
- You must know the ID number (pk) of the document you want to analyse, and this should be stored in the `doc_id` variable below. The easiest way to find this is to go to your document in eScriptorium and find the number in the URL after 'document' (e.g. for https://msia.escriptorium.fr/document/4982/images/ it would be 4982).
- For the first part of the notebook you must also pick one page (image) for analysis from your document and put the relevant ID in the `part_id` variable below. Again you can get this from the url of your page in eScriptorium (it will be the second number after the doc id). At the end of the notebook you will analyse an entire document rather than just one page, but it's always better to start with a sample to make sure that you understand what it does and that it does what you want.

Peter Stokes, EPHE-PSL, March 2025

In [None]:
import os
import json
import requests
from requests.compat import urljoin
from skimage import io
from matplotlib import pyplot as plt

In [None]:
# Default verbose setting (can be overridden as required)
isVerbose = False

In [None]:
# Set up authentication headers to access eScriptorium (see explanation above)

api_token = os.getenv('ESCRIPTORIUM_KEY')

if not(api_token):
    api_token = input("No ESCRIPTORIUM_KEY env variable found. Please enter your key here:")
    
headers = {'Content-type':'application/json', 'Accept':'application/json', 'Authorization': f'Token {api_token}'}

In [None]:
# CHANGE THESE VALUES according to your instance of eScriptorium, and the ID of the document and page you want to use.
# See above for details.

base_url = "https://msia.escriptorium.fr/api/"

# Capitoli dei frati
#doc_id = 3048
#part_id = 559323

# St Gallen MS
#doc_id = 2913
#part_id = 544917

doc_id = 2917
part_id = 545037

In [None]:
# A helper function to get all the elements in a page of JSON results from the API

def get_page(part_url, page=1, verbose=isVerbose):
    url = urljoin(part_url, '?page=%d' % page)

    if verbose:
        print('fetching', url)
        
    res = requests.get(url, headers=headers)
    try:
        data = res.json()
    except json.decoder.JSONDecodeError as e:
        print(res)
    else:
        return data

In [None]:
# A helper function to get all the pages of results from the API
# Note that 'page' here refers to paginated API results and has nothing to do with pages of a document!

def get_paged_elements(element_url):
    elems = []

    page_no = 0
    has_next_page = True

    while has_next_page:
        page_no += 1
        data = get_page(element_url, page_no)
        for part in data['results']:
            elems.append(part)
        has_next_page = data['next']
            
    return(elems)

In [None]:
# A helper function to download the image of a given document and part

def get_image(doc_nu, part_nu, verbose=isVerbose):
    part_data = get_page(urljoin(base_url, f"documents/{doc_nu}/parts/{part_nu}/"))
    
    im_addr = base_url + '..' + part_data['image']['uri']
    
    if verbose:
        print("Image found at", im_addr)
    
    im = io.imread(im_addr)

    return im

In [None]:
# A helper function to get all the transcription lines for a given document, part (image) and transcription

def get_transcriptions(doc_nu, part_nu, transcr_nu, verbose=isVerbose):
    tr_data = get_paged_elements(urljoin(base_url, f"documents/{doc_nu}/parts/{part_nu}/transcriptions/"))
    
    transcriptions = []
    for t in tr_data:
        if t['transcription'] == transcr_nu:
            transcriptions.append(t)

    if verbose:
        print(f"\nFound {len(transcriptions)} lines for this transcription:")
        print(" / ".join([t['content'] for t in transcriptions]))

    return transcriptions

In [None]:
# Pick out a character image based on its position from the automatic transcription (the kraken cuts).
# Note that the kraken results are only the approximate position, as it's not an OCR system (it doesn't specifically identify characters),
# so we need to fiddle a bit to get useful coordinates.
# Normally the vertical coordinates are pretty good but the horizontal ones are much too small, so
# try to infer position based on the midpoint between the kraken cuts of this and the previous and following characters.
# Also add a bit of margin, assuming that it's better to have too big cuts than too small ones.
# TODO: better to have fixed-size images (will see why in next step)

def get_graph_image(line_graphs, g_idx, img, verbose=isVerbose):
    g = line_graphs[g_idx]

    # If we don't have a regular box for polygon then we just leave it for now
    if len(g['poly']) != 4:
        return None

    y_min = g['poly'][0][1]
    y_max = g['poly'][1][1]

    # Look for midpoint between this and the previous cut, unless this is the first character in the line
    if (g_idx > 0) and len(line_graphs[g_idx-1]['poly']) >= 4:
        x_min = (g['poly'][0][0] + line_graphs[g_idx-1]['poly'][3][0]) // 2
    else:
        x_min = g['poly'][0][0]

    # Look for midpoint between this and the next cut, unless this is the last character in the line
    if (g_idx < len(line_graphs) - 1):
        x_max = (g['poly'][3][0] + line_graphs[g_idx+1]['poly'][0][0]) // 2
    else:
        x_max = g['poly'][3][0]

    # Add some horizontal margin for good measure
    x_min -= (x_max - x_min) // 3
    x_max += (x_max - x_min) // 3

    if verbose:
        print(g['c'], x_min, y_min, x_max, y_max)

    return img[y_min:y_max, x_min:x_max, :]

In [None]:
# Function to get all character images from a given set of transcriptions (normally lines from a given page)
# Adds the character images to an existing dictionary keyed to the character

def get_char_imgs(transcriptions, char_imgs, img, verbose=isVerbose):
    
    for l in transcriptions:
        for idx, g in enumerate(l['graphs']):
            newchar = g['c']
            newim = get_graph_image(l['graphs'], idx, img)
            
            if newchar in char_imgs:
                char_imgs[newchar].append(newim)
            else:
                char_imgs[newchar] = [newim]
    return char_imgs



In [None]:
# Display a list of images in a grid with a set number of columns.
# Intended to be used for a list of images of letters (graphs), though could be used for any list of images.

def plot_chars(imgs, cols=12):
    # Number of images
    num_images = len(imgs)
    if num_images == 0:
        print("No character images found")
        return
    
    # Determine the number of rows necessary for the given number of images and columns
    rows  = num_images // cols + 1
    
    # Create a figure with subplots
    fig, axes = plt.subplots(rows, cols, figsize=(cols, rows+1))
    
    # Flatten the axes array for easy iteration
    axes = axes.flatten()
    
    # Iterate over the images and corresponding axes to display each image
    # Be careful: we can sometimes have zero-width images
    for i in range(num_images):
        if (imgs[i] is not None) and (len(imgs[i]) > 0):
            axes[i].imshow(imgs[i])
            axes[i].axis('off')  # Hide the axis
    
    # Hide any remaining empty subplots
    for j in range(num_images, rows * cols):
        axes[j].axis('off')
    
    # Display the grid of images
    plt.tight_layout()
    plt.savefig(f"doc{doc_id}.jpg")
    plt.show()
    
    return plt

In [None]:
# Get all the character images for a given page and store in a dictionary keyed to the character

def get_chars_per_page(doc_id, part_id, transcr_id, verbose=isVerbose):
    img = get_image(doc_id, part_id)
    trans = get_transcriptions(doc_id, part_id, transcr_id)
    
    char_imgs = {}
    return get_char_imgs(trans, char_imgs, img)

In [None]:
# Get the list of transcriptions and ask the user which one they want
# Assume we only have one page of transcriptions.
# Only include transcriptions which have an average confidence, assuming that these should be kraken generated.

def select_transcription(doc_id):
    transcr_list_url = urljoin(base_url, f"documents/{doc_id}/transcriptions/")
    
    transcr_full_list = get_page(transcr_list_url)
    transcr_list = [t for t in transcr_full_list if t['avg_confidence'] and not t['archived']]
    
    print("Please select one of the following transcriptions by numer in the list (0, 2, 3...):")
    for idx, t in enumerate(transcr_list):
        print(f"  {idx}. {t['name']}")
    
    trans_idx = int(input("\nTranscription no."))
    
    if trans_idx in list(range(len(transcr_list))):
        transcr_id = int(transcr_list[trans_idx]['pk'])
        print(f"Transcription id. {transcr_id} selected")
    else:
        transcr_id = None
        print(f"Can't find selected transcript")

    return transcr_id

In [None]:
# Ask the user for the transcription and then load all the graph images for the selected page into memory...

transcr_id = select_transcription(doc_id)
char_imgs = get_chars_per_page(doc_id, part_id, transcr_id)

In [None]:
# ... and show the results for a character

plot_chars(char_imgs['&'])

In [None]:
# Now use get_paged_elements to loop through all the document parts and generate images for the entire document
# Storing all the images in memory may be an issue here, so will probably need something more sophisticated for larger documents

# First get all the part records for the document and filter out those parts which don't have the relevant transcription
#doc_id = 4982
doc_id = 2917
transcr_id = select_transcription(doc_id)

part_list = get_paged_elements(urljoin(base_url, f"documents/{doc_id}/parts/"))
part_list_transcribed = [p for p in part_list if p['transcription_progress']==100]

In [None]:
# Now go through all the parts, extract the cuts and store them in memory
# Note this mixes everything indiscriminately; we may prefer to keep the cuts by page so we can find them afterwards

full_char_dict = {}

for p in part_list_transcribed:
    print(f"Processing page {p['title']} ({p['pk']})")
    try:
        tempdict = get_chars_per_page(doc_id, int(p['pk']), transcr_id)
        
        for key in tempdict:
            if key in full_char_dict:
                full_char_dict[key].extend(tempdict[key])
            else:
                full_char_dict[key] = tempdict[key]
    except Exception as e:
        # Catch any exception and print the error message
        print(f"An error occurred: {e}")

In [None]:
# Test by taking a character
char = 'a'
samplesize = 1000

print(f"Found {len(full_char_dict[char])} sampes of character {char}; showing first {samplesize}")

theplot = plot_chars(full_char_dict[char][0:samplesize])

In [None]:
# Now do the same but this time save the images into a subfolder by the identified character for later processing
# Name the image by docid, partid, character then incremental number. This way we can find the character in the page relatively easily
# Note that the OS probably isn't case sensitive for folder names, so minuscule and majuscule forms will be mixed.

from PIL import Image

output_path = os.path.join(os.path.expanduser('~'), 'Image_data', 'escr_chars', str(doc_id))

if not os.path.exists(output_path):
        os.makedirs(output_path)

for p in part_list_transcribed:
    print(f"Processing page {p['title']} ({p['pk']})")
    try:
        char_dict = get_chars_per_page(doc_id, int(p['pk']), transcr_id)
    except Exception as e:
        # Catch any exception and print the error message
        print(f"An error occurred: {e}")
    
    for key in char_dict:
        subdir = os.path.join(output_path, key)
        if not os.path.exists(subdir):
            os.makedirs(subdir)
        try:
            for idx, char in enumerate(char_dict[key]):
                img = Image.fromarray(char)
                img.save(os.path.join(subdir, f"char-{doc_id}-{p['pk']}-{idx}.png"))
                
        except Exception as e:
            # Catch any exception and print the error message
            print(f"An error occurred: {e}")