## Handwritten Text Recognition with Deep Learning

This notebook explores the development of a Handwritten Text Recognition (HTR) service using deep learning techniques. The service aims to accurately transcribe handwritten text from images, enabling various applications such as document processing, form automation, and data extraction.

**Inspiration and Architecture:**

The HTR service draws inspiration from the advanced architecture presented in the Keras Handwriting Recognition example ([source](https://keras.io/examples/vision/handwriting_recognition/)). This architecture, rooted in deep learning principles, is meticulously designed to recognize and transcribe handwritten text with high accuracy and efficiency. The underlying model has been further customized and optimized for seamless deployment in various settings, ensuring its suitability for real-world use cases.

This notebook delves into the step-by-step process of building, training, and evaluating the HTR model. We'll explore essential components like data preparation, model architecture design, training techniques, and performance evaluation metrics. By following along, you'll gain a comprehensive understanding of how deep learning can be harnessed for effective handwritten text recognition.

**If you think this notebook could be a resource for others, consider giving it an upvote for better discoverability!**

## Online Demo

You can work with online demo in the following address: https://hamiddamadi.ir/app/textRecognition.

## Essential Libraries for Handwritten Text Recognition

This cell imports several essential libraries for building and manipulating the handwritten text recognition (HTR) model:

* **`os`:** Provides operating system functionalities, potentially useful for tasks like:
    * File path manipulation during data loading from the file system.
    * Saving and managing model checkpoints during training.
* **`numpy`:** Offers powerful array manipulation and mathematical operations, crucial for various tasks like:
    * Data pre-processing, such as image resizing and normalization.
    * Representing and manipulating image data as NumPy arrays.
    * Performing calculations within the model, such as matrix multiplications.
* **`tensorflow`:** Serves as the core deep learning framework for building and training the HTR model. It provides tools for:
    * Defining the model architecture, including layers and their connections.
    * Performing computations on tensors (multidimensional data arrays) during training and inference.
    * Optimizing the model's performance through techniques like backpropagation.
* **`sklearn.model_selection`:** Offers functionalities for splitting data into training and testing sets. This is essential for:
    * Evaluating the model's performance on unseen data during training.
    * Ensuring the model generalizes well to new handwritten text samples.
* **`matplotlib.pyplot`:** Enables data visualization using plots and charts. This can be helpful for:
    * Visualizing data distributions, such as the distribution of characters in the training data.
    * Understanding training progress, such as plotting the model's loss over training epochs.
    * Analyzing results, such as visualizing predicted vs. ground truth text for evaluation purposes.

These libraries will be utilized throughout the notebook, playing crucial roles in building, training, and evaluating the HTR model to effectively recognize handwritten text.

In [None]:
import os
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

This cell defines various constants and hyperparameters that will be used throughout the notebook for the handwritten text recognition (HTR) model:

* **`IMAGE_SIZE`:** This tuple represents the fixed size to which all input images will be resized. In this case, images will be resized to 128 pixels wide and 32 pixels high. This ensures consistency in the input data and simplifies processing within the model.
* **`BATCH_SIZE`:** This integer specifies the number of images processed by the model in each training iteration (epoch). A batch size of 64 means the model will update its weights based on the gradients calculated from 64 images at a time. Choosing an appropriate batch size can impact training speed, memory usage, and convergence.
* **`EPOCHS`:** This integer represents the number of times the entire training dataset will be passed through the model during training. In this case, the model will be trained for 10 epochs. This hyperparameter fine-tunes the model's learning process and influences its ability to learn patterns from the data.
* **`PADDING_TOKEN`:** This integer represents a special token used for padding sequences during text processing. It is often used to ensure all sequences have the same length, which is necessary for certain model architectures. A value of 99 is assigned as the padding token in this case.

These constants and hyperparameters will be used in various parts of the notebook, including data pre-processing, model building, and training. Carefully choosing and adjusting these values can significantly impact the model's performance and effectiveness in recognizing handwritten text.


In [None]:
IMAGE_SIZE = (128, 32)
BATCH_SIZE = 64
EPOCHS = 50
PADDING_TOKEN = 99

## Creating Datasets

### Preprocessing the Handwritten Text Dataset

This cell defines and executes the `preprocess_dataset` function, which is responsible for loading and processing the handwritten text dataset:

**Data Path:**

* **`DATA_INPUT_PATH`:** This variable stores the path to the dataset, assumed to be located in the `/kaggle/input/iam-handwriting-word-database` directory. This path needs to be adjusted if the data is located elsewhere.

**Function Details:**

1. **Open the labels file:** The function opens the `words.txt` file located within the data directory. This file contains information about each image and its corresponding label (the handwritten text).
2. **Iterate over lines:** It iterates through each line in the file, skipping comments and empty lines.
3. **Extract information:** 
    * **Word ID:** Identifies the unique identifier for the word.
    * **Image filename:** Constructs the filename for the corresponding image based on the word ID.
    * **Label:** Extracts the handwritten text associated with the image.
4. **Process image and label:** 
    * **Image path:** Constructs the full path to the image file using the data directory and filename.
    * **Check image existence:** Verifies if the image file exists and has a non-zero size to avoid processing non-existent or empty images.
    * **Add to lists:** 
        * Appends the image path to the `images_path` list.
        * Appends the label to the `labels` list.
        * Adds each unique character from the label to the `characters` set.
    * **Update max length:** Tracks the maximum length (number of characters) across all labels to be used later for padding.

**Character Mapping:**

* After processing all lines, the function:
    * Sorts the unique characters encountered in the labels to create a consistent order.
    * Defines two `StringLookup` layers from TensorFlow Keras:
        * `char_to_num`: Maps each unique character to a unique integer, enabling efficient processing within the model.
        * `num_to_char`: Maps the integer back to the original character, useful for decoding predictions later.

**Running the Function:**

* The `preprocess_dataset` function is called at the end of the cell to initiate the data preprocessing process.

**Printing Information:**

* The function prints the following information after processing the data:
    * List of unique characters (`characters`)
    * Maximum length of any label (`max_len`)

This preprocessed data, including the image paths, labels, character mapping, and maximum sequence length, will be used for subsequent steps like loading images, preparing sequences, and building the HTR model.


In [None]:
DATA_INPUT_PATH = "/kaggle/input/iam-handwriting-word-database"

images_path = []
labels = []

def preprocess_dataset():
    characters = set()
    max_len = 0
    with open(os.path.join(DATA_INPUT_PATH, 'iam_words', 'words.txt'), 'r') as file:
        lines = file.readlines()

        for line_number, line in enumerate(lines):
            # Skip comments and empty lines
            if line.startswith('#') or line.strip() == '':
                continue

            # Split the line and extract information
            parts = line.strip().split()

            # Continue with the rest of the code
            word_id = parts[0]

            first_folder = word_id.split("-")[0]
            second_folder = first_folder + '-' + word_id.split("-")[1]

            # Construct the image filename
            image_filename = f"{word_id}.png"
            image_path = os.path.join(
                DATA_INPUT_PATH, 'iam_words', 'words', first_folder, second_folder, image_filename)

            # Check if the image file exists
            if os.path.isfile(image_path) and os.path.getsize(image_path):

                images_path.append(image_path)

                # Extract labels
                label = parts[-1].strip()
                for char in label:
                    characters.add(char)

                max_len = max(max_len, len(label))
                labels.append(label)

    characters = sorted(list(characters))

    print('characters: ', characters)
    print('max_len: ', max_len)
    # Mapping characters to integers.
    char_to_num = tf.keras.layers.StringLookup(
        vocabulary=list(characters), mask_token=None)

    # Mapping integers back to original characters.
    num_to_char = tf.keras.layers.StringLookup(
        vocabulary=char_to_num.get_vocabulary(), mask_token=None, invert=True
    )
    return characters, char_to_num, num_to_char, max_len
    
characters, char_to_num, num_to_char, max_len = preprocess_dataset()

### Preprocessing Functions for Images and Labels

This cell defines three functions used for pre-processing images and labels:

**1. `distortion_free_resize`:**

* This function resizes an image while preserving its aspect ratio using TensorFlow's `tf.image.resize` function with `preserve_aspect_ratio=True`.
* It then calculates the amount of padding needed to make the image size match the target size defined by `img_size` (a tuple of width and height).
* The function ensures equal padding on both top/bottom and left/right sides for consistency.
* Finally, it performs the following operations:
    * Pads the image with the calculated padding amounts.
    * Transposes the image dimensions.
    * Flips the image horizontally (optional, can be helpful for data augmentation).

**2. `preprocess_image`:**

* This function takes an image path and the target image size (`img_size`) as input.
* It reads the image file using `tf.io.read_file`.
* Decodes the PNG image into a grayscale image using `tf.image.decode_png` with a channel dimension of 1.
* Calls the `distortion_free_resize` function to resize and pad the image.
* Normalizes the image pixel values to the range [0, 1] by dividing by 255.0.
* Returns the preprocessed image as a floating-point tensor.

**3. `vectorize_label`:**

* This function takes a label string (the handwritten text) as input.
* It maps each character in the label to its corresponding integer using the `char_to_num` StringLookup layer defined earlier.
* It calculates the length of the label sequence.
* It pads the label sequence with the `PADDING_TOKEN` (value of 99) up to the maximum length (`max_len`) defined earlier. This ensures all labels have the same length, which is necessary for certain model architectures.
* Returns the padded label sequence as a tensor of integers.

These functions will be instrumental in transforming the raw image data and labels into a format suitable for training the HTR model. The preprocessed images and labels will be used in the next steps for data loading and model building.

In [None]:
def distortion_free_resize(image, img_size):
    w, h = img_size
    image = tf.image.resize(image, size=(h, w), preserve_aspect_ratio=True)

    # Check tha amount of padding needed to be done.
    pad_height = h - tf.shape(image)[0]
    pad_width = w - tf.shape(image)[1]

    # Only necessary if you want to do same amount of padding on both sides.
    if pad_height % 2 != 0:
        height = pad_height // 2
        pad_height_top = height + 1
        pad_height_bottom = height
    else:
        pad_height_top = pad_height_bottom = pad_height // 2

    if pad_width % 2 != 0:
        width = pad_width // 2
        pad_width_left = width + 1
        pad_width_right = width
    else:
        pad_width_left = pad_width_right = pad_width // 2

    image = tf.pad(
        image,
        paddings=[
            [pad_height_top, pad_height_bottom],
            [pad_width_left, pad_width_right],
            [0, 0],
        ],
    )

    image = tf.transpose(image, perm=[1, 0, 2])
    image = tf.image.flip_left_right(image)
    return image

def preprocess_image(image_path, img_size):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_png(image, 1)
    image = distortion_free_resize(image, img_size)
    image = tf.cast(image, tf.float32) / 255.0
    return image

def vectorize_label(label):
    label = char_to_num(tf.strings.unicode_split(
        label, input_encoding="UTF-8"))
    length = tf.shape(label)[0]
    pad_amount = max_len - length
    label = tf.pad(label, paddings=[[0, pad_amount]],
                   constant_values=PADDING_TOKEN)
    return label

### Preparing the Handwritten Text Dataset

This cell defines two functions to prepare the dataset for training the HTR model:

**1. `process_images_labels`:**

* This function takes an image path and its corresponding label as input.
* It utilizes the previously defined functions:
    * `preprocess_image`: Reads, resizes, normalizes, and preprocesses the image.
    * `vectorize_label`: Converts the label string into a padded sequence of integers.
* It returns a dictionary containing the preprocessed image and label.

**2. `prepare_dataset`:**

* This function takes lists of image paths and labels as input.
* It creates a TensorFlow dataset using `tf.data.Dataset.from_tensor_slices`.
* It applies the `process_images_labels` function to each element (image path, label) in the dataset using parallel processing (controlled by `num_parallel_calls=AUTOTUNE`). This improves efficiency by utilizing multiple CPU cores.
* It uses the following techniques for further optimization:
    * **Batching:** Groups elements into batches of size `BATCH_SIZE` (defined earlier) for efficient processing during training.
    * **Caching:** Stores the preprocessed data in memory to avoid redundant processing on subsequent epochs.
    * **Prefetching:** Overlaps data preprocessing with training to improve training speed.

**Data Size Information:**

* The function prints the lengths of the `image_paths` and `labels` lists to verify dataset size.

**Output:**

* This function returns the prepared dataset, which is now ready to be used for training the HTR model.

By preparing the dataset in this way, we ensure that the images and labels are efficiently processed and fed into the model during training, leading to improved performance and faster training times.

In [None]:
def process_images_labels(image_path, label):
    image = preprocess_image(image_path, IMAGE_SIZE)
    label = vectorize_label(label)
    return {"image": image, "label": label}

def prepare_dataset(image_paths, labels):
    AUTOTUNE = tf.data.AUTOTUNE
    print('len(image_paths): ', len(image_paths))
    print('len(labels): ', len(labels))
    dataset = tf.data.Dataset.from_tensor_slices((image_paths, labels)).map(
        process_images_labels, num_parallel_calls=AUTOTUNE
    )
    return dataset.batch(BATCH_SIZE).cache().prefetch(AUTOTUNE)

### Splitting the Dataset

This cell defines a function `split_dataset` and calls it to split the prepared dataset into training, validation, and test sets:

**1. `split_dataset` function:**

* This function performs the following steps:
    * **Initial split:** Splits the entire dataset (image paths and labels) into training and testing sets using `sklearn.model_selection.train_test_split`. The `test_size` parameter is set to 0.2, allocating 20% of the data for the testing set. The `random_state` parameter is set to 42 for reproducibility.
    * **Further split:** Splits the test set further into validation and final test sets using another `train_test_split` call. This time, half of the remaining data (10% of the original dataset) is allocated for validation.
* **Prepare datasets:** Uses the `prepare_dataset` function defined earlier to prepare separate datasets for training, validation, and testing.

**2. Splitting and preparing:**

* The function calls itself (using `split_dataset()`) to execute the splitting and preparation steps.
* The function assigns the resulting training, validation, and test sets to the respective variables:
    * `train_set`: Contains the training data for training the model.
    * `val_set`: Contains the validation data for monitoring model performance during training.
    * `test_set`: Contains the final test data for evaluating the model's generalization ability on unseen data.

**Output:**

* This cell does not explicitly print any output, but it assigns the split and prepared datasets to variables for use in subsequent parts of the notebook, namely model building and training.

By splitting the data into separate training, validation, and test sets, we:

* Ensure the model is trained on unseen data during validation, allowing for unbiased evaluation of its performance.
* Reserve a portion of the data for final testing to assess the model's ability to generalize to completely new handwritten text samples.

In [None]:
def split_dataset():
    # Split the data into training, validation, and test sets using train_test_split
    train_images, test_images, train_labels, test_labels = train_test_split(
        images_path, labels, test_size=0.2, random_state=42
    )

    # Further split the test set into validation and final test sets
    val_images, test_images, val_labels, test_labels = train_test_split(
        test_images, test_labels, test_size=0.5, random_state=42
    )

    train_set = prepare_dataset(train_images, train_labels)
    val_set = prepare_dataset(val_images, val_labels)
    test_set = prepare_dataset(test_images, test_labels)
    
    return train_set, val_set, test_set

train_set, val_set, test_set = split_dataset()

## Building the Handwritten Text Recognition Model (HTR)

### Custom CTC Layer

This code defines a custom layer named `CTCLayer` that inherits from `tf.keras.layers.Layer`. This layer is specifically designed for handling the Connectionist Temporal Classification (CTC) loss function commonly used in sequence recognition tasks like handwritten text recognition (HTR).

**Initialization:**

* The `__init__` method initializes the layer and sets the internal loss function to `tf.keras.backend.ctc_batch_cost`. This function calculates the CTC loss, which is essential for training the HTR model.

**Call method:**

* This method defines the forward pass of the layer, taking the ground truth labels (`y_true`) and model predictions (`y_pred`) as input.
* It performs the following steps:
    * **Calculate shapes:** Extracts relevant dimensions from the input shapes:
        * Batch size (`batch_len`) from the number of samples in the ground truth labels.
        * Predicted sequence length (`input_length`) from the shape of the predictions.
        * Ground truth label length (`label_length`) from the shape of the labels.
    * **Reshape lengths:** Reshapes the sequence lengths to have a batch dimension (size of batch x 1) for compatibility with the CTC loss function.
    * **Calculate loss:** Calculates the CTC loss using the internal `loss_fn` and adds it to the model's total loss using `self.add_loss`.
    * **Return predictions:** During testing (when `y_true` is None), the layer simply returns the predictions without calculating the loss.

**Key Points:**

* This custom layer simplifies the integration of the CTC loss into the HTR model by encapsulating the loss calculation logic within the layer itself.
* The layer handles reshaping the sequence lengths to the appropriate format required by the CTC loss function.
* It also provides flexibility by returning the predictions during testing, allowing for decoding and evaluation without calculating the loss.

By incorporating this layer into the model, we can effectively train the HTR model using the CTC loss function, enabling it to learn to recognize handwritten text sequences.

In [None]:
class CTCLayer(tf.keras.layers.Layer):
    def __init__(self, name=None):
        super().__init__(name=name)
        self.loss_fn = tf.keras.backend.ctc_batch_cost

    def call(self, y_true, y_pred):
        batch_len = tf.cast(tf.shape(y_true)[0], dtype="int64")
        input_length = tf.cast(tf.shape(y_pred)[1], dtype="int64")
        label_length = tf.cast(tf.shape(y_true)[1], dtype="int64")

        input_length = input_length * \
            tf.ones(shape=(batch_len, 1), dtype="int64")
        label_length = label_length * \
            tf.ones(shape=(batch_len, 1), dtype="int64")
        loss = self.loss_fn(y_true, y_pred, input_length, label_length)
        self.add_loss(loss)

        # At test time, just return the computed predictions.
        return y_pred

This cell defines and builds the HTR model using TensorFlow Keras. Here's a breakdown of the code:

**Model Inputs:**

* **Image input:** The model takes an image as input using a `tf.keras.Input` layer named "image". The input shape is defined as `(IMAGE_SIZE[0], IMAGE_SIZE[1], 1)`, where:
    * `IMAGE_SIZE[0]`: Height of the image (128 pixels in this case).
    * `IMAGE_SIZE[1]`: Width of the image (32 pixels in this case).
    * The final dimension of 1 indicates a grayscale channel.
* **Label input:** The model also takes the corresponding label (the handwritten text) as input using another `tf.keras.Input` layer named "label". This layer has a shape of `(None,)`, indicating a sequence of characters with variable length.

**Model Architecture:**

The model follows a sequence of convolutional and recurrent layers:

* **Convolutional layers:**
    * Two convolutional layers with 3x3 kernels and ReLU activation are used to extract features from the input image.
    * Max pooling layers are used after each convolutional layer to reduce the dimensionality and introduce translation invariance.
* **Reshaping layer:**
    * The output of the second convolutional layer is reshaped to prepare it for the recurrent layers.
    * The new shape is calculated based on the image size and the number of filters in the previous layer.
* **Dense layer:**
    * A dense layer with 64 units and ReLU activation is added for further feature extraction.
* **Dropout layer:**
    * A dropout layer with a rate of 0.2 is used to prevent overfitting.
* **Bidirectional LSTMs:**
    * Two bidirectional LSTMs with 128 and 64 units are used, respectively.
        * Bidirectional LSTMs process the sequence in both directions, capturing dependencies from both past and future elements in the sequence.
        * Return sequences are set to `True` to allow processing the entire sequence at once.
        * Dropout is applied to each LSTM layer (with a rate of 0.25) to further prevent overfitting.
* **Output layer:**
    * A dense layer with a number of units equal to the vocabulary size (number of unique characters) plus 2 is used.
        * The extra 2 units represent the padding token (used for sequences with different lengths) and the blank character (used in the CTC loss function).
    * The output layer uses the softmax activation function to predict the probability distribution over the characters for each timestep in the sequence.

**CTC Layer:**

* A custom `CTCLayer` defined earlier is used as the final layer.
    * It takes the ground truth labels and the model predictions as input.
    * It calculates the CTC loss and adds it to the model's total loss.
    * During testing, it simply returns the predictions.

**Model Compilation:**

* The model is compiled using the Adam optimizer with a learning rate of 0.001.
* A summary of the model architecture is printed using `model.summary()`.

**Calling the Function:**

* The `build_model` function is called, which defines and compiles the HTR model.

This model architecture utilizes convolutional layers to extract features from the images and recurrent layers (LSTMs) to learn the temporal dependencies between characters in the sequence. The CTC loss function is used during training to guide the model towards learning accurate representations of handwritten text sequences.

By building and compiling this model, we are now ready to train it on the prepared dataset using the CTC loss function, enabling it to learn and recognize handwritten text.

In [None]:
def build_model():
    input_img = tf.keras.Input(
        shape=(IMAGE_SIZE[0], IMAGE_SIZE[1], 1), name="image")
    labels = tf.keras.layers.Input(name="label", shape=(None,))

    x = tf.keras.layers.Conv2D(
        32,
        (3, 3),
        activation="relu",
        kernel_initializer="he_normal",
        padding="same",
    )(input_img)
    x = tf.keras.layers.MaxPooling2D((2, 2))(x)
    x = tf.keras.layers.Conv2D(
        64,
        (3, 3),
        activation="relu",
        kernel_initializer="he_normal",
        padding="same",
    )(x)
    x = tf.keras.layers.MaxPooling2D((2, 2))(x)
    new_shape = ((IMAGE_SIZE[0] // 4), (IMAGE_SIZE[1] // 4) * 64)
    x = tf.keras.layers.Reshape(target_shape=new_shape)(x)
    x = tf.keras.layers.Dense(64, activation="relu")(x)
    x = tf.keras.layers.Dropout(0.2)(x)
    x = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(128, return_sequences=True, dropout=0.25)
    )(x)
    x = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(64, return_sequences=True, dropout=0.25)
    )(x)
    x = tf.keras.layers.Dense(
        len(char_to_num.get_vocabulary()) + 2, activation="softmax", name="dense2"
    )(x)
    output = CTCLayer(name="ctc_loss")(labels, x)
    model = tf.keras.models.Model(
        inputs=[input_img, labels], outputs=output
    )
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
    model.summary()
    return model
    
model = build_model()

### Edit Distance Callback Class

This code defines a custom callback class named `EditDistanceCallback` that inherits from `tf.keras.callbacks.Callback`. This callback calculates and monitors the edit distance between predicted and ground truth labels during training:

**Initialization:**

* Takes the following arguments in its constructor:
    * `pred_model`: The model used for making predictions (typically the built HTR model).
    * `max_len`: The maximum length of the label sequences.
    * `validation_images`: Validation images used for calculating edit distance.
    * `validation_labels`: Corresponding validation labels for the images.

**`calculate_edit_distance` Method:**

* This method calculates the average edit distance between a batch of predictions and their corresponding ground truth labels.
* It performs the following steps:
    * Converts the ground truth labels to sparse tensors.
    * Makes predictions using the provided model and converts them to sparse tensors.
    * Calculates the edit distance between each predicted sequence and its corresponding label using the `tf.edit_distance` function.
    * Calculates and returns the average edit distance across the entire batch.

**`on_epoch_end` Method:**

* This method is called at the end of each training epoch.
* It iterates through the validation images and labels:
    * For each image-label pair:
        * Makes predictions using the `prediction_model`.
        * Calculates the edit distance using the `calculate_edit_distance` method.
    * Calculates the mean edit distance across all validation samples for the current epoch.
    * Prints the mean edit distance for the current epoch.

**Purpose:**

* This callback provides a way to monitor the model's performance on the validation set in terms of edit distance, which reflects the number of edits (insertions, deletions, substitutions) required to transform the predicted sequence into the ground truth label.
* By tracking the edit distance over time, we can gain insights into the model's ability to learn and recognize handwritten text accurately.

By incorporating this callback during training, we can monitor the model's progress beyond just the training and validation losses, providing a more comprehensive understanding of its performance on unseen data.

In [None]:
class EditDistanceCallback(tf.keras.callbacks.Callback):
    def __init__(self, pred_model, max_len, validation_images, validation_labels):
        super().__init__()
        self.prediction_model = pred_model
        self.max_len = max_len
        self.validation_images = validation_images
        self.validation_labels = validation_labels

    def calculate_edit_distance(self, labels, predictions, max_len):
        # Get a single batch and convert its labels to sparse tensors.
        saprse_labels = tf.cast(tf.sparse.from_dense(labels), dtype=tf.int64)

        # Make predictions and convert them to sparse tensors.
        input_len = np.ones(predictions.shape[0]) * predictions.shape[1]
        predictions_decoded = tf.keras.backend.ctc_decode(
            predictions, input_length=input_len, greedy=True
        )[0][0][:, :max_len]
        sparse_predictions = tf.cast(
            tf.sparse.from_dense(predictions_decoded), dtype=tf.int64
        )

        # Compute individual edit distances and average them out.
        edit_distances = tf.edit_distance(
            sparse_predictions, saprse_labels, normalize=False
        )
        return tf.reduce_mean(edit_distances)

    def on_epoch_end(self, epoch, logs=None):
        edit_distances = []

        for i in range(len(self.validation_images)):
            labels = self.validation_labels[i]
            predictions = self.prediction_model.predict(
                self.validation_images[i])
            edit_distances.append(self.calculate_edit_distance(
                labels, predictions, self.max_len).numpy())

        print(
            f"Mean edit distance for epoch {epoch + 1}: {np.mean(edit_distances):.4f}"
        )

## Training the Handwritten Text Recognition (HTR) Model

This code block defines and executes the training process for the HTR model:

**Preparing Validation Data:**

* Retrieves validation images and labels from the `val_set` dataset and stores them in separate lists for later use in the custom callback.

**Creating a Prediction Model:**

* Creates a new model, `prediction_model`, using the layers of the original model needed for inference. This model accepts images as input and outputs the character probabilities before the CTC layer.
* Extracts these layers using `model.get_layer(name="image").input` and `model.get_layer(name="dense2").output`.

**Setting Up Callbacks:**

* **EditDistanceCallback:** 
    * An instance of the custom `EditDistanceCallback` class is created, providing the prediction model, maximum label length, validation images, and labels.
    * This callback monitors the edit distance on the validation set during training.
* **EarlyStopping:** 
    * A standard `EarlyStopping` callback is instantiated to monitor validation loss and stop training if it doesn't improve for 10 epochs, restoring the best weights found so far.

**Training the Model:**

* The `model.fit` function initiates the training process with the following parameters:
    * `train_set`: The training dataset for model training.
    * `validation_data=val_set`: The validation dataset for evaluating performance during training.
    * `epochs=EPOCHS`: The maximum number of epochs to train (unless interrupted by early stopping).
    * `callbacks=[edit_distance_callback, early_stopping]`: The specified callbacks to monitor training and potentially stop it early.

**Return Value:**

* The function returns the training history object `hist`, containing information about the training process, such as losses and metrics for each epoch.

**Execution:**

* The `history` variable is assigned the result of calling the `train_model` function, initiating the model training process with the specified settings and callbacks.

In [None]:
def train_model():

    validation_images = []
    validation_labels = []

    for batch in val_set:
        validation_images.append(batch["image"])
        validation_labels.append(batch["label"])

    prediction_model = tf.keras.models.Model(
        model.get_layer(name="image").input, model.get_layer(
            name="dense2").output
    )
    edit_distance_callback = EditDistanceCallback(
        prediction_model, max_len, validation_images, validation_labels)
    early_stopping = tf.keras.callbacks.EarlyStopping(
        monitor="val_loss", patience=10, restore_best_weights=True
    )
    # Train the model.
    hist = model.fit(
        train_set,
        validation_data=val_set,
        epochs=EPOCHS,
        callbacks=[edit_distance_callback, early_stopping],
    )
    return hist, prediction_model

history, prediction_model = train_model()

### Visualizing the Training History

This code block defines a function `visualize_train_history` and calls it to visualize the training history stored in the `history` object:

**Function Definition:**

* The function takes the `history` object returned from the training process as input.
* It creates a Matplotlib figure with a specific size (12 inches wide and 4 inches high).

**Plotting Accuracy and Loss:**

* The function uses subplots to create a one-panel figure:
    * **Loss:**
        * Plots the training and validation loss values retrieved from the `history` object using `history.history['loss']` and `history.history['val_loss']`.
        * Labels the axes and adds a legend similar to the accuracy plot.

* The `plt.tight_layout()` function adjusts the spacing between subplots for better readability.
* Finally, `plt.show()` displays the generated plot.

**Calling the Function:**

* The `visualize_train_history` function is called, passing the `history` object obtained from the training process. This triggers the creation and display of the visualization, allowing you to visually inspect the model's performance during training.

By visualizing the training history, you can gain valuable insights into the model's learning behavior. 
* The loss plots show how the model's loss (a measure of how well it fits the training data) decreases as training progresses. 
* Ideally, the training accuracy should increase and the training loss should decrease over time, while the validation accuracy and loss should also improve or at least not significantly worsen, indicating that the model is generalizing well to unseen data.

In [None]:
def visualize_train_history(history):
    plt.figure(figsize=(12, 4))

    # Plot training & validation loss values
    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Model loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend(['Train', 'Validation'], loc='upper left')

    plt.tight_layout()
    plt.show()

visualize_train_history(history)

## Evaluating the HTR Model on the Test Set

This code block defines and executes the evaluation process for the trained HTR model on the unseen test set:

**Function Definition:**

* The `evaluate_model` function calculates and prints the accuracy of the model on the test set.

**Evaluation:**

* The `model.evaluate` method is called with the `test_set` dataset as input. This performs evaluation on the test data and returns a list of metrics, including accuracy.
* The first element of the returned list, corresponding to the model's accuracy on the test set, is extracted and stored in the `accuracy` variable.
* The accuracy value is then printed with a descriptive message ("Test Accuracy:").

**Execution:**

* The `evaluate_model` function is called, which triggers the evaluation process and prints the test accuracy for the trained model.

**Interpretation:**

* The test accuracy metric reflects how well the model generalizes to unseen data not encountered during training. Ideally, the test accuracy should be comparable to the validation accuracy, indicating that the model has learned robust features and can perform well on new handwritten text samples.

By evaluating the model on the test set, you can gauge its ability to recognize handwritten text beyond the data it was trained on, providing valuable insights into its real-world applicability.

In [None]:
def evaluate_model():
    accuracy = model.evaluate(test_set)
    print("Test Accuracy:", accuracy)

evaluate_model() 

## Saving the Trained HTR Model

This code block defines and calls a function to save the trained HTR model:

**Function Definition:**

* The `save_model` function is defined to save the trained model to disk.

**Saving the Model:**

* It creates the directory specified in `MODEL_OUTPUT_PATH` if it doesn't exist using `os.makedirs(MODEL_OUTPUT_PATH, exist_ok=True)`.
* It constructs the full file path for the model by combining the `MODEL_OUTPUT_PATH` and the model name (`MODEL_NAME`) with the `.keras` extension using `os.path.join`.
* Finally, it calls `prediction_model.save` to save the model to the constructed file path.

**Execution:**

* The `save_model` function is called, which triggers the saving process. This preserves the trained model, allowing you to load it later for inference or fine-tuning on new data.

**Key Points:**

* The model is saved in the Keras format, which allows for easy loading and use in future applications.
* The directory structure and naming convention (`MODEL_OUTPUT_PATH` and `MODEL_NAME`) can be adjusted to your preference and project requirements.

By saving the trained model, you can:

* Reuse it for making predictions on new handwritten text samples without retraining the entire model.
* Share the model with others for further evaluation or integration into applications.
* Continue training the model on additional data in the future if needed.

In [None]:
MODEL_NAME = 'MODEL_NAME'
MODEL_OUTPUT_PATH = '/kaggle/working/'
        
def save_model():
    """
    Save the trained HTR model.
    """
    os.makedirs(MODEL_OUTPUT_PATH, exist_ok=True)
    prediction_model.save(os.path.join(
        MODEL_OUTPUT_PATH, f'{MODEL_NAME}.keras'))
    
save_model()

## Inference

In [None]:
def decode_batch_predictions(pred):
    input_len = np.ones(pred.shape[0]) * pred.shape[1]
    # Use greedy search. For complex tasks, you can use beam search.
    results = tf.keras.backend.ctc_decode(pred, input_length=input_len, greedy=True)[0][0][
        :, :max_len
    ]

    # Iterate over the results and get back the text.
    output_text = []
    for res in results:
        res = tf.gather(res, tf.where(tf.math.not_equal(res, -1)))
        res = tf.strings.reduce_join(num_to_char(res)).numpy().decode("utf-8")
        output_text.append(res)
    return output_text


#  Let's check results on some test samples.
for batch in test_set.take(2):
    batch_images = batch["image"]
    batch_labels = batch["label"]
    _, ax = plt.subplots(4, 4, figsize=(15, 8))

    preds = prediction_model.predict(batch_images)
    pred_texts = decode_batch_predictions(preds)

    for i in range(16):
        img = batch_images[i]
        img = tf.image.flip_left_right(img)
        img = tf.transpose(img, perm=[1, 0, 2])
        img = (img * 255.0).numpy().clip(0, 255).astype(np.uint8)
        img = img[:, :, 0]

        title = f"Prediction: {pred_texts[i]}"
        ax[i // 4, i % 4].imshow(img, cmap="gray")
        ax[i // 4, i % 4].set_title(title)
        ax[i // 4, i % 4].axis("off")

plt.show()