# Matrices OCR training

This tutorial shows how you can train the OCR module of the [Matrices repository](https://github.com/merialdo/research.matrices) in your Google Colab.



## 1 Disclaimer

This colab loads and saves data from/into your Google Drive main folder.

Please make sure that you have your dataset in your Google Drive main folder (in HDF5 format).

When this Colab will ask your permission to access your Google Drive data, answer "consent".

The training output (e.g., summary files and checkpoints of the training process) and any evaluation result files will be saved in a new folder created on your Google Drive main folder.

## 2 Parameters

In this Python cell we define the main parameters of the training process.

In [None]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize

import psutil
import humanize
import os
import GPUtil as GPU

GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " |     Proc size: " + humanize.naturalsize(process.memory_info().rss))
    print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total     {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()


Gen RAM Free: 10.8 GB  |     Proc size: 5.2 GB
GPU RAM Free: 357MB | Used: 11084MB | Util  97% | Total     11441MB


In [None]:
import string 

base_model = "multilingual_model.hdf5"  # name and path of the base model to fine-tune
target_model_name = "biagini_ft"        # name of the model to create
dataset_name = "biagini.hdf5"           # complete name of the dataset

epochs = 100                       # number of training epochs
batch_size = 16                    # number of samples per mini-batch
#learning_rate=0.001
learning_rate=0.0005               # learning rate of the training process

input_size = (1024, 128, 1)       # input size of the images to transcribe 
#input_size = (640, 64, 1)

max_text_length = 128             # max number of characters for each transcribed line
#max_text_length = 180

validation_interval = 1            # number of training epochs between validations

charset_base = string.printable[:95]#+"€"     # alphabet of available characters

## 3 Google Colab Environment


### 3.1 TensorFlow 2.x

Make sure the jupyter notebook is using GPU mode.

In [None]:
!nvidia-smi

Sat Jan 22 10:50:36 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P0    74W / 149W |  11084MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Install TensorFlow 2.x., and verify it has been installed correctly.

In [None]:
#!pip install -q tensorflow-gpu

%tensorflow_version 2.x

import tensorflow as tf

device_name = tf.test.gpu_device_name()

if device_name != "/device:GPU:0":
    raise SystemError("GPU device not found")

print(f"Found GPU at: {device_name}")

Found GPU at: /device:GPU:0


### 3.2 Google Drive

Mount your Google Drive partition; after this step, you should be able to see the list of your Google Drive files in the project folder, under path /content/drive.

Note: the project folder is temporary storage exclusive to this Colab notebook, and it is now only partially connected to your Google Drive. So: 
- interacting with files in /content/drive actively changes the content of your Google Drive
- but any changes to other paths in the project folder will likely be erased reloading the notebook.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


### 3.3 Clone the Matrices Repository from GitHub

Clone the research.matrices repository into the project folder in path /content/matrices.

In [None]:
import os
import shutil

if os.path.isdir("matrices"):
  shutil.rmtree("matrices")
os.mkdir("matrices")

!git clone https://github.com/merialdo/research.matrices matrices

Cloning into 'matrices'...
remote: Enumerating objects: 472, done.[K
remote: Counting objects: 100% (241/241), done.[K
remote: Compressing objects: 100% (146/146), done.[K
remote: Total 472 (delta 119), reused 196 (delta 93), pack-reused 231[K
Receiving objects: 100% (472/472), 10.22 MiB | 22.16 MiB/s, done.
Resolving deltas: 100% (190/190), done.


Install the backend requirements for Matrices:

In [None]:
!pip install -r /content/matrices/backend/requirements.txt



Create a "data" folder in the matrices project and copy the dataset from Google Drive to that folder

In [None]:
matrices_data_folder = "/content/matrices/data"

# define paths
dataset_path = os.path.join(matrices_data_folder, dataset_name)
output_path = os.path.join("/content/drive/MyDrive", target_model_name)
checkpoint_path = os.path.join(output_path, "checkpoint_model.{epoch:02d}.hdf5")

os.makedirs(output_path, exist_ok=True)

In [None]:
if os.path.isdir(matrices_data_folder):
  shutil.rmtree(matrices_data_folder)
os.mkdir(matrices_data_folder)

shutil.copyfile(os.path.join("/content/drive/MyDrive", dataset_name), os.path.join(matrices_data_folder, dataset_name))

'/content/matrices/data/biagini.hdf5'

## 4 Run the Training Process

Move to the Matrices folder and perform the actual training process.
At constant intervals defined in the validation_interval parameters, a validation run will be executed on the validation set; if the model loss on the validation set has improved since the last validation run, the checkpoint of the current epoch will be saved on Google Drive.

The training process keeps track of which epoch has been the "best" one in terms of validation loss.

In [None]:
%cd '/content/matrices/'

import sys
sys.path.insert(0,"/content/matrices/backend/ocr_service/")

from backend.ocr_service.model import HTRModel
import backend.ocr_service.evaluation as evaluation
from backend.ocr_service.dataset import HDF5Dataset

import time
import logging
import datetime

try:
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = "3"
    logging.disable(logging.WARNING)
except AttributeError:
    pass

device_name = tf.test.gpu_device_name()

print("Dataset: ", dataset_path)
print("Output model folder: ", output_path)

# load the dataset to use in training, validation and testing
dataset = HDF5Dataset(source_path=dataset_path,
                      batch_size=batch_size,
                      charset=charset_base,
                      max_text_length=max_text_length)
print(f"Train images:      {dataset.training_set_size}")
print(f"Validation images: {dataset.valid_set_size }")
print(f"Test images:       {dataset.test_set_size}")

# create and compile HTRModel
htr_model = HTRModel(input_size=input_size,
                     vocabulary_size=dataset.tokenizer.vocab_size,
                     beam_width=10,
                     stop_tolerance=25,
                     reduce_tolerance=20)

htr_model.compile(learning_rate=learning_rate)
htr_model.summary(output_path, "summary.txt")

# load model
resumed_model = os.path.join("/content/drive/MyDrive", base_model) 
htr_model.load_checkpoint(target=resumed_model)
callbacks = htr_model.get_callbacks(logdir=output_path,
                                    checkpoint=checkpoint_path,
                                    verbose=1)

# to calculate total and average time per epoch
start_time = datetime.datetime.now()


htr_model_history = htr_model.fit(x=dataset.training_data_generator,
                                  epochs=epochs,
                                  validation_data=dataset.valid_data_generator,
                                  validation_freq=validation_interval,
                                  callbacks=callbacks,
                                  verbose=1)

training_time = datetime.datetime.now() - start_time

loss = htr_model_history.history['loss']
val_loss = htr_model_history.history['val_loss']

min_val_loss = min(val_loss)
min_val_loss_i = val_loss.index(min_val_loss)

avg_epoch_time = (training_time / len(loss))
best_epoch = (min_val_loss_i + 1) * validation_interval

t_corpus = "\n".join([
    f"Total validation images: {dataset.valid_data_generator}",
    f"Batch Size:              {dataset.training_data_generator.batch_size}\n",
    f"Total Training Time:     {training_time}",
    f"Time per epoch:          {avg_epoch_time}",
    f"Total epochs:            {len(loss)}",
    f"Best epoch:              {best_epoch}",
    f"Training loss:           {loss[min_val_loss_i]:.8f}",
    f"Validation loss:         {min_val_loss:.8f}"
])


with open(os.path.join(output_path, "train.txt"), "w") as lg:
    lg.write(t_corpus)
    print(t_corpus)
  
print("The best epoch for validation loss has been Epoch " + str(best_epoch))

/content/matrices
Dataset:  /content/matrices/data/biagini.hdf5
Output model folder:  /content/drive/MyDrive/biagini_ft
Train images:      1179
Validation images: 460
Test images:       371
Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input (InputLayer)          [(None, 1024, 128, 1)]    0         
                                                                 
 conv2d_18 (Conv2D)          (None, 1024, 64, 32)      320       
                                                                 
 p_re_lu_18 (PReLU)          (None, 1024, 64, 32)      32        
                                                                 
 batch_normalization_18 (Bat  (None, 1024, 64, 32)     224       
 chNormalization)                                                
                                                                 
 full_gated_conv2d_15 (FullG  (None, 1024, 64, 32)     18496     
 

ResourceExhaustedError: ignored

Run evaluation on the test set, and extract
- global metrics:
 - Character Error Rate
 - Word Error Rate
 - Sequence Error Rate

- A visual representation of line image, original text and obtained transcription for each sample in the test set. 

In [None]:
from google.colab.patches import cv2_imshow

start_time = time.time()

# load the checkpoint of the best epoch
best_checkpoint_path = os.path.join(output_path, "checkpoint_model." + str(best_epoch) + ".hdf5")
htr_model.load_checkpoint(target=best_checkpoint_path)

# predict() function will return the predicts with the probabilities
predicts, prob = htr_model.predict(x=dataset.test_data_generator,
                                   steps=1,
                                   ctc_decode=True,
                                   verbose=1,
                                   use_multiprocessing=False)

print("--- %s seconds ---" % (time.time() - start_time))

predicts = tf.sparse.to_dense(predicts[0]).numpy()
prob = prob.numpy()

# decode to string
predicts = [dataset.tokenizer.decode(x) for x in predicts]
ground_truth = [x.decode() for x in dataset.test_data_generator.labels]

# mount predict corpus file
with open(os.path.join(output_path, "predict.txt"), "w") as lg:
    for pd, gt in zip(predicts, ground_truth):
        lg.write(f"TE_L {gt}\nTE_P {pd}\n")
print(len(predicts), len(ground_truth), len(prob))

evaluate = evaluation.ocr_metrics(predicts,
                                  ground_truth,
                                  norm_accentuation=True,
                                  norm_punctuation=True)

e_corpus = "\n".join([
    f"Metrics:",
    f"Character Error Rate: {evaluate[0]:.8f}",
    f"Word Error Rate:      {evaluate[1]:.8f}",
    f"Sequence Error Rate:  {evaluate[2]:.8f}"
])

with open(os.path.join(output_path, "evaluate.txt"), "w") as lg:
    lg.write(e_corpus)
    print(e_corpus)

from backend.ocr_service.image_processing import adjust_to_see

for j, item in enumerate(dataset.test_data_generator.samples):
    print("=" * 256, "\n")
    cv2_imshow(adjust_to_see(item))
    print(ground_truth[j])
    print(predicts[j])
   

NameError: ignored