<img src="https://i.imgur.com/gb6B4ig.png" width="40%" alt="Weights & Biases Logo" />

# Introduction

In this Notebook we'll take one of the common OCR libraries, EasyOCR, an fine-tune the model that it uses for prediction. You can then take that trained model and then call the model using the EasyOCR API to make predictions on images of text.

## Training a custom model

#### Using open-source data
To train your EasyOCR model you can use your own data / generate your own dataset using a tool like [TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator).

The default network in the EasyOCR library is 'None-VGG-BiLSTM-CTC'. In addition to a trained model file in the form of a `.pth` file you will need two files: a network architecture file and a model configuration file. You can see some sample files on this EasyOCR Model Hub page which offers English, Latin, Chinese, Japanese, Korean, Telugu, and Kannada under the Second Gen Models. Under the First Generation Models you can choose between Latin, Chinese (Simple), Chinese (Traditional), Japanese, Korean, Thai, Devanagari, Cyrillic, Arabic, Tamil, Bengali: https://jaided.ai/easyocr/modelhub/

#### Via JaidedAI

If you'd like to pay for a web-based service to fine-tune an EasyOCR model you can use JaidedAI's training service.


# Steps taken for open-source training of your own recognition model

We'll go with the free option. We've outlined the steps taken to fine-tune EasyOCR below after reading through and synthesizing the EasyOCR documentation:

1. Generate a dataset of text images (using something like [TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator) or by BYOD - Bringing Your Own Data - if you already have a corpus of images of text and their labels.
2. After you have your dataset you will want to train your model using the `deep-text-recognition-benchmark` library: https://github.com/clovaai/deep-text-recognition-benchmark This is a **PyTorch**-based library with a great deal of benchmark datasets for you to use: IMBD, ICDAR, etc. The network needs to be fully connected in order to predict flexible text length. The authors of the library used `None-VGG-BiLSTM-CTC` for their model architecture.
3. Once you have a trained model (and the `.pth` file that the EasyOCR library produces) you will need **two** additional files: 1 file describing the network architecture and 1 file describing the model configuration. The **Custom Model** file on the EasyOCR Model Hub page contains an example of the two files: https://jaided.ai/easyocr/modelhub/

# Which model should I fine-tune?

Depending on what model you want to train / fine-tune you'll have a range of performance scores. Below is performance curve taken from the "What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis" paper at https://arxiv.org/abs/1904.01906.

<img src="https://raw.githubusercontent.com/clovaai/deep-text-recognition-benchmark/master/figures/trade-off.png" width="80%" alt="Proposed combinations and performance curves" />


# Getting started

Following the EasyOCR fine-tuning repo, we know that in order to fine tune our model we follow the instructions here: https://github.com/clovaai/deep-text-recognition-benchmark#getting-started

Basically `pip install` some libraries (most of which are already installed in Colab by default). Then we'll edit the `train.py` script to include `wandb` and write out the progress of our training to our Weights and Biases dashboard.


In [None]:
!pip3 install -qqq lmdb pillow torchvision nltk natsort wandb gdown

import wandb
wandb.login()

[K     |████████████████████████████████| 1.8 MB 14.8 MB/s 
[K     |████████████████████████████████| 145 kB 95.0 MB/s 
[K     |████████████████████████████████| 181 kB 74.9 MB/s 
[K     |████████████████████████████████| 63 kB 1.9 MB/s 
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# Download the repo that has the code that you can reference to fine-tune / train
!git clone https://github.com/clovaai/deep-text-recognition-benchmark.git

Cloning into 'deep-text-recognition-benchmark'...
remote: Enumerating objects: 495, done.[K
remote: Total 495 (delta 0), reused 0 (delta 0), pack-reused 495[K
Receiving objects: 100% (495/495), 3.07 MiB | 31.09 MiB/s, done.
Resolving deltas: 100% (302/302), done.


# Downloading Data

Per the instructions, download the training data which is located in a [Dropbox folder](https://www.dropbox.com/sh/i39abvnefllx2si/AAAbAYRvxzRp3cIE5HzqUw3ra?dl=0). Note that the `data_lmdb_release.zip` file contains the training, validation **and** test datasets. It is, however, ~18GB in size, so it may take some time to download depending on your connection.

```
- data_lmdb_release.zip contains training, validation, and evaluation sets.

- validation.zip contains only validation set.

- evaluation.zip contains only evaluation set.

- ST_spe.zip contains word images, which include special characters in SynthText (ST) dataset.
check this issue https://github.com/clovaai/deep-text-recognition-benchmark/issues/7#issuecomment-511727025
```

In [None]:
%cd deep-text-recognition-benchmark/

/content/deep-text-recognition-benchmark


It's easiest to work with the `deep-text-recognition-benchmark` tool if you simply download the LMDB dataset so that it is inside of the `deep-text-recognition-benchmark` directory.

Having the dataset outside of that directory will require you to do a considerable amount of editing of training and validation scripts to allow the `deep-text-recognition-benchmark` to run properly and 'find' the datasets.

Note that instead of `wget`-ing the dataset from the authors you could download a copy from Weights and Biases Artifacts here: https://wandb.ai/andrea0/deep-text-recognition-benchmark/artifacts/compressed-dataset/lmdb-dataset-zip/68b56d59f046d42ea5ce

To download an artifact and make use of it simply:
```python
import wandb
import os

# Pull down that dataset you logged in the last run
artifact = run.use_artifact('lmdb-dataset-zip:latest')
artifact_dir = artifact.download()

# Save a model after training
model = wandb.Artifact('my-model', type='model')
model.add_file('my-model.txt')
run.log_artifact(model)

wandb.finish()
```

In [None]:
!wget https://www.dropbox.com/sh/i39abvnefllx2si/AABX4yjNn2iLeKZh1OAwJUffa/data_lmdb_release.zip

--2022-06-01 01:51:43--  https://www.dropbox.com/sh/i39abvnefllx2si/AABX4yjNn2iLeKZh1OAwJUffa/data_lmdb_release.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:6021:18::a27d:4112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /sh/raw/i39abvnefllx2si/AABX4yjNn2iLeKZh1OAwJUffa/data_lmdb_release.zip [following]
--2022-06-01 01:51:43--  https://www.dropbox.com/sh/raw/i39abvnefllx2si/AABX4yjNn2iLeKZh1OAwJUffa/data_lmdb_release.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucb6bbbf7a5596319cbe64ade88b.dl.dropboxusercontent.com/cd/0/inline/BmWQA_IPII9LvMMHCnoAjs9mHsLxGaMziUSvGIrLSxDBT_Q132sj5nc-7EwPxbaOCc06Ogxriiwl3LzBQ8LXhx9GejGyHr306jqRmVG29SlAG-s7hKOP6eWEDPLC882VsEiGMzABOT7D-QbNFrFu_VUy9Oj8YBARQxL5o0vWOfUZMg/file# [following]
--2022-06-01 01:51:44--  https://ucb6bbbf7a5596319cbe64ad

In [None]:
!unzip ./data_lmdb_release.zip

Archive:  ./data_lmdb_release.zip
   creating: data_lmdb_release/
   creating: data_lmdb_release/evaluation/
   creating: data_lmdb_release/evaluation/IC03_860/
  inflating: data_lmdb_release/evaluation/IC03_860/data.mdb  
  inflating: data_lmdb_release/evaluation/IC03_860/lock.mdb  
   creating: data_lmdb_release/evaluation/IC03_867/
  inflating: data_lmdb_release/evaluation/IC03_867/data.mdb  
  inflating: data_lmdb_release/evaluation/IC03_867/lock.mdb  
   creating: data_lmdb_release/evaluation/IC13_1015/
  inflating: data_lmdb_release/evaluation/IC13_1015/data.mdb  
  inflating: data_lmdb_release/evaluation/IC13_1015/lock.mdb  
   creating: data_lmdb_release/evaluation/IC13_857/
  inflating: data_lmdb_release/evaluation/IC13_857/data.mdb  
  inflating: data_lmdb_release/evaluation/IC13_857/lock.mdb  
   creating: data_lmdb_release/evaluation/IC15_1811/
  inflating: data_lmdb_release/evaluation/IC15_1811/data.mdb  
  inflating: data_lmdb_release/evaluation/IC15_1811/lock.mdb  
   cr

# Integrating `wandb` into training

Now that we have the repository with the training code (`deep-text-recognition-benchmark`) and the training dataset downloaded we'll need to edit our `train.py` file to include some Weights and Biases logging functionality.


### Edit `train.py`

Add the following near the start of your script:

```python
import wandb
wandb.init()
```

Find where `model.eval()` is called - inside the validation loop, around line 192 (not in the multi-GPU section, unless you are using multiple GPUs to train your model) - and insert `wandb.watch(model, criterion, log="all")`. You will usually want to pass in a logging frequency but due to the way the codebase is written the `wandb` settings will conflict, so we do not set the frequency for now:
```python
    # Tell wandb to watch what the model gets up to: gradients, weights, and more!
    wandb.watch(model, criterion, log="all", log_freq=10)
```

We'll also provide a version of the `train.py` script with these edits for you [here](https://gist.github.com/ap-wb/25737c98a1d52fc36220bffa0248c271).

Note that the `num_iter` defaults to 300,000. Since we're working on a toy example and do not want to wait for hours we'll specify a much smaller `num_iter`:

In [None]:
!CUDA_VISIBLE_DEVICES=0 python3 train.py \
--train_data ./data_lmdb_release/training --valid_data ./data_lmdb_release/validation \
--select_data MJ-ST --batch_ratio 0.5-0.5 --num_iter 100 \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn

[34m[1mwandb[0m: Currently logged in as: [33mandrea0[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.17
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/deep-text-recognition-benchmark/wandb/run-20220601_024601-15x3cmga[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33msplendid-hill-27[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/andrea0/deep-text-recognition-benchmark[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/andrea0/deep-text-recognition-benchmark/runs/15x3cmga[0m
Filtering the images containing characters which are not in opt.character
Filtering the images whose label is longer than opt.batch_max_length
--------------------------------------------------------------------------------
dataset_root: ./data_lmdb_release/training
opt.select_data: ['MJ', 'ST']
opt.batch_ratio: ['0.5', '0.5']
----

# Results

You can experiment with logging different parameters. For now, we log the gradients. We'll leave logging the loss curves and accuracy curves as an exercise to the user.


*   Gradients from third run, `sandy-glitter-3` here: https://wandb.ai/andrea0/deep-text-recognition-benchmark/runs/2kppjp09?workspace=user-andrea0
*   Hardware usage - CPU, GPU, etc. here: https://wandb.ai/andrea0/deep-text-recognition-benchmark?workspace=user-andrea0



<img src="https://i.imgur.com/8f68ADy.png" width="80%" alt="Weights & Biases Logo" />


# Bonus: Working with `wandb` Artifacts

As you noticed during the process of fine-tuning the EasyOCR model we had to retrieve files from many different locations. It's not ideal to have to download some files from Github which has [bandwidth and storage limits for large files](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage) and then go to another file-sharing website such as Google Drive or Dropbox for additional files.

To log a file (upload a file or set of files) to Weights and Biases Artifact tool simply specify the project name and the name of the file that you want to upload / save as an Artifact.

You can optionally pass in parameters such as the Artifact type, e.g., `dataset`, `script`, `model-weights`, etc.

```python
import wandb
import os

# Initialize a new W&B run to track this job
run = wandb.init(project="deep-text-recognition-benchmark", job_type="dataset-creation")
# Create a new artifact (type dataset)
dataset = wandb.Artifact('my-dataset', type='dataset')
# Add files to the artifact: the train, test, and eval data
dataset.add_file('lmdb-dataset.zip')
# Log the artifact to save it as an output of this run
run.log_artifact(dataset)

wandb.finish()
```


The Artifact will begin syncing and a Weights and Biases link with yellow text will appear in your Colaboratory Notebook. Click on that link, then click on the database icon (looks like a cylindrical can made up of three slices), and then click on your dataset name to be taken to the Overview page in the Artifacts section.

Instead of clicking through to reach the Artifacts page you can just navigate to your username and project's Artifact's page by going to the following URL:

`https://wandb.ai/andrea0/deep-text-recognition-benchmark/artifacts`

`https://wandb.ai/USERNAME/PROJECT_NAME/artifacts`




<img src="https://i.imgur.com/X5oWFME.png" width="80%" alt="Screenshot of Weights and Biases Artifact tool showing a dataset" />
