## Installer et importer la librairie

In [None]:
!pip install kraken
import kraken

Collecting kraken
  Downloading kraken-4.1.2-py3-none-any.whl (5.4 MB)
[K     |████████████████████████████████| 5.4 MB 5.3 MB/s 
Collecting coremltools>=3.3
  Downloading coremltools-5.2.0-cp37-none-manylinux1_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 40.2 MB/s 
Collecting click>=8.1
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
[K     |████████████████████████████████| 96 kB 2.9 MB/s 
[?25hCollecting rich
  Downloading rich-12.4.1-py3-none-any.whl (231 kB)
[K     |████████████████████████████████| 231 kB 36.7 MB/s 
Collecting pytorch-lightning
  Downloading pytorch_lightning-1.6.3-py3-none-any.whl (584 kB)
[K     |████████████████████████████████| 584 kB 31.0 MB/s 
Collecting python-bidi
  Downloading python_bidi-0.4.2-py2.py3-none-any.whl (30 kB)
Collecting pyDeprecate<0.4.0,>=0.3.1
  Downloading pyDeprecate-0.3.2-py3-none-any.whl (10 kB)
Collecting fsspec[http]!=2021.06.0,>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K  

## Créer les répertoires de travail où nous allons stocker :
* les images à traiter (`img`)
* le(s) modèle(s) d'OCR (`material`)
* les images binarisées (`bin`)
* les segmentations des images (`seg`)
* les résultats d'OCR (`results`)

In [None]:
!mkdir img
!mkdir img/bin
!mkdir img/seg
!mkdir material
!mkdir results

## Télécharger le modèle Kraken existant
* [Kraken Models](https://gitlab.inria.fr/dh-projects/kraken-models)
* [OCR17](https://drive.google.com/file/d/1DfYmJjSeImsU0XyPPVQcwtG92U_bztGp/view)
* [LECTAUREP Contemporary French Model (Administration)](https://zenodo.org/record/6542744#.Yoj8wS8it-V)

Ensuite, les stocker dans le dossier `material`.

In [None]:
from google.colab import files
files.upload('/content/material/')

## Binariser les images

In [None]:
!kraken -I ./img/M1119_03_R416_005r.jpg -o .png binarize

Binarizing	[0m[32m✓[0m


##### Déplacer les images binarisées `.jpg` dans le dossier `bin`

In [None]:
!mv ./img/*.png ./img/bin/

## Binariser + segmenter les images

In [None]:
!kraken -I img/*.jpg -o .json binarize segment

Binarizing	[0m[32m✓[0m
Segmenting	[0m[32m✓[0m


##### Déplacer les segmentations `.json` dans le dossier `seg`

In [None]:
!mv ./img/*.json ./img/seg/

#### Voir la segmentation de l'image

In [None]:
!cat ./img/seg/M1119_03_R416_005r.json

{"text_direction": "horizontal-lr", "boxes": [[573, 144, 2520, 347], [767, 319, 2468, 442], [2552, 0, 2809, 122], [1114, 507, 2103, 621], [564, 594, 2824, 723], [562, 715, 2840, 823], [559, 815, 2635, 907], [491, 907, 2752, 1023], [553, 1012, 2821, 1128], [550, 1117, 2809, 1233], [1082, 1218, 2043, 1314], [48, 1288, 2840, 1411], [4, 1395, 2836, 1521], [32, 1486, 2381, 1619], [1180, 1597, 1648, 1693], [438, 1687, 2781, 1811], [2217, 1781, 2521, 1851], [537, 1796, 1522, 1915], [1249, 1856, 1684, 1984], [535, 1948, 2802, 2109], [534, 2072, 1497, 2207], [1432, 2152, 1781, 2283], [531, 2272, 2613, 2417], [529, 2384, 2389, 2521], [54, 2564, 2652, 2691], [36, 2677, 2816, 2790], [44, 2752, 1713, 2976], [1347, 2876, 1634, 2949], [330, 2964, 2840, 3101], [523, 3077, 2825, 3188], [462, 3174, 938, 3291], [1220, 3251, 1651, 3369], [521, 3352, 2751, 3483], [1265, 3461, 1640, 3565], [519, 3552, 2781, 3682], [525, 3646, 2735, 3777], [939, 3747, 1092, 3830], [1323, 3770, 1669, 3834], [470, 3853, 2682, 

## Binariser + segmenter + transcrire

In [None]:
!kraken -I ./img/*.jpg -o .txt binarize segment ocr -m ./material/lectaurep_base.mlmodel

Loading ANN default	[0m[32m✓[0m
Binarizing	[0m[32m✓[0m
Segmenting	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m40/40[0m [36m0:00:00[0m [33m0:00:26[0m
[?25hWriting recognition results for ./img/M1119_03_R416_005r.jpg	[0m[32m✓[0m


## Déplacer les transcriptions `.txt` dans le dossier `results`

In [None]:
!mv ./img/*.txt ./results

## Entraîner le modèle

### Création des données d'entraînement (vérité-terrain, angl. _ground truth_)

##### Génération de l'interface de transcription

In [None]:
!ketos transcribe -o ./material/test.html ./img/M1119_03_R416_005r.jpg

[2KReading images [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m1/1[0m [36m0:00:00[0m [33m0:00:55[0m
[?25hWriting output [0m[32m✓[0m


#### Télécharger le fichier HTML de l'interface de transcription

In [None]:
from google.colab import files
files.download('/content/material/test.html')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#####  Pré-remplir l'interface de transcription pour accélérer la création de vérité-terrain
⚠️ **BUG** Les lignes transcrites d'HTML `<li>` ne s'affichent pas dans le navigateur.

In [None]:
!ketos transcribe -o material/test_prefill.html --prefill /content/material/lectaurep_base.mlmodel /content/img/M1119_03_R416_005r.jpg

Loading ANN[0m[32m✓[0m
[2KReading images [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m1/1[0m [36m0:00:00[0m [33m0:01:22[0m
[?25hWriting output [0m[32m✓[0m


In [None]:
from google.colab import files
files.download('/content/material/test_prefill.html')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Le traitement (océrisation) par lots (angl. _batch processing_)

In [None]:
!kraken -I "./img/*.jpg" -o .txt binarize segment ocr -m ./material/lectaurep_base.mlmodel

Loading ANN default	[0m[32m✓[0m
Binarizing	[0m[32m✓[0m
Segmenting	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m22/22[0m [36m0:00:00[0m [33m0:00:12[0m
[?25hWriting recognition results for ./img/M1119_03_R416_006v.jpg	[0m[32m✓[0m
Binarizing	[0m[32m✓[0m
Segmenting	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m17/17[0m [36m0:00:00[0m [33m0:00:12[0m
[?25hWriting recognition results for ./img/M1119_03_R416_007r.jpg	[0m[32m✓[0m
Binarizing	[0m[32m✓[0m
Segmenting	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m37/37[0m [36m0:00:00[0m [33m0:00:28[0m
[?25hWriting recognition results for ./img/M1119_03_R416_008v.jpg	[0m[32m✓[0m
Binarizing	[0m[32m✓[0m
Segmenting	[0m[32m✓[0m
[2KProcessing [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m45/45[0m [36m0:00:00[0m [33m0:00:24[0m

## Déplacer les transcriptions `.txt` dans le dossier `results`

In [None]:
!mv ./img/*.txt ./results

## _Fine-tuner_ le modèle

In [None]:
!ketos transcribe --prefill ./material/riant_ftmrs15_12.mlmodel -o ./material/test_prefill.html ./img/M1119_03_R416_005r.jpg

Loading ANN[0m[32m✓[0m
[2KReading images [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m1/1[0m [36m0:00:00[0m [33m0:01:21[0m
[?25hWriting output [0m[32m✓[0m


In [None]:
!ketos transcribe -o ./material/test_prefill.html --prefill ./material/riant_ftmrs15_12.mlmodel ./img/M1119_03_R416_005r.jpg

Loading ANN[0m[32m✓[0m
[2KReading images [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [35m1/1[0m [36m0:00:00[0m [33m0:01:16[0m
[?25hWriting output [0m[32m✓[0m
