### Simple visualization

This is just one very simple way to visualize the book content.

In [1]:
from mikatools import *
from pathlib import Path
import re
from operator import itemgetter
import xml.etree.ElementTree as ET

As Ocropy's generated HTML doesn't display the lines in the correct order, let's, just to be sure, generate the correct order from that content. This is done by parsing the HTML and comparing the restructured order to what we have in Calamari's JSON files. All this works under assumption that when we order the lines by their top position they will be in the desired order. 

In [2]:
tree = ET.parse('book-v02.html')
root = tree.getroot()

table_nodes = []

for table in root.findall('./body/table'):
        
    content = {}

    text = table.find('./tr/td').text

    content['line_id'] = text
    content['page_id'] = re.search('book/(.+?)-block', text).group(1)
    content['line_top'] = int(re.search('-block_\d+-\d+-(\d+)-', text).group(1))
    content['table'] = table
    content['text'] = table.find('./tr[3]/td').text
    table_nodes.append(content)
    

Here we sort the lines by page and top position.

In [3]:
text_ocropy_html = sorted(table_nodes, key=itemgetter('page_id', 'line_top'))

Here we pick up the Calamari JSON files.

In [5]:
text_calamari_json = []

for json_file in sorted(Path().glob('mdf-44_JHN-1901/*json')):
    
    for line in json_load(json_file):
    
        text_calamari_json.append(line['text_pred'])


Here we can inspect those instances where lines at that order point are not matching.

In [6]:
for line_1, line_2 in zip(text_ocropy_html, text_calamari_json):
    
    if line_1['text'] is line_2:
    
        print(f"Mismatch between {line_1['text']} and {line_2}")

Mismatch between - and -
Mismatch between 2 and 2
Mismatch between - and -
Mismatch between - and -
Mismatch between 4 and 4
Mismatch between - and -
Mismatch between 4 and 4
Mismatch between - and -
Mismatch between 2 and 2
Mismatch between - and -


I don't know why Python thinks those aren't matching, but small visual inspection also tells we are ok.

In [16]:
for line_1, line_2 in zip(text_ocropy_html, text_calamari_json):
    
        print(f"{line_1['text']} \t\t {line_2}")

оъ он 		 оъ он
свто вагԙл 		 свто вагԙл
кь 		 кь
а центральномъ Мошансомъ нарѣіи 		 а центральномъ Мошансомъ нарѣіи
Мордцовсаго зьа. 		 Мордцовсаго зьа.
ел ьсмнгфорсъ. 		 ел ьсмнгфорсъ.
аданіе Врмтансаго и Мносграннаго 		 аданіе Врмтансаго и Мносграннаго
бейскаго Сбщества. 		 бейскаго Сбщества.
а.кль,ць 		 а.кль,ць
шкайстьнь нькля 		 шкайстьнь нькля
Мокшънь кԙльс кепедіец шԙнԙрԙень поп 		 Мокшънь кԙльс кепедіец шԙнԙрԙень поп
н. а с о в. 		 н. а с о в.
отъ о 		 отъ о
святок квангшлв 		 святок квангшлв
-=-.-- 		 -=-.--
=:.. 		 =:..
;.:- 		 ;.:-
На центральномъ Мокшанскомъ нарѣчіи 		 На центральномъ Мокшанскомъ нарѣчіи
ордовскаго язька. 		 ордовскаго язька.
Перевелъ села С. шенева енз. г. Псарскаго у. 		 Перевелъ села С. шенева енз. г. Псарскаго у.
Свапеннкъ Николай арсовъ. 		 Свапеннкъ Николай арсовъ.
зданіе рптапскаго и ностраншаго 		 зданіе рптапскаго и ностраншаго
цблейскаго Обпцества. 		 цблейскаго Обпцества.
ешо в ь . -еербуръ: 		 ешо в ь . -еербуръ:
ово-Псаакісвска улц:а . 		 ово-П

In [None]:
for json_file in sorted(Path().glob('mdf-44_JHN-1901/*json')):

    print(f'Page {json_file}')
    
    print()
    
    for line in json_load(json_file):
        print(line['text_pred'])
        
    print()