# NER evaluation + OCR clean-up explorations

## NER evaluation

Model trained on external data ca. 500 entries of half-cleaned data from the Tallinn Universti project

NER categories:

- LOC
- DAT (date)
- WEA (weather phenomenon)
- MEA (measurement)
- PER
- ORG (incl. newspapers)
- MISC (incl. ship names)

In [1]:
import pandas as pd
import spacy
import numpy as np

In [2]:
from climdist.ocr.spellcorrection import spelling_correction
from climdist.ner.evaluation import *

In [14]:
nlp = spacy.load('../data/models/storms_ner_model/')

In [15]:
df = pd.read_excel('../data/processed/LNB_processed.xlsx')

In [56]:
#testbatch = generate_ocr_testbatch(df, 20, maxwords=250)
testbatch = df.iloc[list(results.keys())]

In [57]:
evaluate_spelling_for_ner(testbatch, nlp)

Starting entry 13437: Rigasche Zeitung, 1870.09.15
http://periodika.lv/viewerOpener?issue=/p_003_rzei1870s01n213&article=DIVL17&query=Sturm


DAT
False Negatives: 0
True Negatives: 192
DAT [0, 0, 192, 0]


LOC
True Positives: 16
False Positives: 3
LOC
False Negatives: 5
True Negatives: 171
LOC [16, 3, 171, 5]


MEA


KeyboardInterrupt: Interrupted by user

### Precision and recall for all NER categories

In [58]:
pr_all_categories = get_precision_recall(results)
print('\n')
print(pr_all_categories)
print('\n')
print('F-score:', get_fscore(pr_all_categories))

305 true positives
122 false positives
29346 true negatives
155 false negatives


From 20 entries, for labels ['WEA', 'LOC', 'DAT', 'PER', 'MISC', 'ORG', 'MEA']: precision: 0.7142857142857143, recall: 0.6630434782608695


(0.7142857142857143, 0.6630434782608695)


F-score: 0.6877113866967305


### Precision and recall for WEA, LOC, DAT
These categories are the important ones for text classification. In addition, MISC and ORG are somewhat difficult to evaluate.

In [59]:
pr_wea_loc_dat = get_precision_recall(results, ['WEA', 'LOC', 'DAT'])
print('\n')
print(pr_wea_loc_dat)
print('\n')
print('F-score:', get_fscore(pr_wea_loc_dat))

245 true positives
88 false positives
12460 true negatives
69 false negatives


From 20 entries, for labels ['WEA', 'LOC', 'DAT']: precision: 0.7357357357357357, recall: 0.7802547770700637


(0.7357357357357357, 0.7802547770700637)


F-score: 0.7573415765069551


In [63]:
print(testbatch.iloc[17].full_text)
print(testbatch.iloc[17].href)

Witterungsbeobachtungen in Riga

«MM » Ut wtattta Ы *«t«rf>rf<»aftmtil.
~c veebachrmtg«- «1««"« r»»«à. »«»«à
« аай *■ — ■ I
I Р**' ». в. MM«, g.
"7 Owgotl 7 Uhr 743.2 + 7,8
Dcitiosil 1 „ 742,3 -f 7.2
Abacs 9 , 743,6 -f- 8,9 -f. 9.9
8 Morgen» 7 Uhr 745.3 -f 3.4
Mittag« 1 . 745,7 -f- b,9
«bmd« 9 . 746,0 + 3,1 +1,
/Morgen« 7 Übt 87 8 6,4
Mittag« 1 , /9 10 6,1 i 6,3 3,8
«bend, 9 . 94 10 60,7 1,3
8 Morgen« 7 Uhr 84 9 6,7
«à» 1 „ 78 8 639,8 7.0
Abend« 9 . 88 10 6« 0.9
«emerfonT. A» 7. «ptll Bormittag« ein wenig
und Abend« g Uhr Regen. «« 3. «pril vormittag«
Schnee und Hagel, Nachmittag« sehr wenig Regen.
Für bis Redaction veranwortlich: t. Pejold.
http://periodika.lv/viewerOpener?issue=/p_003_rzei1879s01n080&article=DIVL123&query=Hagel


## OCR clean-up attempts
#### Two relevant Python modules: pyspellchecker and symspellpy

In [35]:
from spellchecker import SpellChecker
from symspellpy import SymSpell
from climdist.ocr import spellcorrection

In [31]:
# Get random text with less than 300 words
randomtext = df[df.w_count < 300].iloc[np.random.randint(0,high=len(df[df.w_count < 300]))].full_text

#### Try default dictionaries

In [40]:
pyspell = SpellChecker(language='de', case_sensitive=True)
symspell = SymSpell()
symspell.load_dictionary('../pipeline/02_ocr/spell_dicts/symspell_default_dict_de.txt', 0, 1, encoding='utf8')

True

In [64]:
pyspell_default_corrected_text = ''

for word in spellcorrection.get_unknown_words(randomtext, pyspell):
    print(word, '-----', pyspell.correction(word))
    
    

The text has about 80 words
45 words are not recognized by pyspellchecker
London, ----- londons
Unheil ----- heil
angerichtet. ----- angerichtet
28- ----- 28-
Barkschiff ----- Barkschiff
Whiimore, ----- whitmore
vcn ----- von
Antwerpen ----- antworten
die" ----- die
Handwerkern ----- Handwerkern
Bord, ----- lord
Longsands ----- longshanks
Personen ----- personal
Gräfin ----- träfen
Rosfi ----- rosa
(Henriekle ----- (Henriekle
I. ----- in
Slernschen ----- Slernschen
classische ----- klassische
Mustk ----- musik
Piecen, ----- piepen
i». ----- ich
Mendelsiohnschen ----- Mendelsiohnschen
März. ----- märz
London, ----- londons
März. ----- märz
 ----- i
Gewässern ----- bewässern
„Flor«dian", ----- „Flor«dian",
E-apitoin ----- E-apitoin
kommend, ----- kommende
tnn ----- tun
Auswanderern, ----- Auswanderern,
Landleuten ----- landeten
Fa- ----- far
Milien ----- milden
gestranbel ----- gestrandet
worden. ----- worden
 ----- i
Sonnraa) ----- Sonnraa)
Gesangverein ----- Gesangverein
na» ----- na
m

In [65]:
symspell_default_corrected_text = spelling_correction(randomtext, symspell, transfercasing=True)
print(symspell_default_corrected_text)

London vom 3 März lOndon vom 3 März Die letzten Stürme haben in den Britischen Gewässern viel Unheil angerichtet lEider ist am 28 fEbruar das bAr schiff florian e api in wie mode von antwerpen kommend tUn 170 die 200 Deutschen Auswanderern lAndsleuten und Handwerkern nebst ihren FamiLien am boRd auf von sands gestrandet Und nur vier Personen davon gerettet worden beRlin Gräfin Rolf heRr ekle Sonntag ist hier in dem i stErnchen gesAng verein für klassische musik in mehren Piepen namentlich i mendels ohmschen kompost ionen aufgetreten


In [55]:
displacy_color_code = {'WEA': '#4cafd9',
                  'PER': '#ffb366',
                  'DAT': '#bf80ff',
                  'LOC': '#a88676',
                  'MISC': 'grey',
                  'MEA': '#85e085',
                  'ORG': '#5353c6'}

displacy_options = {'ents': ['WEA', 'PER', 'DAT', 'LOC', 'MISC', 'MEA', 'ORG'], 'colors': displacy_color_code}
    

for text in [randomtext.replace('\n', ' '), symspell_default_corrected_text]:
    doc = nlp(text)
    displacy.render(doc, style='ent', jupyter=True, options=displacy_options)
    print('\n')









In [6]:
# results for testbatch
# (forgot to assign them to variable, copy pasted here to see the exact texts which were used for evaluation)
results = {13437: {'DAT': [0, 0, 192, 0],
  'LOC': [15, 4, 170, 7],
  'MEA': [0, 0, 192, 0],
  'MISC': [0, 1, 192, 0],
  'ORG': [0, 1, 192, 0],
  'PER': [0, 0, 192, 0],
  'WEA': [1, 0, 191, 0]},
 10711: {'DAT': [0, 0, 204, 0],
  'LOC': [16, 8, 186, 2],
  'MEA': [0, 0, 204, 0],
  'MISC': [1, 1, 203, 0],
  'ORG': [1, 2, 202, 1],
  'PER': [0, 0, 199, 5],
  'WEA': [1, 0, 203, 0]},
 24081: {'DAT': [7, 0, 225, 0],
  'LOC': [7, 4, 221, 4],
  'MEA': [0, 0, 232, 0],
  'MISC': [0, 0, 232, 0],
  'ORG': [1, 1, 231, 0],
  'PER': [0, 0, 230, 2],
  'WEA': [1, 0, 231, 0]},
 969: {'DAT': [2, 0, 276, 0],
  'LOC': [4, 1, 271, 3],
  'MEA': [0, 0, 278, 0],
  'MISC': [0, 0, 278, 0],
  'ORG': [0, 0, 278, 0],
  'PER': [1, 0, 276, 1],
  'WEA': [1, 1, 277, 0]},
 26801: {'DAT': [1, 0, 291, 1],
  'LOC': [6, 1, 285, 2],
  'MEA': [0, 0, 293, 0],
  'MISC': [4, 0, 287, 2],
  'ORG': [1, 0, 291, 1],
  'PER': [1, 0, 292, 0],
  'WEA': [2, 1, 290, 1]},
 19923: {'DAT': [0, 0, 169, 0],
  'LOC': [1, 3, 168, 0],
  'MEA': [0, 0, 169, 0],
  'MISC': [1, 0, 168, 0],
  'ORG': [0, 2, 169, 0],
  'PER': [7, 0, 158, 4],
  'WEA': [0, 2, 169, 0]},
 24113: {'DAT': [0, 0, 244, 0],
  'LOC': [0, 1, 244, 0],
  'MEA': [0, 0, 244, 0],
  'MISC': [0, 1, 244, 0],
  'ORG': [0, 1, 244, 0],
  'PER': [0, 0, 244, 0],
  'WEA': [3, 1, 241, 0]},
 371: {'DAT': [2, 0, 282, 1],
  'LOC': [17, 1, 261, 7],
  'MEA': [0, 0, 285, 0],
  'MISC': [0, 0, 285, 0],
  'ORG': [0, 0, 285, 0],
  'PER': [15, 0, 264, 6],
  'WEA': [0, 2, 285, 0]},
 11159: {'DAT': [0, 0, 194, 0],
  'LOC': [24, 2, 164, 6],
  'MEA': [0, 0, 194, 0],
  'MISC': [0, 0, 194, 0],
  'ORG': [0, 0, 194, 0],
  'PER': [1, 2, 193, 0],
  'WEA': [1, 0, 193, 0]},
 20285: {'DAT': [4, 7, 259, 0],
  'LOC': [7, 2, 256, 0],
  'MEA': [0, 2, 263, 0],
  'MISC': [0, 0, 263, 0],
  'ORG': [1, 1, 262, 0],
  'PER': [18, 0, 234, 11],
  'WEA': [0, 2, 263, 0]},
 6855: {'DAT': [1, 0, 329, 0],
  'LOC': [0, 16, 330, 0],
  'MEA': [0, 0, 330, 0],
  'MISC': [0, 0, 330, 0],
  'ORG': [0, 0, 330, 0],
  'PER': [4, 2, 314, 12],
  'WEA': [0, 2, 330, 0]},
 4806: {'DAT': [0, 0, 90, 7],
  'LOC': [0, 0, 97, 0],
  'MEA': [0, 1, 97, 0],
  'MISC': [0, 0, 97, 0],
  'ORG': [0, 1, 97, 0],
  'PER': [0, 0, 97, 0],
  'WEA': [6, 1, 87, 4]},
 31670: {'DAT': [0, 0, 46, 0],
  'LOC': [1, 2, 45, 0],
  'MEA': [0, 0, 46, 0],
  'MISC': [0, 0, 46, 0],
  'ORG': [0, 0, 46, 0],
  'PER': [0, 0, 46, 0],
  'WEA': [0, 0, 45, 1]},
 9485: {'DAT': [6, 0, 131, 0],
  'LOC': [10, 2, 127, 0],
  'MEA': [0, 0, 137, 0],
  'MISC': [0, 0, 136, 1],
  'ORG': [0, 0, 137, 0],
  'PER': [0, 0, 137, 0],
  'WEA': [1, 1, 136, 0]},
 31185: {'DAT': [2, 0, 262, 0],
  'LOC': [12, 2, 246, 6],
  'MEA': [0, 0, 263, 1],
  'MISC': [0, 1, 264, 0],
  'ORG': [0, 0, 264, 0],
  'PER': [0, 0, 264, 0],
  'WEA': [11, 1, 253, 0]},
 2644: {'DAT': [2, 0, 223, 1],
  'LOC': [10, 0, 213, 3],
  'MEA': [0, 0, 226, 0],
  'MISC': [0, 0, 226, 0],
  'ORG': [0, 0, 226, 0],
  'PER': [0, 1, 226, 0],
  'WEA': [8, 0, 218, 0]},
 4296: {'DAT': [4, 1, 276, 0],
  'LOC': [5, 5, 275, 0],
  'MEA': [0, 1, 280, 0],
  'MISC': [1, 0, 279, 0],
  'ORG': [0, 1, 280, 0],
  'PER': [1, 1, 272, 7],
  'WEA': [1, 3, 279, 0]},
 30345: {'DAT': [0, 0, 190, 6],
  'LOC': [1, 0, 195, 0],
  'MEA': [1, 0, 165, 30],
  'MISC': [0, 1, 196, 0],
  'ORG': [0, 8, 196, 0],
  'PER': [0, 0, 196, 0],
  'WEA': [4, 0, 192, 0]},
 11798: {'DAT': [0, 0, 203, 0],
  'LOC': [31, 7, 166, 6],
  'MEA': [0, 0, 203, 0],
  'MISC': [0, 0, 203, 0],
  'ORG': [0, 1, 202, 1],
  'PER': [0, 0, 202, 1],
  'WEA': [1, 2, 202, 0]},
 832: {'DAT': [1, 0, 124, 0],
  'LOC': [4, 0, 120, 1],
  'MEA': [0, 0, 125, 0],
  'MISC': [0, 0, 125, 0],
  'ORG': [0, 0, 125, 0],
  'PER': [0, 0, 125, 0],
  'WEA': [0, 0, 125, 0]}}