# Step 2: Train a spacy model with NE-Samples extracted from TEIs and evaluate the results

The notebook used here comes from the workshop "Information Extraction aus frühneuhochdeutschen Texten" (https://informationsmodellierung.uni-graz.at/de/neuigkeiten/detail/article/workshop-information-extraction-aus-fruehneuhochdeutschen-texten/). It was modified and adapted for this project.

In [None]:
from spacytei.train import batch_train
from spacytei.data_prep import csv_to_traindata, clean_train_data


## 1) Load a csv with training data
an csv was created with `step1_preprocessing_data_for_NER.ipynb`

In [None]:
TRAIN_DATA = csv_to_traindata('output_csv/samples_out_sents.csv')

In [None]:
len(TRAIN_DATA)

## 2) Clean train data

remove all empty examples, use only samles with text lenght greater 15

In [None]:
TRAIN_DATA = clean_train_data(TRAIN_DATA, min_ents=1, min_text_len=15, lang=[])
len(TRAIN_DATA)

In [None]:
print(TRAIN_DATA)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(TRAIN_DATA)
df.info()
df.to_csv('output_csv/samples_out_sents.csv', index=False) #save csv file to output_csv/samples_out_sents.csv

# Watch out! Do not overwrite "samples_out_sents_clean.csv" or the training won't work! 
# samples_out_sents_clean.csv was improved by hand, because the automatic function does not work properly

In [None]:
TRAIN_DATA = csv_to_traindata('output_csv/samples_out_sents_clean.csv') 

## 3) Train the model

Unfortunately I found out that the training data was not converted correctly into the spaCy format, partly some entities were not assigned correctly, these errors had to be fixed by hand in the training data. Some entities in the training data are still not aligned correctly, but they will be ignored during the training process.

At the end of the training the F-Score of the new model is obtained. 

In [None]:
batch_train(model='de_core_news_sm', train_data=TRAIN_DATA, output_dir='custom_model')

## 4) Evaluation of the German standard models

To get a comparison to the standard models, the German standard models of spaCy could be evaluated here. 

In [None]:
import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def evaluate(ner_model, examples):
    scorer = Scorer()
    for x in examples:
        doc_gold_text = ner_model.make_doc(x[0])
        gold = GoldParse(doc_gold_text, entities=x[1]['entities'])
        pred_value = ner_model(x[0])
        scorer.score(pred_value, gold)
    return scorer.scores

# example run

examples = TRAIN_DATA

# evaluate standard models
ner_model = spacy.load('de_core_news_sm') # you could put in here an other model like "de_core_news_md" to check the F-Score
results = evaluate(ner_model, examples)

results