# Extract NEs from TEIs

The notebook used here comes from the workshop "Information Extraction aus frühneuhochdeutschen Texten" (https://informationsmodellierung.uni-graz.at/de/neuigkeiten/detail/article/workshop-information-extraction-aus-fruehneuhochdeutschen-texten/). It was modified and adapted for this project.

In [None]:
from spacytei.tei import TeiReader

In [None]:
file = '../data/traindata/goldstandard.xml'# path to you file

In [None]:
teidoc = TeiReader(file)

### map your tei encoding to NE-tags

In [None]:
NER_TAG_MAP = {    
    "persName": "PER",
    "placeName": "LOC",
}

### define the tags you used for NEs via xpath
* be aware that those xpaths are relativ to a parent node (defaults to tei:p)

In [None]:
ne_xpath = './/tei:persName | .//tei:placeName'

In [None]:
ner_samples = teidoc.extract_ne_offsets(ne_xpath=ne_xpath, NER_TAG_MAP=NER_TAG_MAP)

In [None]:
ner_samples[:5]

## Extract NEs from TEIs (with sent-splitting)
* The samples above are by paragraph. In case of long(er) paragraphs, you could crate NE samples splitted by sents
* Sent splitting is done by a spacy model (default 'de_core_news_sm'), so make sure you have spacy and the model you'd like to use properly installed. 

In [None]:
ner_samples = teidoc.ne_offsets_by_sent(ne_xpath=ne_xpath, NER_TAG_MAP=NER_TAG_MAP)

In case the cell above threw an error, complaining about not finding a model, try to
* install the german model `!python -m spacy download de`
* and pass in the model name to `teidoc.ne_offsets_by_sent(ne_xpath=ne_xpath, NER_TAG_MAP=NER_TAG_MAP, model='de')`

In [None]:
#!python -m spacy download de

In [None]:
#ner_samples = teidoc.ne_offsets_by_sent(ne_xpath=ne_xpath, NER_TAG_MAP=NER_TAG_MAP, model='de')

In [None]:
ner_samples[:100]

## Extract NEs from TEIs in bulk
and save results to file

In [None]:
import os
import glob

In [None]:
from spacytei.tei_process import teis_to_traindata, teis_to_traindata_sents

In [None]:
tei_dir = '../data/traindata' # define path to directory containing TEI's

In [None]:
files = glob.glob("{}/*.xml".format(tei_dir)) # store list of relative file names of TEI's

In [None]:
samples = teis_to_traindata(files, parent_node='.//tei:body//tei:p', ne_xpath=ne_xpath, NER_TAG_MAP=NER_TAG_MAP)

In [None]:
print(samples)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(samples, columns=['text', 'entities'])

In [None]:
df.info()

In [None]:
df.to_csv('output_csv/samples_out.csv', index=False)

In [None]:
samples = teis_to_traindata_sents(files, parent_node='.//tei:body//tei:p', ne_xpath=ne_xpath, NER_TAG_MAP=NER_TAG_MAP)

In [None]:
print(samples)

In [None]:
df = pd.DataFrame(samples)
df.info()
df.to_csv('output_csv/samples_out_sents.csv', index=False)