## Wiener Diarum Sterbelisten
-----------------------------------------------------------------------------------------------

### Data

The digitalised Wiener Zeitung is partially available as TEI XML as well as an extracted HTML on Github. Starting from 1703, there is representative data.This data set contains only a few newspapers per year starting from 1706.

Github Repository: https://github.com/acdh-oeaw/sterbelisten

### Task

Filter out Toponyms from the Sterbelisten. For this sequence labeling task we need to first tokenize the words and use the rules (patterns and keywords) made by the team (Nina) to fit to our classes (Toponym or not).


##### What we have 
There is a list with sucessfully extracted toponyms, which are already normalised orthography.
There is a list with rules which are combines Regex and other expressions. 
There is an html file that has \<mark> on most toponyms but sometimes on persons and other names too.
There are defined classes (Toponym/no Toponym).
There are .csv files with Toponyms extracted from Wien Geschichte Wiki.

##### What we want
A trainig and test set containing three matrixes:
T Matrix containing the classes and the rules.(np.matrix, sparse)
Z Matrixes each containing the samples and the rules.(np.matrix, sparse)
X Matrix containing the samples (data frame)


##### What to do

**1. Read in the html file**
Strip of tags and clean the text with re
Expand abbreviations- if possible?
create a list with each sentence as an item

**2. Read in already extracted placenames**
Extract all \<mark> words from HTML**
Using their code
clean the placenames
create a set from exsiting placenames

**3. Building training data**
As there is no clean manual training data, it has to be build:

**X Matrx**
pandas df or list with all sentenses

**T Matrix**

np.sparse matirx 1 colum 0, 1 colum 1, rows are keywords

**Z Matrices**
numpy.narray
colums= keywords 
rows = words in sentense



**5. Building test data** 

manually checked! 
There is the 'timemachine_evaluatuin_v1_edited_corrected.jsonl which should contain manually corrected place names but I don't understand where the actual word is tagged or how to filter out that information.

**Imports**

In [None]:
import os
import re
import code
import pandas as pd
import numpy as np
from typing import List

### 1. Read in HTML

In [1]:
'''Load html file'''

file = codecs.open('annotations_3-21_v4.html', "r", "utf-8")
sterbelisten_html = file.read()

'''Cleaning the text first from both html taggs and xml tags and than in a second step clean the intro and replace with /n'''

sterbelisten_strip_html_1 = re.sub(r"<h2>.*</h2><h3>.*xml \|.*\d+</h3><br/>","",sterbelisten_html)

sterbelisten_strip_html1 = re.sub(r"<hr/>|<p>|</p>|#+|=| =|<html>|</html>|\r|\b \.|\b# \.|(|)","",sterbelisten_strip_html_1)#<mark>|</mark>

sterbelisten_strip_html2 = re.sub(r"    \b","",sterbelisten_strip_html1)

In [2]:
'''create an empty list and split the text at line break, creating von element for each sentense'''

sterbeliste = []
sterbeliste = sterbelisten_strip_html2.split("\n\n\n")
#print(sterbeliste)

## 2.Create labeled sentenses and dictionary of Place Names


Open questions:

How to handle Toponyms with more than 2 words - most of them are not really Toponyms but sentenses, where the closing <mark> tag is missing in the html file. 
    
One solution could be exluding the 170 items from the toponym_set, or check them manually. 


In [109]:
def get_location_id(location: List, locations: List) -> int:
    for word in location:
        if word not in locations:
            locations[word]= len(locations)
            return locations[word]

locations={}
samples = []
matched_locations=[]
clean_sterbeliste=[]
for sentense in sterbeliste[:3]:
    labels=[]
    sentense = re.sub(r"<mark>","<start> ",sentense)
    sentense = re.sub(r"</mark>"," <end>",sentense)
    sentense = re.sub(r"\(|\)|/","",sentense)
    sentense = sentense.split(" ")
    sentense = list(filter(None,sentense))
    if_loc= False
    #print(f"Sentence: {sentense}")
    for word in sentense:
        if word=='<start>':
            if_loc= True
            location = []
        elif word=='<end>':
            labels +=[get_location_id(location, locations)] * len(location)
            if_loc= False
        else:
            if if_loc:
                location.append(word)
                #if location not int matched_locations:
                #    matched_locations.append(location)
            else:
                labels.append(None)
    samples.append(labels)
    sentense= [word for word in sentense if word !="<start>" if word!="<end>"]
    clean_sterbeliste.append(sentense)

print(f"Length of the sentence: {len(sentense)}")
print(f"Sample_tagging:{samples[0]}")
print(f"Number of place names: {len(locations)}")
print(f"=======================================")
assert len(sentense) == len(labels)
print(f"Sentense and labeles: {[(word, label) for (word, label) in zip(sentense, labels)]}")
print(f"Cleaned Sentense: {clean_sterbeliste[0]}")

Length of the sentence: 13
Sample_tagging:[None, None, None, None, None, None, 0, 0, None, 1, 1, None, None, None, None, None, None, None]
Number of place names: 6
Sentense and labeles: [('Christina', None), ('Kochin', None), ('ein', None), ('Wittib', None), ('im', None), ('Barbieris', 4), ('Hauß', 4), ('auf', None), ('der', None), ('Laimgrueben', 5), ('alt', None), ('67', None), ('Jahr', None)]
Cleaned Sentense: ['\nDem', 'Peter', 'Frost', 'einem', 'Cammer', 'im', 'Greiseckeris', 'Hauß', 'im', 'Diener', 'Gäßl', 'sein', 'Kind', 'Frantz', 'alt', '6', 'viertl', 'Jahr']


## 3.Building training data

In [118]:
'''
X Matrix
Dimensions sentesesxsentenses (one column sample df, len(sample))

'''
X = pd.DataFrame(clean_sterbeliste)

**T Matrix**

np.sparse matirx 1 colum 0, 1 colum 1, rows are keywords


In [117]:
'''
T Matrix
Dimensions keywords x 2

'''
import numpy as np
a= np.zeros(len(locations))
b= np.full(len(locations),1)

T = np.stack((b,a.T))
T.T.shape


(6, 2)

**Z Matrices**
Each sentense is going to be its own Z Matrix.
colums = place names
rows = words in sentense

In [119]:
'''
Here I am building the Z Matrixes by looping through all samples.
Each sample is compared with the list of locations, if the location is in the sample,
the new list "matched_loc" gets an entry 1 otherwise a 0 is added
Then the list is transformed to an numpy array and brought into shape:
number of place names x number of words

'''
locations_list = list(range(0,(len(locations))))


for sample in samples:
    matched_loc= [1 if i == j else 0 for i in sample for j in locations_list]
    i=len(sample)
    Z = np.array(matched_loc)
    Z = np.reshape(Z, (i,len(locations_list)))
# I haven't found a way to save them with their index number! 

In [120]:
"""create a tokenized sentenses with spacy"""
#import spacy 
#from spacy.lang.de.examples import sentences

#nlp = spacy.load("de_core_news_sm")

#for i in sterbeliste:
    #doc = nlp(i)
    #print(doc.text)
    #for token in doc:
        #print(token.text, token.pos_, token.dep_)

'create a tokenized sentenses with spacy'

## 5. Building test data

In [None]:
'''Loading manually checked test data'''

with jsl.open('timemachine_evaluation_v1_edited_corrected.jsonl','r') as reader: #read in jsonl with jsonlines
    df = pd.json_normalize(reader) 

### I don't understand where the tagged toponym is or whether there is any