## Wiener Diarum Sterbelisten: Data Preprocessing for Sequence Labeling
-----------------------------------------------------------------------------------------------

### Data

The Wiener Zeitung is the oldest running newspaper in the world. It started in 1703 as Wiener Diarum and changed in 1780 to Wiener Zeitung. Through its central role in the Habsburg monarchy it distributed knowledge across all disciplines thorughout Europe. This also helped to put Vienna into a more prominent position.


The Austrian Centre for Digital Humanities and Cultural Heritage of the Austrian Academy of Sciences (ACDH-CH) started a project with the goal of digitalising and providing this historically important data. The digitsed newspapers as well as current developements can be found here: 
https://digitarium.acdh.oeaw.ac.at/willkommen/

https://www.oeaw.ac.at/ihb/forschungsbereiche/kunstgeschichte/forschung/habsburgische-repraesentation/das-wiennerische-diarium


This valuable resource is useful for multiple disciplines as it holds historical data on science, politics, cultural and other sections. It is interesting for linguistic research as the normalisation of german orthography only started at the end of the 18th cenutry and was only fully set in force in the late 19th cenutry.  

For this task there is data prepared by the research group of the ACDH-CH. The digitalised Wiener Zeitung is partially available as TEI XML as well as an extracted HTML on Github. This data set contains only a few newspapers per year starting from 1706.

The link to the Github Repository of the ACDH-CH research group shows their current work porcess as well as the original data used for this Tutorial: https://github.com/acdh-oeaw/sterbelisten



### Task

In this Tutorial we want to prepare the data provided by the research group of the ACDH-CH in order to apply sequence labeling to it and fit the data into the Knodle framework.

Our goal is to automatically find the place names in the obituaries. These lists are in every newspaper issue and contain several notes about the death of people. Usually they contain the name, age, reason and place of death. By filtering out the place names automatically, one can extract historical knowledge about the city of Vienna and its developement as well as orthographical changes and therewith historical linguistical knowledge.

For this sequence labelling task we need to build the data in a way that is suitable for a weakly supervised machine learning approach. We will need to read the .html file and clean all the additional characters as illustrated later on in this tutorial. We will also tokenise the words given and translate the provided \<mark> tags into a matrix format that provides the location of the toponym within the sentence. Our rules are the tagged place names, which are a result of the work made by the team of the ACDH-CH. We want to train a model that can detect place names by using the data we are provided with by the research group of the ACDH-CH. As this data is extremly difficult to organise due to the big spelling variation, it is our goal to try a machine learning approach rather than a rule-based approach.


##### What we have: provided by the research group of the ACDH-CH
The approach of the research group of the ACDH-CH is a rule-based approach on finding place names.  By creating patterns with which they would identify place names and compare them to historical dictionaries, they gather place names in order to link them to other sources. 

For example regular expressions containing prepositons and other words that would occur in those obituaries (Sterbelisten) are used as patterns. The matched words are then extracted and sent to DTA-CAB (https://www.deutschestextarchiv.de/doku/software); another online tool that compares historically varying spelling by comparing it to historical dictionaries and using sound-distance measures like the Levenshtein distance measure. 


- A .html file that has \<mark> on most toponyms but sometimes on persons and other names too.

- A .xml file containing the sterbelisten.

- A .csv files with Toponyms extracted from Wien Geschichte Wiki.

- A .txt file containing a list with sucessfully extracted toponyms, which are already normalised into contemporary orthography by comparing the results with the Wien Geschichte Wiki(https://www.geschichtewiki.wien.gv.at/Wien_Geschichte_Wiki).

- Some notebooks and other files.

As the ACDH-CH research group is activley working on their project, the Github repository is constantly changing and developing. For our Tutorial we will only use the .html file as it contains the marked place names we need to build our training data.

##### What we want: Knodle-compatible input
###### A trainig and test set for our sequence labeling task containing three matrices:
- T Matrix containing the classes and the rules.(panda data frame with the dimensions 2xnumber of place names).
- Z Matrices one for each sentence each containing the words in the sentence and all our rules (=place names)(np.matrix, sparse).
- X Matrix containing the samples (pandas data frame).


##### Pipeline

**1. Read in the html file**

- Strip of tags and clean the text with regular expressions.

- Create a list with each sentence as a sample.

We decided to keep the abbreviations as the expansion would already involve normalisation and we want to try to work with the 'raw' data.


**2. Read in already extracted placenames**
- Extract all \<mark> words from the .html file provided by the ACDH-CH research group and assign an ID.
- Loop through all samples and create lists containing either None if the word is not a place name or the ID of the place name if the word is part of a place name.


**3. Building training data**

**X Matrix**
pandas data frame or list with all  cleaned sentences.

**T Matrix**

np.sparse matirx one column 0, one column 1, rows are place names.

**Z Matrices**
numpy.narray
columns= keywords 
rows = words in sentence


**5. Building test data** 

manually checked! 
There is the 'timemachine_evaluatuin_v1_edited_corrected.jsonl which should contain manually corrected place names but it is not usable for our purposes.

**Imports**

In [67]:
import os
import re
import codecs
import pandas as pd
import numpy as np
import jsonlines as jsnl
from typing import List
from tqdm import tqdm
from joblib import dump

from minio import Minio
client = Minio("knodle.cc", secure=False)

In [3]:
# define the path to the folder where the data will be stored
data_path = "../../../data_from_minio/wiener_diarum_toponyms"
os.makedirs(data_path, exist_ok=True)
os.path.join(data_path)

'../../../data_from_minio/wiener_diarum_toponyms'

In [86]:
files = [
    "annotations_3-21_v4.html","timemachine_evaluation_v1_edited_corrected.jsonl"
]

#"datasets/wiener_diarum_toponyms/"
for file in tqdm(files):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/wiener_diarum_toponyms",*file).replace('\\','/'),
        file_path=os.path.join(data_path, file[-1]))

  0%|                                                    | 0/2 [00:00<?, ?it/s]


S3Error: S3 operation failed; code: NoSuchKey, message: Object does not exist, resource: /knodle/datasets/wiener_diarum_toponyms/a/n/n/o/t/a/t/i/o/n/s/_/3/-/2/1/_/v/4/./h/t/m/l, request_id: 16CE6F7F21CE5D0E, host_id: None, bucket_name: knodle, object_name: datasets/wiener_diarum_toponyms/a/n/n/o/t/a/t/i/o/n/s/_/3/-/2/1/_/v/4/./h/t/m/l

### 1. Read in .html 

In [78]:
# Load html file

#file = codecs.open('annotations_3-21_v4.html', "r", "utf-8")
#sterbelisten_html = file.read()

#file= codecs.open(os.path.join(data_path, file[-1]).replace('\\','/'), 'r',"utf-8" )
#sterbelisten_html = file.read()


TypeError: 'StreamReaderWriter' object is not subscriptable

The .html file contains many tags and information that is redundant and not needed for our nlp task, one example : 

``` 
<h2>Lista aller Verstorbenen in und vor der Stadt .  | Den 3 . Februarii 1706 .  | Den 4 . dito .  | Den 5 . Dito . </h2><h3>/db/apps/edoc/data/170x/1706/02/1706-02-03.xml | i51</h3><br/>\r\n    <p>Dem Peter# Frost / einem Cammer im <mark>Greiseckeris  Hauß</mark> im <mark>Diener  Gäßl</mark> / sein Kind Frantz / alt 6 . viertl Jahr .####################### </p>\r\n<hr/>\r\n<h2>Lista aller Verstorbenen in und vor der Stadt .  | Den 3 . Februarii 1706 .  | Den 4 . dito .  | Den 5 . Dito . </h2><h3>/db/apps/edoc/data/170x/1706/02/1706-02-03.xml | i52</h3><br/>\r\n    <p>Der Maria# Nauitschanin / einer Burgerl . Wittib im <mark>Primis  Hauß</mark> auf der <mark>Wen=delstadt</mark> / ihr Kind Carl / alt 5 . Jahr .########## </p>\r\n<hr/>\r\n

```
We want to clean this document in order to obtain the actual sentence, like here: 

```
\ Dem Peter Frost / einem Cammer im <mark>Greiseckeris  Hauß</mark> im <mark>Diener  Gäßl</mark> / sein Kind Frantz / alt 6 viertl Jahr
```
Note that we don't want to lose the \<mark> and \<\mark> tags as we will need them later to collect all place names. The Wiener Diarum also contains a lot of / partially as markers to signal that there is a line break in the sentence or a noun or verb that is seperated into two words. 

We clean the expression using regex as follows:

In [27]:
# Cleaning the text first from both html tags and xml tags and then in a second step clean the intro and replace with /n

# get rid of the title of the issue and the xml tags by just replacing the whole intro with ""
sterbelisten_strip_html_1 = re.sub(r"<h2>.*</h2><h3>.*xml \|.*\d+</h3><br/>","",sterbelisten_html)

# now working on all the tags around the sentences and weird characters within the words
sterbelisten_strip_html1 = re.sub(r"<hr/>|<p>|</p>|#+|=| =|<html>|</html>|\r|\b \.|\b# \.|(|)","",sterbelisten_strip_html_1)#<mark>|</mark>

sterbelisten_strip_html2 = re.sub(r"    \b","",sterbelisten_strip_html1)

In [28]:
# create an empty list and split the text at line break, creating an element for each sentence

sterbeliste = []
sterbeliste = sterbelisten_strip_html2.split("\n\n\n")

## 2.Create labeled sentences and a dictionary of place names

Open questions:

This code works on the assumption that both \<mark> and \<\mark> tags are present.


In [38]:
print(f"Number of <mark> tags: {len(re.findall('<mark>',sterbelisten_html))}")
print(f"Number of </mark> tags: {len(re.findall('</mark>',sterbelisten_html))}")
print(f"Number of unclosed tags: {len(re.findall('<mark>',sterbelisten_html))-len(re.findall('</mark>',sterbelisten_html))}")


# We will clean the data set from all sentences were </mark> are missing
sterbeliste_clean= [sentence for sentence in sterbeliste if re.search("/mark", sentence)]
print(f"Number of sentenses original data set: {len(sterbeliste)}")
print(f"Number of sentenses with new, usable <mark> tags in our new data set: {len(sterbeliste_clean)}")


Number of <mark> tags: 14870
Number of </mark> tags: 11646
Number of unclosed tags: 3224
Number of sentenses original data set: 13163
Number of sentenses with new, usable <mark> tags in our new data set: 7412


However, there are 14870 \<mark> and 11646 \<\mark> tags, what can be explained by some errors in preprocessing. 
We solve this by exluding the 3224 items and their samples from our list, or check them manually and use a cleaned html file. 

In [87]:
def get_location_id(location: List, locations: List) -> int:
    '''
    this function loops through the list of place names, 
    if it doesn't find a given place name it adds the place name to the list
    and returns the new list 
    
    input: the list or single word that is part of a place name
    output: list of place names 
    '''
    for word in location:
        if word not in locations:
            locations[word]= len(locations)
            return locations[word]
        else:
            return locations[word]

# we create a dictionary locations that contains the place name as a key and the ID as a value

locations={}

# we collect our tagged sentences in samples, these contain None for each word that is no place name and the corresponding ID to each word that is part of a place name 
samples = []

# we collect a list with each sentence that is a list containing each word as its elements
clean_sterbeliste=[]


# we loop through each sentence
for sentence in sterbeliste_clean:
    labels=[]
    sentence = re.sub(r"<mark>","<start> ",sentence)
    sentence = re.sub(r"</mark>"," <end>",sentence)
    sentence = re.sub(r"\(|\)|/","",sentence)
    sentence = sentence.split(" ")
    sentence = list(filter(None,sentence))
    if_loc= False
    for word in sentence:
        if word=='<start>':
            if_loc= True
            location = []
        elif word=='<end>':
            labels +=[get_location_id(location, locations)] * len(location)
            if_loc= False
        else:
            if if_loc:
                location.append(word)
            else:
                labels.append(None)
    samples.append(labels)
    sentence= [word for word in sentence if word !="<start>" if word!="<end>"]
    clean_sterbeliste.append(sentence)

print(f"Length of the sentence: {len(sentence)}")
print(f"Sample_tagging:{samples[0]}")
print(f"Number of place names: {len(locations)}")
print(f"=======================================")
assert len(sentence) == len(labels)
#print(f"Sentence and labeles: {[(word, label) for (word, label) in zip(sentence, labels)]}")
print(f"Cleaned Sentence: {clean_sterbeliste[0]}")

Length of the sentence: 12
Sample_tagging:[None, None, None, None, None, None, 0, 0, None, 1, 1, None, None, None, None, None, None, None]
Number of place names: 2911
Cleaned Sentence: ['\nDem', 'Peter', 'Frost', 'einem', 'Cammer', 'im', 'Greiseckeris', 'Hauß', 'im', 'Diener', 'Gäßl', 'sein', 'Kind', 'Frantz', 'alt', '6', 'viertl', 'Jahr']


## 3.Building training data

In [88]:
# saving the training data, leaving 100 samples as test data

d = {"sample": clean_sterbeliste, "labels":samples}

df = pd.DataFrame(d)

df_train = df[100:]
df_train.to_csv(os.path.join(data_path, 'df_train.csv').replace('\\','/'))


In [89]:
'''
X Matrix

Dimensions # sentences x # sentences (one column samples df, len(sample))

'''
x_matrix = pd.DataFrame(clean_sterbeliste)




**T Matrix**

np.sparse matirx 1 colum 0, 1 colum 1, rows are keywords


In [90]:
'''
T Matrix
Dimensions: # place names x # 2

'''
a= np.zeros(len(locations))
b= np.full(len(locations),1)

t_matrix = np.stack((b,a.T))
t_matrix.T.shape


(2911, 2)

**Z Matrices**
Each sentense is going to be its own Z Matrix.
colums = place names
rows = words in sentense

In [91]:
'''
Z Matrices

In the next function, we are building the Z Matrices by looping through all samples.
Each sample is compared with the list of locations, if the location is in the sample,
the new list "matched_loc" gets an entry 1 otherwise a 0 is added
Then the list is transformed to an numpy array and brought into shape:

Dimensions
number of place names x number of words

'''
locations_list = list(range(0,(len(locations))))

## create a list and turn to np.array 3d 

collected_z_matrices=[]

for sample in samples:
    matched_loc= [1 if i == j else 0 for i in sample for j in locations_list]
    i=len(sample)
    Z = np.array(matched_loc)
    Z = np.reshape(Z, (i,len(locations_list)))
    collected_z_matrices.append(Z)

print(Z.shape)
    
#padding_length= max([len(i) for i in samples])

#z_matrices= np.dstack(collected_z_matrices)

(12, 2911)


In [92]:
# saving the z_matrices

'''
dump(train_rule_matches_sparse_z, os.path.join(data_path, "train_rule_matches_z.lib"))
dump(test_rule_matches_sparse_z, os.path.join(data_path, "test_rule_matches_z.lib"))

'''

# saving the t_matix

dump(t_matrix,  os.path.join(data_path, "mapping_rules_labels_t.lib").replace('\\','/'))

['../../../data_from_minio/wiener_diarum_toponyms/mapping_rules_labels_t.lib']

## 5. Building test data

The provided test data has information about the character position of each place name. 
The data also considers tags, special characters and other non-verbal characters.
For our use, this test data can not be used as we want to train our model to look for place names within a sentence and not on a character level.
We might need to take a part of our training data, manually check each sentence and its tagging and use it as our test data. 

In [93]:
# Building test data, the first 100 samples have been manually checked

df_test = df[:100]

# Saving them

df_test.to_csv(os.path.join(data_path, 'df_test.csv').replace('\\','/'))
