## Wiener Diarum Sterbelisten: Data Preprocessing for Sequence Labeling
-----------------------------------------------------------------------------------------------

### Data

The Wiener Zeitung is the oldest running newspaper in the world. It started in 1703 as Wiener Diarum and changed in 1780 to Wiener Zeitung. Through its central role in the Habsburg monarchy it distributed knowledge across all disciplines throughout Europe. This also helped to put Vienna into a more prominent position.


The Austrian Centre for Digital Humanities and Cultural Heritage of the Austrian Academy of Sciences (ACDH-CH) started a project with the goal of digitalising and providing this historically important data. The digitised newspapers as well as current developements can be found here: 
https://digitarium.acdh.oeaw.ac.at/willkommen/

https://www.oeaw.ac.at/ihb/forschungsbereiche/kunstgeschichte/forschung/habsburgische-repraesentation/das-wiennerische-diarium


This valuable resource is useful for multiple disciplines as it holds historical data on science, politics, cultural and other sections. It is interesting for linguistic research as the normalisation of german orthography only started at the end of the 18th century and was only fully set in force in the late 19th century.  

For this task there is data prepared by the research group of the ACDH-CH. The digitalised Wiener Zeitung is partially available as TEI XML as well as an extracted HTML on Github. This data set contains only a few newspapers per year starting from 1706.

The link to the Github Repository of the ACDH-CH research group shows their current work process as well as the original data used for this tutorial: https://github.com/acdh-oeaw/sterbelisten



### Task

In this tutorial we want to prepare the data provided by the research group of the ACDH-CH in order to apply sequence labeling to it and fit the data into the Knodle framework.

Our goal is to automatically find the place names in the obituaries. These lists are in every newspaper issue and contain several notes about the death of people. Usually they contain the name, age, reason and place of death. By filtering out the place names automatically, one can extract historical knowledge about the city of Vienna and its developement as well as orthographical changes and therewith historical linguistical knowledge.

For this sequence labelling task we need to build the data in a way that is suitable for a weakly supervised machine learning approach. We will need to read the .html file and clean all the additional characters as illustrated later on in this tutorial. We will also tokenise the words given and translate the provided \<mark> tags into a matrix format that provides the location of the toponym within the sentence. Our rules are the tagged place names, which are a result of the work made by the team of the ACDH-CH. We want to train a model that can detect place names by using the data we are provided with by the research group of the ACDH-CH. As this data is extremly difficult to organise due to the big spelling variation, it is our goal to try a machine learning approach rather than a rule-based approach.


##### What we have: provided by the research group of the ACDH-CH
The approach of the research group of the ACDH-CH is a rule-based approach on finding place names.  By creating patterns with which they would identify place names and compare them to historical dictionaries, they gather place names in order to link them to other sources. 

For example regular expressions containing prepositons and other words that would occur in those obituaries (Sterbelisten) are used as patterns. The matched words are then extracted and sent to DTA-CAB (https://www.deutschestextarchiv.de/doku/software); another online tool that compares historically varying spelling by comparing it to historical dictionaries and using sound-distance measures like the Levenshtein distance measure. 


- A .html file that has \<mark> on most toponyms but sometimes on persons and other names too.

- A .xml file containing the sterbelisten.

- A .csv files with Toponyms extracted from Wien Geschichte Wiki.

- A .txt file containing a list with sucessfully extracted toponyms, which are already normalised into contemporary orthography by comparing the results with the Wien Geschichte Wiki(https://www.geschichtewiki.wien.gv.at/Wien_Geschichte_Wiki).

- Some notebooks and other files.

As the ACDH-CH research group is actively working on their project, the Github repository is constantly changing and developing. For our Tutorial we will only use the .html file as it contains the marked place names we need to build our training data.

##### What we want: Knodle-compatible input
###### A training and test set for our sequence labeling task containing three matrices:
- T Matrix containing the classes and the rules (panda data frame with the dimensions 2xnumber of place names).
- Z Matrices one for each sentence each containing the words in the sentence and all our rules (=place names)(np.matrix, sparse).
- X Matrix containing the samples (pandas data frame).


##### Pipeline

**1. Read in the html file**

- Strip of tags and clean the text with regular expressions.

- Create a list with each sentence as a sample.

We decided to keep the abbreviations as the expansion would already involve normalisation and we want to try to work with the 'raw' data.


**2. Read in already extracted placenames**
- Extract all \<mark> words from the .html file provided by the ACDH-CH research group and assign an ID.
- Loop through all samples and create lists containing either None if the word is not a place name or the ID of the place name if the word is part of a place name.


**3. Building training data**

**X Matrix:**
pandas data frame or list with all  cleaned sentences.

**T Matrix**
np.sparse matrix, one column 0, one column 1, rows are place names.

**Z Matrices**
numpy.narray
columns= keywords 
rows = words in sentence


**5. Building test data** 

manually checked! 
There is the 'timemachine_evaluatuin_v1_edited_corrected.jsonl which should contain manually corrected place names but it is not usable for our purposes.

##### Please check requirements are in the requirements.txt 

The code in this tutorial is also available as a data_preprocessing_wiener_diarum_toponym.py.

This folder contains: 
- data_preprocessing_sterbelisten_wiener_diarum.ipn
- data_preprocessing_wiener_diarum_toponym.py
- requirements.ipynb



**Imports**

In [1]:
import os
import re

import pandas as pd
import numpy as np
from typing import List, Tuple, Dict

from tqdm import tqdm
from joblib import dump
from minio import Minio
client = Minio("knodle.cc", secure=False)

In [2]:
# define the path to the folder where the data will be stored
data_path = "../../../data_from_minio/wiener_diarum_toponyms"
os.makedirs(data_path, exist_ok=True)
os.path.join(data_path)

'../../../data_from_minio/wiener_diarum_toponyms'

In [3]:
files = ["annotations_3-21_v4.html"]


for file in tqdm(files):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/wiener_diarum_toponyms/",file),
        file_path=os.path.join(data_path, file[-1]))

100%|████████████████████████████████████████████| 1/1 [00:01<00:00,  1.19s/it]


### 1. Read in .html 

In [4]:
# Load html file

with open(os.path.join(data_path, file[-1]), 'r', encoding = "utf-8" ) as f:
    sterbelisten_html = f.read()

The .html file contains many tags and information that is redundant and not needed for our nlp task, one example : 

``` 
<h2>Lista aller Verstorbenen in und vor der Stadt .  | Den 3 . Februarii 1706 .  | Den 4 . dito .  | Den 5 . Dito . </h2><h3>/db/apps/edoc/data/170x/1706/02/1706-02-03.xml | i51</h3><br/>\r\n    <p>Dem Peter# Frost / einem Cammer im <mark>Greiseckeris  Hauß</mark> im <mark>Diener  Gäßl</mark> / sein Kind Frantz / alt 6 . viertl Jahr .####################### </p>\r\n<hr/>\r\n<h2>Lista aller Verstorbenen in und vor der Stadt .  | Den 3 . Februarii 1706 .  | Den 4 . dito .  | Den 5 . Dito . </h2><h3>/db/apps/edoc/data/170x/1706/02/1706-02-03.xml | i52</h3><br/>\r\n    <p>Der Maria# Nauitschanin / einer Burgerl . Wittib im <mark>Primis  Hauß</mark> auf der <mark>Wen=delstadt</mark> / ihr Kind Carl / alt 5 . Jahr .########## </p>\r\n<hr/>\r\n

```
We want to clean this document in order to obtain the actual sentence, like here: 

```
\ Dem Peter Frost / einem Cammer im <mark>Greiseckeris  Hauß</mark> im <mark>Diener  Gäßl</mark> / sein Kind Frantz / alt 6 viertl Jahr
```
Note that we don't want to lose the \<mark> and \<\mark> tags as we will need them later to collect all place names. The Wiener Diarum also contains a lot of / partially as markers to signal that there is a line break in the sentence or a noun or verb that is seperated into two words. 

We clean the expression using regex as follows:

In [5]:
# Cleaning the text first from both html tags and xml tags and then in a second step clean the intro and replace with /n

# get rid of the title of the issue and the xml tags by just replacing the whole intro with ""
sterbelisten_strip_html_1 = re.sub(r"<h2>.*</h2><h3>.*xml \|.*\d+</h3><br/>","",sterbelisten_html)

# now working on all the tags around the sentences and weird characters within the words
sterbelisten_strip_html1 = re.sub(r"<hr/>|<p>|</p>|#+|=| =|<html>|</html>|\r|\b \.|\b# \.|(|)","",sterbelisten_strip_html_1)#<mark>|</mark>

sterbelisten_strip_html2 = re.sub(r"    \b","",sterbelisten_strip_html1)

In [6]:
# create an empty list and split the text at line break, creating an element for each sentence

sterbeliste = []
sterbeliste = sterbelisten_strip_html2.split("\n\n\n")
len(sterbeliste)

13163

## 2.Create labeled sentences and a dictionary of place names

Open questions:

This code works on the assumption that both \<mark> and \<\mark> tags are present.


In [7]:
print(f"Number of <mark> tags: {len(re.findall('<mark>',sterbelisten_html))}")
print(f"Number of </mark> tags: {len(re.findall('</mark>',sterbelisten_html))}")
print(f"Number of unclosed tags: {len(re.findall('<mark>',sterbelisten_html))-len(re.findall('</mark>',sterbelisten_html))}")


Number of <mark> tags: 14870
Number of </mark> tags: 11646
Number of unclosed tags: 3224


However, there are 14870 \<mark> and 11646 \<\mark> tags, what can be explained by some errors in preprocessing. 
We solve this by exluding the 3224 items and their samples from our list, or check them manually and use a cleaned html file. 

In [14]:
def get_location_id(curr_loc: List, all_loc: Dict) -> Tuple[int,Dict]:
    '''
    Descr: This function loops through the list of place names, 
    if it doesn't find a given place name it adds the place name to the list
    and returns the new list 
    
    Args: the list or single word that is part of a place name
    Returns: list of place names 
    '''
    for word in curr_loc:
        if word not in all_loc:
            all_loc[word]= len(all_loc)
        return all_loc[word]

    
    
def preprocess_sentence(sentence):
    '''
    This function replaces the <mark> tag with <start> and the </mark> tag with <end> in order
    to clean the remaining / out of the senctenses. It then splits the sentence into seperate words
    and returns a list with all words in the senctense.
    
    Args: 
        "sentece" is the sentence as a string containing <mark> and </mark> and several "/".
    Returns: 
        A list with all words in the sentence split by " " cleaned from "/" and 
        containing <start> and <end> tag instead of <mark>
    '''
    sentence = re.sub(r"<mark>","<start> ",sentence)
    sentence = re.sub(r"</mark>"," <end>",sentence)
    sentence = re.sub(r"\(|\)|/","",sentence)
    sentence = sentence.split(" ")
    sentence = list(filter(None,sentence))
    return sentence


def create_labels(sterbeliste: List):
    '''
    Loops through the sentences and creates a list with Labels for each word in a sentence.
    It also creates a dictionary with an ID for each place name
    and collects all cleaned and preprocessed sentences as a list of lists.
    
    Args: 
        sentence= a sentence as a string
        current_loc= a list with all words that are within one place name tag.
        labels=  list with labels for each word, None if it is not a place name and the
        ID of the occuring place name if a word belongs to a place name.
    Return:
        samples = a list with all label lists for all sentences.
        clean_sterbeliste= a list each sentence as a list of words.
        all_loc= a dictionary with all place names and its ID number.
    
    '''
    # we create a dictionary locations that contains the place name as a key and the ID as a value
    all_loc={}

    # we collect our tagged sentences in samples, these contain None for each word that is no place name and the corresponding ID to each word that is part of a place name 
    samples = []

    # we collect a list with each sentence that is a list containing each word as its elements
    clean_sterbeliste=[]


    for sentence in sterbeliste:
        if re.search(r"/mark", sentence) is None:
            continue
        labels=[]
        sentence = preprocess_sentence(sentence)
        if_loc= False
        for word in sentence:
            if word=='<start>':
                if_loc= True
                curr_loc = []
            elif word=='<end>':
                labels +=[get_location_id(curr_loc, all_loc)] * len(curr_loc)
                if_loc= False
            else:
                if if_loc:
                    curr_loc.append(word)
                else:
                    labels.append(None)
        samples.append(labels)
        sentence= [word for word in sentence if word !="<start>" if word!="<end>"]
        clean_sterbeliste.append(sentence)
    return samples, clean_sterbeliste, all_loc
        
        
samples, clean_sterbeliste, all_loc = create_labels(sterbeliste)

In [16]:
print(f"Sample_tagging:{samples[0]}")
print(f"Cleaned Sentence: {clean_sterbeliste[0]}")
print(f"=======================================")
print(f"Number of place names: {len(all_loc)}")
print(f"Number of sample sentences: {len(clean_sterbeliste)}")
print(f"=======================================")

Sample_tagging:[None, None, None, None, None, None, 0, 0, None, 1, 1, None, None, None, None, None, None, None]
Cleaned Sentence: ['\nDem', 'Peter', 'Frost', 'einem', 'Cammer', 'im', 'Greiseckeris', 'Hauß', 'im', 'Diener', 'Gäßl', 'sein', 'Kind', 'Frantz', 'alt', '6', 'viertl', 'Jahr']
Number of place names: 2911
Number of sample sentences: 7412


## 3.Building training data

In [17]:
# saving the training data, leaving 100 samples as test data

d = {"sample": clean_sterbeliste, "labels":samples}

df = pd.DataFrame(d)

df_train = df[100:]
df_train.to_csv(os.path.join(data_path, 'df_train.csv'))


df_train

Unnamed: 0,sample,labels
100,"[Joseph, Gaßrann, Wais, in, dem, Englischen, H...","[None, None, None, None, None, 123, 123, None,..."
101,"[Christina, Zaplerin, lediges, Mensch, bey, de...","[None, None, None, None, None, None, 8, 8, Non..."
102,"[Dem, Herrn, Frantz, v, ,, Königl, Spanischer,...","[None, None, None, None, None, None, None, Non..."
103,"[Herr, Wolf, Joseph, Hofmandl, v, ,, StadtHaup...","[None, None, None, None, None, None, None, Non..."
104,"[Dem, Herrn, Johann, Fridmann, Kaiserl, ,, s, ...","[None, None, None, None, None, None, None, Non..."
...,...,...
7407,"[Philipp, Spitzer, ,, led, alt, 18, J, im, Jud...","[None, None, None, None, None, None, None, Non..."
7408,"[Hr, Bernard, Abbe, v, Berzoni, ,, d, h, r, N,...","[None, None, None, 131, 131, None, None, None,..."
7409,"[Der, Fr, ., Cath, Resch, ,, bg, Kaffeehausinh...","[None, None, None, None, None, None, None, Non..."
7410,"[Hr, Daniel, Freyhr, Rottern, v, und, zu, Kost...","[None, None, None, None, None, None, None, 291..."


In [18]:
'''
X Matrix

Dimensions # sentences x # sentences (one column samples df, len(sample))

'''
x_matrix = pd.DataFrame(clean_sterbeliste)

print(x_matrix.shape)

x_matrix

(7412, 66)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,56,57,58,59,60,61,62,63,64,65
0,\nDem,Peter,Frost,einem,Cammer,im,Greiseckeris,Hauß,im,Diener,...,,,,,,,,,,
1,Der,Maria,Nauitschanin,einer,Burgerl,Wittib,im,Primis,Hauß,auf,...,,,,,,,,,,
2,Christina,Kochin,ein,Wittib,im,Barbieris,Hauß,auf,der,Laimgrueben,...,,,,,,,,,,
3,Gottlieb,Rabel,ein,gewester,Haußmeister,beyn,3,Mohren,in,der,...,,,,,,,,,,
4,Dominica,Stephanebrin,ein,lediges,Mensch,beym,weissen,Ochsen,bey,St,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7407,Philipp,Spitzer,",",led,alt,18,J,im,Judensp,in,...,,,,,,,,,,
7408,Hr,Bernard,Abbe,v,Berzoni,",",d,h,r,N,...,,,,,,,,,,
7409,Der,Fr,.,Cath,Resch,",",bg,Kaffeehausinhab,Wit,i,...,,,,,,,,,,
7410,Hr,Daniel,Freyhr,Rottern,v,und,zu,Kostenthal,",",Weltpriest,...,,,,,,,,,,


**T Matrix**

np.sparse matirx 1 colum 0, 1 colum 1, rows are keywords


In [19]:
'''
T Matrix
Dimensions: # place names x # 2

'''
a= np.zeros(len(all_loc))
b= np.full(len(all_loc),1)

t_matrix = np.stack((b,a.T))
print(t_matrix.T.shape)

t_matrix

(2911, 2)


array([[1., 1., 1., ..., 1., 1., 1.],
       [0., 0., 0., ..., 0., 0., 0.]])

**Z Matrices**
Each sentense is going to be its own Z Matrix.
colums = place names
rows = words in sentense

In [22]:
'''
Z Matrices

In the next function, we are building the Z Matrices by looping through all samples.
Each sample is compared with the list of locations, if the location is in the sample,
the new list "matched_loc" gets an entry 1 otherwise a 0 is added
Then the list is transformed to an numpy array and brought into shape:

Dimensions
number of place names x number of words

'''
locations_list = list(range(0,(len(all_loc.keys()))))

## create a list and turn to np.array 3d 
collected_z_matrices=[]

for sample in samples:
    matched_loc= [1 if i == j else 0 for i in sample for j in locations_list]
    i=len(sample)
    Z = np.array(matched_loc)
    Z = np.reshape(Z, (i,len(locations_list)))
    collected_z_matrices.append(Z)

print(Z.shape)


print(len(collected_z_matrices))


(12, 2911)
7412


In [23]:
print(collected_z_matrices[0].shape)

(18, 2911)


In [24]:
#spilt the test and training z_matrices
test_rule_matches_sparse_z_list = collected_z_matrices[:100]

train_rule_matches_sparse_z_list = collected_z_matrices[100:]

# These lists can't be stacked to a tensor yet, as they need padding. 

In [25]:
# saving the z_matrices


dump(train_rule_matches_sparse_z_list, os.path.join(data_path, "train_rule_matches_z_list.lib"))
dump(test_rule_matches_sparse_z_list, os.path.join(data_path, "test_rule_matches_z_list.lib"))


# saving the t_matix

dump(t_matrix,  os.path.join(data_path, "mapping_rules_labels_t.lib").replace('\\','/'))

['../../../data_from_minio/wiener_diarum_toponyms/mapping_rules_labels_t.lib']

## 5. Building test data

The provided test data has information about the character position of each place name. 
The data also considers tags, special characters and other non-verbal characters.
For our use, this test data can not be used as we want to train our model to look for place names within a sentence and not on a character level.
We might need to take a part of our training data, manually check each sentence and its tagging and use it as our test data. 

In [26]:
# Building test data, the first 100 samples have been manually checked

df_test = df[:100]

# Saving them

df_test.to_csv(os.path.join(data_path, 'df_test.csv'))


### 6. Overview 

In this tutorial we have created a data set that is suitable to train a sequence labeler as well as the matrices needed for the Knodle framework. 

We have dealt with messy data that contains unnormalised historical language. 

We started with the .html file (="l" containing the obituaries and end with a df_train containing samples and their labeling, a df_test containing 100 manually checked labels.
And three lib files containing the t_matrix ( # toponoym/not toponym x # number of all pace names aka. rules) the z-matirces collected as a list one z_matrix being #words in sentence x # all place names aka. rules) both for the test samples (test_rule_matches_z_list) and the train samples (train_rule_matches_z_list).

In [27]:
os.listdir(data_path)

['df_test.csv',
 'df_train.csv',
 'l',
 'mapping_rules_labels_t.lib',
 'test_rule_matches_z_list.lib',
 'train_rule_matches_z_list.lib']