We will be trying to solve NER problem (a.k.a. multi-label token classification) for documents in german language 

First, install all necessary packages and import them

In [10]:
# !pip install transformers datasets seqeval iterative-stratification

Collecting transformers
  Downloading transformers-4.9.1-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 7.1 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.11.0-py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 49.8 MB/s 
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 2.6 MB/s 
[?25hCollecting iterative-stratification
  Downloading iterative_stratification-0.1.6-py3-none-any.whl (8.7 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 59.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 31.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K   

In [11]:
import pandas as pd
import numpy as np
import re
from copy import deepcopy


import transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification, \
TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import load_metric, Dataset

from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit

Do some preprocessing (this is very important stage - many things depend on how well the data is prepared)

In [2]:
# tag each token using token intervals
def tag_tokens(intervals, labels):
    res = []
    l_idx = 0
    for left, right in intervals:
        if l_idx >= len(labels):
            res.append('O')
        else:
            l = labels[l_idx]
            if left >= l[0] and right <= l[1]:
                # print((left, right), l, l_idx)
                res.append(l[2])
                l_idx += 1
            elif right > l[1]:
                # skip label if failed to find
                l_idx += 1
                res.append('O')
            else:
                res.append('O')
    return res

# add BI encoding
def add_IB(tags):
    res, prev = [], tags[0]
    res.append(prev if prev == 'O' else 'B-'+prev)
    for t in tags[1:]:
        if t == 'O':
            res.append('O')
            prev = t
            continue
        if prev == t:
            res.append('I-' + t)
        else:
            res.append('B-' + t)
        prev = t
    return res

In [4]:
df = pd.read_json('JSON for 2nd task.json')
df['labels'] = df['labels'].apply(lambda x: sorted(x))
df['token_ranges'] = df['text'].apply(lambda text: [(m.start() + 1, m.end() - 1) for m in re.finditer(r'[\w.]+', text)])
df['tokens'] = df['text'].apply(lambda text: [m.group(0) for m in re.finditer(r'[\w.]+', text)])
df['tags'] = df.apply(lambda x: tag_tokens(x['token_ranges'], x['labels']), axis=1)

df.head()

Unnamed: 0,text,labels,token_ranges,tokens,tags
0,Öffentliche Bekanntmachung AUREG\n\n\n\nAmtsge...,"[[103, 113, PUBDATE], [349, 359, STATUS], [360...","[(1, 10), (13, 25), (28, 31), (37, 46), (49, 5...","[Öffentliche, Bekanntmachung, AUREG, Amtsgeric...","[O, O, O, O, O, O, O, O, O, O, O, O, PUBDATE, ..."
1,Öffentliche Bekanntmachung RegisSTAR\n\n\n\nAm...,"[[104, 114, PUBDATE], [343, 379, STATUS], [380...","[(1, 10), (13, 25), (28, 35), (41, 50), (53, 5...","[Öffentliche, Bekanntmachung, RegisSTAR, Amtsg...","[O, O, O, O, O, O, O, O, O, O, O, PUBDATE, O, ..."
2,Öffentliche Bekanntmachung RegisSTAR\n\n\n\nAm...,"[[108, 118, PUBDATE], [340, 347, POSITION], [3...","[(1, 10), (13, 25), (28, 35), (41, 50), (53, 6...","[Öffentliche, Bekanntmachung, RegisSTAR, Amtsg...","[O, O, O, O, O, O, O, O, O, O, O, PUBDATE, O, ..."
3,Öffentliche Bekanntmachung RegisSTAR\n\n\n\nAm...,"[[110, 120, PUBDATE], [342, 349, POSITION], [3...","[(1, 10), (13, 25), (28, 35), (41, 50), (53, 6...","[Öffentliche, Bekanntmachung, RegisSTAR, Amtsg...","[O, O, O, O, O, O, O, O, O, O, O, PUBDATE, O, ..."
4,Öffentliche Bekanntmachung RegisSTAR\n\n\n\nAm...,"[[111, 121, PUBDATE], [316, 323, POSITION], [3...","[(1, 10), (13, 25), (28, 35), (41, 50), (53, 6...","[Öffentliche, Bekanntmachung, RegisSTAR, Amtsg...","[O, O, O, O, O, O, O, O, O, O, O, PUBDATE, O, ..."


Statistics on how succesfull our parsing was

In [6]:
df_copy = deepcopy(df)

all_tags = df_copy['tags'].explode().unique()
df_copy[all_tags] = df_copy['tags'].apply(lambda x: [1 if t in set(x) else 0 for t in all_tags]).apply(pd.Series)
src = df_copy['labels'].apply(lambda x: list(set([i[2] for i in x]))).explode().value_counts().rename('source')
parse_stat = df_copy[all_tags].sum().rename('parsed').to_frame().merge(src, left_index=True, right_index=True)
parse_stat['miss'] = parse_stat['source'] - parse_stat['parsed']
parse_stat

Unnamed: 0,parsed,source,miss
PUBDATE,214,214,0
STATUS,142,150,8
POSITION,198,210,12
SURNAME,203,210,7
NAME,203,210,7
CITY,196,203,7
BIRTHDAY,192,198,6
COUNTRY,5,5,0
TITLE,21,22,1
NOTE,21,21,0


### Make a stratified hold-out set for validation

Because we have so little data - we will be only using one validation set (which is not 100% hold-out - model will see it on training stage to evaluate results)

In [7]:
# implement one-hot encoding of tags to make correct validation split
df['tags'] = df['tags'].apply(add_IB)

all_tags = df['tags'].explode().unique()
tag2idx = {t:i for i,t in enumerate(all_tags)}
idx2tag = {i:t for i,t in enumerate(all_tags)}

df['tags_idx'] = df['tags'].apply(lambda x: [tag2idx[tag] for tag in x])
df[all_tags] = df['tags'].apply(lambda x: [1 if t in set(x) else 0 for t in all_tags]).apply(pd.Series)
df.head()

Unnamed: 0,text,labels,token_ranges,tokens,tags,tags_idx,O,B-PUBDATE,B-STATUS,B-POSITION,B-SURNAME,B-NAME,B-CITY,B-BIRTHDAY,B-COUNTRY,B-TITLE,B-NOTE,I-POSITION,I-STATUS,I-CITY,I-SURNAME
0,Öffentliche Bekanntmachung AUREG\n\n\n\nAmtsge...,"[[103, 113, PUBDATE], [349, 359, STATUS], [360...","[(1, 10), (13, 25), (28, 31), (37, 46), (49, 5...","[Öffentliche, Bekanntmachung, AUREG, Amtsgeric...","[O, O, O, O, O, O, O, O, O, O, O, O, B-PUBDATE...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...",1,1,1,1,1,1,0,0,0,0,0,0,0,0,0
1,Öffentliche Bekanntmachung RegisSTAR\n\n\n\nAm...,"[[104, 114, PUBDATE], [343, 379, STATUS], [380...","[(1, 10), (13, 25), (28, 35), (41, 50), (53, 5...","[Öffentliche, Bekanntmachung, RegisSTAR, Amtsg...","[O, O, O, O, O, O, O, O, O, O, O, B-PUBDATE, O...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
2,Öffentliche Bekanntmachung RegisSTAR\n\n\n\nAm...,"[[108, 118, PUBDATE], [340, 347, POSITION], [3...","[(1, 10), (13, 25), (28, 35), (41, 50), (53, 6...","[Öffentliche, Bekanntmachung, RegisSTAR, Amtsg...","[O, O, O, O, O, O, O, O, O, O, O, B-PUBDATE, O...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",1,1,1,1,1,1,1,1,1,0,0,0,0,0,0
3,Öffentliche Bekanntmachung RegisSTAR\n\n\n\nAm...,"[[110, 120, PUBDATE], [342, 349, POSITION], [3...","[(1, 10), (13, 25), (28, 35), (41, 50), (53, 6...","[Öffentliche, Bekanntmachung, RegisSTAR, Amtsg...","[O, O, O, O, O, O, O, O, O, O, O, B-PUBDATE, O...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
4,Öffentliche Bekanntmachung RegisSTAR\n\n\n\nAm...,"[[111, 121, PUBDATE], [316, 323, POSITION], [3...","[(1, 10), (13, 25), (28, 35), (41, 50), (53, 6...","[Öffentliche, Bekanntmachung, RegisSTAR, Amtsg...","[O, O, O, O, O, O, O, O, O, O, O, B-PUBDATE, O...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",1,1,1,1,1,1,1,1,0,0,0,0,0,0,0


In [24]:
X = df.drop(all_tags,1)
y = df[all_tags]

msss = MultilabelStratifiedShuffleSplit(n_splits=2, test_size=0.3, random_state=71)

for train_index, test_index in msss.split(X, y):
   X_train, X_test = X.loc[train_index, :], X.loc[test_index, :]
   y_train, y_test = y.loc[train_index, :], y.loc[test_index, :]
   break

In [25]:
y_train.sum().rename('train').to_frame().merge(y_test.sum().rename('val'), left_index=True, right_index=True)

Unnamed: 0,train,val
O,149,65
B-PUBDATE,149,65
B-STATUS,99,43
B-POSITION,138,60
B-SURNAME,141,62
B-NAME,141,62
B-CITY,136,60
B-BIRTHDAY,134,58
B-COUNTRY,3,2
B-TITLE,15,6


## Tokenize for BERT

Transform data into format which is readable by BERT. We will be using most popular german BERT on hugging-face hub.

In [14]:
model_checkpoint = "bert-base-german-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/485k [00:00<?, ?B/s]

In [15]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"].values.tolist(),
                                 truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples["tags_idx"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            # elif word_idx != previous_word_idx:
            #     label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to the current label
            else:
                label_ids.append(label[word_idx])
                
            previous_word_idx = word_idx
        
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [26]:
train_tokenized = tokenize_and_align_labels(X_train)
val_tokenized = tokenize_and_align_labels(X_test)

## Modeling

In [17]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


In [27]:
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(all_tags))

batch_size = 8

args = TrainingArguments(
    "test-ner",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01
)

data_collator = DataCollatorForTokenClassification(tokenizer)

metric = load_metric("seqeval")

loading configuration file https://huggingface.co/bert-base-german-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/98877e98ee76b3977d326fe4f54bc29f10b486c317a70b6445ac19a0603b00f0.1f2afedb22f9784795ae3a26fe20713637c93f50e2c99101d952ea6476087e5e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_10": 10,
    "LABEL_11": 11,
    "LABEL_12": 

Example on how metrics are computed

In [None]:
labels = [all_tags[i] for i in df["tags_idx"][0]]
metric.compute(predictions=[labels], references=[labels])

{'NAME': {'f1': 1.0, 'number': 1, 'precision': 1.0, 'recall': 1.0},
 'POSITION': {'f1': 1.0, 'number': 1, 'precision': 1.0, 'recall': 1.0},
 'PUBDATE': {'f1': 1.0, 'number': 1, 'precision': 1.0, 'recall': 1.0},
 'STATUS': {'f1': 1.0, 'number': 1, 'precision': 1.0, 'recall': 1.0},
 'SURNAME': {'f1': 1.0, 'number': 1, 'precision': 1.0, 'recall': 1.0},
 'overall_accuracy': 1.0,
 'overall_f1': 1.0,
 'overall_precision': 1.0,
 'overall_recall': 1.0}

In [28]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [all_tags[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [all_tags[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [29]:
train_dataset = Dataset.from_dict(train_tokenized)
val_dataset = Dataset.from_dict(val_tokenized)

trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [30]:
trainer.train()

***** Running training *****
  Num examples = 149
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 57


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.340052,0.927632,0.538168,0.681159,0.911891
2,No log,0.183882,0.831565,0.79771,0.814286,0.950023
3,No log,0.153447,0.876337,0.833969,0.854628,0.958709


***** Running Evaluation *****
  Num examples = 65
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
***** Running Evaluation *****
  Num examples = 65
  Batch size = 8
***** Running Evaluation *****
  Num examples = 65
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=57, training_loss=0.5005863256621779, metrics={'train_runtime': 30.8744, 'train_samples_per_second': 14.478, 'train_steps_per_second': 1.846, 'total_flos': 62291893449240.0, 'train_loss': 0.5005863256621779, 'epoch': 3.0})

## Evaluate results

In [31]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 65
  Batch size = 8


  _warn_prf(average, modifier, msg_start, len(result))


{'epoch': 3.0,
 'eval_accuracy': 0.9587093862815884,
 'eval_f1': 0.8546284224250327,
 'eval_loss': 0.15344683825969696,
 'eval_precision': 0.8763368983957219,
 'eval_recall': 0.833969465648855,
 'eval_runtime': 1.6701,
 'eval_samples_per_second': 38.919,
 'eval_steps_per_second': 5.389}

In [34]:
predictions, labels, _ = trainer.predict(val_dataset)
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [all_tags[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [all_tags[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
pd.DataFrame.from_dict(results)

Unnamed: 0,BIRTHDAY,CITY,COUNTRY,NAME,NOTE,POSITION,PUBDATE,STATUS,SURNAME,TITLE,overall_precision,overall_recall,overall_f1,overall_accuracy
precision,0.983122,0.738532,0.0,0.811765,1.0,0.828125,0.990566,1.0,0.681452,0.0,0.876337,0.833969,0.854628,0.958709
recall,1.0,0.79703,0.0,0.570248,0.458333,0.773723,1.0,0.175,0.816425,0.0,0.876337,0.833969,0.854628,0.958709
f1,0.991489,0.766667,0.0,0.669903,0.628571,0.8,0.995261,0.297872,0.742857,0.0,0.876337,0.833969,0.854628,0.958709
number,466.0,202.0,2.0,121.0,24.0,137.0,315.0,80.0,207.0,18.0,0.876337,0.833969,0.854628,0.958709


For some reason, we couldn't distinguish **TITLE** tag at all. If we look at tags in train / val sets - we won't see much of a difference. Maybe this has something to do with small lengths of tokens or non-trivial punctuation. Another small tag **NOTE** seems to be doing okay (and the smallest tag **COUNTRY** is way too little to draw any decisions).
Also, we've got low recall on **STATUS** label

In [138]:
def get_tags(tag, df):
    df[f'{tag}_ranges'] = df['labels'].apply(lambda x: [(i[0], i[1]) for i in x if i[2] == tag])
    tag_train = df.apply(lambda x: [x['text'][i[0]:i[1]] for i in x[f'{tag}_ranges']], axis=1)
    return tag_train[tag_train.apply(lambda x: len(x) > 0)]

In [149]:
get_tags('TITLE', X_train)

12                               [Dr., Dr. med.]
22                                         [Dr.]
33                                         [Dr.]
46                                         [Dr.]
97                                       [Prof.]
107                                        [Dr.]
116                                   [Dr., Dr.]
118         [Dipl.-Ing., Dipl.-Ing., Dipl.-Kfm.]
170                                   [Dr., Dr.]
173                                        [Dr.]
180                                        [Dr.]
185                                    [Dr.jur.]
197                                        [Dr.]
200    [Prof. Dr., Prof. Dr., Dipl.-Psychologin]
206                                        [Dr.]
dtype: object

In [150]:
get_tags('TITLE', X_test)

6                     [Dr.]
41     [Dr., Dr., Dr., Dr.]
56                    [Dr.]
101                   [Dr.]
163                   [Dr.]
169                   [Dr.]
174                   [Dr.]
dtype: object

Overall quality on more popular tags seems to be ok. Things to do in the future:


1.   Try different preprocessing strategies (to fully cover all tags)
2.   Investigate problem with **TITLE** further
3.   Of course, gather more data, especially on tags which got low scores
4. Try different models from transfomers hub (or even maybe different class of models - though it would probably preform worse than BERT) 



