## Aspect Term Extraction (ATE) Training and Fine Tuning for Large Language Models on German hospital reviews using OBI-Tagging

In [1]:
import torch
import os

import spacy
import ast  # To safely evaluate strings as Python objects

from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import Dataset
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import evaluate

from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

# We need the sys package to load modules from another directory:
import sys
sys.path.append('../')
from functions.ate_model_train import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
print("Is CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("GPU device name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")

Is CUDA available: True
CUDA version: 12.6
GPU device name: NVIDIA A30


In [3]:
# Load the dataset into a DataFrame
data = pd.read_csv("./data/hospitalABSA/patient_review_labels_absa.csv")
data_ano = pd.read_csv("./data/hospitalABSA/patient_review_labels_absa_ano.csv")

In [4]:
models = ["google-bert/bert-base-german-cased","dbmdz/bert-base-german-cased", "dbmdz/bert-base-german-uncased",
          "FacebookAI/xlm-roberta-base", "TUM/GottBERT_base_best", "TUM/GottBERT_filtered_base_best", "TUM/GottBERT_base_last",
          "distilbert/distilbert-base-german-cased", "GerMedBERT/medbert-512", "deepset/gbert-base"]


### 1. standard ATE

In [5]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=10)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 3621.40 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4797.18 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for google-bert/bert-base-german-cased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.174199,0.773756,0.730769,0.751648
2,0.228000,0.170873,0.816143,0.777778,0.796499
3,0.103200,0.275012,0.79386,0.773504,0.78355
4,0.034100,0.27071,0.793103,0.786325,0.7897
5,0.018700,0.392381,0.815315,0.773504,0.79386
6,0.018700,0.472495,0.766393,0.799145,0.782427
7,0.004300,0.460692,0.795745,0.799145,0.797441
8,0.001400,0.482626,0.788793,0.782051,0.785408
9,0.000900,0.488393,0.801802,0.760684,0.780702
10,0.000400,0.491477,0.8,0.769231,0.784314


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4446.81 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.83      0.75      0.79       268

   micro avg       0.83      0.75      0.79       268
   macro avg       0.83      0.75      0.79       268
weighted avg       0.83      0.75      0.79       268

Precision Score: 0.8347107438016529
Recall Score: 0.753731343283582
F1 Score: 0.792156862745098
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred 

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5948.64 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4872.11 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for dbmdz/bert-base-german-cased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.181845,0.748936,0.77193,0.760259
2,0.241500,0.197786,0.756198,0.802632,0.778723
3,0.116300,0.265555,0.759494,0.789474,0.774194
4,0.046900,0.30831,0.715909,0.828947,0.768293
5,0.029000,0.387184,0.741667,0.780702,0.760684
6,0.029000,0.396562,0.789954,0.758772,0.774049
7,0.010500,0.440614,0.797297,0.776316,0.786667
8,0.005800,0.424229,0.758475,0.785088,0.771552
9,0.004800,0.453511,0.75641,0.776316,0.766234
10,0.001300,0.46339,0.748954,0.785088,0.766595


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4403.29 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.85      0.74      0.80       261

   micro avg       0.85      0.74      0.80       261
   macro avg       0.85      0.74      0.80       261
weighted avg       0.85      0.74      0.80       261

Precision Score: 0.8546255506607929
Recall Score: 0.7432950191570882
F1 Score: 0.7950819672131147
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5627.64 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4764.70 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for dbmdz/bert-base-german-uncased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.174619,0.747788,0.747788,0.747788
2,0.233700,0.213093,0.800926,0.765487,0.782805
3,0.115400,0.239687,0.8,0.761062,0.780045
4,0.048200,0.278641,0.723849,0.765487,0.744086
5,0.027300,0.321366,0.748918,0.765487,0.757112
6,0.027300,0.365935,0.770563,0.787611,0.778993
7,0.009100,0.427991,0.755459,0.765487,0.76044
8,0.003400,0.453464,0.782407,0.747788,0.764706
9,0.002400,0.46573,0.779817,0.752212,0.765766
10,0.001600,0.450496,0.765217,0.778761,0.77193


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4507.91 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.70      0.77       267

   micro avg       0.86      0.70      0.77       267
   macro avg       0.86      0.70      0.77       267
weighted avg       0.86      0.70      0.77       267

Precision Score: 0.8611111111111112
Recall Score: 0.6966292134831461
F1 Score: 0.7701863354037267
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5685.16 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4931.14 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for FacebookAI/xlm-roberta-base with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.175791,0.78626,0.792308,0.789272
2,0.269100,0.204412,0.751825,0.792308,0.771536
3,0.175600,0.175286,0.770909,0.815385,0.792523
4,0.110000,0.202577,0.767606,0.838462,0.801471
5,0.081100,0.283526,0.77972,0.857692,0.81685
6,0.081100,0.289111,0.769759,0.861538,0.813067
7,0.047500,0.330556,0.75,0.865385,0.803571
8,0.030400,0.331324,0.777385,0.846154,0.810313
9,0.016300,0.355234,0.775862,0.865385,0.818182
10,0.011500,0.352818,0.762712,0.865385,0.810811


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4535.39 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.77      0.79      0.78       294

   micro avg       0.77      0.79      0.78       294
   macro avg       0.77      0.79      0.78       294
weighted avg       0.77      0.79      0.78       294

Precision Score: 0.7715231788079471
Recall Score: 0.7925170068027211
F1 Score: 0.7818791946308725
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6309.43 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4876.65 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_base_best with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.242694,0.742105,0.758065,0.75
2,0.733700,0.181759,0.787709,0.758065,0.772603
3,0.188100,0.25571,0.79661,0.758065,0.77686
4,0.087900,0.283692,0.741627,0.833333,0.78481
5,0.054000,0.380063,0.768844,0.822581,0.794805
6,0.054000,0.423698,0.745098,0.817204,0.779487
7,0.023200,0.423066,0.75,0.822581,0.784615
8,0.007200,0.524198,0.755,0.811828,0.782383
9,0.004300,0.562123,0.75,0.790323,0.769634
10,0.002500,0.574963,0.758974,0.795699,0.776903


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4734.71 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.79      0.79      0.79       229

   micro avg       0.79      0.79      0.79       229
   macro avg       0.79      0.79      0.79       229
weighted avg       0.79      0.79      0.79       229

Precision Score: 0.7947598253275109
Recall Score: 0.7947598253275109
F1 Score: 0.7947598253275109
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6446.67 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4948.02 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_filtered_base_best with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.276601,0.797101,0.591398,0.679012
2,0.665300,0.207075,0.821429,0.741935,0.779661
3,0.174200,0.218857,0.820652,0.811828,0.816216
4,0.087900,0.241436,0.762791,0.88172,0.817955
5,0.048300,0.314045,0.781421,0.768817,0.775068
6,0.048300,0.424234,0.76,0.817204,0.787565
7,0.020400,0.504661,0.815476,0.736559,0.774011
8,0.009500,0.525661,0.773684,0.790323,0.781915
9,0.004700,0.486074,0.774359,0.811828,0.792651
10,0.001900,0.488192,0.77551,0.817204,0.795812


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4620.33 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.73      0.77      0.75       229

   micro avg       0.73      0.77      0.75       229
   macro avg       0.73      0.77      0.75       229
weighted avg       0.73      0.77      0.75       229

Precision Score: 0.731404958677686
Recall Score: 0.7729257641921398
F1 Score: 0.7515923566878981
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6354.59 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4906.70 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_base_last with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.329046,0.81295,0.607527,0.695385
2,0.737100,0.187419,0.796512,0.736559,0.765363
3,0.190000,0.259836,0.823899,0.704301,0.75942
4,0.095200,0.211095,0.747706,0.876344,0.806931
5,0.061700,0.370675,0.75,0.83871,0.791878
6,0.061700,0.412819,0.756219,0.817204,0.78553
7,0.021700,0.444019,0.771574,0.817204,0.793734
8,0.011100,0.497182,0.764706,0.83871,0.8
9,0.004100,0.535993,0.765,0.822581,0.792746
10,0.001800,0.522426,0.759804,0.833333,0.794872


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4632.08 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.75      0.80      0.77       229

   micro avg       0.75      0.80      0.77       229
   macro avg       0.75      0.80      0.77       229
weighted avg       0.75      0.80      0.77       229

Precision Score: 0.75
Recall Score: 0.7991266375545851
F1 Score: 0.7737843551797039
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6415.77 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5200.66 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for distilbert/distilbert-base-german-cased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.172492,0.766234,0.776316,0.771242
2,0.258400,0.181641,0.768182,0.741228,0.754464
3,0.129800,0.225262,0.799087,0.767544,0.782998
4,0.061900,0.254332,0.763485,0.807018,0.784648
5,0.041300,0.3187,0.747036,0.828947,0.785863
6,0.041300,0.338674,0.782609,0.789474,0.786026
7,0.018500,0.358271,0.771186,0.798246,0.784483
8,0.008700,0.397171,0.738956,0.807018,0.771488
9,0.006300,0.3993,0.759494,0.789474,0.774194
10,0.003000,0.39751,0.774468,0.798246,0.786177


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4891.93 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.81      0.75      0.78       261

   micro avg       0.81      0.75      0.78       261
   macro avg       0.81      0.75      0.78       261
weighted avg       0.81      0.75      0.78       261

Precision Score: 0.8091286307053942
Recall Score: 0.7471264367816092
F1 Score: 0.7768924302788845
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5776.50 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4632.82 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for GerMedBERT/medbert-512 with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.189749,0.756906,0.665049,0.70801
2,2.039400,0.178356,0.748768,0.737864,0.743276
3,0.190800,0.214708,0.790576,0.73301,0.760705
4,0.083500,0.253717,0.746341,0.742718,0.744526
5,0.045600,0.436221,0.693694,0.747573,0.719626
6,0.045600,0.373632,0.728643,0.703883,0.716049
7,0.023700,0.479751,0.72381,0.737864,0.730769
8,0.011100,0.480134,0.732673,0.718447,0.72549
9,0.006400,0.4932,0.753846,0.713592,0.733167
10,0.002900,0.50371,0.747525,0.73301,0.740196


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4295.20 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.79      0.65      0.71       234

   micro avg       0.79      0.65      0.71       234
   macro avg       0.79      0.65      0.71       234
weighted avg       0.79      0.65      0.71       234

Precision Score: 0.7905759162303665
Recall Score: 0.6452991452991453
F1 Score: 0.7105882352941176
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5844.07 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4724.58 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for deepset/gbert-base with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.252091,0.753363,0.736842,0.745011
2,0.611200,0.206167,0.777778,0.828947,0.802548
3,0.175900,0.289843,0.792035,0.785088,0.788546
4,0.074300,0.253312,0.788136,0.815789,0.801724
5,0.050900,0.346724,0.807018,0.807018,0.807018
6,0.050900,0.440677,0.821101,0.785088,0.802691
7,0.021500,0.445545,0.795745,0.820175,0.807775
8,0.009200,0.500903,0.804348,0.811404,0.80786
9,0.004100,0.520252,0.807018,0.807018,0.807018
10,0.002300,0.526721,0.804348,0.811404,0.80786


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4359.72 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.85      0.74      0.80       261

   micro avg       0.85      0.74      0.80       261
   macro avg       0.85      0.74      0.80       261
weighted avg       0.85      0.74      0.80       261

Precision Score: 0.8546255506607929
Recall Score: 0.7432950191570882
F1 Score: 0.7950819672131147
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Model training with 5 epochs and f1 to load the best model

In [5]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=5)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 2979.62 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3753.38 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for google-bert/bert-base-german-cased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.16558,0.769912,0.74359,0.756522
2,0.231200,0.170855,0.781116,0.777778,0.779443
3,0.107000,0.242393,0.780172,0.773504,0.776824
4,0.032100,0.280931,0.775424,0.782051,0.778723
5,0.018700,0.319572,0.756198,0.782051,0.768908


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3327.34 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.82      0.78      0.80       268

   micro avg       0.82      0.78      0.80       268
   macro avg       0.82      0.78      0.80       268
weighted avg       0.82      0.78      0.80       268

Precision Score: 0.8221343873517787
Recall Score: 0.7761194029850746
F1 Score: 0.7984644913627639
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4569.19 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3782.67 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for dbmdz/bert-base-german-cased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.16381,0.762931,0.776316,0.769565
2,0.248800,0.159659,0.776423,0.837719,0.805907
3,0.124800,0.241545,0.74502,0.820175,0.780793
4,0.055600,0.270153,0.771186,0.798246,0.784483
5,0.031600,0.308753,0.760331,0.807018,0.782979


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3568.22 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.82      0.75      0.78       261

   micro avg       0.82      0.75      0.78       261
   macro avg       0.82      0.75      0.78       261
weighted avg       0.82      0.75      0.78       261

Precision Score: 0.819327731092437
Recall Score: 0.7471264367816092
F1 Score: 0.781563126252505
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred 

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 3994.50 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3540.47 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for dbmdz/bert-base-german-uncased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.170243,0.774194,0.743363,0.758465
2,0.230900,0.186659,0.838235,0.756637,0.795349
3,0.113600,0.246048,0.82243,0.778761,0.8
4,0.045400,0.264298,0.776786,0.769912,0.773333
5,0.033500,0.277863,0.8,0.778761,0.789238


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3086.41 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.90      0.69      0.78       267

   micro avg       0.90      0.69      0.78       267
   macro avg       0.90      0.69      0.78       267
weighted avg       0.90      0.69      0.78       267

Precision Score: 0.9024390243902439
Recall Score: 0.6928838951310862
F1 Score: 0.7838983050847458
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4252.34 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3769.20 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for FacebookAI/xlm-roberta-base with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.186593,0.790698,0.784615,0.787645
2,0.275800,0.179816,0.783784,0.780769,0.782274
3,0.169600,0.203222,0.804,0.773077,0.788235
4,0.102200,0.204244,0.785965,0.861538,0.822018
5,0.078000,0.22393,0.796992,0.815385,0.806084


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3573.08 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.76      0.80      0.78       294

   micro avg       0.76      0.80      0.78       294
   macro avg       0.76      0.80      0.78       294
weighted avg       0.76      0.80      0.78       294

Precision Score: 0.762214983713355
Recall Score: 0.7959183673469388
F1 Score: 0.778702163061564
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred 

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4353.78 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3401.49 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for TUM/GottBERT_base_best with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.240714,0.714286,0.806452,0.757576
2,0.746000,0.198866,0.762162,0.758065,0.760108
3,0.192300,0.215613,0.783784,0.77957,0.781671
4,0.087500,0.244708,0.757009,0.870968,0.81
5,0.053500,0.293393,0.753623,0.83871,0.793893


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3600.17 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.77      0.77      0.77       229

   micro avg       0.77      0.77      0.77       229
   macro avg       0.77      0.77      0.77       229
weighted avg       0.77      0.77      0.77       229

Precision Score: 0.7685589519650655
Recall Score: 0.7685589519650655
F1 Score: 0.7685589519650655
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4415.38 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3145.95 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for TUM/GottBERT_filtered_base_best with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.212647,0.764368,0.715054,0.738889
2,0.664000,0.188561,0.8,0.795699,0.797844
3,0.171700,0.190545,0.806122,0.849462,0.827225
4,0.075400,0.218402,0.781095,0.844086,0.81137
5,0.045000,0.263866,0.802083,0.827957,0.814815


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3442.96 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.76      0.78      0.77       229

   micro avg       0.76      0.78      0.77       229
   macro avg       0.76      0.78      0.77       229
weighted avg       0.76      0.78      0.77       229

Precision Score: 0.7584745762711864
Recall Score: 0.7816593886462883
F1 Score: 0.7698924731182795
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4482.84 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3696.55 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for TUM/GottBERT_base_last with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.284252,0.809211,0.66129,0.727811
2,0.757000,0.193124,0.77095,0.741935,0.756164
3,0.190100,0.205109,0.776042,0.801075,0.78836
4,0.088900,0.196,0.790576,0.811828,0.801061
5,0.058200,0.258856,0.769608,0.844086,0.805128


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3259.97 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.80      0.78      0.79       229

   micro avg       0.80      0.78      0.79       229
   macro avg       0.80      0.78      0.79       229
weighted avg       0.80      0.78      0.79       229

Precision Score: 0.8026905829596412
Recall Score: 0.7816593886462883
F1 Score: 0.7920353982300885
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4796.58 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3805.57 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for distilbert/distilbert-base-german-cased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.172081,0.770213,0.79386,0.781857
2,0.264400,0.181659,0.767857,0.754386,0.761062
3,0.136100,0.20704,0.795455,0.767544,0.78125
4,0.071000,0.232414,0.792952,0.789474,0.791209
5,0.054700,0.238745,0.797357,0.79386,0.795604


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3649.31 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.83      0.75      0.79       261

   micro avg       0.83      0.75      0.79       261
   macro avg       0.83      0.75      0.79       261
weighted avg       0.83      0.75      0.79       261

Precision Score: 0.8305084745762712
Recall Score: 0.7509578544061303
F1 Score: 0.7887323943661971
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4380.83 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3487.40 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for GerMedBERT/medbert-512 with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.185691,0.727749,0.674757,0.700252
2,2.058000,0.192995,0.727273,0.776699,0.751174
3,0.194400,0.217908,0.767677,0.737864,0.752475
4,0.082800,0.253167,0.756477,0.708738,0.73183
5,0.051900,0.308834,0.766497,0.73301,0.74938


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3343.38 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.80      0.72      0.76       234

   micro avg       0.80      0.72      0.76       234
   macro avg       0.80      0.72      0.76       234
weighted avg       0.80      0.72      0.76       234

Precision Score: 0.8
Recall Score: 0.717948717948718
F1 Score: 0.7567567567567569
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4443.42 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3615.56 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for deepset/gbert-base with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.257658,0.719828,0.732456,0.726087
2,0.619300,0.22563,0.764228,0.824561,0.793249
3,0.174400,0.246719,0.792952,0.789474,0.791209
4,0.075100,0.251072,0.788793,0.802632,0.795652
5,0.055900,0.319447,0.805195,0.815789,0.810458


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3471.77 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.83      0.72      0.77       261

   micro avg       0.83      0.72      0.77       261
   macro avg       0.83      0.72      0.77       261
weighted avg       0.83      0.72      0.77       261

Precision Score: 0.8289473684210527
Recall Score: 0.7241379310344828
F1 Score: 0.7730061349693252
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

In [6]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=7)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5774.07 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4642.36 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for google-bert/bert-base-german-cased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.181461,0.781659,0.764957,0.773218
2,0.233300,0.188964,0.766129,0.811966,0.788382
3,0.105400,0.256114,0.7875,0.807692,0.797468
4,0.032300,0.287511,0.772532,0.769231,0.770878
5,0.018100,0.389646,0.782051,0.782051,0.782051
6,0.018100,0.41296,0.744856,0.773504,0.75891
7,0.005700,0.42783,0.760331,0.786325,0.773109


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4380.56 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.80      0.75      0.77       268

   micro avg       0.80      0.75      0.77       268
   macro avg       0.80      0.75      0.77       268
weighted avg       0.80      0.75      0.77       268

Precision Score: 0.7976190476190477
Recall Score: 0.75
F1 Score: 0.7730769230769232
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5874.66 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4715.95 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for dbmdz/bert-base-german-cased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.173228,0.738956,0.807018,0.771488
2,0.246900,0.173769,0.784483,0.798246,0.791304
3,0.124500,0.201311,0.795833,0.837719,0.816239
4,0.053400,0.260093,0.745174,0.846491,0.792608
5,0.032400,0.311122,0.764228,0.824561,0.793249
6,0.032400,0.334307,0.77459,0.828947,0.800847
7,0.016300,0.345661,0.784232,0.828947,0.80597


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4499.76 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.83      0.77      0.80       261

   micro avg       0.83      0.77      0.80       261
   macro avg       0.83      0.77      0.80       261
weighted avg       0.83      0.77      0.80       261

Precision Score: 0.8264462809917356
Recall Score: 0.7662835249042146
F1 Score: 0.7952286282306162
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5604.30 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4624.47 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for dbmdz/bert-base-german-uncased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.177574,0.755656,0.738938,0.747204
2,0.230200,0.1874,0.816038,0.765487,0.789954
3,0.109600,0.243596,0.814286,0.756637,0.784404
4,0.043100,0.250718,0.789954,0.765487,0.777528
5,0.024700,0.318618,0.770925,0.774336,0.772627
6,0.024700,0.364562,0.779817,0.752212,0.765766
7,0.009300,0.383791,0.785388,0.761062,0.773034


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4466.50 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.87      0.74      0.80       267

   micro avg       0.87      0.74      0.80       267
   macro avg       0.87      0.74      0.80       267
weighted avg       0.87      0.74      0.80       267

Precision Score: 0.8716814159292036
Recall Score: 0.7378277153558053
F1 Score: 0.799188640973631
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5704.30 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4951.61 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for FacebookAI/xlm-roberta-base with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.200126,0.790123,0.738462,0.763419
2,0.285700,0.216824,0.746324,0.780769,0.763158
3,0.189100,0.206973,0.776515,0.788462,0.782443
4,0.119200,0.197361,0.785441,0.788462,0.786948
5,0.081400,0.235169,0.751701,0.85,0.797834
6,0.081400,0.238651,0.792115,0.85,0.820037
7,0.051900,0.252218,0.78853,0.846154,0.816327


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4448.06 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.81      0.75      0.78       294

   micro avg       0.81      0.75      0.78       294
   macro avg       0.81      0.75      0.78       294
weighted avg       0.81      0.75      0.78       294

Precision Score: 0.8125
Recall Score: 0.7517006802721088
F1 Score: 0.7809187279151942
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6382.33 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4945.36 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_base_best with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.266336,0.761905,0.774194,0.768
2,0.724600,0.184548,0.789773,0.747312,0.767956
3,0.184900,0.196833,0.770053,0.774194,0.772118
4,0.082800,0.215874,0.784038,0.897849,0.837093
5,0.049900,0.31554,0.775229,0.908602,0.836634
6,0.049900,0.312912,0.779412,0.854839,0.815385
7,0.020300,0.340441,0.795918,0.83871,0.816754


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4601.19 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.79      0.79      0.79       229

   micro avg       0.79      0.79      0.79       229
   macro avg       0.79      0.79      0.79       229
weighted avg       0.79      0.79      0.79       229

Precision Score: 0.793859649122807
Recall Score: 0.7903930131004366
F1 Score: 0.7921225382932167
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6487.29 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4916.15 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_filtered_base_best with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.245944,0.769231,0.698925,0.732394
2,0.658800,0.184877,0.841808,0.801075,0.820937
3,0.176300,0.209376,0.760563,0.870968,0.81203
4,0.080700,0.190812,0.75576,0.88172,0.813896
5,0.044500,0.300964,0.758621,0.827957,0.791774
6,0.044500,0.362863,0.777202,0.806452,0.791557
7,0.023400,0.377163,0.773684,0.790323,0.781915


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4740.32 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.73      0.79       229

   micro avg       0.86      0.73      0.79       229
   macro avg       0.86      0.73      0.79       229
weighted avg       0.86      0.73      0.79       229

Precision Score: 0.8615384615384616
Recall Score: 0.7336244541484717
F1 Score: 0.7924528301886793
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6412.06 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4915.30 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_base_last with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.229832,0.760234,0.698925,0.728291
2,0.746900,0.179568,0.765957,0.774194,0.770053
3,0.192000,0.219577,0.776119,0.83871,0.806202
4,0.094900,0.244077,0.775,0.833333,0.803109
5,0.054800,0.366791,0.756098,0.833333,0.792839
6,0.054800,0.374622,0.766497,0.811828,0.788512
7,0.019000,0.401523,0.763819,0.817204,0.78961


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4660.13 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.83      0.79      0.81       229

   micro avg       0.83      0.79      0.81       229
   macro avg       0.83      0.79      0.81       229
weighted avg       0.83      0.79      0.81       229

Precision Score: 0.8256880733944955
Recall Score: 0.7860262008733624
F1 Score: 0.8053691275167785
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6512.07 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5138.21 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for distilbert/distilbert-base-german-cased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.177223,0.755365,0.77193,0.763557
2,0.257800,0.181633,0.779736,0.776316,0.778022
3,0.133200,0.220919,0.816901,0.763158,0.789116
4,0.065100,0.240712,0.816514,0.780702,0.798206
5,0.045200,0.264957,0.802752,0.767544,0.784753
6,0.045200,0.283065,0.80543,0.780702,0.792873
7,0.022400,0.292062,0.786667,0.776316,0.781457


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4791.24 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.81      0.71      0.76       261

   micro avg       0.81      0.71      0.76       261
   macro avg       0.81      0.71      0.76       261
weighted avg       0.81      0.71      0.76       261

Precision Score: 0.8149779735682819
Recall Score: 0.7088122605363985
F1 Score: 0.7581967213114754
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5729.62 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4574.93 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for GerMedBERT/medbert-512 with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.189177,0.755435,0.674757,0.712821
2,2.047700,0.191705,0.754717,0.776699,0.76555
3,0.195100,0.220584,0.773399,0.762136,0.767726
4,0.082300,0.248508,0.754902,0.747573,0.75122
5,0.044900,0.361298,0.769608,0.762136,0.765854
6,0.044900,0.39702,0.781421,0.694175,0.735219
7,0.021600,0.423888,0.774869,0.718447,0.745592


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4219.29 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.79      0.70      0.74       234

   micro avg       0.79      0.70      0.74       234
   macro avg       0.79      0.70      0.74       234
weighted avg       0.79      0.70      0.74       234

Precision Score: 0.7912621359223301
Recall Score: 0.6965811965811965
F1 Score: 0.7409090909090909
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5761.48 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4772.43 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for deepset/gbert-base with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.252348,0.721311,0.77193,0.745763
2,0.622400,0.235447,0.7713,0.754386,0.762749
3,0.173400,0.28424,0.783186,0.776316,0.779736
4,0.072400,0.307109,0.785088,0.785088,0.785088
5,0.057100,0.39086,0.791111,0.780702,0.785872
6,0.057100,0.407471,0.787879,0.798246,0.793028
7,0.023100,0.448715,0.79646,0.789474,0.792952


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4458.91 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.84      0.73      0.78       261

   micro avg       0.84      0.73      0.78       261
   macro avg       0.84      0.73      0.78       261
weighted avg       0.84      0.73      0.78       261

Precision Score: 0.8377192982456141
Recall Score: 0.7318007662835249
F1 Score: 0.7811860940695297
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

In [7]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=8)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5718.76 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4592.44 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for google-bert/bert-base-german-cased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.170733,0.760684,0.760684,0.760684
2,0.228800,0.175184,0.809735,0.782051,0.795652
3,0.105000,0.257495,0.749035,0.82906,0.787018
4,0.032600,0.294443,0.771429,0.807692,0.789144
5,0.019100,0.402445,0.8,0.752137,0.77533
6,0.019100,0.421581,0.784483,0.777778,0.781116
7,0.005400,0.459147,0.797297,0.75641,0.776316
8,0.001100,0.459163,0.798206,0.760684,0.778993


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4340.83 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.81      0.74      0.77       268

   micro avg       0.81      0.74      0.77       268
   macro avg       0.81      0.74      0.77       268
weighted avg       0.81      0.74      0.77       268

Precision Score: 0.8089430894308943
Recall Score: 0.7425373134328358
F1 Score: 0.77431906614786
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred 

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5645.81 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4705.90 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for dbmdz/bert-base-german-cased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.172412,0.744856,0.79386,0.768577
2,0.241300,0.184421,0.767932,0.798246,0.782796
3,0.119200,0.255241,0.799127,0.802632,0.800875
4,0.046100,0.286976,0.736,0.807018,0.769874
5,0.025100,0.369435,0.719231,0.820175,0.766393
6,0.025100,0.395934,0.785714,0.723684,0.753425
7,0.011000,0.388833,0.77533,0.77193,0.773626
8,0.008100,0.381444,0.756303,0.789474,0.772532


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4444.60 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.85      0.74      0.79       261

   micro avg       0.85      0.74      0.79       261
   macro avg       0.85      0.74      0.79       261
weighted avg       0.85      0.74      0.79       261

Precision Score: 0.8458149779735683
Recall Score: 0.735632183908046
F1 Score: 0.7868852459016393
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5467.17 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4506.89 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for dbmdz/bert-base-german-uncased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.170334,0.770642,0.743363,0.756757
2,0.227700,0.18906,0.792627,0.761062,0.776524
3,0.115100,0.262753,0.830189,0.778761,0.803653
4,0.047300,0.270339,0.789238,0.778761,0.783964
5,0.023300,0.3297,0.776744,0.738938,0.75737
6,0.023300,0.371579,0.784753,0.774336,0.77951
7,0.008600,0.376573,0.772926,0.783186,0.778022
8,0.003600,0.373185,0.777778,0.80531,0.791304


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4416.56 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.89      0.70      0.78       267

   micro avg       0.89      0.70      0.78       267
   macro avg       0.89      0.70      0.78       267
weighted avg       0.89      0.70      0.78       267

Precision Score: 0.8867924528301887
Recall Score: 0.704119850187266
F1 Score: 0.7849686847599165
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5741.87 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4931.03 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for FacebookAI/xlm-roberta-base with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.195258,0.760456,0.769231,0.764818
2,0.262500,0.174126,0.786477,0.85,0.817006
3,0.171000,0.196249,0.8,0.815385,0.807619
4,0.107500,0.234222,0.780576,0.834615,0.806691
5,0.071900,0.27417,0.78777,0.842308,0.814126
6,0.071900,0.324814,0.794466,0.773077,0.783626
7,0.039400,0.341685,0.791667,0.803846,0.79771
8,0.022300,0.33877,0.781022,0.823077,0.801498


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4487.91 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.75      0.77      0.76       294

   micro avg       0.75      0.77      0.76       294
   macro avg       0.75      0.77      0.76       294
weighted avg       0.75      0.77      0.76       294

Precision Score: 0.7475083056478405
Recall Score: 0.7653061224489796
F1 Score: 0.7563025210084033
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5823.48 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4935.74 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_base_best with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.254853,0.807453,0.698925,0.74928
2,0.715300,0.197798,0.797688,0.741935,0.768802
3,0.184700,0.218726,0.777778,0.790323,0.784
4,0.086200,0.25485,0.753363,0.903226,0.821516
5,0.056400,0.373603,0.75122,0.827957,0.787724
6,0.056400,0.379446,0.754717,0.860215,0.80402
7,0.020500,0.422208,0.770732,0.849462,0.808184
8,0.006500,0.426926,0.781095,0.844086,0.81137


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4666.90 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.76      0.78      0.77       229

   micro avg       0.76      0.78      0.77       229
   macro avg       0.76      0.78      0.77       229
weighted avg       0.76      0.78      0.77       229

Precision Score: 0.7639484978540773
Recall Score: 0.777292576419214
F1 Score: 0.7705627705627706
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6381.78 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4876.59 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_filtered_base_best with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.252846,0.802817,0.612903,0.695122
2,0.661900,0.187913,0.815476,0.736559,0.774011
3,0.174200,0.212021,0.752427,0.833333,0.790816
4,0.089500,0.203339,0.745283,0.849462,0.79397
5,0.043600,0.360286,0.763819,0.817204,0.78961
6,0.043600,0.368892,0.753769,0.806452,0.779221
7,0.017900,0.432983,0.762626,0.811828,0.786458
8,0.009700,0.475443,0.776042,0.801075,0.78836


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4710.26 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.75      0.78      0.76       229

   micro avg       0.75      0.78      0.76       229
   macro avg       0.75      0.78      0.76       229
weighted avg       0.75      0.78      0.76       229

Precision Score: 0.7489539748953975
Recall Score: 0.7816593886462883
F1 Score: 0.7649572649572649
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6447.74 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4848.24 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_base_last with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.274411,0.801471,0.586022,0.677019
2,0.735400,0.174878,0.742424,0.790323,0.765625
3,0.192600,0.231019,0.808511,0.817204,0.812834
4,0.087700,0.241931,0.761905,0.860215,0.808081
5,0.056200,0.368433,0.768473,0.83871,0.802057
6,0.056200,0.397673,0.778947,0.795699,0.787234
7,0.020500,0.423944,0.764423,0.854839,0.807107
8,0.011100,0.434345,0.777228,0.844086,0.809278


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4665.88 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.81      0.77      0.79       229

   micro avg       0.81      0.77      0.79       229
   macro avg       0.81      0.77      0.79       229
weighted avg       0.81      0.77      0.79       229

Precision Score: 0.8082191780821918
Recall Score: 0.7729257641921398
F1 Score: 0.7901785714285714
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6514.51 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5246.58 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for distilbert/distilbert-base-german-cased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.181537,0.752137,0.77193,0.761905
2,0.258900,0.188958,0.769231,0.745614,0.757238
3,0.132200,0.225596,0.789954,0.758772,0.774049
4,0.064500,0.243682,0.773504,0.79386,0.78355
5,0.045400,0.283247,0.787879,0.798246,0.793028
6,0.045400,0.298365,0.78355,0.79386,0.788671
7,0.022200,0.320788,0.79476,0.798246,0.796499
8,0.013100,0.325516,0.792208,0.802632,0.797386


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4865.62 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.81      0.75      0.78       261

   micro avg       0.81      0.75      0.78       261
   macro avg       0.81      0.75      0.78       261
weighted avg       0.81      0.75      0.78       261

Precision Score: 0.8106995884773662
Recall Score: 0.7547892720306514
F1 Score: 0.7817460317460319
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5728.29 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4571.72 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for GerMedBERT/medbert-512 with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.191483,0.751381,0.660194,0.702842
2,2.044100,0.187153,0.736364,0.786408,0.760563
3,0.194000,0.228407,0.771028,0.800971,0.785714
4,0.083100,0.257542,0.732673,0.718447,0.72549
5,0.046400,0.416212,0.726027,0.771845,0.748235
6,0.046400,0.379,0.773196,0.728155,0.75
7,0.024200,0.439008,0.757426,0.742718,0.75
8,0.012000,0.457236,0.741463,0.737864,0.739659




There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4277.38 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.79      0.74      0.76       234

   micro avg       0.79      0.74      0.76       234
   macro avg       0.79      0.74      0.76       234
weighted avg       0.79      0.74      0.76       234

Precision Score: 0.7853881278538812
Recall Score: 0.7350427350427351
F1 Score: 0.7593818984547461
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5721.92 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4602.72 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for deepset/gbert-base with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.24054,0.738739,0.719298,0.728889
2,0.614500,0.228364,0.748,0.820175,0.782427
3,0.170300,0.240929,0.796537,0.807018,0.801743
4,0.071400,0.246625,0.798283,0.815789,0.806941
5,0.043100,0.346854,0.786008,0.837719,0.81104
6,0.043100,0.375973,0.804348,0.811404,0.80786
7,0.018400,0.431386,0.817778,0.807018,0.812362
8,0.009300,0.437655,0.802575,0.820175,0.81128


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4398.26 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.85      0.72      0.78       261

   micro avg       0.85      0.72      0.78       261
   macro avg       0.85      0.72      0.78       261
weighted avg       0.85      0.72      0.78       261

Precision Score: 0.8506787330316742
Recall Score: 0.7203065134099617
F1 Score: 0.7800829875518672
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

In [6]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=6)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4358.65 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3509.38 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for google-bert/bert-base-german-cased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.176338,0.758333,0.777778,0.767932
2,0.231300,0.175735,0.807692,0.807692,0.807692
3,0.108800,0.222077,0.787149,0.837607,0.811594
4,0.034500,0.249603,0.794118,0.807692,0.800847
5,0.021800,0.335527,0.800885,0.773504,0.786957
6,0.021800,0.343859,0.792035,0.764957,0.778261


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3058.21 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.82      0.76      0.79       268

   micro avg       0.82      0.76      0.79       268
   macro avg       0.82      0.76      0.79       268
weighted avg       0.82      0.76      0.79       268

Precision Score: 0.8192771084337349
Recall Score: 0.7611940298507462
F1 Score: 0.7891682785299807
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4437.65 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3698.26 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for dbmdz/bert-base-german-cased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.167944,0.766234,0.776316,0.771242
2,0.241600,0.174428,0.757322,0.79386,0.775161
3,0.119400,0.237688,0.766667,0.807018,0.786325
4,0.049000,0.263297,0.730924,0.798246,0.763103
5,0.030600,0.357499,0.742857,0.798246,0.769556
6,0.030600,0.365126,0.742739,0.785088,0.763326


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3145.47 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.83      0.75      0.79       261

   micro avg       0.83      0.75      0.79       261
   macro avg       0.83      0.75      0.79       261
weighted avg       0.83      0.75      0.79       261

Precision Score: 0.8340425531914893
Recall Score: 0.7509578544061303
F1 Score: 0.7903225806451614
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 3892.63 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3356.93 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for dbmdz/bert-base-german-uncased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.182595,0.756637,0.756637,0.756637
2,0.233800,0.197348,0.778281,0.761062,0.769575
3,0.117300,0.23348,0.784753,0.774336,0.77951
4,0.043600,0.256252,0.72541,0.783186,0.753191
5,0.029900,0.337623,0.761261,0.747788,0.754464
6,0.029900,0.35067,0.747826,0.761062,0.754386


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3423.81 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.88      0.72      0.79       267

   micro avg       0.88      0.72      0.79       267
   macro avg       0.88      0.72      0.79       267
weighted avg       0.88      0.72      0.79       267

Precision Score: 0.8807339449541285
Recall Score: 0.7191011235955056
F1 Score: 0.7917525773195877
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4144.66 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3237.09 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for FacebookAI/xlm-roberta-base with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.16706,0.759124,0.8,0.779026
2,0.268900,0.185781,0.772388,0.796154,0.784091
3,0.177000,0.177962,0.771127,0.842308,0.805147
4,0.101700,0.206347,0.738983,0.838462,0.785586
5,0.079700,0.233393,0.779783,0.830769,0.804469
6,0.079700,0.252703,0.779359,0.842308,0.809612


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3381.70 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.78      0.74      0.76       294

   micro avg       0.78      0.74      0.76       294
   macro avg       0.78      0.74      0.76       294
weighted avg       0.78      0.74      0.76       294

Precision Score: 0.7833935018050542
Recall Score: 0.7380952380952381
F1 Score: 0.7600700525394045
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4471.20 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3650.46 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for TUM/GottBERT_base_best with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.214263,0.717073,0.790323,0.751918
2,0.740700,0.210448,0.755556,0.731183,0.743169
3,0.179200,0.222889,0.809249,0.752688,0.779944
4,0.077100,0.226247,0.742081,0.88172,0.805897
5,0.046500,0.330229,0.747619,0.844086,0.792929
6,0.046500,0.35783,0.73301,0.811828,0.770408


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3564.27 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.81      0.77      0.79       229

   micro avg       0.81      0.77      0.79       229
   macro avg       0.81      0.77      0.79       229
weighted avg       0.81      0.77      0.79       229

Precision Score: 0.8082191780821918
Recall Score: 0.7729257641921398
F1 Score: 0.7901785714285714
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4517.80 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3656.13 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for TUM/GottBERT_filtered_base_best with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.294401,0.809524,0.639785,0.714715
2,0.666200,0.209947,0.789189,0.784946,0.787062
3,0.177900,0.216284,0.791878,0.83871,0.814621
4,0.081600,0.2317,0.753425,0.887097,0.814815
5,0.041400,0.355718,0.737327,0.860215,0.794045
6,0.041400,0.351964,0.751196,0.844086,0.794937


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3587.79 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.75      0.77      0.76       229

   micro avg       0.75      0.77      0.76       229
   macro avg       0.75      0.77      0.76       229
weighted avg       0.75      0.77      0.76       229

Precision Score: 0.75
Recall Score: 0.7729257641921398
F1 Score: 0.7612903225806452
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4482.37 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3570.49 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for TUM/GottBERT_base_last with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.259857,0.766082,0.704301,0.733894
2,0.737500,0.206259,0.751351,0.747312,0.749326
3,0.181500,0.257704,0.807229,0.72043,0.761364
4,0.083000,0.222824,0.772727,0.822581,0.796875
5,0.052600,0.32544,0.773399,0.844086,0.807198
6,0.052600,0.347636,0.777228,0.844086,0.809278


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3511.12 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.80      0.76      0.78       229

   micro avg       0.80      0.76      0.78       229
   macro avg       0.80      0.76      0.78       229
weighted avg       0.80      0.76      0.78       229

Precision Score: 0.7981651376146789
Recall Score: 0.759825327510917
F1 Score: 0.7785234899328859
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4796.99 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3762.34 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for distilbert/distilbert-base-german-cased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.178517,0.772321,0.758772,0.765487
2,0.256700,0.178241,0.753304,0.75,0.751648
3,0.131700,0.219524,0.775785,0.758772,0.767184
4,0.065200,0.24803,0.780172,0.79386,0.786957
5,0.047000,0.272421,0.791111,0.780702,0.785872
6,0.047000,0.283957,0.787879,0.798246,0.793028


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3628.23 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.83      0.77      0.80       261

   micro avg       0.83      0.77      0.80       261
   macro avg       0.83      0.77      0.80       261
weighted avg       0.83      0.77      0.80       261

Precision Score: 0.8271604938271605
Recall Score: 0.7701149425287356
F1 Score: 0.7976190476190476
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4149.00 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3513.57 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for GerMedBERT/medbert-512 with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.18772,0.73822,0.684466,0.710327
2,2.052400,0.186301,0.747664,0.776699,0.761905
3,0.193000,0.214645,0.794872,0.752427,0.773067
4,0.081700,0.247427,0.743961,0.747573,0.745763
5,0.047700,0.36521,0.753623,0.757282,0.755448
6,0.047700,0.368026,0.769608,0.762136,0.765854


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3179.63 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.81      0.67      0.73       234

   micro avg       0.81      0.67      0.73       234
   macro avg       0.81      0.67      0.73       234
weighted avg       0.81      0.67      0.73       234

Precision Score: 0.8051282051282052
Recall Score: 0.6709401709401709
F1 Score: 0.7319347319347319
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 4465.67 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 3541.69 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: 0.3725025484199796, 1: 3.705196451204056, 2: 21.94744744744745}
Training results for deepset/gbert-base with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.243179,0.735426,0.719298,0.727273
2,0.621300,0.223387,0.762115,0.758772,0.76044
3,0.171800,0.256678,0.795455,0.767544,0.78125
4,0.074500,0.243001,0.796537,0.807018,0.801743
5,0.044000,0.363324,0.794643,0.780702,0.787611
6,0.044000,0.388545,0.809091,0.780702,0.794643


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 3409.03 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.80      0.74      0.77       261

   micro avg       0.80      0.74      0.77       261
   macro avg       0.80      0.74      0.77       261
weighted avg       0.80      0.74      0.77       261

Precision Score: 0.8
Recall Score: 0.735632183908046
F1 Score: 0.7664670658682634
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 

In [8]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=12)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5762.51 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4603.77 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for google-bert/bert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.1829,0.774336,0.747863,0.76087
2,0.227500,0.184631,0.818182,0.769231,0.792952
3,0.100900,0.264567,0.779528,0.846154,0.811475
4,0.033000,0.325516,0.754167,0.773504,0.763713
5,0.018000,0.420455,0.79386,0.773504,0.78355
6,0.018000,0.445054,0.799107,0.764957,0.781659
7,0.006300,0.478515,0.790393,0.773504,0.781857
8,0.001800,0.464537,0.787611,0.760684,0.773913
9,0.001700,0.48557,0.801802,0.760684,0.780702
10,0.000500,0.521621,0.776824,0.773504,0.775161


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4197.35 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.82      0.78      0.80       268

   micro avg       0.82      0.78      0.80       268
   macro avg       0.82      0.78      0.80       268
weighted avg       0.82      0.78      0.80       268

Precision Score: 0.8221343873517787
Recall Score: 0.7761194029850746
F1 Score: 0.7984644913627639
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5649.50 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4748.84 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for dbmdz/bert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.17579,0.782805,0.758772,0.770601
2,0.243900,0.185718,0.779661,0.807018,0.793103
3,0.123700,0.235898,0.757812,0.850877,0.801653
4,0.049500,0.238581,0.750943,0.872807,0.807302
5,0.031900,0.318879,0.729008,0.837719,0.779592
6,0.031900,0.339503,0.804444,0.79386,0.799117
7,0.012700,0.394532,0.77686,0.824561,0.8
8,0.004900,0.404207,0.794118,0.828947,0.811159
9,0.005900,0.404046,0.768595,0.815789,0.791489
10,0.001300,0.424459,0.778243,0.815789,0.796574


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4438.88 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.83      0.76      0.79       261

   micro avg       0.83      0.76      0.79       261
   macro avg       0.83      0.76      0.79       261
weighted avg       0.83      0.76      0.79       261

Precision Score: 0.8319327731092437
Recall Score: 0.7586206896551724
F1 Score: 0.7935871743486974
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5346.81 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4490.45 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for dbmdz/bert-base-german-uncased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.16619,0.764192,0.774336,0.769231
2,0.230700,0.203475,0.79638,0.778761,0.787472
3,0.115400,0.254972,0.776256,0.752212,0.764045
4,0.041700,0.32732,0.770925,0.774336,0.772627
5,0.026500,0.361629,0.747826,0.761062,0.754386
6,0.026500,0.42909,0.789954,0.765487,0.777528
7,0.009200,0.473046,0.733333,0.778761,0.755365
8,0.002400,0.474429,0.744589,0.761062,0.752735
9,0.002000,0.485442,0.772727,0.752212,0.762332
10,0.002000,0.496603,0.74477,0.787611,0.765591


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4352.71 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.85      0.74      0.79       267

   micro avg       0.85      0.74      0.79       267
   macro avg       0.85      0.74      0.79       267
weighted avg       0.85      0.74      0.79       267

Precision Score: 0.8454935622317596
Recall Score: 0.7378277153558053
F1 Score: 0.7879999999999999
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5691.39 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4650.21 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for FacebookAI/xlm-roberta-base with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.18188,0.748276,0.834615,0.789091
2,0.259400,0.228498,0.801653,0.746154,0.772908
3,0.176600,0.188888,0.786765,0.823077,0.804511
4,0.099200,0.212885,0.771626,0.857692,0.812386
5,0.077300,0.248682,0.798587,0.869231,0.832413
6,0.077300,0.285703,0.803704,0.834615,0.818868
7,0.041800,0.303892,0.803571,0.865385,0.833333
8,0.029600,0.356714,0.821012,0.811538,0.816248
9,0.019700,0.352624,0.801394,0.884615,0.840951
10,0.007900,0.358789,0.811594,0.861538,0.835821


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4398.04 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.77      0.77      0.77       294

   micro avg       0.77      0.77      0.77       294
   macro avg       0.77      0.77      0.77       294
weighted avg       0.77      0.77      0.77       294

Precision Score: 0.7713310580204779
Recall Score: 0.7687074829931972
F1 Score: 0.7700170357751277
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6210.98 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4831.16 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_base_best with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.332554,0.783217,0.602151,0.680851
2,0.724700,0.197943,0.748691,0.768817,0.758621
3,0.194500,0.208874,0.809524,0.731183,0.768362
4,0.091400,0.225158,0.753425,0.887097,0.814815
5,0.056100,0.369419,0.739535,0.854839,0.793017
6,0.056100,0.378145,0.762376,0.827957,0.793814
7,0.023800,0.362827,0.761421,0.806452,0.78329
8,0.010500,0.486202,0.752475,0.817204,0.783505
9,0.002100,0.573122,0.761421,0.806452,0.78329
10,0.001400,0.619228,0.772973,0.768817,0.770889


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4612.21 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.80      0.78      0.79       229

   micro avg       0.80      0.78      0.79       229
   macro avg       0.80      0.78      0.79       229
weighted avg       0.80      0.78      0.79       229

Precision Score: 0.7982062780269058
Recall Score: 0.777292576419214
F1 Score: 0.7876106194690267
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6395.77 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4800.17 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_filtered_base_best with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.270717,0.826087,0.612903,0.703704
2,0.669800,0.189723,0.834254,0.811828,0.822888
3,0.180500,0.222736,0.807692,0.790323,0.798913
4,0.092800,0.236305,0.725225,0.865591,0.789216
5,0.052800,0.332976,0.746341,0.822581,0.782609
6,0.052800,0.426769,0.782828,0.833333,0.807292
7,0.016200,0.47381,0.752475,0.817204,0.783505
8,0.011400,0.509127,0.792553,0.801075,0.796791
9,0.006900,0.592428,0.789744,0.827957,0.808399
10,0.002400,0.611152,0.793651,0.806452,0.8


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4414.46 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.87      0.71      0.78       229

   micro avg       0.87      0.71      0.78       229
   macro avg       0.87      0.71      0.78       229
weighted avg       0.87      0.71      0.78       229

Precision Score: 0.8716577540106952
Recall Score: 0.7117903930131004
F1 Score: 0.7836538461538461
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6153.53 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4753.42 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for TUM/GottBERT_base_last with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.250728,0.733333,0.768817,0.750656
2,0.739900,0.200275,0.763158,0.77957,0.771277
3,0.184100,0.231351,0.829412,0.758065,0.792135
4,0.086300,0.21053,0.731818,0.865591,0.793103
5,0.052800,0.376041,0.7343,0.817204,0.773537
6,0.052800,0.503252,0.741463,0.817204,0.777494
7,0.017900,0.455758,0.741463,0.817204,0.777494
8,0.010100,0.561049,0.732673,0.795699,0.762887
9,0.003200,0.582488,0.748792,0.833333,0.788804
10,0.002100,0.627763,0.770408,0.811828,0.790576


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4543.82 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.74      0.81      0.78       229

   micro avg       0.74      0.81      0.78       229
   macro avg       0.74      0.81      0.78       229
weighted avg       0.74      0.81      0.78       229

Precision Score: 0.744
Recall Score: 0.8122270742358079
F1 Score: 0.7766179540709811
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6156.85 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5031.83 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for distilbert/distilbert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.168544,0.753191,0.776316,0.764579
2,0.260500,0.182833,0.787037,0.745614,0.765766
3,0.127900,0.215014,0.767932,0.798246,0.782796
4,0.059300,0.23789,0.816038,0.758772,0.786364
5,0.039200,0.28545,0.742063,0.820175,0.779167
6,0.039200,0.32281,0.772926,0.776316,0.774617
7,0.017400,0.357822,0.744939,0.807018,0.774737
8,0.006900,0.363007,0.787611,0.780702,0.784141
9,0.004600,0.367792,0.759036,0.828947,0.792453
10,0.002900,0.377506,0.766949,0.79386,0.780172


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4822.72 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.81      0.79      0.80       261

   micro avg       0.81      0.79      0.80       261
   macro avg       0.81      0.79      0.80       261
weighted avg       0.81      0.79      0.80       261

Precision Score: 0.8054474708171206
Recall Score: 0.7931034482758621
F1 Score: 0.7992277992277992
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5700.16 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4592.14 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for GerMedBERT/medbert-512 with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.198532,0.757225,0.635922,0.691293
2,2.036500,0.17539,0.776119,0.757282,0.766585
3,0.193900,0.231344,0.765258,0.791262,0.778043
4,0.080100,0.246968,0.731343,0.713592,0.722359
5,0.049100,0.444121,0.731959,0.68932,0.71
6,0.049100,0.368511,0.731959,0.68932,0.71
7,0.024800,0.401871,0.725664,0.796117,0.759259
8,0.010700,0.468312,0.73913,0.742718,0.74092
9,0.005200,0.482595,0.745,0.723301,0.73399
10,0.003400,0.521337,0.766839,0.718447,0.741855


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4229.13 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.76      0.74      0.75       234

   micro avg       0.76      0.74      0.75       234
   macro avg       0.76      0.74      0.75       234
weighted avg       0.76      0.74      0.75       234

Precision Score: 0.7577092511013216
Recall Score: 0.7350427350427351
F1 Score: 0.7462039045553145
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pre

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from 

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5754.61 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4603.32 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-ASPECT' 'O' 'B-ASPECT']
{0: 21.94744744744745, 1: 0.3725025484199796, 2: 3.705196451204056}
Training results for deepset/gbert-base with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.243553,0.742489,0.758772,0.750542
2,0.612200,0.226384,0.753138,0.789474,0.770878
3,0.178900,0.259738,0.780172,0.79386,0.786957
4,0.071500,0.27991,0.797414,0.811404,0.804348
5,0.049100,0.356831,0.784232,0.828947,0.80597
6,0.049100,0.445217,0.807339,0.77193,0.789238
7,0.017000,0.47791,0.781893,0.833333,0.806794
8,0.008900,0.492423,0.809955,0.785088,0.797327
9,0.007100,0.541927,0.806867,0.824561,0.815618
10,0.005000,0.550221,0.773279,0.837719,0.804211


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4383.53 examples/s]


Unique predicted label IDs: {0, 1, 2}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.75      0.80       261

   micro avg       0.86      0.75      0.80       261
   macro avg       0.86      0.75      0.80       261
weighted avg       0.86      0.75      0.80       261

Precision Score: 0.8558951965065502
Recall Score: 0.7509578544061303
F1 Score: 0.8
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O',

### 2. category-aware ATE

In [11]:
for model in models:
    print(f'training and results for {model}:')
    ate_cat_model(data, model, rn1=42, rn2=42, epochs=5)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 5833.23 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3904.81 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training google-bert/bert-base-german-cased for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.328237,0.637584,0.519126,0.572289
2,0.371000,0.357443,0.669014,0.519126,0.584615
3,0.371000,0.468612,0.538462,0.650273,0.589109
4,0.110700,0.480588,0.649425,0.617486,0.633053
5,0.110700,0.508611,0.623596,0.606557,0.614958


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 3490.81 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.55      0.41      0.47        29
    Krankenhaus       0.68      0.74      0.71        43
       Personal       0.71      0.71      0.71         7
 Pflegepersonal       1.00      0.90      0.95        10
anderer Service       0.45      0.28      0.34        18
 mediz. Service       0.71      0.65      0.68        37

      micro avg       0.67      0.60      0.64       144
      macro avg       0.68      0.62      0.64       144
   weighted avg       0.66      0.60      0.62       144

Precision Score: 0.6692307692307692
Recall Score: 0.6041666666666666
F1 Score: 0.635036496350365
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', 'hier', 'nochma

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 5665.46 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3112.17 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training dbmdz/bert-base-german-cased for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.335056,0.59375,0.555556,0.574018
2,0.405000,0.351706,0.659574,0.54386,0.596154
3,0.405000,0.416945,0.5625,0.631579,0.595041
4,0.134700,0.44357,0.59322,0.614035,0.603448
5,0.134700,0.471154,0.555556,0.614035,0.583333


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 3845.58 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.57      0.55      0.56        31
    Krankenhaus       0.74      0.69      0.72        42
       Personal       0.44      0.50      0.47         8
 Pflegepersonal       1.00      0.90      0.95        10
anderer Service       0.29      0.28      0.29        18
 mediz. Service       0.70      0.55      0.62        38

      micro avg       0.63      0.58      0.60       147
      macro avg       0.62      0.58      0.60       147
   weighted avg       0.64      0.58      0.61       147

Precision Score: 0.6343283582089553
Recall Score: 0.5782312925170068
F1 Score: 0.6049822064056939
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', 'hier', 'nochm

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 5601.87 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3921.03 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training dbmdz/bert-base-german-uncased for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.326593,0.654135,0.520958,0.58
2,0.372900,0.347663,0.643939,0.508982,0.568562
3,0.372900,0.469372,0.4375,0.586826,0.501279
4,0.130200,0.462696,0.589404,0.532934,0.559748
5,0.130200,0.480155,0.545977,0.568862,0.557185


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 3784.24 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.54      0.74      0.62        34
    Krankenhaus       0.75      0.57      0.65        42
       Personal       0.67      0.25      0.36         8
 Pflegepersonal       1.00      0.88      0.93         8
anderer Service       1.00      0.06      0.12        16
 mediz. Service       0.72      0.54      0.62        39

      micro avg       0.68      0.54      0.60       147
      macro avg       0.78      0.51      0.55       147
   weighted avg       0.73      0.54      0.58       147

Precision Score: 0.6779661016949152
Recall Score: 0.54421768707483
F1 Score: 0.6037735849056604
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', 'hier', 'nochmal

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 5774.19 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3593.32 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training FacebookAI/xlm-roberta-base for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.428879,0.46087,0.258537,0.33125
2,0.488600,0.343836,0.655405,0.473171,0.549575
3,0.488600,0.38717,0.589862,0.62439,0.606635
4,0.226100,0.431481,0.607843,0.604878,0.606357
5,0.226100,0.454515,0.606635,0.62439,0.615385


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4174.53 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.62      0.69      0.66        36
    Krankenhaus       0.73      0.73      0.73        45
       Personal       0.75      0.55      0.63        11
 Pflegepersonal       0.85      0.92      0.88        12
anderer Service       0.53      0.41      0.46        22
 mediz. Service       0.80      0.66      0.73        50

      micro avg       0.71      0.66      0.69       176
      macro avg       0.71      0.66      0.68       176
   weighted avg       0.71      0.66      0.69       176

Precision Score: 0.7134146341463414
Recall Score: 0.6647727272727273
F1 Score: 0.6882352941176471
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', 'hier', 'nochm

Device set to use cuda:0


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 6020.67 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3779.13 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training TUM/GottBERT_base_best for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.424025,0.49375,0.519737,0.50641
2,0.740200,0.319288,0.602837,0.559211,0.580205
3,0.740200,0.4215,0.578947,0.651316,0.613003
4,0.211400,0.535966,0.593333,0.585526,0.589404
5,0.211400,0.613212,0.622517,0.618421,0.620462


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4262.14 examples/s]




Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {np.int64(0), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13), np.int64(206)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.62      0.64      0.63        25
    Krankenhaus       0.74      0.61      0.67        41
       Personal       0.67      0.57      0.62         7
 Pflegepersonal       0.58      0.88      0.70         8
anderer Service       0.23      0.27      0.25        11
 mediz. Service       0.71      0.67      0.69        30

      micro avg       0.63      0.61      0.62       122
      macro avg       0.59      0.61      0.59       122
   weighted avg       0.65      0.61      0.63       122

Precision Score: 0.6302521008403361
Recall Score: 0.6147540983606558
F1 Score: 0.6224066390041495
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Statio

Device set to use cuda:0


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 6059.64 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3994.31 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training TUM/GottBERT_filtered_base_best for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.392108,0.585106,0.361842,0.447154
2,0.594000,0.394694,0.56962,0.592105,0.580645
3,0.594000,0.534113,0.598485,0.519737,0.556338
4,0.168300,0.573898,0.618705,0.565789,0.591065
5,0.168300,0.653187,0.601266,0.625,0.612903




There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4256.08 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(10), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.65      0.52      0.58        25
    Krankenhaus       0.70      0.73      0.71        41
       Personal       0.50      0.71      0.59         7
 Pflegepersonal       1.00      0.88      0.93         8
anderer Service       0.38      0.27      0.32        11
 mediz. Service       0.57      0.40      0.47        30

      micro avg       0.64      0.57      0.61       122
      macro avg       0.63      0.59      0.60       122
   weighted avg       0.64      0.57      0.60       122

Precision Score: 0.6422018348623854
Recall Score: 0.5737704918032787
F1 Score: 0.606060606060606
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', '

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. 

Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 6155.12 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 4056.99 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training TUM/GottBERT_base_last for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.415237,0.517986,0.473684,0.494845
2,0.730100,0.390912,0.631944,0.598684,0.614865
3,0.730100,0.4594,0.6,0.552632,0.575342
4,0.169100,0.669077,0.634483,0.605263,0.619529
5,0.169100,0.769484,0.628571,0.578947,0.60274




There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4354.14 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {np.int64(0), np.int64(2), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.65      0.60      0.63        25
    Krankenhaus       0.76      0.63      0.69        41
       Personal       0.67      0.57      0.62         7
 Pflegepersonal       0.88      0.88      0.88         8
anderer Service       0.17      0.27      0.21        11
 mediz. Service       0.70      0.53      0.60        30

      micro avg       0.63      0.58      0.61       122
      macro avg       0.64      0.58      0.60       122
   weighted avg       0.67      0.58      0.62       122

Precision Score: 0.6339285714285714
Recall Score: 0.5819672131147541
F1 Score: 0.6068376068376068
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', '

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 6735.97 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 4313.46 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training distilbert/distilbert-base-german-cased for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.382121,0.544776,0.426901,0.478689
2,0.478400,0.327016,0.573333,0.502924,0.535826
3,0.478400,0.36044,0.495,0.578947,0.533693
4,0.178900,0.386078,0.52459,0.561404,0.542373
5,0.178900,0.389224,0.513966,0.538012,0.525714


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 3965.05 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.55      0.52      0.53        31
    Krankenhaus       0.72      0.67      0.69        42
       Personal       0.67      0.50      0.57         8
 Pflegepersonal       0.82      0.90      0.86        10
anderer Service       0.29      0.11      0.16        18
 mediz. Service       0.65      0.58      0.61        38

      micro avg       0.64      0.55      0.59       147
      macro avg       0.61      0.55      0.57       147
   weighted avg       0.62      0.55      0.58       147

Precision Score: 0.6428571428571429
Recall Score: 0.5510204081632653
F1 Score: 0.5934065934065934
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', 'hier', 'nochm

Device set to use cuda:0


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 5948.59 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3898.50 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training GerMedBERT/medbert-512 for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.564667,0.338983,0.372671,0.35503
2,2.451800,0.410399,0.564885,0.459627,0.506849
3,2.451800,0.458173,0.502703,0.57764,0.537572
4,0.205300,0.514803,0.662162,0.608696,0.634304
5,0.205300,0.548273,0.616352,0.608696,0.6125


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4009.85 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.52      0.58      0.55        24
    Krankenhaus       0.86      0.80      0.83        45
       Personal       0.67      0.75      0.71         8
 Pflegepersonal       0.91      0.91      0.91        11
anderer Service       0.33      0.38      0.35        16
 mediz. Service       0.56      0.32      0.41        31

      micro avg       0.66      0.61      0.63       135
      macro avg       0.64      0.62      0.63       135
   weighted avg       0.66      0.61      0.62       135

Precision Score: 0.656
Recall Score: 0.6074074074074074
F1 Score: 0.6307692307692309
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', 'hier', 'nochm

In [12]:
for model in models:
    print(f'training and results for {model}:')
    ate_cat_model(data, model, rn1=42, rn2=42, epochs=10)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 5617.60 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3823.09 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training google-bert/bert-base-german-cased for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.314145,0.626506,0.568306,0.595989
2,0.357200,0.363651,0.70068,0.562842,0.624242
3,0.357200,0.4642,0.578199,0.666667,0.619289
4,0.108000,0.512544,0.618785,0.612022,0.615385
5,0.108000,0.580289,0.592593,0.612022,0.602151
6,0.031400,0.583369,0.639535,0.601093,0.619718
7,0.008100,0.617354,0.596774,0.606557,0.601626
8,0.008100,0.654113,0.647059,0.601093,0.623229
9,0.002000,0.647009,0.619318,0.595628,0.607242
10,0.002000,0.656684,0.626437,0.595628,0.610644


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 3981.51 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.57      0.59      0.58        29
    Krankenhaus       0.78      0.67      0.72        43
       Personal       0.75      0.43      0.55         7
 Pflegepersonal       1.00      0.90      0.95        10
anderer Service       0.42      0.28      0.33        18
 mediz. Service       0.88      0.57      0.69        37

      micro avg       0.72      0.58      0.65       144
      macro avg       0.73      0.57      0.64       144
   weighted avg       0.73      0.58      0.64       144

Precision Score: 0.7241379310344828
Recall Score: 0.5833333333333334
F1 Score: 0.6461538461538462
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', 'hier', 'nochm

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 6058.24 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 4032.61 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training dbmdz/bert-base-german-cased for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.340263,0.564417,0.538012,0.550898
2,0.386000,0.315173,0.612676,0.508772,0.555911
3,0.386000,0.459173,0.60221,0.637427,0.619318
4,0.125900,0.486373,0.586592,0.614035,0.6
5,0.125900,0.550674,0.569892,0.619883,0.593838
6,0.050200,0.63032,0.59887,0.619883,0.609195
7,0.017900,0.672255,0.601156,0.608187,0.604651
8,0.017900,0.692469,0.559585,0.631579,0.593407
9,0.009200,0.699506,0.562162,0.608187,0.58427
10,0.009200,0.696316,0.564516,0.614035,0.588235


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4141.78 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.68      0.55      0.61        31
    Krankenhaus       0.74      0.48      0.58        42
       Personal       0.71      0.62      0.67         8
 Pflegepersonal       0.47      0.90      0.62        10
anderer Service       0.22      0.28      0.24        18
 mediz. Service       0.61      0.61      0.61        38

      micro avg       0.57      0.54      0.55       147
      macro avg       0.57      0.57      0.55       147
   weighted avg       0.61      0.54      0.56       147

Precision Score: 0.5683453237410072
Recall Score: 0.5374149659863946
F1 Score: 0.5524475524475525
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', 'hier', 'nochm

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 5886.09 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3900.27 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training dbmdz/bert-base-german-uncased for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.319907,0.611511,0.508982,0.555556
2,0.366300,0.376197,0.576642,0.473054,0.519737
3,0.366300,0.481974,0.486772,0.550898,0.516854
4,0.124400,0.527617,0.546512,0.562874,0.554572
5,0.124400,0.593346,0.583893,0.520958,0.550633
6,0.042100,0.644776,0.546512,0.562874,0.554572
7,0.014100,0.704451,0.527174,0.580838,0.552707
8,0.014100,0.721638,0.558282,0.54491,0.551515
9,0.006100,0.754043,0.521978,0.568862,0.544413
10,0.006100,0.753562,0.552941,0.562874,0.557864


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4063.10 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.57      0.59      0.58        34
    Krankenhaus       0.81      0.71      0.76        42
       Personal       0.71      0.62      0.67         8
 Pflegepersonal       0.58      0.88      0.70         8
anderer Service       0.21      0.25      0.23        16
 mediz. Service       0.76      0.56      0.65        39

      micro avg       0.63      0.60      0.62       147
      macro avg       0.61      0.60      0.60       147
   weighted avg       0.66      0.60      0.62       147

Precision Score: 0.6330935251798561
Recall Score: 0.5986394557823129
F1 Score: 0.6153846153846154
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'ge

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 4656.14 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3259.01 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training FacebookAI/xlm-roberta-base for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.40462,0.547619,0.336585,0.416918
2,0.508600,0.324651,0.728682,0.458537,0.562874
3,0.508600,0.330159,0.683616,0.590244,0.633508
4,0.238600,0.431061,0.702703,0.507317,0.589235
5,0.238600,0.483945,0.56872,0.585366,0.576923
6,0.149500,0.507345,0.520833,0.609756,0.561798
7,0.080800,0.566223,0.570776,0.609756,0.589623
8,0.080800,0.605235,0.576037,0.609756,0.592417
9,0.047400,0.630347,0.575221,0.634146,0.603248
10,0.047400,0.647925,0.558442,0.629268,0.591743


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4293.96 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.67      0.72      0.69        36
    Krankenhaus       0.82      0.62      0.71        45
       Personal       0.40      0.18      0.25        11
 Pflegepersonal       0.61      0.92      0.73        12
anderer Service       0.38      0.45      0.42        22
 mediz. Service       0.83      0.50      0.62        50

      micro avg       0.67      0.58      0.62       176
      macro avg       0.62      0.57      0.57       176
   weighted avg       0.70      0.58      0.62       176

Precision Score: 0.6710526315789473
Recall Score: 0.5795454545454546
F1 Score: 0.6219512195121951
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', 'hier', 'nochm

Device set to use cuda:0


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 5843.55 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3883.46 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training TUM/GottBERT_base_best for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.429805,0.448087,0.539474,0.489552
2,0.743900,0.432371,0.468293,0.631579,0.537815
3,0.743900,0.420159,0.542857,0.5,0.520548
4,0.225700,0.47808,0.56051,0.578947,0.569579
5,0.225700,0.662139,0.571429,0.5,0.533333
6,0.113900,1.028739,0.574713,0.657895,0.613497
7,0.041700,1.091422,0.576687,0.618421,0.596825
8,0.041700,1.079843,0.60479,0.664474,0.633229
9,0.020300,1.113201,0.607595,0.631579,0.619355
10,0.020300,1.145341,0.598802,0.657895,0.626959




There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4303.26 examples/s]




Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13), np.int64(149)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.60      0.72      0.65        25
    Krankenhaus       0.71      0.59      0.64        41
       Personal       0.62      0.71      0.67         7
 Pflegepersonal       0.78      0.88      0.82         8
anderer Service       0.21      0.27      0.24        11
 mediz. Service       0.54      0.67      0.60        30

      micro avg       0.58      0.63      0.61       122
      macro avg       0.58      0.64      0.60       122
   weighted avg       0.60      0.63      0.61       122

Precision Score: 0.5833333333333334
Recall Score: 0.6311475409836066
F1 Score: 0.6062992125984252
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'g

Device set to use cuda:0


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 6424.81 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 4116.66 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training TUM/GottBERT_filtered_base_best for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.399965,0.504202,0.394737,0.442804
2,0.582200,0.46778,0.497297,0.605263,0.545994
3,0.582200,0.535172,0.598639,0.578947,0.588629
4,0.181200,0.651366,0.579882,0.644737,0.610592
5,0.181200,0.797268,0.613139,0.552632,0.581315
6,0.076800,0.944462,0.603659,0.651316,0.626582
7,0.031500,1.024025,0.593023,0.671053,0.62963
8,0.031500,1.04935,0.582353,0.651316,0.614907
9,0.009100,1.123064,0.582857,0.671053,0.623853
10,0.009100,1.098065,0.60119,0.664474,0.63125




There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4189.47 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.76      0.64      0.70        25
    Krankenhaus       0.85      0.80      0.83        41
       Personal       0.50      0.71      0.59         7
 Pflegepersonal       0.88      0.88      0.88         8
anderer Service       0.38      0.27      0.32        11
 mediz. Service       0.56      0.60      0.58        30

      micro avg       0.69      0.67      0.68       122
      macro avg       0.65      0.65      0.65       122
   weighted avg       0.70      0.67      0.68       122

Precision Score: 0.6949152542372882
Recall Score: 0.6721311475409836
F1 Score: 0.6833333333333333
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Station

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. 

Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 6045.70 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3889.97 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training TUM/GottBERT_base_last for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.436968,0.555556,0.526316,0.540541
2,0.726900,0.447421,0.591549,0.552632,0.571429
3,0.726900,0.523774,0.575949,0.598684,0.587097
4,0.184000,0.846363,0.603774,0.631579,0.617363
5,0.184000,0.964704,0.633094,0.578947,0.604811
6,0.075400,1.235551,0.554913,0.631579,0.590769
7,0.018200,1.251128,0.585526,0.585526,0.585526
8,0.018200,1.329724,0.603896,0.611842,0.607843
9,0.007400,1.31453,0.603896,0.611842,0.607843
10,0.007400,1.305345,0.621622,0.605263,0.613333




There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 3923.68 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.62      0.64      0.63        25
    Krankenhaus       0.78      0.68      0.73        41
       Personal       0.60      0.43      0.50         7
 Pflegepersonal       0.78      0.88      0.82         8
anderer Service       0.25      0.27      0.26        11
 mediz. Service       0.67      0.60      0.63        30

      micro avg       0.65      0.61      0.63       122
      macro avg       0.61      0.58      0.60       122
   weighted avg       0.66      0.61      0.63       122

Precision Score: 0.6521739130434783
Recall Score: 0.6147540983606558
F1 Score: 0.6329113924050632
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'ge

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 6500.59 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 4412.61 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training distilbert/distilbert-base-german-cased for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.376311,0.482558,0.48538,0.483965
2,0.448200,0.308758,0.661538,0.502924,0.571429
3,0.448200,0.327499,0.60355,0.596491,0.6
4,0.164300,0.386752,0.602484,0.567251,0.584337
5,0.164300,0.417542,0.551913,0.590643,0.570621
6,0.074800,0.445919,0.583333,0.573099,0.578171
7,0.039900,0.475213,0.575581,0.578947,0.577259
8,0.039900,0.496147,0.57764,0.54386,0.560241
9,0.023400,0.508502,0.556818,0.573099,0.564841
10,0.023400,0.514116,0.55814,0.561404,0.559767


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4002.68 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.61      0.61      0.61        31
    Krankenhaus       0.71      0.57      0.63        42
       Personal       0.60      0.38      0.46         8
 Pflegepersonal       0.60      0.90      0.72        10
anderer Service       0.20      0.11      0.14        18
 mediz. Service       0.69      0.66      0.68        38

      micro avg       0.63      0.56      0.59       147
      macro avg       0.57      0.54      0.54       147
   weighted avg       0.61      0.56      0.58       147

Precision Score: 0.6259541984732825
Recall Score: 0.5578231292517006
F1 Score: 0.5899280575539567
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', 'hier', 'nochm

Device set to use cuda:0


Mapping the data


Map: 100%|██████████| 575/575 [00:00<00:00, 5837.31 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 3507.23 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-anderer Service' 'B-anderer Service' 'B-mediz. Service' 'I-Personal'
 'O' 'I-Pflegepersonal' 'B-Krankenhaus' 'B-Arzt' 'I-mediz. Service'
 'I-Arzt' 'B-Pflegepersonal' 'B-Personal' 'I-Krankenhaus']
{0: np.float64(78.56153846153846), 1: np.float64(5.651909241837299), 2: np.float64(2.846432552954292), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(261.87179487179486), 6: np.float64(3.9280769230769232), 7: np.float64(5.493813878429263), 8: np.float64(23.806526806526808), 9: np.float64(18.705128205128204), 10: np.float64(9.465245597775718), 11: np.float64(13.545092838196286), 12: np.float64(29.096866096866098)}
Training GerMedBERT/medbert-512 for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.560103,0.356757,0.409938,0.381503
2,2.415900,0.378509,0.578571,0.503106,0.538206
3,2.415900,0.459142,0.518919,0.596273,0.554913
4,0.206600,0.542074,0.6,0.596273,0.598131
5,0.206600,0.651896,0.593548,0.571429,0.582278
6,0.073700,0.802876,0.576923,0.559006,0.567823
7,0.024100,0.756369,0.61745,0.571429,0.593548
8,0.024100,0.822646,0.573171,0.583851,0.578462
9,0.008800,0.823921,0.567251,0.602484,0.584337
10,0.008800,0.837292,0.592593,0.596273,0.594427




There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


Evaluating on test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4051.27 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.52      0.62      0.57        24
    Krankenhaus       0.73      0.73      0.73        45
       Personal       0.55      0.75      0.63         8
 Pflegepersonal       0.67      0.91      0.77        11
anderer Service       0.30      0.38      0.33        16
 mediz. Service       0.65      0.42      0.51        31

      micro avg       0.59      0.61      0.60       135
      macro avg       0.57      0.64      0.59       135
   weighted avg       0.61      0.61      0.60       135

Precision Score: 0.5928571428571429
Recall Score: 0.6148148148148148
F1 Score: 0.6036363636363636
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', '

#### 2.1 category-aware ATE with k-fold cross validation

In [10]:
for model in models:
    print(f'training and results for {model}:')
    ate_cat_model_kfold(data, model, rn1=42, rn2=42, k=2, epochs=5)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting fold 1/2


Map: 100%|██████████| 359/359 [00:00<00:00, 5802.73 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5473.53 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5902.21 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 1


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.45539,0.638298,0.16129,0.257511
2,No log,0.37147,0.673729,0.427419,0.523026
3,No log,0.342624,0.634375,0.545699,0.586705
4,No log,0.342606,0.605405,0.602151,0.603774
5,No log,0.345547,0.66055,0.580645,0.618026


Evaluating fold 1


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.55      0.68      0.61        53
    Krankenhaus       0.65      0.60      0.62       105
       Personal       1.00      0.04      0.08        25
 Pflegepersonal       0.80      0.65      0.71        31
anderer Service       0.50      0.10      0.16        61
 mediz. Service       0.57      0.67      0.62       127

      micro avg       0.60      0.52      0.56       402
      macro avg       0.68      0.46      0.47       402
   weighted avg       0.62      0.52      0.52       402

Fold 1 Results - Precision: 0.6045845272206304, Recall: 0.5248756218905473, F1: 0.5619174434087884
Starting fold 2/2


Map: 100%|██████████| 360/360 [00:00<00:00, 6132.15 examples/s]
Map: 100%|██████████| 179/179 [00:00<00:00, 5603.59 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5202.88 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 2


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.177941,0.737327,0.73903,0.738178
2,No log,0.166044,0.731463,0.842956,0.783262
3,No log,0.158262,0.737805,0.838337,0.784865
4,No log,0.14772,0.766529,0.856813,0.80916
5,No log,0.147531,0.779221,0.831409,0.804469


Evaluating fold 2


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.63      0.85      0.73        47
    Krankenhaus       0.72      0.90      0.80        97
       Personal       0.64      0.41      0.50        22
 Pflegepersonal       0.82      0.83      0.82        48
anderer Service       0.77      0.66      0.71        50
 mediz. Service       0.76      0.83      0.79       150

      micro avg       0.74      0.80      0.77       414
      macro avg       0.72      0.75      0.73       414
   weighted avg       0.74      0.80      0.77       414

Fold 2 Results - Precision: 0.7367256637168141, Recall: 0.8043478260869565, F1: 0.7690531177829099

=== Final Cross-Validation Results ===
Average Precision: 0.6706550954687223
Average 

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting fold 1/2


Map: 100%|██████████| 359/359 [00:00<00:00, 6016.23 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5513.98 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 6008.70 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 1


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.55637,0.0,0.0,0.0
2,No log,0.414181,0.635922,0.361878,0.461268
3,No log,0.377731,0.614865,0.502762,0.553191
4,No log,0.357026,0.577143,0.558011,0.567416
5,No log,0.357232,0.593023,0.563536,0.577904


  _warn_prf(average, modifier, msg_start, len(result))


Evaluating fold 1


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.46      0.57      0.51        54
    Krankenhaus       0.57      0.47      0.52       100
       Personal       1.00      0.13      0.23        23
 Pflegepersonal       0.59      0.65      0.62        31
anderer Service       0.57      0.07      0.12        61
 mediz. Service       0.54      0.63      0.58       118

      micro avg       0.54      0.46      0.50       387
      macro avg       0.62      0.42      0.43       387
   weighted avg       0.57      0.46      0.46       387

Fold 1 Results - Precision: 0.5391566265060241, Recall: 0.4625322997416021, F1: 0.49791376912378305
Starting fold 2/2


Map: 100%|██████████| 360/360 [00:00<00:00, 6157.71 examples/s]
Map: 100%|██████████| 179/179 [00:00<00:00, 5655.00 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5325.27 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 2


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.215663,0.666667,0.695652,0.680851
2,No log,0.191817,0.721698,0.73913,0.73031
3,No log,0.20712,0.66129,0.792271,0.720879
4,No log,0.19986,0.681913,0.792271,0.732961
5,No log,0.200088,0.687234,0.780193,0.730769


Evaluating fold 2


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.64      0.83      0.72        47
    Krankenhaus       0.67      0.86      0.75        96
       Personal       0.90      0.43      0.58        21
 Pflegepersonal       0.72      0.81      0.76        47
anderer Service       0.55      0.59      0.57        49
 mediz. Service       0.71      0.87      0.78       133

      micro avg       0.68      0.80      0.73       393
      macro avg       0.70      0.73      0.69       393
   weighted avg       0.68      0.80      0.73       393

Fold 2 Results - Precision: 0.6767241379310345, Recall: 0.7989821882951654, F1: 0.7327887981330221

=== Final Cross-Validation Results ===
Average Precision: 0.6079403822185293
Average 

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting fold 1/2


Map: 100%|██████████| 359/359 [00:00<00:00, 4430.91 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5349.80 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5732.01 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 1


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.467376,0.333333,0.038997,0.069825
2,No log,0.37871,0.594142,0.395543,0.474916
3,No log,0.348459,0.614198,0.554318,0.582723
4,No log,0.34369,0.612245,0.584958,0.598291
5,No log,0.346167,0.61976,0.576602,0.597403


Evaluating fold 1


  _warn_prf(average, modifier, msg_start, len(result))


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(7), np.int64(9), np.int64(11)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.45      0.75      0.56        61
    Krankenhaus       0.59      0.61      0.60        99
       Personal       0.00      0.00      0.00        23
 Pflegepersonal       0.70      0.50      0.58        28
anderer Service       0.44      0.07      0.11        61
 mediz. Service       0.50      0.54      0.52       114

      micro avg       0.52      0.48      0.50       386
      macro avg       0.45      0.41      0.40       386
   weighted avg       0.49      0.48      0.46       386

Fold 1 Results - Precision: 0.5210084033613446, Recall: 0.48186528497409326, F1: 0.5006729475100942
Starting fold 2/2


Map: 100%|██████████| 360/360 [00:00<00:00, 5962.48 examples/s]
Map: 100%|██████████| 179/179 [00:00<00:00, 5400.09 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5094.85 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 2


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.230841,0.646778,0.639151,0.642942
2,No log,0.212578,0.668192,0.688679,0.678281
3,No log,0.211445,0.682819,0.731132,0.70615
4,No log,0.212596,0.702765,0.71934,0.710956
5,No log,0.208785,0.68913,0.747642,0.717195


Evaluating fold 2


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.60      0.88      0.71        57
    Krankenhaus       0.77      0.83      0.80        99
       Personal       0.60      0.29      0.39        21
 Pflegepersonal       0.78      0.63      0.70        46
anderer Service       0.65      0.48      0.55        50
 mediz. Service       0.69      0.82      0.75       131

      micro avg       0.69      0.74      0.71       404
      macro avg       0.68      0.65      0.65       404
   weighted avg       0.70      0.74      0.71       404

Fold 2 Results - Precision: 0.6930232558139535, Recall: 0.7376237623762376, F1: 0.7146282973621103

=== Final Cross-Validation Results ===
Average Precision: 0.6070158295876491
Average 

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting fold 1/2


Map: 100%|██████████| 359/359 [00:00<00:00, 6007.83 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5641.55 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 6047.05 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 1


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.578754,0.0,0.0,0.0
2,No log,0.493067,0.0,0.0,0.0
3,No log,0.456439,0.311111,0.033019,0.059701
4,No log,0.450865,0.354978,0.193396,0.250382
5,No log,0.442501,0.402214,0.257075,0.313669


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Evaluating fold 1


  _warn_prf(average, modifier, msg_start, len(result))


Unique predicted label IDs: {np.int64(0), np.int64(5), np.int64(7), np.int64(9), np.int64(11)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.45      0.14      0.21        64
    Krankenhaus       0.29      0.33      0.31       112
       Personal       0.00      0.00      0.00        30
 Pflegepersonal       0.71      0.45      0.55        38
anderer Service       0.00      0.00      0.00        78
 mediz. Service       0.39      0.30      0.34       145

      micro avg       0.37      0.23      0.28       467
      macro avg       0.31      0.20      0.24       467
   weighted avg       0.31      0.23      0.25       467

Fold 1 Results - Precision: 0.3741258741258741, Recall: 0.2291220556745182, F1: 0.2841965471447543
Starting fold 2/2


Map: 100%|██████████| 360/360 [00:00<00:00, 6397.06 examples/s]
Map: 100%|██████████| 179/179 [00:00<00:00, 5837.79 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5551.57 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 2


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.359174,0.546196,0.414433,0.471278
2,No log,0.321821,0.518519,0.57732,0.546341
3,No log,0.304909,0.57037,0.635052,0.600976
4,No log,0.299644,0.586916,0.647423,0.615686
5,No log,0.293543,0.609615,0.653608,0.630846


Evaluating fold 2


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.54      0.84      0.66        58
    Krankenhaus       0.53      0.78      0.63       108
       Personal       1.00      0.04      0.07        26
 Pflegepersonal       0.67      0.70      0.68        63
anderer Service       0.63      0.18      0.28        66
 mediz. Service       0.62      0.77      0.68       150

      micro avg       0.58      0.65      0.61       471
      macro avg       0.66      0.55      0.50       471
   weighted avg       0.62      0.65      0.58       471

Fold 2 Results - Precision: 0.5842911877394636, Recall: 0.6475583864118896, F1: 0.6143001007049346

=== Final Cross-Validation Results ===
Average Precision: 0.4792085309326689
Average Recall: 0.438

Device set to use cuda:0


Starting fold 1/2


Map: 100%|██████████| 359/359 [00:00<00:00, 6553.37 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5842.55 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 6292.45 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 1


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.681662,0.245902,0.047771,0.08
2,No log,0.585697,0.393617,0.353503,0.372483
3,No log,0.51812,0.54023,0.449045,0.490435
4,No log,0.467893,0.576208,0.493631,0.531732
5,No log,0.450444,0.570922,0.512739,0.540268


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating fold 1


  _warn_prf(average, modifier, msg_start, len(result))


Unique predicted label IDs: {np.int64(0), np.int64(97), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.52      0.67      0.59        48
    Krankenhaus       0.61      0.58      0.59        93
       Personal       0.50      0.05      0.09        20
 Pflegepersonal       0.54      0.70      0.61        27
anderer Service       0.00      0.00      0.00        53
 mediz. Service       0.50      0.56      0.53       107

      micro avg       0.54      0.48      0.51       348
      macro avg       0.45      0.43      0.40       348
   weighted avg       0.46      0.48      0.46       348

Fold 1 Results - Precision: 0.5424836601307189, Recall: 0.47701149425287354, F1: 0.5076452599388379
Starting fold 2/2


Map: 100%|██████████| 360/360 [00:00<00:00, 6621.63 examples/s]
Map: 100%|██████████| 179/179 [00:00<00:00, 6176.57 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5905.30 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 2


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.336446,0.562914,0.73487,0.6375
2,No log,0.241901,0.581114,0.691643,0.631579
3,No log,0.220495,0.683616,0.697406,0.690442
4,No log,0.22743,0.667575,0.706052,0.686275
5,No log,0.219041,0.656489,0.743516,0.697297


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating fold 2


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.56      0.83      0.67        35
    Krankenhaus       0.61      0.84      0.71        87
       Personal       0.56      0.26      0.36        19
 Pflegepersonal       0.72      0.81      0.76        36
anderer Service       0.67      0.45      0.54        40
 mediz. Service       0.62      0.73      0.67       117

      micro avg       0.62      0.72      0.67       334
      macro avg       0.62      0.65      0.62       334
   weighted avg       0.63      0.72      0.66       334

Fold 2 Results - Precision: 0.6223958333333334, Recall: 0.7155688622754491, F1: 0.6657381615598886

=== Final Cross-Validation Results ===
Average Precision: 0.5824397467320261
Average 

Device set to use cuda:0


Starting fold 1/2


Map: 100%|██████████| 359/359 [00:00<00:00, 6474.50 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5706.88 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 6111.03 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 1


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.684097,0.043478,0.006369,0.011111
2,No log,0.539212,0.469697,0.197452,0.278027
3,No log,0.456455,0.498246,0.452229,0.474124
4,No log,0.413883,0.538012,0.585987,0.560976
5,No log,0.400581,0.591065,0.547771,0.568595


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating fold 1


  _warn_prf(average, modifier, msg_start, len(result))


Unique predicted label IDs: {np.int64(0), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(10), np.int64(11), np.int64(23281)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.49      0.73      0.58        48
    Krankenhaus       0.60      0.55      0.57        93
       Personal       0.00      0.00      0.00        20
 Pflegepersonal       0.65      0.63      0.64        27
anderer Service       0.00      0.00      0.00        53
 mediz. Service       0.48      0.51      0.50       107

      micro avg       0.53      0.45      0.49       348
      macro avg       0.37      0.40      0.38       348
   weighted avg       0.43      0.45      0.44       348

Fold 1 Results - Precision: 0.5302013422818792, Recall: 0.4540229885057471, F1: 0.4891640866873065
Starting fold 2/2


Map: 100%|██████████| 360/360 [00:00<00:00, 6605.12 examples/s]
Map: 100%|██████████| 179/179 [00:00<00:00, 6045.81 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5875.24 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 2


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.280354,0.55157,0.708934,0.620429
2,No log,0.243453,0.74502,0.538905,0.625418
3,No log,0.209263,0.657068,0.723343,0.688615
4,No log,0.209812,0.682796,0.731988,0.706537
5,No log,0.206193,0.665829,0.763689,0.711409


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating fold 2


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.64      0.83      0.73        35
    Krankenhaus       0.62      0.79      0.70        87
       Personal       0.57      0.42      0.48        19
 Pflegepersonal       0.72      0.81      0.76        36
anderer Service       0.47      0.53      0.49        40
 mediz. Service       0.66      0.75      0.70       117

      micro avg       0.63      0.73      0.68       334
      macro avg       0.62      0.69      0.64       334
   weighted avg       0.63      0.73      0.67       334

Fold 2 Results - Precision: 0.6288659793814433, Recall: 0.7305389221556886, F1: 0.6759002770083102

=== Final Cross-Validation Results ===
Average Precision: 0.5795336608316612
Average 

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. 

Starting fold 1/2


Map: 100%|██████████| 359/359 [00:00<00:00, 6547.25 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5869.67 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 6185.33 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 1


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.700529,0.114286,0.025478,0.041667
2,No log,0.602503,0.425926,0.219745,0.289916
3,No log,0.530164,0.483254,0.321656,0.386233
4,No log,0.496689,0.548936,0.410828,0.469945
5,No log,0.465845,0.525253,0.496815,0.510638


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating fold 1


  _warn_prf(average, modifier, msg_start, len(result))


Unique predicted label IDs: {np.int64(0), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.45      0.69      0.54        48
    Krankenhaus       0.57      0.51      0.53        93
       Personal       0.33      0.05      0.09        20
 Pflegepersonal       0.53      0.59      0.56        27
anderer Service       0.00      0.00      0.00        53
 mediz. Service       0.45      0.50      0.47       107

      micro avg       0.49      0.43      0.46       348
      macro avg       0.39      0.39      0.37       348
   weighted avg       0.41      0.43      0.41       348

Fold 1 Results - Precision: 0.487012987012987, Recall: 0.43103448275862066, F1: 0.45731707317073167
Starting fold 2/2


Map: 100%|██████████| 360/360 [00:00<00:00, 6501.59 examples/s]
Map: 100%|██████████| 179/179 [00:00<00:00, 6042.21 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5884.26 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 2


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.372455,0.522167,0.610951,0.563081
2,No log,0.266914,0.588542,0.651297,0.618331
3,No log,0.239906,0.598485,0.682997,0.637954
4,No log,0.256053,0.635171,0.697406,0.664835
5,No log,0.254847,0.594724,0.714697,0.649215


There were missing keys in the checkpoint model loaded: ['lm_head.decoder.weight', 'lm_head.decoder.bias'].


Evaluating fold 2


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.58      0.83      0.68        35
    Krankenhaus       0.56      0.80      0.66        87
       Personal       0.27      0.16      0.20        19
 Pflegepersonal       0.61      0.75      0.67        36
anderer Service       0.50      0.15      0.23        40
 mediz. Service       0.65      0.68      0.66       117

      micro avg       0.59      0.64      0.61       334
      macro avg       0.53      0.56      0.52       334
   weighted avg       0.58      0.64      0.59       334

Fold 2 Results - Precision: 0.5895316804407713, Recall: 0.6407185628742516, F1: 0.6140602582496414

=== Final Cross-Validation Results ===
Average Precision: 0.53827233372

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting fold 1/2


Map: 100%|██████████| 359/359 [00:00<00:00, 6548.16 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5943.47 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 6497.26 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 1


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.755667,0.0,0.0,0.0
2,No log,0.512207,0.714286,0.027624,0.053191
3,No log,0.448723,0.546875,0.19337,0.285714
4,No log,0.424061,0.657343,0.259669,0.372277
5,No log,0.413659,0.653631,0.323204,0.432532


  _warn_prf(average, modifier, msg_start, len(result))


Evaluating fold 1


  _warn_prf(average, modifier, msg_start, len(result))


Unique predicted label IDs: {np.int64(0), np.int64(5), np.int64(7), np.int64(9), np.int64(11)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.62      0.44      0.52        54
    Krankenhaus       0.45      0.26      0.33       100
       Personal       0.00      0.00      0.00        23
 Pflegepersonal       0.79      0.35      0.49        31
anderer Service       0.00      0.00      0.00        61
 mediz. Service       0.45      0.43      0.44       118

      micro avg       0.50      0.29      0.37       387
      macro avg       0.38      0.25      0.30       387
   weighted avg       0.40      0.29      0.33       387

Fold 1 Results - Precision: 0.5, Recall: 0.28940568475452194, F1: 0.36661211129296234
Starting fold 2/2


Map: 100%|██████████| 360/360 [00:00<00:00, 6582.43 examples/s]
Map: 100%|██████████| 179/179 [00:00<00:00, 6107.18 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5756.84 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 2


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.349676,0.567474,0.396135,0.466572
2,No log,0.315241,0.620283,0.635266,0.627685
3,No log,0.306487,0.58952,0.652174,0.619266
4,No log,0.291857,0.626424,0.664251,0.644783
5,No log,0.285538,0.650124,0.63285,0.641371


Evaluating fold 2


Unique predicted label IDs: {np.int64(0), np.int64(3), np.int64(5), np.int64(6), np.int64(7), np.int64(9), np.int64(11), np.int64(13)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.49      0.83      0.62        47
    Krankenhaus       0.57      0.75      0.65        96
       Personal       0.67      0.19      0.30        21
 Pflegepersonal       0.63      0.51      0.56        47
anderer Service       0.75      0.12      0.21        49
 mediz. Service       0.62      0.80      0.70       133

      micro avg       0.59      0.64      0.61       393
      macro avg       0.62      0.53      0.51       393
   weighted avg       0.61      0.64      0.58       393

Fold 2 Results - Precision: 0.586046511627907, Recall: 0.6412213740458015, F1: 0.6123936816524909

=== Final Cross-Validation Results ===
Average Precision: 0.5430232558139535
Average Recall: 0.4653

Device set to use cuda:0


Starting fold 1/2


Map: 100%|██████████| 359/359 [00:00<00:00, 5952.92 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5451.36 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5795.55 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 1


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,8.840662,0.026854,0.246334,0.048429
2,No log,6.030432,0.031238,0.246334,0.055446
3,No log,4.913649,0.026863,0.246334,0.048443
4,No log,4.36038,0.027255,0.246334,0.04908
5,No log,4.173833,0.060092,0.228739,0.09518


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


Evaluating fold 1


  _warn_prf(average, modifier, msg_start, len(result))


Unique predicted label IDs: {np.int64(0), np.int64(7)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.00      0.00      0.00        44
    Krankenhaus       0.07      0.93      0.12       102
       Personal       0.00      0.00      0.00        25
 Pflegepersonal       0.00      0.00      0.00        32
anderer Service       0.00      0.00      0.00        70
 mediz. Service       0.00      0.00      0.00       100

      micro avg       0.07      0.25      0.10       373
      macro avg       0.01      0.16      0.02       373
   weighted avg       0.02      0.25      0.03       373

Fold 1 Results - Precision: 0.0659264399722415, Recall: 0.2546916890080429, F1: 0.10474090407938257
Starting fold 2/2


Map: 100%|██████████| 360/360 [00:00<00:00, 5967.59 examples/s]
Map: 100%|██████████| 179/179 [00:00<00:00, 5378.35 examples/s]
Map: 100%|██████████| 180/180 [00:00<00:00, 5283.79 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['I-Pflegepersonal' 'B-Arzt' 'B-anderer Service' 'I-Personal' 'O'
 'B-mediz. Service' 'I-anderer Service' 'I-Arzt' 'I-mediz. Service'
 'B-Krankenhaus' 'I-Krankenhaus' 'B-Pflegepersonal' 'B-Personal']
{0: np.float64(261.87179487179486), 1: np.float64(5.493813878429263), 2: np.float64(5.651909241837299), 3: np.float64(71.41958041958041), 4: np.float64(0.08550450420280634), 5: np.float64(2.846432552954292), 6: np.float64(78.56153846153846), 7: np.float64(18.705128205128204), 8: np.float64(23.806526806526808), 9: np.float64(3.9280769230769232), 10: np.float64(29.096866096866098), 11: np.float64(9.465245597775718), 12: np.float64(13.545092838196286)}
Training fold 2


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,2.922192,0.210667,0.204663,0.207622
2,No log,1.876802,0.30719,0.365285,0.333728
3,No log,1.305647,0.439353,0.42228,0.430647
4,No log,1.047339,0.451835,0.510363,0.479319
5,No log,0.964234,0.441913,0.502591,0.470303


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


Evaluating fold 2


Unique predicted label IDs: {np.int64(0), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(11)}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.49      0.69      0.57        42
    Krankenhaus       0.41      0.74      0.53        96
       Personal       0.00      0.00      0.00        22
 Pflegepersonal       0.54      0.77      0.63        48
anderer Service       0.00      0.00      0.00        55
 mediz. Service       0.45      0.65      0.53       106

      micro avg       0.46      0.56      0.50       369
      macro avg       0.32      0.48      0.38       369
   weighted avg       0.36      0.56      0.44       369

Fold 2 Results - Precision: 0.4557522123893805, Recall: 0.5582655826558266, F1: 0.5018270401948843

=== Final Cross-Validation Results ===
Average Precision: 0.260839326180811
Average Recall: 0.4064786358319348
A

  _warn_prf(average, modifier, msg_start, len(result))


Training complete. Model directory for fold 1 deleted to free memory.
Training complete. Model directory for fold 2 deleted to free memory.



In [6]:
ate_model(data, "GerMedBERT/medbert-512", rn1=42, rn2=42, epochs=7)

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0


mapping of the data



Map: 100%|██████████| 575/575 [00:00<00:00, 5966.16 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 4142.07 examples/s]
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


['O' 'B-ASPECT' 'I-ASPECT']
{0: np.float64(0.3705195182121608), 1: np.float64(3.78680014831294), 2: np.float64(27.01851851851852)}
Training results for GerMedBERT/medbert-512 with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.30263,0.657534,0.596273,0.625407
2,2.032800,0.241518,0.691781,0.627329,0.65798
3,2.032800,0.276279,0.671875,0.534161,0.595156
4,0.148300,0.434866,0.620112,0.689441,0.652941
5,0.148300,0.502979,0.658385,0.658385,0.658385
6,0.046400,0.574677,0.654545,0.670807,0.662577
7,0.009500,0.630149,0.634286,0.689441,0.660714




There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


mapping the test data



Map: 100%|██████████| 72/72 [00:00<00:00, 4259.98 examples/s]


Unique predicted label IDs: {np.int64(0), np.int64(1), np.int64(2)}
Expected label IDs: {0, 1, 2}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.69      0.65      0.67       135

   micro avg       0.69      0.65      0.67       135
   macro avg       0.69      0.65      0.67       135
weighted avg       0.69      0.65      0.67       135

Precision Score: 0.6875
Recall Score: 0.6518518518518519
F1 Score: 0.6692015209125476
Tokens     : ['Ich', 'habe', 'jeden', 'Tag', 'die', 'Stationsärztin', 'gesehen', ',', 'hier', 'nochmal', 'vielen', 'Dank', 'für', 'die', 'Geduld', 'und', 'die', 'fachliche', 'Betreuung', 'und', 'die', 'Empathie', 'die', 'mir', 'entgegengebracht', 'wurde', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'