## Aspect Term Extraction (ATE) Training and Fine Tuning for Large Language Models on German hospital reviews using the special OB-Tagging


In [1]:
import torch
import os

import spacy
import ast  # To safely evaluate strings as Python objects

from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import Dataset
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import evaluate

from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

# We need the sys package to load modules from another directory:
import sys
sys.path.append('../')
from functions.ate_model_train_OB import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
print("Is CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("GPU device name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")

Is CUDA available: True
CUDA version: 12.6
GPU device name: NVIDIA A30


In [3]:
# Load the dataset into a DataFrame
data = pd.read_csv("./data/hospitalABSA/patient_review_labels_absa.csv")
data_ano = pd.read_csv("./data/hospitalABSA/patient_review_labels_absa_ano.csv")

In [4]:
models = ["google-bert/bert-base-german-cased","dbmdz/bert-base-german-cased", "dbmdz/bert-base-german-uncased",
          "FacebookAI/xlm-roberta-base", "TUM/GottBERT_base_best", "TUM/GottBERT_filtered_base_best", "TUM/GottBERT_base_last",
          "distilbert/distilbert-base-german-cased", "GerMedBERT/medbert-512", "deepset/gbert-base"]

### 1. Train standard ATE Models for 5, 6, 7, 8, 10, 12 epochs

In [5]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=5, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5543.37 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4695.36 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for google-bert/bert-base-german-cased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.142769,0.819328,0.747126,0.781563
2,0.189000,0.166361,0.833333,0.785441,0.808679
3,0.083200,0.244248,0.795276,0.773946,0.784466
4,0.025800,0.267005,0.818898,0.796935,0.807767
5,0.013500,0.337622,0.823045,0.766284,0.793651



Best Model saved at: ./saved_models/ate_google-bert_bert-base-german-cased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_google-bert_bert-base-german-cased_42_42_5
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4313.52 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.92      0.78      0.84       323

   micro avg       0.92      0.78      0.84       323
   macro avg       0.92      0.78      0.84       323
weighted avg       0.92      0.78      0.84       323

Precision Score: 0.9227941176470589
Recall Score: 0.7770897832817337
F1 Score: 0.8436974789915966
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5988.42 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4755.98 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-cased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.141073,0.832618,0.76378,0.796715
2,0.203200,0.158556,0.849785,0.779528,0.813142
3,0.095900,0.204931,0.849593,0.822835,0.836
4,0.037700,0.25093,0.833333,0.807087,0.82
5,0.020800,0.290485,0.836735,0.807087,0.821643



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-cased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-cased_42_42_5
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4402.70 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.91      0.74      0.81       315

   micro avg       0.91      0.74      0.81       315
   macro avg       0.91      0.74      0.81       315
weighted avg       0.91      0.74      0.81       315

Precision Score: 0.9098039215686274
Recall Score: 0.7365079365079366
F1 Score: 0.8140350877192983
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5413.13 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4339.88 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-uncased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.142309,0.832618,0.769841,0.8
2,0.192200,0.165703,0.855204,0.75,0.799154
3,0.091500,0.243324,0.843049,0.746032,0.791579
4,0.037300,0.225724,0.839506,0.809524,0.824242
5,0.025700,0.257734,0.852941,0.805556,0.828571



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-uncased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-uncased_42_42_5
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4205.52 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.88      0.76      0.82       316

   micro avg       0.88      0.76      0.82       316
   macro avg       0.88      0.76      0.82       316
weighted avg       0.88      0.76      0.82       316

Precision Score: 0.8827838827838828
Recall Score: 0.7626582278481012
F1 Score: 0.8183361629881153
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5695.11 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4952.30 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for FacebookAI/xlm-roberta-base with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.191942,0.822642,0.754325,0.787004
2,0.224400,0.180229,0.842697,0.778547,0.809353
3,0.153900,0.174498,0.865169,0.799308,0.830935
4,0.095900,0.182869,0.84083,0.84083,0.84083
5,0.067900,0.199222,0.83737,0.83737,0.83737



Best Model saved at: ./saved_models/ate_FacebookAI_xlm-roberta-base_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_FacebookAI_xlm-roberta-base_42_42_5
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4555.72 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.84      0.77      0.80       346

   micro avg       0.84      0.77      0.80       346
   macro avg       0.84      0.77      0.80       346
weighted avg       0.84      0.77      0.80       346

Precision Score: 0.8369905956112853
Recall Score: 0.7716763005780347
F1 Score: 0.8030075187969926
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6388.83 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4978.78 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_best with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.158621,0.821782,0.783019,0.801932
2,0.183600,0.152034,0.835749,0.816038,0.825776
3,0.108300,0.189798,0.835749,0.816038,0.825776
4,0.054900,0.199698,0.824074,0.839623,0.831776
5,0.037200,0.216283,0.824645,0.820755,0.822695



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_best_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_best_42_42_5
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4695.27 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.82      0.84       279

   micro avg       0.86      0.82      0.84       279
   macro avg       0.86      0.82      0.84       279
weighted avg       0.86      0.82      0.84       279

Precision Score: 0.8641509433962264
Recall Score: 0.8207885304659498
F1 Score: 0.8419117647058825
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6355.24 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4908.18 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_filtered_base_best with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.120732,0.836634,0.79717,0.816425
2,0.174700,0.149519,0.801843,0.820755,0.811189
3,0.100100,0.196196,0.809302,0.820755,0.814988
4,0.041500,0.214591,0.813636,0.84434,0.828704
5,0.029300,0.223136,0.823256,0.834906,0.82904



Best Model saved at: ./saved_models/ate_TUM_GottBERT_filtered_base_best_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_filtered_base_best_42_42_5
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4499.10 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.87      0.78      0.82       279

   micro avg       0.87      0.78      0.82       279
   macro avg       0.87      0.78      0.82       279
weighted avg       0.87      0.78      0.82       279

Precision Score: 0.8656126482213439
Recall Score: 0.7849462365591398
F1 Score: 0.8233082706766918
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6280.13 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4784.56 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_last with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.152716,0.835897,0.768868,0.800983
2,0.177900,0.131368,0.845,0.79717,0.820388
3,0.110300,0.153732,0.833333,0.825472,0.829384
4,0.056700,0.214831,0.828431,0.79717,0.8125
5,0.037300,0.21799,0.846154,0.830189,0.838095



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_last_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_last_42_42_5
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4513.53 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.87      0.81      0.84       279

   micro avg       0.87      0.81      0.84       279
   macro avg       0.87      0.81      0.84       279
weighted avg       0.87      0.81      0.84       279

Precision Score: 0.8664122137404581
Recall Score: 0.8136200716845878
F1 Score: 0.8391866913123844
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6291.09 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5064.49 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for distilbert/distilbert-base-german-cased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.140078,0.818548,0.799213,0.808765
2,0.208000,0.163722,0.843049,0.740157,0.78826
3,0.104600,0.198602,0.873832,0.73622,0.799145
4,0.054800,0.207215,0.814815,0.779528,0.796781
5,0.039600,0.216795,0.829167,0.783465,0.805668



Best Model saved at: ./saved_models/ate_distilbert_distilbert-base-german-cased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_distilbert_distilbert-base-german-cased_42_42_5
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4786.63 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.81      0.77      0.79       315

   micro avg       0.81      0.77      0.79       315
   macro avg       0.81      0.77      0.79       315
weighted avg       0.81      0.77      0.79       315

Precision Score: 0.8114478114478114
Recall Score: 0.765079365079365
F1 Score: 0.7875816993464052
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Label

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5678.35 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4488.83 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for GerMedBERT/medbert-512 with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.113728,0.828054,0.792208,0.809735
2,0.190200,0.145504,0.789683,0.861472,0.824017
3,0.087200,0.198338,0.834081,0.805195,0.819383
4,0.035500,0.21509,0.818565,0.839827,0.82906
5,0.018600,0.263873,0.84,0.818182,0.828947



Best Model saved at: ./saved_models/ate_GerMedBERT_medbert-512_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_GerMedBERT_medbert-512_42_42_5
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4200.60 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.90      0.78      0.83       288

   micro avg       0.90      0.78      0.83       288
   macro avg       0.90      0.78      0.83       288
weighted avg       0.90      0.78      0.83       288

Precision Score: 0.8995983935742972
Recall Score: 0.7777777777777778
F1 Score: 0.8342644320297952
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5799.49 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4641.50 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for deepset/gbert-base with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.129026,0.808,0.795276,0.801587
2,0.190300,0.153711,0.833333,0.826772,0.83004
3,0.095500,0.203675,0.836735,0.807087,0.821643
4,0.038200,0.20933,0.830769,0.850394,0.840467
5,0.023900,0.23232,0.825095,0.854331,0.839458



Best Model saved at: ./saved_models/ate_deepset_gbert-base_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_deepset_gbert-base_42_42_5
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4324.51 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.91      0.79      0.84       315

   micro avg       0.91      0.79      0.84       315
   macro avg       0.91      0.79      0.84       315
weighted avg       0.91      0.79      0.84       315

Precision Score: 0.9084249084249084
Recall Score: 0.7873015873015873
F1 Score: 0.8435374149659863
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

In [5]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=6, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 3446.21 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4640.89 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for google-bert/bert-base-german-cased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.143136,0.821577,0.758621,0.788845
2,0.188600,0.152802,0.817829,0.808429,0.813102
3,0.080800,0.243848,0.820513,0.735632,0.775758
4,0.023300,0.261013,0.806324,0.781609,0.793774
5,0.013100,0.32709,0.807229,0.770115,0.788235
6,0.013100,0.343247,0.806202,0.796935,0.801541



Best Model saved at: ./saved_models/ate_google-bert_bert-base-german-cased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_google-bert_bert-base-german-cased_42_42_6
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4498.15 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.90      0.80      0.85       323

   micro avg       0.90      0.80      0.85       323
   macro avg       0.90      0.80      0.85       323
weighted avg       0.90      0.80      0.85       323

Precision Score: 0.9020979020979021
Recall Score: 0.7987616099071208
F1 Score: 0.8472906403940887
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6118.30 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4703.96 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-cased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.163448,0.822034,0.76378,0.791837
2,0.198700,0.14319,0.85124,0.811024,0.830645
3,0.099700,0.197719,0.849802,0.846457,0.848126
4,0.041100,0.225006,0.837945,0.834646,0.836292
5,0.023200,0.275418,0.831373,0.834646,0.833006
6,0.023200,0.290995,0.838583,0.838583,0.838583



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-cased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-cased_42_42_6
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4410.64 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.91      0.79      0.85       315

   micro avg       0.91      0.79      0.85       315
   macro avg       0.91      0.79      0.85       315
weighted avg       0.91      0.79      0.85       315

Precision Score: 0.9124087591240876
Recall Score: 0.7936507936507936
F1 Score: 0.8488964346349744
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5681.95 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4751.13 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-uncased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.148295,0.828194,0.746032,0.784969
2,0.191100,0.168569,0.834783,0.761905,0.79668
3,0.091400,0.231687,0.849138,0.781746,0.81405
4,0.031300,0.257203,0.83913,0.765873,0.80083
5,0.020400,0.28901,0.850427,0.789683,0.81893
6,0.020400,0.307134,0.830579,0.797619,0.813765



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-uncased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-uncased_42_42_6
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4484.81 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.89      0.71      0.79       316

   micro avg       0.89      0.71      0.79       316
   macro avg       0.89      0.71      0.79       316
weighted avg       0.89      0.71      0.79       316

Precision Score: 0.8884462151394422
Recall Score: 0.7056962025316456
F1 Score: 0.7865961199294532
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5849.37 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5124.35 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for FacebookAI/xlm-roberta-base with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.174645,0.827715,0.764706,0.794964
2,0.220300,0.15919,0.79661,0.813149,0.804795
3,0.156000,0.193529,0.843636,0.802768,0.822695
4,0.093800,0.199891,0.830325,0.795848,0.812721
5,0.072100,0.216175,0.826389,0.823529,0.824957
6,0.072100,0.252078,0.823322,0.806228,0.814685



Best Model saved at: ./saved_models/ate_FacebookAI_xlm-roberta-base_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_FacebookAI_xlm-roberta-base_42_42_6
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4647.53 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.84      0.81      0.82       346

   micro avg       0.84      0.81      0.82       346
   macro avg       0.84      0.81      0.82       346
weighted avg       0.84      0.81      0.82       346

Precision Score: 0.8403614457831325
Recall Score: 0.8063583815028902
F1 Score: 0.8230088495575221
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6512.67 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5138.15 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_best with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.147684,0.852041,0.787736,0.818627
2,0.182400,0.140712,0.827103,0.834906,0.830986
3,0.108500,0.183245,0.849246,0.79717,0.822384
4,0.054400,0.207949,0.822115,0.806604,0.814286
5,0.035500,0.270277,0.818605,0.830189,0.824356
6,0.035500,0.314616,0.811594,0.792453,0.801909



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_best_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_best_42_42_6
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4832.09 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.88      0.75      0.81       279

   micro avg       0.88      0.75      0.81       279
   macro avg       0.88      0.75      0.81       279
weighted avg       0.88      0.75      0.81       279

Precision Score: 0.8818565400843882
Recall Score: 0.7491039426523297
F1 Score: 0.810077519379845
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Label

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6548.45 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5106.93 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_filtered_base_best with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.150857,0.805687,0.801887,0.803783
2,0.181600,0.131269,0.839623,0.839623,0.839623
3,0.101600,0.229412,0.854839,0.75,0.798995
4,0.052000,0.183269,0.781893,0.896226,0.835165
5,0.035000,0.211739,0.808036,0.853774,0.830275
6,0.035000,0.221422,0.818182,0.806604,0.812352



Best Model saved at: ./saved_models/ate_TUM_GottBERT_filtered_base_best_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_filtered_base_best_42_42_6
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4686.32 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.88      0.80      0.84       279

   micro avg       0.88      0.80      0.84       279
   macro avg       0.88      0.80      0.84       279
weighted avg       0.88      0.80      0.84       279

Precision Score: 0.8784313725490196
Recall Score: 0.8028673835125448
F1 Score: 0.8389513108614233
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6531.65 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5067.10 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_last with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.163724,0.833333,0.754717,0.792079
2,0.183700,0.135094,0.834862,0.858491,0.846512
3,0.111700,0.156065,0.822727,0.853774,0.837963
4,0.060200,0.217409,0.837321,0.825472,0.831354
5,0.038400,0.278484,0.833333,0.801887,0.817308
6,0.038400,0.281253,0.825472,0.825472,0.825472



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_last_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_last_42_42_6
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4586.40 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.89      0.77      0.82       279

   micro avg       0.89      0.77      0.82       279
   macro avg       0.89      0.77      0.82       279
weighted avg       0.89      0.77      0.82       279

Precision Score: 0.8916666666666667
Recall Score: 0.7670250896057348
F1 Score: 0.8246628131021194
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6517.57 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5066.92 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for distilbert/distilbert-base-german-cased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.14471,0.850877,0.76378,0.804979
2,0.202000,0.153486,0.837607,0.771654,0.803279
3,0.102500,0.205724,0.866359,0.740157,0.798301
4,0.048000,0.218037,0.829787,0.767717,0.797546
5,0.035300,0.252324,0.843478,0.76378,0.801653
6,0.035300,0.256023,0.839662,0.783465,0.810591



Best Model saved at: ./saved_models/ate_distilbert_distilbert-base-german-cased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_distilbert_distilbert-base-german-cased_42_42_6
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4864.01 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.90      0.77      0.83       315

   micro avg       0.90      0.77      0.83       315
   macro avg       0.90      0.77      0.83       315
weighted avg       0.90      0.77      0.83       315

Precision Score: 0.9
Recall Score: 0.7714285714285715
F1 Score: 0.8307692307692307
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5732.79 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4576.56 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for GerMedBERT/medbert-512 with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.115652,0.816667,0.848485,0.832272
2,0.187700,0.127233,0.815574,0.861472,0.837895
3,0.087300,0.212654,0.834821,0.809524,0.821978
4,0.037900,0.214262,0.831169,0.831169,0.831169
5,0.017900,0.284796,0.829787,0.844156,0.83691
6,0.017900,0.283669,0.827731,0.852814,0.840085



Best Model saved at: ./saved_models/ate_GerMedBERT_medbert-512_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_GerMedBERT_medbert-512_42_42_6
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4226.88 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.93      0.77      0.84       288

   micro avg       0.93      0.77      0.84       288
   macro avg       0.93      0.77      0.84       288
weighted avg       0.93      0.77      0.84       288

Precision Score: 0.9324894514767933
Recall Score: 0.7673611111111112
F1 Score: 0.8419047619047619
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5868.43 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4782.45 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for deepset/gbert-base with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.141743,0.832618,0.76378,0.796715
2,0.195200,0.146101,0.8327,0.862205,0.847195
3,0.094100,0.255068,0.884444,0.783465,0.830898
4,0.032100,0.248563,0.843373,0.826772,0.83499
5,0.020800,0.272559,0.830769,0.850394,0.840467
6,0.020800,0.302068,0.835341,0.818898,0.827038



Best Model saved at: ./saved_models/ate_deepset_gbert-base_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_deepset_gbert-base_42_42_6
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4461.56 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.85      0.81      0.83       315

   micro avg       0.85      0.81      0.83       315
   macro avg       0.85      0.81      0.83       315
weighted avg       0.85      0.81      0.83       315

Precision Score: 0.85
Recall Score: 0.8095238095238095
F1 Score: 0.8292682926829269
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O',

In [6]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=7, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5782.35 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4778.19 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for google-bert/bert-base-german-cased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.152189,0.838298,0.754789,0.794355
2,0.191800,0.140404,0.84898,0.796935,0.822134
3,0.081200,0.261907,0.795181,0.758621,0.776471
4,0.022200,0.306329,0.82,0.785441,0.802348
5,0.015600,0.333421,0.79845,0.789272,0.793834
6,0.015600,0.370111,0.820408,0.770115,0.794466
7,0.003500,0.369743,0.81746,0.789272,0.803119



Best Model saved at: ./saved_models/ate_google-bert_bert-base-german-cased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_google-bert_bert-base-german-cased_42_42_7
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4316.09 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.92      0.76      0.84       323

   micro avg       0.92      0.76      0.84       323
   macro avg       0.92      0.76      0.84       323
weighted avg       0.92      0.76      0.84       323

Precision Score: 0.9216417910447762
Recall Score: 0.7647058823529411
F1 Score: 0.8358714043993233
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5851.94 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4766.74 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-cased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.160473,0.855204,0.744094,0.795789
2,0.201500,0.139999,0.848361,0.814961,0.831325
3,0.098100,0.203983,0.813953,0.826772,0.820312
4,0.041200,0.203095,0.819923,0.84252,0.831068
5,0.021600,0.29944,0.840637,0.830709,0.835644
6,0.021600,0.317098,0.853556,0.80315,0.827586
7,0.006500,0.33581,0.838057,0.814961,0.826347



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-cased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-cased_42_42_7
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4462.35 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.88      0.79      0.83       315

   micro avg       0.88      0.79      0.83       315
   macro avg       0.88      0.79      0.83       315
weighted avg       0.88      0.79      0.83       315

Precision Score: 0.8771929824561403
Recall Score: 0.7936507936507936
F1 Score: 0.8333333333333334
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5653.10 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4643.38 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-uncased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.149725,0.826087,0.753968,0.788382
2,0.191600,0.182563,0.884058,0.72619,0.797386
3,0.090300,0.232095,0.86036,0.757937,0.805907
4,0.033800,0.240383,0.857143,0.809524,0.832653
5,0.019100,0.286075,0.830645,0.81746,0.824
6,0.019100,0.297768,0.823529,0.833333,0.828402
7,0.007000,0.302323,0.836653,0.833333,0.83499



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-uncased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-uncased_42_42_7
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4407.19 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.91      0.76      0.83       316

   micro avg       0.91      0.76      0.83       316
   macro avg       0.91      0.76      0.83       316
weighted avg       0.91      0.76      0.83       316

Precision Score: 0.9060150375939849
Recall Score: 0.7626582278481012
F1 Score: 0.8281786941580755
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5853.56 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4965.71 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for FacebookAI/xlm-roberta-base with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.14445,0.80137,0.809689,0.805508
2,0.223600,0.162242,0.810345,0.813149,0.811744
3,0.151500,0.179898,0.848708,0.795848,0.821429
4,0.098500,0.186958,0.816993,0.865052,0.840336
5,0.067100,0.221011,0.798722,0.865052,0.830565
6,0.067100,0.25027,0.821918,0.83045,0.826162
7,0.032800,0.266046,0.799353,0.854671,0.826087



Best Model saved at: ./saved_models/ate_FacebookAI_xlm-roberta-base_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_FacebookAI_xlm-roberta-base_42_42_7
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4559.37 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.89      0.79      0.83       346

   micro avg       0.89      0.79      0.83       346
   macro avg       0.89      0.79      0.83       346
weighted avg       0.89      0.79      0.83       346

Precision Score: 0.8888888888888888
Recall Score: 0.7861271676300579
F1 Score: 0.8343558282208589
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6496.79 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5048.38 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_best with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.153794,0.836735,0.773585,0.803922
2,0.182800,0.116576,0.808889,0.858491,0.832952
3,0.109600,0.185088,0.808411,0.816038,0.812207
4,0.051500,0.200544,0.826484,0.853774,0.839907
5,0.031600,0.275151,0.816901,0.820755,0.818824
6,0.031600,0.289001,0.8125,0.858491,0.834862
7,0.014700,0.304613,0.820755,0.820755,0.820755



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_best_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_best_42_42_7
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4675.98 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.87      0.80      0.83       279

   micro avg       0.87      0.80      0.83       279
   macro avg       0.87      0.80      0.83       279
weighted avg       0.87      0.80      0.83       279

Precision Score: 0.8682170542635659
Recall Score: 0.8028673835125448
F1 Score: 0.8342644320297952
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6496.51 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5054.52 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_filtered_base_best with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.137183,0.822967,0.811321,0.817102
2,0.186300,0.172187,0.828431,0.79717,0.8125
3,0.099200,0.207794,0.792627,0.811321,0.801865
4,0.051800,0.199818,0.807175,0.849057,0.827586
5,0.030500,0.270295,0.838235,0.806604,0.822115
6,0.030500,0.313414,0.839378,0.764151,0.8
7,0.017900,0.302727,0.818182,0.806604,0.812352



Best Model saved at: ./saved_models/ate_TUM_GottBERT_filtered_base_best_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_filtered_base_best_42_42_7
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4656.48 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.82      0.79      0.80       279

   micro avg       0.82      0.79      0.80       279
   macro avg       0.82      0.79      0.80       279
weighted avg       0.82      0.79      0.80       279

Precision Score: 0.8208955223880597
Recall Score: 0.7885304659498208
F1 Score: 0.8043875685557587
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6588.24 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5057.84 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_last with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.180353,0.843434,0.787736,0.814634
2,0.179900,0.123179,0.81106,0.830189,0.820513
3,0.110300,0.165871,0.830275,0.853774,0.84186
4,0.054100,0.150279,0.812766,0.900943,0.854586
5,0.043400,0.203105,0.816239,0.900943,0.856502
6,0.043400,0.233076,0.815789,0.877358,0.845455
7,0.018300,0.268538,0.832558,0.84434,0.838407



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_last_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_last_42_42_7
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4671.48 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.84      0.85       279

   micro avg       0.86      0.84      0.85       279
   macro avg       0.86      0.84      0.85       279
weighted avg       0.86      0.84      0.85       279

Precision Score: 0.8571428571428571
Recall Score: 0.8387096774193549
F1 Score: 0.8478260869565217
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6531.99 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5173.92 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for distilbert/distilbert-base-german-cased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.139097,0.823529,0.771654,0.796748
2,0.205700,0.155902,0.827586,0.755906,0.790123
3,0.105200,0.212884,0.876777,0.728346,0.795699
4,0.049400,0.211797,0.855263,0.767717,0.809129
5,0.035000,0.248456,0.86758,0.748031,0.803383
6,0.035000,0.253412,0.839827,0.76378,0.8
7,0.017100,0.270706,0.85022,0.759843,0.802495



Best Model saved at: ./saved_models/ate_distilbert_distilbert-base-german-cased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_distilbert_distilbert-base-german-cased_42_42_7
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4872.10 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.87      0.76      0.81       315

   micro avg       0.87      0.76      0.81       315
   macro avg       0.87      0.76      0.81       315
weighted avg       0.87      0.76      0.81       315

Precision Score: 0.8695652173913043
Recall Score: 0.7619047619047619
F1 Score: 0.8121827411167514
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5722.43 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4643.53 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for GerMedBERT/medbert-512 with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.104758,0.827586,0.831169,0.829374
2,0.190200,0.145571,0.79771,0.904762,0.84787
3,0.088600,0.209467,0.845794,0.78355,0.813483
4,0.037600,0.224285,0.831111,0.809524,0.820175
5,0.018800,0.275971,0.824034,0.831169,0.827586
6,0.018800,0.309367,0.819742,0.82684,0.823276
7,0.005400,0.320667,0.820513,0.831169,0.825806



Best Model saved at: ./saved_models/ate_GerMedBERT_medbert-512_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_GerMedBERT_medbert-512_42_42_7
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4257.92 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.83      0.81      0.82       288

   micro avg       0.83      0.81      0.82       288
   macro avg       0.83      0.81      0.82       288
weighted avg       0.83      0.81      0.82       288

Precision Score: 0.8268551236749117
Recall Score: 0.8125
F1 Score: 0.8196147110332749
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5853.21 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4829.39 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for deepset/gbert-base with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.123697,0.821429,0.814961,0.818182
2,0.190100,0.145844,0.831373,0.834646,0.833006
3,0.089500,0.23338,0.860169,0.799213,0.828571
4,0.033400,0.235083,0.843373,0.826772,0.83499
5,0.020000,0.27627,0.813688,0.84252,0.827853
6,0.020000,0.298147,0.819923,0.84252,0.831068
7,0.007200,0.300986,0.831373,0.834646,0.833006



Best Model saved at: ./saved_models/ate_deepset_gbert-base_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_deepset_gbert-base_42_42_7
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4474.49 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.91      0.77      0.84       315

   micro avg       0.91      0.77      0.84       315
   macro avg       0.91      0.77      0.84       315
weighted avg       0.91      0.77      0.84       315

Precision Score: 0.9070631970260223
Recall Score: 0.7746031746031746
F1 Score: 0.8356164383561643
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

In [5]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=8, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5618.77 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4816.16 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for google-bert/bert-base-german-cased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.149254,0.8159,0.747126,0.78
2,0.192400,0.186539,0.838983,0.758621,0.796781
3,0.081300,0.255996,0.844156,0.747126,0.792683
4,0.022700,0.280217,0.832636,0.762452,0.796
5,0.015200,0.375729,0.819328,0.747126,0.781563
6,0.015200,0.40264,0.823293,0.785441,0.803922
7,0.002700,0.406842,0.820717,0.789272,0.804688
8,0.000700,0.429536,0.82996,0.785441,0.807087



Best Model saved at: ./saved_models/ate_google-bert_bert-base-german-cased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_google-bert_bert-base-german-cased_42_42_8
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4382.49 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.91      0.77      0.83       323

   micro avg       0.91      0.77      0.83       323
   macro avg       0.91      0.77      0.83       323
weighted avg       0.91      0.77      0.83       323

Precision Score: 0.9084249084249084
Recall Score: 0.7678018575851393
F1 Score: 0.8322147651006712
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6021.20 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4875.19 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-cased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.141909,0.789272,0.811024,0.8
2,0.200700,0.156389,0.833333,0.846457,0.839844
3,0.099100,0.210994,0.801471,0.858268,0.828897
4,0.037600,0.228102,0.803571,0.885827,0.842697
5,0.020700,0.309773,0.82197,0.854331,0.837838
6,0.020700,0.366385,0.843478,0.76378,0.801653
7,0.006300,0.361147,0.840637,0.830709,0.835644
8,0.005000,0.360834,0.835938,0.84252,0.839216



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-cased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-cased_42_42_8
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4469.62 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.85      0.85      0.85       315

   micro avg       0.85      0.85      0.85       315
   macro avg       0.85      0.85      0.85       315
weighted avg       0.85      0.85      0.85       315

Precision Score: 0.8503184713375797
Recall Score: 0.8476190476190476
F1 Score: 0.848966613672496
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Label

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5666.05 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4772.43 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-uncased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.137137,0.835498,0.765873,0.799172
2,0.186700,0.166352,0.843478,0.769841,0.804979
3,0.090500,0.215347,0.859574,0.801587,0.829569
4,0.033800,0.24386,0.844262,0.81746,0.830645
5,0.018900,0.292529,0.842105,0.825397,0.833667
6,0.018900,0.304921,0.820611,0.853175,0.836576
7,0.004000,0.323235,0.841897,0.845238,0.843564
8,0.001900,0.328161,0.827586,0.857143,0.842105



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-uncased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-uncased_42_42_8
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4513.38 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.89      0.77      0.83       316

   micro avg       0.89      0.77      0.83       316
   macro avg       0.89      0.77      0.83       316
weighted avg       0.89      0.77      0.83       316

Precision Score: 0.8933823529411765
Recall Score: 0.7689873417721519
F1 Score: 0.826530612244898
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Label

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5888.87 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5063.16 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for FacebookAI/xlm-roberta-base with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.15183,0.815498,0.764706,0.789286
2,0.212300,0.148696,0.823529,0.775087,0.798574
3,0.142600,0.155815,0.864769,0.84083,0.852632
4,0.080400,0.189003,0.774854,0.916955,0.839937
5,0.056500,0.228572,0.808176,0.889273,0.846787
6,0.056500,0.298263,0.836237,0.83045,0.833333
7,0.031500,0.286758,0.786787,0.906574,0.842444
8,0.024300,0.267071,0.819355,0.878893,0.84808



Best Model saved at: ./saved_models/ate_FacebookAI_xlm-roberta-base_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_FacebookAI_xlm-roberta-base_42_42_8
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4642.49 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.88      0.75      0.81       346

   micro avg       0.88      0.75      0.81       346
   macro avg       0.88      0.75      0.81       346
weighted avg       0.88      0.75      0.81       346

Precision Score: 0.8847457627118644
Recall Score: 0.7543352601156069
F1 Score: 0.814352574102964
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Label

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6540.12 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5099.73 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_best with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.190343,0.85119,0.674528,0.752632
2,0.182800,0.144139,0.842105,0.830189,0.836105
3,0.109300,0.178813,0.801724,0.877358,0.837838
4,0.050600,0.215727,0.816038,0.816038,0.816038
5,0.034100,0.269656,0.79386,0.853774,0.822727
6,0.034100,0.299629,0.808219,0.834906,0.821346
7,0.012200,0.312673,0.810811,0.849057,0.829493
8,0.007500,0.330588,0.814286,0.806604,0.810427



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_best_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_best_42_42_8
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4775.73 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.81      0.79      0.80       279

   micro avg       0.81      0.79      0.80       279
   macro avg       0.81      0.79      0.80       279
weighted avg       0.81      0.79      0.80       279

Precision Score: 0.8118081180811808
Recall Score: 0.7885304659498208
F1 Score: 0.8
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6582.85 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5060.99 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_filtered_base_best with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.129274,0.810427,0.806604,0.808511
2,0.179200,0.147977,0.813084,0.820755,0.816901
3,0.102200,0.177405,0.823256,0.834906,0.82904
4,0.051500,0.188275,0.803653,0.830189,0.816705
5,0.031200,0.24568,0.807175,0.849057,0.827586
6,0.031200,0.287995,0.801724,0.877358,0.837838
7,0.009500,0.28511,0.809735,0.863208,0.835616
8,0.006500,0.292516,0.80531,0.858491,0.83105



Best Model saved at: ./saved_models/ate_TUM_GottBERT_filtered_base_best_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_filtered_base_best_42_42_8
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4671.18 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.81      0.83       279

   micro avg       0.86      0.81      0.83       279
   macro avg       0.86      0.81      0.83       279
weighted avg       0.86      0.81      0.83       279

Precision Score: 0.8620689655172413
Recall Score: 0.8064516129032258
F1 Score: 0.8333333333333334
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6551.89 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5060.02 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_last with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.17993,0.817734,0.783019,0.8
2,0.181400,0.152914,0.823529,0.792453,0.807692
3,0.112200,0.199777,0.820388,0.79717,0.808612
4,0.060100,0.178344,0.820628,0.863208,0.841379
5,0.042600,0.294801,0.793722,0.834906,0.813793
6,0.042600,0.333246,0.827586,0.792453,0.809639
7,0.019100,0.393153,0.819095,0.768868,0.793187
8,0.007700,0.378625,0.805825,0.783019,0.794258



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_last_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_last_42_42_8
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4614.10 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.87      0.81      0.84       279

   micro avg       0.87      0.81      0.84       279
   macro avg       0.87      0.81      0.84       279
weighted avg       0.87      0.81      0.84       279

Precision Score: 0.8725868725868726
Recall Score: 0.8100358422939068
F1 Score: 0.8401486988847584
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6432.66 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5251.07 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for distilbert/distilbert-base-german-cased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.144885,0.829167,0.783465,0.805668
2,0.208200,0.169265,0.835498,0.759843,0.795876
3,0.102900,0.199689,0.83913,0.759843,0.797521
4,0.048100,0.234594,0.834783,0.755906,0.793388
5,0.031300,0.266809,0.827731,0.775591,0.800813
6,0.031300,0.284576,0.825911,0.80315,0.814371
7,0.014700,0.304624,0.827869,0.795276,0.811245
8,0.005400,0.314198,0.834711,0.795276,0.814516



Best Model saved at: ./saved_models/ate_distilbert_distilbert-base-german-cased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_distilbert_distilbert-base-german-cased_42_42_8
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4852.48 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.75      0.80       315

   micro avg       0.86      0.75      0.80       315
   macro avg       0.86      0.75      0.80       315
weighted avg       0.86      0.75      0.80       315

Precision Score: 0.864963503649635
Recall Score: 0.7523809523809524
F1 Score: 0.8047538200339559
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Label

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5755.69 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4579.88 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for GerMedBERT/medbert-512 with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.113751,0.835749,0.748918,0.789954
2,0.189800,0.150909,0.816,0.883117,0.848233
3,0.088500,0.20706,0.818565,0.839827,0.82906
4,0.037600,0.233955,0.807531,0.835498,0.821277
5,0.019100,0.288862,0.807377,0.852814,0.829474
6,0.019100,0.313762,0.807531,0.835498,0.821277
7,0.003900,0.32168,0.830435,0.82684,0.828633
8,0.004500,0.320143,0.8107,0.852814,0.831224



Best Model saved at: ./saved_models/ate_GerMedBERT_medbert-512_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_GerMedBERT_medbert-512_42_42_8
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4216.25 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.87      0.78      0.82       288

   micro avg       0.87      0.78      0.82       288
   macro avg       0.87      0.78      0.82       288
weighted avg       0.87      0.78      0.82       288

Precision Score: 0.8653846153846154
Recall Score: 0.78125
F1 Score: 0.8211678832116789
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', '

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5817.70 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4689.90 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for deepset/gbert-base with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.14296,0.83913,0.759843,0.797521
2,0.190500,0.138249,0.861925,0.811024,0.8357
3,0.089300,0.231621,0.845188,0.795276,0.819473
4,0.031600,0.226515,0.84127,0.834646,0.837945
5,0.021000,0.283137,0.833977,0.850394,0.842105
6,0.021000,0.272008,0.833333,0.866142,0.849421
7,0.007200,0.30423,0.844,0.830709,0.837302
8,0.004300,0.285747,0.833977,0.850394,0.842105



Best Model saved at: ./saved_models/ate_deepset_gbert-base_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_deepset_gbert-base_42_42_8
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4399.89 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.87      0.80      0.84       315

   micro avg       0.87      0.80      0.84       315
   macro avg       0.87      0.80      0.84       315
weighted avg       0.87      0.80      0.84       315

Precision Score: 0.8724137931034482
Recall Score: 0.8031746031746032
F1 Score: 0.8363636363636363
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

In [6]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=10, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5910.34 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4784.23 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for google-bert/bert-base-german-cased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.155934,0.817021,0.735632,0.774194
2,0.189800,0.155716,0.822134,0.796935,0.809339
3,0.081400,0.238581,0.798507,0.819923,0.809074
4,0.023800,0.315927,0.807229,0.770115,0.788235
5,0.014300,0.370965,0.824219,0.808429,0.816248
6,0.014300,0.41991,0.828571,0.777778,0.802372
7,0.006900,0.426093,0.832653,0.781609,0.806324
8,0.001900,0.435215,0.833992,0.808429,0.821012
9,0.001500,0.448985,0.828685,0.796935,0.8125
10,0.000300,0.456094,0.828685,0.796935,0.8125



Best Model saved at: ./saved_models/ate_google-bert_bert-base-german-cased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_google-bert_bert-base-german-cased_42_42_10
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4493.24 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.91      0.78      0.84       323

   micro avg       0.91      0.78      0.84       323
   macro avg       0.91      0.78      0.84       323
weighted avg       0.91      0.78      0.84       323

Precision Score: 0.9100719424460432
Recall Score: 0.7832817337461301
F1 Score: 0.8419301164725459
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5895.41 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4782.72 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-cased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.147201,0.813008,0.787402,0.8
2,0.200200,0.163032,0.835391,0.799213,0.816901
3,0.099000,0.240881,0.835443,0.779528,0.806517
4,0.034700,0.288684,0.789655,0.901575,0.841912
5,0.017900,0.355788,0.80315,0.80315,0.80315
6,0.017900,0.358243,0.832669,0.822835,0.827723
7,0.007400,0.346879,0.832061,0.858268,0.844961
8,0.005500,0.396509,0.838583,0.838583,0.838583
9,0.002500,0.419361,0.826772,0.826772,0.826772
10,0.001400,0.43027,0.826772,0.826772,0.826772



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-cased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-cased_42_42_10
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4444.78 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.88      0.82      0.85       315

   micro avg       0.88      0.82      0.85       315
   macro avg       0.88      0.82      0.85       315
weighted avg       0.88      0.82      0.85       315

Precision Score: 0.8839590443686007
Recall Score: 0.8222222222222222
F1 Score: 0.8519736842105263
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5635.12 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4758.92 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-uncased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.145257,0.830435,0.757937,0.792531
2,0.191600,0.186798,0.859031,0.77381,0.814196
3,0.092600,0.225076,0.838298,0.781746,0.809035
4,0.032600,0.21824,0.819923,0.849206,0.834308
5,0.020600,0.332044,0.821862,0.805556,0.813627
6,0.020600,0.37145,0.829365,0.829365,0.829365
7,0.005800,0.399477,0.836735,0.813492,0.82495
8,0.001500,0.402862,0.835294,0.845238,0.840237
9,0.002400,0.432827,0.842742,0.829365,0.836
10,0.001300,0.418345,0.833992,0.837302,0.835644



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-uncased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-uncased_42_42_10
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4540.83 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.88      0.78      0.83       316

   micro avg       0.88      0.78      0.83       316
   macro avg       0.88      0.78      0.83       316
weighted avg       0.88      0.78      0.83       316

Precision Score: 0.875886524822695
Recall Score: 0.7816455696202531
F1 Score: 0.8260869565217391
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Label

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5853.70 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5033.62 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for FacebookAI/xlm-roberta-base with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.142815,0.826241,0.806228,0.816112
2,0.216400,0.168245,0.804054,0.823529,0.813675
3,0.145400,0.183943,0.827119,0.844291,0.835616
4,0.093000,0.205133,0.83045,0.83045,0.83045
5,0.066800,0.302071,0.798701,0.851211,0.824121
6,0.066800,0.299186,0.803987,0.83737,0.820339
7,0.033400,0.30739,0.780488,0.885813,0.829822
8,0.026600,0.332037,0.794953,0.871972,0.831683
9,0.016000,0.341836,0.795527,0.861592,0.827243
10,0.009900,0.346254,0.796178,0.865052,0.829187



Best Model saved at: ./saved_models/ate_FacebookAI_xlm-roberta-base_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_FacebookAI_xlm-roberta-base_42_42_10
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4607.34 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.82      0.75      0.79       346

   micro avg       0.82      0.75      0.79       346
   macro avg       0.82      0.75      0.79       346
weighted avg       0.82      0.75      0.79       346

Precision Score: 0.8227848101265823
Recall Score: 0.7514450867052023
F1 Score: 0.7854984894259818
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6464.58 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5036.80 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_best with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.147691,0.827586,0.792453,0.809639
2,0.181200,0.148894,0.842857,0.834906,0.838863
3,0.108500,0.17776,0.817352,0.84434,0.830626
4,0.058200,0.175909,0.804255,0.891509,0.845638
5,0.043300,0.2971,0.814286,0.806604,0.810427
6,0.043300,0.282738,0.812785,0.839623,0.825986
7,0.017200,0.401498,0.836842,0.75,0.791045
8,0.006700,0.376106,0.813397,0.801887,0.807601
9,0.007800,0.392271,0.825,0.778302,0.800971
10,0.004000,0.403141,0.820896,0.778302,0.799031



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_best_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_best_42_42_10
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4712.91 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.85      0.83      0.84       279

   micro avg       0.85      0.83      0.84       279
   macro avg       0.85      0.83      0.84       279
weighted avg       0.85      0.83      0.84       279

Precision Score: 0.8498168498168498
Recall Score: 0.8315412186379928
F1 Score: 0.8405797101449275
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6523.00 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5084.19 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_filtered_base_best with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.14879,0.81407,0.764151,0.788321
2,0.186500,0.210977,0.798122,0.801887,0.8
3,0.099000,0.149225,0.83871,0.858491,0.848485
4,0.049400,0.177294,0.857868,0.79717,0.826406
5,0.034600,0.250404,0.818182,0.849057,0.833333
6,0.034600,0.240946,0.822511,0.896226,0.857788
7,0.014100,0.299077,0.84264,0.783019,0.811736
8,0.007300,0.31506,0.828571,0.820755,0.824645
9,0.002300,0.295691,0.834821,0.882075,0.857798
10,0.002000,0.341165,0.840796,0.79717,0.818402



Best Model saved at: ./saved_models/ate_TUM_GottBERT_filtered_base_best_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_filtered_base_best_42_42_10
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4693.83 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.78      0.82       279

   micro avg       0.86      0.78      0.82       279
   macro avg       0.86      0.78      0.82       279
weighted avg       0.86      0.78      0.82       279

Precision Score: 0.857707509881423
Recall Score: 0.7777777777777778
F1 Score: 0.8157894736842106
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Label

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6457.71 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5062.19 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_last with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.237963,0.838889,0.712264,0.770408
2,0.179200,0.157235,0.804545,0.834906,0.819444
3,0.113600,0.18128,0.801802,0.839623,0.820276
4,0.059100,0.171909,0.826087,0.806604,0.816229
5,0.045800,0.26868,0.808219,0.834906,0.821346
6,0.045800,0.338653,0.829146,0.778302,0.80292
7,0.012500,0.378633,0.834197,0.759434,0.795062
8,0.008000,0.375693,0.81068,0.787736,0.799043
9,0.006900,0.380848,0.823529,0.792453,0.807692
10,0.005800,0.387219,0.820896,0.778302,0.799031



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_last_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_last_42_42_10
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4754.87 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.81      0.83       279

   micro avg       0.86      0.81      0.83       279
   macro avg       0.86      0.81      0.83       279
weighted avg       0.86      0.81      0.83       279

Precision Score: 0.8593155893536122
Recall Score: 0.8100358422939068
F1 Score: 0.8339483394833949
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6513.12 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5260.33 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for distilbert/distilbert-base-german-cased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.144258,0.827731,0.775591,0.800813
2,0.205500,0.162938,0.821739,0.744094,0.780992
3,0.101200,0.226657,0.859813,0.724409,0.786325
4,0.045200,0.229085,0.802281,0.830709,0.816248
5,0.034400,0.262624,0.816,0.80315,0.809524
6,0.034400,0.305888,0.817797,0.759843,0.787755
7,0.012800,0.360086,0.790698,0.80315,0.796875
8,0.004100,0.359875,0.795367,0.811024,0.803119
9,0.004100,0.366117,0.792308,0.811024,0.801556
10,0.001400,0.370863,0.792308,0.811024,0.801556



Best Model saved at: ./saved_models/ate_distilbert_distilbert-base-german-cased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_distilbert_distilbert-base-german-cased_42_42_10
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4814.15 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.81      0.83       315

   micro avg       0.86      0.81      0.83       315
   macro avg       0.86      0.81      0.83       315
weighted avg       0.86      0.81      0.83       315

Precision Score: 0.8639455782312925
Recall Score: 0.8063492063492064
F1 Score: 0.8341543513957307
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5757.05 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4618.57 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for GerMedBERT/medbert-512 with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.109198,0.817021,0.831169,0.824034
2,0.189900,0.141898,0.794466,0.87013,0.830579
3,0.088000,0.222649,0.843902,0.748918,0.793578
4,0.036800,0.183271,0.823045,0.865801,0.843882
5,0.023100,0.265912,0.814815,0.857143,0.835443
6,0.023100,0.288966,0.834783,0.831169,0.832972
7,0.004200,0.316017,0.812245,0.861472,0.836134
8,0.004100,0.330555,0.823045,0.865801,0.843882
9,0.001000,0.341773,0.829787,0.844156,0.83691
10,0.000800,0.3428,0.824268,0.852814,0.838298



Best Model saved at: ./saved_models/ate_GerMedBERT_medbert-512_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_GerMedBERT_medbert-512_42_42_10
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4222.95 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.87      0.75      0.81       288

   micro avg       0.87      0.75      0.81       288
   macro avg       0.87      0.75      0.81       288
weighted avg       0.87      0.75      0.81       288

Precision Score: 0.8739837398373984
Recall Score: 0.7465277777777778
F1 Score: 0.8052434456928839
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5814.19 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4803.87 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for deepset/gbert-base with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.144171,0.815261,0.799213,0.807157
2,0.190400,0.174801,0.811765,0.814961,0.81336
3,0.089000,0.237684,0.861472,0.783465,0.820619
4,0.033400,0.263219,0.860759,0.80315,0.830957
5,0.018300,0.326294,0.845833,0.799213,0.821862
6,0.018300,0.352128,0.823308,0.862205,0.842308
7,0.007900,0.395983,0.862069,0.787402,0.823045
8,0.001200,0.388542,0.83004,0.826772,0.828402
9,0.000700,0.393355,0.836735,0.807087,0.821643
10,0.001500,0.392739,0.841463,0.814961,0.828



Best Model saved at: ./saved_models/ate_deepset_gbert-base_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_deepset_gbert-base_42_42_10
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4431.11 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.90      0.79      0.84       315

   micro avg       0.90      0.79      0.84       315
   macro avg       0.90      0.79      0.84       315
weighted avg       0.90      0.79      0.84       315

Precision Score: 0.8953068592057761
Recall Score: 0.7873015873015873
F1 Score: 0.8378378378378378
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

In [None]:
for model in models:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=12, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5786.81 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4766.04 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for google-bert/bert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.152613,0.826271,0.747126,0.784708
2,0.192700,0.157459,0.836653,0.804598,0.820312
3,0.084000,0.256477,0.814516,0.773946,0.793713
4,0.025200,0.246607,0.83682,0.766284,0.8
5,0.013700,0.389142,0.823045,0.766284,0.793651
6,0.013700,0.425228,0.824786,0.739464,0.779798
7,0.005900,0.390524,0.819277,0.781609,0.8
8,0.001300,0.439598,0.814516,0.773946,0.793713
9,0.001200,0.459527,0.829787,0.747126,0.78629
10,0.000800,0.450793,0.82449,0.773946,0.798419



Best Model saved at: ./saved_models/ate_google-bert_bert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_google-bert_bert-base-german-cased_42_42_12
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4355.23 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.91      0.77      0.83       323

   micro avg       0.91      0.77      0.83       323
   macro avg       0.91      0.77      0.83       323
weighted avg       0.91      0.77      0.83       323

Precision Score: 0.9054545454545454
Recall Score: 0.7708978328173375
F1 Score: 0.8327759197324415
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5732.15 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4760.74 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.149743,0.794466,0.791339,0.792899
2,0.206900,0.168503,0.862661,0.791339,0.825462
3,0.098600,0.253586,0.82397,0.866142,0.84453
4,0.041100,0.278187,0.8,0.834646,0.816956
5,0.023200,0.390024,0.767123,0.88189,0.820513
6,0.023200,0.402914,0.857778,0.759843,0.805846
7,0.006000,0.444511,0.827004,0.771654,0.798371
8,0.006000,0.432876,0.807843,0.811024,0.80943
9,0.002500,0.457051,0.831933,0.779528,0.804878
10,0.000600,0.485857,0.829167,0.783465,0.805668



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-cased_42_42_12
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4481.05 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.87      0.82      0.84       315

   micro avg       0.87      0.82      0.84       315
   macro avg       0.87      0.82      0.84       315
weighted avg       0.87      0.82      0.84       315

Precision Score: 0.8716216216216216
Recall Score: 0.819047619047619
F1 Score: 0.8445171849427168
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Label

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5512.05 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4592.88 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for dbmdz/bert-base-german-uncased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.140258,0.844828,0.777778,0.809917
2,0.187200,0.158992,0.850622,0.813492,0.831643
3,0.089400,0.227017,0.83682,0.793651,0.814664
4,0.032800,0.224848,0.820717,0.81746,0.819085
5,0.021900,0.312917,0.818898,0.825397,0.822134
6,0.021900,0.373301,0.829365,0.829365,0.829365
7,0.004900,0.41968,0.814229,0.81746,0.815842
8,0.002200,0.421972,0.816733,0.813492,0.815109
9,0.000900,0.444466,0.818182,0.821429,0.819802
10,0.001500,0.432362,0.825911,0.809524,0.817635



Best Model saved at: ./saved_models/ate_dbmdz_bert-base-german-uncased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_dbmdz_bert-base-german-uncased_42_42_12
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4213.31 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.93      0.75      0.83       316

   micro avg       0.93      0.75      0.83       316
   macro avg       0.93      0.75      0.83       316
weighted avg       0.93      0.75      0.83       316

Precision Score: 0.9260700389105059
Recall Score: 0.7531645569620253
F1 Score: 0.830715532286213
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Label

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5753.53 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4965.83 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for FacebookAI/xlm-roberta-base with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.160928,0.792388,0.792388,0.792388
2,0.217400,0.148956,0.856618,0.806228,0.83066
3,0.155000,0.175109,0.830986,0.816609,0.823735
4,0.105000,0.213291,0.815436,0.84083,0.827939
5,0.079500,0.254736,0.796825,0.868512,0.831126
6,0.079500,0.274454,0.817881,0.854671,0.835871
7,0.042300,0.316326,0.774775,0.892734,0.829582
8,0.027300,0.323452,0.792332,0.858131,0.82392
9,0.020700,0.3496,0.817276,0.851211,0.833898
10,0.007900,0.345608,0.81877,0.875433,0.846154



Best Model saved at: ./saved_models/ate_FacebookAI_xlm-roberta-base_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_FacebookAI_xlm-roberta-base_42_42_12
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4561.41 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.75      0.80       346

   micro avg       0.86      0.75      0.80       346
   macro avg       0.86      0.75      0.80       346
weighted avg       0.86      0.75      0.80       346

Precision Score: 0.8557377049180328
Recall Score: 0.7543352601156069
F1 Score: 0.8018433179723502
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6523.07 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4988.99 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_best with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.18932,0.836066,0.721698,0.774684
2,0.183900,0.150063,0.803738,0.811321,0.807512
3,0.113500,0.182954,0.801688,0.896226,0.846325
4,0.056600,0.230458,0.789216,0.759434,0.774038
5,0.042700,0.297693,0.787037,0.801887,0.794393
6,0.042700,0.358697,0.835979,0.745283,0.78803
7,0.014900,0.381509,0.804878,0.778302,0.791367
8,0.008600,0.45022,0.823834,0.75,0.785185
9,0.004700,0.437428,0.824742,0.754717,0.788177
10,0.005000,0.438994,0.815166,0.811321,0.813239



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_best_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_best_42_42_12
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4659.93 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.84      0.82      0.83       279

   micro avg       0.84      0.82      0.83       279
   macro avg       0.84      0.82      0.83       279
weighted avg       0.84      0.82      0.83       279

Precision Score: 0.8394160583941606
Recall Score: 0.8243727598566308
F1 Score: 0.8318264014466547
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6559.66 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4969.44 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_filtered_base_best with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.131867,0.796296,0.811321,0.803738
2,0.176600,0.144252,0.783186,0.834906,0.808219
3,0.102400,0.213909,0.804651,0.816038,0.810304
4,0.048300,0.212191,0.8,0.867925,0.832579
5,0.035500,0.285013,0.800885,0.853774,0.826484
6,0.035500,0.309983,0.81106,0.830189,0.820513
7,0.010300,0.320235,0.798122,0.801887,0.8
8,0.010700,0.405206,0.825397,0.735849,0.778055
9,0.002400,0.34441,0.811321,0.811321,0.811321
10,0.002600,0.380798,0.795238,0.787736,0.791469



Best Model saved at: ./saved_models/ate_TUM_GottBERT_filtered_base_best_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_filtered_base_best_42_42_12
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4724.41 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.83      0.83      0.83       279

   micro avg       0.83      0.83      0.83       279
   macro avg       0.83      0.83      0.83       279
weighted avg       0.83      0.83      0.83       279

Precision Score: 0.8279569892473119
Recall Score: 0.8279569892473119
F1 Score: 0.8279569892473119
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6532.18 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5068.01 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for TUM/GottBERT_base_last with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.170278,0.834225,0.735849,0.781955
2,0.179300,0.151724,0.813953,0.825472,0.819672
3,0.114000,0.19204,0.832512,0.79717,0.814458
4,0.056900,0.213357,0.8,0.773585,0.786571
5,0.038200,0.26479,0.795556,0.84434,0.819222
6,0.038200,0.32948,0.815534,0.792453,0.803828
7,0.013600,0.377215,0.81592,0.773585,0.794189
8,0.013400,0.43781,0.795,0.75,0.771845
9,0.006300,0.443204,0.823834,0.75,0.785185
10,0.002300,0.435113,0.812183,0.754717,0.782396



Best Model saved at: ./saved_models/ate_TUM_GottBERT_base_last_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_TUM_GottBERT_base_last_42_42_12
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4656.38 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.90      0.74      0.81       279

   micro avg       0.90      0.74      0.81       279
   macro avg       0.90      0.74      0.81       279
weighted avg       0.90      0.74      0.81       279

Precision Score: 0.8956521739130435
Recall Score: 0.7383512544802867
F1 Score: 0.8094302554027505
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6546.21 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5177.58 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for distilbert/distilbert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.138312,0.808765,0.799213,0.80396
2,0.209300,0.160013,0.846847,0.740157,0.789916
3,0.098600,0.190818,0.821577,0.779528,0.8
4,0.045500,0.225302,0.839662,0.783465,0.810591
5,0.030200,0.259969,0.804688,0.811024,0.807843
6,0.030200,0.294629,0.816733,0.807087,0.811881
7,0.012000,0.322201,0.8125,0.818898,0.815686
8,0.005300,0.356954,0.82716,0.791339,0.808853
9,0.002200,0.362175,0.799242,0.830709,0.814672
10,0.001500,0.380429,0.801556,0.811024,0.806262



Best Model saved at: ./saved_models/ate_distilbert_distilbert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_distilbert_distilbert-base-german-cased_42_42_12
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4892.44 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.86      0.80      0.83       315

   micro avg       0.86      0.80      0.83       315
   macro avg       0.86      0.80      0.83       315
weighted avg       0.86      0.80      0.83       315

Precision Score: 0.863013698630137
Recall Score: 0.8
F1 Score: 0.8303130148270181
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', '

Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5730.50 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4597.02 examples/s]
  trainer = Trainer(


['B-ASPECT' 'O']
{0: 4.755042290175667, 1: 0.5587538226299694}
Training results for GerMedBERT/medbert-512 with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.110661,0.837209,0.779221,0.807175
2,0.189600,0.157961,0.829787,0.844156,0.83691


In [5]:
for model in ["GerMedBERT/medbert-512", "deepset/gbert-base"]:
    print(f'training and results for {model}:')
    ate_model(data, model, rn1=42, rn2=42, epochs=12, save=True)
    print()

training and results for GerMedBERT/medbert-512:


BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 5612.39 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4681.61 examples/s]
  trainer = Trainer(


['O' 'B-ASPECT']
{0: 0.5587538226299694, 1: 4.755042290175667}
Training results for GerMedBERT/medbert-512 with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.115053,0.815668,0.766234,0.790179
2,0.189400,0.146322,0.807692,0.818182,0.812903
3,0.084300,0.178711,0.826271,0.844156,0.835118
4,0.037700,0.215008,0.800797,0.87013,0.834025
5,0.021800,0.272521,0.825532,0.839827,0.832618
6,0.021800,0.323538,0.837719,0.82684,0.832244
7,0.004800,0.344278,0.8,0.848485,0.823529
8,0.002200,0.358251,0.82684,0.82684,0.82684
9,0.001500,0.40078,0.825328,0.818182,0.821739
10,0.000700,0.406954,0.813559,0.831169,0.82227



Best Model saved at: ./saved_models/ate_GerMedBERT_medbert-512_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_GerMedBERT_medbert-512_42_42_12
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4342.72 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.89      0.73      0.80       288

   micro avg       0.89      0.73      0.80       288
   macro avg       0.89      0.73      0.80       288
weighted avg       0.89      0.73      0.80       288

Precision Score: 0.8860759493670886
Recall Score: 0.7291666666666666
F1 Score: 0.8
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

mapping of the data



Map: 100%|██████████| 808/808 [00:00<00:00, 6026.86 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4929.37 examples/s]
  trainer = Trainer(


['O' 'B-ASPECT']
{0: 0.5587538226299694, 1: 4.755042290175667}
Training results for deepset/gbert-base with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.13888,0.851528,0.767717,0.807453
2,0.187600,0.168863,0.837945,0.834646,0.836292
3,0.093600,0.234688,0.857724,0.830709,0.844
4,0.034600,0.216671,0.828358,0.874016,0.850575
5,0.020800,0.307312,0.8327,0.862205,0.847195
6,0.020800,0.341299,0.856574,0.846457,0.851485
7,0.008900,0.335284,0.856574,0.846457,0.851485
8,0.005600,0.331235,0.844358,0.854331,0.849315
9,0.002200,0.386206,0.827068,0.866142,0.846154
10,0.002300,0.381521,0.849802,0.846457,0.848126



Best Model saved at: ./saved_models/ate_deepset_gbert-base_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_deepset_gbert-base_42_42_12
mapping the test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4484.99 examples/s]


Unique predicted label IDs: {0, 1}
Expected label IDs: {0, 1}
Classification Report:
              precision    recall  f1-score   support

      ASPECT       0.90      0.78      0.83       315

   micro avg       0.90      0.78      0.83       315
   macro avg       0.90      0.78      0.83       315
weighted avg       0.90      0.78      0.83       315

Precision Score: 0.9007352941176471
Recall Score: 0.7777777777777778
F1 Score: 0.8347529812606475
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O']
Tokens     : ['Insbesondere', 'bei', 'Unikliniken', ',', 'mit', 'anderen', 'Krankheitsbildern', 'haben', 'sie', 'leider', 'ab', 'und', 'zu', 'Probleme', '.']
True Labels: ['O', 'O', 'B-ASPECT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labe

### 2. Train category-aware ATE Models for 5, 6, 7, 8, 10, 12 epochs

In [6]:
for model in models:
    print(f'training and results for {model}:')
    ate_cat_model(data, model, rn1=42, rn2=42, epochs=5, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5855.83 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4847.46 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training google-bert/bert-base-german-cased for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.224869,0.709677,0.590038,0.644351
2,0.330800,0.190629,0.747788,0.64751,0.694045
3,0.138700,0.210872,0.764957,0.685824,0.723232
4,0.053500,0.233231,0.771084,0.735632,0.752941
5,0.029300,0.252053,0.738095,0.712644,0.725146



Best Model saved at: ./saved_models/ate_cat_google-bert_bert-base-german-cased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_google-bert_bert-base-german-cased_42_42_5
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4337.09 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.86      0.92      0.89        52
    Krankenhaus       0.90      0.71      0.79       119
       Personal       0.75      0.64      0.69        14
 Pflegepersonal       0.94      0.94      0.94        18
anderer Service       0.68      0.39      0.50        33
 mediz. Service       0.82      0.76      0.79        87

      micro avg       0.85      0.73      0.79       323
      macro avg       0.83      0.73      0.77       323
   weighted avg       0.85      0.73      0.78       323

Precision Score: 0.8525179856115108
Recall Score: 0.7337461300309598
F1 Score: 0.788685524126456
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5939.80 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4806.16 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training dbmdz/bert-base-german-cased for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.221302,0.686916,0.57874,0.628205
2,0.353600,0.202281,0.724576,0.673228,0.697959
3,0.156100,0.229355,0.771552,0.704724,0.736626
4,0.071100,0.262558,0.748988,0.728346,0.738523
5,0.038000,0.280303,0.75502,0.740157,0.747515



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-cased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-cased_42_42_5
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4417.47 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.90      0.95      0.92        55
    Krankenhaus       0.86      0.64      0.74       117
       Personal       0.67      0.46      0.55        13
 Pflegepersonal       1.00      0.94      0.97        18
anderer Service       0.69      0.57      0.62        35
 mediz. Service       0.79      0.75      0.77        77

      micro avg       0.84      0.72      0.78       315
      macro avg       0.82      0.72      0.76       315
   weighted avg       0.83      0.72      0.77       315

Precision Score: 0.8351648351648352
Recall Score: 0.7238095238095238
F1 Score: 0.7755102040816326
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5664.95 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4729.22 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training dbmdz/bert-base-german-uncased for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.224606,0.723301,0.59127,0.650655
2,0.326400,0.234411,0.763636,0.666667,0.711864
3,0.150000,0.241171,0.794118,0.75,0.771429
4,0.064700,0.269833,0.773663,0.746032,0.759596
5,0.046400,0.273208,0.783673,0.761905,0.772636



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-uncased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-uncased_42_42_5
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4383.44 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.90      0.90      0.90        63
    Krankenhaus       0.92      0.60      0.72       112
       Personal       0.75      0.86      0.80        14
 Pflegepersonal       1.00      0.95      0.97        19
anderer Service       0.63      0.35      0.45        34
 mediz. Service       0.85      0.77      0.81        74

      micro avg       0.87      0.71      0.78       316
      macro avg       0.84      0.74      0.78       316
   weighted avg       0.87      0.71      0.77       316

Precision Score: 0.87109375
Recall Score: 0.7056962025316456
F1 Score: 0.7797202797202798
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5837.36 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5026.16 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training FacebookAI/xlm-roberta-base for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.283657,0.586777,0.491349,0.53484
2,0.421400,0.221515,0.691275,0.712803,0.701874
3,0.227300,0.211442,0.716312,0.698962,0.707531
4,0.136900,0.205753,0.724252,0.754325,0.738983
5,0.102100,0.216718,0.728223,0.723183,0.725694



Best Model saved at: ./saved_models/ate_cat_FacebookAI_xlm-roberta-base_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_FacebookAI_xlm-roberta-base_42_42_5
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4289.47 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.82      0.93      0.87        59
    Krankenhaus       0.79      0.79      0.79       120
       Personal       0.67      0.67      0.67        15
 Pflegepersonal       1.00      0.88      0.93        24
anderer Service       0.48      0.33      0.39        45
 mediz. Service       0.70      0.67      0.69        83

      micro avg       0.75      0.73      0.74       346
      macro avg       0.74      0.71      0.72       346
   weighted avg       0.74      0.73      0.73       346

Precision Score: 0.7522388059701492
Recall Score: 0.7283236994219653
F1 Score: 0.7400881057268722
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6517.35 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5059.29 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_base_best for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.233069,0.711957,0.617925,0.661616
2,0.318700,0.185063,0.743842,0.712264,0.727711
3,0.173200,0.213243,0.737864,0.716981,0.727273
4,0.084500,0.283493,0.734694,0.679245,0.705882
5,0.057500,0.274932,0.730392,0.70283,0.716346



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_best_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_best_42_42_5
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4767.00 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.88      0.98      0.93        52
    Krankenhaus       0.85      0.65      0.74       104
       Personal       0.70      0.58      0.64        12
 Pflegepersonal       0.93      0.93      0.93        14
anderer Service       0.83      0.33      0.48        30
 mediz. Service       0.79      0.81      0.80        67

      micro avg       0.84      0.73      0.78       279
      macro avg       0.83      0.71      0.75       279
   weighted avg       0.84      0.73      0.77       279

Precision Score: 0.8388429752066116
Recall Score: 0.7275985663082437
F1 Score: 0.7792706333973128
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6443.90 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5059.96 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_filtered_base_best for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.20974,0.761628,0.617925,0.682292
2,0.321700,0.177827,0.767327,0.731132,0.748792
3,0.160200,0.178644,0.763547,0.731132,0.746988
4,0.082200,0.214671,0.771845,0.75,0.760766
5,0.052700,0.21025,0.771845,0.75,0.760766



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_filtered_base_best_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_filtered_base_best_42_42_5
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4737.44 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.91      0.96      0.93        52
    Krankenhaus       0.91      0.66      0.77       104
       Personal       0.75      0.75      0.75        12
 Pflegepersonal       0.93      0.93      0.93        14
anderer Service       0.59      0.33      0.43        30
 mediz. Service       0.78      0.73      0.75        67

      micro avg       0.84      0.72      0.78       279
      macro avg       0.81      0.73      0.76       279
   weighted avg       0.84      0.72      0.77       279

Precision Score: 0.8438818565400844
Recall Score: 0.7168458781362007
F1 Score: 0.7751937984496124
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6457.95 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4988.46 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_base_last for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.235655,0.734463,0.613208,0.66838
2,0.313000,0.175196,0.757426,0.721698,0.73913
3,0.167600,0.201712,0.764423,0.75,0.757143
4,0.083800,0.251479,0.8,0.716981,0.756219
5,0.054600,0.256751,0.785,0.740566,0.762136



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_last_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_last_42_42_5
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4653.34 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.96      0.98      0.97        52
    Krankenhaus       0.90      0.76      0.82       104
       Personal       0.69      0.75      0.72        12
 Pflegepersonal       1.00      1.00      1.00        14
anderer Service       0.63      0.57      0.60        30
 mediz. Service       0.67      0.70      0.69        67

      micro avg       0.82      0.78      0.80       279
      macro avg       0.81      0.79      0.80       279
   weighted avg       0.82      0.78      0.80       279

Precision Score: 0.8188679245283019
Recall Score: 0.7777777777777778
F1 Score: 0.7977941176470589
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6479.63 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5237.95 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training distilbert/distilbert-base-german-cased for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.241931,0.632558,0.535433,0.579957
2,0.397300,0.222672,0.715556,0.633858,0.672234
3,0.172500,0.243207,0.748792,0.610236,0.672451
4,0.096700,0.243671,0.75,0.685039,0.716049
5,0.069700,0.255058,0.73617,0.681102,0.707566



Best Model saved at: ./saved_models/ate_cat_distilbert_distilbert-base-german-cased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_distilbert_distilbert-base-german-cased_42_42_5
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4687.40 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.89      0.93      0.91        55
    Krankenhaus       0.94      0.56      0.71       117
       Personal       0.60      0.46      0.52        13
 Pflegepersonal       0.94      0.89      0.91        18
anderer Service       0.75      0.34      0.47        35
 mediz. Service       0.68      0.68      0.68        77

      micro avg       0.82      0.64      0.72       315
      macro avg       0.80      0.64      0.70       315
   weighted avg       0.83      0.64      0.71       315

Precision Score: 0.8218623481781376
Recall Score: 0.6444444444444445
F1 Score: 0.7224199288256228
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5663.47 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4554.52 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training GerMedBERT/medbert-512 for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.202199,0.685279,0.584416,0.630841
2,0.320200,0.171366,0.70354,0.688312,0.695842
3,0.146400,0.2165,0.752427,0.670996,0.709382
4,0.063300,0.233945,0.728889,0.709957,0.719298
5,0.040300,0.249321,0.732143,0.709957,0.720879



Best Model saved at: ./saved_models/ate_cat_GerMedBERT_medbert-512_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_GerMedBERT_medbert-512_42_42_5
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4172.09 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.98      0.96      0.97        54
    Krankenhaus       0.89      0.76      0.82       105
       Personal       0.93      0.88      0.90        16
 Pflegepersonal       0.93      0.93      0.93        15
anderer Service       0.56      0.40      0.47        35
 mediz. Service       0.79      0.60      0.68        63

      micro avg       0.86      0.74      0.79       288
      macro avg       0.85      0.76      0.80       288
   weighted avg       0.85      0.74      0.79       288

Precision Score: 0.8617886178861789
Recall Score: 0.7361111111111112
F1 Score: 0.7940074906367042
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5843.76 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4748.14 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training deepset/gbert-base for 5 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.230418,0.686636,0.586614,0.632696
2,0.337500,0.192138,0.764151,0.637795,0.695279
3,0.147500,0.219408,0.752137,0.692913,0.721311
4,0.059800,0.244842,0.76824,0.704724,0.735113
5,0.037300,0.257871,0.759184,0.732283,0.745491



Best Model saved at: ./saved_models/ate_cat_deepset_gbert-base_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_deepset_gbert-base_42_42_5
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4399.57 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.90      0.95      0.92        55
    Krankenhaus       0.90      0.62      0.73       117
       Personal       0.73      0.62      0.67        13
 Pflegepersonal       1.00      0.94      0.97        18
anderer Service       0.73      0.46      0.56        35
 mediz. Service       0.88      0.73      0.79        77

      micro avg       0.88      0.70      0.78       315
      macro avg       0.85      0.72      0.77       315
   weighted avg       0.87      0.70      0.77       315

Precision Score: 0.876984126984127
Recall Score: 0.7015873015873015
F1 Score: 0.7795414462081128
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

In [7]:
for model in models:
    print(f'training and results for {model}:')
    ate_cat_model(data, model, rn1=42, rn2=42, epochs=6, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5777.21 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4646.03 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training google-bert/bert-base-german-cased for 6 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.223776,0.703704,0.582375,0.637317
2,0.325700,0.207315,0.698276,0.62069,0.657201
3,0.134500,0.220423,0.767347,0.720307,0.743083
4,0.052200,0.257904,0.775424,0.701149,0.736419
5,0.024000,0.298462,0.760684,0.681992,0.719192
6,0.024000,0.31066,0.757322,0.693487,0.724



Best Model saved at: ./saved_models/ate_cat_google-bert_bert-base-german-cased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_google-bert_bert-base-german-cased_42_42_6
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4388.47 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.86      0.96      0.91        52
    Krankenhaus       0.92      0.61      0.73       119
       Personal       0.67      0.43      0.52        14
 Pflegepersonal       1.00      0.94      0.97        18
anderer Service       0.68      0.45      0.55        33
 mediz. Service       0.77      0.80      0.79        87

      micro avg       0.84      0.71      0.77       323
      macro avg       0.82      0.70      0.74       323
   weighted avg       0.84      0.71      0.76       323

Precision Score: 0.8363636363636363
Recall Score: 0.7120743034055728
F1 Score: 0.7692307692307693
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5779.66 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4844.19 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training dbmdz/bert-base-german-cased for 6 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.22105,0.704545,0.610236,0.654008
2,0.342700,0.199783,0.750988,0.748031,0.749507
3,0.163400,0.233014,0.735849,0.767717,0.751445
4,0.074800,0.23267,0.79661,0.740157,0.767347
5,0.044900,0.269969,0.740876,0.799213,0.768939
6,0.044900,0.277904,0.765152,0.795276,0.779923



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-cased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-cased_42_42_6
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4463.65 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.87      0.95      0.90        55
    Krankenhaus       0.91      0.74      0.82       117
       Personal       0.64      0.54      0.58        13
 Pflegepersonal       0.94      0.94      0.94        18
anderer Service       0.69      0.57      0.62        35
 mediz. Service       0.79      0.79      0.79        77

      micro avg       0.84      0.77      0.80       315
      macro avg       0.81      0.75      0.78       315
   weighted avg       0.84      0.77      0.80       315

Precision Score: 0.8408304498269896
Recall Score: 0.7714285714285715
F1 Score: 0.804635761589404
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5536.11 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4686.74 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training dbmdz/bert-base-german-uncased for 6 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.228741,0.679825,0.615079,0.645833
2,0.325300,0.228101,0.738938,0.662698,0.698745
3,0.150000,0.219872,0.772358,0.753968,0.763052
4,0.061300,0.259166,0.812766,0.757937,0.784394
5,0.040100,0.262348,0.784,0.777778,0.780876
6,0.040100,0.305934,0.774059,0.734127,0.753564



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-uncased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-uncased_42_42_6
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4439.43 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.97      0.97      0.97        63
    Krankenhaus       0.98      0.47      0.64       112
       Personal       0.75      0.86      0.80        14
 Pflegepersonal       1.00      0.95      0.97        19
anderer Service       0.64      0.41      0.50        34
 mediz. Service       0.88      0.76      0.81        74

      micro avg       0.90      0.68      0.77       316
      macro avg       0.87      0.74      0.78       316
   weighted avg       0.91      0.68      0.76       316

Precision Score: 0.9029535864978903
Recall Score: 0.6772151898734177
F1 Score: 0.7739602169981916
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5752.94 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4837.39 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training FacebookAI/xlm-roberta-base for 6 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.278038,0.637131,0.522491,0.574144
2,0.419000,0.222709,0.735632,0.66436,0.698182
3,0.226500,0.218506,0.748175,0.709343,0.728242
4,0.133300,0.244461,0.740876,0.702422,0.721137
5,0.096300,0.256283,0.720779,0.768166,0.743719
6,0.096300,0.26253,0.734483,0.737024,0.735751



Best Model saved at: ./saved_models/ate_cat_FacebookAI_xlm-roberta-base_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_FacebookAI_xlm-roberta-base_42_42_6
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4606.25 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.82      0.93      0.87        59
    Krankenhaus       0.79      0.86      0.82       120
       Personal       0.71      0.67      0.69        15
 Pflegepersonal       1.00      0.96      0.98        24
anderer Service       0.67      0.49      0.56        45
 mediz. Service       0.78      0.75      0.77        83

      micro avg       0.79      0.79      0.79       346
      macro avg       0.80      0.78      0.78       346
   weighted avg       0.79      0.79      0.79       346

Precision Score: 0.792507204610951
Recall Score: 0.7947976878612717
F1 Score: 0.7936507936507936
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6479.67 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5055.37 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_base_best for 6 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.241105,0.706522,0.613208,0.656566
2,0.323200,0.20959,0.752747,0.646226,0.695431
3,0.166600,0.232567,0.736842,0.726415,0.731591
4,0.084100,0.276728,0.722488,0.712264,0.71734
5,0.052500,0.299214,0.696429,0.735849,0.715596
6,0.052500,0.322898,0.688372,0.698113,0.693208



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_best_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_best_42_42_6
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4691.20 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.94      0.98      0.96        52
    Krankenhaus       0.92      0.64      0.76       104
       Personal       0.64      0.75      0.69        12
 Pflegepersonal       0.93      0.93      0.93        14
anderer Service       0.67      0.47      0.55        30
 mediz. Service       0.68      0.78      0.72        67

      micro avg       0.81      0.74      0.77       279
      macro avg       0.80      0.76      0.77       279
   weighted avg       0.83      0.74      0.77       279

Precision Score: 0.8142292490118577
Recall Score: 0.7383512544802867
F1 Score: 0.7744360902255639
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6422.38 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5021.39 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_filtered_base_best for 6 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.213365,0.694301,0.632075,0.661728
2,0.319900,0.155376,0.787879,0.735849,0.760976
3,0.170500,0.183111,0.770833,0.698113,0.732673
4,0.085200,0.158645,0.758294,0.754717,0.756501
5,0.058300,0.250049,0.751295,0.683962,0.716049
6,0.058300,0.229074,0.734694,0.679245,0.705882



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_filtered_base_best_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_filtered_base_best_42_42_6
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4695.99 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.93      0.98      0.95        52
    Krankenhaus       0.92      0.66      0.77       104
       Personal       0.73      0.67      0.70        12
 Pflegepersonal       0.87      0.93      0.90        14
anderer Service       0.64      0.30      0.41        30
 mediz. Service       0.82      0.81      0.81        67

      micro avg       0.86      0.73      0.79       279
      macro avg       0.82      0.72      0.76       279
   weighted avg       0.86      0.73      0.78       279

Precision Score: 0.864406779661017
Recall Score: 0.7311827956989247
F1 Score: 0.7922330097087378
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6431.25 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5095.62 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_base_last for 6 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.223606,0.698225,0.556604,0.619423
2,0.337400,0.188333,0.761421,0.707547,0.733496
3,0.175700,0.241469,0.746341,0.721698,0.733813
4,0.090200,0.248286,0.752525,0.70283,0.726829
5,0.058500,0.310955,0.716981,0.716981,0.716981
6,0.058500,0.338784,0.7109,0.707547,0.70922



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_last_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_last_42_42_6
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4660.08 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.94      0.98      0.96        52
    Krankenhaus       0.94      0.62      0.75       104
       Personal       0.67      0.67      0.67        12
 Pflegepersonal       0.93      0.93      0.93        14
anderer Service       0.53      0.33      0.41        30
 mediz. Service       0.70      0.84      0.76        67

      micro avg       0.82      0.73      0.77       279
      macro avg       0.78      0.73      0.75       279
   weighted avg       0.83      0.73      0.76       279

Precision Score: 0.8185483870967742
Recall Score: 0.7275985663082437
F1 Score: 0.7703984819734346
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6500.53 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5163.57 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training distilbert/distilbert-base-german-cased for 6 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.232993,0.648402,0.559055,0.600423
2,0.388600,0.218969,0.688525,0.661417,0.674699
3,0.173900,0.232267,0.75,0.649606,0.696203
4,0.091800,0.240863,0.757202,0.724409,0.740443
5,0.062200,0.266235,0.759494,0.708661,0.733198
6,0.062200,0.280191,0.741667,0.700787,0.720648



Best Model saved at: ./saved_models/ate_cat_distilbert_distilbert-base-german-cased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_distilbert_distilbert-base-german-cased_42_42_6
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4782.13 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.91      0.93      0.92        55
    Krankenhaus       0.92      0.61      0.73       117
       Personal       0.62      0.62      0.62        13
 Pflegepersonal       1.00      0.89      0.94        18
anderer Service       0.78      0.40      0.53        35
 mediz. Service       0.65      0.71      0.68        77

      micro avg       0.81      0.68      0.74       315
      macro avg       0.81      0.69      0.74       315
   weighted avg       0.83      0.68      0.74       315

Precision Score: 0.8113207547169812
Recall Score: 0.6825396825396826
F1 Score: 0.7413793103448275
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5726.06 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4561.58 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training GerMedBERT/medbert-512 for 6 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.190527,0.707692,0.597403,0.647887
2,0.320100,0.167318,0.733333,0.714286,0.723684
3,0.148900,0.226242,0.763285,0.683983,0.721461
4,0.063200,0.248719,0.740566,0.679654,0.708804
5,0.036700,0.277726,0.725664,0.709957,0.717724
6,0.036700,0.286306,0.734513,0.718615,0.726477



Best Model saved at: ./saved_models/ate_cat_GerMedBERT_medbert-512_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_GerMedBERT_medbert-512_42_42_6
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4245.37 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       1.00      0.98      0.99        54
    Krankenhaus       0.86      0.79      0.82       105
       Personal       1.00      0.62      0.77        16
 Pflegepersonal       0.79      1.00      0.88        15
anderer Service       0.58      0.40      0.47        35
 mediz. Service       0.85      0.70      0.77        63

      micro avg       0.86      0.76      0.81       288
      macro avg       0.85      0.75      0.78       288
   weighted avg       0.85      0.76      0.80       288

Precision Score: 0.8588235294117647
Recall Score: 0.7604166666666666
F1 Score: 0.8066298342541436
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5915.46 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4655.17 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training deepset/gbert-base for 6 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.238325,0.688372,0.582677,0.63113
2,0.330400,0.185145,0.75,0.661417,0.702929
3,0.142500,0.218466,0.793249,0.740157,0.765784
4,0.057400,0.230707,0.787234,0.728346,0.756646
5,0.034700,0.236863,0.765385,0.783465,0.774319
6,0.034700,0.24163,0.786561,0.783465,0.78501



Best Model saved at: ./saved_models/ate_cat_deepset_gbert-base_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_deepset_gbert-base_42_42_6
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4378.90 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.88      0.96      0.92        55
    Krankenhaus       0.92      0.82      0.87       117
       Personal       0.64      0.69      0.67        13
 Pflegepersonal       1.00      0.94      0.97        18
anderer Service       0.59      0.46      0.52        35
 mediz. Service       0.84      0.75      0.79        77

      micro avg       0.86      0.79      0.82       315
      macro avg       0.81      0.77      0.79       315
   weighted avg       0.85      0.79      0.82       315

Precision Score: 0.8556701030927835
Recall Score: 0.7904761904761904
F1 Score: 0.8217821782178217
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

In [8]:
for model in models:
    print(f'training and results for {model}:')
    ate_cat_model(data, model, rn1=42, rn2=42, epochs=7, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5743.69 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4616.46 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training google-bert/bert-base-german-cased for 7 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.223721,0.690909,0.582375,0.632017
2,0.332800,0.191899,0.756198,0.701149,0.727634
3,0.136400,0.240071,0.738956,0.704981,0.721569
4,0.048900,0.290693,0.748899,0.651341,0.696721
5,0.025500,0.332394,0.763485,0.704981,0.733068
6,0.025500,0.348486,0.76824,0.685824,0.724696
7,0.005900,0.359796,0.760504,0.693487,0.725451



Best Model saved at: ./saved_models/ate_cat_google-bert_bert-base-german-cased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_google-bert_bert-base-german-cased_42_42_7
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4347.18 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.80      0.92      0.86        52
    Krankenhaus       0.88      0.64      0.74       119
       Personal       0.67      0.57      0.62        14
 Pflegepersonal       0.89      0.94      0.92        18
anderer Service       0.68      0.45      0.55        33
 mediz. Service       0.89      0.78      0.83        87

      micro avg       0.84      0.72      0.78       323
      macro avg       0.80      0.72      0.75       323
   weighted avg       0.84      0.72      0.77       323

Precision Score: 0.8436363636363636
Recall Score: 0.718266253869969
F1 Score: 0.7759197324414716
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5807.17 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4660.70 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training dbmdz/bert-base-german-cased for 7 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.248914,0.695853,0.594488,0.641189
2,0.355300,0.169604,0.743191,0.751969,0.747554
3,0.162900,0.199503,0.783465,0.783465,0.783465
4,0.068700,0.218836,0.786885,0.755906,0.771084
5,0.038200,0.245226,0.782101,0.791339,0.786693
6,0.038200,0.255656,0.755556,0.80315,0.778626
7,0.016000,0.268988,0.777778,0.799213,0.78835



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-cased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-cased_42_42_7
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4252.55 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.93      0.95      0.94        55
    Krankenhaus       0.89      0.74      0.80       117
       Personal       0.83      0.77      0.80        13
 Pflegepersonal       1.00      1.00      1.00        18
anderer Service       0.68      0.49      0.57        35
 mediz. Service       0.79      0.81      0.80        77

      micro avg       0.86      0.78      0.82       315
      macro avg       0.85      0.79      0.82       315
   weighted avg       0.85      0.78      0.81       315

Precision Score: 0.8566433566433567
Recall Score: 0.7777777777777778
F1 Score: 0.8153078202995009
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5499.55 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4609.02 examples/s]


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}


  trainer = Trainer(


Training dbmdz/bert-base-german-uncased for 7 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.232232,0.707317,0.575397,0.634573
2,0.334500,0.211842,0.764444,0.68254,0.721174
3,0.148200,0.231706,0.77533,0.698413,0.734864
4,0.061100,0.243154,0.78125,0.793651,0.787402
5,0.032000,0.282041,0.776423,0.757937,0.767068
6,0.032000,0.297592,0.797521,0.765873,0.781377
7,0.010500,0.321191,0.78,0.77381,0.776892



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-uncased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-uncased_42_42_7
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4409.69 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.91      0.92      0.91        63
    Krankenhaus       0.92      0.62      0.74       112
       Personal       0.69      0.79      0.73        14
 Pflegepersonal       1.00      0.95      0.97        19
anderer Service       0.47      0.44      0.45        34
 mediz. Service       0.83      0.77      0.80        74

      micro avg       0.83      0.72      0.77       316
      macro avg       0.80      0.75      0.77       316
   weighted avg       0.84      0.72      0.77       316

Precision Score: 0.8321167883211679
Recall Score: 0.7215189873417721
F1 Score: 0.7728813559322033
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5757.83 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4867.46 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training FacebookAI/xlm-roberta-base for 7 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.268955,0.609375,0.539792,0.572477
2,0.447000,0.212149,0.737255,0.650519,0.691176
3,0.235300,0.18512,0.757042,0.743945,0.750436
4,0.136100,0.226553,0.776952,0.723183,0.749104
5,0.096600,0.22377,0.760943,0.782007,0.771331
6,0.096600,0.267362,0.779783,0.747405,0.763251
7,0.066600,0.268247,0.767123,0.775087,0.771084



Best Model saved at: ./saved_models/ate_cat_FacebookAI_xlm-roberta-base_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_FacebookAI_xlm-roberta-base_42_42_7
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4611.31 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.82      0.93      0.87        59
    Krankenhaus       0.81      0.83      0.82       120
       Personal       0.77      0.67      0.71        15
 Pflegepersonal       0.88      0.96      0.92        24
anderer Service       0.58      0.33      0.42        45
 mediz. Service       0.69      0.77      0.73        83

      micro avg       0.77      0.77      0.77       346
      macro avg       0.76      0.75      0.75       346
   weighted avg       0.76      0.77      0.76       346

Precision Score: 0.7672413793103449
Recall Score: 0.7716763005780347
F1 Score: 0.7694524495677234
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6407.32 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5016.81 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_base_best for 7 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.285238,0.728916,0.570755,0.640212
2,0.335000,0.195575,0.737374,0.688679,0.712195
3,0.172900,0.276536,0.717073,0.693396,0.705036
4,0.085700,0.300896,0.727273,0.641509,0.681704
5,0.055000,0.33724,0.696262,0.70283,0.699531
6,0.055000,0.394077,0.723077,0.665094,0.692875
7,0.023800,0.394472,0.731707,0.707547,0.719424



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_best_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_best_42_42_7
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4666.08 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.93      0.98      0.95        52
    Krankenhaus       0.87      0.80      0.83       104
       Personal       0.75      0.75      0.75        12
 Pflegepersonal       1.00      1.00      1.00        14
anderer Service       0.64      0.47      0.54        30
 mediz. Service       0.66      0.73      0.70        67

      micro avg       0.81      0.79      0.80       279
      macro avg       0.81      0.79      0.80       279
   weighted avg       0.81      0.79      0.80       279

Precision Score: 0.8088235294117647
Recall Score: 0.7885304659498208
F1 Score: 0.7985480943738655
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6439.23 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4975.22 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_filtered_base_best for 7 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.19828,0.695652,0.603774,0.646465
2,0.328500,0.166925,0.748768,0.716981,0.73253
3,0.162800,0.175944,0.760331,0.867925,0.810573
4,0.079300,0.226092,0.751174,0.754717,0.752941
5,0.050200,0.273639,0.758621,0.726415,0.742169
6,0.050200,0.255691,0.767773,0.764151,0.765957
7,0.021500,0.27,0.766667,0.759434,0.763033



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_filtered_base_best_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_filtered_base_best_42_42_7
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4720.08 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.96      0.98      0.97        52
    Krankenhaus       0.82      0.67      0.74       104
       Personal       0.64      0.75      0.69        12
 Pflegepersonal       0.87      0.93      0.90        14
anderer Service       0.59      0.33      0.43        30
 mediz. Service       0.59      0.84      0.69        67

      micro avg       0.75      0.75      0.75       279
      macro avg       0.75      0.75      0.74       279
   weighted avg       0.76      0.75      0.74       279

Precision Score: 0.7491039426523297
Recall Score: 0.7491039426523297
F1 Score: 0.7491039426523297
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6519.86 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4902.78 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_base_last for 7 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.240597,0.777778,0.561321,0.652055
2,0.349400,0.186103,0.747368,0.669811,0.706468
3,0.180700,0.198872,0.717703,0.707547,0.712589
4,0.095200,0.267394,0.717822,0.683962,0.700483
5,0.056800,0.324792,0.71028,0.716981,0.713615
6,0.056800,0.369378,0.712871,0.679245,0.695652
7,0.025900,0.381289,0.719807,0.70283,0.711217



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_last_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_last_42_42_7
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4836.68 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.94      0.98      0.96        52
    Krankenhaus       0.85      0.75      0.80       104
       Personal       0.71      0.83      0.77        12
 Pflegepersonal       0.93      1.00      0.97        14
anderer Service       0.71      0.57      0.63        30
 mediz. Service       0.70      0.82      0.75        67

      micro avg       0.81      0.81      0.81       279
      macro avg       0.81      0.83      0.81       279
   weighted avg       0.81      0.81      0.81       279

Precision Score: 0.8093525179856115
Recall Score: 0.8064516129032258
F1 Score: 0.8078994614003591
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6608.13 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5321.12 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training distilbert/distilbert-base-german-cased for 7 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.235504,0.701923,0.574803,0.632035
2,0.394800,0.20747,0.70339,0.653543,0.677551
3,0.179500,0.229822,0.771028,0.649606,0.705128
4,0.091800,0.23703,0.758065,0.740157,0.749004
5,0.062200,0.250153,0.743083,0.740157,0.741617
6,0.062200,0.26161,0.762846,0.759843,0.761341
7,0.034200,0.276165,0.762097,0.744094,0.752988



Best Model saved at: ./saved_models/ate_cat_distilbert_distilbert-base-german-cased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_distilbert_distilbert-base-german-cased_42_42_7
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4989.43 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.87      0.95      0.90        55
    Krankenhaus       0.94      0.54      0.68       117
       Personal       0.75      0.46      0.57        13
 Pflegepersonal       0.94      0.94      0.94        18
anderer Service       0.71      0.43      0.54        35
 mediz. Service       0.63      0.75      0.69        77

      micro avg       0.79      0.67      0.73       315
      macro avg       0.81      0.68      0.72       315
   weighted avg       0.82      0.67      0.72       315

Precision Score: 0.793233082706767
Recall Score: 0.6698412698412698
F1 Score: 0.7263339070567986
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5745.20 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4645.72 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training GerMedBERT/medbert-512 for 7 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.208703,0.653266,0.562771,0.604651
2,0.319400,0.188274,0.709402,0.718615,0.713978
3,0.146400,0.233929,0.75,0.688312,0.717833
4,0.071400,0.247591,0.764423,0.688312,0.724374
5,0.035900,0.262979,0.738397,0.757576,0.747863
6,0.035900,0.293432,0.746667,0.727273,0.736842
7,0.011900,0.299482,0.744589,0.744589,0.744589



Best Model saved at: ./saved_models/ate_cat_GerMedBERT_medbert-512_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_GerMedBERT_medbert-512_42_42_7
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4267.56 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.98      0.96      0.97        54
    Krankenhaus       0.81      0.78      0.80       105
       Personal       0.93      0.88      0.90        16
 Pflegepersonal       0.93      0.93      0.93        15
anderer Service       0.65      0.43      0.52        35
 mediz. Service       0.85      0.63      0.73        63

      micro avg       0.85      0.75      0.80       288
      macro avg       0.86      0.77      0.81       288
   weighted avg       0.85      0.75      0.79       288

Precision Score: 0.8543307086614174
Recall Score: 0.7534722222222222
F1 Score: 0.8007380073800738
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5879.70 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4859.86 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training deepset/gbert-base for 7 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.23892,0.709845,0.53937,0.612975
2,0.327400,0.20022,0.76652,0.685039,0.723493
3,0.144800,0.23233,0.765217,0.692913,0.727273
4,0.057800,0.245363,0.75,0.732283,0.741036
5,0.034800,0.260847,0.736059,0.779528,0.75717
6,0.034800,0.271232,0.746154,0.76378,0.754864
7,0.011000,0.28112,0.753846,0.771654,0.762646



Best Model saved at: ./saved_models/ate_cat_deepset_gbert-base_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_deepset_gbert-base_42_42_7
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4417.75 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.88      0.93      0.90        55
    Krankenhaus       0.90      0.72      0.80       117
       Personal       0.75      0.69      0.72        13
 Pflegepersonal       0.85      0.94      0.89        18
anderer Service       0.81      0.49      0.61        35
 mediz. Service       0.76      0.78      0.77        77

      micro avg       0.84      0.76      0.80       315
      macro avg       0.83      0.76      0.78       315
   weighted avg       0.84      0.76      0.79       315

Precision Score: 0.8409893992932862
Recall Score: 0.7555555555555555
F1 Score: 0.7959866220735786
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

In [9]:
for model in models:
    print(f'training and results for {model}:')
    ate_cat_model(data, model, rn1=42, rn2=42, epochs=8, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5857.90 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4763.63 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training google-bert/bert-base-german-cased for 8 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.225879,0.746479,0.609195,0.670886
2,0.329900,0.215424,0.736842,0.643678,0.687117
3,0.131400,0.227239,0.741176,0.724138,0.732558
4,0.049800,0.249514,0.766129,0.727969,0.746562
5,0.026600,0.32689,0.751092,0.659004,0.702041
6,0.026600,0.331667,0.763052,0.727969,0.745098
7,0.006100,0.359499,0.741803,0.693487,0.716832
8,0.003100,0.359407,0.747967,0.704981,0.725838



Best Model saved at: ./saved_models/ate_cat_google-bert_bert-base-german-cased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_google-bert_bert-base-german-cased_42_42_8
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4141.80 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.86      0.96      0.91        52
    Krankenhaus       0.91      0.67      0.77       119
       Personal       0.71      0.71      0.71        14
 Pflegepersonal       1.00      0.94      0.97        18
anderer Service       0.63      0.52      0.57        33
 mediz. Service       0.87      0.76      0.81        87

      micro avg       0.86      0.74      0.80       323
      macro avg       0.83      0.76      0.79       323
   weighted avg       0.86      0.74      0.79       323

Precision Score: 0.8571428571428571
Recall Score: 0.7430340557275542
F1 Score: 0.7960199004975125
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5814.70 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4774.58 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training dbmdz/bert-base-german-cased for 8 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.212539,0.662338,0.602362,0.630928
2,0.342500,0.190894,0.784141,0.700787,0.740125
3,0.149700,0.206935,0.749049,0.775591,0.762089
4,0.068900,0.233781,0.781481,0.830709,0.805344
5,0.034400,0.291955,0.76,0.748031,0.753968
6,0.034400,0.290576,0.791667,0.822835,0.80695
7,0.010400,0.323501,0.784,0.771654,0.777778
8,0.006300,0.326282,0.785992,0.795276,0.790607



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-cased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-cased_42_42_8
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4450.79 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.90      0.95      0.92        55
    Krankenhaus       0.88      0.60      0.71       117
       Personal       0.75      0.69      0.72        13
 Pflegepersonal       1.00      0.94      0.97        18
anderer Service       0.74      0.49      0.59        35
 mediz. Service       0.71      0.84      0.77        77

      micro avg       0.82      0.73      0.77       315
      macro avg       0.83      0.75      0.78       315
   weighted avg       0.83      0.73      0.76       315

Precision Score: 0.8185053380782918
Recall Score: 0.7301587301587301
F1 Score: 0.7718120805369129
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5508.41 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4601.02 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training dbmdz/bert-base-german-uncased for 8 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.213718,0.761194,0.607143,0.675497
2,0.322300,0.20224,0.759657,0.702381,0.729897
3,0.146700,0.216141,0.808696,0.738095,0.771784
4,0.059800,0.231258,0.766129,0.753968,0.76
5,0.037400,0.259934,0.790514,0.793651,0.792079
6,0.037400,0.294663,0.776892,0.77381,0.775348
7,0.012700,0.317299,0.790984,0.765873,0.778226
8,0.006200,0.309103,0.773946,0.801587,0.787524



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-uncased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-uncased_42_42_8
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4319.44 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.95      0.89      0.92        63
    Krankenhaus       0.88      0.67      0.76       112
       Personal       0.62      0.57      0.59        14
 Pflegepersonal       1.00      0.84      0.91        19
anderer Service       0.58      0.32      0.42        34
 mediz. Service       0.79      0.77      0.78        74

      micro avg       0.84      0.71      0.77       316
      macro avg       0.80      0.68      0.73       316
   weighted avg       0.84      0.71      0.76       316

Precision Score: 0.8446969696969697
Recall Score: 0.7056962025316456
F1 Score: 0.7689655172413793
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5762.26 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4836.01 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training FacebookAI/xlm-roberta-base for 8 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.256555,0.695473,0.584775,0.635338
2,0.404200,0.222601,0.759336,0.633218,0.690566
3,0.215300,0.241274,0.708609,0.740484,0.724196
4,0.134300,0.220489,0.749141,0.754325,0.751724
5,0.094800,0.244191,0.737374,0.757785,0.74744
6,0.094800,0.279263,0.779026,0.719723,0.748201
7,0.053800,0.294028,0.757895,0.747405,0.752613
8,0.035200,0.289657,0.750853,0.761246,0.756014



Best Model saved at: ./saved_models/ate_cat_FacebookAI_xlm-roberta-base_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_FacebookAI_xlm-roberta-base_42_42_8
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4650.56 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.90      0.92      0.91        59
    Krankenhaus       0.87      0.78      0.82       120
       Personal       0.85      0.73      0.79        15
 Pflegepersonal       0.86      1.00      0.92        24
anderer Service       0.55      0.36      0.43        45
 mediz. Service       0.70      0.67      0.69        83

      micro avg       0.80      0.74      0.77       346
      macro avg       0.79      0.74      0.76       346
   weighted avg       0.79      0.74      0.76       346

Precision Score: 0.8018867924528302
Recall Score: 0.7369942196531792
F1 Score: 0.7680722891566266
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6533.15 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5029.62 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_base_best for 8 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.257883,0.772727,0.561321,0.650273
2,0.316900,0.197205,0.759358,0.669811,0.711779
3,0.167400,0.219021,0.734884,0.745283,0.740047
4,0.084200,0.238142,0.761468,0.783019,0.772093
5,0.052900,0.287468,0.700422,0.783019,0.739421
6,0.052900,0.325055,0.778846,0.764151,0.771429
7,0.019400,0.34575,0.754545,0.783019,0.768519
8,0.011200,0.355956,0.743243,0.778302,0.760369



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_best_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_best_42_42_8
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4804.31 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.96      0.94      0.95        52
    Krankenhaus       0.89      0.69      0.78       104
       Personal       0.80      0.67      0.73        12
 Pflegepersonal       0.93      1.00      0.97        14
anderer Service       0.86      0.40      0.55        30
 mediz. Service       0.69      0.75      0.72        67

      micro avg       0.84      0.73      0.79       279
      macro avg       0.86      0.74      0.78       279
   weighted avg       0.85      0.73      0.78       279

Precision Score: 0.8436213991769548
Recall Score: 0.7347670250896058
F1 Score: 0.7854406130268201
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6584.76 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5135.53 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_filtered_base_best for 8 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.208711,0.757062,0.632075,0.688946
2,0.315700,0.14986,0.767442,0.778302,0.772834
3,0.163700,0.145058,0.795455,0.825472,0.810185
4,0.086000,0.174111,0.775229,0.79717,0.786047
5,0.054500,0.214192,0.77451,0.745283,0.759615
6,0.054500,0.227282,0.780488,0.754717,0.767386
7,0.024200,0.255428,0.792746,0.721698,0.755556
8,0.013000,0.240525,0.793103,0.759434,0.775904



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_filtered_base_best_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_filtered_base_best_42_42_8
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4754.28 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.85      0.98      0.91        52
    Krankenhaus       0.88      0.70      0.78       104
       Personal       0.64      0.75      0.69        12
 Pflegepersonal       0.87      0.93      0.90        14
anderer Service       0.62      0.33      0.43        30
 mediz. Service       0.71      0.84      0.77        67

      micro avg       0.79      0.76      0.78       279
      macro avg       0.76      0.76      0.75       279
   weighted avg       0.79      0.76      0.77       279

Precision Score: 0.7940074906367042
Recall Score: 0.7598566308243727
F1 Score: 0.7765567765567766
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6615.52 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5117.35 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training TUM/GottBERT_base_last for 8 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.245245,0.736842,0.528302,0.615385
2,0.315100,0.190415,0.748744,0.70283,0.725061
3,0.165100,0.203098,0.75576,0.773585,0.764569
4,0.086600,0.243735,0.7343,0.716981,0.725537
5,0.056900,0.286926,0.684874,0.768868,0.724444
6,0.056900,0.332139,0.742574,0.707547,0.724638
7,0.025400,0.338537,0.725225,0.759434,0.741935
8,0.014000,0.350346,0.702222,0.745283,0.723112



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_last_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_last_42_42_8
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4821.75 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.93      0.98      0.95        52
    Krankenhaus       0.91      0.64      0.75       104
       Personal       0.69      0.75      0.72        12
 Pflegepersonal       0.93      1.00      0.97        14
anderer Service       0.65      0.37      0.47        30
 mediz. Service       0.68      0.84      0.75        67

      micro avg       0.81      0.75      0.78       279
      macro avg       0.80      0.76      0.77       279
   weighted avg       0.82      0.75      0.77       279

Precision Score: 0.8125
Recall Score: 0.7455197132616488
F1 Score: 0.777570093457944
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6696.92 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5353.66 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training distilbert/distilbert-base-german-cased for 8 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.238099,0.64977,0.555118,0.598726
2,0.393000,0.221648,0.748837,0.633858,0.686567
3,0.168900,0.220918,0.783019,0.653543,0.712446
4,0.090800,0.25011,0.767544,0.688976,0.726141
5,0.059300,0.253673,0.757812,0.76378,0.760784
6,0.059300,0.260656,0.773109,0.724409,0.747967
7,0.031000,0.274106,0.756198,0.720472,0.737903
8,0.017900,0.277571,0.757202,0.724409,0.740443



Best Model saved at: ./saved_models/ate_cat_distilbert_distilbert-base-german-cased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_distilbert_distilbert-base-german-cased_42_42_8
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4937.66 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.87      0.95      0.90        55
    Krankenhaus       0.95      0.59      0.73       117
       Personal       0.80      0.31      0.44        13
 Pflegepersonal       0.81      0.94      0.87        18
anderer Service       0.59      0.37      0.46        35
 mediz. Service       0.60      0.70      0.65        77

      micro avg       0.77      0.66      0.71       315
      macro avg       0.77      0.64      0.67       315
   weighted avg       0.79      0.66      0.70       315

Precision Score: 0.7712177121771218
Recall Score: 0.6634920634920635
F1 Score: 0.7133105802047781
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5779.32 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4727.37 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training GerMedBERT/medbert-512 for 8 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.191477,0.678049,0.601732,0.637615
2,0.326500,0.196,0.725322,0.731602,0.728448
3,0.151600,0.259559,0.765625,0.636364,0.695035
4,0.067900,0.248632,0.746606,0.714286,0.730088
5,0.037300,0.30266,0.709163,0.770563,0.738589
6,0.037300,0.316533,0.757709,0.744589,0.751092
7,0.010300,0.331507,0.74569,0.748918,0.7473
8,0.005700,0.341789,0.746725,0.74026,0.743478



Best Model saved at: ./saved_models/ate_cat_GerMedBERT_medbert-512_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_GerMedBERT_medbert-512_42_42_8
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4305.62 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       1.00      0.96      0.98        54
    Krankenhaus       0.85      0.79      0.82       105
       Personal       0.88      0.94      0.91        16
 Pflegepersonal       1.00      0.93      0.97        15
anderer Service       0.61      0.49      0.54        35
 mediz. Service       0.86      0.59      0.70        63

      micro avg       0.87      0.76      0.81       288
      macro avg       0.87      0.78      0.82       288
   weighted avg       0.86      0.76      0.80       288

Precision Score: 0.8650793650793651
Recall Score: 0.7569444444444444
F1 Score: 0.8074074074074075
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5881.77 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4709.40 examples/s]
  trainer = Trainer(


['O' 'B-mediz. Service' 'B-Pflegepersonal' 'B-Arzt' 'B-anderer Service'
 'B-Krankenhaus' 'B-Personal']
{0: 0.1596439493228484, 1: 4.959959280624364, 2: 14.014381591562799, 3: 6.8239962651727355, 4: 10.546176046176047, 5: 5.983217355710193, 6: 18.317042606516292}
Training deepset/gbert-base for 8 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.227274,0.712329,0.614173,0.659619
2,0.316000,0.181655,0.752066,0.716535,0.733871
3,0.146400,0.209302,0.776062,0.791339,0.783626
4,0.057600,0.272403,0.762097,0.744094,0.752988
5,0.036100,0.265152,0.736059,0.779528,0.75717
6,0.036100,0.25891,0.785992,0.795276,0.790607
7,0.013400,0.280936,0.754579,0.811024,0.781784
8,0.006000,0.283672,0.7603,0.799213,0.779271



Best Model saved at: ./saved_models/ate_cat_deepset_gbert-base_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_deepset_gbert-base_42_42_8
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4442.06 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.93      0.93      0.93        55
    Krankenhaus       0.90      0.65      0.76       117
       Personal       0.70      0.54      0.61        13
 Pflegepersonal       0.90      1.00      0.95        18
anderer Service       0.71      0.57      0.63        35
 mediz. Service       0.85      0.73      0.78        77

      micro avg       0.87      0.72      0.79       315
      macro avg       0.83      0.74      0.78       315
   weighted avg       0.87      0.72      0.78       315

Precision Score: 0.8669201520912547
Recall Score: 0.7238095238095238
F1 Score: 0.7889273356401384
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

In [5]:
for model in models:
    print(f'training and results for {model}:')
    ate_cat_model(data, model, rn1=42, rn2=42, epochs=10, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5233.86 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4842.75 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training google-bert/bert-base-german-cased for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.223472,0.72549,0.56705,0.636559
2,0.317200,0.192512,0.716981,0.727969,0.722433
3,0.130000,0.222244,0.736059,0.758621,0.74717
4,0.052100,0.268146,0.724528,0.735632,0.730038
5,0.025300,0.390602,0.724138,0.643678,0.681542
6,0.025300,0.407243,0.746835,0.678161,0.710843
7,0.005500,0.43187,0.712551,0.67433,0.692913
8,0.002400,0.432829,0.708171,0.697318,0.702703
9,0.001400,0.446251,0.702381,0.678161,0.690058
10,0.000300,0.447266,0.705882,0.689655,0.697674



Best Model saved at: ./saved_models/ate_cat_google-bert_bert-base-german-cased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_google-bert_bert-base-german-cased_42_42_10
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4502.22 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.78      0.94      0.85        52
    Krankenhaus       0.86      0.55      0.67       119
       Personal       0.67      0.43      0.52        14
 Pflegepersonal       0.94      0.94      0.94        18
anderer Service       0.57      0.52      0.54        33
 mediz. Service       0.77      0.78      0.78        87

      micro avg       0.78      0.69      0.73       323
      macro avg       0.76      0.69      0.72       323
   weighted avg       0.79      0.69      0.72       323

Precision Score: 0.7816901408450704
Recall Score: 0.6873065015479877
F1 Score: 0.7314662273476112
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6042.23 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4854.69 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training dbmdz/bert-base-german-cased for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.230677,0.669643,0.590551,0.627615
2,0.349900,0.191492,0.765432,0.732283,0.748491
3,0.158700,0.213762,0.767442,0.779528,0.773438
4,0.069700,0.258253,0.737255,0.740157,0.738703
5,0.036300,0.312374,0.765957,0.708661,0.736196
6,0.036300,0.333826,0.768,0.755906,0.761905
7,0.012100,0.350229,0.773438,0.779528,0.776471
8,0.006500,0.363329,0.734848,0.76378,0.749035
9,0.003800,0.374735,0.724638,0.787402,0.754717
10,0.003200,0.366855,0.742424,0.771654,0.756757



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-cased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-cased_42_42_10
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4425.65 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.95      0.95      0.95        55
    Krankenhaus       0.86      0.62      0.72       117
       Personal       0.85      0.85      0.85        13
 Pflegepersonal       0.86      1.00      0.92        18
anderer Service       0.79      0.43      0.56        35
 mediz. Service       0.81      0.75      0.78        77

      micro avg       0.86      0.72      0.78       315
      macro avg       0.85      0.77      0.80       315
   weighted avg       0.85      0.72      0.77       315

Precision Score: 0.8566037735849057
Recall Score: 0.7206349206349206
F1 Score: 0.7827586206896552
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5668.77 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4784.02 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training dbmdz/bert-base-german-uncased for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.2219,0.708333,0.607143,0.653846
2,0.327700,0.219372,0.775701,0.65873,0.712446
3,0.148800,0.227801,0.805195,0.738095,0.770186
4,0.066800,0.267928,0.749035,0.769841,0.759295
5,0.040600,0.300276,0.752,0.746032,0.749004
6,0.040600,0.309222,0.768293,0.75,0.759036
7,0.012200,0.353222,0.741313,0.761905,0.751468
8,0.006900,0.352309,0.747967,0.730159,0.738956
9,0.003900,0.36098,0.747036,0.75,0.748515
10,0.003100,0.364593,0.749035,0.769841,0.759295



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-uncased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-uncased_42_42_10
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4498.20 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.89      0.94      0.91        63
    Krankenhaus       0.96      0.45      0.61       112
       Personal       0.71      0.86      0.77        14
 Pflegepersonal       1.00      0.95      0.97        19
anderer Service       0.86      0.35      0.50        34
 mediz. Service       0.86      0.77      0.81        74

      micro avg       0.89      0.66      0.76       316
      macro avg       0.88      0.72      0.76       316
   weighted avg       0.90      0.66      0.74       316

Precision Score: 0.8927038626609443
Recall Score: 0.6582278481012658
F1 Score: 0.7577413479052822
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5788.52 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5123.54 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training FacebookAI/xlm-roberta-base for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.29948,0.6875,0.456747,0.548857
2,0.461100,0.227516,0.745387,0.698962,0.721429
3,0.241600,0.226161,0.714286,0.709343,0.711806
4,0.143800,0.236383,0.723684,0.761246,0.74199
5,0.103700,0.304264,0.726644,0.726644,0.726644
6,0.103700,0.285069,0.761404,0.750865,0.756098
7,0.054900,0.334264,0.711111,0.775087,0.741722
8,0.034100,0.303019,0.738411,0.771626,0.754653
9,0.025500,0.323546,0.754153,0.785467,0.769492
10,0.012200,0.327147,0.752508,0.778547,0.765306



Best Model saved at: ./saved_models/ate_cat_FacebookAI_xlm-roberta-base_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_FacebookAI_xlm-roberta-base_42_42_10
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4656.33 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.86      0.93      0.89        59
    Krankenhaus       0.84      0.80      0.82       120
       Personal       1.00      0.40      0.57        15
 Pflegepersonal       0.80      1.00      0.89        24
anderer Service       0.67      0.49      0.56        45
 mediz. Service       0.73      0.73      0.73        83

      micro avg       0.80      0.76      0.78       346
      macro avg       0.82      0.73      0.75       346
   weighted avg       0.80      0.76      0.77       346

Precision Score: 0.8
Recall Score: 0.7630057803468208
F1 Score: 0.7810650887573964
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', 'O', 'O

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6521.92 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5123.98 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training TUM/GottBERT_base_best for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.268073,0.712575,0.561321,0.627968
2,0.323100,0.187483,0.78125,0.707547,0.742574
3,0.171600,0.223834,0.718062,0.768868,0.742597
4,0.083900,0.254508,0.70614,0.759434,0.731818
5,0.054600,0.360901,0.694444,0.707547,0.700935
6,0.054600,0.36679,0.732057,0.721698,0.726841
7,0.022300,0.407067,0.714286,0.660377,0.686275
8,0.013700,0.422206,0.721393,0.683962,0.702179
9,0.006400,0.414087,0.721698,0.721698,0.721698
10,0.003400,0.421214,0.714953,0.721698,0.71831



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_best_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_best_42_42_10
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4809.82 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.91      0.98      0.94        52
    Krankenhaus       0.85      0.69      0.76       104
       Personal       0.64      0.75      0.69        12
 Pflegepersonal       1.00      0.93      0.96        14
anderer Service       0.65      0.43      0.52        30
 mediz. Service       0.68      0.81      0.73        67

      micro avg       0.79      0.76      0.78       279
      macro avg       0.79      0.77      0.77       279
   weighted avg       0.80      0.76      0.77       279

Precision Score: 0.7910447761194029
Recall Score: 0.7598566308243727
F1 Score: 0.7751371115173674
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6583.47 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5089.93 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training TUM/GottBERT_filtered_base_best for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.211675,0.747126,0.613208,0.673575
2,0.332600,0.168921,0.778378,0.679245,0.725441
3,0.166400,0.15801,0.777273,0.806604,0.791667
4,0.078800,0.18692,0.8,0.716981,0.756219
5,0.056400,0.247739,0.729858,0.726415,0.728132
6,0.056400,0.250029,0.781553,0.759434,0.770335
7,0.025000,0.243337,0.732143,0.773585,0.752294
8,0.011300,0.276199,0.76555,0.754717,0.760095
9,0.005600,0.291894,0.75,0.75,0.75
10,0.004100,0.290057,0.759434,0.759434,0.759434



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_filtered_base_best_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_filtered_base_best_42_42_10
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4707.88 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.94      0.98      0.96        52
    Krankenhaus       0.90      0.66      0.76       104
       Personal       0.64      0.75      0.69        12
 Pflegepersonal       1.00      0.93      0.96        14
anderer Service       0.77      0.33      0.47        30
 mediz. Service       0.69      0.81      0.74        67

      micro avg       0.83      0.74      0.78       279
      macro avg       0.82      0.74      0.76       279
   weighted avg       0.84      0.74      0.77       279

Precision Score: 0.8273092369477911
Recall Score: 0.7383512544802867
F1 Score: 0.7803030303030303
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6494.20 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5042.31 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training TUM/GottBERT_base_last for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.260157,0.721893,0.575472,0.64042
2,0.323000,0.16912,0.777778,0.693396,0.733167
3,0.178600,0.241927,0.687204,0.683962,0.685579
4,0.089400,0.23926,0.767327,0.731132,0.748792
5,0.058300,0.326706,0.706897,0.773585,0.738739
6,0.058300,0.31607,0.79397,0.745283,0.768856
7,0.024600,0.327118,0.75814,0.768868,0.763466
8,0.014100,0.383991,0.781095,0.740566,0.760291
9,0.009200,0.349479,0.762557,0.787736,0.774942
10,0.003800,0.352938,0.789474,0.778302,0.783848



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_last_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_last_42_42_10
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4660.19 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.94      0.98      0.96        52
    Krankenhaus       0.92      0.67      0.78       104
       Personal       0.73      0.67      0.70        12
 Pflegepersonal       1.00      1.00      1.00        14
anderer Service       0.68      0.57      0.62        30
 mediz. Service       0.66      0.75      0.70        67

      micro avg       0.82      0.75      0.79       279
      macro avg       0.82      0.77      0.79       279
   weighted avg       0.83      0.75      0.78       279

Precision Score: 0.8203125
Recall Score: 0.7526881720430108
F1 Score: 0.7850467289719626
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O', 'O', '

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6498.52 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5274.93 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training distilbert/distilbert-base-german-cased for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.244423,0.65625,0.57874,0.615063
2,0.392400,0.225741,0.721739,0.653543,0.68595
3,0.172200,0.256577,0.776744,0.65748,0.712154
4,0.087700,0.265688,0.735294,0.688976,0.711382
5,0.055900,0.295494,0.734127,0.728346,0.731225
6,0.055900,0.312017,0.77686,0.740157,0.758065
7,0.024400,0.318798,0.750973,0.759843,0.755382
8,0.014700,0.341944,0.759657,0.69685,0.726899
9,0.011500,0.340218,0.761317,0.728346,0.744467
10,0.005500,0.345061,0.765432,0.732283,0.748491



Best Model saved at: ./saved_models/ate_cat_distilbert_distilbert-base-german-cased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_distilbert_distilbert-base-german-cased_42_42_10
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4946.80 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.90      0.95      0.92        55
    Krankenhaus       0.93      0.59      0.72       117
       Personal       0.67      0.46      0.55        13
 Pflegepersonal       0.89      0.94      0.92        18
anderer Service       0.68      0.37      0.48        35
 mediz. Service       0.60      0.69      0.64        77

      micro avg       0.79      0.67      0.72       315
      macro avg       0.78      0.67      0.71       315
   weighted avg       0.80      0.67      0.71       315

Precision Score: 0.7865168539325843
Recall Score: 0.6666666666666666
F1 Score: 0.7216494845360824
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5795.66 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4612.19 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training GerMedBERT/medbert-512 for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.195355,0.694581,0.61039,0.64977
2,0.318600,0.175384,0.733624,0.727273,0.730435
3,0.150200,0.266162,0.787234,0.640693,0.706444
4,0.072100,0.25451,0.734513,0.718615,0.726477
5,0.039200,0.280344,0.699588,0.735931,0.7173
6,0.039200,0.301701,0.775701,0.718615,0.746067
7,0.010100,0.332243,0.729258,0.722944,0.726087
8,0.007100,0.349217,0.714894,0.727273,0.72103
9,0.001700,0.357856,0.748837,0.69697,0.721973
10,0.001000,0.355737,0.733333,0.714286,0.723684



Best Model saved at: ./saved_models/ate_cat_GerMedBERT_medbert-512_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_GerMedBERT_medbert-512_42_42_10
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4296.15 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.95      0.98      0.96        54
    Krankenhaus       0.96      0.70      0.81       105
       Personal       0.79      0.69      0.73        16
 Pflegepersonal       0.93      0.87      0.90        15
anderer Service       0.60      0.43      0.50        35
 mediz. Service       0.86      0.60      0.71        63

      micro avg       0.89      0.70      0.79       288
      macro avg       0.85      0.71      0.77       288
   weighted avg       0.88      0.70      0.78       288

Precision Score: 0.8864628820960698
Recall Score: 0.7048611111111112
F1 Score: 0.7852998065764023
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5914.12 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4736.25 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training deepset/gbert-base for 10 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.212766,0.789216,0.633858,0.703057
2,0.324100,0.201206,0.802817,0.673228,0.732334
3,0.147400,0.207261,0.7875,0.744094,0.765182
4,0.062800,0.262745,0.814815,0.692913,0.748936
5,0.038500,0.276008,0.7393,0.748031,0.74364
6,0.038500,0.273169,0.795745,0.73622,0.764826
7,0.015600,0.308614,0.750958,0.771654,0.761165
8,0.007800,0.317918,0.772549,0.775591,0.774067
9,0.004800,0.330287,0.765625,0.771654,0.768627
10,0.001700,0.328902,0.767442,0.779528,0.773438



Best Model saved at: ./saved_models/ate_cat_deepset_gbert-base_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_deepset_gbert-base_42_42_10
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4431.89 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.86      0.93      0.89        55
    Krankenhaus       0.92      0.72      0.81       117
       Personal       0.69      0.69      0.69        13
 Pflegepersonal       0.85      0.94      0.89        18
anderer Service       0.78      0.51      0.62        35
 mediz. Service       0.76      0.74      0.75        77

      micro avg       0.84      0.75      0.79       315
      macro avg       0.81      0.76      0.78       315
   weighted avg       0.84      0.75      0.79       315

Precision Score: 0.8398576512455516
Recall Score: 0.7492063492063492
F1 Score: 0.7919463087248322
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

In [6]:
for model in models:
    print(f'training and results for {model}:')
    ate_cat_model(data, model, rn1=42, rn2=42, epochs=12, save=True)
    print()

training and results for google-bert/bert-base-german-cased:


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5831.33 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4566.35 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training google-bert/bert-base-german-cased for 12 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.21118,0.723502,0.601533,0.656904
2,0.332900,0.194919,0.74359,0.666667,0.70303
3,0.132700,0.260497,0.737903,0.701149,0.719057
4,0.052000,0.265018,0.763158,0.666667,0.711656
5,0.027700,0.351346,0.710744,0.659004,0.683897
6,0.027700,0.384536,0.727273,0.643678,0.682927
7,0.006600,0.403812,0.721116,0.693487,0.707031
8,0.002500,0.423429,0.694779,0.662835,0.678431
9,0.000900,0.410853,0.722222,0.697318,0.709552
10,0.000900,0.44168,0.702041,0.659004,0.679842



Best Model saved at: ./saved_models/ate_cat_google-bert_bert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_google-bert_bert-base-german-cased_42_42_12
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4378.46 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.82      0.98      0.89        52
    Krankenhaus       0.84      0.59      0.69       119
       Personal       0.73      0.57      0.64        14
 Pflegepersonal       0.94      0.94      0.94        18
anderer Service       0.70      0.42      0.53        33
 mediz. Service       0.81      0.79      0.80        87

      micro avg       0.82      0.71      0.76       323
      macro avg       0.81      0.72      0.75       323
   weighted avg       0.82      0.71      0.75       323

Precision Score: 0.8207885304659498
Recall Score: 0.7089783281733746
F1 Score: 0.7607973421926909
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5818.50 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4800.66 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training dbmdz/bert-base-german-cased for 12 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.248845,0.741784,0.622047,0.67666
2,0.341100,0.234615,0.768182,0.665354,0.71308
3,0.156100,0.197918,0.814346,0.759843,0.786151
4,0.071200,0.256327,0.829787,0.767717,0.797546
5,0.035100,0.298312,0.792829,0.783465,0.788119
6,0.035100,0.33176,0.809129,0.767717,0.787879
7,0.009900,0.331911,0.783133,0.767717,0.775348
8,0.004700,0.409032,0.789474,0.708661,0.746888
9,0.003100,0.335819,0.778626,0.80315,0.790698
10,0.001400,0.343177,0.762963,0.811024,0.78626



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-cased_42_42_12
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4442.66 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.98      0.93      0.95        55
    Krankenhaus       0.89      0.54      0.67       117
       Personal       0.67      0.92      0.77        13
 Pflegepersonal       1.00      0.89      0.94        18
anderer Service       0.75      0.43      0.55        35
 mediz. Service       0.69      0.74      0.71        77

      micro avg       0.82      0.68      0.74       315
      macro avg       0.83      0.74      0.77       315
   weighted avg       0.84      0.68      0.74       315

Precision Score: 0.823076923076923
Recall Score: 0.6793650793650794
F1 Score: 0.7443478260869565
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5580.67 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4587.46 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training dbmdz/bert-base-german-uncased for 12 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.241119,0.704225,0.595238,0.645161
2,0.325700,0.243878,0.77512,0.642857,0.70282
3,0.149900,0.276914,0.786364,0.686508,0.733051
4,0.061700,0.274966,0.758893,0.761905,0.760396
5,0.032200,0.338818,0.70438,0.765873,0.73384
6,0.032200,0.332746,0.808511,0.753968,0.780287
7,0.012500,0.382021,0.728682,0.746032,0.737255
8,0.006800,0.443337,0.719844,0.734127,0.726916
9,0.002800,0.447263,0.736,0.730159,0.733068
10,0.002700,0.442026,0.726923,0.75,0.738281



Best Model saved at: ./saved_models/ate_cat_dbmdz_bert-base-german-uncased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_dbmdz_bert-base-german-uncased_42_42_12
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4422.04 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.87      0.94      0.90        63
    Krankenhaus       0.93      0.46      0.61       112
       Personal       0.75      0.64      0.69        14
 Pflegepersonal       1.00      1.00      1.00        19
anderer Service       0.56      0.41      0.47        34
 mediz. Service       0.83      0.78      0.81        74

      micro avg       0.84      0.66      0.74       316
      macro avg       0.82      0.71      0.75       316
   weighted avg       0.85      0.66      0.73       316

Precision Score: 0.8433734939759037
Recall Score: 0.6645569620253164
F1 Score: 0.7433628318584071
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5844.21 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5002.12 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training FacebookAI/xlm-roberta-base for 12 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.262373,0.737327,0.553633,0.632411
2,0.385500,0.197599,0.739464,0.66782,0.701818
3,0.216600,0.196081,0.765886,0.792388,0.778912
4,0.132200,0.21559,0.765125,0.743945,0.754386
5,0.108600,0.278568,0.697987,0.719723,0.708688
6,0.108600,0.261659,0.759398,0.698962,0.727928
7,0.057300,0.324861,0.711475,0.750865,0.73064
8,0.034200,0.3587,0.722408,0.747405,0.734694
9,0.018800,0.368425,0.748201,0.719723,0.733686
10,0.013300,0.345504,0.725753,0.750865,0.738095



Best Model saved at: ./saved_models/ate_cat_FacebookAI_xlm-roberta-base_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_FacebookAI_xlm-roberta-base_42_42_12
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4610.76 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.79      0.95      0.86        59
    Krankenhaus       0.75      0.65      0.70       120
       Personal       0.71      0.67      0.69        15
 Pflegepersonal       0.77      0.96      0.85        24
anderer Service       0.54      0.29      0.38        45
 mediz. Service       0.65      0.76      0.70        83

      micro avg       0.71      0.70      0.71       346
      macro avg       0.70      0.71      0.70       346
   weighted avg       0.71      0.70      0.69       346

Precision Score: 0.7147058823529412
Recall Score: 0.7023121387283237
F1 Score: 0.7084548104956269
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6535.80 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5108.41 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training TUM/GottBERT_base_best for 12 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.231569,0.708333,0.561321,0.626316
2,0.325200,0.199338,0.75,0.707547,0.728155
3,0.170700,0.241882,0.752381,0.745283,0.748815
4,0.088600,0.216694,0.756881,0.778302,0.767442
5,0.056300,0.284586,0.765766,0.801887,0.78341
6,0.056300,0.342332,0.764103,0.70283,0.732187
7,0.022800,0.396987,0.760204,0.70283,0.730392
8,0.014000,0.392228,0.772277,0.735849,0.753623
9,0.009800,0.39968,0.75576,0.773585,0.764569
10,0.009300,0.364672,0.743119,0.764151,0.753488



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_best_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_best_42_42_12
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4788.23 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.89      0.98      0.94        52
    Krankenhaus       0.86      0.70      0.77       104
       Personal       0.75      0.75      0.75        12
 Pflegepersonal       1.00      1.00      1.00        14
anderer Service       0.69      0.60      0.64        30
 mediz. Service       0.68      0.84      0.75        67

      micro avg       0.80      0.79      0.80       279
      macro avg       0.81      0.81      0.81       279
   weighted avg       0.81      0.79      0.79       279

Precision Score: 0.8007246376811594
Recall Score: 0.7921146953405018
F1 Score: 0.7963963963963964
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6583.39 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5072.07 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training TUM/GottBERT_filtered_base_best for 12 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.203168,0.741379,0.608491,0.668394
2,0.328000,0.161616,0.776744,0.787736,0.782201
3,0.164200,0.184541,0.757709,0.811321,0.783599
4,0.088800,0.198257,0.709957,0.773585,0.740406
5,0.048200,0.276285,0.766667,0.759434,0.763033
6,0.048200,0.220852,0.81,0.764151,0.786408
7,0.021600,0.312418,0.736585,0.712264,0.724221
8,0.009300,0.289894,0.765766,0.801887,0.78341
9,0.004100,0.325752,0.786408,0.764151,0.77512
10,0.001500,0.338122,0.783654,0.768868,0.77619



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_filtered_base_best_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_filtered_base_best_42_42_12
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4734.08 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.96      0.96      0.96        52
    Krankenhaus       0.95      0.69      0.80       104
       Personal       0.64      0.75      0.69        12
 Pflegepersonal       1.00      0.93      0.96        14
anderer Service       0.69      0.37      0.48        30
 mediz. Service       0.76      0.75      0.75        67

      micro avg       0.86      0.73      0.79       279
      macro avg       0.83      0.74      0.77       279
   weighted avg       0.87      0.73      0.79       279

Precision Score: 0.8649789029535865
Recall Score: 0.7347670250896058
F1 Score: 0.7945736434108528
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6658.67 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5042.43 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training TUM/GottBERT_base_last for 12 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.232755,0.748387,0.54717,0.632153
2,0.329900,0.228068,0.757062,0.632075,0.688946
3,0.161700,0.222531,0.743119,0.764151,0.753488
4,0.080800,0.223668,0.743719,0.698113,0.720195
5,0.057700,0.27718,0.696721,0.801887,0.745614
6,0.057700,0.313491,0.733333,0.778302,0.755149
7,0.021700,0.348781,0.718182,0.745283,0.731481
8,0.008800,0.354017,0.722467,0.773585,0.747153
9,0.008100,0.38001,0.723502,0.740566,0.731935
10,0.004700,0.386166,0.7343,0.716981,0.725537



Best Model saved at: ./saved_models/ate_cat_TUM_GottBERT_base_last_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_TUM_GottBERT_base_last_42_42_12
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4692.18 examples/s]


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.98      0.98      0.98        52
    Krankenhaus       0.90      0.59      0.71       104
       Personal       0.77      0.83      0.80        12
 Pflegepersonal       1.00      1.00      1.00        14
anderer Service       0.58      0.37      0.45        30
 mediz. Service       0.61      0.78      0.68        67

      micro avg       0.79      0.71      0.75       279
      macro avg       0.81      0.76      0.77       279
   weighted avg       0.81      0.71      0.74       279

Precision Score: 0.7928286852589641
Recall Score: 0.7132616487455197
F1 Score: 0.750943396226415
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O'

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 6598.25 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 5155.40 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training distilbert/distilbert-base-german-cased for 12 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.225392,0.696682,0.57874,0.632258
2,0.394000,0.222005,0.707424,0.637795,0.670807
3,0.171900,0.22577,0.795349,0.673228,0.729211
4,0.084700,0.254331,0.765217,0.692913,0.727273
5,0.049600,0.262513,0.769231,0.748031,0.758483
6,0.049600,0.293082,0.77381,0.767717,0.770751
7,0.023600,0.308278,0.751938,0.76378,0.757812
8,0.010700,0.326156,0.776,0.76378,0.769841
9,0.005500,0.342221,0.787234,0.728346,0.756646
10,0.004600,0.347006,0.784553,0.759843,0.772



Best Model saved at: ./saved_models/ate_cat_distilbert_distilbert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_distilbert_distilbert-base-german-cased_42_42_12
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4805.93 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.86      0.89      0.88        55
    Krankenhaus       0.95      0.60      0.73       117
       Personal       0.67      0.31      0.42        13
 Pflegepersonal       0.78      1.00      0.88        18
anderer Service       0.70      0.40      0.51        35
 mediz. Service       0.68      0.71      0.70        77

      micro avg       0.80      0.67      0.73       315
      macro avg       0.77      0.65      0.69       315
   weighted avg       0.82      0.67      0.72       315

Precision Score: 0.8045977011494253
Recall Score: 0.6666666666666666
F1 Score: 0.7291666666666666
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5818.24 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4692.03 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training GerMedBERT/medbert-512 for 12 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.199582,0.695876,0.584416,0.635294
2,0.322200,0.168744,0.718615,0.718615,0.718615
3,0.151900,0.272795,0.776536,0.601732,0.678049
4,0.066400,0.233735,0.703704,0.74026,0.721519
5,0.036100,0.284757,0.721239,0.705628,0.713348
6,0.036100,0.309507,0.737557,0.705628,0.721239
7,0.014600,0.402408,0.748815,0.683983,0.714932
8,0.004000,0.352384,0.733624,0.727273,0.730435
9,0.001600,0.418202,0.729858,0.666667,0.696833
10,0.001700,0.417191,0.7277,0.670996,0.698198



Best Model saved at: ./saved_models/ate_cat_GerMedBERT_medbert-512_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_GerMedBERT_medbert-512_42_42_12
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4267.82 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.96      0.96      0.96        54
    Krankenhaus       0.85      0.75      0.80       105
       Personal       0.69      0.69      0.69        16
 Pflegepersonal       0.88      0.93      0.90        15
anderer Service       0.59      0.49      0.53        35
 mediz. Service       0.75      0.63      0.69        63

      micro avg       0.82      0.74      0.78       288
      macro avg       0.79      0.74      0.76       288
   weighted avg       0.81      0.74      0.77       288

Precision Score: 0.8160919540229885
Recall Score: 0.7395833333333334
F1 Score: 0.7759562841530055
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inferen

Mapping the data


Map: 100%|██████████| 808/808 [00:00<00:00, 5843.86 examples/s]
Map: 100%|██████████| 101/101 [00:00<00:00, 4770.01 examples/s]
  trainer = Trainer(


['B-Krankenhaus' 'B-anderer Service' 'B-Pflegepersonal' 'B-Personal'
 'B-Arzt' 'B-mediz. Service' 'O']
{0: 5.983217355710193, 1: 10.546176046176047, 2: 14.014381591562799, 3: 18.317042606516292, 4: 6.8239962651727355, 5: 4.959959280624364, 6: 0.1596439493228484}
Training deepset/gbert-base for 12 epochs with random seeds 42, 42



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.227519,0.702326,0.594488,0.643923
2,0.329600,0.194476,0.741525,0.688976,0.714286
3,0.150800,0.210085,0.790514,0.787402,0.788955
4,0.056400,0.24685,0.782609,0.779528,0.781065
5,0.032300,0.30767,0.719858,0.799213,0.757463
6,0.032300,0.264187,0.785185,0.834646,0.80916
7,0.014000,0.26468,0.784906,0.818898,0.801541
8,0.006200,0.302801,0.754647,0.799213,0.776291
9,0.003700,0.320325,0.759857,0.834646,0.795497
10,0.001900,0.30108,0.790875,0.818898,0.804642



Best Model saved at: ./saved_models/ate_cat_deepset_gbert-base_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/ate_cat_deepset_gbert-base_42_42_12
Evaluating on test data



Map: 100%|██████████| 102/102 [00:00<00:00, 4420.30 examples/s]


Unique predicted label IDs: {0, 2, 3, 4, 5, 6, 7}
Expected label IDs: {0, 1, 2, 3, 4, 5, 6, 7}
Classification Report:
                 precision    recall  f1-score   support

           Arzt       0.87      0.95      0.90        55
    Krankenhaus       0.89      0.79      0.84       117
       Personal       0.75      0.69      0.72        13
 Pflegepersonal       0.89      0.94      0.92        18
anderer Service       0.70      0.46      0.55        35
 mediz. Service       0.82      0.79      0.81        77

      micro avg       0.85      0.79      0.82       315
      macro avg       0.82      0.77      0.79       315
   weighted avg       0.84      0.79      0.81       315

Precision Score: 0.8464163822525598
Recall Score: 0.7873015873015873
F1 Score: 0.8157894736842106
Tokens     : ['Nun', 'sind', '3', 'Jahre', 'seit', 'der', 'Operation', 'vergangen', 'und', 'es', 'mir', 'gut', '.']
True Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Pred Labels: ['O