## ABSA Training and Fine Tuning for Large Language Models on German hospital reviews

In [1]:
import pandas as pd
import numpy as np
import torch

# need the sys package to load modules from another directory:
import sys
sys.path.append('../')
from functions.absa_model_train import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
print("Is CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("GPU device name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")

Is CUDA available: True
CUDA version: 12.6
GPU device name: NVIDIA A30


In [3]:
# Load the dataset into a DataFrame
data_ano = pd.read_csv("./data/hospitalABSA/patient_review_labels_absa_ano.csv")
data = pd.read_csv("./data/hospitalABSA/patient_review_labels_absa.csv") 

In [4]:
models = ["google-bert/bert-base-german-cased","dbmdz/bert-base-german-cased", "dbmdz/bert-base-german-uncased",
          "FacebookAI/xlm-roberta-base", "TUM/GottBERT_base_best", "TUM/GottBERT_filtered_base_best", "TUM/GottBERT_base_last",
          "distilbert/distilbert-base-german-cased", "GerMedBERT/medbert-512", "deepset/gbert-base"]

### Train ABSA Model with new training, validation, test split

train for 5, 6, 7, 8, 10, 12, 20 epochs

In [5]:
for model in models:
    print(f'training and results for {model}:')
    absa_model(data, model, rn1=42, rn2=42, epochs=5, save = True)
    print()

training and results for google-bert/bert-base-german-cased:
Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2443.78 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3916.08 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3849.42 examples/s]


Training results for google-bert/bert-base-german-cased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8383,0.911232,0.775194,0.786194,0.775194,0.776692,"{0: 51, 1: 25, 2: 53}"
2,0.4768,0.932275,0.790698,0.799154,0.790698,0.788306,"{0: 33, 1: 36, 2: 60}"
3,0.4215,1.166174,0.782946,0.78157,0.782946,0.78201,"{0: 40, 1: 27, 2: 62}"
4,0.2375,1.274274,0.782946,0.784735,0.782946,0.783112,"{0: 39, 1: 30, 2: 60}"
5,0.132,1.326507,0.767442,0.777174,0.767442,0.769552,"{0: 38, 1: 34, 2: 57}"



Best Model saved at: ./saved_models/absa_google-bert_bert-base-german-cased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/absa_google-bert_bert-base-german-cased_42_42_5
Evaluation results for google-bert/bert-base-german-cased with 5 epochs and random seeds: 42, 42



{'eval_loss': 0.9676937460899353, 'eval_accuracy': 0.7647058823529411, 'eval_precision': 0.7889484344670435, 'eval_recall': 0.7647058823529411, 'eval_f1': 0.7680431024084274, 'eval_class_distribution': {0: 41, 1: 44, 2: 68}, 'eval_runtime': 2.3765, 'eval_samples_per_second': 64.38, 'eval_steps_per_second': 32.4, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.75      0.83      0.79        36
     Neutral       0.60      0.88      0.72        33
     Positiv       0.92      0.71      0.81        84

    accuracy                           0.78       153
   macro avg       0.76      0.81      0.77       153
weighted avg       0.81      0.78      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 40, 1: 48, 2: 65}
Negativ Precision Score: 0.75
Negativ Recall Score: 0.8333333333333334
Negativ F1 Score: 0.7894736842105263

Neutral Precision Score: 0.6041666666666666
Neutral Recall Score: 0.878787878787878

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4066.37 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3971.85 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3869.28 examples/s]


Training results for dbmdz/bert-base-german-cased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8579,0.607639,0.844961,0.858825,0.844961,0.843355,"{0: 54, 1: 20, 2: 55}"
2,0.5551,0.743898,0.837209,0.83879,0.837209,0.836884,"{0: 38, 1: 27, 2: 64}"
3,0.492,0.903522,0.837209,0.835397,0.837209,0.833369,"{0: 44, 1: 21, 2: 64}"
4,0.2605,0.880561,0.837209,0.838488,0.837209,0.837269,"{0: 45, 1: 25, 2: 59}"
5,0.2015,0.872499,0.837209,0.838359,0.837209,0.837702,"{0: 42, 1: 28, 2: 59}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-cased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-cased_42_42_5
Evaluation results for dbmdz/bert-base-german-cased with 5 epochs and random seeds: 42, 42



{'eval_loss': 0.997826099395752, 'eval_accuracy': 0.7647058823529411, 'eval_precision': 0.7945879187120263, 'eval_recall': 0.7647058823529411, 'eval_f1': 0.7671448865777926, 'eval_class_distribution': {0: 53, 1: 33, 2: 67}, 'eval_runtime': 2.3467, 'eval_samples_per_second': 65.198, 'eval_steps_per_second': 32.812, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.65      0.97      0.78        36
     Neutral       0.79      0.67      0.72        33
     Positiv       0.92      0.77      0.84        84

    accuracy                           0.80       153
   macro avg       0.78      0.80      0.78       153
weighted avg       0.82      0.80      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 54, 1: 28, 2: 71}
Negativ Precision Score: 0.6481481481481481
Negativ Recall Score: 0.9722222222222222
Negativ F1 Score: 0.7777777777777778

Neutral Precision Score: 0.7857142857142857
Neutral Recall Score: 0

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3861.12 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3804.53 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3731.07 examples/s]


Training results for dbmdz/bert-base-german-uncased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8492,0.683184,0.844961,0.848891,0.844961,0.844586,"{0: 48, 1: 23, 2: 58}"
2,0.5615,0.58389,0.829457,0.840033,0.829457,0.832542,"{0: 40, 1: 33, 2: 56}"
3,0.4485,0.777486,0.852713,0.855932,0.852713,0.853683,"{0: 40, 1: 30, 2: 59}"
4,0.2257,0.67998,0.860465,0.863138,0.860465,0.861469,"{0: 42, 1: 29, 2: 58}"
5,0.2182,0.7898,0.844961,0.854681,0.844961,0.847538,"{0: 40, 1: 33, 2: 56}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-uncased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-uncased_42_42_5
Evaluation results for dbmdz/bert-base-german-uncased with 5 epochs and random seeds: 42, 42



{'eval_loss': 1.0990517139434814, 'eval_accuracy': 0.8169934640522876, 'eval_precision': 0.8299243206054351, 'eval_recall': 0.8169934640522876, 'eval_f1': 0.8203795299412042, 'eval_class_distribution': {0: 38, 1: 40, 2: 75}, 'eval_runtime': 2.3525, 'eval_samples_per_second': 65.036, 'eval_steps_per_second': 32.731, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.84      0.89      0.86        36
     Neutral       0.67      0.85      0.75        33
     Positiv       0.95      0.82      0.88        84

    accuracy                           0.84       153
   macro avg       0.82      0.85      0.83       153
weighted avg       0.86      0.84      0.85       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 38, 1: 42, 2: 73}
Negativ Precision Score: 0.8421052631578947
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8648648648648649

Neutral Precision Score: 0.6666666666666666
Neutral Recall Score: 

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4859.54 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4176.21 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4453.29 examples/s]


Training results for FacebookAI/xlm-roberta-base with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8995,0.633655,0.79845,0.846253,0.79845,0.801034,"{0: 63, 1: 18, 2: 48}"
2,0.7669,0.958312,0.790698,0.790303,0.790698,0.787535,"{0: 40, 1: 22, 2: 67}"
3,0.7834,1.03833,0.829457,0.839276,0.829457,0.830989,"{0: 50, 1: 25, 2: 54}"
4,0.48,1.15733,0.806202,0.810956,0.806202,0.808082,"{0: 43, 1: 29, 2: 57}"
5,0.4995,1.159907,0.806202,0.810956,0.806202,0.808082,"{0: 43, 1: 29, 2: 57}"



Best Model saved at: ./saved_models/absa_FacebookAI_xlm-roberta-base_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/absa_FacebookAI_xlm-roberta-base_42_42_5
Evaluation results for FacebookAI/xlm-roberta-base with 5 epochs and random seeds: 42, 42



{'eval_loss': 1.1958260536193848, 'eval_accuracy': 0.7973856209150327, 'eval_precision': 0.8035030726757454, 'eval_recall': 0.7973856209150327, 'eval_f1': 0.7955541330407017, 'eval_class_distribution': {0: 47, 1: 29, 2: 77}, 'eval_runtime': 2.297, 'eval_samples_per_second': 66.608, 'eval_steps_per_second': 33.522, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.69      0.94      0.80        36
     Neutral       0.66      0.58      0.61        33
     Positiv       0.89      0.80      0.84        84

    accuracy                           0.78       153
   macro avg       0.75      0.77      0.75       153
weighted avg       0.80      0.78      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 49, 1: 29, 2: 75}
Negativ Precision Score: 0.6938775510204082
Negativ Recall Score: 0.9444444444444444
Negativ F1 Score: 0.8

Neutral Precision Score: 0.6551724137931034
Neutral Recall Score: 0.57575757575757

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 5001.08 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4406.00 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4614.43 examples/s]


Training results for TUM/GottBERT_base_best with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8872,0.654201,0.860465,0.867849,0.860465,0.860479,"{0: 50, 1: 23, 2: 56}"
2,0.5388,0.510353,0.868217,0.867489,0.868217,0.867765,"{0: 42, 1: 26, 2: 61}"
3,0.4743,0.825377,0.844961,0.852485,0.844961,0.842293,"{0: 49, 1: 19, 2: 61}"
4,0.2875,0.759579,0.844961,0.847627,0.844961,0.842292,"{0: 49, 1: 21, 2: 59}"
5,0.1987,0.784528,0.837209,0.84186,0.837209,0.838667,"{0: 45, 1: 28, 2: 56}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_best_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_best_42_42_5
Evaluation results for TUM/GottBERT_base_best with 5 epochs and random seeds: 42, 42



{'eval_loss': 0.8579771518707275, 'eval_accuracy': 0.8235294117647058, 'eval_precision': 0.8285151667504609, 'eval_recall': 0.8235294117647058, 'eval_f1': 0.8247620851883427, 'eval_class_distribution': {0: 39, 1: 36, 2: 78}, 'eval_runtime': 2.2943, 'eval_samples_per_second': 66.688, 'eval_steps_per_second': 33.562, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.78      0.89      0.83        36
     Neutral       0.68      0.70      0.69        33
     Positiv       0.88      0.82      0.85        84

    accuracy                           0.81       153
   macro avg       0.78      0.80      0.79       153
weighted avg       0.82      0.81      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 41, 1: 34, 2: 78}
Negativ Precision Score: 0.7804878048780488
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8311688311688312

Neutral Precision Score: 0.6764705882352942
Neutral Recall Score: 

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4998.43 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4484.44 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4494.90 examples/s]


Training results for TUM/GottBERT_filtered_base_best with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8376,0.72531,0.844961,0.856285,0.844961,0.846382,"{0: 51, 1: 24, 2: 54}"
2,0.5302,0.590942,0.883721,0.884948,0.883721,0.883406,"{0: 45, 1: 24, 2: 60}"
3,0.4733,0.643328,0.883721,0.887434,0.883721,0.883974,"{0: 47, 1: 24, 2: 58}"
4,0.3074,0.692748,0.883721,0.885469,0.883721,0.883973,"{0: 45, 1: 25, 2: 59}"
5,0.2824,0.721027,0.891473,0.894203,0.891473,0.891869,"{0: 46, 1: 25, 2: 58}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_filtered_base_best_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_filtered_base_best_42_42_5
Evaluation results for TUM/GottBERT_filtered_base_best with 5 epochs and random seeds: 42, 42



{'eval_loss': 1.2548445463180542, 'eval_accuracy': 0.803921568627451, 'eval_precision': 0.8144595274007039, 'eval_recall': 0.803921568627451, 'eval_f1': 0.8063563119168004, 'eval_class_distribution': {0: 39, 1: 39, 2: 75}, 'eval_runtime': 2.2714, 'eval_samples_per_second': 67.361, 'eval_steps_per_second': 33.9, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.78      0.89      0.83        36
     Neutral       0.64      0.76      0.69        33
     Positiv       0.90      0.79      0.84        84

    accuracy                           0.80       153
   macro avg       0.78      0.81      0.79       153
weighted avg       0.82      0.80      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 41, 1: 39, 2: 73}
Negativ Precision Score: 0.7804878048780488
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8311688311688312

Neutral Precision Score: 0.6410256410256411
Neutral Recall Score: 0.75

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be abl

Training results for TUM/GottBERT_base_last with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9112,1.300491,0.728682,0.812171,0.728682,0.701999,"{0: 75, 1: 6, 2: 48}"
2,0.577,0.600182,0.868217,0.868397,0.868217,0.867761,"{0: 39, 1: 28, 2: 62}"
3,0.4619,0.64395,0.860465,0.858948,0.860465,0.858245,"{0: 42, 1: 23, 2: 64}"
4,0.3103,0.655676,0.860465,0.865701,0.860465,0.861973,"{0: 41, 1: 31, 2: 57}"
5,0.2481,0.619495,0.875969,0.891052,0.875969,0.878868,"{0: 46, 1: 32, 2: 51}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_last_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_last_42_42_5
Evaluation results for TUM/GottBERT_base_last with 5 epochs and random seeds: 42, 42



{'eval_loss': 1.1922639608383179, 'eval_accuracy': 0.803921568627451, 'eval_precision': 0.828454172366621, 'eval_recall': 0.803921568627451, 'eval_f1': 0.80917534477906, 'eval_class_distribution': {0: 40, 1: 43, 2: 70}, 'eval_runtime': 2.2581, 'eval_samples_per_second': 67.755, 'eval_steps_per_second': 34.099, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.73      0.89      0.80        36
     Neutral       0.65      0.79      0.71        33
     Positiv       0.93      0.76      0.84        84

    accuracy                           0.80       153
   macro avg       0.77      0.81      0.78       153
weighted avg       0.82      0.80      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 44, 1: 40, 2: 69}
Negativ Precision Score: 0.7272727272727273
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8

Neutral Precision Score: 0.65
Neutral Recall Score: 0.7878787878787878
Neutral F1 Scor

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4984.51 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4678.23 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4617.95 examples/s]


Training results for distilbert/distilbert-base-german-cased with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8227,0.785686,0.790698,0.824226,0.790698,0.78805,"{0: 62, 1: 18, 2: 49}"
2,0.5838,0.760451,0.813953,0.832567,0.813953,0.815965,"{0: 54, 1: 24, 2: 51}"
3,0.5247,0.789023,0.844961,0.865685,0.844961,0.846614,"{0: 55, 1: 23, 2: 51}"
4,0.2884,0.823489,0.852713,0.870588,0.852713,0.854412,"{0: 53, 1: 26, 2: 50}"
5,0.295,0.829586,0.852713,0.870588,0.852713,0.854412,"{0: 53, 1: 26, 2: 50}"



Best Model saved at: ./saved_models/absa_distilbert_distilbert-base-german-cased_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/absa_distilbert_distilbert-base-german-cased_42_42_5
Evaluation results for distilbert/distilbert-base-german-cased with 5 epochs and random seeds: 42, 42



{'eval_loss': 1.183458685874939, 'eval_accuracy': 0.7647058823529411, 'eval_precision': 0.7875974685029482, 'eval_recall': 0.7647058823529411, 'eval_f1': 0.7681253712604027, 'eval_class_distribution': {0: 47, 1: 37, 2: 69}, 'eval_runtime': 1.2528, 'eval_samples_per_second': 122.127, 'eval_steps_per_second': 61.463, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.66      0.86      0.75        36
     Neutral       0.64      0.64      0.64        33
     Positiv       0.90      0.79      0.84        84

    accuracy                           0.77       153
   macro avg       0.73      0.76      0.74       153
weighted avg       0.79      0.77      0.77       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 47, 1: 33, 2: 73}
Negativ Precision Score: 0.6595744680851063
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.7469879518072289

Neutral Precision Score: 0.6363636363636364
Neutral Recall Score: 

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

Training results for GerMedBERT/medbert-512 with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9215,0.730344,0.782946,0.801155,0.782946,0.783977,"{0: 55, 1: 22, 2: 52}"
2,0.5741,0.756621,0.806202,0.815845,0.806202,0.807423,"{0: 50, 1: 26, 2: 53}"
3,0.5157,1.075526,0.790698,0.798453,0.790698,0.782654,"{0: 43, 1: 16, 2: 70}"
4,0.2696,0.816916,0.821705,0.829546,0.821705,0.823608,"{0: 47, 1: 28, 2: 54}"
5,0.2591,0.81506,0.79845,0.806847,0.79845,0.801172,"{0: 43, 1: 31, 2: 55}"



Best Model saved at: ./saved_models/absa_GerMedBERT_medbert-512_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/absa_GerMedBERT_medbert-512_42_42_5
Evaluation results for GerMedBERT/medbert-512 with 5 epochs and random seeds: 42, 42



{'eval_loss': 1.0450433492660522, 'eval_accuracy': 0.7973856209150327, 'eval_precision': 0.8112036512158607, 'eval_recall': 0.7973856209150327, 'eval_f1': 0.7983848696225423, 'eval_class_distribution': {0: 44, 1: 38, 2: 71}, 'eval_runtime': 2.3258, 'eval_samples_per_second': 65.784, 'eval_steps_per_second': 33.107, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.67      0.89      0.76        36
     Neutral       0.71      0.67      0.69        33
     Positiv       0.88      0.77      0.82        84

    accuracy                           0.78       153
   macro avg       0.75      0.78      0.76       153
weighted avg       0.79      0.78      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 48, 1: 31, 2: 74}
Negativ Precision Score: 0.6666666666666666
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7619047619047619

Neutral Precision Score: 0.7096774193548387
Neutral Recall Score: 

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

Training results for deepset/gbert-base with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8154,0.737219,0.860465,0.860171,0.860465,0.859571,"{0: 44, 1: 24, 2: 61}"
2,0.4918,0.724495,0.837209,0.844581,0.837209,0.83806,"{0: 49, 1: 23, 2: 57}"
3,0.4252,0.811122,0.860465,0.860171,0.860465,0.859571,"{0: 44, 1: 24, 2: 61}"
4,0.2195,0.880146,0.852713,0.854815,0.852713,0.853498,"{0: 44, 1: 27, 2: 58}"
5,0.1645,0.874246,0.860465,0.861382,0.860465,0.860856,"{0: 43, 1: 27, 2: 59}"



Best Model saved at: ./saved_models/absa_deepset_gbert-base_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/absa_deepset_gbert-base_42_42_5
Evaluation results for deepset/gbert-base with 5 epochs and random seeds: 42, 42



{'eval_loss': 1.165304183959961, 'eval_accuracy': 0.8235294117647058, 'eval_precision': 0.8342706313294549, 'eval_recall': 0.8235294117647058, 'eval_f1': 0.826791010385189, 'eval_class_distribution': {0: 35, 1: 40, 2: 78}, 'eval_runtime': 2.332, 'eval_samples_per_second': 65.61, 'eval_steps_per_second': 33.019, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.73      0.92      0.81        36
     Neutral       0.62      0.76      0.68        33
     Positiv       0.90      0.73      0.80        84

    accuracy                           0.78       153
   macro avg       0.75      0.80      0.77       153
weighted avg       0.80      0.78      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 45, 1: 40, 2: 68}
Negativ Precision Score: 0.7333333333333333
Negativ Recall Score: 0.9166666666666666
Negativ F1 Score: 0.8148148148148148

Neutral Precision Score: 0.625
Neutral Recall Score: 0.757575757575757

In [5]:
absa_model(data, "aari1995/German_Sentiment", rn1=42, rn2=42, epochs=5, save = True)

Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Map: 100%|██████████| 1111/1111 [00:00<00:00, 2243.54 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2313.42 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2365.43 examples/s]


Training results for aari1995/German_Sentiment with 5 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8823,0.721462,0.875969,0.888727,0.875969,0.874627,"{0: 51, 1: 19, 2: 59}"
2,0.5643,0.48372,0.868217,0.870841,0.868217,0.868716,"{0: 39, 1: 30, 2: 60}"
3,0.5064,0.586305,0.875969,0.876837,0.875969,0.876334,"{0: 43, 1: 27, 2: 59}"
4,0.2086,0.593073,0.883721,0.886732,0.883721,0.88365,"{0: 47, 1: 24, 2: 58}"
5,0.1784,0.725956,0.860465,0.863998,0.860465,0.861749,"{0: 44, 1: 28, 2: 57}"



Best Model saved at: ./saved_models/absa_aari1995_German_Sentiment_42_42_5

Tokenizer for best Model saved at: ./saved_tokenizers/absa_aari1995_German_Sentiment_42_42_5
Evaluation results for aari1995/German_Sentiment with 5 epochs and random seeds: 42, 42



{'eval_loss': 0.8373995423316956, 'eval_accuracy': 0.8496732026143791, 'eval_precision': 0.8689605028614317, 'eval_recall': 0.8496732026143791, 'eval_f1': 0.8543234685178419, 'eval_class_distribution': {0: 33, 1: 44, 2: 76}, 'eval_runtime': 5.6465, 'eval_samples_per_second': 27.096, 'eval_steps_per_second': 13.637, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.91      0.81      0.85        36
     Neutral       0.59      0.88      0.71        33
     Positiv       0.96      0.82      0.88        84

    accuracy                           0.83       153
   macro avg       0.82      0.84      0.81       153
weighted avg       0.87      0.83      0.84       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 32, 1: 49, 2: 72}
Negativ Precision Score: 0.90625
Negativ Recall Score: 0.8055555555555556
Negativ F1 Score: 0.8529411764705882

Neutral Precision Score: 0.5918367346938775
Neutral Recall Score: 0.878787878

In [6]:
for model in models:
    print(f'training and results for {model}:')
    absa_model(data, model, rn1=42, rn2=42, epochs=6, save = True)
    print()

training and results for google-bert/bert-base-german-cased:
Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3920.40 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3781.82 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3706.35 examples/s]


Training results for google-bert/bert-base-german-cased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8507,0.872872,0.767442,0.794676,0.767442,0.77009,"{0: 57, 1: 24, 2: 48}"
2,0.5033,0.71987,0.821705,0.833824,0.821705,0.823109,"{0: 51, 1: 26, 2: 52}"
3,0.4559,0.897653,0.813953,0.822949,0.813953,0.815203,"{0: 50, 1: 24, 2: 55}"
4,0.259,0.848326,0.821705,0.839475,0.821705,0.824194,"{0: 52, 1: 27, 2: 50}"
5,0.1955,1.000256,0.813953,0.816432,0.813953,0.813751,"{0: 47, 1: 24, 2: 58}"
6,0.1167,1.077784,0.79845,0.799644,0.79845,0.798984,"{0: 43, 1: 27, 2: 59}"



Best Model saved at: ./saved_models/absa_google-bert_bert-base-german-cased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/absa_google-bert_bert-base-german-cased_42_42_6
Evaluation results for google-bert/bert-base-german-cased with 6 epochs and random seeds: 42, 42



{'eval_loss': 1.1823519468307495, 'eval_accuracy': 0.8104575163398693, 'eval_precision': 0.83270987250226, 'eval_recall': 0.8104575163398693, 'eval_f1': 0.8119711042311663, 'eval_class_distribution': {0: 48, 1: 37, 2: 68}, 'eval_runtime': 2.3438, 'eval_samples_per_second': 65.279, 'eval_steps_per_second': 32.853, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.70      0.89      0.78        36
     Neutral       0.68      0.76      0.71        33
     Positiv       0.93      0.77      0.84        84

    accuracy                           0.80       153
   macro avg       0.77      0.81      0.78       153
weighted avg       0.82      0.80      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 46, 1: 37, 2: 70}
Negativ Precision Score: 0.6956521739130435
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7804878048780488

Neutral Precision Score: 0.6756756756756757
Neutral Recall Score: 0.

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4015.36 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3867.96 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3839.56 examples/s]


Training results for dbmdz/bert-base-german-cased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8984,0.973359,0.751938,0.803143,0.751938,0.753058,"{0: 65, 1: 17, 2: 47}"
2,0.5993,0.923055,0.782946,0.795455,0.782946,0.786003,"{0: 47, 1: 30, 2: 52}"
3,0.5108,1.02908,0.79845,0.808,0.79845,0.799121,"{0: 51, 1: 23, 2: 55}"
4,0.3009,1.030535,0.79845,0.808141,0.79845,0.799109,"{0: 51, 1: 25, 2: 53}"
5,0.2569,0.873975,0.837209,0.837543,0.837209,0.836848,"{0: 39, 1: 28, 2: 62}"
6,0.1749,0.92876,0.821705,0.82121,0.821705,0.821392,"{0: 41, 1: 27, 2: 61}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-cased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-cased_42_42_6
Evaluation results for dbmdz/bert-base-german-cased with 6 epochs and random seeds: 42, 42



{'eval_loss': 1.2861956357955933, 'eval_accuracy': 0.7516339869281046, 'eval_precision': 0.771012172985136, 'eval_recall': 0.7516339869281046, 'eval_f1': 0.7528386518005895, 'eval_class_distribution': {0: 49, 1: 35, 2: 69}, 'eval_runtime': 2.3445, 'eval_samples_per_second': 65.26, 'eval_steps_per_second': 32.843, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.67      0.86      0.76        36
     Neutral       0.69      0.76      0.72        33
     Positiv       0.89      0.75      0.81        84

    accuracy                           0.78       153
   macro avg       0.75      0.79      0.76       153
weighted avg       0.80      0.78      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 46, 1: 36, 2: 71}
Negativ Precision Score: 0.6739130434782609
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.7560975609756098

Neutral Precision Score: 0.6944444444444444
Neutral Recall Score: 0.

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3838.76 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3689.17 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3598.22 examples/s]


Training results for dbmdz/bert-base-german-uncased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8432,0.867722,0.775194,0.818401,0.775194,0.776189,"{0: 63, 1: 19, 2: 47}"
2,0.5072,0.73067,0.806202,0.807288,0.806202,0.80548,"{0: 44, 1: 23, 2: 62}"
3,0.527,0.769214,0.852713,0.854343,0.852713,0.852602,"{0: 46, 1: 25, 2: 58}"
4,0.2774,0.78422,0.852713,0.854735,0.852713,0.852079,"{0: 46, 1: 23, 2: 60}"
5,0.2561,0.759085,0.829457,0.828607,0.829457,0.828685,"{0: 42, 1: 25, 2: 62}"
6,0.1582,0.728507,0.860465,0.861078,0.860465,0.860522,"{0: 40, 1: 28, 2: 61}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-uncased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-uncased_42_42_6
Evaluation results for dbmdz/bert-base-german-uncased with 6 epochs and random seeds: 42, 42



{'eval_loss': 1.3692032098770142, 'eval_accuracy': 0.7581699346405228, 'eval_precision': 0.7765781922525108, 'eval_recall': 0.7581699346405228, 'eval_f1': 0.762696878732334, 'eval_class_distribution': {0: 41, 1: 40, 2: 72}, 'eval_runtime': 2.3331, 'eval_samples_per_second': 65.579, 'eval_steps_per_second': 33.004, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.80      0.89      0.84        36
     Neutral       0.60      0.76      0.67        33
     Positiv       0.93      0.79      0.85        84

    accuracy                           0.80       153
   macro avg       0.77      0.81      0.79       153
weighted avg       0.83      0.80      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 40, 1: 42, 2: 71}
Negativ Precision Score: 0.8
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8421052631578947

Neutral Precision Score: 0.5952380952380952
Neutral Recall Score: 0.75757575757575

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4727.56 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3935.48 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4200.65 examples/s]


Training results for FacebookAI/xlm-roberta-base with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8739,0.552215,0.844961,0.84877,0.844961,0.843999,"{0: 49, 1: 23, 2: 57}"
2,0.7372,0.797765,0.821705,0.820183,0.821705,0.820159,"{0: 42, 1: 24, 2: 63}"
3,0.7377,0.773387,0.852713,0.856728,0.852713,0.851034,"{0: 48, 1: 21, 2: 60}"
4,0.4794,0.86789,0.844961,0.850083,0.844961,0.844866,"{0: 49, 1: 24, 2: 56}"
5,0.4557,0.9142,0.829457,0.832768,0.829457,0.82978,"{0: 47, 1: 25, 2: 57}"
6,0.3646,0.90034,0.837209,0.841932,0.837209,0.837653,"{0: 48, 1: 25, 2: 56}"



Best Model saved at: ./saved_models/absa_FacebookAI_xlm-roberta-base_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/absa_FacebookAI_xlm-roberta-base_42_42_6
Evaluation results for FacebookAI/xlm-roberta-base with 6 epochs and random seeds: 42, 42



{'eval_loss': 1.1698037385940552, 'eval_accuracy': 0.8235294117647058, 'eval_precision': 0.824281805745554, 'eval_recall': 0.8235294117647058, 'eval_f1': 0.8216931313021247, 'eval_class_distribution': {0: 43, 1: 30, 2: 80}, 'eval_runtime': 2.2529, 'eval_samples_per_second': 67.912, 'eval_steps_per_second': 34.178, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.76      0.94      0.84        36
     Neutral       0.69      0.61      0.65        33
     Positiv       0.91      0.86      0.88        84

    accuracy                           0.82       153
   macro avg       0.79      0.80      0.79       153
weighted avg       0.83      0.82      0.82       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 45, 1: 29, 2: 79}
Negativ Precision Score: 0.7555555555555555
Negativ Recall Score: 0.9444444444444444
Negativ F1 Score: 0.8395061728395061

Neutral Precision Score: 0.6896551724137931
Neutral Recall Score: 0

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4416.49 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4554.04 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4497.17 examples/s]


Training results for TUM/GottBERT_base_best with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8897,0.784115,0.813953,0.851153,0.813953,0.808713,"{0: 60, 1: 14, 2: 55}"
2,0.534,0.685484,0.860465,0.867948,0.860465,0.857423,"{0: 32, 1: 29, 2: 68}"
3,0.5193,0.664688,0.875969,0.878821,0.875969,0.875406,"{0: 47, 1: 23, 2: 59}"
4,0.3053,0.602223,0.883721,0.891543,0.883721,0.884078,"{0: 50, 1: 24, 2: 55}"
5,0.2669,0.559235,0.899225,0.900215,0.899225,0.899631,"{0: 42, 1: 28, 2: 59}"
6,0.2085,0.548349,0.875969,0.878873,0.875969,0.876931,"{0: 44, 1: 28, 2: 57}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_best_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_best_42_42_6
Evaluation results for TUM/GottBERT_base_best with 6 epochs and random seeds: 42, 42



{'eval_loss': 1.1487834453582764, 'eval_accuracy': 0.8431372549019608, 'eval_precision': 0.8580392156862745, 'eval_recall': 0.8431372549019608, 'eval_f1': 0.8467036625971144, 'eval_class_distribution': {0: 36, 1: 42, 2: 75}, 'eval_runtime': 2.2771, 'eval_samples_per_second': 67.191, 'eval_steps_per_second': 33.815, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.84      0.89      0.86        36
     Neutral       0.63      0.79      0.70        33
     Positiv       0.92      0.81      0.86        84

    accuracy                           0.82       153
   macro avg       0.80      0.83      0.81       153
weighted avg       0.84      0.82      0.83       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 38, 1: 41, 2: 74}
Negativ Precision Score: 0.8421052631578947
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8648648648648649

Neutral Precision Score: 0.6341463414634146
Neutral Recall Score: 

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4911.70 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4468.33 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4498.97 examples/s]


Training results for TUM/GottBERT_filtered_base_best with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.857,0.896841,0.775194,0.802215,0.775194,0.776722,"{0: 58, 1: 22, 2: 49}"
2,0.5292,0.846917,0.860465,0.861577,0.860465,0.858477,"{0: 38, 1: 24, 2: 67}"
3,0.4871,0.587638,0.875969,0.88245,0.875969,0.875628,"{0: 49, 1: 22, 2: 58}"
4,0.2992,0.639291,0.883721,0.884332,0.883721,0.883407,"{0: 45, 1: 25, 2: 59}"
5,0.3038,0.608101,0.899225,0.898668,0.899225,0.898855,"{0: 42, 1: 26, 2: 61}"
6,0.1992,0.632043,0.883721,0.883986,0.883721,0.883606,"{0: 44, 1: 26, 2: 59}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_filtered_base_best_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_filtered_base_best_42_42_6
Evaluation results for TUM/GottBERT_filtered_base_best with 6 epochs and random seeds: 42, 42



{'eval_loss': 1.2891265153884888, 'eval_accuracy': 0.8104575163398693, 'eval_precision': 0.8136641700611742, 'eval_recall': 0.8104575163398693, 'eval_f1': 0.8110396752207357, 'eval_class_distribution': {0: 40, 1: 34, 2: 79}, 'eval_runtime': 2.2571, 'eval_samples_per_second': 67.785, 'eval_steps_per_second': 34.114, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.78      0.89      0.83        36
     Neutral       0.67      0.67      0.67        33
     Positiv       0.87      0.82      0.85        84

    accuracy                           0.80       153
   macro avg       0.77      0.79      0.78       153
weighted avg       0.81      0.80      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 41, 1: 33, 2: 79}
Negativ Precision Score: 0.7804878048780488
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8311688311688312

Neutral Precision Score: 0.6666666666666666
Neutral Recall Score: 

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be abl

Training results for TUM/GottBERT_base_last with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8874,0.931562,0.806202,0.830926,0.806202,0.800897,"{0: 57, 1: 15, 2: 57}"
2,0.5876,0.83382,0.852713,0.853088,0.852713,0.852658,"{0: 40, 1: 28, 2: 61}"
3,0.4702,0.826412,0.852713,0.870975,0.852713,0.847282,"{0: 54, 1: 16, 2: 59}"
4,0.2987,0.767191,0.868217,0.871894,0.868217,0.868089,"{0: 47, 1: 23, 2: 59}"
5,0.2497,0.768104,0.844961,0.845417,0.844961,0.842991,"{0: 46, 1: 22, 2: 61}"
6,0.2133,0.758869,0.860465,0.863426,0.860465,0.859381,"{0: 47, 1: 22, 2: 60}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_last_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_last_42_42_6
Evaluation results for TUM/GottBERT_base_last with 6 epochs and random seeds: 42, 42



{'eval_loss': 1.1324610710144043, 'eval_accuracy': 0.8169934640522876, 'eval_precision': 0.8301974214343272, 'eval_recall': 0.8169934640522876, 'eval_f1': 0.81928726927745, 'eval_class_distribution': {0: 40, 1: 40, 2: 73}, 'eval_runtime': 2.2852, 'eval_samples_per_second': 66.954, 'eval_steps_per_second': 33.696, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.76      0.89      0.82        36
     Neutral       0.62      0.76      0.68        33
     Positiv       0.92      0.77      0.84        84

    accuracy                           0.80       153
   macro avg       0.77      0.81      0.78       153
weighted avg       0.82      0.80      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 42, 1: 40, 2: 71}
Negativ Precision Score: 0.7619047619047619
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8205128205128205

Neutral Precision Score: 0.625
Neutral Recall Score: 0.7575757575757

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4984.24 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4781.12 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4636.20 examples/s]


Training results for distilbert/distilbert-base-german-cased with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.7988,0.735539,0.790698,0.824226,0.790698,0.78805,"{0: 62, 1: 18, 2: 49}"
2,0.5613,0.695231,0.837209,0.859105,0.837209,0.836639,"{0: 56, 1: 19, 2: 54}"
3,0.505,0.677091,0.860465,0.876804,0.860465,0.86201,"{0: 52, 1: 27, 2: 50}"
4,0.2799,0.68771,0.860465,0.876804,0.860465,0.86201,"{0: 52, 1: 27, 2: 50}"
5,0.251,0.737269,0.875969,0.892047,0.875969,0.877154,"{0: 53, 1: 25, 2: 51}"
6,0.1775,0.810352,0.837209,0.858393,0.837209,0.839439,"{0: 52, 1: 29, 2: 48}"



Best Model saved at: ./saved_models/absa_distilbert_distilbert-base-german-cased_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/absa_distilbert_distilbert-base-german-cased_42_42_6
Evaluation results for distilbert/distilbert-base-german-cased with 6 epochs and random seeds: 42, 42



{'eval_loss': 1.2296180725097656, 'eval_accuracy': 0.7581699346405228, 'eval_precision': 0.7745098039215687, 'eval_recall': 0.7581699346405228, 'eval_f1': 0.7599655246714071, 'eval_class_distribution': {0: 48, 1: 33, 2: 72}, 'eval_runtime': 1.3004, 'eval_samples_per_second': 117.655, 'eval_steps_per_second': 59.212, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.73      0.89      0.80        36
     Neutral       0.69      0.67      0.68        33
     Positiv       0.88      0.81      0.84        84

    accuracy                           0.80       153
   macro avg       0.77      0.79      0.77       153
weighted avg       0.80      0.80      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 44, 1: 32, 2: 77}
Negativ Precision Score: 0.7272727272727273
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8

Neutral Precision Score: 0.6875
Neutral Recall Score: 0.6666666666666666
Neutral

Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3944.11 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3738.52 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3753.87 examples/s]


Training results for GerMedBERT/medbert-512 with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8824,0.710477,0.767442,0.781672,0.767442,0.762594,"{0: 51, 1: 16, 2: 62}"
2,0.5445,1.052178,0.75969,0.758426,0.75969,0.755334,"{0: 36, 1: 24, 2: 69}"
3,0.499,0.989582,0.806202,0.812616,0.806202,0.803087,"{0: 39, 1: 20, 2: 70}"
4,0.2148,0.941466,0.813953,0.813576,0.813953,0.811673,"{0: 46, 1: 22, 2: 61}"
5,0.2331,1.158886,0.806202,0.805212,0.806202,0.805507,"{0: 41, 1: 26, 2: 62}"
6,0.1026,1.152936,0.813953,0.815823,0.813953,0.814685,"{0: 43, 1: 28, 2: 58}"



Best Model saved at: ./saved_models/absa_GerMedBERT_medbert-512_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/absa_GerMedBERT_medbert-512_42_42_6
Evaluation results for GerMedBERT/medbert-512 with 6 epochs and random seeds: 42, 42



{'eval_loss': 1.4708349704742432, 'eval_accuracy': 0.7843137254901961, 'eval_precision': 0.8020066889632107, 'eval_recall': 0.7843137254901961, 'eval_f1': 0.7854564483745569, 'eval_class_distribution': {0: 45, 1: 39, 2: 69}, 'eval_runtime': 2.332, 'eval_samples_per_second': 65.61, 'eval_steps_per_second': 33.019, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.66      0.86      0.75        36
     Neutral       0.66      0.70      0.68        33
     Positiv       0.89      0.75      0.81        84

    accuracy                           0.76       153
   macro avg       0.73      0.77      0.75       153
weighted avg       0.78      0.76      0.77       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 47, 1: 35, 2: 71}
Negativ Precision Score: 0.6595744680851063
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.7469879518072289

Neutral Precision Score: 0.6571428571428571
Neutral Recall Score: 0.

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

Training results for deepset/gbert-base with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8235,0.771304,0.837209,0.844654,0.837209,0.837041,"{0: 50, 1: 22, 2: 57}"
2,0.5481,0.605326,0.860465,0.862715,0.860465,0.860653,"{0: 46, 1: 25, 2: 58}"
3,0.441,0.742509,0.868217,0.870869,0.868217,0.867459,"{0: 48, 1: 24, 2: 57}"
4,0.1913,0.61062,0.906977,0.90967,0.906977,0.907683,"{0: 45, 1: 27, 2: 57}"
5,0.1594,0.971854,0.844961,0.849143,0.844961,0.845038,"{0: 48, 1: 24, 2: 57}"
6,0.0549,0.828755,0.868217,0.873176,0.868217,0.869614,"{0: 46, 1: 27, 2: 56}"



Best Model saved at: ./saved_models/absa_deepset_gbert-base_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/absa_deepset_gbert-base_42_42_6
Evaluation results for deepset/gbert-base with 6 epochs and random seeds: 42, 42



{'eval_loss': 1.2386518716812134, 'eval_accuracy': 0.8169934640522876, 'eval_precision': 0.8307317934864563, 'eval_recall': 0.8169934640522876, 'eval_f1': 0.8205443519934269, 'eval_class_distribution': {0: 37, 1: 41, 2: 75}, 'eval_runtime': 2.3469, 'eval_samples_per_second': 65.192, 'eval_steps_per_second': 32.809, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.89      0.92      0.90        36
     Neutral       0.63      0.79      0.70        33
     Positiv       0.89      0.80      0.84        84

    accuracy                           0.82       153
   macro avg       0.81      0.83      0.82       153
weighted avg       0.84      0.82      0.83       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 37, 1: 41, 2: 75}
Negativ Precision Score: 0.8918918918918919
Negativ Recall Score: 0.9166666666666666
Negativ F1 Score: 0.9041095890410958

Neutral Precision Score: 0.6341463414634146
Neutral Recall Score: 

In [7]:
absa_model(data, "aari1995/German_Sentiment", rn1=42, rn2=42, epochs=6, save = True)

Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Map: 100%|██████████| 1111/1111 [00:00<00:00, 3948.55 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3783.46 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3734.41 examples/s]


Training results for aari1995/German_Sentiment with 6 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9119,0.623253,0.883721,0.908965,0.883721,0.882207,"{0: 55, 1: 17, 2: 57}"
2,0.5104,0.740861,0.837209,0.848215,0.837209,0.839176,"{0: 36, 1: 34, 2: 59}"
3,0.4436,0.715509,0.868217,0.870328,0.868217,0.865894,"{0: 47, 1: 21, 2: 61}"
4,0.244,0.627744,0.906977,0.906579,0.906977,0.906446,"{0: 43, 1: 25, 2: 61}"
5,0.1854,0.68801,0.906977,0.908279,0.906977,0.906126,"{0: 45, 1: 23, 2: 61}"
6,0.0935,0.642033,0.922481,0.923643,0.922481,0.922088,"{0: 45, 1: 24, 2: 60}"



Best Model saved at: ./saved_models/absa_aari1995_German_Sentiment_42_42_6

Tokenizer for best Model saved at: ./saved_tokenizers/absa_aari1995_German_Sentiment_42_42_6
Evaluation results for aari1995/German_Sentiment with 6 epochs and random seeds: 42, 42



{'eval_loss': 0.9785120487213135, 'eval_accuracy': 0.8562091503267973, 'eval_precision': 0.8614384165095456, 'eval_recall': 0.8562091503267973, 'eval_f1': 0.8579112116996729, 'eval_class_distribution': {0: 38, 1: 36, 2: 79}, 'eval_runtime': 7.1423, 'eval_samples_per_second': 21.422, 'eval_steps_per_second': 10.781, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.83      0.94      0.88        36
     Neutral       0.64      0.70      0.67        33
     Positiv       0.93      0.85      0.89        84

    accuracy                           0.84       153
   macro avg       0.80      0.83      0.81       153
weighted avg       0.85      0.84      0.84       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 41, 1: 36, 2: 76}
Negativ Precision Score: 0.8292682926829268
Negativ Recall Score: 0.9444444444444444
Negativ F1 Score: 0.8831168831168831

Neutral Precision Score: 0.6388888888888888
Neutral Recall Score: 

In [5]:
for model in models:
    print(f'training and results for {model}:')
    absa_model(data, model, rn1=42, rn2=42, epochs=7, save = True)
    print()

training and results for google-bert/bert-base-german-cased:
Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2452.69 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3918.80 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4015.09 examples/s]


Training results for google-bert/bert-base-german-cased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8271,0.93374,0.75969,0.815455,0.75969,0.762571,"{0: 65, 1: 17, 2: 47}"
2,0.5347,0.612021,0.79845,0.814447,0.79845,0.800366,"{0: 53, 1: 22, 2: 54}"
3,0.4613,1.035336,0.790698,0.798305,0.790698,0.790147,"{0: 50, 1: 21, 2: 58}"
4,0.2182,0.919725,0.806202,0.808743,0.806202,0.807006,"{0: 45, 1: 26, 2: 58}"
5,0.2031,1.046697,0.829457,0.831635,0.829457,0.828375,"{0: 36, 1: 29, 2: 64}"
6,0.139,1.068112,0.821705,0.823492,0.821705,0.822333,"{0: 41, 1: 29, 2: 59}"
7,0.0862,1.140166,0.829457,0.832398,0.829457,0.830607,"{0: 42, 1: 29, 2: 58}"



Best Model saved at: ./saved_models/absa_google-bert_bert-base-german-cased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/absa_google-bert_bert-base-german-cased_42_42_7
Evaluation results for google-bert/bert-base-german-cased with 7 epochs and random seeds: 42, 42



{'eval_loss': 1.4145169258117676, 'eval_accuracy': 0.7843137254901961, 'eval_precision': 0.8023184540047554, 'eval_recall': 0.7843137254901961, 'eval_f1': 0.7876029162556677, 'eval_class_distribution': {0: 41, 1: 41, 2: 71}, 'eval_runtime': 2.3322, 'eval_samples_per_second': 65.604, 'eval_steps_per_second': 33.017, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.71      0.89      0.79        36
     Neutral       0.62      0.70      0.66        33
     Positiv       0.89      0.75      0.81        84

    accuracy                           0.77       153
   macro avg       0.74      0.78      0.75       153
weighted avg       0.79      0.77      0.77       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 45, 1: 37, 2: 71}
Negativ Precision Score: 0.7111111111111111
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7901234567901234

Neutral Precision Score: 0.6216216216216216
Neutral Recall Score: 

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3951.23 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3528.74 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3955.40 examples/s]


Training results for dbmdz/bert-base-german-cased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9322,1.065827,0.75969,0.818431,0.75969,0.755759,"{0: 66, 1: 13, 2: 50}"
2,0.5933,1.049472,0.744186,0.741648,0.744186,0.739688,"{0: 39, 1: 22, 2: 68}"
3,0.5219,1.06833,0.821705,0.829688,0.821705,0.821732,"{0: 50, 1: 22, 2: 57}"
4,0.3322,0.881616,0.806202,0.823985,0.806202,0.809993,"{0: 51, 1: 27, 2: 51}"
5,0.3412,1.016977,0.821705,0.827939,0.821705,0.823577,"{0: 42, 1: 31, 2: 56}"
6,0.218,1.035653,0.844961,0.876655,0.844961,0.850565,"{0: 39, 1: 40, 2: 50}"
7,0.1725,1.03513,0.837209,0.8623,0.837209,0.842079,"{0: 40, 1: 38, 2: 51}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-cased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-cased_42_42_7
Evaluation results for dbmdz/bert-base-german-cased with 7 epochs and random seeds: 42, 42



{'eval_loss': 1.4055098295211792, 'eval_accuracy': 0.7843137254901961, 'eval_precision': 0.8106868478385506, 'eval_recall': 0.7843137254901961, 'eval_f1': 0.7901270254211431, 'eval_class_distribution': {0: 38, 1: 45, 2: 70}, 'eval_runtime': 2.3704, 'eval_samples_per_second': 64.545, 'eval_steps_per_second': 32.483, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.74      0.89      0.81        36
     Neutral       0.58      0.79      0.67        33
     Positiv       0.94      0.73      0.82        84

    accuracy                           0.78       153
   macro avg       0.75      0.80      0.77       153
weighted avg       0.81      0.78      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 43, 1: 45, 2: 65}
Negativ Precision Score: 0.7441860465116279
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.810126582278481

Neutral Precision Score: 0.5777777777777777
Neutral Recall Score: 0

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3930.50 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3710.07 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3726.52 examples/s]


Training results for dbmdz/bert-base-german-uncased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.84,0.790999,0.821705,0.843522,0.821705,0.820305,"{0: 57, 1: 19, 2: 53}"
2,0.5828,1.017033,0.744186,0.764707,0.744186,0.747105,"{0: 37, 1: 39, 2: 53}"
3,0.497,0.689613,0.852713,0.851761,0.852713,0.851771,"{0: 40, 1: 26, 2: 63}"
4,0.3013,0.757878,0.829457,0.839343,0.829457,0.832214,"{0: 39, 1: 33, 2: 57}"
5,0.2912,0.818664,0.829457,0.827855,0.829457,0.827817,"{0: 39, 1: 26, 2: 64}"
6,0.1781,0.936172,0.829457,0.839243,0.829457,0.832737,"{0: 41, 1: 32, 2: 56}"
7,0.1158,0.887976,0.860465,0.865547,0.860465,0.862148,"{0: 44, 1: 29, 2: 56}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-uncased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-uncased_42_42_7
Evaluation results for dbmdz/bert-base-german-uncased with 7 epochs and random seeds: 42, 42



{'eval_loss': 1.603320837020874, 'eval_accuracy': 0.7516339869281046, 'eval_precision': 0.785293336955741, 'eval_recall': 0.7516339869281046, 'eval_f1': 0.7564902099869795, 'eval_class_distribution': {0: 46, 1: 42, 2: 65}, 'eval_runtime': 2.3525, 'eval_samples_per_second': 65.037, 'eval_steps_per_second': 32.731, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.65      0.89      0.75        36
     Neutral       0.64      0.82      0.72        33
     Positiv       0.95      0.70      0.81        84

    accuracy                           0.77       153
   macro avg       0.75      0.80      0.76       153
weighted avg       0.81      0.77      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 49, 1: 42, 2: 62}
Negativ Precision Score: 0.6530612244897959
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7529411764705882

Neutral Precision Score: 0.6428571428571429
Neutral Recall Score: 0.

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4383.31 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4039.85 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4249.69 examples/s]


Training results for FacebookAI/xlm-roberta-base with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9216,0.708911,0.806202,0.812835,0.806202,0.805108,"{0: 50, 1: 21, 2: 58}"
2,0.7061,0.852544,0.837209,0.837209,0.837209,0.837209,"{0: 42, 1: 27, 2: 60}"
3,0.7415,1.164746,0.813953,0.819257,0.813953,0.814667,"{0: 48, 1: 24, 2: 57}"
4,0.5173,0.945621,0.852713,0.860207,0.852713,0.854514,"{0: 45, 1: 30, 2: 54}"
5,0.4757,0.966865,0.837209,0.837803,0.837209,0.837277,"{0: 44, 1: 26, 2: 59}"
6,0.3776,0.872053,0.860465,0.873088,0.860465,0.862747,"{0: 39, 1: 35, 2: 55}"
7,0.3379,0.859931,0.868217,0.878529,0.868217,0.870122,"{0: 40, 1: 34, 2: 55}"



Best Model saved at: ./saved_models/absa_FacebookAI_xlm-roberta-base_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/absa_FacebookAI_xlm-roberta-base_42_42_7
Evaluation results for FacebookAI/xlm-roberta-base with 7 epochs and random seeds: 42, 42



{'eval_loss': 1.2322006225585938, 'eval_accuracy': 0.8169934640522876, 'eval_precision': 0.8249450602391779, 'eval_recall': 0.8169934640522876, 'eval_f1': 0.8192717086834734, 'eval_class_distribution': {0: 39, 1: 37, 2: 77}, 'eval_runtime': 2.2759, 'eval_samples_per_second': 67.225, 'eval_steps_per_second': 33.832, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.83      0.94      0.88        36
     Neutral       0.67      0.79      0.72        33
     Positiv       0.93      0.81      0.87        84

    accuracy                           0.84       153
   macro avg       0.81      0.85      0.82       153
weighted avg       0.85      0.84      0.84       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 41, 1: 39, 2: 73}
Negativ Precision Score: 0.8292682926829268
Negativ Recall Score: 0.9444444444444444
Negativ F1 Score: 0.8831168831168831

Neutral Precision Score: 0.6666666666666666
Neutral Recall Score: 

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 5008.11 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4495.20 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4507.15 examples/s]


Training results for TUM/GottBERT_base_best with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8641,0.691972,0.79845,0.851249,0.79845,0.788623,"{0: 62, 1: 11, 2: 56}"
2,0.5724,0.598921,0.860465,0.864491,0.860465,0.861585,"{0: 46, 1: 26, 2: 57}"
3,0.4763,0.626405,0.860465,0.862384,0.860465,0.856953,"{0: 44, 1: 20, 2: 65}"
4,0.3464,0.661666,0.868217,0.885219,0.868217,0.869064,"{0: 54, 1: 23, 2: 52}"
5,0.2603,0.741127,0.852713,0.850767,0.852713,0.849997,"{0: 41, 1: 23, 2: 65}"
6,0.2234,0.864755,0.860465,0.886164,0.860465,0.864099,"{0: 34, 1: 39, 2: 56}"
7,0.1659,0.770394,0.868217,0.878702,0.868217,0.87117,"{0: 39, 1: 33, 2: 57}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_best_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_best_42_42_7
Evaluation results for TUM/GottBERT_base_best with 7 epochs and random seeds: 42, 42



{'eval_loss': 1.2199233770370483, 'eval_accuracy': 0.8104575163398693, 'eval_precision': 0.8222825540472599, 'eval_recall': 0.8104575163398693, 'eval_f1': 0.8135454433345665, 'eval_class_distribution': {0: 39, 1: 39, 2: 75}, 'eval_runtime': 2.2536, 'eval_samples_per_second': 67.892, 'eval_steps_per_second': 34.168, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.74      0.89      0.81        36
     Neutral       0.65      0.79      0.71        33
     Positiv       0.93      0.77      0.84        84

    accuracy                           0.80       153
   macro avg       0.77      0.82      0.79       153
weighted avg       0.83      0.80      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 43, 1: 40, 2: 70}
Negativ Precision Score: 0.7441860465116279
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.810126582278481

Neutral Precision Score: 0.65
Neutral Recall Score: 0.7878787878787

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4910.00 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4463.53 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4469.70 examples/s]


Training results for TUM/GottBERT_filtered_base_best with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8259,0.977278,0.767442,0.809019,0.767442,0.7698,"{0: 62, 1: 21, 2: 46}"
2,0.5613,0.682484,0.868217,0.868217,0.868217,0.868217,"{0: 42, 1: 27, 2: 60}"
3,0.4879,0.651089,0.891473,0.894169,0.891473,0.891106,"{0: 46, 1: 23, 2: 60}"
4,0.2865,0.699712,0.883721,0.890875,0.883721,0.884222,"{0: 49, 1: 23, 2: 57}"
5,0.2924,0.735913,0.891473,0.890789,0.891473,0.891039,"{0: 42, 1: 26, 2: 61}"
6,0.1965,0.970835,0.852713,0.864224,0.852713,0.855692,"{0: 42, 1: 33, 2: 54}"
7,0.1969,0.85101,0.852713,0.860465,0.852713,0.854655,"{0: 45, 1: 30, 2: 54}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_filtered_base_best_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_filtered_base_best_42_42_7
Evaluation results for TUM/GottBERT_filtered_base_best with 7 epochs and random seeds: 42, 42



{'eval_loss': 1.1252208948135376, 'eval_accuracy': 0.7843137254901961, 'eval_precision': 0.7895209365797601, 'eval_recall': 0.7843137254901961, 'eval_f1': 0.7856395013763637, 'eval_class_distribution': {0: 40, 1: 35, 2: 78}, 'eval_runtime': 2.251, 'eval_samples_per_second': 67.971, 'eval_steps_per_second': 34.208, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.76      0.89      0.82        36
     Neutral       0.66      0.70      0.68        33
     Positiv       0.87      0.79      0.82        84

    accuracy                           0.79       153
   macro avg       0.76      0.79      0.77       153
weighted avg       0.80      0.79      0.79       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 42, 1: 35, 2: 76}
Negativ Precision Score: 0.7619047619047619
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8205128205128205

Neutral Precision Score: 0.6571428571428571
Neutral Recall Score: 0

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be abl

Training results for TUM/GottBERT_base_last with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.843,0.706404,0.813953,0.837502,0.813953,0.812199,"{0: 56, 1: 17, 2: 56}"
2,0.5584,0.677789,0.852713,0.850767,0.852713,0.849997,"{0: 41, 1: 23, 2: 65}"
3,0.4896,0.764616,0.852713,0.862729,0.852713,0.850912,"{0: 52, 1: 20, 2: 57}"
4,0.2893,0.862519,0.844961,0.852522,0.844961,0.844393,"{0: 51, 1: 23, 2: 55}"
5,0.2633,0.980617,0.829457,0.826867,0.829457,0.827835,"{0: 42, 1: 25, 2: 62}"
6,0.1973,0.925912,0.829457,0.83541,0.829457,0.831433,"{0: 40, 1: 31, 2: 58}"
7,0.1585,0.996599,0.829457,0.832803,0.829457,0.830665,"{0: 44, 1: 28, 2: 57}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_last_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_last_42_42_7
Evaluation results for TUM/GottBERT_base_last with 7 epochs and random seeds: 42, 42



{'eval_loss': 0.9821602702140808, 'eval_accuracy': 0.8366013071895425, 'eval_precision': 0.8400129282482223, 'eval_recall': 0.8366013071895425, 'eval_f1': 0.8369542978296697, 'eval_class_distribution': {0: 40, 1: 35, 2: 78}, 'eval_runtime': 2.2087, 'eval_samples_per_second': 69.272, 'eval_steps_per_second': 34.862, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.76      0.94      0.84        36
     Neutral       0.70      0.64      0.67        33
     Positiv       0.88      0.82      0.85        84

    accuracy                           0.81       153
   macro avg       0.78      0.80      0.79       153
weighted avg       0.81      0.81      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 45, 1: 30, 2: 78}
Negativ Precision Score: 0.7555555555555555
Negativ Recall Score: 0.9444444444444444
Negativ F1 Score: 0.8395061728395061

Neutral Precision Score: 0.7
Neutral Recall Score: 0.6363636363636

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4915.00 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4677.42 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4562.04 examples/s]


Training results for distilbert/distilbert-base-german-cased with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8224,0.630164,0.79845,0.813384,0.79845,0.799927,"{0: 53, 1: 24, 2: 52}"
2,0.5452,0.680577,0.821705,0.848066,0.821705,0.824731,"{0: 56, 1: 24, 2: 49}"
3,0.515,0.618437,0.829457,0.84125,0.829457,0.831894,"{0: 48, 1: 29, 2: 52}"
4,0.3069,0.636362,0.852713,0.873897,0.852713,0.854722,"{0: 54, 1: 26, 2: 49}"
5,0.284,0.665883,0.852713,0.873897,0.852713,0.854722,"{0: 54, 1: 26, 2: 49}"
6,0.198,0.76194,0.829457,0.853099,0.829457,0.831978,"{0: 51, 1: 31, 2: 47}"
7,0.1472,0.753789,0.860465,0.87964,0.860465,0.862112,"{0: 53, 1: 27, 2: 49}"



Best Model saved at: ./saved_models/absa_distilbert_distilbert-base-german-cased_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/absa_distilbert_distilbert-base-german-cased_42_42_7
Evaluation results for distilbert/distilbert-base-german-cased with 7 epochs and random seeds: 42, 42



{'eval_loss': 1.47126305103302, 'eval_accuracy': 0.738562091503268, 'eval_precision': 0.7675507551163472, 'eval_recall': 0.738562091503268, 'eval_f1': 0.743681635088792, 'eval_class_distribution': {0: 43, 1: 43, 2: 67}, 'eval_runtime': 1.2597, 'eval_samples_per_second': 121.454, 'eval_steps_per_second': 61.124, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.66      0.92      0.77        36
     Neutral       0.62      0.70      0.66        33
     Positiv       0.89      0.70      0.79        84

    accuracy                           0.75       153
   macro avg       0.73      0.77      0.74       153
weighted avg       0.78      0.75      0.75       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 50, 1: 37, 2: 66}
Negativ Precision Score: 0.66
Negativ Recall Score: 0.9166666666666666
Negativ F1 Score: 0.7674418604651163

Neutral Precision Score: 0.6216216216216216
Neutral Recall Score: 0.696969696969697


BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

Training results for GerMedBERT/medbert-512 with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8724,0.705863,0.790698,0.816109,0.790698,0.789007,"{0: 57, 1: 17, 2: 55}"
2,0.5456,0.876974,0.782946,0.790073,0.782946,0.785352,"{0: 42, 1: 31, 2: 56}"
3,0.4447,1.099862,0.79845,0.808525,0.79845,0.793985,"{0: 49, 1: 17, 2: 63}"
4,0.2279,0.901028,0.821705,0.833144,0.821705,0.822166,"{0: 52, 1: 22, 2: 55}"
5,0.2156,1.028583,0.821705,0.82252,0.821705,0.821889,"{0: 44, 1: 26, 2: 59}"
6,0.1471,0.99627,0.829457,0.831414,0.829457,0.830229,"{0: 43, 1: 28, 2: 58}"
7,0.1143,1.05112,0.837209,0.838043,0.837209,0.837561,"{0: 43, 1: 27, 2: 59}"



Best Model saved at: ./saved_models/absa_GerMedBERT_medbert-512_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/absa_GerMedBERT_medbert-512_42_42_7
Evaluation results for GerMedBERT/medbert-512 with 7 epochs and random seeds: 42, 42



{'eval_loss': 1.371387243270874, 'eval_accuracy': 0.7908496732026143, 'eval_precision': 0.8091062902702453, 'eval_recall': 0.7908496732026143, 'eval_f1': 0.7925936093439869, 'eval_class_distribution': {0: 47, 1: 36, 2: 70}, 'eval_runtime': 2.341, 'eval_samples_per_second': 65.356, 'eval_steps_per_second': 32.891, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.69      0.94      0.80        36
     Neutral       0.67      0.73      0.70        33
     Positiv       0.94      0.76      0.84        84

    accuracy                           0.80       153
   macro avg       0.77      0.81      0.78       153
weighted avg       0.82      0.80      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 49, 1: 36, 2: 68}
Negativ Precision Score: 0.6938775510204082
Negativ Recall Score: 0.9444444444444444
Negativ F1 Score: 0.8

Neutral Precision Score: 0.6666666666666666
Neutral Recall Score: 0.727272727272727

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

Training results for deepset/gbert-base with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.843,0.858787,0.844961,0.859774,0.844961,0.839021,"{0: 52, 1: 16, 2: 61}"
2,0.5213,0.723457,0.860465,0.861265,0.860465,0.860261,"{0: 39, 1: 29, 2: 61}"
3,0.4875,1.070793,0.829457,0.827501,0.829457,0.8282,"{0: 43, 1: 25, 2: 61}"
4,0.2263,1.028618,0.821705,0.82739,0.821705,0.823698,"{0: 45, 1: 28, 2: 56}"
5,0.2074,1.090236,0.837209,0.838175,0.837209,0.837626,"{0: 43, 1: 27, 2: 59}"
6,0.1284,1.158467,0.829457,0.8332,0.829457,0.831015,"{0: 42, 1: 29, 2: 58}"
7,0.0754,1.181532,0.829457,0.831015,0.829457,0.830156,"{0: 42, 1: 28, 2: 59}"



Best Model saved at: ./saved_models/absa_deepset_gbert-base_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/absa_deepset_gbert-base_42_42_7
Evaluation results for deepset/gbert-base with 7 epochs and random seeds: 42, 42



{'eval_loss': 0.7281306982040405, 'eval_accuracy': 0.8366013071895425, 'eval_precision': 0.8394884269497273, 'eval_recall': 0.8366013071895425, 'eval_f1': 0.8374766867466673, 'eval_class_distribution': {0: 38, 1: 35, 2: 80}, 'eval_runtime': 2.3718, 'eval_samples_per_second': 64.508, 'eval_steps_per_second': 32.465, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.84      0.89      0.86        36
     Neutral       0.63      0.73      0.68        33
     Positiv       0.91      0.83      0.87        84

    accuracy                           0.82       153
   macro avg       0.79      0.82      0.80       153
weighted avg       0.83      0.82      0.83       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 38, 1: 38, 2: 77}
Negativ Precision Score: 0.8421052631578947
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8648648648648649

Neutral Precision Score: 0.631578947368421
Neutral Recall Score: 0

In [6]:
absa_model(data, "aari1995/German_Sentiment", rn1=42, rn2=42, epochs=7, save = True)

Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Map: 100%|██████████| 1111/1111 [00:00<00:00, 3996.77 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3787.01 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3732.04 examples/s]


Training results for aari1995/German_Sentiment with 7 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9038,0.656261,0.883721,0.884589,0.883721,0.884086,"{0: 43, 1: 27, 2: 59}"
2,0.5465,0.574743,0.868217,0.885456,0.868217,0.870272,"{0: 53, 1: 22, 2: 54}"
3,0.5327,0.616116,0.883721,0.884141,0.883721,0.883155,"{0: 44, 1: 24, 2: 61}"
4,0.3257,0.765709,0.875969,0.885395,0.875969,0.873925,"{0: 49, 1: 19, 2: 61}"
5,0.2288,0.692132,0.899225,0.901921,0.899225,0.898858,"{0: 46, 1: 23, 2: 60}"
6,0.1859,0.842588,0.868217,0.872111,0.868217,0.869267,"{0: 45, 1: 28, 2: 56}"
7,0.0937,0.867013,0.883721,0.885831,0.883721,0.883819,"{0: 46, 1: 25, 2: 58}"



Best Model saved at: ./saved_models/absa_aari1995_German_Sentiment_42_42_7

Tokenizer for best Model saved at: ./saved_tokenizers/absa_aari1995_German_Sentiment_42_42_7
Evaluation results for aari1995/German_Sentiment with 7 epochs and random seeds: 42, 42



{'eval_loss': 0.8971391916275024, 'eval_accuracy': 0.8627450980392157, 'eval_precision': 0.8630566438582937, 'eval_recall': 0.8627450980392157, 'eval_f1': 0.8624587035926511, 'eval_class_distribution': {0: 39, 1: 32, 2: 82}, 'eval_runtime': 7.1771, 'eval_samples_per_second': 21.318, 'eval_steps_per_second': 10.729, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.89      0.92      0.90        36
     Neutral       0.78      0.76      0.77        33
     Positiv       0.92      0.92      0.92        84

    accuracy                           0.88       153
   macro avg       0.86      0.86      0.86       153
weighted avg       0.88      0.88      0.88       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 37, 1: 32, 2: 84}
Negativ Precision Score: 0.8918918918918919
Negativ Recall Score: 0.9166666666666666
Negativ F1 Score: 0.9041095890410958

Neutral Precision Score: 0.78125
Neutral Recall Score: 0.757575757

In [5]:
for model in models:
    print(f'training and results for {model}:')
    absa_model(data, model, rn1=42, rn2=42, epochs=8, save = True)
    print()

training and results for google-bert/bert-base-german-cased:
Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2250.09 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3376.82 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3457.50 examples/s]


Training results for google-bert/bert-base-german-cased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.845,0.864035,0.782946,0.793056,0.782946,0.782883,"{0: 52, 1: 22, 2: 55}"
2,0.5437,0.970465,0.75969,0.783373,0.75969,0.76595,"{0: 40, 1: 37, 2: 52}"
3,0.4596,1.072917,0.813953,0.81384,0.813953,0.813309,"{0: 39, 1: 27, 2: 63}"
4,0.248,0.970963,0.782946,0.798324,0.782946,0.786356,"{0: 49, 1: 29, 2: 51}"
5,0.2136,1.317872,0.790698,0.789348,0.790698,0.787569,"{0: 41, 1: 22, 2: 66}"
6,0.1377,1.305402,0.79845,0.801431,0.79845,0.79919,"{0: 39, 1: 30, 2: 60}"
7,0.0924,1.433116,0.813953,0.814084,0.813953,0.813789,"{0: 40, 1: 28, 2: 61}"
8,0.0753,1.440312,0.79845,0.80444,0.79845,0.799373,"{0: 37, 1: 32, 2: 60}"



Best Model saved at: ./saved_models/absa_google-bert_bert-base-german-cased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/absa_google-bert_bert-base-german-cased_42_42_8
Evaluation results for google-bert/bert-base-german-cased with 8 epochs and random seeds: 42, 42



{'eval_loss': 1.7391194105148315, 'eval_accuracy': 0.7647058823529411, 'eval_precision': 0.7755499098864352, 'eval_recall': 0.7647058823529411, 'eval_f1': 0.767150056283583, 'eval_class_distribution': {0: 43, 1: 35, 2: 75}, 'eval_runtime': 2.4178, 'eval_samples_per_second': 63.281, 'eval_steps_per_second': 31.847, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.66      0.81      0.72        36
     Neutral       0.60      0.64      0.62        33
     Positiv       0.85      0.75      0.80        84

    accuracy                           0.74       153
   macro avg       0.70      0.73      0.71       153
weighted avg       0.75      0.74      0.74       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 44, 1: 35, 2: 74}
Negativ Precision Score: 0.6590909090909091
Negativ Recall Score: 0.8055555555555556
Negativ F1 Score: 0.725

Neutral Precision Score: 0.6
Neutral Recall Score: 0.6363636363636364
Neutral F1

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3831.18 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3790.11 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3855.92 examples/s]


Training results for dbmdz/bert-base-german-cased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8826,0.738482,0.821705,0.84344,0.821705,0.822154,"{0: 56, 1: 20, 2: 53}"
2,0.576,0.610201,0.844961,0.867486,0.844961,0.848288,"{0: 53, 1: 27, 2: 49}"
3,0.5193,1.100045,0.821705,0.835822,0.821705,0.8163,"{0: 50, 1: 16, 2: 63}"
4,0.2655,0.770221,0.852713,0.85357,0.852713,0.852912,"{0: 44, 1: 26, 2: 59}"
5,0.2601,0.874548,0.852713,0.854126,0.852713,0.852424,"{0: 38, 1: 29, 2: 62}"
6,0.1877,0.893036,0.868217,0.874928,0.868217,0.869298,"{0: 37, 1: 32, 2: 60}"
7,0.1168,0.950542,0.852713,0.85203,0.852713,0.851652,"{0: 43, 1: 24, 2: 62}"
8,0.0754,0.967914,0.860465,0.859173,0.860465,0.858997,"{0: 42, 1: 24, 2: 63}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-cased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-cased_42_42_8
Evaluation results for dbmdz/bert-base-german-cased with 8 epochs and random seeds: 42, 42



{'eval_loss': 1.441780686378479, 'eval_accuracy': 0.7712418300653595, 'eval_precision': 0.8038680951530018, 'eval_recall': 0.7712418300653595, 'eval_f1': 0.7793616211916865, 'eval_class_distribution': {0: 34, 1: 48, 2: 71}, 'eval_runtime': 2.3769, 'eval_samples_per_second': 64.371, 'eval_steps_per_second': 32.396, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.83      0.83      0.83        36
     Neutral       0.57      0.82      0.68        33
     Positiv       0.90      0.75      0.82        84

    accuracy                           0.78       153
   macro avg       0.77      0.80      0.78       153
weighted avg       0.81      0.78      0.79       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 36, 1: 47, 2: 70}
Negativ Precision Score: 0.8333333333333334
Negativ Recall Score: 0.8333333333333334
Negativ F1 Score: 0.8333333333333334

Neutral Precision Score: 0.574468085106383
Neutral Recall Score: 0.

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3842.33 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3674.29 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3632.75 examples/s]


Training results for dbmdz/bert-base-german-uncased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8477,0.719345,0.844961,0.858727,0.844961,0.844923,"{0: 53, 1: 21, 2: 55}"
2,0.5418,0.730675,0.860465,0.867571,0.860465,0.861635,"{0: 48, 1: 27, 2: 54}"
3,0.516,0.845765,0.837209,0.837209,0.837209,0.837209,"{0: 42, 1: 27, 2: 60}"
4,0.2661,0.893135,0.829457,0.841489,0.829457,0.829089,"{0: 53, 1: 22, 2: 54}"
5,0.2031,0.900852,0.860465,0.86107,0.860465,0.8607,"{0: 43, 1: 27, 2: 59}"
6,0.1608,0.933761,0.852713,0.857242,0.852713,0.854129,"{0: 44, 1: 29, 2: 56}"
7,0.0939,0.891322,0.860465,0.864385,0.860465,0.861815,"{0: 41, 1: 30, 2: 58}"
8,0.0503,0.976547,0.860465,0.862474,0.860465,0.861189,"{0: 41, 1: 29, 2: 59}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-uncased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-uncased_42_42_8
Evaluation results for dbmdz/bert-base-german-uncased with 8 epochs and random seeds: 42, 42



{'eval_loss': 1.917163372039795, 'eval_accuracy': 0.7254901960784313, 'eval_precision': 0.7579185520361991, 'eval_recall': 0.7254901960784313, 'eval_f1': 0.7299436093275224, 'eval_class_distribution': {0: 48, 1: 40, 2: 65}, 'eval_runtime': 2.3786, 'eval_samples_per_second': 64.323, 'eval_steps_per_second': 32.372, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.63      0.86      0.73        36
     Neutral       0.61      0.85      0.71        33
     Positiv       0.93      0.64      0.76        84

    accuracy                           0.74       153
   macro avg       0.72      0.78      0.73       153
weighted avg       0.79      0.74      0.74       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 49, 1: 46, 2: 58}
Negativ Precision Score: 0.6326530612244898
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.7294117647058823

Neutral Precision Score: 0.6086956521739131
Neutral Recall Score: 0

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4560.26 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3915.29 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4279.02 examples/s]


Training results for FacebookAI/xlm-roberta-base with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.96,0.733855,0.790698,0.796138,0.790698,0.792435,"{0: 46, 1: 27, 2: 56}"
2,0.8546,1.031593,0.813953,0.814601,0.813953,0.811145,"{0: 37, 1: 24, 2: 68}"
3,0.7837,1.017649,0.813953,0.818571,0.813953,0.812347,"{0: 49, 1: 21, 2: 59}"
4,0.5643,1.389266,0.767442,0.777905,0.767442,0.769491,"{0: 48, 1: 29, 2: 52}"
5,0.5145,1.104014,0.813953,0.815302,0.813953,0.813136,"{0: 46, 1: 23, 2: 60}"
6,0.4392,1.173627,0.829457,0.833317,0.829457,0.830879,"{0: 43, 1: 29, 2: 57}"
7,0.3681,1.039371,0.844961,0.84798,0.844961,0.845873,"{0: 45, 1: 27, 2: 57}"
8,0.3354,1.040194,0.844961,0.850306,0.844961,0.846928,"{0: 42, 1: 30, 2: 57}"



Best Model saved at: ./saved_models/absa_FacebookAI_xlm-roberta-base_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/absa_FacebookAI_xlm-roberta-base_42_42_8
Evaluation results for FacebookAI/xlm-roberta-base with 8 epochs and random seeds: 42, 42



{'eval_loss': 0.9596351981163025, 'eval_accuracy': 0.8496732026143791, 'eval_precision': 0.8603635786298326, 'eval_recall': 0.8496732026143791, 'eval_f1': 0.8525172231054584, 'eval_class_distribution': {0: 38, 1: 39, 2: 76}, 'eval_runtime': 2.2805, 'eval_samples_per_second': 67.089, 'eval_steps_per_second': 33.764, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.74      0.89      0.81        36
     Neutral       0.68      0.70      0.69        33
     Positiv       0.91      0.82      0.86        84

    accuracy                           0.81       153
   macro avg       0.78      0.80      0.79       153
weighted avg       0.82      0.81      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 43, 1: 34, 2: 76}
Negativ Precision Score: 0.7441860465116279
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.810126582278481

Neutral Precision Score: 0.6764705882352942
Neutral Recall Score: 0

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 5034.22 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4652.40 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4675.58 examples/s]


Training results for TUM/GottBERT_base_best with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8794,0.760487,0.829457,0.860506,0.829457,0.827117,"{0: 58, 1: 16, 2: 55}"
2,0.5066,0.809799,0.829457,0.8468,0.829457,0.834453,"{0: 39, 1: 35, 2: 55}"
3,0.5362,0.629004,0.883721,0.885124,0.883721,0.881197,"{0: 45, 1: 21, 2: 63}"
4,0.318,0.536448,0.860465,0.865477,0.860465,0.85883,"{0: 49, 1: 21, 2: 59}"
5,0.2715,0.895684,0.844961,0.847129,0.844961,0.839765,"{0: 39, 1: 20, 2: 70}"
6,0.2319,0.761999,0.860465,0.875031,0.860465,0.863805,"{0: 38, 1: 35, 2: 56}"
7,0.193,0.719269,0.868217,0.86927,0.868217,0.868189,"{0: 39, 1: 28, 2: 62}"
8,0.1707,0.866967,0.837209,0.840784,0.837209,0.837464,"{0: 37, 1: 29, 2: 63}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_best_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_best_42_42_8
Evaluation results for TUM/GottBERT_base_best with 8 epochs and random seeds: 42, 42



{'eval_loss': 1.0912635326385498, 'eval_accuracy': 0.7973856209150327, 'eval_precision': 0.8099001643890573, 'eval_recall': 0.7973856209150327, 'eval_f1': 0.7985068034526942, 'eval_class_distribution': {0: 46, 1: 34, 2: 73}, 'eval_runtime': 2.3445, 'eval_samples_per_second': 65.258, 'eval_steps_per_second': 32.842, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.72      0.92      0.80        36
     Neutral       0.71      0.76      0.74        33
     Positiv       0.90      0.77      0.83        84

    accuracy                           0.80       153
   macro avg       0.78      0.82      0.79       153
weighted avg       0.82      0.80      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 46, 1: 35, 2: 72}
Negativ Precision Score: 0.717391304347826
Negativ Recall Score: 0.9166666666666666
Negativ F1 Score: 0.8048780487804879

Neutral Precision Score: 0.7142857142857143
Neutral Recall Score: 0

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4963.88 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4379.71 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4447.58 examples/s]


Training results for TUM/GottBERT_filtered_base_best with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.7952,0.688211,0.852713,0.861607,0.852713,0.853744,"{0: 50, 1: 24, 2: 55}"
2,0.5181,0.684829,0.883721,0.884455,0.883721,0.883463,"{0: 39, 1: 29, 2: 61}"
3,0.4586,0.584417,0.906977,0.907802,0.906977,0.907138,"{0: 44, 1: 26, 2: 59}"
4,0.2969,0.722327,0.875969,0.878879,0.875969,0.876812,"{0: 45, 1: 27, 2: 57}"
5,0.3046,0.705719,0.883721,0.883721,0.883721,0.883721,"{0: 42, 1: 27, 2: 60}"
6,0.2432,0.817558,0.875969,0.880171,0.875969,0.877319,"{0: 42, 1: 30, 2: 57}"
7,0.2147,0.722536,0.875969,0.876307,0.875969,0.87574,"{0: 44, 1: 25, 2: 60}"
8,0.1656,0.813668,0.875969,0.879215,0.875969,0.877052,"{0: 43, 1: 29, 2: 57}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_filtered_base_best_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_filtered_base_best_42_42_8
Evaluation results for TUM/GottBERT_filtered_base_best with 8 epochs and random seeds: 42, 42



{'eval_loss': 1.1530835628509521, 'eval_accuracy': 0.8169934640522876, 'eval_precision': 0.8235219586726042, 'eval_recall': 0.8169934640522876, 'eval_f1': 0.8184065069878218, 'eval_class_distribution': {0: 41, 1: 35, 2: 77}, 'eval_runtime': 2.3039, 'eval_samples_per_second': 66.409, 'eval_steps_per_second': 33.421, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.84      0.89      0.86        36
     Neutral       0.69      0.76      0.72        33
     Positiv       0.91      0.86      0.88        84

    accuracy                           0.84       153
   macro avg       0.82      0.83      0.82       153
weighted avg       0.85      0.84      0.84       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 38, 1: 36, 2: 79}
Negativ Precision Score: 0.8421052631578947
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8648648648648649

Neutral Precision Score: 0.6944444444444444
Neutral Recall Score: 

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be abl

Training results for TUM/GottBERT_base_last with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9086,0.874375,0.79845,0.824793,0.79845,0.787943,"{0: 59, 1: 13, 2: 57}"
2,0.5697,0.81001,0.837209,0.846748,0.837209,0.839653,"{0: 38, 1: 33, 2: 58}"
3,0.5191,1.082653,0.813953,0.833942,0.813953,0.811238,"{0: 55, 1: 17, 2: 57}"
4,0.3103,0.748966,0.829457,0.828248,0.829457,0.828562,"{0: 43, 1: 25, 2: 61}"
5,0.2983,0.761917,0.852713,0.851687,0.852713,0.852115,"{0: 42, 1: 26, 2: 61}"
6,0.1917,0.913061,0.837209,0.84654,0.837209,0.840317,"{0: 43, 1: 31, 2: 55}"
7,0.1904,0.864859,0.868217,0.873853,0.868217,0.870167,"{0: 44, 1: 29, 2: 56}"
8,0.0911,0.95574,0.860465,0.867596,0.860465,0.863049,"{0: 43, 1: 30, 2: 56}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_last_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_last_42_42_8
Evaluation results for TUM/GottBERT_base_last with 8 epochs and random seeds: 42, 42



{'eval_loss': 1.6601247787475586, 'eval_accuracy': 0.7843137254901961, 'eval_precision': 0.8048815770439631, 'eval_recall': 0.7843137254901961, 'eval_f1': 0.7886849318108682, 'eval_class_distribution': {0: 40, 1: 42, 2: 71}, 'eval_runtime': 2.3122, 'eval_samples_per_second': 66.171, 'eval_steps_per_second': 33.302, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.71      0.89      0.79        36
     Neutral       0.63      0.79      0.70        33
     Positiv       0.94      0.75      0.83        84

    accuracy                           0.79       153
   macro avg       0.76      0.81      0.78       153
weighted avg       0.82      0.79      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 45, 1: 41, 2: 67}
Negativ Precision Score: 0.7111111111111111
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7901234567901234

Neutral Precision Score: 0.6341463414634146
Neutral Recall Score: 

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 4809.05 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 4400.37 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 4250.31 examples/s]


Training results for distilbert/distilbert-base-german-cased with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8141,0.742718,0.790698,0.832345,0.790698,0.790678,"{0: 63, 1: 19, 2: 47}"
2,0.568,0.763324,0.813953,0.835625,0.813953,0.816436,"{0: 55, 1: 23, 2: 51}"
3,0.5055,0.599447,0.875969,0.887109,0.875969,0.877011,"{0: 51, 1: 25, 2: 53}"
4,0.2837,0.762575,0.860465,0.878638,0.860465,0.862311,"{0: 53, 1: 26, 2: 50}"
5,0.2962,0.776296,0.844961,0.853919,0.844961,0.846007,"{0: 50, 1: 25, 2: 54}"
6,0.1886,0.807323,0.852713,0.862033,0.852713,0.853939,"{0: 47, 1: 30, 2: 52}"
7,0.1321,0.850481,0.860465,0.876784,0.860465,0.861919,"{0: 53, 1: 25, 2: 51}"
8,0.0919,0.851439,0.860465,0.870176,0.860465,0.861555,"{0: 49, 1: 28, 2: 52}"



Best Model saved at: ./saved_models/absa_distilbert_distilbert-base-german-cased_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/absa_distilbert_distilbert-base-german-cased_42_42_8
Evaluation results for distilbert/distilbert-base-german-cased with 8 epochs and random seeds: 42, 42



{'eval_loss': 1.0098521709442139, 'eval_accuracy': 0.8169934640522876, 'eval_precision': 0.8173789684729426, 'eval_recall': 0.8169934640522876, 'eval_f1': 0.816770793581566, 'eval_class_distribution': {0: 39, 1: 32, 2: 82}, 'eval_runtime': 1.302, 'eval_samples_per_second': 117.514, 'eval_steps_per_second': 59.141, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.81      0.81      0.81        36
     Neutral       0.68      0.58      0.62        33
     Positiv       0.83      0.88      0.86        84

    accuracy                           0.80       153
   macro avg       0.77      0.75      0.76       153
weighted avg       0.79      0.80      0.79       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 36, 1: 28, 2: 89}
Negativ Precision Score: 0.8055555555555556
Negativ Recall Score: 0.8055555555555556
Negativ F1 Score: 0.8055555555555556

Neutral Precision Score: 0.6785714285714286
Neutral Recall Score: 0

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

Training results for GerMedBERT/medbert-512 with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9033,0.842962,0.775194,0.780869,0.775194,0.774612,"{0: 48, 1: 21, 2: 60}"
2,0.5579,1.020309,0.75969,0.768518,0.75969,0.757392,"{0: 32, 1: 35, 2: 62}"
3,0.5143,1.116372,0.806202,0.805258,0.806202,0.805649,"{0: 42, 1: 26, 2: 61}"
4,0.2535,1.255738,0.767442,0.778941,0.767442,0.771011,"{0: 41, 1: 33, 2: 55}"
5,0.2594,1.263882,0.790698,0.79231,0.790698,0.791307,"{0: 43, 1: 28, 2: 58}"
6,0.1881,1.130223,0.821705,0.824743,0.821705,0.822403,"{0: 44, 1: 29, 2: 56}"
7,0.1556,1.288304,0.813953,0.816409,0.813953,0.814687,"{0: 43, 1: 29, 2: 57}"
8,0.0836,1.259637,0.806202,0.807234,0.806202,0.80664,"{0: 42, 1: 28, 2: 59}"



Best Model saved at: ./saved_models/absa_GerMedBERT_medbert-512_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/absa_GerMedBERT_medbert-512_42_42_8
Evaluation results for GerMedBERT/medbert-512 with 8 epochs and random seeds: 42, 42



{'eval_loss': 1.1535640954971313, 'eval_accuracy': 0.7777777777777778, 'eval_precision': 0.7928470656741788, 'eval_recall': 0.7777777777777778, 'eval_f1': 0.7796503130084663, 'eval_class_distribution': {0: 43, 1: 39, 2: 71}, 'eval_runtime': 2.3952, 'eval_samples_per_second': 63.879, 'eval_steps_per_second': 32.148, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.74      0.81      0.77        36
     Neutral       0.70      0.85      0.77        33
     Positiv       0.88      0.77      0.82        84

    accuracy                           0.80       153
   macro avg       0.77      0.81      0.79       153
weighted avg       0.81      0.80      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 39, 1: 40, 2: 74}
Negativ Precision Score: 0.7435897435897436
Negativ Recall Score: 0.8055555555555556
Negativ F1 Score: 0.7733333333333333

Neutral Precision Score: 0.7
Neutral Recall Score: 0.8484848484848

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

Training results for deepset/gbert-base with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8528,0.885829,0.782946,0.822379,0.782946,0.774095,"{0: 60, 1: 12, 2: 57}"
2,0.5486,0.501229,0.860465,0.859302,0.860465,0.859042,"{0: 40, 1: 25, 2: 64}"
3,0.4376,0.862168,0.844961,0.846309,0.844961,0.844059,"{0: 46, 1: 23, 2: 60}"
4,0.2653,0.824371,0.837209,0.839398,0.837209,0.836834,"{0: 47, 1: 24, 2: 58}"
5,0.2004,1.082035,0.821705,0.826926,0.821705,0.820037,"{0: 50, 1: 21, 2: 58}"
6,0.1083,0.82694,0.860465,0.864175,0.860465,0.860949,"{0: 47, 1: 25, 2: 57}"
7,0.0885,0.996552,0.844961,0.855608,0.844961,0.845859,"{0: 51, 1: 25, 2: 53}"
8,0.0258,0.939416,0.860465,0.86603,0.860465,0.860322,"{0: 49, 1: 23, 2: 57}"



Best Model saved at: ./saved_models/absa_deepset_gbert-base_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/absa_deepset_gbert-base_42_42_8
Evaluation results for deepset/gbert-base with 8 epochs and random seeds: 42, 42



{'eval_loss': 1.394826889038086, 'eval_accuracy': 0.7843137254901961, 'eval_precision': 0.8034387738835371, 'eval_recall': 0.7843137254901961, 'eval_f1': 0.7871793565911214, 'eval_class_distribution': {0: 41, 1: 42, 2: 70}, 'eval_runtime': 2.3751, 'eval_samples_per_second': 64.419, 'eval_steps_per_second': 32.42, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.78      0.78      0.78        36
     Neutral       0.55      0.85      0.67        33
     Positiv       0.91      0.71      0.80        84

    accuracy                           0.76       153
   macro avg       0.75      0.78      0.75       153
weighted avg       0.80      0.76      0.77       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 36, 1: 51, 2: 66}
Negativ Precision Score: 0.7777777777777778
Negativ Recall Score: 0.7777777777777778
Negativ F1 Score: 0.7777777777777778

Neutral Precision Score: 0.5490196078431373
Neutral Recall Score: 0.

In [7]:
absa_model(data, "aari1995/German_Sentiment", rn1=42, rn2=42, epochs=8, save = True)

Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Map: 100%|██████████| 1111/1111 [00:00<00:00, 3928.72 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3724.01 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3636.41 examples/s]


Training results for aari1995/German_Sentiment with 8 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8976,0.819189,0.837209,0.868564,0.837209,0.823483,"{0: 60, 1: 12, 2: 57}"
2,0.5459,0.549485,0.891473,0.889679,0.891473,0.889848,"{0: 43, 1: 24, 2: 62}"
3,0.4719,0.827644,0.868217,0.868833,0.868217,0.86638,"{0: 45, 1: 22, 2: 62}"
4,0.2458,0.690967,0.891473,0.893065,0.891473,0.891744,"{0: 45, 1: 26, 2: 58}"
5,0.2218,0.732186,0.899225,0.899166,0.899225,0.898415,"{0: 44, 1: 24, 2: 61}"
6,0.0898,0.896576,0.875969,0.876773,0.875969,0.876278,"{0: 41, 1: 28, 2: 60}"
7,0.0453,0.849441,0.899225,0.901475,0.899225,0.89794,"{0: 46, 1: 22, 2: 61}"
8,0.025,0.877132,0.891473,0.895371,0.891473,0.890744,"{0: 47, 1: 22, 2: 60}"



Best Model saved at: ./saved_models/absa_aari1995_German_Sentiment_42_42_8

Tokenizer for best Model saved at: ./saved_tokenizers/absa_aari1995_German_Sentiment_42_42_8
Evaluation results for aari1995/German_Sentiment with 8 epochs and random seeds: 42, 42



{'eval_loss': 1.1299880743026733, 'eval_accuracy': 0.8496732026143791, 'eval_precision': 0.8480319973744039, 'eval_recall': 0.8496732026143791, 'eval_f1': 0.8485100417318726, 'eval_class_distribution': {0: 35, 1: 31, 2: 87}, 'eval_runtime': 7.1668, 'eval_samples_per_second': 21.348, 'eval_steps_per_second': 10.744, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.86      0.86      0.86        36
     Neutral       0.77      0.73      0.75        33
     Positiv       0.87      0.89      0.88        84

    accuracy                           0.85       153
   macro avg       0.84      0.83      0.83       153
weighted avg       0.85      0.85      0.85       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 36, 1: 31, 2: 86}
Negativ Precision Score: 0.8611111111111112
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.8611111111111112

Neutral Precision Score: 0.7741935483870968
Neutral Recall Score: 

In [5]:
for model in models:
    print(f'training and results for {model}:')
    absa_model(data, model, rn1=42, rn2=42, epochs=10, save = True)
    print()

training and results for google-bert/bert-base-german-cased:
Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 1684.18 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2415.88 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2500.53 examples/s]


Training results for google-bert/bert-base-german-cased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8416,0.846796,0.813953,0.859764,0.813953,0.816602,"{0: 61, 1: 25, 2: 43}"
2,0.5523,0.954318,0.790698,0.787573,0.790698,0.787945,"{0: 39, 1: 25, 2: 65}"
3,0.4147,1.190725,0.806202,0.807185,0.806202,0.805427,"{0: 44, 1: 23, 2: 62}"
4,0.2313,1.299271,0.790698,0.797399,0.790698,0.790219,"{0: 49, 1: 21, 2: 59}"
5,0.2064,1.373593,0.782946,0.797082,0.782946,0.781495,"{0: 51, 1: 18, 2: 60}"
6,0.1277,1.275406,0.813953,0.816278,0.813953,0.814025,"{0: 46, 1: 24, 2: 59}"
7,0.1305,1.544113,0.790698,0.794043,0.790698,0.791815,"{0: 45, 1: 27, 2: 57}"
8,0.0416,1.653646,0.79845,0.80139,0.79845,0.799613,"{0: 42, 1: 29, 2: 58}"
9,0.0072,1.651467,0.79845,0.800454,0.79845,0.799201,"{0: 44, 1: 27, 2: 58}"
10,0.0026,1.624203,0.806202,0.807071,0.806202,0.806419,"{0: 44, 1: 26, 2: 59}"



Best Model saved at: ./saved_models/absa_google-bert_bert-base-german-cased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/absa_google-bert_bert-base-german-cased_42_42_10
Evaluation results for google-bert/bert-base-german-cased with 10 epochs and random seeds: 42, 42



{'eval_loss': 1.2085140943527222, 'eval_accuracy': 0.7058823529411765, 'eval_precision': 0.759738431763533, 'eval_recall': 0.7058823529411765, 'eval_f1': 0.7088385714373423, 'eval_class_distribution': {0: 60, 1: 34, 2: 59}, 'eval_runtime': 2.2178, 'eval_samples_per_second': 68.987, 'eval_steps_per_second': 34.719, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.52      0.94      0.67        36
     Neutral       0.61      0.58      0.59        33
     Positiv       0.93      0.63      0.75        84

    accuracy                           0.69       153
   macro avg       0.69      0.72      0.67       153
weighted avg       0.77      0.69      0.70       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 65, 1: 31, 2: 57}
Negativ Precision Score: 0.5230769230769231
Negativ Recall Score: 0.9444444444444444
Negativ F1 Score: 0.6732673267326733

Neutral Precision Score: 0.6129032258064516
Neutral Recall Score: 

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2626.27 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2438.74 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2548.88 examples/s]


Training results for dbmdz/bert-base-german-cased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,1.0042,0.983909,0.75969,0.819145,0.75969,0.754033,"{0: 68, 1: 13, 2: 48}"
2,0.6962,0.589672,0.829457,0.833971,0.829457,0.830337,"{0: 47, 1: 26, 2: 56}"
3,0.5564,0.829079,0.844961,0.846518,0.844961,0.845242,"{0: 45, 1: 26, 2: 58}"
4,0.3328,0.973666,0.821705,0.836921,0.821705,0.823796,"{0: 50, 1: 29, 2: 50}"
5,0.3451,1.020653,0.813953,0.817547,0.813953,0.815168,"{0: 41, 1: 30, 2: 58}"
6,0.2691,1.061146,0.806202,0.81469,0.806202,0.808853,"{0: 41, 1: 32, 2: 56}"
7,0.2252,1.044361,0.821705,0.833691,0.821705,0.822586,"{0: 52, 1: 24, 2: 53}"
8,0.1809,1.10138,0.829457,0.831961,0.829457,0.829835,"{0: 46, 1: 26, 2: 57}"
9,0.0712,1.135446,0.821705,0.829589,0.821705,0.823768,"{0: 42, 1: 32, 2: 55}"
10,0.0268,1.174703,0.837209,0.841889,0.837209,0.838279,"{0: 42, 1: 31, 2: 56}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-cased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-cased_42_42_10
Evaluation results for dbmdz/bert-base-german-cased with 10 epochs and random seeds: 42, 42



{'eval_loss': 1.1199039220809937, 'eval_accuracy': 0.8104575163398693, 'eval_precision': 0.8183178534571723, 'eval_recall': 0.8104575163398693, 'eval_f1': 0.8108010725657785, 'eval_class_distribution': {0: 45, 1: 32, 2: 76}, 'eval_runtime': 2.159, 'eval_samples_per_second': 70.865, 'eval_steps_per_second': 35.664, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.70      0.89      0.78        36
     Neutral       0.70      0.64      0.67        33
     Positiv       0.87      0.80      0.83        84

    accuracy                           0.78       153
   macro avg       0.76      0.77      0.76       153
weighted avg       0.79      0.78      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 46, 1: 30, 2: 77}
Negativ Precision Score: 0.6956521739130435
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7804878048780488

Neutral Precision Score: 0.7
Neutral Recall Score: 0.6363636363636

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2756.97 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2543.44 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2606.61 examples/s]


Training results for dbmdz/bert-base-german-uncased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8373,0.705564,0.821705,0.856513,0.821705,0.821713,"{0: 59, 1: 17, 2: 53}"
2,0.5676,0.799524,0.844961,0.845273,0.844961,0.844134,"{0: 38, 1: 29, 2: 62}"
3,0.5076,0.798777,0.852713,0.86068,0.852713,0.852662,"{0: 50, 1: 22, 2: 57}"
4,0.2562,0.702567,0.844961,0.848547,0.844961,0.84367,"{0: 35, 1: 30, 2: 64}"
5,0.2336,0.747054,0.860465,0.873697,0.860465,0.860836,"{0: 33, 1: 34, 2: 62}"
6,0.1793,1.042679,0.829457,0.826929,0.829457,0.826945,"{0: 39, 1: 25, 2: 65}"
7,0.1392,1.085772,0.844961,0.844147,0.844961,0.844346,"{0: 41, 1: 26, 2: 62}"
8,0.058,1.128232,0.844961,0.845772,0.844961,0.844393,"{0: 38, 1: 28, 2: 63}"
9,0.0086,1.12134,0.844961,0.855865,0.844961,0.848049,"{0: 38, 1: 33, 2: 58}"
10,0.0092,1.065396,0.852713,0.860905,0.852713,0.854956,"{0: 38, 1: 32, 2: 59}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-uncased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-uncased_42_42_10
Evaluation results for dbmdz/bert-base-german-uncased with 10 epochs and random seeds: 42, 42



{'eval_loss': 1.445467472076416, 'eval_accuracy': 0.7843137254901961, 'eval_precision': 0.8027412533640907, 'eval_recall': 0.7843137254901961, 'eval_f1': 0.7887461903000194, 'eval_class_distribution': {0: 34, 1: 44, 2: 75}, 'eval_runtime': 2.1867, 'eval_samples_per_second': 69.969, 'eval_steps_per_second': 35.213, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.75      0.75      0.75        36
     Neutral       0.63      0.82      0.71        33
     Positiv       0.89      0.79      0.84        84

    accuracy                           0.78       153
   macro avg       0.76      0.78      0.77       153
weighted avg       0.80      0.78      0.79       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 36, 1: 43, 2: 74}
Negativ Precision Score: 0.75
Negativ Recall Score: 0.75
Negativ F1 Score: 0.75

Neutral Precision Score: 0.627906976744186
Neutral Recall Score: 0.8181818181818182
Neutral F1 Score: 0.7105

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3589.44 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3140.72 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3210.10 examples/s]


Training results for FacebookAI/xlm-roberta-base with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8962,0.815468,0.775194,0.776077,0.775194,0.771644,"{0: 44, 1: 20, 2: 65}"
2,0.7835,0.871984,0.79845,0.801309,0.79845,0.797729,"{0: 46, 1: 22, 2: 61}"
3,0.8224,0.958059,0.844961,0.856875,0.844961,0.843917,"{0: 52, 1: 20, 2: 57}"
4,0.5557,0.956812,0.806202,0.809262,0.806202,0.805945,"{0: 47, 1: 23, 2: 59}"
5,0.5002,0.898484,0.852713,0.85743,0.852713,0.852139,"{0: 48, 1: 22, 2: 59}"
6,0.4182,1.085667,0.806202,0.830046,0.806202,0.808554,"{0: 55, 1: 26, 2: 48}"
7,0.3223,0.943467,0.837209,0.865401,0.837209,0.837937,"{0: 58, 1: 20, 2: 51}"
8,0.2242,0.960268,0.837209,0.848129,0.837209,0.839279,"{0: 47, 1: 30, 2: 52}"
9,0.162,0.906566,0.860465,0.873568,0.860465,0.863639,"{0: 43, 1: 33, 2: 53}"
10,0.1456,0.954546,0.860465,0.866079,0.860465,0.861853,"{0: 46, 1: 28, 2: 55}"



Best Model saved at: ./saved_models/absa_FacebookAI_xlm-roberta-base_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/absa_FacebookAI_xlm-roberta-base_42_42_10
Evaluation results for FacebookAI/xlm-roberta-base with 10 epochs and random seeds: 42, 42



{'eval_loss': 1.3683611154556274, 'eval_accuracy': 0.8169934640522876, 'eval_precision': 0.829985400827169, 'eval_recall': 0.8169934640522876, 'eval_f1': 0.8190537333886903, 'eval_class_distribution': {0: 39, 1: 41, 2: 73}, 'eval_runtime': 2.0388, 'eval_samples_per_second': 75.042, 'eval_steps_per_second': 37.766, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.84      0.86      0.85        36
     Neutral       0.64      0.85      0.73        33
     Positiv       0.92      0.79      0.85        84

    accuracy                           0.82       153
   macro avg       0.80      0.83      0.81       153
weighted avg       0.84      0.82      0.82       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 37, 1: 44, 2: 72}
Negativ Precision Score: 0.8378378378378378
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.8493150684931506

Neutral Precision Score: 0.6363636363636364
Neutral Recall Score: 

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3683.06 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3397.63 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3249.62 examples/s]


Training results for TUM/GottBERT_base_best with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9303,0.734769,0.844961,0.8689,0.844961,0.846236,"{0: 55, 1: 19, 2: 55}"
2,0.5732,0.848223,0.821705,0.825883,0.821705,0.822769,"{0: 40, 1: 31, 2: 58}"
3,0.51,0.838025,0.852713,0.864537,0.852713,0.84817,"{0: 50, 1: 17, 2: 62}"
4,0.3226,0.781667,0.868217,0.86963,0.868217,0.865522,"{0: 47, 1: 21, 2: 61}"
5,0.2993,0.854226,0.852713,0.850493,0.852713,0.851262,"{0: 42, 1: 25, 2: 62}"
6,0.2656,0.862524,0.837209,0.839834,0.837209,0.837637,"{0: 46, 1: 25, 2: 58}"
7,0.2478,0.916696,0.860465,0.86794,0.860465,0.858061,"{0: 51, 1: 20, 2: 58}"
8,0.2071,0.89107,0.844961,0.84279,0.844961,0.842645,"{0: 44, 1: 23, 2: 62}"
9,0.1343,0.964049,0.837209,0.844167,0.837209,0.839057,"{0: 47, 1: 27, 2: 55}"
10,0.0884,1.04149,0.852713,0.857303,0.852713,0.852989,"{0: 48, 1: 24, 2: 57}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_best_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_best_42_42_10
Evaluation results for TUM/GottBERT_base_best with 10 epochs and random seeds: 42, 42



{'eval_loss': 1.0864769220352173, 'eval_accuracy': 0.8431372549019608, 'eval_precision': 0.8412104235633648, 'eval_recall': 0.8431372549019608, 'eval_f1': 0.8415303086696702, 'eval_class_distribution': {0: 35, 1: 30, 2: 88}, 'eval_runtime': 2.0531, 'eval_samples_per_second': 74.52, 'eval_steps_per_second': 37.504, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.82      0.92      0.87        36
     Neutral       0.69      0.55      0.61        33
     Positiv       0.85      0.88      0.87        84

    accuracy                           0.82       153
   macro avg       0.79      0.78      0.78       153
weighted avg       0.81      0.82      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 40, 1: 26, 2: 87}
Negativ Precision Score: 0.825
Negativ Recall Score: 0.9166666666666666
Negativ F1 Score: 0.868421052631579

Neutral Precision Score: 0.6923076923076923
Neutral Recall Score: 0.545454545454

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3703.67 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3131.51 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3202.70 examples/s]


Training results for TUM/GottBERT_filtered_base_best with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8318,0.852817,0.790698,0.816833,0.790698,0.79092,"{0: 58, 1: 19, 2: 52}"
2,0.5642,0.576296,0.852713,0.852431,0.852713,0.852186,"{0: 44, 1: 25, 2: 60}"
3,0.4727,0.635316,0.883721,0.885124,0.883721,0.881197,"{0: 45, 1: 21, 2: 63}"
4,0.3236,0.862517,0.860465,0.867571,0.860465,0.861635,"{0: 48, 1: 27, 2: 54}"
5,0.3033,0.696167,0.891473,0.891362,0.891473,0.891316,"{0: 43, 1: 26, 2: 60}"
6,0.2174,0.738092,0.852713,0.861823,0.852713,0.854064,"{0: 49, 1: 27, 2: 53}"
7,0.1907,0.821242,0.868217,0.869791,0.868217,0.868493,"{0: 45, 1: 26, 2: 58}"
8,0.1599,0.885034,0.860465,0.86655,0.860465,0.861425,"{0: 48, 1: 26, 2: 55}"
9,0.0771,0.988218,0.844961,0.855226,0.844961,0.846333,"{0: 49, 1: 28, 2: 52}"
10,0.0513,1.020409,0.852713,0.86082,0.852713,0.853933,"{0: 49, 1: 26, 2: 54}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_filtered_base_best_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_filtered_base_best_42_42_10
Evaluation results for TUM/GottBERT_filtered_base_best with 10 epochs and random seeds: 42, 42



{'eval_loss': 1.1834323406219482, 'eval_accuracy': 0.7973856209150327, 'eval_precision': 0.7986928104575163, 'eval_recall': 0.7973856209150327, 'eval_f1': 0.7972178173794121, 'eval_class_distribution': {0: 40, 1: 33, 2: 80}, 'eval_runtime': 2.0503, 'eval_samples_per_second': 74.622, 'eval_steps_per_second': 37.555, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.82      0.89      0.85        36
     Neutral       0.69      0.67      0.68        33
     Positiv       0.87      0.85      0.86        84

    accuracy                           0.82       153
   macro avg       0.79      0.80      0.80       153
weighted avg       0.82      0.82      0.82       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 39, 1: 32, 2: 82}
Negativ Precision Score: 0.8205128205128205
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8533333333333334

Neutral Precision Score: 0.6875
Neutral Recall Score: 0.666666666

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be abl

Training results for TUM/GottBERT_base_last with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9077,0.562201,0.852713,0.859883,0.852713,0.851635,"{0: 50, 1: 21, 2: 58}"
2,0.6042,0.47913,0.891473,0.893137,0.891473,0.892083,"{0: 43, 1: 28, 2: 58}"
3,0.5224,0.871606,0.844961,0.862877,0.844961,0.837209,"{0: 54, 1: 15, 2: 60}"
4,0.3073,0.777478,0.860465,0.870731,0.860465,0.86178,"{0: 36, 1: 34, 2: 59}"
5,0.295,0.653303,0.875969,0.876068,0.875969,0.873864,"{0: 45, 1: 22, 2: 62}"
6,0.254,0.850213,0.837209,0.845886,0.837209,0.839983,"{0: 39, 1: 32, 2: 58}"
7,0.2277,1.114671,0.813953,0.822197,0.813953,0.813928,"{0: 51, 1: 23, 2: 55}"
8,0.1801,1.013435,0.852713,0.85363,0.852713,0.853104,"{0: 43, 1: 27, 2: 59}"
9,0.1218,0.99379,0.829457,0.838436,0.829457,0.832436,"{0: 40, 1: 32, 2: 57}"
10,0.0918,1.124825,0.837209,0.837472,0.837209,0.834978,"{0: 47, 1: 22, 2: 60}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_last_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_last_42_42_10
Evaluation results for TUM/GottBERT_base_last with 10 epochs and random seeds: 42, 42



{'eval_loss': 0.8848168849945068, 'eval_accuracy': 0.8235294117647058, 'eval_precision': 0.8382084340585549, 'eval_recall': 0.8235294117647058, 'eval_f1': 0.8265638870962673, 'eval_class_distribution': {0: 40, 1: 40, 2: 73}, 'eval_runtime': 2.0565, 'eval_samples_per_second': 74.397, 'eval_steps_per_second': 37.441, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.75      0.92      0.82        36
     Neutral       0.64      0.76      0.69        33
     Positiv       0.91      0.76      0.83        84

    accuracy                           0.80       153
   macro avg       0.77      0.81      0.78       153
weighted avg       0.82      0.80      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 44, 1: 39, 2: 70}
Negativ Precision Score: 0.75
Negativ Recall Score: 0.9166666666666666
Negativ F1 Score: 0.825

Neutral Precision Score: 0.6410256410256411
Neutral Recall Score: 0.7575757575757576
Neutral

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3732.17 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3125.74 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3279.08 examples/s]


Training results for distilbert/distilbert-base-german-cased with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.83,0.653218,0.837209,0.859818,0.837209,0.840041,"{0: 54, 1: 26, 2: 49}"
2,0.5813,0.63037,0.821705,0.831934,0.821705,0.823529,"{0: 50, 1: 24, 2: 55}"
3,0.4881,0.643581,0.868217,0.876949,0.868217,0.868851,"{0: 50, 1: 23, 2: 56}"
4,0.2811,0.631425,0.868217,0.883099,0.868217,0.869499,"{0: 51, 1: 28, 2: 50}"
5,0.293,0.6657,0.868217,0.868038,0.868217,0.86788,"{0: 40, 1: 28, 2: 61}"
6,0.2111,0.747172,0.860465,0.880559,0.860465,0.864461,"{0: 45, 1: 34, 2: 50}"
7,0.1792,0.668056,0.883721,0.897066,0.883721,0.884708,"{0: 52, 1: 25, 2: 52}"
8,0.1333,0.761774,0.868217,0.881897,0.868217,0.869636,"{0: 51, 1: 27, 2: 51}"
9,0.0583,0.859355,0.860465,0.875606,0.860465,0.862527,"{0: 48, 1: 31, 2: 50}"
10,0.0473,0.864833,0.860465,0.871148,0.860465,0.862282,"{0: 46, 1: 31, 2: 52}"



Best Model saved at: ./saved_models/absa_distilbert_distilbert-base-german-cased_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/absa_distilbert_distilbert-base-german-cased_42_42_10
Evaluation results for distilbert/distilbert-base-german-cased with 10 epochs and random seeds: 42, 42



{'eval_loss': 1.3278748989105225, 'eval_accuracy': 0.7843137254901961, 'eval_precision': 0.8022368189812484, 'eval_recall': 0.7843137254901961, 'eval_f1': 0.7874353012445324, 'eval_class_distribution': {0: 44, 1: 38, 2: 71}, 'eval_runtime': 1.2031, 'eval_samples_per_second': 127.166, 'eval_steps_per_second': 63.999, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.67      0.89      0.76        36
     Neutral       0.61      0.61      0.61        33
     Positiv       0.89      0.76      0.82        84

    accuracy                           0.76       153
   macro avg       0.72      0.75      0.73       153
weighted avg       0.78      0.76      0.76       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 48, 1: 33, 2: 72}
Negativ Precision Score: 0.6666666666666666
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7619047619047619

Neutral Precision Score: 0.6060606060606061
Neutral Recall Score

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

Training results for GerMedBERT/medbert-512 with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8909,0.805563,0.767442,0.785038,0.767442,0.768114,"{0: 55, 1: 21, 2: 53}"
2,0.5736,1.07094,0.751938,0.756547,0.751938,0.741502,"{0: 28, 1: 27, 2: 74}"
3,0.4635,0.93076,0.821705,0.832005,0.821705,0.821458,"{0: 52, 1: 22, 2: 55}"
4,0.2796,0.974474,0.813953,0.82569,0.813953,0.816567,"{0: 41, 1: 34, 2: 54}"
5,0.2472,1.368894,0.806202,0.804616,0.806202,0.804967,"{0: 40, 1: 26, 2: 63}"
6,0.1752,1.255434,0.767442,0.768666,0.767442,0.764427,"{0: 34, 1: 31, 2: 64}"
7,0.1778,1.243594,0.79845,0.804972,0.79845,0.799909,"{0: 48, 1: 25, 2: 56}"
8,0.12,1.325993,0.790698,0.793545,0.790698,0.791289,"{0: 46, 1: 26, 2: 57}"
9,0.0497,1.394601,0.782946,0.783978,0.782946,0.783384,"{0: 42, 1: 28, 2: 59}"
10,0.0308,1.347941,0.790698,0.792757,0.790698,0.79153,"{0: 43, 1: 28, 2: 58}"



Best Model saved at: ./saved_models/absa_GerMedBERT_medbert-512_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/absa_GerMedBERT_medbert-512_42_42_10
Evaluation results for GerMedBERT/medbert-512 with 10 epochs and random seeds: 42, 42



{'eval_loss': 1.1626139879226685, 'eval_accuracy': 0.803921568627451, 'eval_precision': 0.8156893249177959, 'eval_recall': 0.803921568627451, 'eval_f1': 0.8048216264985952, 'eval_class_distribution': {0: 47, 1: 31, 2: 75}, 'eval_runtime': 2.1774, 'eval_samples_per_second': 70.268, 'eval_steps_per_second': 35.364, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.63      0.89      0.74        36
     Neutral       0.71      0.67      0.69        33
     Positiv       0.90      0.76      0.83        84

    accuracy                           0.77       153
   macro avg       0.75      0.77      0.75       153
weighted avg       0.80      0.77      0.77       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 51, 1: 31, 2: 71}
Negativ Precision Score: 0.6274509803921569
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.735632183908046

Neutral Precision Score: 0.7096774193548387
Neutral Recall Score: 0.

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

Training results for deepset/gbert-base with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8174,0.975716,0.775194,0.813477,0.775194,0.762682,"{0: 60, 1: 11, 2: 58}"
2,0.5014,0.59875,0.860465,0.860142,0.860465,0.859453,"{0: 42, 1: 24, 2: 63}"
3,0.48,0.819698,0.860465,0.862697,0.860465,0.859388,"{0: 44, 1: 22, 2: 63}"
4,0.2412,0.929708,0.829457,0.831257,0.829457,0.827164,"{0: 47, 1: 21, 2: 61}"
5,0.1872,0.89289,0.875969,0.875319,0.875969,0.87456,"{0: 41, 1: 24, 2: 64}"
6,0.1054,1.10706,0.829457,0.834227,0.829457,0.830458,"{0: 47, 1: 26, 2: 56}"
7,0.0707,1.244699,0.844961,0.847006,0.844961,0.844875,"{0: 46, 1: 24, 2: 59}"
8,0.031,1.229894,0.852713,0.854815,0.852713,0.853498,"{0: 44, 1: 27, 2: 58}"
9,0.0019,1.274283,0.852713,0.855024,0.852713,0.853656,"{0: 43, 1: 28, 2: 58}"
10,0.0002,1.246567,0.860465,0.861382,0.860465,0.860856,"{0: 43, 1: 27, 2: 59}"



Best Model saved at: ./saved_models/absa_deepset_gbert-base_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/absa_deepset_gbert-base_42_42_10
Evaluation results for deepset/gbert-base with 10 epochs and random seeds: 42, 42



{'eval_loss': 1.456703782081604, 'eval_accuracy': 0.7973856209150327, 'eval_precision': 0.8075866899396311, 'eval_recall': 0.7973856209150327, 'eval_f1': 0.8005776859085901, 'eval_class_distribution': {0: 37, 1: 39, 2: 77}, 'eval_runtime': 2.2015, 'eval_samples_per_second': 69.498, 'eval_steps_per_second': 34.976, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.84      0.89      0.86        36
     Neutral       0.62      0.73      0.67        33
     Positiv       0.88      0.80      0.84        84

    accuracy                           0.80       153
   macro avg       0.78      0.80      0.79       153
weighted avg       0.81      0.80      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 38, 1: 39, 2: 76}
Negativ Precision Score: 0.8421052631578947
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8648648648648649

Neutral Precision Score: 0.6153846153846154
Neutral Recall Score: 

In [6]:
absa_model(data, "aari1995/German_Sentiment", rn1=42, rn2=42, epochs=10, save = True)

Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Map: 100%|██████████| 1111/1111 [00:00<00:00, 2754.01 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2458.23 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2512.12 examples/s]


Training results for aari1995/German_Sentiment with 10 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8395,0.791142,0.860465,0.865477,0.860465,0.85883,"{0: 49, 1: 21, 2: 59}"
2,0.5341,0.854222,0.837209,0.844961,0.837209,0.838044,"{0: 36, 1: 33, 2: 60}"
3,0.5082,0.742958,0.868217,0.879837,0.868217,0.866869,"{0: 50, 1: 19, 2: 60}"
4,0.2749,0.765735,0.852713,0.857564,0.852713,0.854547,"{0: 41, 1: 30, 2: 58}"
5,0.1764,0.815526,0.875969,0.879838,0.875969,0.877259,"{0: 40, 1: 30, 2: 59}"
6,0.1184,0.767304,0.899225,0.900804,0.899225,0.899732,"{0: 44, 1: 27, 2: 58}"
7,0.0595,0.658802,0.930233,0.931396,0.930233,0.929347,"{0: 44, 1: 23, 2: 62}"
8,0.0288,0.727754,0.906977,0.90835,0.906977,0.907127,"{0: 45, 1: 26, 2: 58}"
9,0.0032,0.718983,0.922481,0.922194,0.922481,0.921549,"{0: 43, 1: 24, 2: 62}"
10,0.0011,0.717965,0.922481,0.922194,0.922481,0.921549,"{0: 43, 1: 24, 2: 62}"



Best Model saved at: ./saved_models/absa_aari1995_German_Sentiment_42_42_10

Tokenizer for best Model saved at: ./saved_tokenizers/absa_aari1995_German_Sentiment_42_42_10
Evaluation results for aari1995/German_Sentiment with 10 epochs and random seeds: 42, 42



{'eval_loss': 1.3482502698898315, 'eval_accuracy': 0.8562091503267973, 'eval_precision': 0.854113139508677, 'eval_recall': 0.8562091503267973, 'eval_f1': 0.8546529723000311, 'eval_class_distribution': {0: 36, 1: 30, 2: 87}, 'eval_runtime': 5.6324, 'eval_samples_per_second': 27.164, 'eval_steps_per_second': 13.671, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.89      0.89      0.89        36
     Neutral       0.79      0.67      0.72        33
     Positiv       0.87      0.92      0.89        84

    accuracy                           0.86       153
   macro avg       0.85      0.82      0.83       153
weighted avg       0.85      0.86      0.85       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 36, 1: 28, 2: 89}
Negativ Precision Score: 0.8888888888888888
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8888888888888888

Neutral Precision Score: 0.7857142857142857
Neutral Recall Score: 

In [6]:
for model in models:
    print(f'training and results for {model}:')
    absa_model(data, model, rn1=42, rn2=42, epochs=12, save = True)
    print()

#early stopping patients = 3, continues when there's any improvement

training and results for google-bert/bert-base-german-cased:
Training Sentiment label count:  {'positiv': 498, 'neutral': 275, 'negativ': 338}
Validation Sentiment label count:  {'positiv': 60, 'neutral': 27, 'negativ': 42}
Test Sentiment label count:  {'positiv': 84, 'neutral': 33, 'negativ': 36}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2252.84 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2463.61 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2543.49 examples/s]


Training results for google-bert/bert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8268,0.85633,0.790698,0.837679,0.790698,0.792652,"{0: 63, 1: 22, 2: 44}"
2,0.5357,0.61449,0.837209,0.85314,0.837209,0.840776,"{0: 41, 1: 35, 2: 53}"
3,0.4286,0.933588,0.821705,0.828073,0.821705,0.822346,"{0: 49, 1: 24, 2: 56}"
4,0.2246,0.911752,0.829457,0.845803,0.829457,0.831843,"{0: 52, 1: 26, 2: 51}"
5,0.1933,1.065856,0.806202,0.806386,0.806202,0.80549,"{0: 42, 1: 24, 2: 63}"



Best Model saved at: ./saved_models/absa_google-bert_bert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_google-bert_bert-base-german-cased_42_42_12
Evaluation results for google-bert/bert-base-german-cased with 12 epochs and random seeds: 42, 42



{'eval_loss': 0.8268358111381531, 'eval_accuracy': 0.7973856209150327, 'eval_precision': 0.8137133728761873, 'eval_recall': 0.7973856209150327, 'eval_f1': 0.801392227088405, 'eval_class_distribution': {0: 42, 1: 38, 2: 73}, 'eval_runtime': 2.1976, 'eval_samples_per_second': 69.622, 'eval_steps_per_second': 35.038, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.80      0.89      0.84        36
     Neutral       0.59      0.73      0.65        33
     Positiv       0.93      0.80      0.86        84

    accuracy                           0.80       153
   macro avg       0.77      0.80      0.78       153
weighted avg       0.83      0.80      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 40, 1: 41, 2: 72}
Negativ Precision Score: 0.8
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8421052631578947

Neutral Precision Score: 0.5853658536585366
Neutral Recall Score: 0.72727272727272

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2722.64 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2439.40 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2505.80 examples/s]


Training results for dbmdz/bert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8746,0.962469,0.775194,0.819495,0.775194,0.766703,"{0: 62, 1: 12, 2: 55}"
2,0.5506,0.89786,0.813953,0.813756,0.813953,0.813489,"{0: 44, 1: 25, 2: 60}"
3,0.5108,1.098207,0.813953,0.855367,0.813953,0.803492,"{0: 62, 1: 12, 2: 55}"
4,0.3163,0.791952,0.852713,0.859386,0.852713,0.853166,"{0: 49, 1: 23, 2: 57}"
5,0.2824,1.103715,0.829457,0.831512,0.829457,0.827072,"{0: 36, 1: 25, 2: 68}"
6,0.1887,1.175396,0.844961,0.864026,0.844961,0.848212,"{0: 53, 1: 24, 2: 52}"
7,0.1177,0.982043,0.860465,0.861065,0.860465,0.858615,"{0: 45, 1: 22, 2: 62}"
8,0.0746,1.205643,0.844961,0.849633,0.844961,0.841367,"{0: 47, 1: 19, 2: 63}"
9,0.0176,0.952554,0.875969,0.876169,0.875969,0.875786,"{0: 40, 1: 27, 2: 62}"
10,0.0073,1.071363,0.860465,0.864774,0.860465,0.860194,"{0: 48, 1: 23, 2: 58}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-cased_42_42_12
Evaluation results for dbmdz/bert-base-german-cased with 12 epochs and random seeds: 42, 42



{'eval_loss': 2.024049758911133, 'eval_accuracy': 0.7450980392156863, 'eval_precision': 0.7647788417692659, 'eval_recall': 0.7450980392156863, 'eval_f1': 0.7512031904807744, 'eval_class_distribution': {0: 35, 1: 43, 2: 75}, 'eval_runtime': 2.1935, 'eval_samples_per_second': 69.753, 'eval_steps_per_second': 35.104, 'epoch': 12.0}
              precision    recall  f1-score   support

     Negativ       0.86      0.83      0.85        36
     Neutral       0.59      0.73      0.65        33
     Positiv       0.90      0.82      0.86        84

    accuracy                           0.80       153
   macro avg       0.78      0.79      0.78       153
weighted avg       0.82      0.80      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 35, 1: 41, 2: 77}
Negativ Precision Score: 0.8571428571428571
Negativ Recall Score: 0.8333333333333334
Negativ F1 Score: 0.8450704225352113

Neutral Precision Score: 0.5853658536585366
Neutral Recall Score: 

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2489.03 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2411.80 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2483.09 examples/s]


Training results for dbmdz/bert-base-german-uncased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8763,0.650298,0.806202,0.833099,0.806202,0.799525,"{0: 56, 1: 14, 2: 59}"
2,0.5215,0.842318,0.813953,0.816596,0.813953,0.812202,"{0: 35, 1: 28, 2: 66}"
3,0.4772,0.832856,0.813953,0.818723,0.813953,0.80888,"{0: 49, 1: 18, 2: 62}"
4,0.2723,0.921269,0.821705,0.828904,0.821705,0.817697,"{0: 48, 1: 18, 2: 63}"
5,0.2459,0.775434,0.860465,0.861469,0.860465,0.860882,"{0: 42, 1: 28, 2: 59}"
6,0.1832,0.934674,0.837209,0.837863,0.837209,0.83747,"{0: 43, 1: 27, 2: 59}"
7,0.1488,1.072342,0.852713,0.856441,0.852713,0.853687,"{0: 46, 1: 26, 2: 57}"
8,0.1006,1.106352,0.844961,0.853116,0.844961,0.847472,"{0: 40, 1: 32, 2: 57}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-uncased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-uncased_42_42_12
Evaluation results for dbmdz/bert-base-german-uncased with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.5307726860046387, 'eval_accuracy': 0.7189542483660131, 'eval_precision': 0.7562915368205994, 'eval_recall': 0.7189542483660131, 'eval_f1': 0.720553382023248, 'eval_class_distribution': {0: 49, 1: 43, 2: 61}, 'eval_runtime': 2.1984, 'eval_samples_per_second': 69.597, 'eval_steps_per_second': 35.026, 'epoch': 8.0}
              precision    recall  f1-score   support

     Negativ       0.64      0.89      0.74        36
     Neutral       0.63      0.88      0.73        33
     Positiv       0.95      0.64      0.77        84

    accuracy                           0.75       153
   macro avg       0.74      0.80      0.75       153
weighted avg       0.81      0.75      0.75       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 50, 1: 46, 2: 57}
Negativ Precision Score: 0.64
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7441860465116279

Neutral Precision Score: 0.6304347826086957
Neutral Recall Score: 0.8787878787878

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3705.85 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3135.57 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3281.12 examples/s]


Training results for FacebookAI/xlm-roberta-base with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,1.0215,1.070785,0.348837,0.573643,0.348837,0.207087,"{0: 126, 1: 0, 2: 3}"
2,0.9116,0.779326,0.790698,0.792021,0.790698,0.791022,"{0: 40, 1: 29, 2: 60}"
3,0.8395,1.113744,0.790698,0.789147,0.790698,0.789426,"{0: 41, 1: 25, 2: 63}"
4,0.719,1.100871,0.79845,0.796559,0.79845,0.796857,"{0: 44, 1: 24, 2: 61}"
5,0.5738,1.387385,0.782946,0.781519,0.782946,0.779219,"{0: 37, 1: 24, 2: 68}"
6,0.5836,1.182464,0.813953,0.823349,0.813953,0.815646,"{0: 38, 1: 34, 2: 57}"
7,0.4777,1.085219,0.821705,0.824883,0.821705,0.822435,"{0: 46, 1: 26, 2: 57}"
8,0.4109,1.135072,0.813953,0.815073,0.813953,0.812961,"{0: 37, 1: 30, 2: 62}"
9,0.3221,1.162743,0.821705,0.831509,0.821705,0.823597,"{0: 38, 1: 34, 2: 57}"
10,0.2883,1.222357,0.813953,0.81332,0.813953,0.812813,"{0: 39, 1: 26, 2: 64}"



Best Model saved at: ./saved_models/absa_FacebookAI_xlm-roberta-base_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_FacebookAI_xlm-roberta-base_42_42_12
Evaluation results for FacebookAI/xlm-roberta-base with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.316672682762146, 'eval_accuracy': 0.8300653594771242, 'eval_precision': 0.8382506345354644, 'eval_recall': 0.8300653594771242, 'eval_f1': 0.8318959162612414, 'eval_class_distribution': {0: 40, 1: 37, 2: 76}, 'eval_runtime': 2.0453, 'eval_samples_per_second': 74.805, 'eval_steps_per_second': 37.647, 'epoch': 12.0}
              precision    recall  f1-score   support

     Negativ       0.78      0.89      0.83        36
     Neutral       0.60      0.79      0.68        33
     Positiv       0.91      0.75      0.82        84

    accuracy                           0.79       153
   macro avg       0.77      0.81      0.78       153
weighted avg       0.82      0.79      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 41, 1: 43, 2: 69}
Negativ Precision Score: 0.7804878048780488
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8311688311688312

Neutral Precision Score: 0.6046511627906976
Neutral Recall Score: 

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3760.30 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3248.80 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3306.69 examples/s]


Training results for TUM/GottBERT_base_best with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8917,0.934055,0.790698,0.816342,0.790698,0.788631,"{0: 58, 1: 17, 2: 54}"
2,0.5623,0.736761,0.852713,0.863787,0.852713,0.85216,"{0: 52, 1: 21, 2: 56}"
3,0.54,0.940871,0.844961,0.843686,0.844961,0.842311,"{0: 44, 1: 22, 2: 63}"
4,0.3007,0.966743,0.813953,0.818203,0.813953,0.810478,"{0: 51, 1: 20, 2: 58}"
5,0.3294,0.968592,0.821705,0.825249,0.821705,0.818711,"{0: 49, 1: 20, 2: 60}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_best_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_best_42_42_12
Evaluation results for TUM/GottBERT_base_best with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.0387647151947021, 'eval_accuracy': 0.8104575163398693, 'eval_precision': 0.8292497040305368, 'eval_recall': 0.8104575163398693, 'eval_f1': 0.8139436220021332, 'eval_class_distribution': {0: 43, 1: 39, 2: 71}, 'eval_runtime': 2.0508, 'eval_samples_per_second': 74.606, 'eval_steps_per_second': 37.547, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.72      0.92      0.80        36
     Neutral       0.68      0.70      0.69        33
     Positiv       0.92      0.80      0.85        84

    accuracy                           0.80       153
   macro avg       0.77      0.80      0.78       153
weighted avg       0.82      0.80      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 46, 1: 34, 2: 73}
Negativ Precision Score: 0.717391304347826
Negativ Recall Score: 0.9166666666666666
Negativ F1 Score: 0.8048780487804879

Neutral Precision Score: 0.6764705882352942
Neutral Recall Score: 0

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3759.05 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3265.84 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3341.54 examples/s]


Training results for TUM/GottBERT_filtered_base_best with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8297,0.934303,0.790698,0.828243,0.790698,0.784573,"{0: 62, 1: 14, 2: 53}"
2,0.5711,0.61225,0.883721,0.882644,0.883721,0.882964,"{0: 41, 1: 26, 2: 62}"
3,0.5042,0.571497,0.891473,0.89385,0.891473,0.890252,"{0: 46, 1: 22, 2: 61}"
4,0.3288,0.693923,0.852713,0.862629,0.852713,0.852381,"{0: 51, 1: 21, 2: 57}"
5,0.2999,0.649871,0.883721,0.887434,0.883721,0.883974,"{0: 47, 1: 24, 2: 58}"
6,0.2119,0.80463,0.852713,0.857087,0.852713,0.852938,"{0: 48, 1: 25, 2: 56}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_filtered_base_best_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_filtered_base_best_42_42_12
Evaluation results for TUM/GottBERT_filtered_base_best with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.1658645868301392, 'eval_accuracy': 0.7908496732026143, 'eval_precision': 0.7957653649695172, 'eval_recall': 0.7908496732026143, 'eval_f1': 0.7912728312010672, 'eval_class_distribution': {0: 42, 1: 34, 2: 77}, 'eval_runtime': 2.0405, 'eval_samples_per_second': 74.983, 'eval_steps_per_second': 37.736, 'epoch': 6.0}
              precision    recall  f1-score   support

     Negativ       0.78      0.86      0.82        36
     Neutral       0.67      0.73      0.70        33
     Positiv       0.87      0.80      0.83        84

    accuracy                           0.80       153
   macro avg       0.77      0.80      0.78       153
weighted avg       0.80      0.80      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 40, 1: 36, 2: 77}
Negativ Precision Score: 0.775
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.8157894736842105

Neutral Precision Score: 0.6666666666666666
Neutral Recall Score: 0.72727272727

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be abl

Training results for TUM/GottBERT_base_last with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8966,0.734998,0.821705,0.848089,0.821705,0.821513,"{0: 57, 1: 18, 2: 54}"
2,0.5423,0.653417,0.860465,0.860992,0.860465,0.860633,"{0: 41, 1: 28, 2: 60}"
3,0.4977,0.782798,0.844961,0.856712,0.844961,0.840051,"{0: 52, 1: 17, 2: 60}"
4,0.3259,0.778349,0.852713,0.864224,0.852713,0.855158,"{0: 36, 1: 33, 2: 60}"
5,0.2789,1.134474,0.806202,0.805714,0.806202,0.803696,"{0: 42, 1: 22, 2: 65}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_last_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_last_42_42_12
Evaluation results for TUM/GottBERT_base_last with 12 epochs and random seeds: 42, 42



{'eval_loss': 0.9622507691383362, 'eval_accuracy': 0.8104575163398693, 'eval_precision': 0.8231703065294397, 'eval_recall': 0.8104575163398693, 'eval_f1': 0.8142358313188288, 'eval_class_distribution': {0: 37, 1: 40, 2: 76}, 'eval_runtime': 2.0565, 'eval_samples_per_second': 74.399, 'eval_steps_per_second': 37.443, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.72      0.86      0.78        36
     Neutral       0.59      0.67      0.63        33
     Positiv       0.92      0.80      0.85        84

    accuracy                           0.78       153
   macro avg       0.74      0.78      0.76       153
weighted avg       0.80      0.78      0.79       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 43, 1: 37, 2: 73}
Negativ Precision Score: 0.7209302325581395
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.7848101265822784

Neutral Precision Score: 0.5945945945945946
Neutral Recall Score: 

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3870.31 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3261.26 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3353.95 examples/s]


Training results for distilbert/distilbert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8099,0.597033,0.837209,0.864035,0.837209,0.838656,"{0: 57, 1: 24, 2: 48}"
2,0.5747,0.615361,0.829457,0.83517,0.829457,0.829957,"{0: 48, 1: 23, 2: 58}"
3,0.4941,0.615135,0.837209,0.84127,0.837209,0.836054,"{0: 45, 1: 21, 2: 63}"
4,0.2861,0.839471,0.821705,0.847213,0.821705,0.822364,"{0: 55, 1: 28, 2: 46}"



Best Model saved at: ./saved_models/absa_distilbert_distilbert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_distilbert_distilbert-base-german-cased_42_42_12
Evaluation results for distilbert/distilbert-base-german-cased with 12 epochs and random seeds: 42, 42



{'eval_loss': 0.8538591861724854, 'eval_accuracy': 0.7581699346405228, 'eval_precision': 0.7850672111891964, 'eval_recall': 0.7581699346405228, 'eval_f1': 0.7597698725270837, 'eval_class_distribution': {0: 52, 1: 34, 2: 67}, 'eval_runtime': 1.2053, 'eval_samples_per_second': 126.943, 'eval_steps_per_second': 63.886, 'epoch': 4.0}
              precision    recall  f1-score   support

     Negativ       0.62      0.97      0.76        36
     Neutral       0.68      0.64      0.66        33
     Positiv       0.91      0.71      0.80        84

    accuracy                           0.76       153
   macro avg       0.74      0.77      0.74       153
weighted avg       0.79      0.76      0.76       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 56, 1: 31, 2: 66}
Negativ Precision Score: 0.625
Negativ Recall Score: 0.9722222222222222
Negativ F1 Score: 0.7608695652173914

Neutral Precision Score: 0.6774193548387096
Neutral Recall Score: 0.6363636363

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

Training results for GerMedBERT/medbert-512 with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8788,0.873218,0.775194,0.777534,0.775194,0.772396,"{0: 46, 1: 20, 2: 63}"
2,0.5553,0.984977,0.790698,0.794575,0.790698,0.787843,"{0: 33, 1: 31, 2: 65}"
3,0.4741,1.191893,0.775194,0.775666,0.775194,0.771275,"{0: 43, 1: 20, 2: 66}"
4,0.2654,1.13102,0.782946,0.78926,0.782946,0.783659,"{0: 49, 1: 25, 2: 55}"
5,0.2432,1.283702,0.790698,0.789147,0.790698,0.789426,"{0: 41, 1: 25, 2: 63}"
6,0.1608,1.215056,0.806202,0.80706,0.806202,0.806357,"{0: 41, 1: 29, 2: 59}"
7,0.1584,1.372254,0.79845,0.798354,0.79845,0.7972,"{0: 39, 1: 25, 2: 65}"
8,0.0849,1.407316,0.806202,0.808251,0.806202,0.806666,"{0: 45, 1: 25, 2: 59}"
9,0.0599,1.696144,0.767442,0.785833,0.767442,0.770143,"{0: 36, 1: 38, 2: 55}"



Best Model saved at: ./saved_models/absa_GerMedBERT_medbert-512_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_GerMedBERT_medbert-512_42_42_12
Evaluation results for GerMedBERT/medbert-512 with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.453885793685913, 'eval_accuracy': 0.8169934640522876, 'eval_precision': 0.8192333226385131, 'eval_recall': 0.8169934640522876, 'eval_f1': 0.8168216161720311, 'eval_class_distribution': {0: 41, 1: 33, 2: 79}, 'eval_runtime': 2.1781, 'eval_samples_per_second': 70.245, 'eval_steps_per_second': 35.352, 'epoch': 9.0}
              precision    recall  f1-score   support

     Negativ       0.73      0.89      0.80        36
     Neutral       0.68      0.79      0.73        33
     Positiv       0.89      0.75      0.81        84

    accuracy                           0.79       153
   macro avg       0.77      0.81      0.78       153
weighted avg       0.81      0.79      0.79       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 44, 1: 38, 2: 71}
Negativ Precision Score: 0.7272727272727273
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8

Neutral Precision Score: 0.6842105263157895
Neutral Recall Score: 0.78787878787878

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

Training results for deepset/gbert-base with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8537,0.772932,0.844961,0.850111,0.844961,0.842525,"{0: 49, 1: 20, 2: 60}"
2,0.5359,0.673803,0.868217,0.868217,0.868217,0.868217,"{0: 42, 1: 27, 2: 60}"
3,0.485,0.716973,0.868217,0.865643,0.868217,0.864827,"{0: 43, 1: 22, 2: 64}"
4,0.1968,0.758297,0.860465,0.863375,0.860465,0.861308,"{0: 45, 1: 27, 2: 57}"
5,0.1769,0.988339,0.852713,0.859291,0.852713,0.855029,"{0: 43, 1: 30, 2: 56}"



Best Model saved at: ./saved_models/absa_deepset_gbert-base_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_deepset_gbert-base_42_42_12
Evaluation results for deepset/gbert-base with 12 epochs and random seeds: 42, 42



{'eval_loss': 0.8562945127487183, 'eval_accuracy': 0.8169934640522876, 'eval_precision': 0.822554298881415, 'eval_recall': 0.8169934640522876, 'eval_f1': 0.8179301015919039, 'eval_class_distribution': {0: 41, 1: 35, 2: 77}, 'eval_runtime': 2.2069, 'eval_samples_per_second': 69.329, 'eval_steps_per_second': 34.891, 'epoch': 5.0}
              precision    recall  f1-score   support

     Negativ       0.76      0.89      0.82        36
     Neutral       0.66      0.76      0.70        33
     Positiv       0.92      0.80      0.85        84

    accuracy                           0.81       153
   macro avg       0.78      0.81      0.79       153
weighted avg       0.83      0.81      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 42, 1: 38, 2: 73}
Negativ Precision Score: 0.7619047619047619
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8205128205128205

Neutral Precision Score: 0.6578947368421053
Neutral Recall Score: 0

In [5]:
for model in models:
    print(f'training and results for {model}:')
    absa_model(data, model, rn1=42, rn2=42, epochs=12, save = True)
    print()

# v2: early stopping patients = 3, continues when there's any improvement

training and results for google-bert/bert-base-german-cased:
Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 1676.02 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2435.17 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2527.67 examples/s]


Training results for google-bert/bert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8557,0.716623,0.821705,0.838272,0.821705,0.825277,"{0: 50, 1: 28, 2: 51}"
2,0.5595,0.844616,0.767442,0.824419,0.767442,0.769514,"{0: 64, 1: 25, 2: 40}"
3,0.4544,0.983193,0.813953,0.813852,0.813953,0.813603,"{0: 43, 1: 25, 2: 61}"
4,0.2348,0.940081,0.821705,0.831297,0.821705,0.823851,"{0: 45, 1: 31, 2: 53}"
5,0.218,1.073637,0.813953,0.820127,0.813953,0.815194,"{0: 48, 1: 25, 2: 56}"
6,0.1592,1.252349,0.806202,0.80699,0.806202,0.806249,"{0: 40, 1: 29, 2: 60}"
7,0.0914,1.338179,0.813953,0.818806,0.813953,0.815318,"{0: 41, 1: 31, 2: 57}"
8,0.0659,1.354868,0.829457,0.834137,0.829457,0.830966,"{0: 44, 1: 29, 2: 56}"
9,0.0389,1.389237,0.837209,0.844247,0.837209,0.838833,"{0: 46, 1: 29, 2: 54}"
10,0.0181,1.463746,0.821705,0.839788,0.821705,0.825401,"{0: 47, 1: 32, 2: 50}"



Best Model saved at: ./saved_models/absa_google-bert_bert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_google-bert_bert-base-german-cased_42_42_12
Evaluation results for google-bert/bert-base-german-cased with 12 epochs and random seeds: 42, 42



{'eval_loss': 2.006366491317749, 'eval_accuracy': 0.7189542483660131, 'eval_precision': 0.7663681726493292, 'eval_recall': 0.7189542483660131, 'eval_f1': 0.7218794277617807, 'eval_class_distribution': {0: 52, 1: 42, 2: 59}, 'eval_runtime': 2.1847, 'eval_samples_per_second': 70.031, 'eval_steps_per_second': 35.244, 'epoch': 12.0}
              precision    recall  f1-score   support

     Negativ       0.63      0.92      0.75        36
     Neutral       0.64      0.70      0.67        33
     Positiv       0.94      0.73      0.82        84

    accuracy                           0.76       153
   macro avg       0.74      0.78      0.75       153
weighted avg       0.80      0.76      0.77       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 52, 1: 36, 2: 65}
Negativ Precision Score: 0.6346153846153846
Negativ Recall Score: 0.9166666666666666
Negativ F1 Score: 0.75

Neutral Precision Score: 0.6388888888888888
Neutral Recall Score: 0.696969696969

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2652.88 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2456.79 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2525.91 examples/s]


Training results for dbmdz/bert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8516,0.803725,0.806202,0.858408,0.806202,0.797253,"{0: 65, 1: 12, 2: 52}"
2,0.6129,0.909515,0.806202,0.807359,0.806202,0.803416,"{0: 35, 1: 27, 2: 67}"
3,0.4935,0.917656,0.829457,0.848022,0.829457,0.825901,"{0: 55, 1: 17, 2: 57}"
4,0.3179,0.685165,0.860465,0.859938,0.860465,0.860107,"{0: 43, 1: 26, 2: 60}"
5,0.268,0.89827,0.860465,0.859042,0.860465,0.859543,"{0: 41, 1: 26, 2: 62}"
6,0.1437,1.026497,0.860465,0.859573,0.860465,0.85779,"{0: 46, 1: 22, 2: 61}"
7,0.1034,1.122055,0.852713,0.857593,0.852713,0.851013,"{0: 49, 1: 21, 2: 59}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-cased_42_42_12
Evaluation results for dbmdz/bert-base-german-cased with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.3309756517410278, 'eval_accuracy': 0.7908496732026143, 'eval_precision': 0.8278971845148316, 'eval_recall': 0.7908496732026143, 'eval_f1': 0.7959253213868895, 'eval_class_distribution': {0: 40, 1: 48, 2: 65}, 'eval_runtime': 2.2157, 'eval_samples_per_second': 69.051, 'eval_steps_per_second': 34.751, 'epoch': 7.0}
              precision    recall  f1-score   support

     Negativ       0.79      0.83      0.81        36
     Neutral       0.57      0.82      0.68        33
     Positiv       0.94      0.76      0.84        84

    accuracy                           0.79       153
   macro avg       0.77      0.80      0.78       153
weighted avg       0.83      0.79      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 38, 1: 47, 2: 68}
Negativ Precision Score: 0.7894736842105263
Negativ Recall Score: 0.8333333333333334
Negativ F1 Score: 0.8108108108108109

Neutral Precision Score: 0.574468085106383
Neutral Recall Score: 0

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2639.90 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2392.83 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2449.22 examples/s]


Training results for dbmdz/bert-base-german-uncased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8697,0.810844,0.829457,0.832852,0.829457,0.82882,"{0: 48, 1: 23, 2: 58}"
2,0.5698,0.859367,0.829457,0.839456,0.829457,0.832531,"{0: 42, 1: 32, 2: 55}"
3,0.4857,0.811584,0.860465,0.863299,0.860465,0.860981,"{0: 46, 1: 26, 2: 57}"
4,0.3298,1.029312,0.806202,0.839132,0.806202,0.813219,"{0: 47, 1: 35, 2: 47}"
5,0.2507,0.891514,0.844961,0.844139,0.844961,0.844246,"{0: 43, 1: 25, 2: 61}"
6,0.1888,0.90481,0.829457,0.839428,0.829457,0.828969,"{0: 52, 1: 25, 2: 52}"
7,0.1236,0.909692,0.868217,0.871109,0.868217,0.868204,"{0: 47, 1: 26, 2: 56}"
8,0.0956,1.056518,0.860465,0.863837,0.860465,0.861626,"{0: 43, 1: 29, 2: 57}"
9,0.0357,1.221233,0.829457,0.833507,0.829457,0.830897,"{0: 41, 1: 30, 2: 58}"
10,0.0088,1.280209,0.837209,0.848035,0.837209,0.83976,"{0: 48, 1: 28, 2: 53}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-uncased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-uncased_42_42_12
Evaluation results for dbmdz/bert-base-german-uncased with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.7655352354049683, 'eval_accuracy': 0.7320261437908496, 'eval_precision': 0.7672418967587035, 'eval_recall': 0.7320261437908496, 'eval_f1': 0.7363408913013123, 'eval_class_distribution': {0: 49, 1: 40, 2: 64}, 'eval_runtime': 2.1633, 'eval_samples_per_second': 70.726, 'eval_steps_per_second': 35.594, 'epoch': 12.0}
              precision    recall  f1-score   support

     Negativ       0.62      0.86      0.72        36
     Neutral       0.62      0.79      0.69        33
     Positiv       0.93      0.68      0.79        84

    accuracy                           0.75       153
   macro avg       0.72      0.78      0.73       153
weighted avg       0.79      0.75      0.75       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 50, 1: 42, 2: 61}
Negativ Precision Score: 0.62
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.7209302325581395

Neutral Precision Score: 0.6190476190476191
Neutral Recall Score: 0.78787878787

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3648.27 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3110.50 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3215.43 examples/s]


Training results for FacebookAI/xlm-roberta-base with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9357,0.620074,0.782946,0.82118,0.782946,0.775459,"{0: 62, 1: 13, 2: 54}"
2,0.8574,0.606947,0.79845,0.838525,0.79845,0.798129,"{0: 61, 1: 16, 2: 52}"
3,0.8085,1.137659,0.821705,0.827962,0.821705,0.819866,"{0: 49, 1: 20, 2: 60}"
4,0.5411,1.131197,0.813953,0.819161,0.813953,0.813787,"{0: 49, 1: 23, 2: 57}"
5,0.4741,1.055588,0.821705,0.823796,0.821705,0.82187,"{0: 46, 1: 25, 2: 58}"
6,0.4112,1.040325,0.829457,0.83208,0.829457,0.829275,"{0: 47, 1: 24, 2: 58}"
7,0.3388,1.381436,0.79845,0.81135,0.79845,0.797104,"{0: 54, 1: 20, 2: 55}"
8,0.2502,1.415251,0.821705,0.83047,0.821705,0.819454,"{0: 32, 1: 33, 2: 64}"
9,0.1421,1.174609,0.844961,0.849488,0.844961,0.845376,"{0: 37, 1: 31, 2: 61}"
10,0.137,1.135929,0.860465,0.860183,0.860465,0.859938,"{0: 44, 1: 25, 2: 60}"



Best Model saved at: ./saved_models/absa_FacebookAI_xlm-roberta-base_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_FacebookAI_xlm-roberta-base_42_42_12
Evaluation results for FacebookAI/xlm-roberta-base with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.3177909851074219, 'eval_accuracy': 0.7777777777777778, 'eval_precision': 0.7908873805932629, 'eval_recall': 0.7777777777777778, 'eval_f1': 0.7824013415924629, 'eval_class_distribution': {0: 35, 1: 40, 2: 78}, 'eval_runtime': 2.0228, 'eval_samples_per_second': 75.638, 'eval_steps_per_second': 38.066, 'epoch': 12.0}
              precision    recall  f1-score   support

     Negativ       0.83      0.81      0.82        36
     Neutral       0.62      0.76      0.68        33
     Positiv       0.90      0.83      0.86        84

    accuracy                           0.81       153
   macro avg       0.78      0.80      0.79       153
weighted avg       0.82      0.81      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 35, 1: 40, 2: 78}
Negativ Precision Score: 0.8285714285714286
Negativ Recall Score: 0.8055555555555556
Negativ F1 Score: 0.8169014084507042

Neutral Precision Score: 0.625
Neutral Recall Score: 0.7575757575

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3712.20 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3221.39 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3272.64 examples/s]


Training results for TUM/GottBERT_base_best with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8917,0.934055,0.790698,0.816342,0.790698,0.788631,"{0: 58, 1: 17, 2: 54}"
2,0.5623,0.736761,0.852713,0.863787,0.852713,0.85216,"{0: 52, 1: 21, 2: 56}"
3,0.54,0.940871,0.844961,0.843686,0.844961,0.842311,"{0: 44, 1: 22, 2: 63}"
4,0.3007,0.966743,0.813953,0.818203,0.813953,0.810478,"{0: 51, 1: 20, 2: 58}"
5,0.3294,0.968592,0.821705,0.825249,0.821705,0.818711,"{0: 49, 1: 20, 2: 60}"
6,0.2422,0.910391,0.844961,0.847005,0.844961,0.845773,"{0: 43, 1: 28, 2: 58}"
7,0.2276,1.197924,0.821705,0.825419,0.821705,0.819232,"{0: 50, 1: 21, 2: 58}"
8,0.2029,0.899163,0.837209,0.846127,0.837209,0.838972,"{0: 36, 1: 32, 2: 61}"
9,0.1157,1.383694,0.79845,0.836058,0.79845,0.807359,"{0: 36, 1: 41, 2: 52}"
10,0.0855,1.106846,0.837209,0.836862,0.837209,0.836471,"{0: 45, 1: 25, 2: 59}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_best_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_best_42_42_12
Evaluation results for TUM/GottBERT_base_best with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.7277506589889526, 'eval_accuracy': 0.7516339869281046, 'eval_precision': 0.7895460121771642, 'eval_recall': 0.7516339869281046, 'eval_f1': 0.7597231758065712, 'eval_class_distribution': {0: 41, 1: 46, 2: 66}, 'eval_runtime': 2.0451, 'eval_samples_per_second': 74.813, 'eval_steps_per_second': 37.651, 'epoch': 12.0}
              precision    recall  f1-score   support

     Negativ       0.74      0.86      0.79        36
     Neutral       0.58      0.76      0.66        33
     Positiv       0.91      0.74      0.82        84

    accuracy                           0.77       153
   macro avg       0.74      0.79      0.76       153
weighted avg       0.80      0.77      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 42, 1: 43, 2: 68}
Negativ Precision Score: 0.7380952380952381
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.7948717948717948

Neutral Precision Score: 0.5813953488372093
Neutral Recall Score:

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3695.60 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3103.33 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3228.74 examples/s]


Training results for TUM/GottBERT_filtered_base_best with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.84,0.893525,0.782946,0.802395,0.782946,0.777458,"{0: 57, 1: 16, 2: 56}"
2,0.5351,0.62244,0.868217,0.868217,0.868217,0.868217,"{0: 42, 1: 27, 2: 60}"
3,0.5056,0.608511,0.875969,0.877397,0.875969,0.873456,"{0: 44, 1: 21, 2: 64}"
4,0.2996,0.695269,0.883721,0.886115,0.883721,0.884297,"{0: 45, 1: 27, 2: 57}"
5,0.2957,0.800931,0.860465,0.860653,0.860465,0.858773,"{0: 41, 1: 23, 2: 65}"
6,0.2312,0.618526,0.891473,0.891163,0.891473,0.891247,"{0: 41, 1: 27, 2: 61}"
7,0.2385,0.890434,0.860465,0.860447,0.860465,0.859088,"{0: 43, 1: 23, 2: 63}"
8,0.1758,1.060983,0.844961,0.848106,0.844961,0.846014,"{0: 43, 1: 29, 2: 57}"
9,0.1133,0.918313,0.860465,0.870566,0.860465,0.862668,"{0: 42, 1: 33, 2: 54}"
10,0.0785,1.132346,0.844961,0.846487,0.844961,0.845227,"{0: 45, 1: 26, 2: 58}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_filtered_base_best_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_filtered_base_best_42_42_12
Evaluation results for TUM/GottBERT_filtered_base_best with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.2134432792663574, 'eval_accuracy': 0.803921568627451, 'eval_precision': 0.8072460563172638, 'eval_recall': 0.803921568627451, 'eval_f1': 0.8050392362148628, 'eval_class_distribution': {0: 38, 1: 35, 2: 80}, 'eval_runtime': 2.1113, 'eval_samples_per_second': 72.466, 'eval_steps_per_second': 36.47, 'epoch': 12.0}
              precision    recall  f1-score   support

     Negativ       0.84      0.89      0.86        36
     Neutral       0.64      0.70      0.67        33
     Positiv       0.87      0.82      0.85        84

    accuracy                           0.81       153
   macro avg       0.78      0.80      0.79       153
weighted avg       0.82      0.81      0.81       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 38, 1: 36, 2: 79}
Negativ Precision Score: 0.8421052631578947
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.8648648648648649

Neutral Precision Score: 0.6388888888888888
Neutral Recall Score: 0.

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be abl

Training results for TUM/GottBERT_base_last with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9018,0.916637,0.75969,0.827061,0.75969,0.750952,"{0: 70, 1: 11, 2: 48}"
2,0.5943,0.551656,0.829457,0.831253,0.829457,0.829784,"{0: 45, 1: 25, 2: 59}"
3,0.497,1.007532,0.821705,0.826873,0.821705,0.816853,"{0: 45, 1: 18, 2: 66}"
4,0.2953,0.720128,0.844961,0.844767,0.844961,0.844002,"{0: 45, 1: 24, 2: 60}"
5,0.2909,0.735351,0.844961,0.844038,0.844961,0.844293,"{0: 41, 1: 26, 2: 62}"
6,0.2587,0.776919,0.860465,0.871946,0.860465,0.862761,"{0: 37, 1: 34, 2: 58}"
7,0.2173,0.884767,0.852713,0.855829,0.852713,0.851981,"{0: 48, 1: 23, 2: 58}"
8,0.2103,0.851964,0.844961,0.843051,0.844961,0.843799,"{0: 41, 1: 26, 2: 62}"
9,0.1479,1.111537,0.868217,0.875111,0.868217,0.869887,"{0: 41, 1: 32, 2: 56}"
10,0.1057,1.21074,0.844961,0.850041,0.844961,0.846444,"{0: 46, 1: 27, 2: 56}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_last_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_last_42_42_12
Evaluation results for TUM/GottBERT_base_last with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.660540223121643, 'eval_accuracy': 0.8104575163398693, 'eval_precision': 0.843831291218884, 'eval_recall': 0.8104575163398693, 'eval_f1': 0.8161615374626672, 'eval_class_distribution': {0: 39, 1: 47, 2: 67}, 'eval_runtime': 2.0418, 'eval_samples_per_second': 74.934, 'eval_steps_per_second': 37.712, 'epoch': 12.0}
              precision    recall  f1-score   support

     Negativ       0.74      0.86      0.79        36
     Neutral       0.61      0.85      0.71        33
     Positiv       0.95      0.74      0.83        84

    accuracy                           0.79       153
   macro avg       0.77      0.82      0.78       153
weighted avg       0.83      0.79      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 42, 1: 46, 2: 65}
Negativ Precision Score: 0.7380952380952381
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.7948717948717948

Neutral Precision Score: 0.6086956521739131
Neutral Recall Score: 0

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3690.19 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3206.16 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3232.25 examples/s]


Training results for distilbert/distilbert-base-german-cased with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8241,0.651714,0.813953,0.832567,0.813953,0.815965,"{0: 54, 1: 24, 2: 51}"
2,0.5978,0.778695,0.790698,0.791447,0.790698,0.789014,"{0: 43, 1: 22, 2: 64}"
3,0.4802,0.731894,0.829457,0.839281,0.829457,0.829628,"{0: 50, 1: 21, 2: 58}"
4,0.3069,0.756107,0.852713,0.871737,0.852713,0.85486,"{0: 52, 1: 28, 2: 49}"
5,0.2838,0.68021,0.860465,0.865227,0.860465,0.860827,"{0: 47, 1: 23, 2: 59}"
6,0.2301,0.747941,0.821705,0.825459,0.821705,0.822867,"{0: 42, 1: 30, 2: 57}"
7,0.1872,1.083423,0.829457,0.854538,0.829457,0.829405,"{0: 57, 1: 19, 2: 53}"
8,0.1073,0.831728,0.860465,0.864317,0.860465,0.861328,"{0: 46, 1: 27, 2: 56}"
9,0.0617,0.920683,0.829457,0.844619,0.829457,0.832717,"{0: 47, 1: 31, 2: 51}"
10,0.0353,0.975001,0.837209,0.842061,0.837209,0.838338,"{0: 45, 1: 29, 2: 55}"



Best Model saved at: ./saved_models/absa_distilbert_distilbert-base-german-cased_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_distilbert_distilbert-base-german-cased_42_42_12
Evaluation results for distilbert/distilbert-base-german-cased with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.8051081895828247, 'eval_accuracy': 0.7320261437908496, 'eval_precision': 0.7463019938845886, 'eval_recall': 0.7320261437908496, 'eval_f1': 0.7351234336622123, 'eval_class_distribution': {0: 44, 1: 36, 2: 73}, 'eval_runtime': 1.218, 'eval_samples_per_second': 125.612, 'eval_steps_per_second': 63.216, 'epoch': 12.0}
              precision    recall  f1-score   support

     Negativ       0.60      0.83      0.70        36
     Neutral       0.60      0.64      0.62        33
     Positiv       0.85      0.69      0.76        84

    accuracy                           0.71       153
   macro avg       0.68      0.72      0.69       153
weighted avg       0.74      0.71      0.72       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 50, 1: 35, 2: 68}
Negativ Precision Score: 0.6
Negativ Recall Score: 0.8333333333333334
Negativ F1 Score: 0.6976744186046512

Neutral Precision Score: 0.6
Neutral Recall Score: 0.6363636363636364
Neutral F1

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

Training results for GerMedBERT/medbert-512 with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9165,0.71764,0.775194,0.783074,0.775194,0.775916,"{0: 50, 1: 23, 2: 56}"
2,0.5669,1.159936,0.775194,0.782902,0.775194,0.772898,"{0: 32, 1: 33, 2: 64}"
3,0.4785,1.078723,0.79845,0.812383,0.79845,0.790531,"{0: 44, 1: 15, 2: 70}"
4,0.3154,1.460036,0.728682,0.762239,0.728682,0.736847,"{0: 39, 1: 40, 2: 50}"
5,0.2704,0.893261,0.813953,0.824152,0.813953,0.815751,"{0: 50, 1: 25, 2: 54}"
6,0.2088,1.010399,0.790698,0.810039,0.790698,0.793488,"{0: 54, 1: 23, 2: 52}"
7,0.2073,1.24056,0.806202,0.811032,0.806202,0.807804,"{0: 44, 1: 29, 2: 56}"
8,0.1535,1.444989,0.806202,0.830229,0.806202,0.808636,"{0: 34, 1: 40, 2: 55}"
9,0.0296,1.500004,0.790698,0.794636,0.790698,0.792181,"{0: 43, 1: 29, 2: 57}"
10,0.0276,1.462218,0.806202,0.80699,0.806202,0.806249,"{0: 40, 1: 29, 2: 60}"



Best Model saved at: ./saved_models/absa_GerMedBERT_medbert-512_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_GerMedBERT_medbert-512_42_42_12
Evaluation results for GerMedBERT/medbert-512 with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.2331597805023193, 'eval_accuracy': 0.7712418300653595, 'eval_precision': 0.8056298862070228, 'eval_recall': 0.7712418300653595, 'eval_f1': 0.7741878584052047, 'eval_class_distribution': {0: 53, 1: 35, 2: 65}, 'eval_runtime': 2.1632, 'eval_samples_per_second': 70.728, 'eval_steps_per_second': 35.595, 'epoch': 12.0}
              precision    recall  f1-score   support

     Negativ       0.63      0.94      0.76        36
     Neutral       0.75      0.82      0.78        33
     Positiv       0.95      0.71      0.82        84

    accuracy                           0.79       153
   macro avg       0.78      0.83      0.78       153
weighted avg       0.83      0.79      0.79       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 54, 1: 36, 2: 63}
Negativ Precision Score: 0.6296296296296297
Negativ Recall Score: 0.9444444444444444
Negativ F1 Score: 0.7555555555555555

Neutral Precision Score: 0.75
Neutral Recall Score: 0.81818181818

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

Training results for deepset/gbert-base with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8796,1.028829,0.790698,0.82004,0.790698,0.789985,"{0: 58, 1: 17, 2: 54}"
2,0.5559,0.606031,0.875969,0.876419,0.875969,0.875953,"{0: 44, 1: 26, 2: 59}"
3,0.5289,0.8065,0.852713,0.86148,0.852713,0.849033,"{0: 48, 1: 18, 2: 63}"
4,0.2734,0.74876,0.829457,0.832803,0.829457,0.830665,"{0: 44, 1: 28, 2: 57}"
5,0.2334,0.997604,0.837209,0.835122,0.837209,0.835319,"{0: 39, 1: 26, 2: 64}"
6,0.1729,1.296495,0.79845,0.819678,0.79845,0.804576,"{0: 42, 1: 35, 2: 52}"
7,0.0824,1.102772,0.821705,0.833598,0.821705,0.824179,"{0: 48, 1: 29, 2: 52}"
8,0.0598,1.157991,0.852713,0.855674,0.852713,0.851821,"{0: 48, 1: 23, 2: 58}"
9,0.0099,1.218146,0.852713,0.856487,0.852713,0.852531,"{0: 48, 1: 24, 2: 57}"
10,0.0001,1.301049,0.852713,0.854994,0.852713,0.852463,"{0: 47, 1: 25, 2: 57}"



Best Model saved at: ./saved_models/absa_deepset_gbert-base_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_deepset_gbert-base_42_42_12
Evaluation results for deepset/gbert-base with 12 epochs and random seeds: 42, 42



{'eval_loss': 0.8084350824356079, 'eval_accuracy': 0.7843137254901961, 'eval_precision': 0.8025398405053582, 'eval_recall': 0.7843137254901961, 'eval_f1': 0.7867893626165341, 'eval_class_distribution': {0: 49, 1: 31, 2: 73}, 'eval_runtime': 2.1895, 'eval_samples_per_second': 69.878, 'eval_steps_per_second': 35.167, 'epoch': 11.0}
              precision    recall  f1-score   support

     Negativ       0.58      0.89      0.70        36
     Neutral       0.69      0.76      0.72        33
     Positiv       0.90      0.67      0.77        84

    accuracy                           0.74       153
   macro avg       0.73      0.77      0.73       153
weighted avg       0.78      0.74      0.74       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 55, 1: 36, 2: 62}
Negativ Precision Score: 0.5818181818181818
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7032967032967034

Neutral Precision Score: 0.6944444444444444
Neutral Recall Score:

In [5]:
absa_model(data, "aari1995/German_Sentiment", rn1=42, rn2=42, epochs=12, save = True)

Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Map: 100%|██████████| 1111/1111 [00:00<00:00, 2243.73 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2391.09 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2456.98 examples/s]


Training results for aari1995/German_Sentiment with 12 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8497,0.632475,0.875969,0.875969,0.875969,0.875969,"{0: 42, 1: 27, 2: 60}"
2,0.5716,0.470746,0.868217,0.868853,0.868217,0.8683,"{0: 44, 1: 26, 2: 59}"
3,0.4932,0.62057,0.875969,0.885332,0.875969,0.875285,"{0: 51, 1: 21, 2: 57}"
4,0.2626,0.663084,0.891473,0.892821,0.891473,0.890486,"{0: 46, 1: 23, 2: 60}"
5,0.2514,1.012527,0.860465,0.861878,0.860465,0.861088,"{0: 42, 1: 28, 2: 59}"
6,0.1733,0.967994,0.860465,0.872171,0.860465,0.864105,"{0: 39, 1: 33, 2: 57}"
7,0.1356,0.689719,0.899225,0.900623,0.899225,0.899837,"{0: 42, 1: 28, 2: 59}"
8,0.0731,0.741803,0.899225,0.90336,0.899225,0.900654,"{0: 41, 1: 30, 2: 58}"
9,0.0147,0.996164,0.891473,0.891473,0.891473,0.891473,"{0: 42, 1: 27, 2: 60}"
10,0.0063,1.108757,0.883721,0.884437,0.883721,0.883982,"{0: 41, 1: 28, 2: 60}"



Best Model saved at: ./saved_models/absa_aari1995_German_Sentiment_42_42_12

Tokenizer for best Model saved at: ./saved_tokenizers/absa_aari1995_German_Sentiment_42_42_12
Evaluation results for aari1995/German_Sentiment with 12 epochs and random seeds: 42, 42



{'eval_loss': 1.1062209606170654, 'eval_accuracy': 0.8562091503267973, 'eval_precision': 0.8640330576294835, 'eval_recall': 0.8562091503267973, 'eval_f1': 0.8591271587212916, 'eval_class_distribution': {0: 37, 1: 37, 2: 79}, 'eval_runtime': 5.6586, 'eval_samples_per_second': 27.038, 'eval_steps_per_second': 13.608, 'epoch': 10.0}
              precision    recall  f1-score   support

     Negativ       0.89      0.92      0.90        36
     Neutral       0.65      0.73      0.69        33
     Positiv       0.92      0.87      0.90        84

    accuracy                           0.85       153
   macro avg       0.82      0.84      0.83       153
weighted avg       0.86      0.85      0.85       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 37, 1: 37, 2: 79}
Negativ Precision Score: 0.8918918918918919
Negativ Recall Score: 0.9166666666666666
Negativ F1 Score: 0.9041095890410958

Neutral Precision Score: 0.6486486486486487
Neutral Recall Score:

In [5]:
for model in models:
    print(f'training and results for {model}:')
    absa_model(data, model, rn1=42, rn2=42, epochs=20, save = True)
    print()

# early stopping patients = 3, continues when there's any improvement

training and results for google-bert/bert-base-german-cased:
Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 1772.89 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2565.37 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2636.81 examples/s]


Training results for google-bert/bert-base-german-cased with 20 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8749,0.715817,0.782946,0.799801,0.782946,0.776017,"{0: 57, 1: 16, 2: 56}"
2,0.5492,0.956716,0.790698,0.790348,0.790698,0.79018,"{0: 42, 1: 25, 2: 62}"
3,0.4417,0.941463,0.79845,0.798715,0.79845,0.796516,"{0: 43, 1: 22, 2: 64}"
4,0.2834,0.959174,0.813953,0.84288,0.813953,0.812816,"{0: 58, 1: 17, 2: 54}"
5,0.247,1.418789,0.775194,0.781,0.775194,0.772407,"{0: 49, 1: 19, 2: 61}"
6,0.152,1.385458,0.782946,0.792307,0.782946,0.779951,"{0: 50, 1: 18, 2: 61}"
7,0.1276,1.448598,0.806202,0.821365,0.806202,0.807301,"{0: 52, 1: 20, 2: 57}"
8,0.0539,1.470951,0.806202,0.82222,0.806202,0.809309,"{0: 52, 1: 23, 2: 54}"
9,0.0215,1.950188,0.751938,0.761197,0.751938,0.754659,"{0: 44, 1: 31, 2: 54}"
10,0.0531,1.655297,0.79845,0.805118,0.79845,0.800807,"{0: 46, 1: 27, 2: 56}"



Best Model saved at: ./saved_models/absa_google-bert_bert-base-german-cased_42_42_20

Tokenizer for best Model saved at: ./saved_tokenizers/absa_google-bert_bert-base-german-cased_42_42_20
Evaluation results for google-bert/bert-base-german-cased with 20 epochs and random seeds: 42, 42



{'eval_loss': 1.4151344299316406, 'eval_accuracy': 0.7320261437908496, 'eval_precision': 0.7789244352711845, 'eval_recall': 0.7320261437908496, 'eval_f1': 0.7374777868136502, 'eval_class_distribution': {0: 57, 1: 33, 2: 63}, 'eval_runtime': 2.1683, 'eval_samples_per_second': 70.562, 'eval_steps_per_second': 35.511, 'epoch': 15.0}
              precision    recall  f1-score   support

     Negativ       0.57      0.92      0.70        36
     Neutral       0.64      0.64      0.64        33
     Positiv       0.92      0.68      0.78        84

    accuracy                           0.73       153
   macro avg       0.71      0.74      0.71       153
weighted avg       0.78      0.73      0.73       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 58, 1: 33, 2: 62}
Negativ Precision Score: 0.5689655172413793
Negativ Recall Score: 0.9166666666666666
Negativ F1 Score: 0.7021276595744681

Neutral Precision Score: 0.6363636363636364
Neutral Recall Score:

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2822.61 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2561.66 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2658.79 examples/s]


Training results for dbmdz/bert-base-german-cased with 20 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.894,0.976164,0.75969,0.829637,0.75969,0.759899,"{0: 69, 1: 15, 2: 45}"
2,0.6079,0.543642,0.844961,0.858722,0.844961,0.847329,"{0: 49, 1: 29, 2: 51}"
3,0.5379,1.132337,0.829457,0.856758,0.829457,0.827854,"{0: 57, 1: 17, 2: 55}"
4,0.3576,0.871422,0.837209,0.837559,0.837209,0.832973,"{0: 43, 1: 20, 2: 66}"
5,0.327,1.04117,0.821705,0.829314,0.821705,0.820907,"{0: 50, 1: 21, 2: 58}"
6,0.2535,0.795151,0.868217,0.878402,0.868217,0.871175,"{0: 47, 1: 28, 2: 54}"
7,0.1937,1.332279,0.813953,0.84288,0.813953,0.812816,"{0: 58, 1: 17, 2: 54}"
8,0.1542,1.244324,0.837209,0.849976,0.837209,0.83809,"{0: 52, 1: 22, 2: 55}"
9,0.0691,1.106597,0.875969,0.878879,0.875969,0.876812,"{0: 45, 1: 27, 2: 57}"
10,0.0279,1.35772,0.821705,0.824888,0.821705,0.818919,"{0: 47, 1: 20, 2: 62}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-cased_42_42_20

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-cased_42_42_20
Evaluation results for dbmdz/bert-base-german-cased with 20 epochs and random seeds: 42, 42



{'eval_loss': 1.936912178993225, 'eval_accuracy': 0.7647058823529411, 'eval_precision': 0.8043127619598207, 'eval_recall': 0.7647058823529411, 'eval_f1': 0.7716849671045181, 'eval_class_distribution': {0: 37, 1: 50, 2: 66}, 'eval_runtime': 2.1878, 'eval_samples_per_second': 69.933, 'eval_steps_per_second': 35.195, 'epoch': 15.0}
              precision    recall  f1-score   support

     Negativ       0.77      0.83      0.80        36
     Neutral       0.57      0.88      0.69        33
     Positiv       0.94      0.70      0.80        84

    accuracy                           0.77       153
   macro avg       0.76      0.80      0.76       153
weighted avg       0.82      0.77      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 39, 1: 51, 2: 63}
Negativ Precision Score: 0.7692307692307693
Negativ Recall Score: 0.8333333333333334
Negativ F1 Score: 0.8

Neutral Precision Score: 0.5686274509803921
Neutral Recall Score: 0.8787878787878

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 2792.10 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2542.11 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2617.04 examples/s]


Training results for dbmdz/bert-base-german-uncased with 20 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8083,0.861836,0.790698,0.819727,0.790698,0.784087,"{0: 61, 1: 15, 2: 53}"
2,0.5244,0.531516,0.852713,0.854524,0.852713,0.851045,"{0: 37, 1: 25, 2: 67}"
3,0.4684,0.760759,0.844961,0.873754,0.844961,0.844515,"{0: 56, 1: 17, 2: 56}"
4,0.2582,0.734496,0.852713,0.859031,0.852713,0.851789,"{0: 50, 1: 22, 2: 57}"
5,0.2321,1.220612,0.813953,0.838147,0.813953,0.810177,"{0: 27, 1: 34, 2: 68}"
6,0.1692,0.887318,0.829457,0.828936,0.829457,0.828354,"{0: 39, 1: 26, 2: 64}"
7,0.1213,1.059194,0.852713,0.8585,0.852713,0.853526,"{0: 48, 1: 26, 2: 55}"
8,0.0846,0.91713,0.868217,0.871894,0.868217,0.868089,"{0: 47, 1: 23, 2: 59}"
9,0.017,1.136773,0.860465,0.862556,0.860465,0.861009,"{0: 45, 1: 26, 2: 58}"
10,0.045,1.279904,0.852713,0.854931,0.852713,0.851534,"{0: 46, 1: 22, 2: 61}"



Best Model saved at: ./saved_models/absa_dbmdz_bert-base-german-uncased_42_42_20

Tokenizer for best Model saved at: ./saved_tokenizers/absa_dbmdz_bert-base-german-uncased_42_42_20
Evaluation results for dbmdz/bert-base-german-uncased with 20 epochs and random seeds: 42, 42



{'eval_loss': 2.244220733642578, 'eval_accuracy': 0.7254901960784313, 'eval_precision': 0.7438864291236209, 'eval_recall': 0.7254901960784313, 'eval_f1': 0.7273543734727045, 'eval_class_distribution': {0: 50, 1: 31, 2: 72}, 'eval_runtime': 2.1729, 'eval_samples_per_second': 70.412, 'eval_steps_per_second': 35.436, 'epoch': 11.0}
              precision    recall  f1-score   support

     Negativ       0.64      0.89      0.74        36
     Neutral       0.66      0.64      0.65        33
     Positiv       0.90      0.76      0.83        84

    accuracy                           0.76       153
   macro avg       0.73      0.76      0.74       153
weighted avg       0.79      0.76      0.77       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 50, 1: 32, 2: 71}
Negativ Precision Score: 0.64
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7441860465116279

Neutral Precision Score: 0.65625
Neutral Recall Score: 0.6363636363636364
Neutra

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3844.90 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3377.48 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3464.42 examples/s]


Training results for FacebookAI/xlm-roberta-base with 20 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9247,0.521425,0.837209,0.848854,0.837209,0.838761,"{0: 51, 1: 23, 2: 55}"
2,0.7738,0.904179,0.782946,0.800519,0.782946,0.776927,"{0: 53, 1: 15, 2: 61}"
3,0.8305,1.142046,0.806202,0.834677,0.806202,0.801752,"{0: 60, 1: 16, 2: 53}"
4,0.5298,1.083167,0.813953,0.823026,0.813953,0.812844,"{0: 52, 1: 21, 2: 56}"
5,0.4665,1.230084,0.782946,0.78201,0.782946,0.782416,"{0: 41, 1: 27, 2: 61}"
6,0.4967,1.257094,0.806202,0.820342,0.806202,0.809644,"{0: 49, 1: 28, 2: 52}"
7,0.4268,1.277421,0.79845,0.829018,0.79845,0.795726,"{0: 59, 1: 16, 2: 54}"
8,0.3177,0.934092,0.837209,0.840279,0.837209,0.838276,"{0: 44, 1: 28, 2: 57}"
9,0.2464,1.176097,0.821705,0.842225,0.821705,0.825926,"{0: 42, 1: 36, 2: 51}"
10,0.1953,1.267008,0.821705,0.824883,0.821705,0.822435,"{0: 46, 1: 26, 2: 57}"



Best Model saved at: ./saved_models/absa_FacebookAI_xlm-roberta-base_42_42_20

Tokenizer for best Model saved at: ./saved_tokenizers/absa_FacebookAI_xlm-roberta-base_42_42_20
Evaluation results for FacebookAI/xlm-roberta-base with 20 epochs and random seeds: 42, 42



{'eval_loss': 0.6734592318534851, 'eval_accuracy': 0.7973856209150327, 'eval_precision': 0.794736441789639, 'eval_recall': 0.7973856209150327, 'eval_f1': 0.7938625979349961, 'eval_class_distribution': {0: 41, 1: 27, 2: 85}, 'eval_runtime': 2.0051, 'eval_samples_per_second': 76.305, 'eval_steps_per_second': 38.402, 'epoch': 11.0}
              precision    recall  f1-score   support

     Negativ       0.72      0.86      0.78        36
     Neutral       0.65      0.52      0.58        33
     Positiv       0.86      0.86      0.86        84

    accuracy                           0.78       153
   macro avg       0.74      0.74      0.74       153
weighted avg       0.78      0.78      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 43, 1: 26, 2: 84}
Negativ Precision Score: 0.7209302325581395
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.7848101265822784

Neutral Precision Score: 0.6538461538461539
Neutral Recall Score: 

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3932.38 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3413.51 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3490.29 examples/s]


Training results for TUM/GottBERT_base_best with 20 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8784,0.916075,0.79845,0.830776,0.79845,0.792119,"{0: 60, 1: 14, 2: 55}"
2,0.5832,0.81042,0.821705,0.833879,0.821705,0.823552,"{0: 36, 1: 35, 2: 58}"
3,0.5218,0.874525,0.821705,0.819373,0.821705,0.818624,"{0: 44, 1: 22, 2: 63}"
4,0.3452,0.641003,0.860465,0.876881,0.860465,0.863263,"{0: 51, 1: 27, 2: 51}"
5,0.3329,0.922862,0.852713,0.854126,0.852713,0.852424,"{0: 38, 1: 29, 2: 62}"
6,0.2664,0.632766,0.868217,0.869791,0.868217,0.868493,"{0: 45, 1: 26, 2: 58}"
7,0.2744,0.925601,0.821705,0.839627,0.821705,0.817479,"{0: 52, 1: 16, 2: 61}"
8,0.2521,0.941017,0.829457,0.830715,0.829457,0.827903,"{0: 46, 1: 22, 2: 61}"
9,0.1966,1.157889,0.806202,0.809258,0.806202,0.806662,"{0: 38, 1: 30, 2: 61}"
10,0.1745,1.037245,0.829457,0.830995,0.829457,0.825802,"{0: 48, 1: 20, 2: 61}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_best_42_42_20

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_best_42_42_20
Evaluation results for TUM/GottBERT_base_best with 20 epochs and random seeds: 42, 42



{'eval_loss': 2.010373830795288, 'eval_accuracy': 0.7450980392156863, 'eval_precision': 0.7773233302645067, 'eval_recall': 0.7450980392156863, 'eval_f1': 0.748923840480934, 'eval_class_distribution': {0: 52, 1: 35, 2: 66}, 'eval_runtime': 2.023, 'eval_samples_per_second': 75.629, 'eval_steps_per_second': 38.062, 'epoch': 20.0}
              precision    recall  f1-score   support

     Negativ       0.62      0.89      0.73        36
     Neutral       0.68      0.70      0.69        33
     Positiv       0.94      0.75      0.83        84

    accuracy                           0.77       153
   macro avg       0.74      0.78      0.75       153
weighted avg       0.81      0.77      0.78       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 52, 1: 34, 2: 67}
Negativ Precision Score: 0.6153846153846154
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7272727272727273

Neutral Precision Score: 0.6764705882352942
Neutral Recall Score: 0.

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3906.12 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3383.46 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3448.13 examples/s]


Training results for TUM/GottBERT_filtered_base_best with 20 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8274,1.032586,0.767442,0.804786,0.767442,0.770646,"{0: 60, 1: 23, 2: 46}"
2,0.5439,0.608068,0.883721,0.883361,0.883721,0.883164,"{0: 42, 1: 25, 2: 62}"
3,0.5151,0.621424,0.899225,0.899893,0.899225,0.898154,"{0: 44, 1: 23, 2: 62}"
4,0.3027,0.663368,0.860465,0.878508,0.860465,0.860882,"{0: 52, 1: 19, 2: 58}"
5,0.3211,0.704264,0.891473,0.892073,0.891473,0.890345,"{0: 43, 1: 23, 2: 63}"
6,0.2445,0.678609,0.860465,0.867571,0.860465,0.861635,"{0: 48, 1: 27, 2: 54}"
7,0.2398,0.858771,0.860465,0.86697,0.860465,0.860743,"{0: 49, 1: 23, 2: 57}"
8,0.2304,0.781542,0.875969,0.881229,0.875969,0.874742,"{0: 48, 1: 21, 2: 60}"
9,0.1557,0.616346,0.891473,0.892183,0.891473,0.890891,"{0: 45, 1: 24, 2: 60}"
10,0.1021,0.749528,0.883721,0.89161,0.883721,0.884926,"{0: 49, 1: 25, 2: 55}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_filtered_base_best_42_42_20

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_filtered_base_best_42_42_20
Evaluation results for TUM/GottBERT_filtered_base_best with 20 epochs and random seeds: 42, 42



{'eval_loss': 1.5799249410629272, 'eval_accuracy': 0.8104575163398693, 'eval_precision': 0.8175262664426751, 'eval_recall': 0.8104575163398693, 'eval_f1': 0.8127208440933932, 'eval_class_distribution': {0: 38, 1: 37, 2: 78}, 'eval_runtime': 2.0265, 'eval_samples_per_second': 75.501, 'eval_steps_per_second': 37.997, 'epoch': 16.0}
              precision    recall  f1-score   support

     Negativ       0.84      0.86      0.85        36
     Neutral       0.68      0.76      0.71        33
     Positiv       0.91      0.86      0.88        84

    accuracy                           0.84       153
   macro avg       0.81      0.83      0.82       153
weighted avg       0.84      0.84      0.84       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 37, 1: 37, 2: 79}
Negativ Precision Score: 0.8378378378378378
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.8493150684931506

Neutral Precision Score: 0.6756756756756757
Neutral Recall Score:

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be abl

Training results for TUM/GottBERT_base_last with 20 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8898,1.086711,0.736434,0.847252,0.736434,0.708801,"{0: 75, 1: 5, 2: 49}"
2,0.5955,0.647895,0.860465,0.861988,0.860465,0.860612,"{0: 39, 1: 29, 2: 61}"
3,0.5028,1.264444,0.790698,0.805407,0.790698,0.780736,"{0: 46, 1: 14, 2: 69}"
4,0.3565,0.721889,0.868217,0.868034,0.868217,0.868056,"{0: 41, 1: 27, 2: 61}"
5,0.3139,0.834875,0.844961,0.848438,0.844961,0.846226,"{0: 44, 1: 28, 2: 57}"
6,0.2585,0.83836,0.829457,0.835613,0.829457,0.824206,"{0: 51, 1: 18, 2: 60}"
7,0.2563,1.244473,0.806202,0.840166,0.806202,0.801918,"{0: 56, 1: 14, 2: 59}"
8,0.2648,1.032038,0.821705,0.823632,0.821705,0.819476,"{0: 47, 1: 21, 2: 61}"
9,0.2171,1.075173,0.829457,0.83947,0.829457,0.832177,"{0: 48, 1: 27, 2: 54}"
10,0.2207,1.265244,0.813953,0.8265,0.813953,0.808761,"{0: 54, 1: 17, 2: 58}"



Best Model saved at: ./saved_models/absa_TUM_GottBERT_base_last_42_42_20

Tokenizer for best Model saved at: ./saved_tokenizers/absa_TUM_GottBERT_base_last_42_42_20
Evaluation results for TUM/GottBERT_base_last with 20 epochs and random seeds: 42, 42



{'eval_loss': 1.43964684009552, 'eval_accuracy': 0.7320261437908496, 'eval_precision': 0.7592760180995475, 'eval_recall': 0.7320261437908496, 'eval_f1': 0.7337787004109357, 'eval_class_distribution': {0: 48, 1: 40, 2: 65}, 'eval_runtime': 2.0039, 'eval_samples_per_second': 76.352, 'eval_steps_per_second': 38.425, 'epoch': 20.0}
              precision    recall  f1-score   support

     Negativ       0.65      0.89      0.75        36
     Neutral       0.63      0.73      0.68        33
     Positiv       0.89      0.70      0.79        84

    accuracy                           0.75       153
   macro avg       0.73      0.77      0.74       153
weighted avg       0.78      0.75      0.75       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 49, 1: 38, 2: 66}
Negativ Precision Score: 0.6530612244897959
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.7529411764705882

Neutral Precision Score: 0.631578947368421
Neutral Recall Score: 0.

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 1111/1111 [00:00<00:00, 3972.28 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 3333.88 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 3396.29 examples/s]


Training results for distilbert/distilbert-base-german-cased with 20 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8211,0.633758,0.782946,0.803739,0.782946,0.78363,"{0: 55, 1: 19, 2: 55}"
2,0.6002,0.53891,0.844961,0.850371,0.844961,0.845734,"{0: 48, 1: 25, 2: 56}"
3,0.4864,0.56832,0.868217,0.868555,0.868217,0.867988,"{0: 44, 1: 25, 2: 60}"
4,0.2893,0.918813,0.813953,0.852167,0.813953,0.815434,"{0: 61, 1: 20, 2: 48}"
5,0.2909,0.66344,0.860465,0.863299,0.860465,0.860981,"{0: 46, 1: 26, 2: 57}"
6,0.2125,0.743646,0.844961,0.852988,0.844961,0.846478,"{0: 41, 1: 33, 2: 55}"
7,0.1578,1.264177,0.775194,0.798626,0.775194,0.774866,"{0: 56, 1: 18, 2: 55}"
8,0.1219,1.035909,0.813953,0.820626,0.813953,0.815359,"{0: 42, 1: 32, 2: 55}"
9,0.046,1.070395,0.837209,0.841354,0.837209,0.838441,"{0: 44, 1: 29, 2: 56}"
10,0.032,1.081289,0.829457,0.829378,0.829457,0.829124,"{0: 41, 1: 29, 2: 59}"



Best Model saved at: ./saved_models/absa_distilbert_distilbert-base-german-cased_42_42_20

Tokenizer for best Model saved at: ./saved_tokenizers/absa_distilbert_distilbert-base-german-cased_42_42_20
Evaluation results for distilbert/distilbert-base-german-cased with 20 epochs and random seeds: 42, 42



{'eval_loss': 0.9798324704170227, 'eval_accuracy': 0.803921568627451, 'eval_precision': 0.806424028646251, 'eval_recall': 0.803921568627451, 'eval_f1': 0.8021585186483933, 'eval_class_distribution': {0: 44, 1: 28, 2: 81}, 'eval_runtime': 1.2, 'eval_samples_per_second': 127.5, 'eval_steps_per_second': 64.167, 'epoch': 15.0}
              precision    recall  f1-score   support

     Negativ       0.74      0.89      0.81        36
     Neutral       0.68      0.58      0.62        33
     Positiv       0.88      0.86      0.87        84

    accuracy                           0.80       153
   macro avg       0.77      0.77      0.77       153
weighted avg       0.80      0.80      0.80       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 43, 1: 28, 2: 82}
Negativ Precision Score: 0.7441860465116279
Negativ Recall Score: 0.8888888888888888
Negativ F1 Score: 0.810126582278481

Neutral Precision Score: 0.6785714285714286
Neutral Recall Score: 0.57575

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

Training results for GerMedBERT/medbert-512 with 20 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.9016,0.974941,0.790698,0.793225,0.790698,0.784208,"{0: 42, 1: 18, 2: 69}"
2,0.5552,1.485358,0.713178,0.730826,0.713178,0.693508,"{0: 21, 1: 31, 2: 77}"
3,0.5363,1.307692,0.744186,0.739684,0.744186,0.741075,"{0: 41, 1: 24, 2: 64}"
4,0.2988,1.120012,0.751938,0.758253,0.751938,0.750327,"{0: 52, 1: 22, 2: 55}"
5,0.3203,1.341211,0.744186,0.748196,0.744186,0.745918,"{0: 42, 1: 29, 2: 58}"
6,0.2393,1.265736,0.744186,0.763956,0.744186,0.749939,"{0: 49, 1: 29, 2: 51}"
7,0.2267,1.593361,0.775194,0.798573,0.775194,0.778667,"{0: 55, 1: 24, 2: 50}"
8,0.1618,1.636195,0.782946,0.802468,0.782946,0.787379,"{0: 37, 1: 37, 2: 55}"
9,0.0766,1.610823,0.790698,0.801865,0.790698,0.794811,"{0: 43, 1: 31, 2: 55}"
10,0.0377,1.459463,0.806202,0.818622,0.806202,0.810589,"{0: 42, 1: 32, 2: 55}"



Best Model saved at: ./saved_models/absa_GerMedBERT_medbert-512_42_42_20

Tokenizer for best Model saved at: ./saved_tokenizers/absa_GerMedBERT_medbert-512_42_42_20
Evaluation results for GerMedBERT/medbert-512 with 20 epochs and random seeds: 42, 42



{'eval_loss': 2.0073153972625732, 'eval_accuracy': 0.7777777777777778, 'eval_precision': 0.8028309362960871, 'eval_recall': 0.7777777777777778, 'eval_f1': 0.7806260046083314, 'eval_class_distribution': {0: 49, 1: 36, 2: 68}, 'eval_runtime': 2.1607, 'eval_samples_per_second': 70.812, 'eval_steps_per_second': 35.637, 'epoch': 17.0}
              precision    recall  f1-score   support

     Negativ       0.62      0.86      0.72        36
     Neutral       0.58      0.64      0.61        33
     Positiv       0.88      0.70      0.78        84

    accuracy                           0.73       153
   macro avg       0.69      0.73      0.70       153
weighted avg       0.76      0.73      0.73       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 50, 1: 36, 2: 67}
Negativ Precision Score: 0.62
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.7209302325581395

Neutral Precision Score: 0.5833333333333334
Neutral Recall Score: 0.63636363636

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

Training results for deepset/gbert-base with 20 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8695,0.827326,0.837209,0.841133,0.837209,0.836361,"{0: 48, 1: 22, 2: 59}"
2,0.5379,0.642612,0.844961,0.845834,0.844961,0.845313,"{0: 42, 1: 28, 2: 59}"
3,0.581,0.986815,0.821705,0.828826,0.821705,0.822037,"{0: 49, 1: 22, 2: 58}"
4,0.3016,0.90734,0.829457,0.832306,0.829457,0.828453,"{0: 47, 1: 22, 2: 60}"
5,0.2528,0.941541,0.852713,0.852366,0.852713,0.852446,"{0: 43, 1: 26, 2: 60}"
6,0.1681,1.176246,0.829457,0.836794,0.829457,0.830554,"{0: 49, 1: 24, 2: 56}"
7,0.1386,1.192837,0.837209,0.84236,0.837209,0.837803,"{0: 48, 1: 24, 2: 57}"
8,0.0813,1.214781,0.821705,0.825479,0.821705,0.823132,"{0: 44, 1: 28, 2: 57}"
9,0.0505,1.408071,0.813953,0.824255,0.813953,0.816017,"{0: 49, 1: 27, 2: 53}"
10,0.0271,1.474205,0.821705,0.834032,0.821705,0.824419,"{0: 48, 1: 29, 2: 52}"



Best Model saved at: ./saved_models/absa_deepset_gbert-base_42_42_20

Tokenizer for best Model saved at: ./saved_tokenizers/absa_deepset_gbert-base_42_42_20
Evaluation results for deepset/gbert-base with 20 epochs and random seeds: 42, 42



{'eval_loss': 1.168603777885437, 'eval_accuracy': 0.8300653594771242, 'eval_precision': 0.8292870075958311, 'eval_recall': 0.8300653594771242, 'eval_f1': 0.8296025399973829, 'eval_class_distribution': {0: 37, 1: 32, 2: 84}, 'eval_runtime': 2.1808, 'eval_samples_per_second': 70.157, 'eval_steps_per_second': 35.308, 'epoch': 17.0}
              precision    recall  f1-score   support

     Negativ       0.84      0.86      0.85        36
     Neutral       0.68      0.70      0.69        33
     Positiv       0.88      0.86      0.87        84

    accuracy                           0.82       153
   macro avg       0.80      0.81      0.80       153
weighted avg       0.83      0.82      0.82       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 37, 1: 34, 2: 82}
Negativ Precision Score: 0.8378378378378378
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.8493150684931506

Neutral Precision Score: 0.6764705882352942
Neutral Recall Score: 

In [8]:
absa_model(data, "aari1995/German_Sentiment", rn1=42, rn2=42, epochs=20, save = True)

Training Sentiment label count:  {'negativ': 338, 'neutral': 275, 'positiv': 498}
Validation Sentiment label count:  {'negativ': 42, 'neutral': 27, 'positiv': 60}
Test Sentiment label count:  {'negativ': 36, 'neutral': 33, 'positiv': 84}
Class weights for (negative, neutral, positive): tensor([1.0957, 1.3467, 0.7436])


Map: 100%|██████████| 1111/1111 [00:00<00:00, 2582.86 examples/s]
Map: 100%|██████████| 129/129 [00:00<00:00, 2306.09 examples/s]
Map: 100%|██████████| 153/153 [00:00<00:00, 2370.20 examples/s]


Training results for aari1995/German_Sentiment with 20 epochs and random seeds: 42, 42



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,0.8642,0.6025,0.883721,0.886762,0.883721,0.882243,"{0: 48, 1: 22, 2: 59}"
2,0.5475,0.718769,0.875969,0.88268,0.875969,0.87705,"{0: 37, 1: 32, 2: 60}"
3,0.5522,0.570527,0.891473,0.897188,0.891473,0.890652,"{0: 47, 1: 21, 2: 61}"
4,0.3046,0.612081,0.868217,0.872502,0.868217,0.869262,"{0: 46, 1: 27, 2: 56}"
5,0.2334,0.659868,0.875969,0.874395,0.875969,0.873891,"{0: 44, 1: 23, 2: 62}"
6,0.1562,0.614664,0.891473,0.900194,0.891473,0.89335,"{0: 48, 1: 27, 2: 54}"
7,0.0801,0.769258,0.906977,0.911498,0.906977,0.906686,"{0: 48, 1: 23, 2: 58}"
8,0.1055,0.897929,0.868217,0.876876,0.868217,0.870811,"{0: 41, 1: 32, 2: 56}"
9,0.0168,0.818693,0.914729,0.929083,0.914729,0.915143,"{0: 52, 1: 21, 2: 56}"
10,0.0019,0.762654,0.922481,0.924614,0.922481,0.921983,"{0: 45, 1: 23, 2: 61}"



Best Model saved at: ./saved_models/absa_aari1995_German_Sentiment_42_42_20

Tokenizer for best Model saved at: ./saved_tokenizers/absa_aari1995_German_Sentiment_42_42_20
Evaluation results for aari1995/German_Sentiment with 20 epochs and random seeds: 42, 42



{'eval_loss': 1.3664731979370117, 'eval_accuracy': 0.8431372549019608, 'eval_precision': 0.8596078431372548, 'eval_recall': 0.8431372549019608, 'eval_f1': 0.8474435812060673, 'eval_class_distribution': {0: 36, 1: 42, 2: 75}, 'eval_runtime': 5.6335, 'eval_samples_per_second': 27.159, 'eval_steps_per_second': 13.668, 'epoch': 15.0}
              precision    recall  f1-score   support

     Negativ       0.84      0.86      0.85        36
     Neutral       0.68      0.82      0.74        33
     Positiv       0.95      0.86      0.90        84

    accuracy                           0.85       153
   macro avg       0.82      0.85      0.83       153
weighted avg       0.86      0.85      0.85       153

True label distribution: {0: 36, 1: 33, 2: 84}
Predicted label distribution: {0: 37, 1: 40, 2: 76}
Negativ Precision Score: 0.8378378378378378
Negativ Recall Score: 0.8611111111111112
Negativ F1 Score: 0.8493150684931506

Neutral Precision Score: 0.675
Neutral Recall Score: 0.8181818181

## Cross-Validation to check stability

In [5]:
avg_metrics, std_metrics = absa_model_kfold(data, "dbmdz/bert-base-german-cased", rn1=42, rn2=42, epochs=2, n_splits=3, save=False)


Training Fold 1/3

Training Fold 1 Sentiment label count:  {'negativ': 274, 'neutral': 252, 'positiv': 424}
Validation Fold 1 Sentiment label count:  {'negativ': 73, 'neutral': 46, 'positiv': 93}
Test Fold 1 Sentiment label count:  {'negativ': 69, 'neutral': 37, 'positiv': 125}
Class weights for (negative, neutral, positive): tensor([1.1557, 1.2566, 0.7469])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 950/950 [00:00<00:00, 3455.01 examples/s]
Map: 100%|██████████| 212/212 [00:00<00:00, 3549.15 examples/s]
Map: 100%|██████████| 231/231 [00:00<00:00, 3911.52 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.363605,0.533019,0.715617,0.533019,0.439898,"{0: 1, 1: 96, 2: 115}"
2,1.042400,1.160035,0.688679,0.691718,0.688679,0.688074,"{0: 84, 1: 39, 2: 89}"



Classification Report for Fold 1:
              precision    recall  f1-score   support

     Negativ       0.51      0.64      0.57        69
     Neutral       0.50      0.43      0.46        37
     Positiv       0.85      0.77      0.81       125

    accuracy                           0.68       231
   macro avg       0.62      0.61      0.61       231
weighted avg       0.69      0.68      0.68       231


Fold 1 Performance Metrics:
GPU: NVIDIA A30
Avg epoch: 47.13s
Total: 94.26s
Peak memory: 2618.0MB
Avg batch: 0.0961s

Training Fold 2/3

Training Fold 2 Sentiment label count:  {'negativ': 288, 'neutral': 216, 'positiv': 420}
Validation Fold 2 Sentiment label count:  {'negativ': 66, 'neutral': 58, 'positiv': 113}
Test Fold 2 Sentiment label count:  {'negativ': 62, 'neutral': 61, 'positiv': 109}
Class weights for (negative, neutral, positive): tensor([1.0694, 1.4259, 0.7333])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 924/924 [00:00<00:00, 4107.30 examples/s]
Map: 100%|██████████| 237/237 [00:00<00:00, 3775.02 examples/s]
Map: 100%|██████████| 232/232 [00:00<00:00, 4209.27 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.594389,0.852321,0.852008,0.852321,0.851061,"{0: 59, 1: 58, 2: 120}"
2,0.807500,0.78224,0.827004,0.831525,0.827004,0.828223,"{0: 72, 1: 60, 2: 105}"



Classification Report for Fold 2:
              precision    recall  f1-score   support

     Negativ       0.62      0.60      0.61        62
     Neutral       0.61      0.67      0.64        61
     Positiv       0.83      0.80      0.81       109

    accuracy                           0.71       232
   macro avg       0.69      0.69      0.69       232
weighted avg       0.71      0.71      0.71       232


Fold 2 Performance Metrics:
GPU: NVIDIA A30
Avg epoch: 45.46s
Total: 90.92s
Peak memory: 2614.0MB
Avg batch: 0.0957s

Training Fold 3/3

Training Fold 3 Sentiment label count:  {'negativ': 270, 'neutral': 202, 'positiv': 440}
Validation Fold 3 Sentiment label count:  {'negativ': 62, 'neutral': 80, 'positiv': 97}
Test Fold 3 Sentiment label count:  {'negativ': 84, 'neutral': 53, 'positiv': 105}
Class weights for (negative, neutral, positive): tensor([1.1259, 1.5050, 0.6909])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 912/912 [00:00<00:00, 4076.14 examples/s]
Map: 100%|██████████| 239/239 [00:00<00:00, 3599.80 examples/s]
Map: 100%|██████████| 242/242 [00:00<00:00, 4023.50 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.172984,0.74477,0.766577,0.74477,0.734304,"{0: 58, 1: 49, 2: 132}"
2,0.782400,1.219372,0.740586,0.750621,0.740586,0.737006,"{0: 57, 1: 61, 2: 121}"



Classification Report for Fold 3:
              precision    recall  f1-score   support

     Negativ       0.84      0.83      0.84        84
     Neutral       0.76      0.70      0.73        53
     Positiv       0.91      0.95      0.93       105

    accuracy                           0.86       242
   macro avg       0.84      0.83      0.83       242
weighted avg       0.85      0.86      0.85       242


Fold 3 Performance Metrics:
GPU: NVIDIA A30
Avg epoch: 44.64s
Total: 89.29s
Peak memory: 2614.0MB
Avg batch: 0.0953s

Cross-Validation Summary

Average Metrics Across Folds:
eval_accuracy: 0.7589 ± 0.0666
eval_precision: 0.7615 ± 0.0635
eval_recall: 0.7589 ± 0.0666
eval_f1: 0.7595 ± 0.0654
eval_loss: 0.9903 ± 0.1931

Overall Classification Report:
              precision    recall  f1-score   support

     Negativ       0.66      0.70      0.68       215
     Neutral       0.64      0.62      0.63       151
     Positiv       0.86      0.83      0.85       339

    accuracy   

In [6]:
all_model_metrics = {}

for model in models:
    print(f'training and results for {model}:')
    avg_metrics, std_metrics = absa_model_kfold(data, model, rn1=42, rn2=42, epochs=5, n_splits=3, save=False)
    
    # Store both metrics together under the model name
    all_model_metrics[model] = {
        'avg_metrics': avg_metrics,
        'std_metrics': std_metrics
    }
    
    print()

# Access:
# all_model_metrics['model_name']['avg_metrics']
# all_model_metrics['model_name']['std_metrics']

training and results for google-bert/bert-base-german-cased:

Training Fold 1/3

Training Fold 1 Sentiment label count:  {'negativ': 274, 'neutral': 252, 'positiv': 424}
Validation Fold 1 Sentiment label count:  {'negativ': 73, 'neutral': 46, 'positiv': 93}
Test Fold 1 Sentiment label count:  {'negativ': 69, 'neutral': 37, 'positiv': 125}
Class weights for (negative, neutral, positive): tensor([1.1557, 1.2566, 0.7469])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 950/950 [00:00<00:00, 3822.69 examples/s]
Map: 100%|██████████| 212/212 [00:00<00:00, 3575.76 examples/s]
Map: 100%|██████████| 231/231 [00:00<00:00, 3864.92 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.777691,0.79717,0.811701,0.79717,0.801479,"{0: 66, 1: 58, 2: 88}"
2,0.820800,1.135846,0.792453,0.791074,0.792453,0.791568,"{0: 76, 1: 44, 2: 92}"
3,0.466000,1.001936,0.801887,0.802199,0.801887,0.799729,"{0: 84, 1: 41, 2: 87}"
4,0.346500,1.191324,0.778302,0.778923,0.778302,0.778403,"{0: 76, 1: 46, 2: 90}"
5,0.195900,1.269362,0.787736,0.795677,0.787736,0.789836,"{0: 78, 1: 51, 2: 83}"



Classification Report for Fold 1:
              precision    recall  f1-score   support

     Negativ       0.77      0.80      0.79        69
     Neutral       0.55      0.78      0.64        37
     Positiv       0.95      0.82      0.88       125

    accuracy                           0.81       231
   macro avg       0.76      0.80      0.77       231
weighted avg       0.83      0.81      0.81       231


Training Fold 2/3

Training Fold 2 Sentiment label count:  {'negativ': 288, 'neutral': 216, 'positiv': 420}
Validation Fold 2 Sentiment label count:  {'negativ': 66, 'neutral': 58, 'positiv': 113}
Test Fold 2 Sentiment label count:  {'negativ': 62, 'neutral': 61, 'positiv': 109}
Class weights for (negative, neutral, positive): tensor([1.0694, 1.4259, 0.7333])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 924/924 [00:00<00:00, 4108.83 examples/s]
Map: 100%|██████████| 237/237 [00:00<00:00, 3715.49 examples/s]
Map: 100%|██████████| 232/232 [00:00<00:00, 4224.11 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.850315,0.797468,0.811282,0.797468,0.799783,"{0: 71, 1: 70, 2: 96}"
2,0.788400,0.901961,0.805907,0.821133,0.805907,0.804966,"{0: 88, 1: 44, 2: 105}"
3,0.560800,0.98452,0.78903,0.811445,0.78903,0.793331,"{0: 55, 1: 79, 2: 103}"
4,0.301400,0.938552,0.818565,0.828272,0.818565,0.814261,"{0: 79, 1: 39, 2: 119}"
5,0.213700,0.835949,0.831224,0.833137,0.831224,0.831849,"{0: 70, 1: 58, 2: 109}"



Classification Report for Fold 2:
              precision    recall  f1-score   support

     Negativ       0.66      0.76      0.71        62
     Neutral       0.68      0.62      0.65        61
     Positiv       0.83      0.80      0.81       109

    accuracy                           0.74       232
   macro avg       0.72      0.73      0.72       232
weighted avg       0.74      0.74      0.74       232


Training Fold 3/3

Training Fold 3 Sentiment label count:  {'negativ': 270, 'neutral': 202, 'positiv': 440}
Validation Fold 3 Sentiment label count:  {'negativ': 62, 'neutral': 80, 'positiv': 97}
Test Fold 3 Sentiment label count:  {'negativ': 84, 'neutral': 53, 'positiv': 105}
Class weights for (negative, neutral, positive): tensor([1.1259, 1.5050, 0.6909])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 912/912 [00:00<00:00, 3963.89 examples/s]
Map: 100%|██████████| 239/239 [00:00<00:00, 3484.59 examples/s]
Map: 100%|██████████| 242/242 [00:00<00:00, 4003.90 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.106438,0.753138,0.770163,0.753138,0.745277,"{0: 74, 1: 50, 2: 115}"
2,0.793700,1.313978,0.723849,0.730398,0.723849,0.717767,"{0: 74, 1: 56, 2: 109}"
3,0.482400,1.307499,0.719665,0.720409,0.719665,0.716513,"{0: 62, 1: 66, 2: 111}"
4,0.326200,1.553217,0.698745,0.697124,0.698745,0.695356,"{0: 52, 1: 77, 2: 110}"
5,0.160200,1.711333,0.702929,0.70121,0.702929,0.698634,"{0: 52, 1: 74, 2: 113}"



Classification Report for Fold 3:
              precision    recall  f1-score   support

     Negativ       0.80      0.88      0.84        84
     Neutral       0.76      0.70      0.73        53
     Positiv       0.97      0.92      0.95       105

    accuracy                           0.86       242
   macro avg       0.84      0.83      0.84       242
weighted avg       0.86      0.86      0.86       242


Cross-Validation Summary

Average Metrics Across Folds:
eval_accuracy: 0.8121 ± 0.0342
eval_precision: 0.8254 ± 0.0312
eval_recall: 0.8121 ± 0.0342
eval_f1: 0.8152 ± 0.0335
eval_loss: 0.9095 ± 0.2645

Overall Classification Report:
              precision    recall  f1-score   support

     Negativ       0.75      0.82      0.78       215
     Neutral       0.66      0.69      0.67       151
     Positiv       0.92      0.84      0.88       339

    accuracy                           0.80       705
   macro avg       0.77      0.78      0.78       705
weighted avg       0.81  

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 950/950 [00:00<00:00, 3871.53 examples/s]
Map: 100%|██████████| 212/212 [00:00<00:00, 3342.87 examples/s]
Map: 100%|██████████| 231/231 [00:00<00:00, 3711.86 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.946276,0.787736,0.802563,0.787736,0.791936,"{0: 75, 1: 56, 2: 81}"
2,0.810200,1.025055,0.787736,0.792792,0.787736,0.789454,"{0: 67, 1: 50, 2: 95}"
3,0.550200,0.864694,0.839623,0.843901,0.839623,0.839212,"{0: 84, 1: 44, 2: 84}"
4,0.408700,1.077379,0.811321,0.816118,0.811321,0.81299,"{0: 75, 1: 50, 2: 87}"
5,0.257500,1.212649,0.787736,0.80361,0.787736,0.792549,"{0: 73, 1: 57, 2: 82}"



Classification Report for Fold 1:
              precision    recall  f1-score   support

     Negativ       0.73      0.96      0.83        69
     Neutral       0.72      0.49      0.58        37
     Positiv       0.95      0.88      0.91       125

    accuracy                           0.84       231
   macro avg       0.80      0.77      0.77       231
weighted avg       0.85      0.84      0.83       231


Training Fold 2/3

Training Fold 2 Sentiment label count:  {'negativ': 288, 'neutral': 216, 'positiv': 420}
Validation Fold 2 Sentiment label count:  {'negativ': 66, 'neutral': 58, 'positiv': 113}
Test Fold 2 Sentiment label count:  {'negativ': 62, 'neutral': 61, 'positiv': 109}
Class weights for (negative, neutral, positive): tensor([1.0694, 1.4259, 0.7333])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 924/924 [00:00<00:00, 3927.19 examples/s]
Map: 100%|██████████| 237/237 [00:00<00:00, 3807.61 examples/s]
Map: 100%|██████████| 232/232 [00:00<00:00, 4107.43 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.794086,0.835443,0.837959,0.835443,0.836036,"{0: 64, 1: 64, 2: 109}"
2,0.806300,0.658604,0.852321,0.857694,0.852321,0.852907,"{0: 77, 1: 55, 2: 105}"
3,0.597600,0.855432,0.839662,0.839898,0.839662,0.839704,"{0: 64, 1: 59, 2: 114}"
4,0.355400,0.944974,0.839662,0.846381,0.839662,0.84141,"{0: 73, 1: 61, 2: 103}"
5,0.244700,0.922602,0.831224,0.834008,0.831224,0.832233,"{0: 68, 1: 61, 2: 108}"



Classification Report for Fold 2:
              precision    recall  f1-score   support

     Negativ       0.56      0.82      0.67        62
     Neutral       0.68      0.56      0.61        61
     Positiv       0.93      0.78      0.85       109

    accuracy                           0.73       232
   macro avg       0.72      0.72      0.71       232
weighted avg       0.77      0.73      0.74       232


Training Fold 3/3

Training Fold 3 Sentiment label count:  {'negativ': 270, 'neutral': 202, 'positiv': 440}
Validation Fold 3 Sentiment label count:  {'negativ': 62, 'neutral': 80, 'positiv': 97}
Test Fold 3 Sentiment label count:  {'negativ': 84, 'neutral': 53, 'positiv': 105}
Class weights for (negative, neutral, positive): tensor([1.1259, 1.5050, 0.6909])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 912/912 [00:00<00:00, 4169.11 examples/s]
Map: 100%|██████████| 239/239 [00:00<00:00, 3630.20 examples/s]
Map: 100%|██████████| 242/242 [00:00<00:00, 4099.47 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.476761,0.711297,0.748451,0.711297,0.695021,"{0: 67, 1: 39, 2: 133}"
2,0.769700,1.509264,0.719665,0.737548,0.719665,0.72083,"{0: 86, 1: 69, 2: 84}"
3,0.529900,1.452126,0.719665,0.740773,0.719665,0.718523,"{0: 89, 1: 58, 2: 92}"
4,0.379900,1.504852,0.736402,0.738817,0.736402,0.737038,"{0: 68, 1: 78, 2: 93}"
5,0.219200,1.482222,0.740586,0.740199,0.740586,0.740338,"{0: 61, 1: 79, 2: 99}"



Classification Report for Fold 3:
              precision    recall  f1-score   support

     Negativ       0.85      0.73      0.78        84
     Neutral       0.63      0.79      0.70        53
     Positiv       0.93      0.91      0.92       105

    accuracy                           0.82       242
   macro avg       0.80      0.81      0.80       242
weighted avg       0.84      0.82      0.83       242


Cross-Validation Summary

Average Metrics Across Folds:
eval_accuracy: 0.8037 ± 0.0478
eval_precision: 0.8158 ± 0.0386
eval_recall: 0.8037 ± 0.0478
eval_f1: 0.8045 ± 0.0463
eval_loss: 0.9353 ± 0.1292

Overall Classification Report:
              precision    recall  f1-score   support

     Negativ       0.70      0.83      0.76       215
     Neutral       0.66      0.62      0.64       151
     Positiv       0.94      0.86      0.90       339

    accuracy                           0.80       705
   macro avg       0.77      0.77      0.77       705
weighted avg       0.81  

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 950/950 [00:00<00:00, 3992.89 examples/s]
Map: 100%|██████████| 212/212 [00:00<00:00, 3662.85 examples/s]
Map: 100%|██████████| 231/231 [00:00<00:00, 3885.80 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.9189,0.792453,0.813084,0.792453,0.797748,"{0: 65, 1: 62, 2: 85}"
2,0.821600,0.954257,0.820755,0.839581,0.820755,0.825116,"{0: 63, 1: 61, 2: 88}"
3,0.514100,1.089519,0.79717,0.827101,0.79717,0.804321,"{0: 78, 1: 60, 2: 74}"
4,0.327200,1.251209,0.79717,0.810078,0.79717,0.800931,"{0: 75, 1: 55, 2: 82}"
5,0.155900,1.286212,0.801887,0.818466,0.801887,0.806366,"{0: 73, 1: 58, 2: 81}"



Classification Report for Fold 1:
              precision    recall  f1-score   support

     Negativ       0.85      0.75      0.80        69
     Neutral       0.57      0.76      0.65        37
     Positiv       0.93      0.90      0.91       125

    accuracy                           0.83       231
   macro avg       0.78      0.80      0.79       231
weighted avg       0.85      0.83      0.84       231


Training Fold 2/3

Training Fold 2 Sentiment label count:  {'negativ': 288, 'neutral': 216, 'positiv': 420}
Validation Fold 2 Sentiment label count:  {'negativ': 66, 'neutral': 58, 'positiv': 113}
Test Fold 2 Sentiment label count:  {'negativ': 62, 'neutral': 61, 'positiv': 109}
Class weights for (negative, neutral, positive): tensor([1.0694, 1.4259, 0.7333])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 924/924 [00:00<00:00, 3765.80 examples/s]
Map: 100%|██████████| 237/237 [00:00<00:00, 3736.92 examples/s]
Map: 100%|██████████| 232/232 [00:00<00:00, 3573.92 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.65301,0.839662,0.846572,0.839662,0.841577,"{0: 71, 1: 63, 2: 103}"
2,0.774800,0.762807,0.814346,0.832874,0.814346,0.815937,"{0: 88, 1: 47, 2: 102}"
3,0.563300,0.765792,0.843882,0.851201,0.843882,0.845991,"{0: 71, 1: 63, 2: 103}"
4,0.318300,0.807222,0.839662,0.838087,0.839662,0.838703,"{0: 67, 1: 55, 2: 115}"
5,0.262100,0.768236,0.848101,0.847758,0.848101,0.847902,"{0: 67, 1: 57, 2: 113}"



Classification Report for Fold 2:
              precision    recall  f1-score   support

     Negativ       0.60      0.74      0.66        62
     Neutral       0.66      0.64      0.65        61
     Positiv       0.90      0.79      0.84       109

    accuracy                           0.74       232
   macro avg       0.72      0.72      0.72       232
weighted avg       0.75      0.74      0.74       232


Training Fold 3/3

Training Fold 3 Sentiment label count:  {'negativ': 270, 'neutral': 202, 'positiv': 440}
Validation Fold 3 Sentiment label count:  {'negativ': 62, 'neutral': 80, 'positiv': 97}
Test Fold 3 Sentiment label count:  {'negativ': 84, 'neutral': 53, 'positiv': 105}
Class weights for (negative, neutral, positive): tensor([1.1259, 1.5050, 0.6909])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 912/912 [00:00<00:00, 4096.64 examples/s]
Map: 100%|██████████| 239/239 [00:00<00:00, 3598.67 examples/s]
Map: 100%|██████████| 242/242 [00:00<00:00, 4049.62 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.302296,0.715481,0.750745,0.715481,0.716587,"{0: 94, 1: 54, 2: 91}"
2,0.805900,1.340515,0.736402,0.736682,0.736402,0.735414,"{0: 54, 1: 84, 2: 101}"
3,0.526600,1.355965,0.753138,0.752845,0.753138,0.751503,"{0: 60, 1: 72, 2: 107}"
4,0.323800,1.619038,0.728033,0.729507,0.728033,0.726025,"{0: 50, 1: 83, 2: 106}"
5,0.162500,1.685025,0.707113,0.705488,0.707113,0.704865,"{0: 55, 1: 77, 2: 107}"



Classification Report for Fold 3:
              precision    recall  f1-score   support

     Negativ       0.85      0.79      0.81        84
     Neutral       0.65      0.74      0.69        53
     Positiv       0.94      0.93      0.94       105

    accuracy                           0.84       242
   macro avg       0.81      0.82      0.81       242
weighted avg       0.84      0.84      0.84       242


Cross-Validation Summary

Average Metrics Across Folds:
eval_accuracy: 0.8165 ± 0.0473
eval_precision: 0.8226 ± 0.0445
eval_recall: 0.8165 ± 0.0473
eval_f1: 0.8182 ± 0.0465
eval_loss: 1.0356 ± 0.3556

Overall Classification Report:
              precision    recall  f1-score   support

     Negativ       0.76      0.76      0.76       215
     Neutral       0.63      0.70      0.66       151
     Positiv       0.92      0.87      0.90       339

    accuracy                           0.80       705
   macro avg       0.77      0.78      0.77       705
weighted avg       0.81  

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 950/950 [00:00<00:00, 4977.94 examples/s]
Map: 100%|██████████| 212/212 [00:00<00:00, 4243.40 examples/s]
Map: 100%|██████████| 231/231 [00:00<00:00, 4481.92 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.940266,0.617925,0.748964,0.617925,0.621562,"{0: 30, 1: 110, 2: 72}"
2,0.896700,1.401684,0.721698,0.752958,0.721698,0.728561,"{0: 57, 1: 67, 2: 88}"
3,0.738000,1.121706,0.792453,0.796388,0.792453,0.790311,"{0: 88, 1: 39, 2: 85}"
4,0.645300,1.498908,0.768868,0.765149,0.768868,0.766632,"{0: 76, 1: 42, 2: 94}"
5,0.554700,1.335152,0.787736,0.797626,0.787736,0.791346,"{0: 75, 1: 52, 2: 85}"



Classification Report for Fold 1:
              precision    recall  f1-score   support

     Negativ       0.79      0.88      0.84        69
     Neutral       0.68      0.68      0.68        37
     Positiv       0.94      0.88      0.91       125

    accuracy                           0.85       231
   macro avg       0.80      0.81      0.81       231
weighted avg       0.85      0.85      0.85       231


Training Fold 2/3

Training Fold 2 Sentiment label count:  {'negativ': 288, 'neutral': 216, 'positiv': 420}
Validation Fold 2 Sentiment label count:  {'negativ': 66, 'neutral': 58, 'positiv': 113}
Test Fold 2 Sentiment label count:  {'negativ': 62, 'neutral': 61, 'positiv': 109}
Class weights for (negative, neutral, positive): tensor([1.0694, 1.4259, 0.7333])


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 924/924 [00:00<00:00, 4963.43 examples/s]
Map: 100%|██████████| 237/237 [00:00<00:00, 4155.10 examples/s]
Map: 100%|██████████| 232/232 [00:00<00:00, 4630.51 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.571491,0.848101,0.857421,0.848101,0.846767,"{0: 81, 1: 44, 2: 112}"
2,0.816300,0.868014,0.843882,0.847045,0.843882,0.838602,"{0: 60, 1: 44, 2: 133}"
3,0.721500,0.905235,0.848101,0.855942,0.848101,0.847516,"{0: 52, 1: 68, 2: 117}"
4,0.588900,0.925685,0.843882,0.849904,0.843882,0.837763,"{0: 61, 1: 41, 2: 135}"
5,0.427300,0.794344,0.85654,0.85714,0.85654,0.856329,"{0: 61, 1: 61, 2: 115}"



Classification Report for Fold 2:
              precision    recall  f1-score   support

     Negativ       0.64      0.76      0.69        62
     Neutral       0.71      0.75      0.73        61
     Positiv       0.91      0.78      0.84       109

    accuracy                           0.77       232
   macro avg       0.75      0.76      0.75       232
weighted avg       0.79      0.77      0.77       232


Training Fold 3/3

Training Fold 3 Sentiment label count:  {'negativ': 270, 'neutral': 202, 'positiv': 440}
Validation Fold 3 Sentiment label count:  {'negativ': 62, 'neutral': 80, 'positiv': 97}
Test Fold 3 Sentiment label count:  {'negativ': 84, 'neutral': 53, 'positiv': 105}
Class weights for (negative, neutral, positive): tensor([1.1259, 1.5050, 0.6909])


Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 912/912 [00:00<00:00, 3084.61 examples/s]
Map: 100%|██████████| 239/239 [00:00<00:00, 2507.65 examples/s]
Map: 100%|██████████| 242/242 [00:00<00:00, 2429.80 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.947346,0.719665,0.750838,0.719665,0.71031,"{0: 89, 1: 44, 2: 106}"
2,0.846700,1.55315,0.715481,0.711645,0.715481,0.711237,"{0: 63, 1: 68, 2: 108}"
3,0.765200,1.583897,0.740586,0.746308,0.740586,0.733883,"{0: 76, 1: 56, 2: 107}"
4,0.647000,1.743541,0.719665,0.719627,0.719665,0.716389,"{0: 50, 1: 79, 2: 110}"
5,0.437800,1.814499,0.723849,0.724987,0.723849,0.718255,"{0: 70, 1: 60, 2: 109}"



Classification Report for Fold 3:
              precision    recall  f1-score   support

     Negativ       0.82      0.77      0.80        84
     Neutral       0.68      0.75      0.71        53
     Positiv       0.93      0.92      0.93       105

    accuracy                           0.83       242
   macro avg       0.81      0.82      0.81       242
weighted avg       0.84      0.83      0.84       242


Cross-Validation Summary

Average Metrics Across Folds:
eval_accuracy: 0.8125 ± 0.0383
eval_precision: 0.8196 ± 0.0325
eval_recall: 0.8125 ± 0.0383
eval_f1: 0.8145 ± 0.0368
eval_loss: 1.1257 ± 0.2811

Overall Classification Report:
              precision    recall  f1-score   support

     Negativ       0.75      0.80      0.78       215
     Neutral       0.69      0.74      0.71       151
     Positiv       0.93      0.86      0.89       339

    accuracy                           0.82       705
   macro avg       0.79      0.80      0.79       705
weighted avg       0.82  

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 950/950 [00:00<00:00, 5091.72 examples/s]
Map: 100%|██████████| 212/212 [00:00<00:00, 4075.39 examples/s]
Map: 100%|██████████| 231/231 [00:00<00:00, 3391.30 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.884847,0.801887,0.831373,0.801887,0.808671,"{0: 72, 1: 64, 2: 76}"
2,0.858100,1.030374,0.79717,0.799645,0.79717,0.797507,"{0: 67, 1: 46, 2: 99}"
3,0.545000,0.911507,0.806604,0.812547,0.806604,0.803437,"{0: 91, 1: 37, 2: 84}"
4,0.391100,0.872582,0.839623,0.84575,0.839623,0.841894,"{0: 68, 1: 51, 2: 93}"
5,0.268000,0.91257,0.839623,0.847022,0.839623,0.842087,"{0: 72, 1: 53, 2: 87}"



Classification Report for Fold 1:
              precision    recall  f1-score   support

     Negativ       0.83      0.86      0.84        69
     Neutral       0.66      0.73      0.69        37
     Positiv       0.93      0.89      0.91       125

    accuracy                           0.85       231
   macro avg       0.81      0.82      0.82       231
weighted avg       0.86      0.85      0.85       231


Training Fold 2/3

Training Fold 2 Sentiment label count:  {'negativ': 288, 'neutral': 216, 'positiv': 420}
Validation Fold 2 Sentiment label count:  {'negativ': 66, 'neutral': 58, 'positiv': 113}
Test Fold 2 Sentiment label count:  {'negativ': 62, 'neutral': 61, 'positiv': 109}
Class weights for (negative, neutral, positive): tensor([1.0694, 1.4259, 0.7333])


Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 924/924 [00:00<00:00, 5078.97 examples/s]
Map: 100%|██████████| 237/237 [00:00<00:00, 4244.85 examples/s]
Map: 100%|██████████| 232/232 [00:00<00:00, 4662.48 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.662607,0.85654,0.857799,0.85654,0.856129,"{0: 59, 1: 59, 2: 119}"
2,0.776000,0.656891,0.852321,0.860848,0.852321,0.854143,"{0: 69, 1: 67, 2: 101}"
3,0.508900,0.679287,0.869198,0.871952,0.869198,0.869466,"{0: 59, 1: 63, 2: 115}"
4,0.319400,0.659168,0.881857,0.882724,0.881857,0.88221,"{0: 65, 1: 60, 2: 112}"
5,0.292500,0.722814,0.864979,0.868579,0.864979,0.866299,"{0: 65, 1: 63, 2: 109}"



Classification Report for Fold 2:
              precision    recall  f1-score   support

     Negativ       0.68      0.84      0.75        62
     Neutral       0.69      0.66      0.67        61
     Positiv       0.93      0.83      0.88       109

    accuracy                           0.79       232
   macro avg       0.77      0.78      0.77       232
weighted avg       0.80      0.79      0.79       232


Training Fold 3/3

Training Fold 3 Sentiment label count:  {'negativ': 270, 'neutral': 202, 'positiv': 440}
Validation Fold 3 Sentiment label count:  {'negativ': 62, 'neutral': 80, 'positiv': 97}
Test Fold 3 Sentiment label count:  {'negativ': 84, 'neutral': 53, 'positiv': 105}
Class weights for (negative, neutral, positive): tensor([1.1259, 1.5050, 0.6909])


Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 912/912 [00:00<00:00, 5145.96 examples/s]
Map: 100%|██████████| 239/239 [00:00<00:00, 4311.23 examples/s]
Map: 100%|██████████| 242/242 [00:00<00:00, 4775.25 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.149005,0.74477,0.771746,0.74477,0.730385,"{0: 79, 1: 42, 2: 118}"
2,0.781000,1.443302,0.748954,0.753595,0.748954,0.75064,"{0: 67, 1: 81, 2: 91}"
3,0.523300,1.396153,0.74477,0.749106,0.74477,0.741075,"{0: 69, 1: 61, 2: 109}"
4,0.351500,1.444126,0.732218,0.730262,0.732218,0.727973,"{0: 59, 1: 68, 2: 112}"
5,0.264600,1.481215,0.736402,0.73436,0.736402,0.732547,"{0: 59, 1: 69, 2: 111}"



Classification Report for Fold 3:
              precision    recall  f1-score   support

     Negativ       0.88      0.85      0.86        84
     Neutral       0.66      0.83      0.73        53
     Positiv       0.98      0.88      0.92       105

    accuracy                           0.86       242
   macro avg       0.84      0.85      0.84       242
weighted avg       0.87      0.86      0.86       242


Cross-Validation Summary

Average Metrics Across Folds:
eval_accuracy: 0.8323 ± 0.0339
eval_precision: 0.8432 ± 0.0301
eval_recall: 0.8323 ± 0.0339
eval_f1: 0.8349 ± 0.0342
eval_loss: 0.9277 ± 0.2036

Overall Classification Report:
              precision    recall  f1-score   support

     Negativ       0.80      0.85      0.82       215
     Neutral       0.67      0.74      0.70       151
     Positiv       0.95      0.87      0.90       339

    accuracy                           0.83       705
   macro avg       0.80      0.82      0.81       705
weighted avg       0.84  

Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 950/950 [00:00<00:00, 5106.77 examples/s]
Map: 100%|██████████| 212/212 [00:00<00:00, 4182.03 examples/s]
Map: 100%|██████████| 231/231 [00:00<00:00, 4803.33 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.069056,0.740566,0.822841,0.740566,0.754755,"{0: 49, 1: 86, 2: 77}"
2,0.789100,1.029357,0.816038,0.821063,0.816038,0.817919,"{0: 69, 1: 51, 2: 92}"
3,0.523600,0.882868,0.84434,0.853584,0.84434,0.846439,"{0: 77, 1: 53, 2: 82}"
4,0.419600,0.87583,0.84434,0.84889,0.84434,0.845346,"{0: 78, 1: 49, 2: 85}"
5,0.273600,0.990556,0.830189,0.845305,0.830189,0.83474,"{0: 72, 1: 57, 2: 83}"



Classification Report for Fold 1:
              precision    recall  f1-score   support

     Negativ       0.81      0.90      0.85        69
     Neutral       0.74      0.70      0.72        37
     Positiv       0.94      0.90      0.92       125

    accuracy                           0.87       231
   macro avg       0.83      0.83      0.83       231
weighted avg       0.87      0.87      0.87       231


Training Fold 2/3

Training Fold 2 Sentiment label count:  {'negativ': 288, 'neutral': 216, 'positiv': 420}
Validation Fold 2 Sentiment label count:  {'negativ': 66, 'neutral': 58, 'positiv': 113}
Test Fold 2 Sentiment label count:  {'negativ': 62, 'neutral': 61, 'positiv': 109}
Class weights for (negative, neutral, positive): tensor([1.0694, 1.4259, 0.7333])


Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 924/924 [00:00<00:00, 5109.91 examples/s]
Map: 100%|██████████| 237/237 [00:00<00:00, 4322.95 examples/s]
Map: 100%|██████████| 232/232 [00:00<00:00, 4801.56 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.579585,0.85654,0.864544,0.85654,0.857732,"{0: 58, 1: 70, 2: 109}"
2,0.731000,0.457439,0.902954,0.903802,0.902954,0.902892,"{0: 70, 1: 54, 2: 113}"
3,0.520800,0.850861,0.831224,0.845891,0.831224,0.834282,"{0: 58, 1: 74, 2: 105}"
4,0.305600,0.609961,0.864979,0.863709,0.864979,0.862949,"{0: 68, 1: 50, 2: 119}"
5,0.275900,0.780908,0.85654,0.864164,0.85654,0.858755,"{0: 65, 1: 67, 2: 105}"



Classification Report for Fold 2:
              precision    recall  f1-score   support

     Negativ       0.67      0.89      0.76        62
     Neutral       0.72      0.62      0.67        61
     Positiv       0.93      0.83      0.87       109

    accuracy                           0.79       232
   macro avg       0.77      0.78      0.77       232
weighted avg       0.80      0.79      0.79       232


Training Fold 3/3

Training Fold 3 Sentiment label count:  {'negativ': 270, 'neutral': 202, 'positiv': 440}
Validation Fold 3 Sentiment label count:  {'negativ': 62, 'neutral': 80, 'positiv': 97}
Test Fold 3 Sentiment label count:  {'negativ': 84, 'neutral': 53, 'positiv': 105}
Class weights for (negative, neutral, positive): tensor([1.1259, 1.5050, 0.6909])


Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_filtered_base_best and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 912/912 [00:00<00:00, 5164.43 examples/s]
Map: 100%|██████████| 239/239 [00:00<00:00, 4242.41 examples/s]
Map: 100%|██████████| 242/242 [00:00<00:00, 4783.87 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.141562,0.753138,0.773085,0.753138,0.745097,"{0: 88, 1: 50, 2: 101}"
2,0.775600,1.14392,0.76569,0.767682,0.76569,0.76596,"{0: 56, 1: 85, 2: 98}"
3,0.557100,1.289446,0.757322,0.760941,0.757322,0.754018,"{0: 75, 1: 63, 2: 101}"
4,0.377800,1.43226,0.757322,0.760165,0.757322,0.755672,"{0: 57, 1: 70, 2: 112}"
5,0.250700,1.377816,0.757322,0.757063,0.757322,0.754975,"{0: 62, 1: 69, 2: 108}"



Classification Report for Fold 3:
              precision    recall  f1-score   support

     Negativ       0.88      0.77      0.82        84
     Neutral       0.65      0.83      0.73        53
     Positiv       0.95      0.90      0.93       105

    accuracy                           0.84       242
   macro avg       0.83      0.84      0.83       242
weighted avg       0.86      0.84      0.85       242


Cross-Validation Summary

Average Metrics Across Folds:
eval_accuracy: 0.8395 ± 0.0329
eval_precision: 0.8508 ± 0.0258
eval_recall: 0.8395 ± 0.0329
eval_f1: 0.8408 ± 0.0326
eval_loss: 0.8185 ± 0.1097

Overall Classification Report:
              precision    recall  f1-score   support

     Negativ       0.78      0.85      0.81       215
     Neutral       0.69      0.72      0.70       151
     Positiv       0.94      0.88      0.91       339

    accuracy                           0.83       705
   macro avg       0.80      0.81      0.81       705
weighted avg       0.84  

Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be abl

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.92579,0.783019,0.835194,0.783019,0.794593,"{0: 64, 1: 74, 2: 74}"
2,0.831300,0.85417,0.816038,0.817813,0.816038,0.81672,"{0: 70, 1: 48, 2: 94}"
3,0.541300,0.882144,0.820755,0.827651,0.820755,0.821966,"{0: 82, 1: 47, 2: 83}"
4,0.416300,0.923926,0.830189,0.835487,0.830189,0.832232,"{0: 72, 1: 51, 2: 89}"
5,0.289000,0.87912,0.849057,0.855818,0.849057,0.850817,"{0: 78, 1: 50, 2: 84}"



Classification Report for Fold 1:
              precision    recall  f1-score   support

     Negativ       0.77      0.87      0.82        69
     Neutral       0.71      0.65      0.68        37
     Positiv       0.94      0.90      0.92       125

    accuracy                           0.85       231
   macro avg       0.81      0.80      0.80       231
weighted avg       0.85      0.85      0.85       231


Training Fold 2/3

Training Fold 2 Sentiment label count:  {'negativ': 288, 'neutral': 216, 'positiv': 420}
Validation Fold 2 Sentiment label count:  {'negativ': 66, 'neutral': 58, 'positiv': 113}
Test Fold 2 Sentiment label count:  {'negativ': 62, 'neutral': 61, 'positiv': 109}
Class weights for (negative, neutral, positive): tensor([1.0694, 1.4259, 0.7333])


Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be abl

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.610276,0.85654,0.856024,0.85654,0.853725,"{0: 63, 1: 49, 2: 125}"
2,0.741700,0.651074,0.860759,0.862805,0.860759,0.861388,"{0: 69, 1: 60, 2: 108}"
3,0.549400,0.798577,0.860759,0.86813,0.860759,0.86266,"{0: 62, 1: 68, 2: 107}"
4,0.343500,0.651554,0.873418,0.873542,0.873418,0.873402,"{0: 68, 1: 57, 2: 112}"
5,0.269700,0.720586,0.869198,0.870674,0.869198,0.869194,"{0: 60, 1: 61, 2: 116}"



Classification Report for Fold 2:
              precision    recall  f1-score   support

     Negativ       0.66      0.85      0.75        62
     Neutral       0.73      0.62      0.67        61
     Positiv       0.91      0.83      0.87       109

    accuracy                           0.78       232
   macro avg       0.77      0.77      0.76       232
weighted avg       0.80      0.78      0.79       232


Training Fold 3/3

Training Fold 3 Sentiment label count:  {'negativ': 270, 'neutral': 202, 'positiv': 440}
Validation Fold 3 Sentiment label count:  {'negativ': 62, 'neutral': 80, 'positiv': 97}
Test Fold 3 Sentiment label count:  {'negativ': 84, 'neutral': 53, 'positiv': 105}
Class weights for (negative, neutral, positive): tensor([1.1259, 1.5050, 0.6909])


Some weights of the model checkpoint at TUM/GottBERT_base_last were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at TUM/GottBERT_base_last and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be abl

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.029835,0.76569,0.799328,0.76569,0.760032,"{0: 89, 1: 47, 2: 103}"
2,0.781200,1.316112,0.753138,0.754213,0.753138,0.753593,"{0: 64, 1: 80, 2: 95}"
3,0.533700,1.322697,0.769874,0.776365,0.769874,0.766997,"{0: 72, 1: 61, 2: 106}"
4,0.340600,1.332245,0.769874,0.768572,0.769874,0.768977,"{0: 63, 1: 76, 2: 100}"
5,0.228700,1.472779,0.753138,0.751839,0.753138,0.751599,"{0: 56, 1: 79, 2: 104}"



Classification Report for Fold 3:
              precision    recall  f1-score   support

     Negativ       0.88      0.86      0.87        84
     Neutral       0.74      0.81      0.77        53
     Positiv       0.96      0.93      0.95       105

    accuracy                           0.88       242
   macro avg       0.86      0.87      0.86       242
weighted avg       0.88      0.88      0.88       242


Cross-Validation Summary

Average Metrics Across Folds:
eval_accuracy: 0.8390 ± 0.0447
eval_precision: 0.8452 ± 0.0397
eval_recall: 0.8390 ± 0.0447
eval_f1: 0.8400 ± 0.0446
eval_loss: 0.9557 ± 0.2712

Overall Classification Report:
              precision    recall  f1-score   support

     Negativ       0.77      0.86      0.81       215
     Neutral       0.73      0.70      0.71       151
     Positiv       0.94      0.89      0.91       339

    accuracy                           0.84       705
   macro avg       0.81      0.81      0.81       705
weighted avg       0.84  

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 950/950 [00:00<00:00, 4491.87 examples/s]
Map: 100%|██████████| 212/212 [00:00<00:00, 3647.95 examples/s]
Map: 100%|██████████| 231/231 [00:00<00:00, 4241.88 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.822533,0.764151,0.770107,0.764151,0.765735,"{0: 66, 1: 53, 2: 93}"
2,0.799800,0.95475,0.787736,0.789837,0.787736,0.787589,"{0: 66, 1: 46, 2: 100}"
3,0.496400,0.885898,0.806604,0.809744,0.806604,0.807134,"{0: 79, 1: 47, 2: 86}"
4,0.384600,1.127799,0.79717,0.804814,0.79717,0.799864,"{0: 68, 1: 53, 2: 91}"
5,0.275300,1.118231,0.79717,0.804735,0.79717,0.799902,"{0: 69, 1: 53, 2: 90}"



Classification Report for Fold 1:
              precision    recall  f1-score   support

     Negativ       0.68      0.91      0.78        69
     Neutral       0.63      0.59      0.61        37
     Positiv       0.93      0.78      0.85       125

    accuracy                           0.79       231
   macro avg       0.75      0.76      0.75       231
weighted avg       0.81      0.79      0.79       231


Training Fold 2/3

Training Fold 2 Sentiment label count:  {'negativ': 288, 'neutral': 216, 'positiv': 420}
Validation Fold 2 Sentiment label count:  {'negativ': 66, 'neutral': 58, 'positiv': 113}
Test Fold 2 Sentiment label count:  {'negativ': 62, 'neutral': 61, 'positiv': 109}
Class weights for (negative, neutral, positive): tensor([1.0694, 1.4259, 0.7333])


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 924/924 [00:00<00:00, 4994.03 examples/s]
Map: 100%|██████████| 237/237 [00:00<00:00, 4095.14 examples/s]
Map: 100%|██████████| 232/232 [00:00<00:00, 5055.64 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.618776,0.78481,0.783236,0.78481,0.783864,"{0: 67, 1: 55, 2: 115}"
2,0.723000,0.670976,0.822785,0.828079,0.822785,0.821225,"{0: 82, 1: 52, 2: 103}"
3,0.516000,0.738165,0.831224,0.8401,0.831224,0.833611,"{0: 62, 1: 69, 2: 106}"
4,0.299100,0.734327,0.843882,0.843934,0.843882,0.843821,"{0: 68, 1: 58, 2: 111}"
5,0.236300,0.78217,0.831224,0.838543,0.831224,0.83322,"{0: 66, 1: 67, 2: 104}"



Classification Report for Fold 2:
              precision    recall  f1-score   support

     Negativ       0.64      0.79      0.71        62
     Neutral       0.74      0.64      0.68        61
     Positiv       0.86      0.82      0.84       109

    accuracy                           0.76       232
   macro avg       0.75      0.75      0.74       232
weighted avg       0.77      0.76      0.76       232


Training Fold 3/3

Training Fold 3 Sentiment label count:  {'negativ': 270, 'neutral': 202, 'positiv': 440}
Validation Fold 3 Sentiment label count:  {'negativ': 62, 'neutral': 80, 'positiv': 97}
Test Fold 3 Sentiment label count:  {'negativ': 84, 'neutral': 53, 'positiv': 105}
Class weights for (negative, neutral, positive): tensor([1.1259, 1.5050, 0.6909])


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 912/912 [00:00<00:00, 4342.91 examples/s]
Map: 100%|██████████| 239/239 [00:00<00:00, 3699.79 examples/s]
Map: 100%|██████████| 242/242 [00:00<00:00, 4897.85 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.149027,0.715481,0.751736,0.715481,0.702072,"{0: 66, 1: 41, 2: 132}"
2,0.754900,1.281742,0.690377,0.690031,0.690377,0.687325,"{0: 66, 1: 66, 2: 107}"
3,0.552700,1.447227,0.723849,0.728289,0.723849,0.719065,"{0: 70, 1: 59, 2: 110}"
4,0.384000,1.671694,0.707113,0.704336,0.707113,0.702295,"{0: 52, 1: 74, 2: 113}"
5,0.253700,1.762395,0.694561,0.696552,0.694561,0.688007,"{0: 55, 1: 63, 2: 121}"



Classification Report for Fold 3:
              precision    recall  f1-score   support

     Negativ       0.81      0.83      0.82        84
     Neutral       0.75      0.72      0.73        53
     Positiv       0.92      0.92      0.92       105

    accuracy                           0.85       242
   macro avg       0.83      0.82      0.83       242
weighted avg       0.85      0.85      0.85       242


Cross-Validation Summary

Average Metrics Across Folds:
eval_accuracy: 0.8080 ± 0.0275
eval_precision: 0.8187 ± 0.0228
eval_recall: 0.8080 ± 0.0275
eval_f1: 0.8100 ± 0.0266
eval_loss: 1.0116 ± 0.2606

Overall Classification Report:
              precision    recall  f1-score   support

     Negativ       0.72      0.85      0.78       215
     Neutral       0.71      0.66      0.68       151
     Positiv       0.91      0.83      0.87       339

    accuracy                           0.80       705
   macro avg       0.78      0.78      0.78       705
weighted avg       0.81  

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.113644,0.721698,0.73994,0.721698,0.724025,"{0: 56, 1: 61, 2: 95}"
2,0.856200,1.137417,0.759434,0.764115,0.759434,0.760309,"{0: 65, 1: 52, 2: 95}"
3,0.562200,1.039343,0.792453,0.798361,0.792453,0.794492,"{0: 77, 1: 49, 2: 86}"
4,0.367900,1.144373,0.783019,0.799265,0.783019,0.788143,"{0: 70, 1: 58, 2: 84}"
5,0.284500,1.215371,0.792453,0.814491,0.792453,0.798388,"{0: 71, 1: 61, 2: 80}"



Classification Report for Fold 1:
              precision    recall  f1-score   support

     Negativ       0.69      0.77      0.73        69
     Neutral       0.50      0.68      0.57        37
     Positiv       0.96      0.80      0.87       125

    accuracy                           0.77       231
   macro avg       0.72      0.75      0.72       231
weighted avg       0.81      0.77      0.78       231


Training Fold 2/3

Training Fold 2 Sentiment label count:  {'negativ': 288, 'neutral': 216, 'positiv': 420}
Validation Fold 2 Sentiment label count:  {'negativ': 66, 'neutral': 58, 'positiv': 113}
Test Fold 2 Sentiment label count:  {'negativ': 62, 'neutral': 61, 'positiv': 109}
Class weights for (negative, neutral, positive): tensor([1.0694, 1.4259, 0.7333])


Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 924/924 [00:00<00:00, 3556.94 examples/s]
Map: 100%|██████████| 237/237 [00:00<00:00, 2672.30 examples/s]
Map: 100%|██████████| 232/232 [00:00<00:00, 3366.60 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.785198,0.780591,0.796225,0.780591,0.782415,"{0: 85, 1: 56, 2: 96}"
2,0.824900,0.919566,0.772152,0.774692,0.772152,0.769116,"{0: 80, 1: 46, 2: 111}"
3,0.580100,1.182793,0.78481,0.797937,0.78481,0.787519,"{0: 66, 1: 72, 2: 99}"
4,0.303400,1.134372,0.797468,0.807314,0.797468,0.797869,"{0: 83, 1: 53, 2: 101}"
5,0.271400,1.161903,0.797468,0.799514,0.797468,0.798087,"{0: 70, 1: 59, 2: 108}"



Classification Report for Fold 2:
              precision    recall  f1-score   support

     Negativ       0.55      0.66      0.60        62
     Neutral       0.60      0.57      0.59        61
     Positiv       0.84      0.76      0.80       109

    accuracy                           0.69       232
   macro avg       0.66      0.67      0.66       232
weighted avg       0.70      0.69      0.69       232


Training Fold 3/3

Training Fold 3 Sentiment label count:  {'negativ': 270, 'neutral': 202, 'positiv': 440}
Validation Fold 3 Sentiment label count:  {'negativ': 62, 'neutral': 80, 'positiv': 97}
Test Fold 3 Sentiment label count:  {'negativ': 84, 'neutral': 53, 'positiv': 105}
Class weights for (negative, neutral, positive): tensor([1.1259, 1.5050, 0.6909])


Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GerMedBERT/medbert-512 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 912/912 [00:00<00:00, 4055.59 examples/s]
Map: 100%|██████████| 239/239 [00:00<00:00, 3184.51 examples/s]
Map: 100%|██████████| 242/242 [00:00<00:00, 3871.00 examples/s]


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.368158,0.711297,0.751262,0.711297,0.704801,"{0: 93, 1: 44, 2: 102}"
2,0.817800,1.376052,0.728033,0.730225,0.728033,0.727853,"{0: 70, 1: 72, 2: 97}"
3,0.550900,1.699784,0.698745,0.706768,0.698745,0.689244,"{0: 76, 1: 51, 2: 112}"
4,0.375200,1.701387,0.686192,0.682413,0.686192,0.683444,"{0: 57, 1: 77, 2: 105}"
5,0.183600,1.815511,0.677824,0.673042,0.677824,0.672434,"{0: 59, 1: 68, 2: 112}"



Classification Report for Fold 3:
              precision    recall  f1-score   support

     Negativ       0.85      0.79      0.81        84
     Neutral       0.63      0.77      0.69        53
     Positiv       0.94      0.89      0.91       105

    accuracy                           0.83       242
   macro avg       0.81      0.82      0.81       242
weighted avg       0.84      0.83      0.83       242


Cross-Validation Summary

Average Metrics Across Folds:
eval_accuracy: 0.7549 ± 0.0537
eval_precision: 0.7709 ± 0.0509
eval_recall: 0.7549 ± 0.0537
eval_f1: 0.7601 ± 0.0529
eval_loss: 1.2710 ± 0.3454

Overall Classification Report:
              precision    recall  f1-score   support

     Negativ       0.70      0.74      0.72       215
     Neutral       0.58      0.67      0.62       151
     Positiv       0.91      0.81      0.86       339

    accuracy                           0.76       705
   macro avg       0.73      0.74      0.73       705
weighted avg       0.78  

Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.894189,0.787736,0.826328,0.787736,0.797341,"{0: 64, 1: 69, 2: 79}"
2,0.750400,0.805046,0.830189,0.842594,0.830189,0.834185,"{0: 73, 1: 55, 2: 84}"
3,0.509800,0.867564,0.839623,0.850348,0.839623,0.842031,"{0: 78, 1: 53, 2: 81}"
4,0.345600,1.002077,0.825472,0.833519,0.825472,0.828122,"{0: 75, 1: 52, 2: 85}"
5,0.223100,1.047652,0.825472,0.838496,0.825472,0.829214,"{0: 75, 1: 55, 2: 82}"



Classification Report for Fold 1:
              precision    recall  f1-score   support

     Negativ       0.83      0.86      0.84        69
     Neutral       0.70      0.76      0.73        37
     Positiv       0.94      0.90      0.92       125

    accuracy                           0.87       231
   macro avg       0.82      0.84      0.83       231
weighted avg       0.87      0.87      0.87       231


Training Fold 2/3

Training Fold 2 Sentiment label count:  {'negativ': 288, 'neutral': 216, 'positiv': 420}
Validation Fold 2 Sentiment label count:  {'negativ': 66, 'neutral': 58, 'positiv': 113}
Test Fold 2 Sentiment label count:  {'negativ': 62, 'neutral': 61, 'positiv': 109}
Class weights for (negative, neutral, positive): tensor([1.0694, 1.4259, 0.7333])


Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,0.660488,0.85654,0.860974,0.85654,0.856169,"{0: 55, 1: 64, 2: 118}"
2,0.756400,0.568352,0.848101,0.851781,0.848101,0.849282,"{0: 63, 1: 64, 2: 110}"
3,0.509300,0.67266,0.869198,0.874764,0.869198,0.871051,"{0: 67, 1: 64, 2: 106}"
4,0.277800,0.681783,0.860759,0.86387,0.860759,0.861918,"{0: 68, 1: 61, 2: 108}"
5,0.224900,0.724932,0.873418,0.876072,0.873418,0.874344,"{0: 68, 1: 61, 2: 108}"



Classification Report for Fold 2:
              precision    recall  f1-score   support

     Negativ       0.72      0.81      0.76        62
     Neutral       0.70      0.74      0.72        61
     Positiv       0.93      0.84      0.88       109

    accuracy                           0.81       232
   macro avg       0.79      0.80      0.79       232
weighted avg       0.82      0.81      0.81       232


Training Fold 3/3

Training Fold 3 Sentiment label count:  {'negativ': 270, 'neutral': 202, 'positiv': 440}
Validation Fold 3 Sentiment label count:  {'negativ': 62, 'neutral': 80, 'positiv': 97}
Test Fold 3 Sentiment label count:  {'negativ': 84, 'neutral': 53, 'positiv': 105}
Class weights for (negative, neutral, positive): tensor([1.1259, 1.5050, 0.6909])


Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/gbert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infe

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Class Distribution
1,No log,1.090109,0.774059,0.775709,0.774059,0.769561,"{0: 69, 1: 62, 2: 108}"
2,0.745600,1.29289,0.778243,0.791338,0.778243,0.780771,"{0: 56, 1: 97, 2: 86}"
3,0.484800,1.434442,0.76569,0.77332,0.76569,0.761216,"{0: 71, 1: 58, 2: 110}"
4,0.312300,1.409014,0.782427,0.793514,0.782427,0.784368,"{0: 52, 1: 95, 2: 92}"
5,0.186000,1.463964,0.782427,0.781518,0.782427,0.781596,"{0: 59, 1: 78, 2: 102}"



Classification Report for Fold 3:
              precision    recall  f1-score   support

     Negativ       0.85      0.76      0.81        84
     Neutral       0.63      0.83      0.72        53
     Positiv       0.98      0.90      0.94       105

    accuracy                           0.84       242
   macro avg       0.82      0.83      0.82       242
weighted avg       0.86      0.84      0.84       242


Cross-Validation Summary

Average Metrics Across Folds:
eval_accuracy: 0.8398 ± 0.0160
eval_precision: 0.8474 ± 0.0130
eval_recall: 0.8398 ± 0.0160
eval_f1: 0.8414 ± 0.0159
eval_loss: 0.9507 ± 0.2044

Overall Classification Report:
              precision    recall  f1-score   support

     Negativ       0.80      0.80      0.80       215
     Neutral       0.67      0.77      0.72       151
     Positiv       0.95      0.88      0.92       339

    accuracy                           0.84       705
   macro avg       0.81      0.82      0.81       705
weighted avg       0.85  