# BERT

The second model used in this thesis is the BERT model. The reasoning for this choice is that the BERT model is bit different than other LLMs in a sense that while being pre-trained, the BERT model is trained bidirectionaly. This means that the model has both past and future context, unlike other LLMs that only have past context. The model also offers a middle ground in terms of size as it sits at around 110 million parameters, which is significantly larger than the BiLSTM, but significantly smaller than the Llama model which has 3 billion parameters. 

## Imports

In [None]:
!pip install optuna
!pip uninstall transformers -y
!pip install transformers
!pip install python-dotenv
!pip install datasets


Collecting optuna
  Downloading optuna-4.2.1-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.14.1-py3-none-any.whl.metadata (7.4 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.9-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.2.1-py3-none-any.whl (383 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.6/383.6 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.14.1-py3-none-any.whl (233 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.6/233.6 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Downloading Mako-1.3.9-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Ma

In [None]:
import numpy as np
import pandas as pd
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    BitsAndBytesConfig,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
    EarlyStoppingCallback
)
from peft import(
    LoraConfig,
    prepare_model_for_kbit_training,
    get_peft_model,
    PeftModel
)
from dotenv import load_dotenv
from datasets import Dataset
from sklearn.metrics import accuracy_score, f1_score
from huggingface_hub import login
#Insert hugging face api key below
#login('')

load_dotenv()

## Model Building

In this section, the best hyperparameters for the model are first chosen, then the model is fully finetuned on the dataset.

### Model and Tokenizer

In [None]:
id2label = {0: 'Depression', 1: 'Normal', 2: 'Anxiety', 3: 'Suicidal'}
label2id = {'Depression': 0, 'Normal': 1,'Anxiety': 2, 'Suicidal':3 }
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=4,
 id2label=id2label, label2id=label2id

)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Hyperparameter tuning

#### Data Pre-processing

The pre-processing steps for the LLMs are the same and are very simple. The comprise of turning the dataframes into datasets so huggingface can handle them, and then using the pre-trained model tokenizer to tokenize the datasets while applying truncation.

In [None]:
id2label = {0: 'Depression', 1: 'Normal', 2: 'Anxiety', 3: 'Suicidal'}
label2id = {'Depression': 0, 'Normal': 1,'Anxiety': 2, 'Suicidal':3 }
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["italian text"], truncation=True)


In [None]:
X_train = pd.read_csv('/content/drive/MyDrive/Italian thesis/Training dataset/train_sample.csv')
X_val = pd.read_csv('/content/drive/MyDrive/Italian thesis/Training dataset/val_sample.csv')

# Merge them by concatenation
print(f"Training set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")
#print(f"Test set size: {len(X_test)}")
def tokenize(example):
    return tokenizer(example["italian text"], padding=True, truncation=True, max_length=tokenizer.model_max_length)
train = Dataset.from_pandas(X_train)
#test = Dataset.from_pandas(X_test)
validation = Dataset.from_pandas(X_val)
tokenized_train = train.map(tokenize, batched=True)
#tokenized_test=test.map(tokenize, batched=True)
tokenized_val=validation.map(tokenize, batched=True)
id2label = {0: 'Depression', 1: 'Normal', 2: 'Anxiety', 3: 'Suicidal'}
label2id = {'Depression': 0, 'Normal': 1,'Anxiety': 2, 'Suicidal':3}


Training set size: 3200
Validation set size: 600


Map:   0%|          | 0/3200 [00:00<?, ? examples/s]

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

In [None]:
tokenized_train = tokenized_train.remove_columns(["text", "italian text"])
tokenized_val = tokenized_val.remove_columns(["text", "italian text"])
#tokenized_test = tokenized_test.remove_columns(["text", "italian text"])  # If using test data


After pre-processing the dataset, the hyperparameter tuning process begins. The first step is to define the model_init function. Then the next step is to define the hyperparameter space that will be tested by the model. Next, the metrics for analyzing the performance of the different models is defined, which are Precision, Recall, Accuracy, and F1 Score. The tuning process then begins, with the aim of maximizing the F1 score. 

In [None]:
def model_init():
  model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=4)
  tokenizer = AutoTokenizer.from_pretrained(
    'bert-base-uncased',
    add_prefix_space=True
)
  tokenizer.pad_token_id = tokenizer.eos_token_id
  tokenizer.pad_token = tokenizer.eos_token

  model.config.pad_token_id = tokenizer.pad_token_id
  return model

In [None]:
def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
        "gradient_accumulation_steps": trial.suggest_categorical("gradient_accumulation_steps", [1, 2, 4, 8]),  # Accumulation steps
        "lr_scheduler_type": trial.suggest_categorical("lr_scheduler_type", ["linear", "cosine", "constant", "cosine_with_restarts"]),  # Scheduler
        "warmup_ratio": trial.suggest_float("warmup_ratio", 0.0, 0.1),  # Warmup ratio
        "weight_decay": trial.suggest_float("weight_decay", 0.01, 0.1),
    }


In [None]:
def compute_objective(metrics):
    print(metrics)
    return metrics["eval_f1_macro"]

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import numpy as np
import torch
from datasets import Dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer,   DataCollatorWithPadding,Trainer, TrainingArguments

from tqdm import tqdm
from sklearn.metrics import accuracy_score,f1_score, confusion_matrix, classification_report
from torch.utils.data import DataLoader
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, precision_score, recall_score
import numpy as np
import pandas as pd

# Define your class labels
id2label = {0: 'Depression', 1: 'Normal', 2: 'Anxiety', 3: 'Suicidal'}
label2id = {v: k for k, v in id2label.items()}

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # Compute accuracy
    accuracy = accuracy_score(labels, predictions)

    # Compute F1 score (macro-average)
    f1 = f1_score(labels, predictions, average='macro')

    # Compute precision and recall (macro-average)
    precision = precision_score(labels, predictions, average='macro')
    recall = recall_score(labels, predictions, average='macro')

    # Compute confusion matrix
    cm = confusion_matrix(labels, predictions)

    # Convert confusion matrix from class IDs to labels
    cm_labels = np.array([id2label[i] for i in range(len(id2label))])
    cm_with_labels = pd.DataFrame(cm, index=cm_labels, columns=cm_labels)

    # Generate the classification report with labels
    class_report = classification_report(labels, predictions, target_names=[id2label[i] for i in range(len(id2label))])

    # Print confusion matrix and classification report
    print("Confusion Matrix:")
    print(cm_with_labels)
    print("\nClassification Report:")
    print(class_report)

    return {
        'accuracy': accuracy,
        'f1_macro': f1,
        'precision_macro': precision,
        'recall_macro': recall,
    }


In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

training_args = TrainingArguments(
    output_dir="BERT-hyperparameter-search",
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    metric_for_best_model="f1_macro",
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
)
trainer = Trainer(
    model=None,
    model_init=model_init,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    compute_metrics=compute_metrics,
    data_collator=data_collator)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
best_trial = trainer.hyperparameter_search(
    direction="maximize",
    backend="optuna",
    hp_space=optuna_hp_space,
    n_trials=30,
    compute_objective=compute_objective,
)

[I 2025-02-26 20:31:19,364] A new study created in memory with name: no-name-4155925a-b8c6-43a0-8a7f-cd6b90c5c2fb
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmasabbah97[0m ([33maml_group[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.109219,0.531667,0.489294,0.545773,0.531667
2,No log,0.965656,0.581667,0.565978,0.57071,0.581667
3,No log,0.918424,0.596667,0.578806,0.585889,0.596667
4,No log,0.904798,0.605,0.595616,0.597525,0.605


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          20      18       27        85
Normal               1     138        6         5
Anxiety              5      34       63        48
Suicidal             9      26       17        98

Classification Report:
              precision    recall  f1-score   support

  Depression       0.57      0.13      0.22       150
      Normal       0.64      0.92      0.75       150
     Anxiety       0.56      0.42      0.48       150
    Suicidal       0.42      0.65      0.51       150

    accuracy                           0.53       600
   macro avg       0.55      0.53      0.49       600
weighted avg       0.55      0.53      0.49       600

{'eval_loss': 1.109218716621399, 'eval_accuracy': 0.5316666666666666, 'eval_f1_macro': 0.4892935125172063, 'eval_precision_macro': 0.5457734553748503, 'eval_recall_macro': 0.5316666666666667}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression   

[I 2025-02-26 20:35:49,809] Trial 0 finished with value: 0.5956160683110948 and parameters: {'learning_rate': 2.74673265941076e-05, 'gradient_accumulation_steps': 8, 'lr_scheduler_type': 'cosine_with_restarts', 'warmup_ratio': 0.00789916420694845, 'weight_decay': 0.09266055131490777}. Best is trial 0 with value: 0.5956160683110948.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆▇█
eval/f1_macro,▁▆▇█
eval/loss,█▃▁▁
eval/precision_macro,▁▄▆█
eval/recall_macro,▁▆▇█
eval/runtime,▄▂▁█
eval/samples_per_second,▅▇█▁
eval/steps_per_second,▅▇█▁
train/epoch,▁▃▆██
train/global_step,▁▃▆██

0,1
eval/accuracy,0.605
eval/f1_macro,0.59562
eval/loss,0.9048
eval/precision_macro,0.59752
eval/recall_macro,0.605
eval/runtime,3.2206
eval/samples_per_second,186.301
eval/steps_per_second,5.9
total_flos,3084177395340288.0
train/epoch,4.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.078076,0.546667,0.517698,0.545394,0.546667
2,No log,0.911662,0.611667,0.585408,0.619875,0.611667
3,No log,0.803368,0.668333,0.650578,0.68762,0.668333
4,No log,0.718812,0.708333,0.7016,0.709999,0.708333


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          42      20       45        43
Normal               0     140        8         2
Anxiety              7      39       94        10
Suicidal            21      33       44        52

Classification Report:
              precision    recall  f1-score   support

  Depression       0.60      0.28      0.38       150
      Normal       0.60      0.93      0.73       150
     Anxiety       0.49      0.63      0.55       150
    Suicidal       0.49      0.35      0.40       150

    accuracy                           0.55       600
   macro avg       0.55      0.55      0.52       600
weighted avg       0.55      0.55      0.52       600

{'eval_loss': 1.0780760049819946, 'eval_accuracy': 0.5466666666666666, 'eval_f1_macro': 0.5176978459515325, 'eval_precision_macro': 0.5453940452829806, 'eval_recall_macro': 0.5466666666666666}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression  

[I 2025-02-26 20:40:11,018] Trial 1 finished with value: 0.7016000827042129 and parameters: {'learning_rate': 1.5447773984638102e-05, 'gradient_accumulation_steps': 4, 'lr_scheduler_type': 'constant', 'warmup_ratio': 0.00627989399480865, 'weight_decay': 0.02288839677418366}. Best is trial 1 with value: 0.7016000827042129.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▄▆█
eval/f1_macro,▁▄▆█
eval/loss,█▅▃▁
eval/precision_macro,▁▄▇█
eval/recall_macro,▁▄▆█
eval/runtime,▂▁▁█
eval/samples_per_second,▇██▁
eval/steps_per_second,▇██▁
train/epoch,▁▃▆██
train/global_step,▁▃▆██

0,1
eval/accuracy,0.70833
eval/f1_macro,0.7016
eval/loss,0.71881
eval/precision_macro,0.71
eval/recall_macro,0.70833
eval/runtime,3.2122
eval/samples_per_second,186.788
eval/steps_per_second,5.915
total_flos,3084177395340288.0
train/epoch,4.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.081101,0.558333,0.536679,0.548632,0.558333
2,No log,0.902718,0.611667,0.588932,0.603898,0.611667
3,No log,0.818586,0.643333,0.622243,0.652265,0.643333
4,No log,0.785215,0.653333,0.637885,0.660292,0.653333


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          48      20       37        45
Normal               0     140        6         4
Anxiety             12      36       84        18
Suicidal            26      31       30        63

Classification Report:
              precision    recall  f1-score   support

  Depression       0.56      0.32      0.41       150
      Normal       0.62      0.93      0.74       150
     Anxiety       0.54      0.56      0.55       150
    Suicidal       0.48      0.42      0.45       150

    accuracy                           0.56       600
   macro avg       0.55      0.56      0.54       600
weighted avg       0.55      0.56      0.54       600

{'eval_loss': 1.081100583076477, 'eval_accuracy': 0.5583333333333333, 'eval_f1_macro': 0.5366791254167581, 'eval_precision_macro': 0.5486317136846476, 'eval_recall_macro': 0.5583333333333333}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression   

[I 2025-02-26 20:44:32,142] Trial 2 finished with value: 0.6378849371832328 and parameters: {'learning_rate': 4.10297246856984e-05, 'gradient_accumulation_steps': 8, 'lr_scheduler_type': 'linear', 'warmup_ratio': 0.05934413398671354, 'weight_decay': 0.011032110342431258}. Best is trial 1 with value: 0.7016000827042129.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▅▇█
eval/f1_macro,▁▅▇█
eval/loss,█▄▂▁
eval/precision_macro,▁▄▇█
eval/recall_macro,▁▅▇█
eval/runtime,▂▂▁█
eval/samples_per_second,▇▇█▁
eval/steps_per_second,▇▇█▁
train/epoch,▁▃▆██
train/global_step,▁▃▆██

0,1
eval/accuracy,0.65333
eval/f1_macro,0.63788
eval/loss,0.78522
eval/precision_macro,0.66029
eval/recall_macro,0.65333
eval/runtime,3.2197
eval/samples_per_second,186.352
eval/steps_per_second,5.901
total_flos,3084177395340288.0
train/epoch,4.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.020656,0.576667,0.546258,0.605162,0.576667
2,No log,0.904825,0.62,0.60095,0.612393,0.62
3,No log,0.816104,0.64,0.619591,0.637624,0.64
4,No log,0.805891,0.655,0.640387,0.655035,0.655


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          29      14       18        89
Normal               1     137        6         6
Anxiety              7      29       77        37
Suicidal             7      21       19       103

Classification Report:
              precision    recall  f1-score   support

  Depression       0.66      0.19      0.30       150
      Normal       0.68      0.91      0.78       150
     Anxiety       0.64      0.51      0.57       150
    Suicidal       0.44      0.69      0.54       150

    accuracy                           0.58       600
   macro avg       0.61      0.58      0.55       600
weighted avg       0.61      0.58      0.55       600

{'eval_loss': 1.0206555128097534, 'eval_accuracy': 0.5766666666666667, 'eval_f1_macro': 0.5462577895567586, 'eval_precision_macro': 0.6051618719747491, 'eval_recall_macro': 0.5766666666666667}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression  

[I 2025-02-26 20:48:56,310] Trial 3 finished with value: 0.6403866623662385 and parameters: {'learning_rate': 1.2428488395490032e-05, 'gradient_accumulation_steps': 2, 'lr_scheduler_type': 'cosine_with_restarts', 'warmup_ratio': 0.00858078544090094, 'weight_decay': 0.0858217213049825}. Best is trial 1 with value: 0.7016000827042129.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▅▇█
eval/f1_macro,▁▅▆█
eval/loss,█▄▁▁
eval/precision_macro,▁▂▆█
eval/recall_macro,▁▅▇█
eval/runtime,▂▂▁█
eval/samples_per_second,▇▇█▁
eval/steps_per_second,▇▇█▁
train/epoch,▁▃▆██
train/global_step,▁▃▆██

0,1
eval/accuracy,0.655
eval/f1_macro,0.64039
eval/loss,0.80589
eval/precision_macro,0.65504
eval/recall_macro,0.655
eval/runtime,3.2501
eval/samples_per_second,184.608
eval/steps_per_second,5.846
total_flos,3084177395340288.0
train/epoch,4.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.979709,0.576667,0.564491,0.583297,0.576667
2,No log,0.741316,0.685,0.669604,0.677532,0.685
3,0.957100,0.679991,0.711667,0.703099,0.709704,0.711667
4,0.957100,0.66573,0.711667,0.703233,0.71263,0.711667


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          47      11       14        78
Normal               8     129        8         5
Anxiety             15      27       74        34
Suicidal            24      18       12        96

Classification Report:
              precision    recall  f1-score   support

  Depression       0.50      0.31      0.39       150
      Normal       0.70      0.86      0.77       150
     Anxiety       0.69      0.49      0.57       150
    Suicidal       0.45      0.64      0.53       150

    accuracy                           0.58       600
   macro avg       0.58      0.58      0.56       600
weighted avg       0.58      0.58      0.56       600

{'eval_loss': 0.9797086715698242, 'eval_accuracy': 0.5766666666666667, 'eval_f1_macro': 0.5644910465145279, 'eval_precision_macro': 0.5832966769586487, 'eval_recall_macro': 0.5766666666666667}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression  

[I 2025-02-26 20:53:23,232] Trial 4 finished with value: 0.7032325301079986 and parameters: {'learning_rate': 1.405744185469908e-05, 'gradient_accumulation_steps': 1, 'lr_scheduler_type': 'cosine', 'warmup_ratio': 0.07812731454043954, 'weight_decay': 0.06132096297713702}. Best is trial 4 with value: 0.7032325301079986.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▇██
eval/f1_macro,▁▆██
eval/loss,█▃▁▁
eval/precision_macro,▁▆██
eval/recall_macro,▁▇██
eval/runtime,▂▁▁█
eval/samples_per_second,▇██▁
eval/steps_per_second,▇██▁
train/epoch,▁▃▅▆██
train/global_step,▁▃▅▆██

0,1
eval/accuracy,0.71167
eval/f1_macro,0.70323
eval/loss,0.66573
eval/precision_macro,0.71263
eval/recall_macro,0.71167
eval/runtime,3.2249
eval/samples_per_second,186.05
eval/steps_per_second,5.892
total_flos,3084177395340288.0
train/epoch,4.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.072516,0.553333,0.532285,0.539348,0.553333
2,No log,0.937264,0.591667,0.574382,0.584882,0.591667
3,No log,0.893102,0.615,0.596859,0.614054,0.615


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          40      14       36        60
Normal               1     136        9         4
Anxiety             16      31       82        21
Suicidal            22      23       31        74

Classification Report:
              precision    recall  f1-score   support

  Depression       0.51      0.27      0.35       150
      Normal       0.67      0.91      0.77       150
     Anxiety       0.52      0.55      0.53       150
    Suicidal       0.47      0.49      0.48       150

    accuracy                           0.55       600
   macro avg       0.54      0.55      0.53       600
weighted avg       0.54      0.55      0.53       600

{'eval_loss': 1.0725163221359253, 'eval_accuracy': 0.5533333333333333, 'eval_f1_macro': 0.5322846234622184, 'eval_precision_macro': 0.5393479818485789, 'eval_recall_macro': 0.5533333333333333}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression  

[I 2025-02-26 20:56:34,468] Trial 5 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▅█
eval/f1_macro,▁▆█
eval/loss,█▃▁
eval/precision_macro,▁▅█
eval/recall_macro,▁▅█
eval/runtime,▇█▁
eval/samples_per_second,▂▁█
eval/steps_per_second,▁▁█
train/epoch,▁▅█
train/global_step,▁▅█

0,1
eval/accuracy,0.615
eval/f1_macro,0.59686
eval/loss,0.8931
eval/precision_macro,0.61405
eval/recall_macro,0.615
eval/runtime,3.168
eval/samples_per_second,189.396
eval/steps_per_second,5.998
train/epoch,3.0
train/global_step,75.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.159673,0.551667,0.544386,0.597412,0.551667


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          51       6        1        92
Normal               6     125       10         9
Anxiety             81       5       46        18
Suicidal            25      12        4       109

Classification Report:
              precision    recall  f1-score   support

  Depression       0.31      0.34      0.33       150
      Normal       0.84      0.83      0.84       150
     Anxiety       0.75      0.31      0.44       150
    Suicidal       0.48      0.73      0.58       150

    accuracy                           0.55       600
   macro avg       0.60      0.55      0.54       600
weighted avg       0.60      0.55      0.54       600

{'eval_loss': 1.1596726179122925, 'eval_accuracy': 0.5516666666666666, 'eval_f1_macro': 0.5443858257028484, 'eval_precision_macro': 0.5974116415679377, 'eval_recall_macro': 0.5516666666666666}


[I 2025-02-26 20:57:40,594] Trial 6 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.55167
eval/f1_macro,0.54439
eval/loss,1.15967
eval/precision_macro,0.59741
eval/recall_macro,0.55167
eval/runtime,3.1797
eval/samples_per_second,188.697
eval/steps_per_second,5.975
train/epoch,1.0
train/global_step,200.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.060296,0.558333,0.530052,0.56435,0.558333
2,No log,0.908699,0.611667,0.596991,0.611894,0.611667
3,No log,0.840424,0.635,0.625807,0.64588,0.635
4,No log,0.818561,0.64,0.631146,0.646383,0.64


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          55      20       49        26
Normal               0     139        9         2
Anxiety             10      35      100         5
Suicidal            25      30       54        41

Classification Report:
              precision    recall  f1-score   support

  Depression       0.61      0.37      0.46       150
      Normal       0.62      0.93      0.74       150
     Anxiety       0.47      0.67      0.55       150
    Suicidal       0.55      0.27      0.37       150

    accuracy                           0.56       600
   macro avg       0.56      0.56      0.53       600
weighted avg       0.56      0.56      0.53       600

{'eval_loss': 1.0602960586547852, 'eval_accuracy': 0.5583333333333333, 'eval_f1_macro': 0.530051614442864, 'eval_precision_macro': 0.5643497481646067, 'eval_recall_macro': 0.5583333333333333}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression   

[I 2025-02-26 21:02:00,940] Trial 7 finished with value: 0.6311460196692297 and parameters: {'learning_rate': 4.1954462259599006e-05, 'gradient_accumulation_steps': 8, 'lr_scheduler_type': 'cosine', 'warmup_ratio': 0.023042363062248872, 'weight_decay': 0.0329232204081366}. Best is trial 4 with value: 0.7032325301079986.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆██
eval/f1_macro,▁▆██
eval/loss,█▄▂▁
eval/precision_macro,▁▅██
eval/recall_macro,▁▆██
eval/runtime,▂▂▁█
eval/samples_per_second,▇▇█▁
eval/steps_per_second,▇▇█▁
train/epoch,▁▃▆██
train/global_step,▁▃▆██

0,1
eval/accuracy,0.64
eval/f1_macro,0.63115
eval/loss,0.81856
eval/precision_macro,0.64638
eval/recall_macro,0.64
eval/runtime,3.231
eval/samples_per_second,185.701
eval/steps_per_second,5.881
total_flos,6168354790680576.0
train/epoch,4.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.912425,0.63,0.619209,0.63698,0.63
2,No log,0.673336,0.731667,0.731154,0.734012,0.731667
3,0.847400,0.655604,0.75,0.746381,0.745033,0.75
4,0.847400,0.650407,0.75,0.747718,0.754736,0.75


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          48       5       25        72
Normal               5     121       16         8
Anxiety             11      10      110        19
Suicidal            16      13       22        99

Classification Report:
              precision    recall  f1-score   support

  Depression       0.60      0.32      0.42       150
      Normal       0.81      0.81      0.81       150
     Anxiety       0.64      0.73      0.68       150
    Suicidal       0.50      0.66      0.57       150

    accuracy                           0.63       600
   macro avg       0.64      0.63      0.62       600
weighted avg       0.64      0.63      0.62       600

{'eval_loss': 0.9124245047569275, 'eval_accuracy': 0.63, 'eval_f1_macro': 0.619208980291945, 'eval_precision_macro': 0.6369796718004422, 'eval_recall_macro': 0.63}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          91       8        6  

[I 2025-02-26 21:06:28,110] Trial 8 finished with value: 0.7477179660543716 and parameters: {'learning_rate': 2.8623443228780884e-05, 'gradient_accumulation_steps': 1, 'lr_scheduler_type': 'linear', 'warmup_ratio': 0.08949554639047672, 'weight_decay': 0.04922322533381231}. Best is trial 8 with value: 0.7477179660543716.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▇██
eval/f1_macro,▁▇██
eval/loss,█▂▁▁
eval/precision_macro,▁▇▇█
eval/recall_macro,▁▇██
eval/runtime,▃▁▁█
eval/samples_per_second,▆██▁
eval/steps_per_second,▆██▁
train/epoch,▁▃▅▆██
train/global_step,▁▃▅▆██

0,1
eval/accuracy,0.75
eval/f1_macro,0.74772
eval/loss,0.65041
eval/precision_macro,0.75474
eval/recall_macro,0.75
eval/runtime,3.2091
eval/samples_per_second,186.97
eval/steps_per_second,5.921
total_flos,3084177395340288.0
train/epoch,4.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.964444,0.573333,0.568639,0.574965,0.573333


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          85      11       15        39
Normal               5     125       13         7
Anxiety             38      22       66        24
Suicidal            56      11       15        68

Classification Report:
              precision    recall  f1-score   support

  Depression       0.46      0.57      0.51       150
      Normal       0.74      0.83      0.78       150
     Anxiety       0.61      0.44      0.51       150
    Suicidal       0.49      0.45      0.47       150

    accuracy                           0.57       600
   macro avg       0.57      0.57      0.57       600
weighted avg       0.57      0.57      0.57       600

{'eval_loss': 0.964444100856781, 'eval_accuracy': 0.5733333333333334, 'eval_f1_macro': 0.5686389568410011, 'eval_precision_macro': 0.5749649256244251, 'eval_recall_macro': 0.5733333333333333}


[I 2025-02-26 21:07:33,334] Trial 9 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.57333
eval/f1_macro,0.56864
eval/loss,0.96444
eval/precision_macro,0.57496
eval/recall_macro,0.57333
eval/runtime,3.1633
eval/samples_per_second,189.673
eval/steps_per_second,6.006
train/epoch,1.0
train/global_step,100.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.01697,0.56,0.52002,0.55886,0.56


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          20      15       31        84
Normal               1     135       10         4
Anxiety              9      29       90        22
Suicidal             6      21       32        91

Classification Report:
              precision    recall  f1-score   support

  Depression       0.56      0.13      0.22       150
      Normal       0.68      0.90      0.77       150
     Anxiety       0.55      0.60      0.58       150
    Suicidal       0.45      0.61      0.52       150

    accuracy                           0.56       600
   macro avg       0.56      0.56      0.52       600
weighted avg       0.56      0.56      0.52       600

{'eval_loss': 1.0169695615768433, 'eval_accuracy': 0.56, 'eval_f1_macro': 0.5200201813981058, 'eval_precision_macro': 0.5588597783068299, 'eval_recall_macro': 0.56}


[I 2025-02-26 21:08:39,208] Trial 10 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.56
eval/f1_macro,0.52002
eval/loss,1.01697
eval/precision_macro,0.55886
eval/recall_macro,0.56
eval/runtime,3.1672
eval/samples_per_second,189.44
eval/steps_per_second,5.999
train/epoch,1.0
train/global_step,200.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.017422,0.55,0.511446,0.564266,0.55


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          18      13       10       109
Normal               7     130        8         5
Anxiety              9      26       68        47
Suicidal             6      19       11       114

Classification Report:
              precision    recall  f1-score   support

  Depression       0.45      0.12      0.19       150
      Normal       0.69      0.87      0.77       150
     Anxiety       0.70      0.45      0.55       150
    Suicidal       0.41      0.76      0.54       150

    accuracy                           0.55       600
   macro avg       0.56      0.55      0.51       600
weighted avg       0.56      0.55      0.51       600

{'eval_loss': 1.0174223184585571, 'eval_accuracy': 0.55, 'eval_f1_macro': 0.5114455822814956, 'eval_precision_macro': 0.5642664360206584, 'eval_recall_macro': 0.55}


[I 2025-02-26 21:09:45,715] Trial 11 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.55
eval/f1_macro,0.51145
eval/loss,1.01742
eval/precision_macro,0.56427
eval/recall_macro,0.55
eval/runtime,3.1853
eval/samples_per_second,188.366
eval/steps_per_second,5.965
train/epoch,1.0
train/global_step,200.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.995641,0.58,0.573512,0.586003,0.58


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          48       6       25        71
Normal              11     117       13         9
Anxiety             12      16       89        33
Suicidal            21      12       23        94

Classification Report:
              precision    recall  f1-score   support

  Depression       0.52      0.32      0.40       150
      Normal       0.77      0.78      0.78       150
     Anxiety       0.59      0.59      0.59       150
    Suicidal       0.45      0.63      0.53       150

    accuracy                           0.58       600
   macro avg       0.59      0.58      0.57       600
weighted avg       0.59      0.58      0.57       600

{'eval_loss': 0.9956413507461548, 'eval_accuracy': 0.58, 'eval_f1_macro': 0.5735117075852059, 'eval_precision_macro': 0.5860032952618612, 'eval_recall_macro': 0.5800000000000001}


[I 2025-02-26 21:10:51,647] Trial 12 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.58
eval/f1_macro,0.57351
eval/loss,0.99564
eval/precision_macro,0.586
eval/recall_macro,0.58
eval/runtime,3.1747
eval/samples_per_second,188.994
eval/steps_per_second,5.985
train/epoch,1.0
train/global_step,200.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.949339,0.59,0.58121,0.621277,0.59


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          43       6        6        95
Normal              11     123        8         8
Anxiety             25      11       73        41
Suicidal            15      14        6       115

Classification Report:
              precision    recall  f1-score   support

  Depression       0.46      0.29      0.35       150
      Normal       0.80      0.82      0.81       150
     Anxiety       0.78      0.49      0.60       150
    Suicidal       0.44      0.77      0.56       150

    accuracy                           0.59       600
   macro avg       0.62      0.59      0.58       600
weighted avg       0.62      0.59      0.58       600

{'eval_loss': 0.9493392109870911, 'eval_accuracy': 0.59, 'eval_f1_macro': 0.5812099440601952, 'eval_precision_macro': 0.6212774469466302, 'eval_recall_macro': 0.59}


[I 2025-02-26 21:11:57,637] Trial 13 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.59
eval/f1_macro,0.58121
eval/loss,0.94934
eval/precision_macro,0.62128
eval/recall_macro,0.59
eval/runtime,3.1765
eval/samples_per_second,188.888
eval/steps_per_second,5.981
train/epoch,1.0
train/global_step,200.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.10865,0.546667,0.529391,0.529101,0.546667


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          46      13       36        55
Normal               2     133       12         3
Anxiety             21      34       82        13
Suicidal            31      23       29        67

Classification Report:
              precision    recall  f1-score   support

  Depression       0.46      0.31      0.37       150
      Normal       0.66      0.89      0.75       150
     Anxiety       0.52      0.55      0.53       150
    Suicidal       0.49      0.45      0.47       150

    accuracy                           0.55       600
   macro avg       0.53      0.55      0.53       600
weighted avg       0.53      0.55      0.53       600

{'eval_loss': 1.108649730682373, 'eval_accuracy': 0.5466666666666666, 'eval_f1_macro': 0.5293907977086523, 'eval_precision_macro': 0.5291007326525416, 'eval_recall_macro': 0.5466666666666666}


[I 2025-02-26 21:13:02,176] Trial 14 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.54667
eval/f1_macro,0.52939
eval/loss,1.10865
eval/precision_macro,0.5291
eval/recall_macro,0.54667
eval/runtime,3.1696
eval/samples_per_second,189.301
eval/steps_per_second,5.995
train/epoch,1.0
train/global_step,50.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.962912,0.586667,0.55628,0.603164,0.586667


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          25      10       19        96
Normal               7     126       13         4
Anxiety              6      19       90        35
Suicidal             5      17       17       111

Classification Report:
              precision    recall  f1-score   support

  Depression       0.58      0.17      0.26       150
      Normal       0.73      0.84      0.78       150
     Anxiety       0.65      0.60      0.62       150
    Suicidal       0.45      0.74      0.56       150

    accuracy                           0.59       600
   macro avg       0.60      0.59      0.56       600
weighted avg       0.60      0.59      0.56       600

{'eval_loss': 0.9629117846488953, 'eval_accuracy': 0.5866666666666667, 'eval_f1_macro': 0.5562798710033505, 'eval_precision_macro': 0.6031637537389261, 'eval_recall_macro': 0.5866666666666667}


[I 2025-02-26 21:14:08,079] Trial 15 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.58667
eval/f1_macro,0.55628
eval/loss,0.96291
eval/precision_macro,0.60316
eval/recall_macro,0.58667
eval/runtime,3.1897
eval/samples_per_second,188.103
eval/steps_per_second,5.957
train/epoch,1.0
train/global_step,200.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.965274,0.608333,0.614478,0.631361,0.608333
2,No log,0.657226,0.721667,0.71952,0.718356,0.721667
3,0.862800,0.634945,0.733333,0.728884,0.730231,0.733333
4,0.862800,0.619089,0.746667,0.743809,0.757937,0.746667


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          94       4        9        43
Normal              16     108       18         8
Anxiety             35      10       88        17
Suicidal            57       7       11        75

Classification Report:
              precision    recall  f1-score   support

  Depression       0.47      0.63      0.53       150
      Normal       0.84      0.72      0.77       150
     Anxiety       0.70      0.59      0.64       150
    Suicidal       0.52      0.50      0.51       150

    accuracy                           0.61       600
   macro avg       0.63      0.61      0.61       600
weighted avg       0.63      0.61      0.61       600

{'eval_loss': 0.9652737379074097, 'eval_accuracy': 0.6083333333333333, 'eval_f1_macro': 0.6144777523474407, 'eval_precision_macro': 0.6313610149668175, 'eval_recall_macro': 0.6083333333333334}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression  

[I 2025-02-26 21:18:35,139] Trial 16 finished with value: 0.7438089495977674 and parameters: {'learning_rate': 2.374892819523537e-05, 'gradient_accumulation_steps': 1, 'lr_scheduler_type': 'cosine', 'warmup_ratio': 0.06471251558687481, 'weight_decay': 0.0704020547549088}. Best is trial 8 with value: 0.7477179660543716.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▇▇█
eval/f1_macro,▁▇▇█
eval/loss,█▂▁▁
eval/precision_macro,▁▆▆█
eval/recall_macro,▁▇▇█
eval/runtime,▁▁▁█
eval/samples_per_second,███▁
eval/steps_per_second,███▁
train/epoch,▁▃▅▆██
train/global_step,▁▃▅▆██

0,1
eval/accuracy,0.74667
eval/f1_macro,0.74381
eval/loss,0.61909
eval/precision_macro,0.75794
eval/recall_macro,0.74667
eval/runtime,3.246
eval/samples_per_second,184.841
eval/steps_per_second,5.853
total_flos,8482984308185088.0
train/epoch,4.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.862801,0.64,0.62945,0.640376,0.64
2,No log,0.664367,0.731667,0.725475,0.747916,0.731667
3,0.830200,0.691613,0.74,0.735398,0.739898,0.74
4,0.830200,0.903307,0.736667,0.715335,0.744581,0.736667


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          57       8       14        71
Normal               1     138        8         3
Anxiety             10      25       98        17
Suicidal            27      18       14        91

Classification Report:
              precision    recall  f1-score   support

  Depression       0.60      0.38      0.47       150
      Normal       0.73      0.92      0.81       150
     Anxiety       0.73      0.65      0.69       150
    Suicidal       0.50      0.61      0.55       150

    accuracy                           0.64       600
   macro avg       0.64      0.64      0.63       600
weighted avg       0.64      0.64      0.63       600

{'eval_loss': 0.8628011345863342, 'eval_accuracy': 0.64, 'eval_f1_macro': 0.6294497576597844, 'eval_precision_macro': 0.6403755034352049, 'eval_recall_macro': 0.64}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          65       9        3 

[I 2025-02-26 21:22:59,286] Trial 17 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▇██
eval/f1_macro,▁▇█▇
eval/loss,▇▁▂█
eval/precision_macro,▁█▇█
eval/recall_macro,▁▇██
eval/runtime,▂▁▂█
eval/samples_per_second,▇█▆▁
eval/steps_per_second,▇█▇▁
train/epoch,▁▃▅▆█
train/global_step,▁▃▅▆█

0,1
eval/accuracy,0.73667
eval/f1_macro,0.71533
eval/loss,0.90331
eval/precision_macro,0.74458
eval/recall_macro,0.73667
eval/runtime,3.2277
eval/samples_per_second,185.889
eval/steps_per_second,5.886
train/epoch,4.0
train/global_step,800.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.112098,0.506667,0.454763,0.583353,0.506667


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          16      21       18        95
Normal               0     140        1         9
Anxiety              2      41       50        57
Suicidal             3      35       14        98

Classification Report:
              precision    recall  f1-score   support

  Depression       0.76      0.11      0.19       150
      Normal       0.59      0.93      0.72       150
     Anxiety       0.60      0.33      0.43       150
    Suicidal       0.38      0.65      0.48       150

    accuracy                           0.51       600
   macro avg       0.58      0.51      0.45       600
weighted avg       0.58      0.51      0.45       600

{'eval_loss': 1.1120977401733398, 'eval_accuracy': 0.5066666666666667, 'eval_f1_macro': 0.4547627170196213, 'eval_precision_macro': 0.583352519603854, 'eval_recall_macro': 0.5066666666666666}


[I 2025-02-26 21:24:03,959] Trial 18 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.50667
eval/f1_macro,0.45476
eval/loss,1.1121
eval/precision_macro,0.58335
eval/recall_macro,0.50667
eval/runtime,3.1773
eval/samples_per_second,188.84
eval/steps_per_second,5.98
train/epoch,1.0
train/global_step,50.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.02857,0.563333,0.553952,0.555098,0.563333


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          76      13       14        47
Normal               1     132        6        11
Anxiety             32      26       67        25
Suicidal            40      20       27        63

Classification Report:
              precision    recall  f1-score   support

  Depression       0.51      0.51      0.51       150
      Normal       0.69      0.88      0.77       150
     Anxiety       0.59      0.45      0.51       150
    Suicidal       0.43      0.42      0.43       150

    accuracy                           0.56       600
   macro avg       0.56      0.56      0.55       600
weighted avg       0.56      0.56      0.55       600

{'eval_loss': 1.028570294380188, 'eval_accuracy': 0.5633333333333334, 'eval_f1_macro': 0.553951546412977, 'eval_precision_macro': 0.5550981845236083, 'eval_recall_macro': 0.5633333333333334}


[I 2025-02-26 21:25:08,902] Trial 19 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.56333
eval/f1_macro,0.55395
eval/loss,1.02857
eval/precision_macro,0.5551
eval/recall_macro,0.56333
eval/runtime,3.1731
eval/samples_per_second,189.092
eval/steps_per_second,5.988
train/epoch,1.0
train/global_step,100.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.987747,0.581667,0.582934,0.598019,0.581667


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          65       7       13        65
Normal               9     117        7        17
Anxiety             26      17       78        29
Suicidal            35      15       11        89

Classification Report:
              precision    recall  f1-score   support

  Depression       0.48      0.43      0.46       150
      Normal       0.75      0.78      0.76       150
     Anxiety       0.72      0.52      0.60       150
    Suicidal       0.45      0.59      0.51       150

    accuracy                           0.58       600
   macro avg       0.60      0.58      0.58       600
weighted avg       0.60      0.58      0.58       600

{'eval_loss': 0.9877467751502991, 'eval_accuracy': 0.5816666666666667, 'eval_f1_macro': 0.5829335660295413, 'eval_precision_macro': 0.5980194529391777, 'eval_recall_macro': 0.5816666666666667}


[I 2025-02-26 21:26:14,955] Trial 20 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.58167
eval/f1_macro,0.58293
eval/loss,0.98775
eval/precision_macro,0.59802
eval/recall_macro,0.58167
eval/runtime,3.1905
eval/samples_per_second,188.056
eval/steps_per_second,5.955
train/epoch,1.0
train/global_step,200.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.99702,0.603333,0.60335,0.604228,0.603333


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          78       6       25        41
Normal              19     109       15         7
Anxiety             19      18       98        15
Suicidal            38      12       23        77

Classification Report:
              precision    recall  f1-score   support

  Depression       0.51      0.52      0.51       150
      Normal       0.75      0.73      0.74       150
     Anxiety       0.61      0.65      0.63       150
    Suicidal       0.55      0.51      0.53       150

    accuracy                           0.60       600
   macro avg       0.60      0.60      0.60       600
weighted avg       0.60      0.60      0.60       600

{'eval_loss': 0.9970203638076782, 'eval_accuracy': 0.6033333333333334, 'eval_f1_macro': 0.6033501271821932, 'eval_precision_macro': 0.6042283241496136, 'eval_recall_macro': 0.6033333333333333}


[I 2025-02-26 21:27:21,200] Trial 21 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.60333
eval/f1_macro,0.60335
eval/loss,0.99702
eval/precision_macro,0.60423
eval/recall_macro,0.60333
eval/runtime,3.1857
eval/samples_per_second,188.344
eval/steps_per_second,5.964
train/epoch,1.0
train/global_step,200.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.891441,0.638333,0.622003,0.678124,0.638333
2,No log,0.665975,0.728333,0.727392,0.729754,0.728333
3,0.844100,0.646208,0.748333,0.741726,0.74151,0.748333
4,0.844100,0.625595,0.75,0.745844,0.75149,0.75


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          40       7       14        89
Normal               2     127       10        11
Anxiety              5      14      101        30
Suicidal             8      14       13       115

Classification Report:
              precision    recall  f1-score   support

  Depression       0.73      0.27      0.39       150
      Normal       0.78      0.85      0.81       150
     Anxiety       0.73      0.67      0.70       150
    Suicidal       0.47      0.77      0.58       150

    accuracy                           0.64       600
   macro avg       0.68      0.64      0.62       600
weighted avg       0.68      0.64      0.62       600

{'eval_loss': 0.891440749168396, 'eval_accuracy': 0.6383333333333333, 'eval_f1_macro': 0.6220034591107839, 'eval_precision_macro': 0.6781237894074333, 'eval_recall_macro': 0.6383333333333333}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression   

[I 2025-02-26 21:31:48,481] Trial 22 finished with value: 0.7458436466424814 and parameters: {'learning_rate': 2.7931735279589602e-05, 'gradient_accumulation_steps': 1, 'lr_scheduler_type': 'cosine', 'warmup_ratio': 0.07583472709096399, 'weight_decay': 0.06095771056452989}. Best is trial 8 with value: 0.7477179660543716.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▇██
eval/f1_macro,▁▇██
eval/loss,█▂▂▁
eval/precision_macro,▁▆▇█
eval/recall_macro,▁▇██
eval/runtime,▁▄▅█
eval/samples_per_second,█▅▄▁
eval/steps_per_second,█▅▄▁
train/epoch,▁▃▅▆██
train/global_step,▁▃▅▆██

0,1
eval/accuracy,0.75
eval/f1_macro,0.74584
eval/loss,0.6256
eval/precision_macro,0.75149
eval/recall_macro,0.75
eval/runtime,3.2113
eval/samples_per_second,186.84
eval/steps_per_second,5.917
total_flos,7326097112575488.0
train/epoch,4.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.013479,0.568333,0.567147,0.60469,0.568333


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          46       5        7        92
Normal              11     112       12        15
Anxiety             35       6       71        38
Suicidal            22       8        8       112

Classification Report:
              precision    recall  f1-score   support

  Depression       0.40      0.31      0.35       150
      Normal       0.85      0.75      0.80       150
     Anxiety       0.72      0.47      0.57       150
    Suicidal       0.44      0.75      0.55       150

    accuracy                           0.57       600
   macro avg       0.60      0.57      0.57       600
weighted avg       0.60      0.57      0.57       600

{'eval_loss': 1.0134793519973755, 'eval_accuracy': 0.5683333333333334, 'eval_f1_macro': 0.5671467672314303, 'eval_precision_macro': 0.6046895163197277, 'eval_recall_macro': 0.5683333333333334}


[I 2025-02-26 21:32:54,655] Trial 23 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.56833
eval/f1_macro,0.56715
eval/loss,1.01348
eval/precision_macro,0.60469
eval/recall_macro,0.56833
eval/runtime,3.1871
eval/samples_per_second,188.26
eval/steps_per_second,5.962
train/epoch,1.0
train/global_step,200.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.95667,0.611667,0.614085,0.630374,0.611667


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          60       5       14        71
Normal               4     109       20        17
Anxiety             26       5       99        20
Suicidal            29       5       17        99

Classification Report:
              precision    recall  f1-score   support

  Depression       0.50      0.40      0.45       150
      Normal       0.88      0.73      0.80       150
     Anxiety       0.66      0.66      0.66       150
    Suicidal       0.48      0.66      0.55       150

    accuracy                           0.61       600
   macro avg       0.63      0.61      0.61       600
weighted avg       0.63      0.61      0.61       600

{'eval_loss': 0.9566704034805298, 'eval_accuracy': 0.6116666666666667, 'eval_f1_macro': 0.6140847352426984, 'eval_precision_macro': 0.6303737020755006, 'eval_recall_macro': 0.6116666666666667}


[I 2025-02-26 21:34:00,786] Trial 24 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.61167
eval/f1_macro,0.61408
eval/loss,0.95667
eval/precision_macro,0.63037
eval/recall_macro,0.61167
eval/runtime,3.1886
eval/samples_per_second,188.17
eval/steps_per_second,5.959
train/epoch,1.0
train/global_step,200.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.86433,0.651667,0.645936,0.645649,0.651667
2,No log,0.657345,0.745,0.746631,0.751935,0.745
3,0.830100,0.585473,0.768333,0.765434,0.764945,0.768333
4,0.830100,0.61102,0.775,0.775874,0.782948,0.775


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          76       6       24        44
Normal               5     126       17         2
Anxiety             16      12      116         6
Suicidal            43      12       22        73

Classification Report:
              precision    recall  f1-score   support

  Depression       0.54      0.51      0.52       150
      Normal       0.81      0.84      0.82       150
     Anxiety       0.65      0.77      0.71       150
    Suicidal       0.58      0.49      0.53       150

    accuracy                           0.65       600
   macro avg       0.65      0.65      0.65       600
weighted avg       0.65      0.65      0.65       600

{'eval_loss': 0.8643304705619812, 'eval_accuracy': 0.6516666666666666, 'eval_f1_macro': 0.6459359017401397, 'eval_precision_macro': 0.6456485358217202, 'eval_recall_macro': 0.6516666666666667}
Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression  

[I 2025-02-26 21:38:28,307] Trial 25 finished with value: 0.775874368438653 and parameters: {'learning_rate': 3.543919690225377e-05, 'gradient_accumulation_steps': 1, 'lr_scheduler_type': 'linear', 'warmup_ratio': 0.09785956688181825, 'weight_decay': 0.028924075405665158}. Best is trial 25 with value: 0.775874368438653.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁▆██
eval/f1_macro,▁▆▇█
eval/loss,█▃▁▂
eval/precision_macro,▁▆▇█
eval/recall_macro,▁▆██
eval/runtime,▁▁▂█
eval/samples_per_second,██▇▁
eval/steps_per_second,██▇▁
train/epoch,▁▃▅▆██
train/global_step,▁▃▅▆██

0,1
eval/accuracy,0.775
eval/f1_macro,0.77587
eval/loss,0.61102
eval/precision_macro,0.78295
eval/recall_macro,0.775
eval/runtime,3.2416
eval/samples_per_second,185.093
eval/steps_per_second,5.861
total_flos,4626693656153088.0
train/epoch,4.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,0.971016,0.593333,0.586184,0.617526,0.593333


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          43       6        6        95
Normal               9     120        9        12
Anxiety             28      11       80        31
Suicidal            15      12       10       113

Classification Report:
              precision    recall  f1-score   support

  Depression       0.45      0.29      0.35       150
      Normal       0.81      0.80      0.80       150
     Anxiety       0.76      0.53      0.63       150
    Suicidal       0.45      0.75      0.56       150

    accuracy                           0.59       600
   macro avg       0.62      0.59      0.59       600
weighted avg       0.62      0.59      0.59       600

{'eval_loss': 0.9710158109664917, 'eval_accuracy': 0.5933333333333334, 'eval_f1_macro': 0.5861844990708984, 'eval_precision_macro': 0.6175261678890399, 'eval_recall_macro': 0.5933333333333334}


[I 2025-02-26 21:39:34,531] Trial 26 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.59333
eval/f1_macro,0.58618
eval/loss,0.97102
eval/precision_macro,0.61753
eval/recall_macro,0.59333
eval/runtime,3.1817
eval/samples_per_second,188.58
eval/steps_per_second,5.972
train/epoch,1.0
train/global_step,200.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.007137,0.556667,0.520713,0.543236,0.556667


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          23      16       26        85
Normal               3     135        8         4
Anxiety             10      38       86        16
Suicidal            12      25       23        90

Classification Report:
              precision    recall  f1-score   support

  Depression       0.48      0.15      0.23       150
      Normal       0.63      0.90      0.74       150
     Anxiety       0.60      0.57      0.59       150
    Suicidal       0.46      0.60      0.52       150

    accuracy                           0.56       600
   macro avg       0.54      0.56      0.52       600
weighted avg       0.54      0.56      0.52       600

{'eval_loss': 1.007137417793274, 'eval_accuracy': 0.5566666666666666, 'eval_f1_macro': 0.5207128303099515, 'eval_precision_macro': 0.5432362127747642, 'eval_recall_macro': 0.5566666666666666}


[I 2025-02-26 21:40:39,189] Trial 27 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.55667
eval/f1_macro,0.52071
eval/loss,1.00714
eval/precision_macro,0.54324
eval/recall_macro,0.55667
eval/runtime,3.1708
eval/samples_per_second,189.225
eval/steps_per_second,5.992
train/epoch,1.0
train/global_step,50.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.006171,0.56,0.533441,0.567776,0.56


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          30       9       49        62
Normal               4     124       16         6
Anxiety              7      22      108        13
Suicidal            10      18       48        74

Classification Report:
              precision    recall  f1-score   support

  Depression       0.59      0.20      0.30       150
      Normal       0.72      0.83      0.77       150
     Anxiety       0.49      0.72      0.58       150
    Suicidal       0.48      0.49      0.49       150

    accuracy                           0.56       600
   macro avg       0.57      0.56      0.53       600
weighted avg       0.57      0.56      0.53       600

{'eval_loss': 1.0061705112457275, 'eval_accuracy': 0.56, 'eval_f1_macro': 0.533441366124663, 'eval_precision_macro': 0.5677763593855334, 'eval_recall_macro': 0.5599999999999999}


[I 2025-02-26 21:41:44,634] Trial 28 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/accuracy,▁
eval/f1_macro,▁
eval/loss,▁
eval/precision_macro,▁
eval/recall_macro,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/accuracy,0.56
eval/f1_macro,0.53344
eval/loss,1.00617
eval/precision_macro,0.56778
eval/recall_macro,0.56
eval/runtime,3.1891
eval/samples_per_second,188.143
eval/steps_per_second,5.958
train/epoch,1.0
train/global_step,100.0


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,No log,1.090234,0.555,0.525553,0.561591,0.555


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression          55      18       53        24
Normal               1     137       11         1
Anxiety             11      31      103         5
Suicidal            26      30       56        38

Classification Report:
              precision    recall  f1-score   support

  Depression       0.59      0.37      0.45       150
      Normal       0.63      0.91      0.75       150
     Anxiety       0.46      0.69      0.55       150
    Suicidal       0.56      0.25      0.35       150

    accuracy                           0.56       600
   macro avg       0.56      0.56      0.53       600
weighted avg       0.56      0.56      0.53       600

{'eval_loss': 1.0902342796325684, 'eval_accuracy': 0.555, 'eval_f1_macro': 0.5255528626217767, 'eval_precision_macro': 0.5615910115512847, 'eval_recall_macro': 0.555}


[I 2025-02-26 21:42:49,304] Trial 29 pruned. 


In [None]:
print("Best Trial:")
print("Params:", best_trial.hyperparameters)

Best Trial:
Params: {'learning_rate': 3.543919690225377e-05, 'gradient_accumulation_steps': 1, 'lr_scheduler_type': 'linear', 'warmup_ratio': 0.09785956688181825, 'weight_decay': 0.028924075405665158}


In [None]:
import os

# Change the current working directory
os.chdir("/content/drive/MyDrive/Italian thesis/")

# Verify the current working directory
print("Current Working Directory:", os.getcwd())

Current Working Directory: /content/drive/MyDrive/Italian thesis


In [None]:
import json
json_file_path = "BERT_hyperparameters.json"

# Write hyperparameters to a JSON file
with open(json_file_path, 'w') as json_file:
    json.dump(best_trial.hyperparameters, json_file, indent=4)

print(f"Hyperparameters saved to {json_file_path}")

Hyperparameters saved to BERT_hyperparameters.json


### Finetuning

After Hyperparameter tuning the model, the next step is to use the best performing hyperparameters to fully finetune the model on the mental health dataset. 

#### Data Pre-processing

The same steps as the hyperparameter tuning process are done here.

In [None]:
X_train = pd.read_csv('/content/drive/MyDrive/Italian thesis/Training dataset/train.csv')
#X_train = X_train.drop(columns=["text"]).rename(columns={"italian text": "text"})
X_train = X_train.dropna()
X_val = pd.read_csv('/content/drive/MyDrive/Italian thesis/Training dataset/val.csv')

#X_val = X_val.drop(columns=["text"]).rename(columns={"italian text": "text"})
X_test = pd.read_csv('/content/drive/MyDrive/Italian thesis/Training dataset/test.csv')
#X_test = X_test.drop(columns=["text"]).rename(columns={"italian text": "text"})
print(f"Training set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")
print(f"Test set size: {len(X_test)}")

Training set size: 11760
Validation set size: 1470
Test set size: 1470


In [None]:
X_train.head()

Unnamed: 0,text,label,italian text
0,Im at the point of the semester where Im so ti...,3,Sono arrivato al punto del semestre in cui son...
1,I could do it I want to do it I have what I ne...,3,Potrei farlo Voglio farlo Ho ciò che mi serve ...
2,I cannot imagine anyone wanting to be in a rel...,3,Non riesco a immaginare qualcuno che voglia av...
3,How I am supposed to live if I cannot accept t...,3,Come dovrei vivere se non riesco ad accettare ...
4,maybe i know but how could you know,1,forse lo so ma come potresti saperlo?


In [None]:
train = Dataset.from_pandas(X_train)
test = Dataset.from_pandas(X_test)
validation = Dataset.from_pandas(X_val)

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["italian text"], truncation=True)


In [None]:
tokenized_train = train.map(preprocess_function, batched=True)
tokenized_test = test.map(preprocess_function, batched=True)
tokenized_val = validation.map(preprocess_function, batched=True)


Map:   0%|          | 0/11760 [00:00<?, ? examples/s]

Map:   0%|          | 0/1470 [00:00<?, ? examples/s]

Map:   0%|          | 0/1470 [00:00<?, ? examples/s]

In [None]:
tokenized_train = tokenized_train.remove_columns(["text", "italian text"])
tokenized_val = tokenized_val.remove_columns(["text", "italian text"])
tokenized_test = tokenized_test.remove_columns(["text", "italian text"])  # If using test data


In [None]:
print(tokenized_test.column_names)


['label', 'input_ids', 'token_type_ids', 'attention_mask']


In [None]:
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=2,
    early_stopping_threshold=0.01
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

#### Model Training

For the Training stage, the first step is to define the metrics for the model performance. Then the next step is to define the trainer and its arguments using the ones obtained from the hyperparameter tuning process. The last step is of course to train the model.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import numpy as np
import torch
from datasets import Dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer,   DataCollatorWithPadding,Trainer, TrainingArguments

from tqdm import tqdm
from sklearn.metrics import accuracy_score,f1_score, confusion_matrix, classification_report
from torch.utils.data import DataLoader
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, precision_score, recall_score
import numpy as np
import pandas as pd

# Define your class labels
id2label = {0: 'Depression', 1: 'Normal', 2: 'Anxiety', 3: 'Suicidal'}
label2id = {v: k for k, v in id2label.items()}

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # Compute accuracy
    accuracy = accuracy_score(labels, predictions)

    # Compute F1 score (macro-average)
    f1 = f1_score(labels, predictions, average='macro')

    # Compute precision and recall (macro-average)
    precision = precision_score(labels, predictions, average='macro')
    recall = recall_score(labels, predictions, average='macro')

    # Compute confusion matrix
    cm = confusion_matrix(labels, predictions)

    # Convert confusion matrix from class IDs to labels
    cm_labels = np.array([id2label[i] for i in range(len(id2label))])
    cm_with_labels = pd.DataFrame(cm, index=cm_labels, columns=cm_labels)

    # Generate the classification report with labels
    class_report = classification_report(labels, predictions, target_names=[id2label[i] for i in range(len(id2label))])

    # Print confusion matrix and classification report
    print("Confusion Matrix:")
    print(cm_with_labels)
    print("\nClassification Report:")
    print(class_report)

    return {
        'accuracy': accuracy,
        'f1_macro': f1,
        'precision_macro': precision,
        'recall_macro': recall,
    }


In [None]:
training_args = TrainingArguments(
    output_dir="bert-finetuning-italian",
    learning_rate=2.1612703354421325e-05,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay= 0.028924075405665158,
    gradient_accumulation_steps=1,
    lr_scheduler_type="cosine_with_restarts",
    warmup_ratio=0.051758482154894515,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[early_stopping]
)

  trainer = Trainer(


In [None]:
results = trainer.evaluate(eval_dataset=tokenized_test)
print('Results before finetuning:')
print('-'*15)

print(results)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression         110     278        0        12
Normal             118     152        0       130
Anxiety             60     194        0        16
Suicidal            96     292        0        12

Classification Report:
              precision    recall  f1-score   support

  Depression       0.29      0.28      0.28       400
      Normal       0.17      0.38      0.23       400
     Anxiety       0.00      0.00      0.00       270
    Suicidal       0.07      0.03      0.04       400

    accuracy                           0.19      1470
   macro avg       0.13      0.17      0.14      1470
weighted avg       0.14      0.19      0.15      1470



[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmasabbah97[0m ([33maml_group[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Results before finetuning:
---------------
{'eval_loss': 1.4045946598052979, 'eval_model_preparation_time': 0.0031, 'eval_accuracy': 0.18639455782312925, 'eval_f1_macro': 0.13843013689238293, 'eval_precision_macro': 0.13074635831406797, 'eval_recall_macro': 0.17125, 'eval_runtime': 6.2067, 'eval_samples_per_second': 236.841, 'eval_steps_per_second': 14.823}


In [None]:
trainer.train()


Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,1.0928,0.72223,0.0031,0.687755,0.693287,0.720715,0.697778
2,0.6827,0.613302,0.0031,0.738095,0.745604,0.753249,0.74794
3,0.473,0.606387,0.0031,0.754422,0.763796,0.770798,0.760231
4,0.4259,0.710029,0.0031,0.746939,0.757815,0.763959,0.753356
5,0.3187,0.748144,0.0031,0.752381,0.758624,0.763078,0.758657


Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression         148      11       25       216
Normal               5     334        8        53
Anxiety             20       9      219        22
Suicidal            57      25        8       310

Classification Report:
              precision    recall  f1-score   support

  Depression       0.64      0.37      0.47       400
      Normal       0.88      0.83      0.86       400
     Anxiety       0.84      0.81      0.83       270
    Suicidal       0.52      0.78      0.62       400

    accuracy                           0.69      1470
   macro avg       0.72      0.70      0.69      1470
weighted avg       0.71      0.69      0.68      1470

Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression         286      12       29        73
Normal              22     358        7        13
Anxiety             25       9      232         4
Suicidal           161      24        6       209

Clas

TrainOutput(global_step=3675, training_loss=0.5508762043025218, metrics={'train_runtime': 783.2003, 'train_samples_per_second': 150.153, 'train_steps_per_second': 9.385, 'total_flos': 9762697238890752.0, 'train_loss': 0.5508762043025218, 'epoch': 5.0})

### Results

When looking at the results of BERT model, some similar patterns to the ones exhibited by the AB-BiLSTM model are seen, albeit with a better overall performance at an accuracy of 77% and an F1-score of 78%. Overall, the model has the most trouble with the ’Depression’ class with a low class specific F1-score of 63% and the same misclassification problem when it comes to this class and the ’Suicidal’ one. But overall, the model also performs well on both the ’Normal’ and the ’Anxiety’ classes as they are easier to distinguish than the other two.

In [None]:
results = trainer.evaluate(eval_dataset=tokenized_test)
print('Results after finetuning:')
print('-'*15)
print(results)

Confusion Matrix:
            Depression  Normal  Anxiety  Suicidal
Depression         239       8       21       132
Normal              19     359        8        14
Anxiety             40       6      218         6
Suicidal            60      12        5       323

Classification Report:
              precision    recall  f1-score   support

  Depression       0.67      0.60      0.63       400
      Normal       0.93      0.90      0.91       400
     Anxiety       0.87      0.81      0.84       270
    Suicidal       0.68      0.81      0.74       400

    accuracy                           0.77      1470
   macro avg       0.79      0.78      0.78      1470
weighted avg       0.78      0.77      0.77      1470

Results after finetuning:
---------------
{'eval_loss': 0.5598059892654419, 'eval_model_preparation_time': 0.0031, 'eval_accuracy': 0.7748299319727892, 'eval_f1_macro': 0.7796978245295705, 'eval_precision_macro': 0.7862861657275065, 'eval_recall_macro': 0.7774768518518519,

In [None]:
trainer.push_to_hub()

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

events.out.tfevents.1740662535.86163c9bd3c3.2313.1:   0%|          | 0.00/644 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

events.out.tfevents.1740661635.86163c9bd3c3.2313.0:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.37k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/msab97/bert-finetuning-italian/commit/45e2e9d654301bb7fe4fbdb9566bf8760dbc9a8c', commit_message='End of training', commit_description='', oid='45e2e9d654301bb7fe4fbdb9566bf8760dbc9a8c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/msab97/bert-finetuning-italian', endpoint='https://huggingface.co', repo_type='model', repo_id='msab97/bert-finetuning-italian'), pr_revision=None, pr_num=None)

## Predictions

In order to build the indicator, the model will be used to classify a different set of tweets. These tweets are sampled (3000 for each day) from a dataset that contains 15 million tweets from the first 5 months of 2020. The same pre-processing steps are applied to this new dataset, these tweets are then using as input and the final predictions as well as the confidence values are then stored.

In [None]:


from peft import PeftModel
from transformers import AutoModelForSequenceClassification,BitsAndBytesConfig, AutoTokenizer,EarlyStoppingCallback,DataCollatorWithPadding
import torch
from huggingface_hub import login
import pandas as pd
from datasets import Dataset
import torch

#login('')


In [3]:
df = pd.read_csv('/content/drive/MyDrive/Italian thesis/Training dataset/italian_with_predictions_and_logits.csv')
df.head()

Unnamed: 0,testo,tweet_date,llama_logits,llama_prediction
0,l istruzione e la ricchezza posson essere sorg...,2020-01-31,"[-2.353515625, 4.1796875, -2.923828125, -2.949...",1
1,ce l abbiamo fatta tutti in pratica parco nord...,2020-01-31,"[0.2763671875, 3.181640625, -5.38671875, 0.232...",1
2,qualcuno poi mi spieghi tutto il credito che s...,2020-01-31,"[-0.0018262863159179688, 2.484375, -7.28125, -...",1
3,aforismi f duva mov stelle e chi lo rappresent...,2020-01-31,"[0.78662109375, 3.923828125, -7.43359375, 0.48...",1
4,di solito quando un prodotto ha bisogno di tan...,2020-01-31,"[-0.41552734375, 5.20703125, -7.1796875, -1.48...",1


In [4]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("msab97/bert-finetuning-italian")
model = AutoModelForSequenceClassification.from_pretrained("msab97/bert-finetuning-italian")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/925 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [5]:
def preprocess_function(examples):
    return tokenizer(examples["testo"], truncation=True)

In [6]:
data = Dataset.from_pandas(df)
tokenized_data = data.map(preprocess_function, batched=True)
tokenized_data = tokenized_data.remove_columns(["testo", "tweet_date"])


Map:   0%|          | 0/456000 [00:00<?, ? examples/s]

In [7]:
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=2,
    early_stopping_threshold=0.01
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [8]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import numpy as np
import torch
from datasets import Dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer,   DataCollatorWithPadding,Trainer, TrainingArguments

from tqdm import tqdm  # Import tqdm for progress bar
from sklearn.metrics import accuracy_score,f1_score, confusion_matrix, classification_report
from torch.utils.data import DataLoader
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score
import numpy as np

id2label = {0: 'Depression', 1: 'Normal', 2: 'Anxiety', 3: 'Suicidal'}
label2id = {v: k for k, v in id2label.items()}

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    accuracy = accuracy_score(labels, predictions)

    f1 = f1_score(labels, predictions, average='macro')

    cm = confusion_matrix(labels, predictions)

    cm_labels = np.array([id2label[i] for i in range(len(id2label))])
    cm_with_labels = pd.DataFrame(cm, index=cm_labels, columns=cm_labels)

    class_report = classification_report(labels, predictions, target_names=[id2label[i] for i in range(len(id2label))])

    print("Confusion Matrix:")
    print(cm_with_labels)
    print("\nClassification Report:")
    print(class_report)

    return {
        'accuracy': accuracy,
        'f1_macro': f1,
    }

In [9]:
training_args = TrainingArguments(
    output_dir="bert-finetuning-italian",
    per_device_eval_batch_size=128,
    logging_dir='./logs',
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

In [None]:
predictions = trainer.predict(tokenized_data)

logits = torch.tensor(predictions.predictions)

df['bert_logits'] = logits.tolist()

predicted_classes = torch.argmax(logits, dim=-1).numpy()
df['bert_prediction'] = predicted_classes


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmasabbah97[0m ([33maml_group[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [11]:
df.to_csv("/content/drive/MyDrive/Italian thesis/Training dataset/italian_with_predictions_and_logits.csv", index=False)

In [12]:
df.head()

Unnamed: 0,testo,tweet_date,llama_logits,llama_prediction,bert_logits,bert_prediction
0,l istruzione e la ricchezza posson essere sorg...,2020-01-31,"[-2.353515625, 4.1796875, -2.923828125, -2.949...",1,"[-0.4601691961288452, 3.44775390625, -1.441370...",1
1,ce l abbiamo fatta tutti in pratica parco nord...,2020-01-31,"[0.2763671875, 3.181640625, -5.38671875, 0.232...",1,"[-1.0215137004852295, 4.273667335510254, -1.83...",1
2,qualcuno poi mi spieghi tutto il credito che s...,2020-01-31,"[-0.0018262863159179688, 2.484375, -7.28125, -...",1,"[1.2525972127914429, 0.6674477458000183, -1.16...",0
3,aforismi f duva mov stelle e chi lo rappresent...,2020-01-31,"[0.78662109375, 3.923828125, -7.43359375, 0.48...",1,"[-0.9413514137268066, 4.013585567474365, -0.89...",1
4,di solito quando un prodotto ha bisogno di tan...,2020-01-31,"[-0.41552734375, 5.20703125, -7.1796875, -1.48...",1,"[-0.9259320497512817, 3.9972760677337646, -1.7...",1
