## Fine-Tuning of Open Source Model for Improved Sentiment Analysis
After using an OpenAI API GPT model to classify the sentiment of 5000 forum posts, those were now here used to fine-tune the open-source model ""distilbert-base-multilingual-cased-sentiments-student" by "lxyuan.

Organization of data:
- of the 5000 classified forum posts 4000 will be used as training data and 1000 as validation data
- the 100 by human classified forums posts will serve for final evaluation.

Description of model: The model is a distilled version of a zero-shot classification model trained on a multilingual sentiment dataset. While the dataset was actually annotated, for this training zero-shot learning was simluated.</br>
- teacher model: MoritzLaurer/mDeBERTa-v3-base-mnli-xnli
- teacher hypothesis template: "The sentiment of this text is {}."
- Student model: distilbert-base-multilingual-cased
source: https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student

For more details on using the huggingface transformers libarary for fine-tuning, see:
- https://huggingface.co/docs/transformers/v4.15.0/training
- https://colab.research.google.com/github/huggingface/notebooks/blob/master/transformers_doc/training.ipynb#scrollTo=5JPNwrsB21ct

In [None]:
# Transformers installation
! pip install -q transformers[torch] datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

In [None]:
# installing the triton library which is part of PyTorch
!pip install triton
import torch._dynamo

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer
import copy
import yaml
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
import os
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
with open ("/content/drive/MyDrive/github_projects/fine_tuning_ai_for_sentiments/config/config.yaml", "r") as f:
  config = yaml.safe_load(f)

# import "time" to allow the use of current date for output filenames
import time
TodaysDate = time.strftime("%Y_%m_%d_")

### loading the training data and preparing the data frames for fine-tuning

In [None]:
# loading the clean and refined training data
df_GPT_classified_data = pd.read_csv(config["project_path"]+config["data_processed_dir"]+"df_forum_posts_5000_train_classification_by_GPT_model_completed_refined_valid.csv")
display(df_GPT_classified_data)
# verifying that the sentiment columns is of type integer and all other column types are correct
df_GPT_classified_data.info()

Unnamed: 0,ID,text,datetime,company,sentiment_gpt-4-turbo-preview
0,330,">>Wer genau hinschaut, erkennt die Sinnlosigk...",2019-06-19 18:04:05,1_und_1_Drillisch,2
1,429,Der Markt dürfte für Drillisch enger durch di...,2018-10-28 20:27:34,1_und_1_Drillisch,2
2,607,27.10.15 13:12 aktiencheck.de Maintal (www.a...,2015-11-02 10:04:41,1_und_1_Drillisch,2
3,695,ich habs auf aktiecheck gefunden gruss Tageshoch,2015-01-26 17:35:36,1_und_1_Drillisch,1
4,875,07.07.14 16:07 Bankhaus Lampe Düsseldorf (...,2014-07-08 09:32:19,1_und_1_Drillisch,0
...,...,...,...,...,...
4994,669305,Sie bekommen doch aber jeden Monat ihre Monat...,2020-09-18 18:29:51,Grenke,1
4995,669384,dpa-AFX: *GRENKE-CEO: VICEROY HAT VIEL POR...,2020-09-18 15:17:02,Grenke,2
4996,669405,"gepostet? Verkauf 33,x - Leihgebühr je nach D...",2020-09-18 14:33:10,Grenke,1
4997,669556,Nachbörse sieht auch mau aus...,2020-09-17 17:37:51,Grenke,2


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 5 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   ID                             4999 non-null   int64 
 1   text                           4999 non-null   object
 2   datetime                       4999 non-null   object
 3   company                        4999 non-null   object
 4   sentiment_gpt-4-turbo-preview  4999 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 195.4+ KB


In [None]:
# preparing the data to suit the format for fine-tuning
df_GPT_classified_data_processed = (
    df_GPT_classified_data
    .drop(["ID", "datetime", "company"], axis=1)                                # dropping unused columns
    .rename(columns={"sentiment_"+config["train_data_eval_model1"] : "label"})  # renaming the sentiment column to "label"
    .set_index("label")                                                         # setting the label column as index
)
df_GPT_classified_data_processed

Unnamed: 0_level_0,text
label,Unnamed: 1_level_1
2,">>Wer genau hinschaut, erkennt die Sinnlosigk..."
2,Der Markt dürfte für Drillisch enger durch di...
2,27.10.15 13:12 aktiencheck.de Maintal (www.a...
1,ich habs auf aktiecheck gefunden gruss Tageshoch
0,07.07.14 16:07 Bankhaus Lampe Düsseldorf (...
...,...
1,Sie bekommen doch aber jeden Monat ihre Monat...
2,dpa-AFX: *GRENKE-CEO: VICEROY HAT VIEL POR...
1,"gepostet? Verkauf 33,x - Leihgebühr je nach D..."
2,Nachbörse sieht auch mau aus...


In [None]:
# separate the approx 5000 forum posts into 80% train data and 20% for validation
# use a seed to make the selection reproducible
dataset = Dataset.from_pandas(df_GPT_classified_data_processed)
dataset = dataset.train_test_split(test_size=0.20, shuffle = True, seed = 3)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3999
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
})

In [None]:
# verification of properly accessing label and text for an exemplary forum post
display(dataset["train"][7]["label"])
display("----------")
display(dataset["train"][7]["text"])

1

'----------'

'    Das EPS liegt derzeit bei 1,80 statt 2,30. Vielleicht sind 2017 wieder 2,00 möglich. 32 wäre dann ein KGV von 16. 37,5 wäre dann knapp 19. Wäre kurzfristig möglich angesichts der solideren Bilanz als Wettbewerber. Allerdings heißt Nullzinsen nicht, dass das KGV deswegen in den Himmel wächst. Ich würde mich da eher am WACC orientieren, den man meist irgendwo im Anhang des Geschäftsberichtes findet, v.a. wenn es um den Werthaltigkeitstest geht. Da sind eher 6-10% eine übliche Hausnummer bei normalen Bluechips. Von daher ist das KGV vom DAX mit ca. 12 nicht wirklich verwunderlich und auch kein Schnäppchen, sondern  fair.  Mit der Divi kann man jedenfalls auch bei Kursen um 25 ganz gut auskommen. Wenn es doch noch einen richtigen Hype um die Aktie gibt, werde ich halt die Gewinne mitnehmen und einfach die nächste Straßenbahn nehmen. '

### tokenizing the forum posts

In [None]:
# making a copy of the "dataset" to retaining the original object for potential analysis
dataset_to_tokenize = copy.deepcopy(dataset)

# loading the tokenizer which was used for the "distilbert-base-multilingual-cased-sentiments-student"
tokenizer = AutoTokenizer.from_pretrained("lxyuan/distilbert-base-multilingual-cased-sentiments-student")

# defining a tokenize function to specify the tokenization parameters
def tokenize_function(training_dataset):
    tokens = tokenizer(training_dataset["text"], padding="max_length", truncation=True)
    return tokens

# tokenizing the forum posts of the dataset
tokenized_dataset = dataset_to_tokenize.map(
    tokenize_function,
    batched=True
)

tokenized_dataset

Map:   0%|          | 0/3999 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3999
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

### establishing the metrics

In [None]:
def compute_metrics(p):
    """
    computes various metrics including accuracy, macro average for precision, recall and F1 score
    as well as the precision, recall and F1 score for each individual class.

    parameters:
    - p (object) : an object which contains "predictions" and "label_ids",
                   and is provided by the Hugging Face's Trainer during the evaluation phase.

    returns:
    - dict: a dictionary containing the accuracy, the macro average for precision, recall and F1 score
            and precision, recall and F1 score for each individual class.
    """

    # getting the predicted label by taking the argmax of the predictions
    predictions = np.argmax(p.predictions, axis=1)

    # extracting the labels
    labels = p.label_ids

    # computing the accuracy
    accuracy = accuracy_score(labels, predictions)

    # computing macro average precision, recall and F1 score
    macro_average_precision = precision_score(labels, predictions, average="macro")
    macro_average_recall = recall_score(labels, predictions, average="macro")
    macro_average_f1 = f1_score(labels, predictions, average="macro")

    # computing per-class precision, recall and F1 score
    precision = precision_score(labels, predictions, average=None, zero_division=0)
    recall = recall_score(labels, predictions, average=None, zero_division=0)
    f1 = f1_score(labels, predictions, average=None, zero_division=0)

    # initiating a dictionary to store the accuracy and macro average metrics
    metrics = {
        "accuracy": accuracy,
        "macro_average_precision": macro_average_precision,
        "macro_average_recall": macro_average_recall,
        "macro_average_f1": macro_average_f1
    }

    # adding per-class precision, recall, and f1 metrics to the dictionary
    for i, (prec, rec, f1_val) in enumerate(zip(precision, recall, f1)):
        metrics[f'precision_class_{i}'] = prec
        metrics[f'recall_class_{i}'] = rec
        metrics[f'f1_class_{i}'] = f1_val

    #optional print statement
    #print(metrics)

    return metrics

### fine-tuning of the model with different hyperparameter sets

In [None]:
# set_01 - set_04: batch size of 8 and varying the learning rate (1e-6 to 1e-3)
# set_05 - set_06: batch size of 8 with gradient accumulation of 2 and 4
# set_07 - set_08: batch size of 16 with learning rates 1e-5 and 1e-4
# set_09 - set_10: omitted
# set_11 - set_14: batch size of 32 with varying learning rates
# set_15 - set_16: top-performing hyperparameters with 50 epochs to test stability and overfitting tendencies

# save step is set to save approximately three checkpoints per model

hyperparameter_sets  = [
    # {"name": "set_01", "learning_rate": 1e-6, "gradient_accumulation": 1, "batch_size": 8, "total_epochs": 10, "note": "",
    # "save_strategy": "steps", "save_steps": 1600, "eval_strategy": "steps",  "eval_steps": 200, "logging_strategy": "steps", "logging_steps": 100},

    # {"name": "set_02", "learning_rate": 1e-5, "gradient_accumulation": 1, "batch_size": 8, "total_epochs": 10, "note" : "",
    # "save_strategy": "steps", "save_steps": 1600, "eval_strategy": "steps", "eval_steps": 100, "logging_strategy": "steps", "logging_steps": 50},

    # {"name": "set_03", "learning_rate": 1e-4, "gradient_accumulation": 1, "batch_size": 8, "total_epochs": 10, "note": "",
    # "save_strategy": "steps", "save_steps": 1600, "eval_strategy": "steps", "eval_steps": 100, "logging_strategy": "steps", "logging_steps": 50},

    # {"name": "set_04", "learning_rate": 1e-3, "gradient_accumulation": 1, "batch_size": 8, "total_epochs": 10, "note": "",
    # "save_strategy": "steps", "save_steps": 1600, "eval_strategy": "steps", "eval_steps": 50, "logging_strategy": "steps", "logging_steps": 25},

    # {"name": "set_05", "learning_rate": 1e-5, "gradient_accumulation": 2, "batch_size": 8, "total_epochs": 10, "note": "",
    # "save_strategy": "steps", "save_steps": 1600, "eval_strategy": "steps", "eval_steps": 100, "logging_strategy": "steps", "logging_steps": 50},

    # {"name": "set_06", "learning_rate": 1e-5, "gradient_accumulation": 4, "batch_size": 8, "total_epochs": 10, "note": "",
    # "save_strategy": "steps", "save_steps": 1600, "eval_strategy": "steps", "eval_steps": 200, "logging_strategy": "steps", "logging_steps": 100},

    # {"name": "set_07", "learning_rate": 1e-5, "gradient_accumulation": 1, "batch_size": 16, "total_epochs": 10, "note": "",
    # "save_strategy": "steps", "save_steps": 800, "eval_strategy": "steps", "eval_steps": 100, "logging_strategy": "steps", "logging_steps": 50},

    # {"name": "set_08", "learning_rate": 1e-4, "gradient_accumulation": 1, "batch_size": 16, "total_epochs": 10, "note": "",
    # "save_strategy": "steps", "save_steps": 800, "eval_strategy": "steps", "eval_steps": 100, "logging_strategy": "steps", "logging_steps": 50},

    # {"name": "set_11", "learning_rate": 1e-6, "gradient_accumulation": 1, "batch_size": 32, "total_epochs": 10, "note": "",
    # "save_strategy": "steps", "save_steps": 400, "eval_strategy": "steps", "eval_steps": 200, "logging_strategy": "steps", "logging_steps": 100},

    # {"name": "set_12", "learning_rate": 1e-5, "gradient_accumulation": 1, "batch_size": 32, "total_epochs": 10, "note": "",
    # "save_strategy": "steps", "save_steps": 400, "eval_strategy": "steps", "eval_steps": 120, "logging_strategy": "steps", "logging_steps": 60},

    # {"name": "set_13", "learning_rate": 1e-4, "gradient_accumulation": 1, "batch_size": 32, "total_epochs": 10, "note": "",
    # "save_strategy": "steps", "save_steps": 400, "eval_strategy": "steps", "eval_steps": 120, "logging_strategy": "steps", "logging_steps": 60},

    # {"name": "set_14", "learning_rate": 5e-6, "gradient_accumulation": 1, "batch_size": 32, "total_epochs": 10, "note": "",
    # "save_strategy": "steps", "save_steps": 400, "eval_strategy": "steps", "eval_steps": 120, "logging_strategy": "steps", "logging_steps": 60},

    # {"name": "set_15", "learning_rate": 1e-6, "gradient_accumulation": 1, "batch_size": 32, "total_epochs": 50, "note": "",
    # "save_strategy": "steps", "save_steps": 2000, "eval_strategy": "steps", "eval_steps": 500, "logging_strategy": "steps", "logging_steps": 250},

    # {"name": "set_16", "learning_rate": 5e-6, "gradient_accumulation": 1, "batch_size": 32, "total_epochs": 50, "note": "",
    # "save_strategy": "steps", "save_steps": 2000, "eval_strategy": "steps", "eval_steps": 500, "logging_strategy": "steps", "logging_steps": 250}
    ]

# iterate through each set of hyperparameters
for hyperparameters in hyperparameter_sets:
  # reinstantiating the model
  model = AutoModelForSequenceClassification.from_pretrained(
      "lxyuan/distilbert-base-multilingual-cased-sentiments-student",
      num_labels=3
  )

  # loading the hyperparameters from the dictionary
  name = hyperparameters["name"]
  learning_rate = hyperparameters["learning_rate"]
  gradient_accumulation = hyperparameters["gradient_accumulation"]
  batch_size = hyperparameters["batch_size"]
  total_epochs = hyperparameters["total_epochs"]
  note = hyperparameters["note"]
  save_strategy = hyperparameters["save_strategy"]
  save_steps = hyperparameters["save_steps"]
  eval_strategy = hyperparameters["eval_strategy"]
  eval_steps = hyperparameters["eval_steps"]
  logging_strategy = hyperparameters["logging_strategy"]
  logging_steps = hyperparameters["logging_steps"]

  # setting up the training_args
  training_args = TrainingArguments(
      output_dir = config["project_path"] + config["models_dir"] +f"{name}_lr{learning_rate}_gracc{gradient_accumulation}_bs{batch_size}_te{total_epochs}_note{note}",
      save_strategy = save_strategy,
      save_steps = save_steps,

      logging_dir = config["project_path"] + config["data_tensorboard_logs"] + f"{name}_lr{learning_rate}_gracc{gradient_accumulation}_bs{batch_size}_te{total_epochs}_note{note}",
      logging_steps = logging_steps,

      eval_steps = eval_steps,
      eval_strategy = eval_strategy,
      per_device_train_batch_size = batch_size,
      logging_strategy = logging_strategy,
      torch_compile = True,
      gradient_accumulation_steps = gradient_accumulation,
      learning_rate = learning_rate,
      num_train_epochs = total_epochs,
      report_to = "tensorboard"
      )

  # setting up the trainger
  trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics
    )

  # training of the model and evaluation of the model       # Question / Frage: ok, geht dann wohl beides so ans Tensorboard
  print(f"The following data belong to: {name}_lr{learning_rate}_gracc{gradient_accumulation}_bs{batch_size}_te{total_epochs}_note{note}")
  trainer.train()
  trainer.evaluate()

### Tensorboard evaluations

In [None]:
# loading the tensorboard to view the metrics and training logs
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [None]:
# setting the directory for the logs and launching the tensorboard
tensorboard_logs_dir = os.path.join(config["project_path"], config["data_tensorboard_logs"])
%tensorboard --logdir "$tensorboard_logs_dir"