# Assignment 3 Natural Language Processing - Part 2 Multi Label Classification


Binary classification involves predicting whether an instance belongs to a single class or not, yielding a 0 or 1 outcome for each instance. In contrast, multi-label classification allows for each instance to be tagged with multiple labels from a set; here, each of the five personality traits in the OCEAN model can be predicted simultaneously. The deep learning model must output a probability for each class rather than a single binary outcome, and a threshold is then applied to these probabilities to determine the presence of each trait. For the transformer model, this means that the final layer should be adapted from a single neuron with a sigmoid activation (for binary classification) to multiple neurons corresponding to the number of classes, each with its own sigmoid activation to allow for independent probability predictions. Additionally, the loss function will change to a version suitable for multi-label tasks, like binary cross-entropy applied independently to each class label. This adjustment allows the model to learn each trait prediction simultaneously with the others, considering the interdependencies between different personality traits.

## Adaptation to new task

Moving to the multi-label classification phase, we continue to utilize both BERT and RoBERTa as baseline models, now with the modified token input sizes to accommodate the full text length through chunking (using Longformer to handle the full input sequence would be better but due to resource limitations not ideal). We adapt our dataset loading methodology to align with multi-label standards, converting target variables to floating-point numbers to reflect the continuous nature of the probability outputs required for each of the five personality trait classes.
For initial parameter selection, we use parameters from an example in using long text with multi-label classification with transformers (Longformer Multilabel text Classification · Jesus Leal, 2021), setting a learning rate of 2e-5, an effective batch size of 128, and a training duration of 5 epochs. Hence, we diverge from the bare default parameters previously used for binary classification with BERT and RoBERTa. Instead, we opt for parameters suggested by similar multi-label tasks, establishing a new baseline that is tailored to the specific demands and complexities of multi-label prediction, while still maintaining a link to our initial approach through the consistent use of these two transformer models.

In evaluating model performance, we also adopt metrics suitable for multi-label contexts: micro-averaged F1 score and ROC-AUC score. The micro-averaged F1 score will give us insight into the overall performance across all labels, compensating for class imbalances, while the ROC-AUC score will help us assess the trade-off between true positive rate and false positive rate for each trait. Although accuracy will also be reported, its interpretation is less straightforward in a multi-label scenario, as it requires a perfect match across all labels. Hence, we do not put too much importance in this accuracy score. By employing both BERT and RoBERTa for these initial multi-label experiments, we establish a comprehensive baseline, against which we can measure the performance improvements of any further model optimizations or architectural innovations.


In [None]:
from google.colab import drive
import pandas as pd
drive.mount('/content/drive')

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from sklearn.datasets import make_multilabel_classification
import numpy as np


In [None]:
!pip install nvidia-ml-py3
!pip install scikit-multilearn
!pip install evaluate
!pip install datasets
! pip install -U accelerate
! pip install -U transformers

Collecting nvidia-ml-py3
  Downloading nvidia-ml-py3-7.352.0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: nvidia-ml-py3
  Building wheel for nvidia-ml-py3 (setup.py) ... [?25l[?25hdone
  Created wheel for nvidia-ml-py3: filename=nvidia_ml_py3-7.352.0-py3-none-any.whl size=19171 sha256=ce8ff8c5c3a96c9c37c56703d398344f538c5262ea08fb851ba1d4ffe3a8d7c2
  Stored in directory: /root/.cache/pip/wheels/5c/d8/c0/46899f8be7a75a2ffd197a23c8797700ea858b9b34819fbf9e
Successfully built nvidia-ml-py3
Installing collected packages: nvidia-ml-py3
Successfully installed nvidia-ml-py3-7.352.0
Collecting scikit-multilearn
  Downloading scikit_multilearn-0.2.0-py3-none-any.whl (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.4/89.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-multilearn
Successfully installed scikit-multilearn-0.2.0
Collecting evaluate
  Downloading evalua

In [None]:
from pynvml import *


def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()

In [None]:
print_gpu_utilization()

GPU memory occupied: 233 MB.


In [None]:

answers = pd.read_csv("drive/MyDrive/Levy/english_texts.csv")

In [None]:

answers

Unnamed: 0.1,Unnamed: 0,TEXT,TEXT_NL,cEXT,cNEU,cAGR,cCON,cOPN
0,0,"Well, right now I just woke up from a mid-day ...","Nou, op dit moment werd ik net wakker na een m...",0,1,1,0,1
1,1,"Well, here we go with the stream of consciousn...","Nou, hier gaan we met de stroom van bewustzijn...",0,0,1,0,0
2,2,An open keyboard and buttons to push. The thin...,Een open toetsenbord en knoppen om in te drukk...,0,1,0,1,1
3,3,I can't believe it! It's really happening! M...,Ik kan het niet geloven! Het gebeurt echt! Mij...,1,0,1,1,0
4,4,"Well, here I go with the good old stream of co...","Welnu, hier ga ik weer met de goede oude stroo...",1,0,1,0,1
...,...,...,...,...,...,...,...,...
2958,2958,I am motivated on a day to day basis by the ne...,Ik word dagelijks gemotiveerd door de noodzaak...,1,0,0,1,1
2959,2959,My son is the biggest part of my life and with...,Mijn zoon is het grootste deel van mijn leven ...,1,1,0,0,0
2960,2960,My kids and grandkids are what keeps me motiva...,Mijn kinderen en kleinkinderen houden me elke ...,1,0,1,1,0
2961,2961,My biggest drive is to earn money so I can ret...,Mijn grootste drijfveer is om geld te verdiene...,0,0,0,0,0


In [None]:
from skmultilearn.model_selection import iterative_train_test_split

x = answers['TEXT']
y = answers[['cEXT',	'cNEU',	'cAGR',	'cCON',	'cOPN']]


# Convert the labels DataFrame to a numpy array
y_array = y.to_numpy()

# Iterative stratification to split the dataset
x_array = x.to_numpy().reshape(-1, 1)
x_train_it, y_train_it, x_test_it, y_test_it = iterative_train_test_split(x_array, y_array, test_size = 0.2)

# Display the shapes of the train and test sets after iterative stratification
print(x_train_it.shape, x_test_it.shape, y_train_it.shape, y_test_it.shape)


(2370, 1) (593, 1) (2370, 5) (593, 5)


In [None]:
# Creating a list of dictionaries for the test set
test_data = [{'label': label, 'text': text} for label, text in zip(y_test_it, x_test_it)]
train_data = [{'label': label, 'text': text} for label, text in zip(y_train_it, x_train_it)]

# Creating the final data structure
multilabel_df = {
    'train': train_data,
    'test': test_data
}

In [None]:
multilabel_df['train'][0]

{'label': array([0, 1, 1, 0, 1]),
 'text': array(['Well, right now I just woke up from a mid-day nap. It\'s sort of weird, but ever since I moved to Texas, I have had problems concentrating on things. I remember starting my homework in  10th grade as soon as the clock struck 4 and not stopping until it was done. Of course it was easier, but I still did it. But when I moved here, the homework got a little more challenging and there was a lot more busy work, and so I decided not to spend hours doing it, and just getting by. But the thing was that I always paid attention in class and just plain out knew the stuff, and now that I look back, if I had really worked hard and stayed on track the last two years without getting  lazy, I would have been a genius, but hey, that\'s all good. It\'s too late to correct the past, but I don\'t really know how to stay focused n the future. The one thing I know is that when  people say that b/c they live on campus they can\'t concentrate, it\'s b. s. For

In [None]:
def label_distribution(data):
    # Initialize counts for each label
    label_counts = {i: {'1': 0, '0': 0} for i in range(5)}

    # Iterate over each entry and count the label occurrences
    for entry in data:
        labels = entry['label']
        for i, label in enumerate(labels):
            label_str = str(int(label))  # Convert label to string (either '1' or '0')
            label_counts[i][label_str] += 1

    return label_counts

# Calculating label distributions for train and test sets
train_label_distribution = label_distribution(multilabel_df['train'])
test_label_distribution = label_distribution(multilabel_df['test'])

train_label_distribution, test_label_distribution

({0: {'1': 1193, '0': 1177},
  1: {'1': 1186, '0': 1184},
  2: {'1': 1230, '0': 1140},
  3: {'1': 1214, '0': 1156},
  4: {'1': 1145, '0': 1225}},
 {0: {'1': 298, '0': 295},
  1: {'1': 296, '0': 297},
  2: {'1': 307, '0': 286},
  3: {'1': 304, '0': 289},
  4: {'1': 286, '0': 307}})

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
import torch

def tokenize_and_chunk(texts, labels, tokenizer, chunk_size=510):
    input_id_chunks = []
    mask_chunks = []
    chunked_labels = []

    for text, label in zip(texts, labels):
        # Ensure text is a string
        text = str(text)

        # Tokenize without special tokens
        tokens = tokenizer.encode_plus(text, add_special_tokens=False,
                                       return_tensors='pt', truncation=False)

        # Split into chunks of `chunk_size`
        input_ids = tokens['input_ids'][0]
        attention_mask = tokens['attention_mask'][0]
        num_chunks = (len(input_ids) // chunk_size) + int(len(input_ids) % chunk_size != 0)

        for i in range(num_chunks):
            # Define the start and end of the chunk
            start = i * chunk_size
            end = start + chunk_size

            # Extract chunks for input IDs and attention mask
            input_ids_chunk = input_ids[start:end]
            attention_mask_chunk = attention_mask[start:end]

            # Add [CLS] and [SEP] tokens
            input_ids_chunk = torch.tensor([101] + input_ids_chunk.tolist() + [102])
            attention_mask_chunk = torch.tensor([1] + attention_mask_chunk.tolist() + [1])

            # Pad the sequences
            padding_length = 512 - len(input_ids_chunk)
            input_ids_chunk = torch.cat([input_ids_chunk, torch.zeros(padding_length, dtype=torch.long)])
            attention_mask_chunk = torch.cat([attention_mask_chunk, torch.zeros(padding_length, dtype=torch.long)])

            # Store the chunks
            input_id_chunks.append(input_ids_chunk)
            mask_chunks.append(attention_mask_chunk)
            chunked_labels.append(torch.tensor(label, dtype=torch.float))  # Convert label list to tensor of floats

    # Convert lists to tensors
    input_ids_tensor = torch.stack(input_id_chunks)
    attention_mask_tensor = torch.stack(mask_chunks)
    labels_tensor = torch.stack(chunked_labels)  # Stack label tensors

    return input_ids_tensor, attention_mask_tensor, labels_tensor


# Training data
train_texts = [item['text'] for item in multilabel_df['train']]
train_labels = [item['label'] for item in multilabel_df['train']]
input_ids_tensor_train, attention_mask_tensor_train, labels_tensor_train = tokenize_and_chunk(train_texts, train_labels, tokenizer)

# Preparing the training dictionary
input_dict_train = {
    'input_ids': input_ids_tensor_train.long(),
    'attention_mask': attention_mask_tensor_train.int(),
    'labels': labels_tensor_train
}

# Testing data
test_texts = [item['text'] for item in multilabel_df['test']]
test_labels = [item['label'] for item in multilabel_df['test']]
input_ids_tensor_test, attention_mask_tensor_test, labels_tensor_test = tokenize_and_chunk(test_texts, test_labels, tokenizer)

# Preparing the testing dictionary
input_dict_test = {
    'input_ids': input_ids_tensor_test.long(),
    'attention_mask': attention_mask_tensor_test.int(),
    'labels': labels_tensor_test
}


Token indices sequence length is longer than the specified maximum sequence length for this model (820 > 512). Running this sequence through the model will result in indexing errors


In [None]:
import pandas as pd
from datasets import Dataset

# Function to convert tensor to list of integers or floats
def tensor_to_list(tensor, dtype=int):
    return [tensor[i].tolist() for i in range(len(tensor))]

# Convert tensors to lists
input_ids_list_train = tensor_to_list(input_ids_tensor_train, dtype=int)
attention_mask_list_train = tensor_to_list(attention_mask_tensor_train, dtype=int)
labels_list_train = tensor_to_list(labels_tensor_train, dtype=float)

input_ids_list_test = tensor_to_list(input_ids_tensor_test, dtype=int)
attention_mask_list_test = tensor_to_list(attention_mask_tensor_test, dtype=int)
labels_list_test = tensor_to_list(labels_tensor_test, dtype=float)

# Create DataFrame
df_train = pd.DataFrame({
    'input_ids': input_ids_list_train,
    'attention_mask': attention_mask_list_train,
    'labels': labels_list_train
})

df_test = pd.DataFrame({
    'input_ids': input_ids_list_test,
    'attention_mask': attention_mask_list_test,
    'labels': labels_list_test
})

# Create Hugging Face Dataset
train_dataset = Dataset.from_pandas(df_train)
test_dataset = Dataset.from_pandas(df_test)


In [None]:
train_dataset['input_ids'][0]

[101,
 48759,
 8346,
 6,
 235,
 122,
 38,
 95,
 13356,
 62,
 31,
 10,
 1084,
 12,
 1208,
 16159,
 4,
 85,
 46495,
 29,
 2345,
 9,
 7735,
 6,
 53,
 655,
 187,
 38,
 1410,
 7,
 1184,
 6,
 38,
 33,
 56,
 1272,
 28619,
 15,
 383,
 4,
 38,
 2145,
 1158,
 127,
 19122,
 11,
 1437,
 158,
 212,
 4978,
 25,
 1010,
 25,
 5,
 6700,
 2322,
 204,
 8,
 45,
 8197,
 454,
 24,
 21,
 626,
 4,
 1525,
 768,
 24,
 21,
 3013,
 6,
 53,
 38,
 202,
 222,
 24,
 4,
 125,
 77,
 38,
 1410,
 259,
 6,
 5,
 19122,
 300,
 10,
 410,
 55,
 4087,
 8,
 89,
 21,
 10,
 319,
 55,
 3610,
 173,
 6,
 8,
 98,
 38,
 1276,
 45,
 7,
 1930,
 722,
 608,
 24,
 6,
 8,
 95,
 562,
 30,
 4,
 125,
 5,
 631,
 21,
 14,
 38,
 460,
 1199,
 1503,
 11,
 1380,
 8,
 95,
 10798,
 66,
 1467,
 5,
 2682,
 6,
 8,
 122,
 14,
 38,
 356,
 124,
 6,
 114,
 38,
 56,
 269,
 1006,
 543,
 8,
 4711,
 15,
 1349,
 5,
 94,
 80,
 107,
 396,
 562,
 1437,
 22414,
 6,
 38,
 74,
 33,
 57,
 10,
 16333,
 6,
 53,
 17232,
 6,
 14,
 46495,
 29,
 70,
 205,
 4,
 85,
 46495,
 29

In [None]:
train_dataset['labels'][0]

[0.0, 1.0, 1.0, 0.0, 1.0]

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
labels = ['cEXT',	'cNEU',	'cAGR',	'cCON',	'cOPN']
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("roberta-base",
                                                           problem_type="multi_label_classification",
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
train_dataset['labels'][0]

[0.0, 1.0, 1.0, 0.0, 1.0]

In [None]:
train_dataset['input_ids'][0]

[101,
 48759,
 8346,
 6,
 235,
 122,
 38,
 95,
 13356,
 62,
 31,
 10,
 1084,
 12,
 1208,
 16159,
 4,
 85,
 46495,
 29,
 2345,
 9,
 7735,
 6,
 53,
 655,
 187,
 38,
 1410,
 7,
 1184,
 6,
 38,
 33,
 56,
 1272,
 28619,
 15,
 383,
 4,
 38,
 2145,
 1158,
 127,
 19122,
 11,
 1437,
 158,
 212,
 4978,
 25,
 1010,
 25,
 5,
 6700,
 2322,
 204,
 8,
 45,
 8197,
 454,
 24,
 21,
 626,
 4,
 1525,
 768,
 24,
 21,
 3013,
 6,
 53,
 38,
 202,
 222,
 24,
 4,
 125,
 77,
 38,
 1410,
 259,
 6,
 5,
 19122,
 300,
 10,
 410,
 55,
 4087,
 8,
 89,
 21,
 10,
 319,
 55,
 3610,
 173,
 6,
 8,
 98,
 38,
 1276,
 45,
 7,
 1930,
 722,
 608,
 24,
 6,
 8,
 95,
 562,
 30,
 4,
 125,
 5,
 631,
 21,
 14,
 38,
 460,
 1199,
 1503,
 11,
 1380,
 8,
 95,
 10798,
 66,
 1467,
 5,
 2682,
 6,
 8,
 122,
 14,
 38,
 356,
 124,
 6,
 114,
 38,
 56,
 269,
 1006,
 543,
 8,
 4711,
 15,
 1349,
 5,
 94,
 80,
 107,
 396,
 562,
 1437,
 22414,
 6,
 38,
 74,
 33,
 57,
 10,
 16333,
 6,
 53,
 17232,
 6,
 14,
 46495,
 29,
 70,
 205,
 4,
 85,
 46495,
 29

In [None]:
[id2label[idx] for idx, value in enumerate(train_dataset['labels'][0]) if value == 1.0]


['cNEU', 'cAGR', 'cOPN']

In [None]:
output_dir = '/content/drive/MyDrive/Levy/part2/roberta'


In [None]:
from transformers import TrainingArguments, Trainer, logging

logging.set_verbosity_error()

In [None]:
print_gpu_utilization()

GPU memory occupied: 233 MB.


In [None]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch

def multi_label_metrics(predictions, labels, threshold=0.5):
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)

    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions,
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result

In [None]:
# training_args = TrainingArguments(
#     output_dir=output_dir,
#     learning_rate=1e-5,
#     per_device_train_batch_size=8,
#     per_device_eval_batch_size=16,
#     num_train_epochs=5,
#     weight_decay=0.01,
#     warmup_steps = 50,
#     logging_steps = 8,
#     evaluation_strategy="epoch",
#     save_strategy="epoch",
#     disable_tqdm = False,
#     load_best_model_at_end=True,
#     push_to_hub=False,
#     gradient_accumulation_steps=8,
#     gradient_checkpointing=True,
#     fp16=True,
# )
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=test_dataset,
#     tokenizer=tokenizer,
#     data_collator=data_collator,
#     compute_metrics=compute_metrics,
# )

# trainer.train()

In [None]:
print_gpu_utilization()

GPU memory occupied: 233 MB.


In [None]:
# trainer.evaluate()

In [None]:
output_dir_hyper = '/content/drive/MyDrive/Levy/part2/hyperparam'


In [None]:
%pip install wandb

Collecting wandb
  Downloading wandb-0.16.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.40-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.39.1-py2.py3-none-any.whl (254 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m254.1/254.1 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Collecting gitdb<5,>=4.0.1 (from GitPython!=3.1.29,>=1.0.0->w

In [None]:
def wandb_hp_space(trial):
    return {
        "method": "bayes",
        "metric": {"name": "objective", "goal": "minimize"},
        "parameters": {
            "learning_rate": {"distribution": "uniform", "min": 1e-6, "max": 1e-5},
            "per_device_train_batch_size": {"values": [4, 8]}
        },
    }

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
    "roberta-base", problem_type="multi_label_classification", num_labels=len(labels), id2label=id2label, label2id=label2id
)

In [None]:
training_args_hyper = TrainingArguments(
    output_dir=output_dir_hyper,
    per_device_eval_batch_size=8,
    per_device_train_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_steps = 4,
    disable_tqdm=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
    fp16=True,
)

In [None]:
trainer = Trainer(
    model=None,
    model_init=model_init,
    args=training_args_hyper,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

In [None]:
best_trial = trainer.hyperparameter_search(
    direction="minimize",
    backend="wandb",
    hp_space=wandb_hp_space,
    n_trials=5,
)

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Create sweep with ID: q890s6cq
Sweep URL: https://wandb.ai/jads/uncategorized/sweeps/q890s6cq


[34m[1mwandb[0m: Agent Starting Run: a40w2k48 with config:
[34m[1mwandb[0m: 	learning_rate: 7.848496824596478e-06
[34m[1mwandb[0m: 	per_device_train_batch_size: 4
[34m[1mwandb[0m: Currently logged in as: [33mvk_jads[0m ([33mjads[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))





Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,0.6853,0.688764,0.644979,0.530069,0.032657
2,0.6776,0.678711,0.6023,0.565629,0.05737
3,0.6434,0.673693,0.613625,0.575438,0.052074
4,0.6542,0.675097,0.606279,0.574049,0.054722
5,0.6593,0.676472,0.615899,0.573173,0.052074




VBox(children=(Label(value='0.013 MB of 0.023 MB uploaded\r'), FloatProgress(value=0.5929890848026869, max=1.0…

0,1
eval/accuracy,▁█▆▇▆
eval/f1,█▁▃▂▃
eval/loss,█▃▁▂▂
eval/roc_auc,▁▆███
eval/runtime,▁▂▂▂█
eval/samples_per_second,█▇▇▇▁
eval/steps_per_second,█▇▇▇▁
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/learning_rate,████▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁

0,1
eval/accuracy,0.05207
eval/f1,0.6159
eval/loss,0.67647
eval/roc_auc,0.57317
eval/runtime,5.667
eval/samples_per_second,199.929
eval/steps_per_second,25.057
train/epoch,5.0
train/global_step,730.0
train/learning_rate,0.0


[34m[1mwandb[0m: Agent Starting Run: kj9butog with config:
[34m[1mwandb[0m: 	learning_rate: 5.756176473962249e-06
[34m[1mwandb[0m: 	per_device_train_batch_size: 4


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))





Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,0.6874,0.691382,0.648712,0.521496,0.026478
2,0.6826,0.680277,0.595199,0.564589,0.061783
3,0.6525,0.676107,0.612898,0.567451,0.056487
4,0.6572,0.675173,0.613329,0.574902,0.054722
5,0.6645,0.676792,0.618427,0.56944,0.055605




VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁█▇▇▇
eval/f1,█▁▃▃▄
eval/loss,█▃▁▁▂
eval/roc_auc,▁▇▇█▇
eval/runtime,▁▁█▃▁
eval/samples_per_second,██▁▆▇
eval/steps_per_second,██▁▆▇
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/learning_rate,████▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁

0,1
eval/accuracy,0.0556
eval/f1,0.61843
eval/loss,0.67679
eval/roc_auc,0.56944
eval/runtime,5.2388
eval/samples_per_second,216.271
eval/steps_per_second,27.106
train/epoch,5.0
train/global_step,730.0
train/learning_rate,0.0


[34m[1mwandb[0m: Agent Starting Run: xrctba7x with config:
[34m[1mwandb[0m: 	learning_rate: 1.856202950728685e-06
[34m[1mwandb[0m: 	per_device_train_batch_size: 8


VBox(children=(Label(value='0.001 MB of 0.002 MB uploaded\r'), FloatProgress(value=0.5956127801621364, max=1.0…





Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,0.6923,0.692894,0.623956,0.511642,0.073257
2,0.6912,0.692525,0.662754,0.51825,0.042365
3,0.6938,0.692228,0.664,0.520564,0.039718
4,0.6926,0.691993,0.660253,0.521486,0.039718
5,0.6942,0.691893,0.663571,0.521867,0.042365




VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,█▂▁▁▂
eval/f1,▁██▇█
eval/loss,█▅▃▂▁
eval/roc_auc,▁▆▇██
eval/runtime,▄▁█▅▁
eval/samples_per_second,▅█▁▄█
eval/steps_per_second,▅█▁▄█
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/learning_rate,████▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁

0,1
eval/accuracy,0.04237
eval/f1,0.66357
eval/loss,0.69189
eval/roc_auc,0.52187
eval/runtime,5.266
eval/samples_per_second,215.154
eval/steps_per_second,26.965
train/epoch,5.0
train/global_step,365.0
train/learning_rate,0.0


[34m[1mwandb[0m: Agent Starting Run: 3l6wpr4q with config:
[34m[1mwandb[0m: 	learning_rate: 6.8736693522129396e-06
[34m[1mwandb[0m: 	per_device_train_batch_size: 4


VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.1117873444911841, max=1.0…





Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,0.6869,0.690621,0.645563,0.523651,0.030009
2,0.6784,0.679433,0.607457,0.565819,0.061783
3,0.6488,0.675872,0.614776,0.568842,0.05737
4,0.6546,0.67498,0.609302,0.573401,0.050309
5,0.6607,0.676992,0.61919,0.571964,0.052957




VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁█▇▅▆
eval/f1,█▁▂▁▃
eval/loss,█▃▁▁▂
eval/roc_auc,▁▇▇██
eval/runtime,▄▁█▂▁
eval/samples_per_second,▅█▁▇█
eval/steps_per_second,▅█▁▇█
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/learning_rate,████▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁

0,1
eval/accuracy,0.05296
eval/f1,0.61919
eval/loss,0.67699
eval/roc_auc,0.57196
eval/runtime,5.2078
eval/samples_per_second,217.559
eval/steps_per_second,27.267
train/epoch,5.0
train/global_step,730.0
train/learning_rate,0.0


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 3lvy2p95 with config:
[34m[1mwandb[0m: 	learning_rate: 7.065700973929783e-06
[34m[1mwandb[0m: 	per_device_train_batch_size: 4


VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.11179735051915503, max=1.…





Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,0.6856,0.690029,0.645676,0.525289,0.030891
2,0.6782,0.678454,0.600972,0.5644,0.056487
3,0.6455,0.675282,0.617819,0.572377,0.05737
4,0.6526,0.675115,0.610407,0.576095,0.050309
5,0.6574,0.676869,0.620591,0.575193,0.052074




VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁██▆▇
eval/f1,█▁▄▂▄
eval/loss,█▃▁▁▂
eval/roc_auc,▁▆▇██
eval/runtime,█▃▄▁▆
eval/samples_per_second,▁▆▅█▃
eval/steps_per_second,▁▆▅█▃
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/learning_rate,████▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁

0,1
eval/accuracy,0.05207
eval/f1,0.62059
eval/loss,0.67687
eval/roc_auc,0.57519
eval/runtime,5.2439
eval/samples_per_second,216.059
eval/steps_per_second,27.079
train/epoch,5.0
train/global_step,730.0
train/learning_rate,0.0


# Modeling

Analyzing the results from the four models, we observe distinct performance patterns and implications of the adjustments made to BERT and RoBERTa for multi-label classification.

BERT baseline vs BERT with adjusted parameters
- BERT Baseline: Exhibits an increase in validation loss over epochs, suggesting overfitting. The performance metrics like micro-averaged F1 score and ROC AUC score show some fluctuation but do not indicate significant improvement.
- BERT Adjusted: With a reduced learning rate to 1e-6, effective batch size increased to 64, and warm-up steps extended to 50, the adjustments aimed at more stable and gradual learning. The smaller learning rate can mitigate overfitting by making less aggressive updates to the model weights. The larger batch size can help in regularizing the model further. These changes resulted in slightly better stability in the validation loss and a modest improvement in micro-averaged F1 and ROC AUC scores, although the increments were not drastic.

RoBERTa baseline vs RoBERTa with adjusted parameters
- RoBERTa Baseline: Interestingly, the RoBERTa baseline showed a decreasing trend in validation loss, a positive sign compared to the BERT baseline. However, due to logging issues, only results from three epochs are available. Despite this, the model shows promising performance with respect to the micro averaged F1 and ROC AUC scores.
- RoBERTa Adjusted: For the RoBERTa adjusted model, the learning rate was also decreased like with adjusted BERT, but to a lesser extent (1e-5), with the batch size likewise reduced to 64 and an increase in warm-up steps. These modifications led to an improvement in performance metrics, making this version the most promising among the four models yet. The less drastic decrease in learning rate, compared to BERT adjusted, and other changes appear to have positively influenced the model's ability to learn from the data while controlling for overfitting.

With the RoBERTa model with adjusted parameters being the most promising model, we move on to seeing if using a Longformer model with its ability to handle the full input sequence is beneficial. This model, although requiring a lot of computation and taking a lot of time to train, is able to use the entire length of the input data and possibly can use that to its advantage.


## Longformer
In an effort to explore the potential benefits of handling full input text length for multi-label personality trait prediction, the Longformer model was employed as the next step. Using the parameters from the latest and best-performing RoBERTa model, the Longformer was run for nearly two hours. However, the results were somewhat underwhelming: it achieved a validation loss of 0.679, a micro-averaged F1 score of 0.58, and a ROC AUC of 0.56. When compared to the adjusted RoBERTa model, which took only 8 minutes per run, the Longformer's extensive training time of almost 2 hours did not translate into a corresponding improvement in performance. These findings suggest that, in the context of predicting personality traits on a multi-label continuous scale, using a more complex model like Longformer that can interpret the full length of the input does not necessarily yield clear advantages. The significant increase in training duration without a proportional gain in predictive accuracy raises questions about the cost-effectiveness and practicality of using such resource-intensive models for this specific task, especially in a business context.

## Hyperparameter

Continuing the exploration of optimal hyperparameters for multi-label classification, a focused hyperparameter search was conducted, specifically targeting the learning rate. Given the insights from the previous hyperparameter search in part 1 for the binary classification task, which highlighted the learning rate as a crucial factor, this search aimed to pinpoint the optimal learning rate for achieving the lowest validation loss in a multi-label setting. Utilizing the Bayesian search method again its efficiency in navigating complex parameter spaces, five different learning rate configurations were tested.
This search revealed a somewhat focused outcome towards higher values of learning range within the range of 1e-5 to 1e-6. This finding suggests a characteristic of the dataset or the multi-label nature of the task, where the model benefits from more substantial updates to its weights during training, possibly due to the complexity of capturing the nuances of multiple labels simultaneously.
Using these findings, the identified learning rate settings were then applied to the RoBERTa model (7e-6). The resulting model from this was relatively good, however not being significantly better than the adjusted baselines models nor the Longformer mode, highlighting the still remaining gap in possible new hyperparameters.

## Comparison and compare against part 1
The methodologies employed in the two parts of this study for the deep learning approach, offer insightful comparisons in terms of model performance and effectiveness of different approaches.
In the binary classification task, the Longformer model emerged as the standout performer. Its ability to handle longer text sequences played a crucial role, particularly evident in the significant decrease in validation loss (unique in all the models) and a balanced performance in precision and recall metrics. Among the five personality traits analyzed, 'Openness' was predicted with the highest accuracy of almost 0.65, indicating that certain traits might be better fit for prediction using this model.
The multi-label classification task, while showing the adjusted RoBERTa model slightly ahead with a micro-averaged F1 score of 0.62 and a micro-averaged ROC AUC of 0.57, revealed an interesting pattern: all models, while having a different parameter configuration, yielded broadly similar quality results. This contrasts with the binary classification part, where certain models and parameter settings demonstrated clear advantages. The relatively uniform performance across different models in the multi-label task suggests a possible inherent complexity in predicting multiple personality traits simultaneously. It indicates that the task itself might be fundamentally challenging, with limitations on how much model tuning and architectural changes can enhance performance.
This scenario implies that the nuances and interrelationships inherent in multi-label prediction may require more than just sophisticated model architectures or optimized parameters. It could be that the subtleties of human personality traits and their expression in text are difficult to capture fully with a transformed based approach. Consequently, this may call for adopting other strategies, possibly integrating domain-specific knowledge or exploring advanced techniques like transfer learning. The observations from the multi-label task underscore the need to consider not just the technical aspects of model building but also the complex nature of the data and the task itself when striving for improvements in machine learning applications.
These findings highlight the nuances of model selection and optimization in different classification tasks. While Longformer's extended sequence handling capability gives it an edge in binary classification, especially for specific personality traits, the multi-label task seems to benefit more from the refined parameter tuning of models like RoBERTa. This underscores the importance of tailoring the model and its parameters to the specific characteristics and requirements of the prediction task at hand.

## Comparison between DL and ML approaches

The comparison between ML and deep learning DL approaches for the multi-label contexts reveals some interesting insights into the strengths and limitations of each methodology.
The shift to multi-label classification saw a noticeable degradation in performance with the ML models, with micro-averaged F1 scores dropping to around 0.52. This decline suggests that ML models struggle to simultaneously predict multiple traits, possibly due to the increased complexity and interdependence of the labels.
Conversely, the DL approaches using transformer models like, exhibited a different pattern. In the multi-label classification task, the DL models maintained a relatively consistent level of performance, with the adjusted RoBERTa model achieving a micro-averaged F1 score of 0.62, marginally higher than the highest consistent F1 score for the binary classification part. This consistency indicates that DL models, despite not showing significant improvements in multi-label classification compared to binary, are better equipped to handle the complexities of predicting multiple traits simultaneously.
The contrast between the two methodologies suggests that while ML models may be effective for individual trait prediction, their capability and performance decreases in the multi-label setting. In contrast, DL models, with their innovative architectures designed to capture complex patterns and dependencies in data, are more adept at handling the intricacies of multi-label classification. This highlights the potential of transformers, in complex prediction tasks where understanding nuanced relationships between multiple variables is crucial.


Link to folder with screenshots: https://drive.google.com/drive/folders/1mpyirSEd7irkOInVyXhZeNWs6-OAVZ9z?usp=sharing

References

Longformer Multilabel text Classification · Jesus Leal. (2021, 21 april). https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
(Longformer Multilabel text Classification · Jesus Leal, 2021)


BERT baseline

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 0     | 0.580000      | 0.727587        | 0.540678 | 0.564698 | 0.069647 |
| 1     | 0.550100      | 0.737402        | 0.557868 | 0.557937 | 0.061049 |
| 2     | 0.539400      | 0.766280        | 0.547074 | 0.554327 | 0.065348 |
| 3     | 0.487500      | 0.776530        | 0.573559 | 0.556996 | 0.065348 |
| 4     | 0.475900      | 0.789953        | 0.557054 | 0.559545 | 0.076526 |

BERT adjusted

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 0     | 0.522500      | 0.731816        | 0.560959 | 0.565442 | 0.068788 |
| 1     | 0.546300      | 0.735134        | 0.561864 | 0.563479 | 0.066208 |
| 2     | 0.557400      | 0.736811        | 0.566519 | 0.561772 | 0.062769 |
| 3     | 0.556900      | 0.739377        | 0.565180 | 0.562336 | 0.067928 |
| 4     | 0.532100      | 0.740090        | 0.563699 | 0.561845 | 0.067068 |

RoBERTa baseline

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 0     | 0.693500      | 0.691598        | 0.622071 | 0.508490 | 0.027289 |
| 2     | 0.656600      | 0.676237        | 0.626175 | 0.571581 | 0.060739 |
| 4     | 0.641300      | 0.681267        | 0.626275 | 0.572846 | 0.056338 |

RoBERTa adjusted

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 0     | 0.692200      | 0.692370        | 0.618713 | 0.516594 | 0.062832 |
| 1     | 0.676000      | 0.677975        | 0.610326 | 0.564549 | 0.059292 |
| 2     | 0.666800      | 0.679611        | 0.638248 | 0.562050 | 0.048673 |
| 3     | 0.650200      | 0.678402        | 0.611678 | 0.567969 | 0.059292 |
| 4     | 0.632900      | 0.677630        | 0.626727 | 0.572730 | 0.058407 |

Longformer

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 0     | 0.692900      | 0.692545        | 0.565723 | 0.507863 | 0.033727 |
| 1     | 0.691800      | 0.692078        | 0.463757 | 0.511483 | 0.055649 |
| 2     | 0.691000      | 0.689413        | 0.610433 | 0.525257 | 0.045531 |
| 3     | 0.678200      | 0.681626        | 0.593272 | 0.550862 | 0.055649 |
| 4     | 0.674600      | 0.679399        | 0.580291 | 0.561656 | 0.057336 |

Hyperparameter search with RoBERTa

| Epoch | Training Loss | Validation Loss |    F1    |  Roc Auc | Accuracy |
|:-----:|:-------------:|:---------------:|:--------:|:--------:|:--------:|
| 0     | 0.694200      | 0.691885        | 0.657082 | 0.511317 | 0.026619 |
| 1     | 0.674900      | 0.679207        | 0.610469 | 0.567987 | 0.047915 |
| 2     | 0.663000      | 0.677373        | 0.628518 | 0.571595 | 0.056788 |
| 4     | 0.639100      | 0.678278        | 0.620080 | 0.578823 | 0.058563 |


