# Emotional Analysis using Hugging Face Ecosystem
##Set Environment
In this notebook, we have to install following additional libraries (compared to previous notebooks) from Huggingface to enhance our workflow: transformers, datasets, evaluate, and accelearte. In addition, we are also installing wandb.

* The transformers library provides Trainer class that we will use to manage Training process.
* The datasets library simplifies the process of accessing and manipulating a wide array of datasets.
* The evaluate library offers a suite of standardized metrics and methods for robust and consistent model evaluation.
* We will not use accelerate library directly. However , we need to install it as transformer librray usses it in the background.
* Finally wandb library provide tools for efficient experiment tracking.

# Setting up the Environment



In [1]:
import sys
# If in Colab, then import the drive module from google.colab
if 'google.colab' in str(get_ipython()):
  from google.colab import drive
  # Mount the Google Drive to access files stored there
  drive.mount('/content/drive')

  # !pip install torchtext -qq
  # # Install the torchinfo library quietly
  !pip install torchinfo -qq
  # # !pip install torchtext --upgrade -qq
  !pip install torchmetrics -qq
  # !pip install torchinfo -qq
  !pip install fast_ml -qq
  !pip install joblib -qq
  # !pip install sklearn -qq
  # !pip install pandas -qq
  # !pip install numpy -qq
  !pip install scikit-multilearn -qq
  !pip install transformers evaluate wandb accelerate -U -qq
  !pip install pytorch-ignite -qq -U
  !pip install optuna -qq

  basepath = '/content/drive/MyDrive/Colab_Notebooks/BUAN_6342_Applied_Natural_Language_Processing'
  sys.path.append('/content/drive/MyDrive/Colab_Notebooks/BUAN_6342_Applied_Natural_Language_Processing/0_Custom_files')
else:
  basepath = '/Users/harikrishnadev/Library/CloudStorage/GoogleDrive-harikrish0607@gmail.com/My Drive/Colab_Notebooks/BUAN_6342_Applied_Natural_Language_Processing/'
  sys.path.append('/Users/harikrishnadev/Library/CloudStorage/GoogleDrive-harikrish0607@gmail.com/My Drive/Colab_Notebooks/BUAN_6342_Applied_Natural_Language_Processing/0_Custom_files')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## *Load Libraries*

In [2]:
# standard data science librraies for data handling and v isualization
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# New libraries introduced in this notebook
import evaluate
import torch
from datasets import load_dataset, DatasetDict, ClassLabel, Dataset
from datasets import load_metric
from transformers import Pipeline
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import AutoConfig
from transformers import pipeline
from pprint import pprint

import wandb

import os

In [3]:
# Set the base folder path using the Path class for better path handling
base_folder = Path(basepath)

# Define the data folder path by appending the relative path to the base folder
# This is where the data files will be stored
data_folder = base_folder / '0_Data_Folder'

# Define the model folder path for saving trained models
# This path points to a specific folder designated for NLP models related to the IMDb dataset
model_folder = data_folder

custom_functions = base_folder / '0_Custom_files'

# **Logging into Kaggle**
    


In [4]:
if 'google.colab' in str(get_ipython()):
    !chmod 600 /content/drive/MyDrive/Colab_Notebooks/BUAN_6382_Applied_DeepLearning/Data/.kaggle/kaggle.json
    !ls -la /content/drive/MyDrive/Colab_Notebooks/BUAN_6382_Applied_DeepLearning/Data/.kaggle
else:
    !chmod 600 '/Users/harikrishnadev/Library/CloudStorage/GoogleDrive-harikrish0607@gmail.com/My Drive/Colab_Notebooks/BUAN_6382_Applied_DeepLearning/Data/.kaggle/kaggle.json'
    ! ls -la '/Users/harikrishnadev/Library/CloudStorage/GoogleDrive-harikrish0607@gmail.com/My Drive/Colab_Notebooks/BUAN_6382_Applied_DeepLearning/Data/.kaggle'

total 1
-rw------- 1 root root 70 Nov 27 02:27 kaggle.json


In [5]:
if 'google.colab' in str(get_ipython()):
    os.environ['KAGGLE_CONFIG_DIR']='/content/drive/MyDrive/Colab_Notebooks/BUAN_6382_Applied_DeepLearning/Data/.kaggle'
else:
    os.environ['KAGGLE_CONFIG_DIR']='/Users/harikrishnadev/Library/CloudStorage/GoogleDrive-harikrish0607@gmail.com/My Drive/Colab_Notebooks/BUAN_6382_Applied_DeepLearning/Data/.kaggle'

# **Logging into Wandb**

In [6]:
if 'google.colab' in str(get_ipython()):
    from google.colab import userdata
    wandb.login(key=userdata.get('wandb'))
else:
    !wandb login

[34m[1mwandb[0m: Currently logged in as: [33mharikrish0607[0m ([33mharikrishnad[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


# **Loading Dataset**

In [7]:
! kaggle competitions download -c emotion-detection-spring2014

emotion-detection-spring2014.zip: Skipping, found more recently modified local copy (use --force to force download)


In [8]:
! unzip emotion-detection-spring2014.zip

Archive:  emotion-detection-spring2014.zip
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [9]:
import pandas as pd
train_dataset = pd.read_csv('train.csv', usecols=lambda column: column != 'ID')

In [10]:
train_dataset.head()

Unnamed: 0,Tweet,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,“Worry is a down payment on a problem you may ...,0,1,0,0,0,0,1,0,0,0,1
1,Whatever you decide to do make sure it makes y...,0,0,0,0,1,1,1,0,0,0,0
2,@Max_Kellerman it also helps that the majorit...,1,0,1,0,1,0,1,0,0,0,0
3,Accept the challenges so that you can literall...,0,0,0,0,1,0,1,0,0,0,0
4,My roommate: it's okay that we can't spell bec...,1,0,1,0,0,0,0,0,0,0,0


In [11]:
train_dataset.columns

Index(['Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love',
       'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
      dtype='object')

In [12]:
label_columns = ['anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust']

In [13]:
len(label_columns)

11

In [14]:
train_dataset[label_columns] = train_dataset[label_columns].astype(bool)

In [15]:
trainset = Dataset.from_pandas(train_dataset)

In [16]:
trainset.features

{'Tweet': Value(dtype='string', id=None),
 'anger': Value(dtype='bool', id=None),
 'anticipation': Value(dtype='bool', id=None),
 'disgust': Value(dtype='bool', id=None),
 'fear': Value(dtype='bool', id=None),
 'joy': Value(dtype='bool', id=None),
 'love': Value(dtype='bool', id=None),
 'optimism': Value(dtype='bool', id=None),
 'pessimism': Value(dtype='bool', id=None),
 'sadness': Value(dtype='bool', id=None),
 'surprise': Value(dtype='bool', id=None),
 'trust': Value(dtype='bool', id=None)}

# **Accessing and Manuplating Splits**

In [17]:
trainset = trainset.train_test_split(test_size=0.3)

In [18]:
trainset

DatasetDict({
    train: Dataset({
        features: ['Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 5406
    })
    test: Dataset({
        features: ['Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
        num_rows: 2318
    })
})

# Custom `MultiLabelClassifier` class built to run multiple models without repeating mutiple lines of code

The **MultiLabelClassifier** is a class designed for training and evaluating multi-label text classification models using the Hugging Face Transformers library. It supports fine-tuning pre-trained models for multi-label classification tasks and provides methods for prediction and hyperparameter optimization.

* `model_name` (str): The pre-trained model name from Hugging Face Transformers.
* `labels` (list of str): The list of labels for classification.
* `batch_size` (int): Batch size for training (default is 8).
* `learning_rate` (float): Learning rate for training (default is 2e-5).
* `num_epochs` (int): Number of epochs for training (default is 5).
* `metric_name` (str): The name of the evaluation metric (default is "f1").
* `threshold` (float): Threshold for binary classification (default is 0.5).



```python
# Initialize the classifier
classifier = MultiLabelClassifier(
    model_name="distilbert-base-uncased",
    labels=["positive", "negative"],
    batch_size=8,
    learning_rate=2e-5,
    num_epochs=10,
    metric_name="f1",
    threshold=0.5
)

# Train the classifier
classifier.train(train_dataset, valid_dataset)

# Optimize threshold
best_threshold = classifier.optimize_threshold(valid_dataset)

# Make predictions
predictions = classifier.predict(["This is a positive sentence", "This is a negative sentence"], threshold = best_threshold)

```


Here's a detailed explanation of the different components of the class:

1. **__init__** method:
   - Initializes the classifier by taking in various parameters such as the pre-trained model name, the list of labels, batch size, learning rate, number of epochs, evaluation metric, and the classification threshold.
   - It sets up the device (either 'cuda' if a GPU is available or 'cpu'), creates the tokenizer and the pre-trained model for multi-label classification.
   - The model is loaded onto the specified device.

2. **preprocess_data** method:
   - This method takes in a dictionary of examples and preprocesses the data for the model.
   - It tokenizes the input text and encodes it using the tokenizer.
   - It then creates a label matrix where each row corresponds to the binary labels for a given input text.
   - The preprocessed data, including the input IDs and the label matrix, is returned.

3. **multi_label_metrics** method:
   - This method computes the multi-label classification metrics, including F1 score (micro-averaged), ROC-AUC score, and accuracy.
   - It takes in the model predictions and the ground truth labels, and applies a threshold to convert the probabilities to binary predictions.
   - The computed metrics are returned as a dictionary.

4. **compute_metrics** method:
   - This method is used as the `compute_metrics` function for the Trainer in the Transformers library.
   - It calls the `multi_label_metrics` method to compute the evaluation metrics for the model.

5. **train** method:
   - This method is responsible for training the model.
   - It sets up the `TrainingArguments` object, which specifies the training configuration, such as the learning rate, batch size, number of epochs, and various logging and checkpointing options.
   - It preprocesses the training and validation datasets using the `preprocess_data` method and sets the data format to PyTorch tensors.
   - It creates a `Trainer` object and calls the `train` method to train the model.
   - After training, it evaluates the model on the validation dataset and logs the results to Weights & Biases.

6. **predict** method:
   - This method generates predictions for a list of input texts.
   - It preprocesses the input texts using the `preprocess_data` method and makes predictions using the model.
   - It applies the classification threshold to convert the probabilities to binary predictions and returns the predicted labels and the binary predictions.

7. **objective** method:
   - This method is used for hyperparameter optimization using Optuna.
   - It takes in a trial object and the validation dataset, and computes the negative F1 score as the objective function.
   - It applies the threshold (which is a hyperparameter to be optimized) to the model predictions and computes the multi-label metrics.
   - The negative F1 score is returned as the objective value.

8. **optimize_threshold** method:
   - This method uses Optuna to optimize the classification threshold.
   - It creates an Optuna study, optimizes the objective function (the `objective` method), and sets the best threshold value found during the optimization process.
   - The best threshold value is returned.


**Notes:**
* The train_dataset and valid_dataset should be compatible with the Hugging Face Dataset class.
* The labels should match the labels present in the datasets.
* Model fine-tuning and prediction methods require GPU if available for faster computation.

In [19]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
import torch
from transformers import EvalPrediction
import optuna
from datetime import date

class MultiLabelClassifier:
    def __init__(self, model_name, labels, batch_size=8, learning_rate=2e-5, num_epochs=5, metric_name="f1", threshold=0.5):
        """
        Initializes the MultiLabelClassifier.

        Args:
        - model_name (str): The pre-trained model name.
        - labels (list of str): The list of labels for classification.
        - batch_size (int): Batch size for training.
        - learning_rate (float): Learning rate for training.
        - num_epochs (int): Number of epochs for training.
        - metric_name (str): The name of the evaluation metric.
        - threshold (float): Threshold for binary classification.

        Returns:
        - None
        """
        self.model_name = model_name
        self.labels = labels
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.num_epochs = num_epochs
        self.metric_name = metric_name
        self.threshold = threshold
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name, problem_type="multi_label_classification", num_labels=len(labels), id2label={str(i): label for i, label in enumerate(labels)}, label2id={label: i for i, label in enumerate(labels)})
        self.id2label = {str(i): label for i, label in enumerate(labels)}
        self.label2id = {label: i for i, label in enumerate(labels)}
        self.model.to(self.device)

    def preprocess_data(self, examples):
        """
        Preprocesses the input data.

        Args:
        - examples (dict): Dictionary containing input data.

        Returns:
        - dict: Preprocessed input data.
        """
        text = examples["Tweet"]
        encoding = self.tokenizer(text, padding="max_length", truncation=True, max_length=128)
        labels_batch = {k: examples[k] for k in examples.keys() if k in self.labels}
        labels_matrix = np.zeros((len(text), len(self.labels)))
        for idx, label in enumerate(self.labels):
            labels_matrix[:, idx] = labels_batch[label]
        encoding["labels"] = labels_matrix.tolist()
        return encoding

    def multi_label_metrics(self, predictions, labels, threshold=None):
        """
        Computes multi-label classification metrics.

        Args:
        - predictions (torch.Tensor): Model predictions.
        - labels (np.ndarray): Ground truth labels.
        - threshold (float): Threshold for binary classification.

        Returns:
        - dict: Dictionary containing computed metrics.
        """
        if threshold is None:
            threshold = self.threshold
        sigmoid = torch.nn.Sigmoid()
        probs = sigmoid(torch.Tensor(predictions))
        y_pred = np.zeros(probs.shape)
        y_pred[np.where(probs >= threshold)] = 1
        y_true = labels
        f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
        roc_auc = roc_auc_score(y_true, y_pred, average='micro')
        accuracy = accuracy_score(y_true, y_pred)
        metrics = {'f1': f1_micro_average, 'roc_auc': roc_auc, 'accuracy': accuracy}
        return metrics

    def compute_metrics(self, p: EvalPrediction):
        """
        Computes evaluation metrics.

        Args:
        - p (EvalPrediction): Evaluation predictions.

        Returns:
        - dict: Dictionary containing computed metrics.
        """
        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
        result = self.multi_label_metrics(predictions=preds, labels=p.label_ids)
        return result

    def train(self, train_dataset, valid_dataset):
        """
        Trains the model.

        Args:
        - train_dataset (Dataset): Training dataset.
        - valid_dataset (Dataset): Validation dataset.

        Returns:
        - None
        """
        args = TrainingArguments(
            f"{self.model_name}-finetuned",
            # evaluation_strategy="epoch",
            # save_strategy="epoch",
            learning_rate=self.learning_rate,
            per_device_train_batch_size=self.batch_size,
            per_device_eval_batch_size=self.batch_size,
            num_train_epochs=self.num_epochs,
            weight_decay=0.01,
            load_best_model_at_end=True,
            metric_for_best_model="f1",  # Use F1 score as the metric to determine the best model
            optim='adamw_torch',  # Optimizer
            # output_dir=str(model_folder),  # Directory to save model checkpoints
            evaluation_strategy='steps',  # Evaluate model at specified step intervals
            eval_steps=50,  # Perform evaluation every 50 training steps
            save_strategy="steps",  # Save model checkpoint at specified step intervals
            save_steps=1000,  # Save model checkpoint every 1000 training steps
            save_total_limit=2,  # Retain only the best and the most recent model checkpoints
            greater_is_better=True,  # A model is 'better' if its F1 score is higher
            logging_strategy='steps',  # Log metrics and results to Weights & Biases platform
            logging_steps=50,  # Log metrics and results every 50 steps
            report_to='wandb',  # Log metrics and results to Weights & Biases platform
            run_name=f"emotion_tweet_{self.model_name}_{date.today().strftime('%Y-%m-%d')}",  # Experiment name for Weights & Biases
            fp16=True  # Use mixed precision training (FP16)
            )

        train_dataset = train_dataset.map(self.preprocess_data, batched=True, remove_columns=train_dataset.column_names)
        valid_dataset = valid_dataset.map(self.preprocess_data, batched=True, remove_columns=valid_dataset.column_names)

        train_dataset.set_format("torch")
        valid_dataset.set_format("torch")

        trainer = Trainer(
            self.model,
            args,
            train_dataset=train_dataset,
            eval_dataset=valid_dataset,
            tokenizer=self.tokenizer,
            compute_metrics=self.compute_metrics,
        )

        trainer.train()
        eval_results = trainer.evaluate()
        print(f"Evaluation results: {eval_results}")

        # Log evaluation results to Weights & Biases platform
        wandb.log({"eval_accuracy": eval_results["eval_accuracy"], "eval_loss": eval_results["eval_loss"], "eval_f1": eval_results["eval_f1"]})

    def predict(self, texts, threshold=0.5):
        """
        Generates predictions for a list of texts.

        Args:
        - texts (list of str): List of input texts.
        - threshold (float): Threshold for binary classification.

        Returns:
        - dict: Dictionary containing predicted labels for each input text.
        """
        if threshold is None:
            threshold = self.threshold

        # Preprocess input texts
        encoding = self.tokenizer(texts, padding="max_length", truncation=True, max_length=128, return_tensors="pt").to(self.device)

        # Make predictions
        with torch.no_grad():
            output = self.model(**encoding)

        # Convert logits to probabilities
        sigmoid = torch.nn.Sigmoid()
        probs = sigmoid(output.logits)

        # Apply threshold for binary classification
        threshold_tensor = torch.tensor([threshold], device=self.device)
        binary_preds = (probs >= threshold_tensor).int()

        # Convert binary predictions to label names
        label_preds = []
        for pred in binary_preds:
            label_pred = [self.id2label[str(i)] for i, val in enumerate(pred) if val == 1]
            label_preds.append(label_pred)

        return label_preds, binary_preds.cpu().numpy()

    def objective(self, trial, valid_dataset):
        """
        Objective function for hyperparameter optimization.

        Args:
        - trial (Trial): Optuna trial object.
        - valid_dataset (Dataset): Validation dataset.

        Returns:
        - float: Computed metric value.
        """
        threshold = trial.suggest_float("threshold", 0.1, 0.9)
        valid_dataset = valid_dataset.map(self.preprocess_data, batched=True)
        valid_dataset.set_format("torch")

        # Get the correct labels from the dataset
        labels = np.array([valid_dataset[column] for column in self.labels]).T

        # Make predictions
        with torch.no_grad():
            logits = self.model(valid_dataset["input_ids"].to(torch.device("cuda")))['logits']
            predictions = torch.sigmoid(logits).cpu().numpy()

            # Apply threshold for binary classification
            binary_preds = (predictions >= threshold).astype(int)

            # Compute metrics
            f1_micro_average = f1_score(y_true=labels, y_pred=binary_preds, average='micro')
            roc_auc = roc_auc_score(labels, predictions, average='micro')
            accuracy = accuracy_score(labels, binary_preds)

            result = {'f1': f1_micro_average, 'roc_auc': roc_auc, 'accuracy': accuracy}
            return -result["f1"]

    def optimize_threshold(self, valid_dataset):
        """
        Optimizes the threshold for binary classification.

        Args:
        - valid_dataset (Dataset): Validation dataset.

        Returns:
        - float: Best threshold value.
        """
        study = optuna.create_study(direction="maximize")
        study.optimize(lambda trial: self.objective(trial, valid_dataset), n_trials=10)
        self.threshold = study.best_params["threshold"]
        return study.best_params["threshold"]

In [None]:
os.environ["WANDB_PROJECT"] = "nlp_course_spring_2024-emotion-analysis-hf-trainer-hw7"  # name your W&B project
os.environ["WANDB_LOG_MODEL"] = "checkpoint"  # log the model during training

# Distill BERT
## Training the model

In [20]:
classifier = MultiLabelClassifier(
    model_name="distilbert-base-uncased",
    labels=label_columns,
    batch_size=8,
    learning_rate=2e-5,
    num_epochs=10,
    metric_name="f1",
    threshold=0.5
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
classifier.train(trainset['train'], trainset['test'])

Map:   0%|          | 0/5406 [00:00<?, ? examples/s]

Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
50,0.5631,0.484989,0.0,0.5,0.029336
100,0.4758,0.464279,0.000367,0.500092,0.029336
150,0.4546,0.419203,0.414102,0.630171,0.137187
200,0.4212,0.394945,0.508575,0.67431,0.186368
250,0.4011,0.377476,0.551052,0.697342,0.201898
300,0.3852,0.361643,0.586875,0.718048,0.22692
350,0.3816,0.362198,0.553422,0.698358,0.205781
400,0.368,0.347996,0.598222,0.723423,0.227351
450,0.361,0.344683,0.596924,0.723412,0.221311
500,0.3507,0.342098,0.650432,0.766916,0.238999


[34m[1mwandb[0m: Adding directory to artifact (./distilbert-base-uncased-finetuned/checkpoint-1000)... Done. 4.0s
[34m[1mwandb[0m: Adding directory to artifact (./distilbert-base-uncased-finetuned/checkpoint-2000)... Done. 2.5s
[34m[1mwandb[0m: Adding directory to artifact (./distilbert-base-uncased-finetuned/checkpoint-3000)... Done. 2.3s
[34m[1mwandb[0m: Adding directory to artifact (./distilbert-base-uncased-finetuned/checkpoint-4000)... Done. 4.2s
[34m[1mwandb[0m: Adding directory to artifact (./distilbert-base-uncased-finetuned/checkpoint-5000)... Done. 4.7s
[34m[1mwandb[0m: Adding directory to artifact (./distilbert-base-uncased-finetuned/checkpoint-6000)... Done. 3.9s
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Evaluation results: {'eval_loss': 0.33779624104499817, 'eval_f1': 0.6783027965284475, 'eval_roc_auc': 0.7878137573121489, 'eval_accuracy': 0.24762726488352027, 'eval_runtime': 3.3362, 'eval_samples_per_second': 694.793, 'eval_steps_per_second': 86.924, 'epoch': 10.0}


## Finding the optimal threshold

In [22]:
best_threshold = classifier.optimize_threshold(trainset['test'])
print(f"Best threshold: {best_threshold}")

[I 2024-04-12 05:19:00,725] A new study created in memory with name: no-name-61968806-ecc3-4277-94f9-60cb3b9b94f1


Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
[I 2024-04-12 05:19:06,381] Trial 0 finished with value: -0.466891716437289 and parameters: {'threshold': 0.45294865686166663}. Best is trial 0 with value: -0.466891716437289.


Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

[I 2024-04-12 05:19:12,092] Trial 1 finished with value: -0.5560423122128432 and parameters: {'threshold': 0.2758763655915911}. Best is trial 0 with value: -0.466891716437289.


Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

[I 2024-04-12 05:19:17,733] Trial 2 finished with value: -0.32662864004803366 and parameters: {'threshold': 0.6839045990901998}. Best is trial 2 with value: -0.32662864004803366.


Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

[I 2024-04-12 05:19:23,313] Trial 3 finished with value: -0.40532585844428876 and parameters: {'threshold': 0.5662134640342488}. Best is trial 2 with value: -0.32662864004803366.


Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

[I 2024-04-12 05:19:28,762] Trial 4 finished with value: -0.375613984397573 and parameters: {'threshold': 0.6191699415741193}. Best is trial 2 with value: -0.32662864004803366.


Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

[I 2024-04-12 05:19:34,288] Trial 5 finished with value: -0.4988864142538975 and parameters: {'threshold': 0.3977350389605607}. Best is trial 2 with value: -0.32662864004803366.


Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

[I 2024-04-12 05:19:39,858] Trial 6 finished with value: -0.5646776131357629 and parameters: {'threshold': 0.2407408668322395}. Best is trial 2 with value: -0.32662864004803366.


Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

[I 2024-04-12 05:19:45,333] Trial 7 finished with value: -0.3209281301792979 and parameters: {'threshold': 0.688262136093825}. Best is trial 7 with value: -0.3209281301792979.


Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

[I 2024-04-12 05:19:51,230] Trial 8 finished with value: -0.42703071672354953 and parameters: {'threshold': 0.5268342436170674}. Best is trial 7 with value: -0.3209281301792979.


Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

[I 2024-04-12 05:19:56,754] Trial 9 finished with value: -0.563993831919815 and parameters: {'threshold': 0.2177464389640547}. Best is trial 7 with value: -0.3209281301792979.


Best threshold: 0.688262136093825


In [23]:
best_threshold

0.688262136093825

## Prediction on Submission file

In [24]:
test = pd.read_csv('test.csv')
test.head()

Unnamed: 0,ID,Tweet,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,2018-01559,@Adnan__786__ @AsYouNotWish Dont worry Indian ...,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE
1,2018-03739,"Academy of Sciences, eschews the normally sobe...",NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE
2,2018-00385,I blew that opportunity -__- #mad,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE
3,2018-03001,This time in 2 weeks I will be 30... 😥,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE
4,2018-01988,#Deppression is real. Partners w/ #depressed p...,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE


In [25]:
testset = Dataset.from_dict({
    'Tweet': test['Tweet']})

In [26]:
testset

Dataset({
    features: ['Tweet'],
    num_rows: 3259
})

In [28]:
outputs, outputs_array = classifier.predict(testset['Tweet'], threshold = best_threshold)

In [29]:
outputs[:10]

[['anger', 'disgust', 'fear'],
 ['anger', 'disgust'],
 ['anger', 'disgust'],
 ['anticipation', 'joy'],
 ['fear', 'pessimism', 'sadness'],
 ['disgust', 'fear'],
 ['anticipation', 'optimism'],
 ['joy', 'love', 'optimism'],
 ['joy', 'love', 'optimism'],
 ['sadness']]

In [30]:
outputs_array[:10]

array([[1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]], dtype=int32)

In [31]:
test[label_columns] = outputs_array

In [33]:
submission = pd.read_csv('sample_submission.csv')
submission.head()

Unnamed: 0,ID,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,2018-01559,0,0,0,0,0,0,0,0,0,0,0
1,2018-03739,0,0,0,0,0,0,0,0,0,0,0
2,2018-00385,0,0,0,0,0,0,0,0,0,0,0
3,2018-03001,0,0,0,0,0,0,0,0,0,0,0
4,2018-01988,0,0,0,0,0,0,0,0,0,0,0


In [34]:
submission[label_columns] = test[label_columns]

In [35]:
submission.to_csv(model_folder/f'{classifier.model_name}_{date.today()}.csv', index = False)

## Submission

In [36]:
from kaggle import api
comp = 'emotion-detection-spring2014'
api.competition_submit(model_folder/f'{classifier.model_name}_{date.today()}.csv', f'{classifier.model_name}_{date.today()}', comp)



100%|██████████| 105k/105k [00:01<00:00, 56.7kB/s]


Successfully submitted to Emotion Detection Spring2024

# albert-base-v2
## Training

In [37]:
classifier = MultiLabelClassifier(
    model_name="albert-base-v2",
    labels=label_columns,
    batch_size=8,
    learning_rate=2e-5,
    num_epochs=10,
    metric_name="f1",
    threshold=0.5
)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [38]:
classifier.train(trainset['train'], trainset['test'])

Map:   0%|          | 0/5406 [00:00<?, ? examples/s]

Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
50,0.5192,0.468602,0.0,0.499626,0.028904
100,0.4608,0.440097,0.305734,0.584001,0.026747
150,0.4294,0.415494,0.440337,0.641909,0.150561
200,0.4132,0.404375,0.451115,0.647391,0.124676
250,0.3933,0.397587,0.510172,0.678666,0.157032
300,0.3971,0.385476,0.559601,0.708624,0.194133
350,0.3903,0.402105,0.495119,0.669923,0.167386
400,0.3877,0.375352,0.539995,0.693265,0.181191
450,0.3822,0.380891,0.538733,0.693046,0.181622
500,0.3795,0.365225,0.591153,0.724952,0.216997


[34m[1mwandb[0m: Adding directory to artifact (./albert-base-v2-finetuned/checkpoint-1000)... Done. 0.3s
[34m[1mwandb[0m: Adding directory to artifact (./albert-base-v2-finetuned/checkpoint-2000)... Done. 0.3s
[34m[1mwandb[0m: Adding directory to artifact (./albert-base-v2-finetuned/checkpoint-3000)... Done. 0.3s
[34m[1mwandb[0m: Adding directory to artifact (./albert-base-v2-finetuned/checkpoint-4000)... Done. 0.3s
[34m[1mwandb[0m: Adding directory to artifact (./albert-base-v2-finetuned/checkpoint-5000)... Done. 0.3s
[34m[1mwandb[0m: Adding directory to artifact (./albert-base-v2-finetuned/checkpoint-6000)... Done. 0.3s
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Evaluation results: {'eval_loss': 0.32286542654037476, 'eval_f1': 0.6818581907090464, 'eval_roc_auc': 0.7878102409314189, 'eval_accuracy': 0.2631578947368421, 'eval_runtime': 8.7174, 'eval_samples_per_second': 265.905, 'eval_steps_per_second': 33.267, 'epoch': 10.0}


In [43]:
best_threshold = classifier.optimize_threshold(trainset['test'])
print(f"Best threshold: {best_threshold}")

[I 2024-04-12 05:59:15,571] A new study created in memory with name: no-name-73bd7433-0dc2-4f8b-b639-60e71e958ebc


Map:   0%|          | 0/2318 [00:00<?, ? examples/s]

[W 2024-04-12 05:59:18,408] Trial 0 failed with parameters: {'threshold': 0.15303892926422763} because of the following error: OutOfMemoryError('CUDA out of memory. Tried to allocate 3.40 GiB. GPU 0 has a total capacity of 14.75 GiB of which 3.21 GiB is free. Process 562628 has 11.54 GiB memory in use. Of the allocated memory 8.54 GiB is allocated by PyTorch, and 2.87 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)').
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/optuna/study/_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "<ipython-input-19-ae1ebde83474>", line 239, in <lambda>
    study.optimize(lambda trial: self.objective(trial, valid_dataset), n_trials=10)
  File "<ipython-input

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.40 GiB. GPU 0 has a total capacity of 14.75 GiB of which 3.21 GiB is free. Process 562628 has 11.54 GiB memory in use. Of the allocated memory 8.54 GiB is allocated by PyTorch, and 2.87 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [45]:
outputs, outputs_array = classifier.predict(testset['Tweet'])

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.20 GiB. GPU 0 has a total capacity of 14.75 GiB of which 837.06 MiB is free. Process 562628 has 13.93 GiB memory in use. Of the allocated memory 10.81 GiB is allocated by PyTorch, and 2.99 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [44]:
submission[label_columns] = outputs_array

In [42]:
submission.to_csv(model_folder/f'{classifier.model_name}_{date.today()}.csv', index = False)

In [None]:
from kaggle import api
comp = 'emotion-detection-spring2014'
api.competition_submit(model_folder/f'{classifier.model_name}_{date.today()}.csv', f'{classifier.model_name}_{date.today()}', comp)

In [47]:
wandb.finish()

VBox(children=(Label(value='5728.733 MB of 5728.733 MB uploaded (19.032 MB deduped)\r'), FloatProgress(value=1…

0,1
eval/accuracy,▄▆███████▇▇█▇▇▇▇▇▇▇▆▁▅▇▇▇███▇▇▇▇█▇▇▇▇▇▇▇
eval/f1,▃▅▇▇▇███████████████▁▅▆▇▇█▇█████████████
eval/loss,▇▄▂▁▁▁▁▂▂▂▃▃▄▄▅▅▆▆▆▆█▄▃▂▂▁▁▁▁▁▂▂▂▃▃▃▄▄▄▂
eval/roc_auc,▂▅▇▇▇█▇█████████████▁▅▆▆▆▇▇▇████████████
eval/runtime,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇███████████████████
eval/samples_per_second,████████████████████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
eval/steps_per_second,████████████████████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
eval_accuracy,▁█
eval_f1,▁█
eval_loss,█▁

0,1
eval/accuracy,0.26316
eval/f1,0.68186
eval/loss,0.32287
eval/roc_auc,0.78781
eval/runtime,8.7174
eval/samples_per_second,265.905
eval/steps_per_second,33.267
eval_accuracy,0.26316
eval_f1,0.68186
eval_loss,0.32287
