# DistilBERT Classifier with Bayesian Optimisation

### Cleaning Dataset
To ensure consistency across the different models, our group will be using the same method of preprocessing and data cleaning methodology.  
Data cleaning was done with the following:
- Expanded contractions
- Removed Hashtags
- Removed airline mentions and kept their names
- Removed links
- Converted Emojis to texts


Importing Libaries and Downloading Modules

In [1]:
%pip install contractions
%pip install emoji
%pip install nltk

import pandas as pd
import contractions  # Import contractions package for contraction expansion
import re
import emoji
import nltk
from nltk.stem import WordNetLemmatizer  # Word lemmatization

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()


Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp312-cp312-macosx_10_9_universal2.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
Downloading pyahocorasick-2.1.0-cp312-cp312-macosx_10_9_universal2.whl (63 kB)
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.1.0 textsearch-0.0.24
Note: you may need to restart the kernel to use updated packages.
Collec

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/tanhonjung/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Load the Dataset

In [3]:
import pandas as pd
file_path = "Tweets.csv"
df = pd.read_csv(file_path)
df


FileNotFoundError: [Errno 2] No such file or directory: 'Tweets.csv'

In [None]:
# Define the list of airlines (case-insensitive)
airline_list = ["VirginAmerica", "united", "SouthwestAir", "JetBlue", "USAirways", "AmericanAir"]

def clean_text(text):

    # Convert emojis to text descriptions
    text = emoji.demojize(text, delimiters=(" ", " "))  # 😊 → " smiley face "

    # Expand contractions (e.g., "can't" → "cannot")
    text = contractions.fix(text)

    # Remove hashtags but keep words (e.g., "#happy" → "happy")
    text = re.sub(r'#(\w+)', r'\1', text)

    # Extract airline name if mentioned
    airline_pattern = r'@(' + '|'.join(airline_list) + r')\b'
    match = re.search(airline_pattern, text, re.IGNORECASE)
    airline = match.group(1) if match else "Unknown"

    # Remove airline mentions
    text = re.sub(airline_pattern, '', text, flags=re.IGNORECASE).strip()

    # Detect and remove hyperlinks
    link_pattern = r'http\S+'
    has_link = 1 if re.search(link_pattern, text) else 0
    text = re.sub(link_pattern, '', text).strip()

    # Remove punctuation and convert to lowercase
    text = re.sub(r'[^\w\s]', '', text).lower()

    # Remove underscores
    text = text.replace('_', ' ')
    text = text.replace('-', ' ')

    # Apply lemmatization
    words = text.split()
    words = [lemmatizer.lemmatize(w) for w in words]
    text = " ".join(words)

    return airline, has_link, text

# Apply function and create new columns
df[['airline', 'has_link', 'clean_text']] = df['text'].apply(lambda x: pd.Series(clean_text(x)))


# Display first few rows of the emoji-converted text
df[380:391]


Unnamed: 0,airline_sentiment,sentiment_confidence,text,airline,has_link,clean_text
380,positive,1.0,@VirginAmerica gave a credit for my Late Fligh...,VirginAmerica,0,gave a credit for my late flight flight yester...
381,neutral,1.0,@VirginAmerica I need a receipt for a flight c...,VirginAmerica,0,i need a receipt for a flight change can you s...
382,negative,1.0,"@VirginAmerica, I submitted a status match req...",VirginAmerica,0,i submitted a status match request a while bac...
383,positive,1.0,@VirginAmerica had me at their safety video . ...,VirginAmerica,1,had me at their safety video loved my first cr...
384,positive,0.6871,@VirginAmerica that doesn't look to fat to me!...,VirginAmerica,0,that doe not look to fat to me it look yummy
385,neutral,1.0,@VirginAmerica CEO says #Southwest &amp; #jetb...,VirginAmerica,1,ceo say southwest amp jetblue have strayed fro...
386,neutral,0.6811,@VirginAmerica a brilliant brisk am in Boston ...,VirginAmerica,1,a brilliant brisk am in boston in cue for vx363
387,neutral,1.0,@VirginAmerica Atlantic ploughs a lone furrow ...,VirginAmerica,1,atlantic plough a lone furrow in the middleeas...
388,neutral,1.0,@VirginAmerica Atlantic ploughs a lone furrow ...,VirginAmerica,1,atlantic plough a lone furrow in the middleeas...
389,neutral,0.7026,@VirginAmerica Atlantic ploughs a lone furrow ...,VirginAmerica,1,atlantic plough a lone furrow in the middleeas...


## Sentiment Classification using DistilBert Classifier


For this Deep Learning method, we wil be using transfer learning using a pre-trained model, specifically DistilBert with fine-tuning. DistilBert is a faster, cheaper and lighter transformer model based on the Bert architecture (Sanh et al., 2019), making it less resource intensive while retaining most of its predictive capabilities.


This approach draws inspiration from (Akpatsa et al., 2022) whereby researchers used both Bert and DistilBert models to perform Online News Sentiment Classification. The experiments confirmed the superiority of the transformer-based (BERT, and DistilBERT) models over other Machine Learning on a downstream NLP task, which is what we will try to reproduce in our study.



---



Before being passed into the code, sentiment classifications need to be encoded into numerical values such as {0: negative, 1: neutral, 2: positive}, a format which the algorithm understands.

We then use DistilBertTokenizer to split raw text data into smaller units called tokens, making it suitable for processing by the DistilBert model.

We have also used a train-test split of 80:20 for model validation.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer

import torch
import numpy as np

# Encode labels (sentiment categories)
label_encoder = LabelEncoder()
df['encoded_sentiment'] = label_encoder.fit_transform(df['airline_sentiment'])

# Split into features (X) and target (y)
X = df['clean_text']
y = df['encoded_sentiment']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Load DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize and pad the text data
X_train_encodings = tokenizer(X_train.tolist(), truncation=True, padding=True, max_length=128, return_tensors='pt')
X_test_encodings = tokenizer(X_test.tolist(), truncation=True, padding=True, max_length=128, return_tensors='pt')

# Convert the labels to tensors
y_train_tensor = torch.tensor(y_train.values)
y_test_tensor = torch.tensor(y_test.values)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

The study also show that a class imbalance in the dataset can affect model performance (Akpatsa et al., 2022). Keeping this in mind, we will use RandomOverSampler to address imbalanced datasets by randomly duplicating examples from the minority class to balance the class distribution.

In [None]:
from imblearn.over_sampling import RandomOverSampler

input_ids_np = X_train_encodings['input_ids'].numpy()
attention_mask_np = X_train_encodings['attention_mask'].numpy()
labels_np = y_train_tensor.numpy()

# Combine input_ids and attention_mask
X_combined = np.concatenate([input_ids_np, attention_mask_np], axis=1)

# Apply RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_combined, labels_np)

# Split resampled data
seq_len = X_train_encodings['input_ids'].shape[1]
input_ids_resampled = X_resampled[:, :seq_len]
attention_mask_resampled = X_resampled[:, seq_len:]

# Convert back to tensors
input_ids_tensor = torch.tensor(input_ids_resampled, dtype=torch.long)
attention_mask_tensor = torch.tensor(attention_mask_resampled, dtype=torch.long)
y_resampled_tensor = torch.tensor(y_resampled, dtype=torch.long)

Define key model parameters and check oversampled training set size using ```len(train_dataset)```

Improving on the study, we used AdamW optimizer instead of Adam, which has been shown to yield better results in NLP and leads to better generalization and stability (Zhou et al., 2024).

In [None]:

from transformers import DistilBertForSequenceClassification
from torch.optim import AdamW
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score


# Define the DistilBERT model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

# Create TensorDataset for training
train_dataset = TensorDataset(input_ids_tensor, attention_mask_tensor, y_resampled_tensor)
test_dataset = TensorDataset(X_test_encodings['input_ids'], X_test_encodings['attention_mask'], y_test_tensor)

# Create DataLoader for training and validation
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

# Set up the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print("Oversampled training set size:", len(train_dataset))

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Oversampled training set size: 22014


Without Hyperparameter Tuning

In [None]:
# Training loop
def train_model(model, train_loader, optimizer, device, epochs=3):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch in train_loader:
            input_ids, attention_mask, labels = [b.to(device) for b in batch]
            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()
            loss.backward()
            optimizer.step()
        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch + 1}, Loss: {avg_loss}")

# Train the model
train_model(model, train_loader, optimizer, device)

# Save model
model.save_pretrained('/content/drive/MyDrive/base_model')

Epoch 1, Loss: 0.40966614731858203
Epoch 2, Loss: 0.1546566090989874
Epoch 3, Loss: 0.07754596218331054


Evaluation Function

In [None]:
from sklearn.metrics import classification_report, f1_score

def evaluate_model(model, test_loader, device):
    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for batch in test_loader:
            input_ids, attention_mask, labels = [b.to(device) for b in batch]
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    # Accuracy
    f1 = f1_score(all_labels, all_preds, average = 'weighted')

    # Classification report
    report = classification_report(all_labels, all_preds, digits=4)

    return f1, report

# Usage
f1, report = evaluate_model(model, test_loader, device)

print(f"f1_score: {f1 * 100:.2f}%")
print("Classification Report:\n")
print(report)

f1_score: 81.99%
Classification Report:

              precision    recall  f1-score   support

           0     0.8694    0.9120    0.8902      1840
           1     0.6782    0.6483    0.6629       634
           2     0.8138    0.7026    0.7541       454

    accuracy                         0.8224      2928
   macro avg     0.7871    0.7543    0.7691      2928
weighted avg     0.8194    0.8224    0.8199      2928



One limitation of the study is that it uses a Stochastic Gradient Descent with Restart (SGDR) policy learning rate, fine-tuning only the learning rate parameter.

To improve on this, we used Bayesian Optimization as our hyperparameter tuning method, in order to find the best values for other parameters including batch size and number of epochs.



In [None]:
!pip install optuna
import optuna
from sklearn.metrics import f1_score

# Define the objective function for optimization
def objective(trial):
    # Suggest hyperparameters for optimization
    lr = trial.suggest_loguniform('lr', 1e-5, 1e-3)  # Learning rate (log scale)
    batch_size = trial.suggest_int('batch_size', 8, 32)  # Batch size
    epochs = trial.suggest_int('epochs', 3, 5)  # Number of epochs

    # Load DistilBERT model
    model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

    # Set up optimizer
    optimizer = AdamW(model.parameters(), lr=lr)

    # Move model to device (GPU if available)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Training loop
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch in train_loader:
            input_ids, attention_mask, labels = [b.to(device) for b in batch]
            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()
            loss.backward()
            optimizer.step()

        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch + 1}, Loss: {avg_loss}")

    # Evaluate the model using F1-score
    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for batch in test_loader:
            input_ids, attention_mask, labels = [b.to(device) for b in batch]
            outputs = model(input_ids, attention_mask=attention_mask)
            preds = torch.argmax(outputs.logits, dim=-1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    # Calculate F1-score
    f1 = f1_score(all_labels, all_preds, average='weighted')
    print(f"F1-Score: {f1 * 100:.2f}%")

    return f1  # Return the F1-score for optimization


study = optuna.create_study(direction="maximize")  # maximize F1-score
study.optimize(objective, n_trials=10)  # adjust n_trials for more/less iterations

# Get the best hyperparameters and evaluation score
print(f"Best hyperparameters: {study.best_params}")
print(f"Best F1-score: {study.best_value * 100:.2f}%")

model.save_pretrained('/content/drive/MyDrive/best_model')

Collecting optuna
  Downloading optuna-4.2.1-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.15.2-py3-none-any.whl.metadata (7.3 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Downloading optuna-4.2.1-py3-none-any.whl (383 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.6/383.6 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.15.2-py3-none-any.whl (231 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.9/231.9 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, alembic, optuna
Successfully installed alembic-1.15.2 colorlog-6.9.0 optuna-4.2.1


[I 2025-04-06 08:31:04,173] A new study created in memory with name: no-name-79f68cc8-f836-4ca9-b08b-42ec627ef78f
  lr = trial.suggest_loguniform('lr', 1e-5, 1e-3)  # Learning rate (log scale)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 1.002929538878244
Epoch 2, Loss: 1.0993121760182603
Epoch 3, Loss: 1.0993966068639311
Epoch 4, Loss: 1.0989521135424458


[I 2025-04-06 08:33:31,123] Trial 0 finished with value: 0.041629100380348553 and parameters: {'lr': 0.0002808669421546765, 'batch_size': 20, 'epochs': 4}. Best is trial 0 with value: 0.041629100380348553.


F1-Score: 4.16%


  lr = trial.suggest_loguniform('lr', 1e-5, 1e-3)  # Learning rate (log scale)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 0.4111246243904366
Epoch 2, Loss: 0.13718596368257854
Epoch 3, Loss: 0.07292249710363374
Epoch 4, Loss: 0.05776971450245054


[I 2025-04-06 08:35:57,287] Trial 1 finished with value: 0.8110460208626278 and parameters: {'lr': 4.1442627530596416e-05, 'batch_size': 30, 'epochs': 4}. Best is trial 1 with value: 0.8110460208626278.


F1-Score: 81.10%


  lr = trial.suggest_loguniform('lr', 1e-5, 1e-3)  # Learning rate (log scale)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 0.4218285255144935
Epoch 2, Loss: 0.14666254213858168
Epoch 3, Loss: 0.07322205554331493
Epoch 4, Loss: 0.05265067413766215
Epoch 5, Loss: 0.04178116296715902


[I 2025-04-06 08:38:59,231] Trial 2 finished with value: 0.8258881695443688 and parameters: {'lr': 3.164628635340893e-05, 'batch_size': 21, 'epochs': 5}. Best is trial 2 with value: 0.8258881695443688.


F1-Score: 82.59%


  lr = trial.suggest_loguniform('lr', 1e-5, 1e-3)  # Learning rate (log scale)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 0.4173861502989217
Epoch 2, Loss: 0.14743418280450832
Epoch 3, Loss: 0.07809074967371998


[I 2025-04-06 08:40:49,823] Trial 3 finished with value: 0.8212429729061241 and parameters: {'lr': 3.312585424354438e-05, 'batch_size': 11, 'epochs': 3}. Best is trial 2 with value: 0.8258881695443688.


F1-Score: 82.12%


  lr = trial.suggest_loguniform('lr', 1e-5, 1e-3)  # Learning rate (log scale)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 0.45310692792503854
Epoch 2, Loss: 0.19868316867850386
Epoch 3, Loss: 0.12086189056927377
Epoch 4, Loss: 0.09733052257713076
Epoch 5, Loss: 0.07751194379349283


[I 2025-04-06 08:43:52,454] Trial 4 finished with value: 0.7829285045563934 and parameters: {'lr': 0.00012046725376362701, 'batch_size': 8, 'epochs': 5}. Best is trial 2 with value: 0.8258881695443688.


F1-Score: 78.29%


  lr = trial.suggest_loguniform('lr', 1e-5, 1e-3)  # Learning rate (log scale)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 1.1042040129473736
Epoch 2, Loss: 1.0992637444720712
Epoch 3, Loss: 1.100295849577632


[I 2025-04-06 08:45:42,781] Trial 5 finished with value: 0.48501852055598343 and parameters: {'lr': 0.0008323253093785686, 'batch_size': 19, 'epochs': 3}. Best is trial 2 with value: 0.8258881695443688.


F1-Score: 48.50%


  lr = trial.suggest_loguniform('lr', 1e-5, 1e-3)  # Learning rate (log scale)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 1.1028445979017159
Epoch 2, Loss: 1.0991527578511904
Epoch 3, Loss: 1.0997168955414793


[I 2025-04-06 08:47:33,076] Trial 6 finished with value: 0.041629100380348553 and parameters: {'lr': 0.0007961808660018567, 'batch_size': 8, 'epochs': 3}. Best is trial 2 with value: 0.8258881695443688.


F1-Score: 4.16%


  lr = trial.suggest_loguniform('lr', 1e-5, 1e-3)  # Learning rate (log scale)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 0.4878950303929403
Epoch 2, Loss: 0.22447203208630104
Epoch 3, Loss: 0.11928508801014856
Epoch 4, Loss: 0.06945853611521374


[I 2025-04-06 08:49:59,466] Trial 7 finished with value: 0.8143633029548144 and parameters: {'lr': 1.022774604734039e-05, 'batch_size': 22, 'epochs': 4}. Best is trial 2 with value: 0.8258881695443688.


F1-Score: 81.44%


  lr = trial.suggest_loguniform('lr', 1e-5, 1e-3)  # Learning rate (log scale)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 1.0850071417557638
Epoch 2, Loss: 1.0820023433933423
Epoch 3, Loss: 1.0991575656415418
Epoch 4, Loss: 1.099039007064908
Epoch 5, Loss: 1.09100415912825


[I 2025-04-06 08:53:01,717] Trial 8 finished with value: 0.04205714526078489 and parameters: {'lr': 0.00027515517332248653, 'batch_size': 26, 'epochs': 5}. Best is trial 2 with value: 0.8258881695443688.


F1-Score: 4.21%


  lr = trial.suggest_loguniform('lr', 1e-5, 1e-3)  # Learning rate (log scale)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 1.1013730433444644
Epoch 2, Loss: 1.0991664035895536
Epoch 3, Loss: 1.0997793682266113


[I 2025-04-06 08:54:52,210] Trial 9 finished with value: 0.041629100380348553 and parameters: {'lr': 0.0005448943619264644, 'batch_size': 28, 'epochs': 3}. Best is trial 2 with value: 0.8258881695443688.


F1-Score: 4.16%
Best hyperparameters: {'lr': 3.164628635340893e-05, 'batch_size': 21, 'epochs': 5}
Best F1-score: 82.59%


Best Model

In [None]:
# Retrieve the best hyperparameters
best_params = study.best_params

# Load the best model
best_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)
best_optimizer = AdamW(best_model.parameters(), lr=best_params['lr'])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
best_model.to(device)

# Create DataLoader for training with best batch size
best_train_loader = DataLoader(train_dataset, batch_size=best_params['batch_size'], shuffle=True)

# Retrain the model using best hyperparameters
train_model(best_model, best_train_loader, best_optimizer, device)

# Evaluate using evaluation function defined above
f1, report = evaluate_model(best_model, test_loader, device)

print(f"Final Evaluation using Best Hyperparameters:")
print(f"F1-Score: {f1 * 100:.2f}%")
print("Classification Report:\n")
print(report)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 0.4253795556368944
Epoch 2, Loss: 0.15389986484408294
Epoch 3, Loss: 0.0712280424302435
Final Evaluation using Best Hyperparameters:
F1-Score: 82.33%
Classification Report:

              precision    recall  f1-score   support

           0     0.9009    0.8897    0.8953      1840
           1     0.6976    0.6293    0.6617       634
           2     0.6976    0.8282    0.7573       454

    accuracy                         0.8238      2928
   macro avg     0.7654    0.7824    0.7714      2928
weighted avg     0.8254    0.8238    0.8233      2928



## References


---

Akpatsa, S.K., Lei, H., Li, X., Setornyo Obeng, V.K., Martey, E.M. et al. (2022). Online News Sentiment Classification Using DistilBERT. Journal of Quantum Computing, 4(1), 1–11. https://doi.org/10.32604/jqc.2022.026658

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019, October 2). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv.org. https://arxiv.org/abs/1910.01108v4

Zhou, P., Xie, X., Lin, Z., & Yan, S. (2024). Towards understanding convergence and generalization of AdamW. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1-8. https://ink.library.smu.edu.sg/sis_research/8986
