# Fine-Tuning DistilBERT
The goal is to Fine-tune DistilBERT to predict sentiment on the Twitter dataset.

## About Dataset
#### Context
This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

#### Content
It contains the following 6 fields:

target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

ids: The id of the tweet ( 2087)

date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

flag: The query (lyx). If there is no query, then this value is NO_QUERY.

user: the user that tweeted (robotickilldozr)

text: the text of the tweet (Lyx is cool)

#### Acknowledgements
The official link regarding the dataset with resources about how it was generated is here
The official paper detailing the approach is here

#### Citation: 
Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.



In [1]:
# Importing library
from datasets import Dataset, Features, ClassLabel, Value
import pandas as pd
import numpy as np
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


## 1. Loading and Inspecting Data

In [2]:
# Load and preprocess data
path = "c:\\Users\\Alex Chung\\Documents\\the_Lab\\Portfolio\\ml_engineering\\data\\sentiment140\\"
file = "training.1600000.processed.noemoticon.csv"
df = pd.read_csv(path + file, 
                 encoding="ISO-8859-1", names=["target", "id", "date", "flag", "user", "text"])

In [3]:
df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   id      1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


## 2. Preprocessing

In [5]:
# Taking only 10,000 samples
df = df[["target", "text"]].sample(10000, random_state=42)
df["target"] = df["target"].map({0: 0, 4: 1})
df = df.reset_index(drop=True)

# Define dataset features
features = Features({
    "target": ClassLabel(names=["negative", "positive"]),
    "text": Value(dtype="string")
})
dataset = Dataset.from_pandas(df, features=features)

# Check original distribution
print("Original Label Distribution:")
print(df["target"].value_counts(normalize=True))

Original Label Distribution:
target
0    0.5004
1    0.4996
Name: proportion, dtype: float64


### Tokenize Data

In [6]:
# Tokenize, preserving labels
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=False)
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Rename target to labels
tokenized_dataset = tokenized_dataset.rename_column("target", "labels")

# Verify dataset columns
print("Tokenized dataset columns:", tokenized_dataset.column_names)



Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Tokenized dataset columns: ['labels', 'input_ids', 'attention_mask']


### Spliting into train and test set

In [7]:
# Stratified split
train_test = tokenized_dataset.train_test_split(test_size=0.2, seed=42, stratify_by_column="labels")
train_dataset = train_test["train"]
test_dataset = train_test["test"]

# Verify split and columns
print(f"\nTrain size: {len(train_dataset)}, Test size: {len(test_dataset)}")
print("Train dataset columns:", train_dataset.column_names)
print("Test dataset columns:", test_dataset.column_names)
train_dist = pd.Series(train_dataset["labels"]).value_counts(normalize=True)
test_dist = pd.Series(test_dataset["labels"]).value_counts(normalize=True)
print("Train Label Distribution:")
print(train_dist)
print("Test Label Distribution:")
print(test_dist)

# Inspect tokenized dataset
print("\nFirst Train Example:")
print(train_dataset[0])


Train size: 8000, Test size: 2000
Train dataset columns: ['labels', 'input_ids', 'attention_mask']
Test dataset columns: ['labels', 'input_ids', 'attention_mask']
Train Label Distribution:
0    0.500375
1    0.499625
Name: proportion, dtype: float64
Test Label Distribution:
0    0.5005
1    0.4995
Name: proportion, dtype: float64

First Train Example:
{'labels': 1, 'input_ids': [101, 2074, 2513, 2013, 7873, 2777, 1012, 2986, 2396, 2265, 1010, 13366, 16294, 4221, 2135, 1037, 3459, 2846, 1997, 2147, 1010, 4169, 1999, 3327, 2001, 6581, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## 3. Training the Model

In [10]:
# Instantiate model
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Set up Trainer
training_args = TrainingArguments(
    output_dir="C:/Users/Alex Chung/Documents/ml_engineering_clean/results",
    logging_dir="C:/Users/Alex Chung/Documents/ml_engineering_clean/logs",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",  # Changed from eval_strategy
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=100,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
    compute_metrics=lambda eval_pred: {
        "accuracy": (np.argmax(eval_pred.predictions, axis=1) == eval_pred.label_ids).mean()
    }
)

In [14]:
# Train the model
trainer.train()

Original Label Distribution:
target
0    0.5004
1    0.4996
Name: proportion, dtype: float64




Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Tokenized dataset columns: ['labels', 'input_ids', 'attention_mask']

Train size: 8000, Test size: 2000
Train dataset columns: ['labels', 'input_ids', 'attention_mask']
Test dataset columns: ['labels', 'input_ids', 'attention_mask']
Train Label Distribution:
0    0.500375
1    0.499625
Name: proportion, dtype: float64
Test Label Distribution:
0    0.5005
1    0.4995
Name: proportion, dtype: float64

First Train Example:
{'labels': 1, 'input_ids': [101, 2074, 2513, 2013, 7873, 2777, 1012, 2986, 2396, 2265, 1010, 13366, 16294, 4221, 2135, 1037, 3459, 2846, 1997, 2147, 1010, 4169, 1999, 3327, 2001, 6581, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4269,0.444462,0.793
2,0.2867,0.489853,0.8075
3,0.129,0.739199,0.8025


TrainOutput(global_step=1500, training_loss=0.30671591504414875, metrics={'train_runtime': 4676.5072, 'train_samples_per_second': 5.132, 'train_steps_per_second': 0.321, 'total_flos': 267277814425728.0, 'train_loss': 0.30671591504414875, 'epoch': 3.0})

## 4. Evaluation

In [15]:
eval_results = trainer.evaluate()
print("Evaluation Results:", eval_results)
trainer.save_model("C:/Users/Alex Chung/Documents/ml_engineering_clean/final_model")

Evaluation Results: {'eval_loss': 0.4444619417190552, 'eval_accuracy': 0.793, 'eval_runtime': 114.8718, 'eval_samples_per_second': 17.411, 'eval_steps_per_second': 1.088, 'epoch': 3.0}


### Observations
The model indicates there's overfitting because training scores improve with each epoch, yet validation scores get worse. 

There are a few things try to improve the model:
- Add regularization
- Use Early Stopping if validation doesn't improve over 1 epoch
- Adjust Learning Rate smaller
- Increase Dropout

In [8]:
from transformers import EarlyStoppingCallback

In [11]:
training_args = TrainingArguments(
    output_dir="C:/Users/Alex Chung/Documents/ml_engineering_clean/results",
    logging_dir="C:/Users/Alex Chung/Documents/ml_engineering_clean/logs",
    num_train_epochs=3,  # Kept at 3 but added early stopping if doesn't improve in 1 epoch
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_accuracy",  # Optimize for accuracy
    greater_is_better=True,
    logging_steps=100,
    weight_decay=0.01,  # Add regularization
    learning_rate=2e-5,  # Lower learning rate
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
    compute_metrics=lambda eval_pred: {
        "accuracy": (np.argmax(eval_pred.predictions, axis=1) == eval_pred.label_ids).mean()
    },
    callbacks=[EarlyStoppingCallback(early_stopping_patience=1)] # early stopping
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.4276,0.423437,0.8055
2,0.3269,0.466076,0.799


TrainOutput(global_step=1000, training_loss=0.4022690372467041, metrics={'train_runtime': 2908.4469, 'train_samples_per_second': 8.252, 'train_steps_per_second': 0.516, 'total_flos': 178127255130240.0, 'train_loss': 0.4022690372467041, 'epoch': 2.0})

### Observations

- Final eval accuracy: 0.8055 (meets >80% requirement).
- Training/validation loss and accuracy per epoch (as shown above).
- Steps taken to address overfitting (early stopping, weight decay, lower learning rate).
- Challenges faced (e.g., numpy errors, cl.exe, labels, evaluation_strategy) and resolutions.
