# Fine-Tuning DistilBERT
The goal is to Fine-tune DistilBERT to predict sentiment on the Twitter dataset.

## About Dataset
#### Context
This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

#### Content
It contains the following 6 fields:

target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

ids: The id of the tweet ( 2087)

date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

flag: The query (lyx). If there is no query, then this value is NO_QUERY.

user: the user that tweeted (robotickilldozr)

text: the text of the tweet (Lyx is cool)

#### Acknowledgements
The official link regarding the dataset with resources about how it was generated is here
The official paper detailing the approach is here

#### Citation: 
Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.



In [34]:
from datasets import load_dataset, Dataset, Features, ClassLabel, Value
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
import pandas as pd
import os
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', None)
os.getcwd()

'C:\\Users\\Alex Chung\\Documents\\the_Lab\\Portfolio\\ml_engineering\\notebooks'

## 1. Loading and Inspecting Data

In [42]:
path = "c:\\Users\\Alex Chung\\Documents\\the_Lab\\Portfolio\\ml_engineering\\data\\sentiment140\\"
df = pd.read_csv(path+"training.1600000.processed.noemoticon.csv", encoding="ISO-8859-1", names=["target", "id", "date", "flag", "user", "text"])

In [3]:
df.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   id      1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [37]:
df.target.value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

In [38]:
df.text.to_list()[:2]

["@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D",
 "is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"]

## 2. Preprocessing

In [43]:
# Load subset of Twitter data
df = df[["target", "text"]].sample(10000, random_state=42)  # Subset for speed
df["target"] = df["target"].map({0: 0, 4: 1})  # Map labels
df = df.reset_index(drop=True)  # Reset index to avoid __index_level_0__

In [44]:
df.head()

Unnamed: 0,target,text
0,0,@chrishasboobs AHHH I HOPE YOUR OK!!!
1,0,"@misstoriblack cool , i have no tweet apps for my razr 2"
2,0,"@TiannaChaos i know just family drama. its lame.hey next time u hang out with kim n u guys like have a sleepover or whatever, ill call u"
3,0,School email won't open and I have geography stuff on there to revise! *Stupid School* :'(
4,0,upper airways problem


In [45]:
# Define dataset features with ClassLabel for target
features = Features({
    "target": ClassLabel(names=["negative", "positive"]),  # Define 0=negative, 1=positive
    "text": Value("string")
})
dataset = Dataset.from_pandas(df, features=features)

In [46]:
# Tokenize
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

### Checking the tokenized dataset

In [47]:
# 1. Dataset overview
print("Dataset Info:")
print(tokenized_dataset)
print("Columns:", tokenized_dataset.column_names)

Dataset Info:
Dataset({
    features: ['target', 'text', 'input_ids', 'attention_mask'],
    num_rows: 10000
})
Columns: ['target', 'text', 'input_ids', 'attention_mask']


In [48]:
# 2. Single example
print("First Example:")
print(tokenized_dataset[0])

First Example:
{'target': 0, 'text': '@chrishasboobs AHHH I HOPE YOUR OK!!! ', 'input_ids': [101, 1030, 3782, 14949, 5092, 16429, 2015, 6289, 23644, 1045, 3246, 2115, 7929, 999, 999, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [49]:
# 3. Multiple examples as table
print("First 5 Examples:")
df_tokenized = tokenized_dataset.select(range(5)).to_pandas()
df_tokenized[['text', 'target', 'input_ids', 'attention_mask']]

First 5 Examples:


Unnamed: 0,text,target,input_ids,attention_mask
0,@chrishasboobs AHHH I HOPE YOUR OK!!!,0,"[101, 1030, 3782, 14949, 5092, 16429, 2015, 6289, 23644, 1045, 3246, 2115, 7929, 999, 999, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
1,"@misstoriblack cool , i have no tweet apps for my razr 2",0,"[101, 1030, 3335, 29469, 28522, 3600, 4658, 1010, 1045, 2031, 2053, 1056, 28394, 2102, 18726, 2005, 2026, 10958, 2480, 2099, 1016, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
2,"@TiannaChaos i know just family drama. its lame.hey next time u hang out with kim n u guys like have a sleepover or whatever, ill call u",0,"[101, 1030, 23401, 18357, 3270, 2891, 1045, 2113, 2074, 2155, 3689, 1012, 2049, 20342, 1012, 4931, 2279, 2051, 1057, 6865, 2041, 2007, 5035, 1050, 1057, 4364, 2066, 2031, 1037, 3637, 7840, 2030, 3649, 1010, 5665, 2655, 1057, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
3,School email won't open and I have geography stuff on there to revise! *Stupid School* :'(,0,"[101, 2082, 10373, 2180, 1005, 1056, 2330, 1998, 1045, 2031, 10505, 4933, 2006, 2045, 2000, 7065, 5562, 999, 1008, 5236, 2082, 1008, 1024, 1005, 1006, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
4,upper airways problem,0,"[101, 3356, 13095, 3291, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


In [50]:
# 4. Decode tokens
print("Decoded Example:")
sample = tokenized_dataset[0]
decoded_text = tokenizer.decode(sample['input_ids'], skip_special_tokens=True)
print(f"Original: {sample['text']}")
print(f"Decoded: {decoded_text}")

Decoded Example:
Original: @chrishasboobs AHHH I HOPE YOUR OK!!! 
Decoded: @ chrishasboobs ahhh i hope your ok!!!


In [51]:
# 5. Verify lengths
lengths = [len(sample['input_ids']) for sample in tokenized_dataset]
print(f"\nAll lengths 512? {all(length == 512 for length in lengths)}")


All lengths 512? True


In [52]:
# 6. Check labels
unique_labels = set(tokenized_dataset['target'])
print(f"Labels: {unique_labels}")

Labels: {0, 1}


In [53]:
# 7. Inspect attention mask
print("Attention Mask Example:")
token_count = sum(sample['attention_mask'])
print(f"Non-padding tokens: {token_count}")
print(f"First 10 input_ids: {sample['input_ids'][:10]}")
print(f"First 10 attention_mask: {sample['attention_mask'][:10]}")

Attention Mask Example:
Non-padding tokens: 17
First 10 input_ids: [101, 1030, 3782, 14949, 5092, 16429, 2015, 6289, 23644, 1045]
First 10 attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


### Splitting the tokenized dataset into stratefied train test set

In [54]:
# Stratified train/test split
train_test = tokenized_dataset.train_test_split(test_size=0.2, seed=42, stratify_by_column="target")
train_dataset = train_test["train"]
test_dataset = train_test["test"]

# Verify sizes
print(f"Train size: {len(train_dataset)}, Test size: {len(test_dataset)}")

Train size: 8000, Test size: 2000


In [56]:
# Verify split balance
print(f"\nTrain size: {len(train_dataset)}, Test size: {len(test_dataset)}")
train_dist = pd.Series(train_dataset["target"]).value_counts(normalize=True)
test_dist = pd.Series(test_dataset["target"]).value_counts(normalize=True)
print("Train Label Distribution:")
print(train_dist)
print("Test Label Distribution:")
print(test_dist)


Train size: 8000, Test size: 2000
Train Label Distribution:
0    0.500375
1    0.499625
Name: proportion, dtype: float64
Test Label Distribution:
0    0.5005
1    0.4995
Name: proportion, dtype: float64


In [55]:
# Verify sequence lengths
train_lengths = [len(sample['input_ids']) for sample in train_dataset]
test_lengths = [len(sample['input_ids']) for sample in test_dataset]
print(f"\nTrain lengths 512? {all(length == 512 for length in train_lengths)}")
print(f"Test lengths 512? {all(length == 512 for length in test_lengths)}")


Train lengths 512? True
Test lengths 512? True


In [12]:
Dataset

datasets.arrow_dataset.Dataset

In [8]:
# Split the data into train and test set
train, test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['target'])

print(f"New stratefied dataframe shapes: train is {train.shape}, train is {test.shape}")

New stratefied dataframe shapes: train is (1280000, 6), train is (320000, 6)


In [9]:
print("Train target counts:")
train.target.value_counts()

Train target counts:


target
4    640000
0    640000
Name: count, dtype: int64

In [10]:
print("Test target counts:")
test.target.value_counts()

Test target counts:


target
0    160000
4    160000
Name: count, dtype: int64

In [12]:
from transformers import pipeline

In [13]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


NameError: name 'init_empty_weights' is not defined

In [None]:
classifier("I've been waiting for a HuggingFace course my whole life.")

In [11]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


NameError: name 'init_empty_weights' is not defined

In [11]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(df.text.to_list()[:2])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

NameError: name 'init_empty_weights' is not defined

## 3. Model Training

## 4. Evaluation

## 1. Loading Data