this is just the setup to fine tune the BERT uncased model on the Sentiment 140 dataset. I could have gone with a classic ML model (TF-IDF, Logistic Regression/Naive Bayes) for quick training, but I wanted a bit of a challenge. 
Make sure deps is installed: 
#pip install transformers datasets torch

In [None]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

# only loading the first 100k rows from the sentiment 140 dataset for faster training on Colab.
df = pd.read_csv("training.1600000.processed.noemoticon.csv",
                  encoding= "latin-1", 
                  names=["target", "id", "date", "flag", "user", "text"], 
                  nrows=100000
)

# map the labels. Target: 0 is negative, 4 is positive. remap to 0/1 binary 
df["label"]= df["target"].replace({0:0, 4:1})

# Converting to HuggingFace dataset
dataset = Dataset.from_pandas(df[["text", "label"]])

# Split between testing and training sets
dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_ds = dataset["train"]
test_ds = dataset["test"]

Now, I'm going to tokenize the training and testing datasets. This means I'm breaking down the raw data (text) into smaller, fundamental units (tokens) that the AI model can understand. For text like the Sentiment 140 dataset, it means converting words, subwords and characters into numerical representations. This is the actual boilerplate, but the HuggingFace code block is actually 8 lines, lol. This is something I had to look up

import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer

#Load the dataset
df = pd.read_csv("training.1600000.processed.noemoticon.csv", encoding="latin-1", header=None)
df = df[[0, 5]]  # sentiment label = col 0, tweet = col 5
df.columns = ["label", "text"]

#Convert labels: 0 = negative, 4 = positive → map to binary
df['label'] = df['label'].map({0: 0, 4: 1})

#Sample smaller dataset if needed
df = df.sample(100000, random_state=42)

#Train-test split
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['text'], df['label'], test_size=0.1, random_state=42
)

#Initialize tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

#Tokenize
train_encodings = tokenizer(
    list(train_texts),
    truncation=True,
    padding=True,
    max_length=64
)

val_encodings = tokenizer(
    list(val_texts),
    truncation=True,
    padding=True,
    max_length=64
)

#Wrap into torch Dataset
import torch

class TweetDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels.iloc[idx])
        return item

train_dataset = TweetDataset(train_encodings, train_labels)
val_dataset = TweetDataset(val_encodings, val_labels)

print(train_dataset[0])  # preview one sample

More Notes: 
the function below gets the text column from the dataset, then pads everything to the model's max sequence length (default 512), and shortens the tweets longer than the limit. Now, each row becomes a dictionary with input_ids (the tokens) and the attention mask (binary 1 and 0s). 

then the function is applied to the entire dataset. batched=True runs the function in chunks instead of rows at a time, to save time. 

then the dataset is turned into Pytorch tensors (multidimensional arrays) with only the columns the model needs. Input id(tokenized tweet), attention mask (which part is padding or real) and label (sentiment class 0,1,2,3,4 which is what I set) This is the Pytorch format. 

for the sentiment140 dataset, clean up the dataset (get rid of the weird characters, hashtags, mentions), tokenize each tweet to get the input_ids and attention masks, then pair the tokenized tweets with their labels (0=negative, 1=positive) Simple! 

In [None]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True)

train_ds = train_ds.map(tokenize, batched=True)
test_ds = test_ds.map(tokenize, batched=True)

train_ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])

Now, i'm going to load the dataset and pray for the best. Pretty self explanatory. 
the first import is the correct model class set up for the classification task at hand. 
from pretrained loads the pretrained weights from bert_base_uncased (uncased defintition I got from the huggingface site)
num_labels=2 tells the model that this is a binary classification task. I can set it to 3, if I want neutral classification.

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Now, train the model on the 100k rows of data. This should only take about 2 hours (max) to train, based on the free tier of HgFc Colab, which is about 16GB of VRAM. You get more with paid tiers, of course. The GPU is a Tesla t4. 

Note: Epoch is a complete pass through the training dataset. 2 epochs mean the model will passthrough or "see" all 100k tweets twice. Each pass helps the model "learn" a bit more. In my case, 2 passes is enough, anymore would cause overfitting (memorizing the training data instead of generalizing it). On the free tier of Colab, 2 or 3 passes is enough. 

Note: "evaluation_strategy = epoch" tells HuggingFace when to evaluate on the test set. epoch means evaluate once at the end of each epoch (passthrough), steps means evaluate every X training steps (in this case, i've set it to 500). This helps track if the model is improving or if it's overfitting. 

Note: "per_device_train_batch_size" is "how many tweets does the model read before it updates the weights?" in this case, its 16 tweets, since I set the size to 16. If I set it higher, I would need more VRAM, if I set it too small, then training takes longer. Looks like if you set the batch size to the VRAM size, you're good. 

Note: batch_size x sequence length x model size determines how much VRAM I will need. BERT is 110M parameters + sequence length 128 + batch size 16 all fits on a T4. You could do a batch size 64, but you may run out of memory. (OOM) 128 for a seq length is the max, 512 is possible but good luck. 

Note: if you overfit and get the OOM message, just clear the GPU memory without starting the entire runtime. 
import torch
torch.cuda.empty_cache()

Note: weight decay is a regularization trick to prevent overfitting. 

Note: can use Tensorboard to track training logs

Note: the trainer block wraps all of this into a HuggingFace trainer object. this handles the training loop, eval, checkpoint saving, logging, gpu handling and save me from writing more lines of code. Thank gosh. 

So, what happens? Each epoch loads a batch of tweets, runs them through BERT, computes loss (prediction vs label), backpropagates and updates weights(an algorithm that uses the chain rule to compute the gradient of the loss function with respect to the network's weights and biases. Also known as a backward pass and it allows the network to adjust parameters to reduce errors, improve prediction by spreading the error from the output later backward to the input layer. Put it in reverse, Terry, to learn from errors and give better predictions.) and then logs progress. It's one of the coolest things, a wrapper around Pytorch so I don't have to write the dreaded manual loop. 

Note: Weight decay. A type of regularization (L2) with the goal of preventing overfitting. Overfitting is when the model memorizes training data instead of learning the general pattern(s) from it. It adds a penalty term to the loss function, so the model will go after the simpler (the loss function without the penalty) solutions. Every update, it shrinks the weights towards zero. Loss = Original Loss + lambda * sum of squared weights. lambda in this equation is the regularization strength (hyperparameter). It's a fine tuning thing. In my code, the 0.01 represents a small pull towards zero with each update. It's a typical value. It makes the larger weights too costly, so the model goes with the generalizations that aren't so costly. 
weight update = current weight - learning weight * (gradient of error + lambda * weight)

In [None]:
from transformers import Training Arguments, Trainer

training_args = TrainingArguments(
     output_dir="./results", 
     eval_strategy="epoch", 
     learning_rate=2e-5,
     per_device_train_batch_size=16,
     per_device_eval_batch_size=16,
     num_train_epochs=2,
     weight_decay=0.01,
     logging_dir="./logs",
     logging_steps=500,
     save_strategy="epoch"
)

from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.labels_ids
    preds = pred.predictions.argmax(-1)
    return {"accuracy": accuracy_score(labels, preds), "f1": f1_score(labels, preds)}

trainer = Trainer(
     model=model, 
     args=training_args, 
     train_dataset=train_ds, 
     eval_dataset=test_ds, 
     tokenizer=tokenizer, 
     compute_metrics=compute_metrics
)

trainer.train()

now, save the model

In [None]:
trainer.save_model("C:\Users\khowi\Desktop\SentimentAn/bert-sentiment")
tokenizer.save_pretrained("C:\Users\khowi\Desktop\SentimentAn/bert-sentiment")