### Fine-Tuning GPT-2 for Sentiment Analysis

This guide demonstrates how to fine-tune a pre-trained GPT-2 model for sentiment classification using the Hugging Face `transformers` library. The process involves:

1. **Loading and Tokenizing Data**:
    - We utilize the `mteb/tweet_sentiment_extraction` dataset, which contains tweets labeled for sentiment analysis.
    - The dataset is tokenized using GPT-2's tokenizer, with a special configuration to handle padding by using the `eos_token` as the `pad_token`, ensuring the model treats padding appropriately during training.
2. **Model Setup**:
    - GPT-2 is adapted for sequence classification by adding a classification head with three sentiment labels (positive, neutral, negative).
    - The model is configured to use Apple Silicon's Metal Performance Shaders (MPS) for GPU acceleration if available, enhancing training performance on M1/M2 Macs.
3. **Training Configuration**:
    - We define training arguments including the number of epochs, batch size, learning rate, and strategies for logging, evaluation, and checkpoint saving.
    - A custom `compute_metrics` function is implemented to evaluate the model's performance using accuracy, F1 score (weighted for class imbalance), and AUC-ROC (One-vs-Rest) metrics.
4. **Model Training**:
    - The `Trainer` class is used to manage the training loop, applying the defined configuration and metrics.
    - A subset of the dataset is used for quick iteration and testing, allowing the model to be trained and evaluated efficiently.
5. **Evaluation**:
    - After training, the model’s performance is assessed on a test set. The metrics computed provide insights into the model’s accuracy, its balance between precision and recall (F1 score), and its ability to distinguish between different sentiment classes (AUC-ROC).

In [15]:
import pandas as pd
from datasets import load_dataset
import numpy as np
from gensim.models import Word2Vec
from transformers import GPT2Tokenizer
from transformers import GPT2ForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
import torch
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score
import torch.nn.functional as F
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline


In [2]:
# Load dataset, instantiate tokenizer and LLM
dataset = load_dataset("mteb/tweet_sentiment_extraction")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
   return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
# Set the device to "mps" (Metal Performance Shaders) if available, which allows
# PyTorch to leverage the GPU on an Apple Silicon Mac (e.g., M1, M2 chip).
# If MPS is not available, default to using the CPU.
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model.to(device)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=3, bias=False)
)

In [4]:
small_train_dataset = tokenized_datasets["train"].select(range(1000))  # Select a smaller subset for training
small_test_dataset = tokenized_datasets["test"].select(range(200))  # Select an even smaller subset for testing

In [5]:
# Inspect the first few examples in the dataset
for i in range(5):
    print(dataset["train"][i])


{'id': 'cb774db0d1', 'text': ' I`d have responded, if I were going', 'label': 1, 'label_text': 'neutral'}
{'id': '549e992a42', 'text': ' Sooo SAD I will miss you here in San Diego!!!', 'label': 0, 'label_text': 'negative'}
{'id': '088c60f138', 'text': 'my boss is bullying me...', 'label': 0, 'label_text': 'negative'}
{'id': '9642c003ef', 'text': ' what interview! leave me alone', 'label': 0, 'label_text': 'negative'}
{'id': '358bd9e861', 'text': ' Sons of ****, why couldn`t they put them on the releases we already bought', 'label': 0, 'label_text': 'negative'}


In [6]:
# Inspect the tokenization of a few examples
for i in range(5):
    print(tokenized_datasets["train"][i])


{'id': 'cb774db0d1', 'text': ' I`d have responded, if I were going', 'label': 1, 'label_text': 'neutral', 'input_ids': [314, 63, 67, 423, 7082, 11, 611, 314, 547, 1016, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256], 'att

In [7]:
# Check if padding and truncation are applied correctly
print(tokenizer.decode(tokenized_datasets["train"][0]["input_ids"]))


 I`d have responded, if I were going<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|

In [8]:
#GPT-2 doesn't have a padding token defined by default, which is necessary when you're using a batch size greater than 1. 
# To resolve this, we need to define a padding token for the tokenizer.

# Set the padding token  to be the same as the eos_token (End of Sequence token).
tokenizer.pad_token = tokenizer.eos_token

# Update the model to be aware of which token represents padding in the input sequences
model.config.pad_token_id = tokenizer.pad_token_id

# Set the padding token.  Using the eos_token marks the end of meaningful input and fills in padding spaces.
tokenizer.pad_token = tokenizer.eos_token  


In [9]:

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Convert logits to probabilities
    probs = F.softmax(torch.tensor(logits), dim=-1).numpy()
    # Convert logits to predicted class indices
    predictions = np.argmax(probs, axis=-1)

    # Calculate accuracy
    accuracy = accuracy_score(labels, predictions)

    # Calculate F1 score (weighted to handle class imbalance)
    f1 = f1_score(labels, predictions, average='weighted')

    # Calculate AUC (One-vs-Rest for multi-class classification)
    auc = roc_auc_score(labels, probs, multi_class='ovr', average='weighted')

    return {
        'accuracy': accuracy,
        'f1': f1,
        'auc': auc
    }


In [10]:
training_args = TrainingArguments(
    output_dir='../output',          # Output directory for model checkpoints and logs
    num_train_epochs=3,              # Number of training epochs
    per_device_train_batch_size=4,   # Batch size for training
    per_device_eval_batch_size=4,    # Batch size for evaluation
    warmup_steps=500,                # Number of steps to perform learning rate warmup
    weight_decay=0.01,               # Strength of weight decay
    logging_dir='./logs',            # Directory for storing logs
    logging_steps=10,                # Log every 10 steps
    evaluation_strategy="epoch",     # Evaluate after each epoch
    save_strategy="epoch",           # Save checkpoints after each epoch
    learning_rate=5e-5               # Learning rate
)



In [11]:
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=small_train_dataset,
   eval_dataset=small_test_dataset,
   compute_metrics=compute_metrics,
)

trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)


  0%|          | 0/750 [00:00<?, ?it/s]

{'loss': 3.3404, 'grad_norm': 97.77446746826172, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.04}
{'loss': 3.492, 'grad_norm': 231.92886352539062, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.08}
{'loss': 3.049, 'grad_norm': 231.93826293945312, 'learning_rate': 3e-06, 'epoch': 0.12}
{'loss': 2.792, 'grad_norm': 162.41400146484375, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.16}
{'loss': 2.5212, 'grad_norm': 103.83672332763672, 'learning_rate': 5e-06, 'epoch': 0.2}
{'loss': 2.1856, 'grad_norm': 80.27428436279297, 'learning_rate': 6e-06, 'epoch': 0.24}
{'loss': 2.098, 'grad_norm': 156.78529357910156, 'learning_rate': 7.000000000000001e-06, 'epoch': 0.28}
{'loss': 1.4273, 'grad_norm': 41.621849060058594, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.32}
{'loss': 1.326, 'grad_norm': 46.01503372192383, 'learning_rate': 9e-06, 'epoch': 0.36}
{'loss': 1.1411, 'grad_norm': 31.84248924255371, 'learning_rate': 1e-05, 'epoch': 0.4}
{'loss': 1.4985, 'grad_norm': 90.465744

  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 1.2688579559326172, 'eval_accuracy': 0.405, 'eval_f1': 0.32578785514713887, 'eval_auc': 0.7050442672888715, 'eval_runtime': 4.3264, 'eval_samples_per_second': 46.228, 'eval_steps_per_second': 11.557, 'epoch': 1.0}
{'loss': 1.0369, 'grad_norm': 41.403263092041016, 'learning_rate': 2.6000000000000002e-05, 'epoch': 1.04}
{'loss': 1.0068, 'grad_norm': 25.508134841918945, 'learning_rate': 2.7000000000000002e-05, 'epoch': 1.08}
{'loss': 0.8381, 'grad_norm': 13.556118965148926, 'learning_rate': 2.8000000000000003e-05, 'epoch': 1.12}
{'loss': 0.8538, 'grad_norm': 26.07984161376953, 'learning_rate': 2.9e-05, 'epoch': 1.16}
{'loss': 0.9212, 'grad_norm': 18.82402801513672, 'learning_rate': 3e-05, 'epoch': 1.2}
{'loss': 1.065, 'grad_norm': 92.01287078857422, 'learning_rate': 3.1e-05, 'epoch': 1.24}
{'loss': 0.9027, 'grad_norm': 23.44768714904785, 'learning_rate': 3.2000000000000005e-05, 'epoch': 1.28}
{'loss': 0.7682, 'grad_norm': 4.753272533416748, 'learning_rate': 3.3e-05, 'epoch':

  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.6198831796646118, 'eval_accuracy': 0.755, 'eval_f1': 0.755, 'eval_auc': 0.8857596214919955, 'eval_runtime': 13.3552, 'eval_samples_per_second': 14.975, 'eval_steps_per_second': 3.744, 'epoch': 2.0}
{'loss': 0.573, 'grad_norm': 49.29901123046875, 'learning_rate': 4.8e-05, 'epoch': 2.04}
{'loss': 0.5306, 'grad_norm': 50.19501876831055, 'learning_rate': 4.600000000000001e-05, 'epoch': 2.08}
{'loss': 0.4739, 'grad_norm': 5.950389385223389, 'learning_rate': 4.4000000000000006e-05, 'epoch': 2.12}
{'loss': 0.6368, 'grad_norm': 13.032154083251953, 'learning_rate': 4.2e-05, 'epoch': 2.16}
{'loss': 0.7921, 'grad_norm': 86.68267822265625, 'learning_rate': 4e-05, 'epoch': 2.2}
{'loss': 0.506, 'grad_norm': 18.818553924560547, 'learning_rate': 3.8e-05, 'epoch': 2.24}
{'loss': 0.5022, 'grad_norm': 24.89988899230957, 'learning_rate': 3.6e-05, 'epoch': 2.28}
{'loss': 0.6122, 'grad_norm': 10.987289428710938, 'learning_rate': 3.4000000000000007e-05, 'epoch': 2.32}
{'loss': 0.6347, 'grad_n

  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.680131196975708, 'eval_accuracy': 0.755, 'eval_f1': 0.7564746934065115, 'eval_auc': 0.8851832523170653, 'eval_runtime': 6.0848, 'eval_samples_per_second': 32.869, 'eval_steps_per_second': 8.217, 'epoch': 3.0}
{'train_runtime': 406.9496, 'train_samples_per_second': 7.372, 'train_steps_per_second': 1.843, 'train_loss': 1.000835683186849, 'epoch': 3.0}


  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.680131196975708, 'eval_accuracy': 0.755, 'eval_f1': 0.7564746934065115, 'eval_auc': 0.8851832523170653, 'eval_runtime': 6.9074, 'eval_samples_per_second': 28.954, 'eval_steps_per_second': 7.239, 'epoch': 3.0}



### Interpreting the Results:

- **`eval_loss`:** Indicates how well the model is performing on the evaluation set. A lower loss generally suggests better performance, as it reflects the model's ability to minimize errors in predictions.

- **`eval_accuracy`:** Represents the percentage of correct predictions made by the model out of the total predictions. For example, an accuracy of 0.80 means the model correctly predicts the sentiment 80% of the time.

- **`eval_f1`:** The F1 score is a measure of a model's accuracy that considers both precision (the proportion of positive identifications that are actually correct) and recall (the proportion of actual positives that were correctly identified). It's especially useful when dealing with imbalanced classes. A higher F1 score indicates a better balance between precision and recall.

- **`eval_auc`:** The AUC-ROC score (Area Under the Receiver Operating Characteristic Curve) with One-vs-Rest (OvR) strategy measures the model's ability to distinguish between classes. It evaluates the true positive rate against the false positive rate across different threshold values. A higher AUC-ROC score suggests that the model has a good measure of separability between the classes, meaning it can effectively differentiate between different sentiment classes.

- **`epoch`:** Shows the number of training cycles completed. For example, `epoch = 3.0` indicates that the model has been trained over the entire dataset three times.

These metrics together provide a comprehensive view of how well the model is performing, offering insights into its accuracy, balance, and ability to distinguish between different classes.

### Next Steps:

- Consider increasing training epochs.
- Experiment with different hyperparameters.
- Add more data or refine the dataset.

### Model Comparison Using AUC-ROC and F1-Score

In this analysis, we compare the performance of three different machine learning models—Logistic Regression, Support Vector Machine (SVM), and Random Forest—on a sentiment classification task. Each model is trained on embeddings generated using Word2Vec, a method that converts text into vector representations by capturing the semantic relationships between words.

### Steps Involved:
1. **Word2Vec Embeddings**: 
   - We trained a Word2Vec model on the training dataset to generate 100-dimensional word embeddings.
   - Sentences in the dataset were split into individual words, and the average of the word embeddings was used to create a fixed-size vector for each sentence.

2. **Model Training**:
   - Three classifiers were trained using the Word2Vec embeddings: Logistic Regression, SVM (with probability estimates), and Random Forest.
   - Each model's performance was evaluated using two key metrics: AUC-ROC (One-vs-Rest) and F1-Score.

3. **Evaluation Metrics**:
   - **AUC-ROC (OvR)**: This metric measures the model’s ability to distinguish between different classes. The One-vs-Rest (OvR) strategy was used to handle the multi-class nature of the task.
   - **F1-Score**: This score is calculated as the harmonic mean of precision and recall, offering a balanced measure of model performance, especially useful for imbalanced datasets.


### Interpretation:
- **Logistic Regression**: Provides a baseline for comparison. Although it is simple, it often performs well in scenarios where data is linearly separable.
- **SVM**: Known for handling non-linear relationships, this model might perform better if the data is complex and not linearly separable.
- **Random Forest**: As an ensemble method, it can capture complex patterns and is less prone to overfitting than individual models.

The comparison helps determine if the complexity and computational cost of more sophisticated models like SVM or Random Forest are justified compared to simpler models like Logistic Regression. By evaluating both AUC-ROC and F1-Score, we get a comprehensive understanding of how well each model distinguishes between classes and balances precision and recall. 

These insights guide the selection of the most appropriate model for deployment, considering the trade-offs between performance, interpretability, and computational efficiency.

In [20]:
# Train Word2Vec model on the dataset
sentences = [sentence.split() for sentence in small_train_dataset["text"]]
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

def word2vec_embeddings(texts):
    embeddings = []
    for text in texts:
        words = text.split()
        word_vecs = [word2vec_model.wv[word] for word in words if word in word2vec_model.wv]
        if word_vecs:
            embeddings.append(np.mean(word_vecs, axis=0))
        else:
            embeddings.append(np.zeros(100))
    return np.array(embeddings)

# Create Word2Vec embeddings for the training data
train_embeddings = word2vec_embeddings(small_train_dataset["text"])
test_embeddings = word2vec_embeddings(small_test_dataset["text"])



# Logistic Regression
lr = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
lr.fit(train_embeddings, small_train_dataset["label"])
lr_probs = lr.predict_proba(test_embeddings)
lr_auc = roc_auc_score(small_test_dataset["label"], lr_probs, multi_class='ovr', average='weighted')

# SVM with probability estimates
svm = make_pipeline(StandardScaler(), SVC(probability=True))
svm.fit(train_embeddings, small_train_dataset["label"])
svm_probs = svm.predict_proba(test_embeddings)
svm_auc = roc_auc_score(small_test_dataset["label"], svm_probs, multi_class='ovr', average='weighted')

# Random Forest
rf = RandomForestClassifier(n_estimators=200)
rf.fit(train_embeddings, small_train_dataset["label"])
rf_probs = rf.predict_proba(test_embeddings)
rf_auc = roc_auc_score(small_test_dataset["label"], rf_probs, multi_class='ovr', average='weighted')


In [21]:
# Calculate F1-Score for each model
lr_f1 = f1_score(small_test_dataset["label"], lr.predict(test_embeddings), average='weighted')
svm_f1 = f1_score(small_test_dataset["label"], svm.predict(test_embeddings), average='weighted')
rf_f1 = f1_score(small_test_dataset["label"], rf.predict(test_embeddings), average='weighted')

# Print AUC-ROC OvR and F1-Score for each model
print(f'Logistic Regression AUC-ROC OvR: {lr_auc}, F1-Score: {lr_f1}')
print(f'SVM AUC-ROC OvR: {svm_auc}, F1-Score: {svm_f1}')
print(f'Random Forest AUC-ROC OvR: {rf_auc}, F1-Score: {rf_f1}')


Logistic Regression AUC-ROC OvR: 0.6599149069933241, F1-Score: 0.4895631457145218
SVM AUC-ROC OvR: 0.6450256918789293, F1-Score: 0.43086562860867944
Random Forest AUC-ROC OvR: 0.5974000421284594, F1-Score: 0.4127470563834201


Word2Vec's 100-dimensional word embeddings and the embeddings generated by a transformer model like GPT-2 are fundamentally different in terms of their architecture, contextual awareness, and the type of information they capture. Here’s a comparison to help understand the differences:

### 1. **Contextual Awareness**:
   - **Word2Vec**: Generates static embeddings. This means that each word has a single fixed vector representation, regardless of the context in which it appears. For example, the word "bank" will have the same embedding whether it's used in the context of a riverbank or a financial institution.
   - **Transformers (GPT-2)**: Produce contextual embeddings. The same word can have different embeddings depending on the surrounding words, allowing the model to capture the word's meaning based on the context. For example, "bank" would have different embeddings in "river bank" versus "investment bank."

### 2. **Dimensionality**:
   - **Word2Vec**: Typically uses a lower dimensionality (e.g., 100 or 300 dimensions). While lower-dimensional embeddings are easier to compute and use less memory, they may not capture as much nuanced information as higher-dimensional embeddings.
   - **Transformers (GPT-2)**: Typically use higher-dimensional embeddings (e.g., 768 dimensions for GPT-2). These embeddings are more expressive, capturing more complex relationships and nuances in the text due to the higher dimensionality and the model's deeper architecture.

### 3. **Training Objectives**:
   - **Word2Vec**: Is trained using objectives like Skip-Gram or Continuous Bag of Words (CBOW), focusing on predicting words based on their neighbors or vice versa. It captures co-occurrence statistics of words.
   - **Transformers (GPT-2)**: Are trained using objectives like next-word prediction (causal language modeling). They use self-attention mechanisms to consider the entire context of a sentence or paragraph to generate embeddings, resulting in richer, contextually-informed representations.

### 4. **Use Cases**:
   - **Word2Vec**: Often used in simpler NLP tasks where context-specific meaning is less critical, or where computational resources are limited.
   - **Transformers (GPT-2)**: Preferred in complex NLP tasks like language generation, contextual understanding, and tasks requiring deep semantic comprehension.

### Conclusion:
While Word2Vec's 100-dimensional embeddings provide a quick and efficient way to represent words in vector space, they are not directly comparable to the embeddings generated by transformer models like GPT-2. The latter's embeddings are far more sophisticated, capturing richer, context-dependent nuances, which makes them more powerful for many NLP tasks. Therefore, if the goal is to leverage deep contextual understanding and nuanced language features, transformers significantly outperform Word2Vec in most scenarios. However, Word2Vec can still be valuable in scenarios where simplicity and speed are priorities.