In [None]:
!pip install transformers

In [29]:
from transformers import AutoModel, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

inputs = tokenizer("Deep learning is awesome!", return_tensors="pt")
outputs = model(**inputs, output_attentions=True, output_hidden_states=True) # Added output_attentions and output_hidden_states

In [30]:
# Access the last hidden state from the outputs
last_hidden_state = outputs.last_hidden_state
print("Shape of the last hidden state:", last_hidden_state.shape)

# You can also access other outputs if available, for example:
attentions = outputs.attentions
hidden_states = outputs.hidden_states
print(attentions)

Shape of the last hidden state: torch.Size([1, 7, 768])
(tensor([[[[9.2010e-02, 7.0300e-02, 1.7699e-01, 6.6514e-02, 7.4351e-02,
           1.1171e-01, 4.0813e-01],
          [2.0479e-01, 4.6417e-02, 1.5348e-01, 7.7433e-02, 1.5218e-01,
           2.3405e-01, 1.3165e-01],
          [2.7185e-01, 6.2297e-02, 1.1724e-01, 7.6816e-02, 1.4196e-01,
           2.2617e-01, 1.0367e-01],
          [3.2929e-01, 9.4178e-02, 1.7418e-01, 3.7162e-02, 1.3396e-01,
           1.5358e-01, 7.7647e-02],
          [2.3691e-01, 7.4713e-02, 8.1912e-02, 9.7331e-02, 1.0050e-01,
           1.8106e-01, 2.2758e-01],
          [2.5490e-01, 1.1865e-01, 1.0832e-01, 7.6234e-02, 2.1038e-01,
           6.6618e-02, 1.6490e-01],
          [2.4854e-01, 8.5162e-02, 7.4471e-02, 9.1349e-02, 3.8234e-02,
           1.1857e-01, 3.4367e-01]],

         [[9.8199e-01, 2.7207e-03, 1.7563e-03, 3.6101e-03, 1.7300e-03,
           1.8096e-03, 6.3879e-03],
          [1.3182e-02, 6.1160e-02, 2.6447e-01, 2.2376e-02, 3.6175e-01,
           6.8

# Task
Explain how to fine-tune a DistilBERT model for a text classification task and provide a code example.

## Prepare a dataset

### Subtask:
Load or create a small dataset for a text classification task. This dataset should contain text examples and their corresponding labels.


**Reasoning**:
Creating lists of texts and labels for a small text classification dataset.



In [31]:
texts = [
    "This is a great movie!",
    "I hated that book, it was terrible.",
    "The weather is nice today.",
    "Politics are complicated.",
    "This product is amazing, I love it!",
    "What a terrible experience.",
    "The news is depressing.",
    "Sports events are exciting.",
    "I feel so happy right now.",
    "This is the worst thing ever."
]

labels = [
    "positive",
    "negative",
    "neutral",
    "neutral",
    "positive",
    "negative",
    "neutral",
    "neutral",
    "positive",
    "negative"
]

print("Number of texts:", len(texts))
print("Number of labels:", len(labels))

Number of texts: 10
Number of labels: 10


## Load the pre-trained model and tokenizer

### Subtask:
Load the DistilBERT model and tokenizer, similar to what you've done before.


**Reasoning**:
Import the necessary classes and instantiate the tokenizer and model for sequence classification.



In [32]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(set(labels)))

print("Tokenizer and model loaded successfully.")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Tokenizer and model loaded successfully.


## Prepare the data for training

### Subtask:
Tokenize the text data and format it into a suitable input format for the model (e.g., PyTorch tensors). Split the data into training and validation sets.


**Reasoning**:
Create a dictionary for label mapping, tokenize the text data, convert labels to tensors, and split the data into training and validation sets using scikit-learn.



**Reasoning**:
Correct the variable name in the print statement to `input_ids_val` and re-run the code to display the shapes of the split data.



In [33]:
from sklearn.model_selection import train_test_split

# 1. Create a dictionary mapping labels to integers
unique_labels = list(set(labels))
label_map = {label: i for i, label in enumerate(unique_labels)}
numerical_labels = [label_map[label] for label in labels]

# 2. Tokenize the texts
tokenized_inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# 3. Convert numerical labels to a PyTorch tensor
labels_tensor = torch.tensor(numerical_labels)

# 4. Split data into training and validation sets
input_ids_train, input_ids_val, attention_mask_train, attention_mask_val, labels_train, labels_val = train_test_split(
    tokenized_inputs["input_ids"],
    tokenized_inputs["attention_mask"],
    labels_tensor,
    test_size=0.2,
    random_state=42  # for reproducibility
)

print("Training input IDs shape:", input_ids_train.shape)
print("Validation input IDs shape:", input_ids_val.shape)
print("Training attention mask shape:", attention_mask_train.shape)
print("Validation attention mask shape:", attention_mask_val.shape)
print("Training labels shape:", labels_train.shape)
print("Validation labels shape:", labels_val.shape)

Training input IDs shape: torch.Size([8, 11])
Validation input IDs shape: torch.Size([2, 11])
Training attention mask shape: torch.Size([8, 11])
Validation attention mask shape: torch.Size([2, 11])
Training labels shape: torch.Size([8])
Validation labels shape: torch.Size([2])


## Define training parameters

### Subtask:
Set up training parameters such as the number of epochs, learning rate, batch size, etc.


**Reasoning**:
Define the training parameters as instructed in the subtask.



In [37]:
# 1. Define the number of training epochs
num_epochs = 50

# 2. Define the learning rate for the optimizer
learning_rate = 5e-5

# 3. Define the batch size for training and evaluation
batch_size = 16

# 4. Define other relevant parameters (optional)
# For this simple example, we'll keep it basic.
# You could add parameters like weight_decay, warmup_steps, etc.

print(f"Number of epochs: {num_epochs}")
print(f"Learning rate: {learning_rate}")
print(f"Batch size: {batch_size}")

Number of epochs: 50
Learning rate: 5e-05
Batch size: 16


## Define the classification head

### Subtask:
Add a classification layer on top of the pre-trained model. This layer will take the model's output (e.g., the hidden state of the CLS token) and predict the class label.


## Define the training loop

### Subtask:
Implement a training loop that iterates over the training data, calculates the loss, performs backpropagation, and updates the model's weights.


**Reasoning**:
Implement the training loop as described in the instructions, including creating the DataLoader, optimizer, and iterating through epochs and batches to perform the forward pass, calculate loss, and perform backpropagation.



In [38]:
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# Create a TensorDataset
train_dataset = TensorDataset(input_ids_train, attention_mask_train, labels_train)

# Create a DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Instantiate an optimizer
optimizer = optim.AdamW(model.parameters(), lr=learning_rate)

# Set the model to training mode
model.train()

# Check for GPU availability and move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    for batch in train_dataloader:
        # Move batch to device
        batch = [r.to(device) for r in batch]
        input_ids, attention_mask, labels = batch

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    # Print average loss for the epoch
    avg_train_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch + 1}/{num_epochs}, Average training loss: {avg_train_loss}")

print("Training complete.")

Epoch 1/50, Average training loss: 0.6761602163314819
Epoch 2/50, Average training loss: 0.573887050151825
Epoch 3/50, Average training loss: 0.46184229850769043
Epoch 4/50, Average training loss: 0.35568684339523315
Epoch 5/50, Average training loss: 0.30437955260276794
Epoch 6/50, Average training loss: 0.2529052793979645
Epoch 7/50, Average training loss: 0.19536809623241425
Epoch 8/50, Average training loss: 0.1546444594860077
Epoch 9/50, Average training loss: 0.13495655357837677
Epoch 10/50, Average training loss: 0.12131339311599731
Epoch 11/50, Average training loss: 0.10749314725399017
Epoch 12/50, Average training loss: 0.081902414560318
Epoch 13/50, Average training loss: 0.06726101785898209
Epoch 14/50, Average training loss: 0.06400955468416214
Epoch 15/50, Average training loss: 0.051929574459791183
Epoch 16/50, Average training loss: 0.044834237545728683
Epoch 17/50, Average training loss: 0.04621502012014389
Epoch 18/50, Average training loss: 0.04123428836464882
Epoch 

## Evaluate the fine-tuned model

### Subtask:
Evaluate the performance of the fine-tuned model on the validation set using appropriate metrics (e.g., accuracy, precision, recall).


**Reasoning**:
Evaluate the fine-tuned model on the validation set using accuracy, precision, recall, and F1-score.



In [39]:
from torch.utils.data import TensorDataset, DataLoader
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import numpy as np

# 1. Create a TensorDataset for the validation data
val_dataset = TensorDataset(input_ids_val, attention_mask_val, labels_val)

# 2. Create a DataLoader for the validation dataset
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False) # Do not shuffle validation data

# 3. Set the model to evaluation mode
model.eval()

# 4. Initialize lists to store predicted labels and true labels
predicted_labels = []
true_labels = []

# 5. Iterate through the validation DataLoader
for batch in val_dataloader:
    # Move batch to device
    batch = [r.to(device) for r in batch]
    input_ids, attention_mask, labels = batch

    # 6. Inside the loop, use a torch.no_grad() context
    with torch.no_grad():
        # 7. Perform a forward pass through the model
        outputs = model(input_ids, attention_mask=attention_mask)

    # 8. Get the predicted logits from the model outputs
    logits = outputs.logits

    # 9. Determine the predicted class for each example
    predictions = torch.argmax(logits, dim=-1)

    # 10. Append the predicted labels and true labels
    predicted_labels.extend(predictions.cpu().numpy())
    true_labels.extend(labels.cpu().numpy())

# 11. After iterating, concatenate the predicted and true labels
# (Already done by extending lists and converting to numpy arrays)

# Convert lists to numpy arrays for metric calculations
predicted_labels = np.array(predicted_labels)
true_labels = np.array(true_labels)

# 12. Calculate the accuracy of the model
accuracy = accuracy_score(true_labels, predicted_labels)

# 13. Calculate precision, recall, and F1-score
# We use zero_division=0 to avoid warnings/errors for classes with no true/predicted samples
precision, recall, f1_score, _ = precision_recall_fscore_support(true_labels, predicted_labels, average='weighted', zero_division=0)

# 14. Print the calculated evaluation metrics
print("\nEvaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision (weighted): {precision:.4f}")
print(f"Recall (weighted): {recall:.4f}")
print(f"F1-score (weighted): {f1_score:.4f}")


Evaluation Metrics:
Accuracy: 0.5000
Precision (weighted): 0.2500
Recall (weighted): 0.5000
F1-score (weighted): 0.3333


## Summary:

### Data Analysis Key Findings

*   A small dataset of 10 text examples with corresponding "positive", "negative", and "neutral" labels was created.
*   The DistilBERT tokenizer (`distilbert-base-uncased`) and a sequence classification model (`AutoModelForSequenceClassification`) were successfully loaded.
*   The text data was tokenized, numerical labels were created based on unique labels, and both were converted into PyTorch tensors.
*   The data was split into training (8 samples) and validation (2 samples) sets.
*   Training parameters were defined: 5 epochs, a learning rate of \$5e-5\$, and a batch size of 16.
*   The `AutoModelForSequenceClassification` model inherently includes the necessary classification head, so no separate definition was required.
*   A training loop was successfully implemented using a `DataLoader` and `AdamW` optimizer, showing a decrease in average training loss over the 5 epochs.
*   The model was evaluated on the validation set, resulting in an accuracy, precision, recall, and F1-score of 0.0000.

### Insights or Next Steps

*   The extremely low evaluation metrics suggest that the model did not learn effectively from the very small dataset. This could be due to the limited number of samples (especially in the validation set) and potentially the complexity of the task relative to the data size.
*   To improve performance, the next steps should involve using a significantly larger and more diverse dataset for training and validation. Additionally, consider hyperparameter tuning or using different pre-trained models.
