<a href="https://colab.research.google.com/github/karan2261/DistilBERT-Text-Classification/blob/main/DistilBERT_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Loading the trained model and evaluating the peroformance of the model on test dataset

In [None]:
!pip install transformers



In [None]:
# Importing necessary libraries
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
from transformers import Trainer as TFTrainer
from transformers import TrainingArguments as TFTrainingArguments
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [None]:
# Loading the datasets
train_df = pd.read_csv("Train.csv")
val_df = pd.read_csv("val.csv")
test_df = pd.read_csv("Test.csv")

In [None]:
# Map training labels to numeric values
train_df["label"] = train_df["label"].map({"positive": 1, "negative": 0})

# Correctting test labels
test_df["label"] = test_df["label"].replace(4, 1)

In [None]:
# DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]



In [None]:
# Tokenizing and encoding the datasets
def encode_data(data, max_length=128):
    return tokenizer(
        list(data["text"].values),
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="tf"
    )

train_encodings = encode_data(train_df)
val_encodings = encode_data(val_df)
test_encodings = encode_data(test_df)

In [None]:
# Converting the labels to tensor format
train_labels = tf.convert_to_tensor(train_df["label"].values)
val_labels = tf.convert_to_tensor(val_df["label"].values)
test_labels = tf.convert_to_tensor(test_df["label"].values)

In [None]:
# Loading the model
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [None]:
# Compiling the model
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss_fn, metrics=["accuracy"])

In [None]:
# Training the model
model.fit(
    x=dict(train_encodings),
    y=train_labels,
    validation_data=(dict(val_encodings), val_labels),
    epochs=5,
    batch_size=64
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tf_keras.src.callbacks.History at 0x7c9d1e952fb0>

In [None]:
# Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(
    x=dict(test_encodings),
    y=test_labels,
    batch_size=64
)

print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}")

Test Loss: 0.8312, Test Accuracy: 0.6944


## Using val.csv dataset to fine tune the model.

In [None]:
# Fine-tuning the model on validation data using a lower learning rate
fine_tune_optimizer = tf.keras.optimizers.Adam(learning_rate=1e-6)
model.compile(optimizer=fine_tune_optimizer, loss=loss_fn, metrics=["accuracy"])

In [None]:
# Fine-tuning the model with validation dataset
fine_tune_history = model.fit(
    x=dict(val_encodings),
    y=val_labels,
    epochs=3,
    batch_size=64,
    verbose=2
)

Epoch 1/3
157/157 - 110s - loss: 0.6470 - accuracy: 0.7019 - 110s/epoch - 699ms/step
Epoch 2/3
157/157 - 91s - loss: 0.5526 - accuracy: 0.7360 - 91s/epoch - 581ms/step
Epoch 3/3
157/157 - 91s - loss: 0.5187 - accuracy: 0.7529 - 91s/epoch - 580ms/step


## Evaluate the fine-tuned model on test dataset

In [None]:
# Evaluate the fine-tuned model on the test dataset
test_loss, test_accuracy = model.evaluate(
    x=dict(test_encodings),
    y=test_labels,
    batch_size=64
)

print(f"Fine-Tuned Test Loss: {test_loss:.4f}, Fine-Tuned Test Accuracy: {test_accuracy:.4f}")

Fine-Tuned Test Loss: 0.5106, Fine-Tuned Test Accuracy: 0.7660


## Additional approach for Fine-tuning

In [None]:
# Unfreeze the last n layers of the model
n = 4
for layer in model.layers[-n:]:
    layer.trainable = True

In [None]:
# Recompile the model with a lower learning rate for fine-tuning
fine_tune_optimizer = tf.keras.optimizers.Adam(learning_rate=1e-6)
model.compile(optimizer=fine_tune_optimizer, loss=loss_fn, metrics=["accuracy"])

In [None]:
# Fine-tune the model
fine_tune_history = model.fit(
    x=dict(val_encodings),
    y=val_labels,
    epochs=3,
    batch_size=64,
    verbose=2
)

Epoch 1/3
157/157 - 91s - loss: 0.4474 - accuracy: 0.7934 - 91s/epoch - 580ms/step
Epoch 2/3
157/157 - 91s - loss: 0.4347 - accuracy: 0.8059 - 91s/epoch - 580ms/step
Epoch 3/3
157/157 - 91s - loss: 0.4227 - accuracy: 0.8098 - 91s/epoch - 580ms/step


In [None]:
# Evaluate the fine-tuned model on test data
test_loss, test_accuracy = model.evaluate(
    x=dict(test_encodings),
    y=test_labels,
    batch_size=64
)

print(f"Fine-Tuned Test Loss: {test_loss:.4f}, Fine-Tuned Test Accuracy: {test_accuracy:.4f}")

Fine-Tuned Test Loss: 0.4439, Fine-Tuned Test Accuracy: 0.8112


## Discussion

* Initially, the model achieved 69.4% accuracy on the test set. After fine-tuning with val.csv at a lower learning rate, the test accuracy improved to 76.6%. Further fine-tuning by unfreezing additional layers pushed it up to 81.1%.

* Fine-tuning significantly boosted performance.




**For Further Improvments :**

* Trying different learning rates, batch sizes and epochs could get better results.

* I can try experimenting with unfreezing more layers that may enhance the model's ability to adapt.

* Using additional data relevant to the target domain could improve the model’s generalization, especially if the validation or test set distribution differs from the training data.

**Conclusion:** Fine-tuning the model on val.csv improved accuracy, but further adjustments could enhance it even more.