You are working with a HuggingFace model for sentiment classification.

# **Task:**
Write code to:
1. Load a "distilbert-base-uncased" model for sequence classification.
2. Predict sentiment of the sentence: "I love working with generative Al models.

***Solution Overview:***

The base model "distilbert-base-uncased" is a pre-trained language model designed for general-purpose NLP tasks, but it is not fine-tuned for specific downstream tasks like sentiment classification. Therefore, it cannot directly predict sentiment without further training.

To perform sentiment analysis, we have two practical options:

1. **Use a pre-finetuned version:** The model "distilbert-base-uncased-finetuned-sst-2-english" is already fine-tuned on the SST-2 dataset for sentiment classification and can be used directly for this task.

2. **Fine-tune the base model:** Alternatively, we can fine-tune "distilbert-base-uncased" on a labeled sentiment dataset (e.g., SST-2) and then use the resulting model for sentiment prediction.

For this task, we will demonstrate both approaches to compare using a pre-finetuned model versus fine-tuning from scratch.

# **First Approach: Using Pre-Finetuned Model**

In [1]:
from transformers import pipeline

# 1. Load a distilbert-base-uncased model for sequence classification.
# The 'sentiment-analysis' pipeline automatically uses a suitable model like distilbert-base-uncased-finetuned-sst2
# if not specified, or you can explicitly pass the model name.
classifier = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

# 2. Predict sentiment of the sentence: "I love working with generative AI models."
sentence = "I love working with generative AI models."
result = classifier(sentence)

print(result)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9978815913200378}]


# **Second Approach: Fine-tune distilbert-base-uncased for Sentiment Classification**

In [2]:
import os
os.environ["WANDB_DISABLED"] = "true"

## 1. Install Required Libraries

In [3]:
pip install transformers datasets



In [4]:
!pip install -U datasets fsspec

Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


## 2. Load Dataset

In [5]:
from datasets import load_dataset

# Load SST-2 sentiment dataset
dataset = load_dataset("glue", "sst2")


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

## 3. Visualize the Dataset


In [7]:
import pandas as pd
# Convert first 5 samples to a DataFrame
df = pd.DataFrame(dataset["train"][:5])
df["label"] = df["label"].map({0: "Negative", 1: "Positive"})

df

Unnamed: 0,sentence,label,idx
0,hide new secretions from the parental units,Negative,0
1,"contains no wit , only labored gags",Negative,1
2,that loves its characters and communicates som...,Positive,2
3,remains utterly satisfied to remain the same t...,Negative,3
4,on the worst revenge-of-the-nerds clichés the ...,Negative,4


## 4. Tokenize the Data

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True, padding="max_length")

tokenized_datasets = dataset.map(tokenize_function, batched=True)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

## 5. Load Model for Sequence Classification

In [9]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 6. Training Setup

In [10]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch", # Changed from evaluation_strategy
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


## 7. Start the Training

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.1833,0.305052


TrainOutput(global_step=4210, training_loss=0.2285320472264233, metrics={'train_runtime': 3283.2059, 'train_samples_per_second': 20.513, 'train_steps_per_second': 1.282, 'total_flos': 8921546832082944.0, 'train_loss': 0.2285320472264233, 'epoch': 1.0})

## 8. Save model temporarily


In [None]:
model.save_pretrained("./Distelbert_base_uncased_finetuned")
tokenizer.save_pretrained("./Distelbert_base_uncased_finetuned")


('./Distelbert_base_uncased_finetuned/tokenizer_config.json',
 './Distelbert_base_uncased_finetuned/special_tokens_map.json',
 './Distelbert_base_uncased_finetuned/vocab.txt',
 './Distelbert_base_uncased_finetuned/added_tokens.json',
 './Distelbert_base_uncased_finetuned/tokenizer.json')

## 9. Map the Labels to Positve & Negative from Label_0 & Label_1

In [None]:
model.config.id2label = {0: "NEGATIVE", 1: "POSITIVE"}
model.config.label2id = {"NEGATIVE": 0, "POSITIVE": 1}
model.save_pretrained("./MyDrive/DistilBERT_finetuned_sst2")
tokenizer.save_pretrained("./MyDrive/DistilBERT_finetuned_sst2")


('./Distelbert_base_uncased_finetuned1/tokenizer_config.json',
 './Distelbert_base_uncased_finetuned1/special_tokens_map.json',
 './Distelbert_base_uncased_finetuned1/vocab.txt',
 './Distelbert_base_uncased_finetuned1/added_tokens.json',
 './Distelbert_base_uncased_finetuned1/tokenizer.json')

## 10. Mount Google Drive

In [16]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 11. Saving the Finetuned model in Google Drive for future use

In [None]:
model.save_pretrained("/content/drive/MyDrive/DistilBERT_finetuned_sst2")
tokenizer.save_pretrained("/content/drive/MyDrive/DistilBERT_finetuned_sst2")

('/content/drive/MyDrive/DistilBERT_finetuned_sst2/tokenizer_config.json',
 '/content/drive/MyDrive/DistilBERT_finetuned_sst2/special_tokens_map.json',
 '/content/drive/MyDrive/DistilBERT_finetuned_sst2/vocab.txt',
 '/content/drive/MyDrive/DistilBERT_finetuned_sst2/added_tokens.json',
 '/content/drive/MyDrive/DistilBERT_finetuned_sst2/tokenizer.json')

## 12. Perform Setiment Analysis on given sentence

In [19]:
from transformers import pipeline

# Load the model and tokenizer from the local path
classifier = pipeline("sentiment-analysis", model="/content/drive/MyDrive/DistilBERT_finetuned_sst2", tokenizer="/content/drive/MyDrive/DistilBERT_finetuned_sst2")

result = classifier("I love working with generative AI models.")
result1 = classifier("I hate working with generative AI models.")
print(result, result1)

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9954856038093567}] [{'label': 'NEGATIVE', 'score': 0.9887709021568298}]


## 13. Make Sentiment Analysis on finetuned model uploaded on Huggingface  Models

Link to the Model: https://huggingface.co/Jalal465/DistilBERT_base_uncased_finetuned_sst2_Jalal


In [1]:
from transformers import pipeline

# Load your uploaded model from Hugging Face Hub
classifier = pipeline("sentiment-analysis", model="Jalal465/DistilBERT_base_uncased_finetuned_sst2_Jalal")

# Test it
print(classifier("I love working with generative AI models."))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9954856038093567}]
