<a href="https://colab.research.google.com/github/gnitashu/1-python/blob/main/BERT_Finetune_Project(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import required packages

In [3]:
pip install transformers datasets torch scikit-learn




# Read the data

In [25]:
#/content/IMDB Dataset_sample.xlsx
import pandas as pd
data=pd.read_excel('/content/IMDB Dataset_sample.xlsx')
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Train test split

In [26]:
from sklearn.model_selection import train_test_split
from datasets import Dataset

# Convert sentiment to numerical labels
data['sentiment'] = data['sentiment'].map({"positive": 1, "negative": 0})

# Split the data into train and test sets
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [27]:
train_dataset

Dataset({
    features: ['review', 'sentiment', '__index_level_0__'],
    num_rows: 891
})

In [28]:
train_dataset[:5]

{'review': ['This film was pretty good. I am not too big a fan of baseball, but this is a movie that was made to help understand the meaning of love, determination, heart, etc.<br /><br />Danny Glover, Joseph Gordon-Levitt, Brenda Fricker, Christopher Lloyd, Tony Danza, and Milton Davis Jr. are brought in with a variety of talented actors and understanding of the sport. The plot was believable, and I love the message. William Dear and the guys put together a great movie.<br /><br />Most sports films revolve around true stories or events, and they often do not work well. But this film hits a 10 on the perfectness scale, even though there were a few minor mistakes here and there.<br /><br />10/10',
  'How do you take a cast of experienced, well-known actors, and put together such a stupid movie? Nimrod Antel has the answer: Armored. Six co-workers at an armored car business decide to steal a large shipment of cash themselves. But, just as they get to first base with their plans, everythi

# Tokenize the Text

Use the BERT tokenizer to preprocess the text.



The batched=True parameter in the map method of Hugging Face Dataset means that the mapping function (in this case, tokenize_function) will process the data in batches (groups of examples) rather than processing one example at a time.

1. padding="max_length"

**Purpose:** Ensures that all tokenized sequences have the same length.

**How It Works:**
If a sequence is shorter than max_length, it will be padded with a special padding token (e.g., [PAD]) until it reaches max_length.

If a sequence is longer than max_length, it won't be padded further.

**Why It’s Important:**
Models like BERT require inputs of uniform length for batch processing.
This ensures that all sequences in a batch can be processed simultaneously.

In [29]:
from transformers import BertTokenizer

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples["review"],
        padding="max_length",
        truncation=True,
        max_length=128
    )
# Tokenize datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Remove unnecessary columns
train_dataset = train_dataset.remove_columns(["review", "__index_level_0__"])
test_dataset = test_dataset.remove_columns(["review", "__index_level_0__"])

# Rename the sentiment column to labels
train_dataset = train_dataset.rename_column("sentiment", "labels")
test_dataset = test_dataset.rename_column("sentiment", "labels")

# Set the format for PyTorch
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/891 [00:00<?, ? examples/s]

Map:   0%|          | 0/223 [00:00<?, ? examples/s]

# Initialize the Model

2. AutoModelForSequence Classification(model)

Load the BertForSequenceClassification model.

In [30]:
from transformers import BertForSequenceClassification

# Load the model
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2  # Binary classification
)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Train the model

In [33]:
from huggingface_hub import login

login(token="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")


https://wandb.ai/omkar-nallagoni-naresh-it

In [34]:
from transformers import TrainingArguments, Trainer

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=1000
)
training_args



TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=epoch,
eval_use_gather_object

In [38]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

trainer


<transformers.trainer.Trainer at 0x7fbfd75d83a0>

https://wandb.ai/authorize

In [39]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.2379,0.313792
2,0.1121,0.422293


Epoch,Training Loss,Validation Loss
1,0.2379,0.313792
2,0.1121,0.422293
3,0.0058,0.516657


TrainOutput(global_step=336, training_loss=0.1854325577705389, metrics={'train_runtime': 4329.6204, 'train_samples_per_second': 0.617, 'train_steps_per_second': 0.078, 'total_flos': 175823962744320.0, 'train_loss': 0.1854325577705389, 'epoch': 3.0})

In [40]:
results = trainer.evaluate()
results

{'eval_loss': 0.5166565775871277,
 'eval_runtime': 101.8056,
 'eval_samples_per_second': 2.19,
 'eval_steps_per_second': 0.275,
 'epoch': 3.0}

# Save Model

In [41]:
model.save_pretrained("./sentiment_model")
tokenizer.save_pretrained("./sentiment_model")


('./sentiment_model/tokenizer_config.json',
 './sentiment_model/special_tokens_map.json',
 './sentiment_model/vocab.txt',
 './sentiment_model/added_tokens.json')

# Load the save model

In [42]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load the model and tokenizer
model = BertForSequenceClassification.from_pretrained("./sentiment_model")
tokenizer = BertTokenizer.from_pretrained("./sentiment_model")


# Perdict on unseen data

In [43]:
def predict_sentiment(review, model, tokenizer):
    # Tokenize the input review
    inputs = tokenizer(
        review,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=256
    )

    # Move tensors to the same device as the model (CPU or GPU)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    # Perform prediction
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)

    # Map predictions to sentiment labels
    sentiment = "positive" if predictions.item() == 1 else "negative"
    return sentiment


In [44]:
review = "I really loved this movie. The story was good!"
sentiment = predict_sentiment(review, model, tokenizer)
print(f"Review: {review}")
print(f"Predicted Sentiment: {sentiment}")


Review: I really loved this movie. The story was good!
Predicted Sentiment: positive


In [45]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)
trainer.train()




Epoch,Training Loss,Validation Loss
1,No log,0.491865
2,No log,0.407662


Epoch,Training Loss,Validation Loss
1,No log,0.491865
2,No log,0.407662
