<a href="https://colab.research.google.com/github/jahnvisikligar/Sentiment-analysis/blob/main/sentiment_analysis_using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Activate GPU and Install Dependencies

In [3]:
# Activate GPU for faster training by clicking on 'Runtime' > 'Change runtime type' and then selecting GPU as the Hardware accelerator
# Then check if GPU is available
import torch
torch.cuda.is_available()

True

In [4]:
# Install required libraries
!pip install datasets transformers huggingface_hub
!apt-get install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m102.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface_hub
  Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m19.0 MB/s[0m eta [36m

#2. Preprocess data

In [5]:
# Load data
from datasets import load_dataset
imdb = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
# Create a smaller training dataset for faster training times
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(10000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(1000))])
print(small_train_dataset[0])
print(small_test_dataset[0])

{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...', 'label': 1}
{'text': "<br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, 

In [7]:
# Set DistilBERT tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [8]:
# Prepare the text inputs for the model
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [9]:
# Use data_collector to convert our samples to PyTorch tensors and concatenate them with the correct amount of padding
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 3. Training the model

In [10]:
# Define DistilBERT as our base model:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

In [11]:
# Define the evaluation metrics 
import numpy as np
from datasets import load_metric

def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")
    
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

In [12]:
# Log in to your Hugging Face account 
# Get your API token here https://huggingface.co/settings/token
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [30]:
from transformers import TrainingArguments, Trainer
from torch.utils.data import DataLoader

repo_name = "finetuning-sentiment-model-10000-samples"

training_args = TrainingArguments(
    output_dir=repo_name,
    learning_rate=2e-5,
    per_device_train_batch_size=32,  # Increase batch size
    per_device_eval_batch_size=32,   # Increase batch size
    gradient_accumulation_steps=2,   # Use gradient accumulation
    num_train_epochs=4,              # Increase number of epochs
    weight_decay=0.01,
    save_strategy="epoch", 
    push_to_hub=True,
    #learning_rate_scheduler="linear",   # Use a learning rate schedule
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    #optimizers=(torch.optim.AdamW, {"lr": 2e-5}),   # Use a different optimizer
    #fp16=True,   # Use mixed precision training
)

#trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
/content/finetuning-sentiment-model-10000-samples is already a clone of https://huggingface.co/JS21/finetuning-sentiment-model-10000-samples. Make sure you pull the latest changes with `repo.git_pull()`.


In [31]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 10000
  Num Epochs = 4
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 2
  Total optimization steps = 624
  Number of trainable parameters = 66955010


Step,Training Loss
500,0.0232


Saving model checkpoint to finetuning-sentiment-model-10000-samples/checkpoint-156
Configuration saved in finetuning-sentiment-model-10000-samples/checkpoint-156/config.json
Model weights saved in finetuning-sentiment-model-10000-samples/checkpoint-156/pytorch_model.bin
tokenizer config file saved in finetuning-sentiment-model-10000-samples/checkpoint-156/tokenizer_config.json
Special tokens file saved in finetuning-sentiment-model-10000-samples/checkpoint-156/special_tokens_map.json
tokenizer config file saved in finetuning-sentiment-model-10000-samples/tokenizer_config.json
Special tokens file saved in finetuning-sentiment-model-10000-samples/special_tokens_map.json
Saving model checkpoint to finetuning-sentiment-model-10000-samples/checkpoint-312
Configuration saved in finetuning-sentiment-model-10000-samples/checkpoint-312/config.json
Model weights saved in finetuning-sentiment-model-10000-samples/checkpoint-312/pytorch_model.bin
tokenizer config file saved in finetuning-sentiment-

TrainOutput(global_step=624, training_loss=0.02049432370143059, metrics={'train_runtime': 1999.0148, 'train_samples_per_second': 20.01, 'train_steps_per_second': 0.312, 'total_flos': 5294233450747776.0, 'train_loss': 0.02049432370143059, 'epoch': 4.0})

In [32]:
# Compute the evaluation metrics
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 32


{'eval_loss': 0.5301056504249573,
 'eval_accuracy': 0.908,
 'eval_f1': 0.9076305220883534,
 'eval_runtime': 16.7421,
 'eval_samples_per_second': 59.73,
 'eval_steps_per_second': 1.911,
 'epoch': 4.0}

# 4. Analyzing new data with the model

In [33]:
# Upload the model to the Hub
trainer.push_to_hub()

Saving model checkpoint to finetuning-sentiment-model-10000-samples
Configuration saved in finetuning-sentiment-model-10000-samples/config.json
Model weights saved in finetuning-sentiment-model-10000-samples/pytorch_model.bin
tokenizer config file saved in finetuning-sentiment-model-10000-samples/tokenizer_config.json
Special tokens file saved in finetuning-sentiment-model-10000-samples/special_tokens_map.json


Upload file runs/Mar13_23-28-55_28e95fc6d0b3/events.out.tfevents.1678750176.28e95fc6d0b3.934.9: 100%|#########…

Upload file runs/Mar13_23-28-55_28e95fc6d0b3/events.out.tfevents.1678752407.28e95fc6d0b3.934.11: 100%|########…

remote: Scanning LFS files of refs/heads/main for validity...        
remote: LFS file scan complete.        
To https://huggingface.co/JS21/finetuning-sentiment-model-10000-samples
   b6512db..90edbc3  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/JS21/finetuning-sentiment-model-10000-samples
   b6512db..90edbc3  main -> main

To https://huggingface.co/JS21/finetuning-sentiment-model-10000-samples
   90edbc3..c290750  main -> main

   90edbc3..c290750  main -> main



'https://huggingface.co/JS21/finetuning-sentiment-model-10000-samples/commit/90edbc346bae470461feea1b7a067c7b0ce66d7c'

In [34]:
# Run inferences with your new model using Pipeline
from transformers import pipeline

sentiment_model = pipeline(model="JS21/finetuning-sentiment-model-10000-samples")

sentiment_model(["I love this move", "This movie sucks!"])

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--JS21--finetuning-sentiment-model-10000-samples/snapshots/c2907507127ad56d00e1f362a7e5a9dd319eb74e/config.json
Model config DistilBertConfig {
  "_name_or_path": "JS21/finetuning-sentiment-model-10000-samples",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--JS21--finetuning-sentiment-model-10000-sam

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--JS21--finetuning-sentiment-model-10000-samples/snapshots/c2907507127ad56d00e1f362a7e5a9dd319eb74e/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at JS21/finetuning-sentiment-model-10000-samples.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.
loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--JS21--finetuning-sentiment-model-10000-samples/snapshots/c2907507127ad56d00e1f362a7e5a9dd319eb74e/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--JS21--finetuning-sentiment-model-10000-samples/snapshots/c2907507127ad56d00e1f362a7e5a9dd319eb74e/tokenizer.json

[{'label': 'LABEL_1', 'score': 0.9902129769325256},
 {'label': 'LABEL_0', 'score': 0.9912029504776001}]