# Sentiment Analysis using DistilBERT
##Why DistilBERT ?
The reason why I choose DistilBERT for the task is mainly due to the fact that it is much less resource intensive as compare to BERT.
(I started training with BERT the, first time 2 epochs in 2hrs20min :> ),
it is much faster than gpt2 and other counter parts.
meenaJoke.png

In [None]:
%%capture
!pip install transformers
!pip install transformers[torch]
!pip install -U accelerate
!pip install -U transformers

The above code resolve the error "ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U . . ."

It also ensures that the libraries are installed.
after installing make sure to restart runtime, or ide if on local machine. . .

the `%%capture` is used to suppress the output of the cell.

In [None]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import TrainingArguments, Trainer
import pandas as pd
import torch

model = "distilbert-base-uncased"

Here, sklearn.metrics will be used to compute the metrics for the training arguements, a function compute_metrics is defined below for that purpose.
we'll use pandas to preprocess the dataset.
We are going to use **DistilBERT** as our base model.

##The Dataset
We are going to use the IMDB movie review dataset, in the dataset is in the form of a csv file, with two columns, namely review and sentiment.

The dataset is availible [here](https://storage.googleapis.com/kaggle-data-sets/134715/320111/bundle/archive.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20230809%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20230809T134152Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=3194f39deac56352ec92bcffeeaf56f1c4fb42493cb83992146a43c659a99946891e4a93e0451b7f3ad011ae61d83a090c667ab8c73a9b14b69f0eabfbd5bc510d6cd0fcad573a3ff47609afaac977ec77eb814947867c49ed9c2f0f4897a32359b072e3e6db3c719c8b0a808934d36dd93c3817aa1bd982a9e75bfb6008f506ad4a8c4b84e7a02cc68dd6bc55310eaf85906e2a9a00a50aa007943d77c01b80771f248ddad6eb72ff461a5aa87263e992166f83c0084545b9d1e674fad7a50ec7d30d880db62dee74ee96485e56325ca54e8ec11a842738de81542140935377dea87f7608d55e6ea1e92add4e7e62f23d61b518790a9ff6854d43ba5fd87eb0).

In [None]:
dataset = pd.read_csv("./data.csv") #load the dataset

Convert the sentiments to numerical form, we do this by defining a function convert_label and using the apply method to get the desired.


In [None]:
def convert_label(inp):
    return 0 if inp == "negative" else 1
dataset["sentiment"] = dataset["sentiment"].apply(lambda x: convert_label(x))

As we only want to produce a model here, we are only goint to split the data into training and validation factions.
Validation ensures that our model doesn't overfit, i.e. to make sure it doesn't gets used to only the training data, and produce fairly accurate results on unseen data too.


In [None]:
train_set = dataset[0:40000]
valid_set = dataset[40000:45000] #we're doing an 80:20 split for train:validation data.
print(train_set.head())

                                              review  sentiment
0  One of the other reviewers has mentioned that ...          1
1  A wonderful little production. <br /><br />The...          1
2  I thought this was a wonderful way to spend ti...          1
3  Basically there's a family where a little boy ...          0
4  Petter Mattei's "Love in the Time of Money" is...          1


We will now write a class, which returns a dictionary of namely, input_ids, attention_mask and labels, which at least makes our datafeeding more organized allied with torch dataloaders. The class also keeps track of the length of the reviews. Take note of the way we get the attention mask along side the wordembeddings.

In [None]:
class makeData(torch.utils.data.Dataset):

    def __init__(self, reviews, sentiments, tokenizer):
        self.reviews = reviews
        self.sentiments = sentiments
        self.tokenizer = tokenizer
        self.max_len = tokenizer.model_max_length

    def __len__(self):
        return len(self.reviews) # to keep track of the length of the input

    def __getitem__(self, index):
        review = str(self.reviews[index])
        sentiments = self.sentiments[index]

        encoded_review = self.tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            return_attention_mask=True,
            return_tensors="pt",
            padding="max_length",
            truncation=True
        ) # encode_plus also returns the attention mask which lies at the core of our transformers architecture.

        return {
            'input_ids': encoded_review['input_ids'][0],
            'attention_mask': encoded_review['attention_mask'][0],
            'labels': torch.tensor(sentiments, dtype=torch.long)
        }

We are going to use DistilBertTokenizerFast, for the sole reason that it contains fast in its name, must be good :)
  . . .



In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained(model)
model = DistilBertForSequenceClassification.from_pretrained(model)
#running this cell twice will produce an error, it doesn't affect the runtime tho.
#the warning below just states that no previous training checkpoints were found, which is fine as we are not continuing from a checkpoint but starting a new training session.

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We will now create two instances of the makeData class, one for training data and the other for validation, which will prevent the curve from overfitting. Defining the class makes data handling much easier and organized.

In [None]:
train_set_dataset = makeData(
    reviews=train_set.review.tolist(),
    sentiments=train_set.sentiment.tolist(),
    tokenizer=tokenizer
)

valid_set_dataset = makeData(
    reviews=valid_set.review.tolist(),
    sentiments=valid_set.sentiment.tolist(),
    tokenizer=tokenizer
)

Now we will implement our dataloader object, which creates iterable dictionary batches as required by the trainer.

In [None]:
train_set_dataloader = torch.utils.data.DataLoader(
    train_set_dataset,
    batch_size=16,
    num_workers=4,
    shuffle = True
)

valid_set_dataloader = torch.utils.data.DataLoader(
    valid_set_dataset,
    batch_size=16,
    num_workers=4,
    shuffle = True
)



Check the shape, also notice how the batch looks like, at the end after preprocessing the model is much similar to what we have done so far in this course using the native pytorch training loops. But here we are going to use trainers provided by out transformers import. You will soon notice the advantages of using it over the traditional way.

In [None]:
train_data = next(iter(train_set_dataloader))
valid_data = next(iter(valid_set_dataloader))
print(train_data["input_ids"].size(), valid_data["input_ids"].size())
print(train_data["input_ids"], valid_data["input_ids"])

torch.Size([16, 512]) torch.Size([16, 512])
tensor([[  101, 25176, 16136,  ...,     0,     0,     0],
        [  101,  2023,  3185,  ...,     0,     0,     0],
        [  101,  2023,  2143,  ...,     0,     0,     0],
        ...,
        [  101,  4931,  7779,  ...,     0,     0,     0],
        [  101,  1999,  2026,  ...,     0,     0,     0],
        [  101,  1045,  2245,  ...,  2052,  6011,   102]]) tensor([[ 101, 7929, 1010,  ..., 1996, 9577,  102],
        [ 101, 2092, 1010,  ..., 2001, 2004,  102],
        [ 101, 2034, 1010,  ...,    0,    0,    0],
        ...,
        [ 101, 2172, 2062,  ...,    0,    0,    0],
        [ 101, 1000, 3098,  ...,    0,    0,    0],
        [ 101, 1045, 2074,  ...,    0,    0,    0]])


We do have to keep in mind that DistilBase is a model with about 66 million parameters and it is impossible to train without a powerful setup. However, we can fine-tune the model to make it familiar with our custom dataset and get better results. (It is also known as “transfer learning”) It can be done by freezing most of the network and re-train (adjust weights) a small part of it(the classifier only).

In [None]:
for name, param in model.distilbert.named_parameters():
    param.requires_grad = False # try setting it to true and runnning the model, compare the times also the accuracy for each run. There exist tradeoffs here too.

You might run into memory constraints if the above isn't implemented and will have to reduce the batch size to train the model in the first place.

We can reduce the batch size too, to get out of the memory contraints but it wont be efficient for such huge chunks of data, further the accuracy gain would not be worth the time.

Change the per device training batch sizes to 16 and 8, if you want to use the complete parameters, the trainingTime/epoch rises from 11min to 35 minutes.

The base distilBERT has 6 layers, 768 hidden( as compared to 24 layers in bert ), each of the layers have specific functions such as pooling etc, you can read up on them and try to finetune by freezing or changing the parameters of any of these layers, and map its effect on the models accuracy and training time etc.

  We will know write the compute_metrics function, which is used to measure the models accuracy.

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

Now we will use the TrainingArguements from transformers that will be used to configure the trainer, a config file could also be used. After the training we will save the model and the tokenizer, so that we can proceed with our gradio application.

In [None]:
training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=32,
    warmup_steps=512,
    weight_decay=0.01,
    save_strategy="epoch",
    evaluation_strategy="steps",
    optim = "adamw_torch",
    logging_steps = 187 # 1875//187 = 10, losses will be shown when cross validated.
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set_dataset,
    eval_dataset=valid_set_dataset,
    compute_metrics=compute_metrics
)

torch.cuda.empty_cache()
trainer.train()
model.save_pretrained("./savedModel")
tokenizer.save_pretrained("./savedModel")

print("Training done proceed to gradio :)")

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
187,0.6805,0.650958,0.7842,0.778121,0.813414,0.745763
374,0.6031,0.522183,0.8092,0.796067,0.869687,0.733938
561,0.4779,0.404104,0.8404,0.842666,0.842998,0.842333
748,0.4086,0.366127,0.8516,0.853069,0.857143,0.849034
935,0.3846,0.350538,0.853,0.854196,0.859824,0.84864
1122,0.3738,0.342915,0.853,0.853731,0.862138,0.845487
1309,0.3644,0.338313,0.8562,0.854424,0.878435,0.831691
1496,0.3613,0.336676,0.8558,0.858655,0.854134,0.863224
1683,0.3558,0.333662,0.859,0.859197,0.87085,0.847852
1870,0.3631,0.334583,0.856,0.858713,0.855021,0.862436


Training done proceed to gradio :)


Latly, its kind of a pain to download the .bin model file directly from colab, so rather we export it to drive and download it from there. We use the shutil library for the same . . . : )

In [19]:
#if you have mounted the drive, proceed
import os
os.listdir("./drive/MyDrive")

['Colab Notebooks',
 'pizza_vs_not',
 'reviews.csv',
 'state_dict.pt',
 'pytorch_model.bin']

In [18]:
import shutil
shutil.copy("./savedModel/pytorch_model.bin", "./drive/MyDrive")

'./drive/MyDrive/pytorch_model.bin'