# Train IMDb Classifier
 This model is a fine-tuned version of distilbert-base-uncased on the imdb dataset. Here we are training an IMDb classifier with DistilBERT.

The original lab exercises focused on the fine-tune a FLAN-T5 model to generate less toxic content with Meta AI's hate speech reward model.


# Explanation of the original lab exercises

The original lab exercise was using the meta AI’s hate speech reward model to calculate the rewards for content based on how hurtful it is. They were using the logits hate and not hate to classify the text input. They were then evaluating toxicity and fine tuning the model.


#Modifications and Improvements in this lab exercise

As the original exercise used Reinforcement learning to identify hate and not hate, We tried to use a different model in a similar use case. In this lab exercise, we used the DistilBert model on our dataset to train a model that can be used to classify if a specific sentence review is positive or negative. We used a BERT model because it offers these advantages:-

Transfer Learning: BERT is pre-trained on a massive amount of data, allowing it to capture general language patterns. Fine-tuning on a smaller dataset for a specific task leverages this pre-trained knowledge.

Contextual Embeddings: BERT provides contextual embeddings, considering the surrounding words in a sentence. This helps capture the meaning of words in different contexts.


# Model description
DistilBERT is a transformers-based model that offers a more compact and faster alternative to BERT. It underwent pretraining on the same corpus in a self-supervised manner, leveraging the BERT base model as a teacher. During pretraining, the model focused on three key objectives:

Distillation Loss: The model was trained to produce probabilities consistent with those of the BERT base model, essentially distilling the knowledge from its teacher.

Masked Language Modeling (MLM): This involves a process where 15% of words in a sentence are randomly masked, and the model predicts these masked words. This approach, different from traditional RNNs or autoregressive models like GPT, enables the model to learn a bidirectional representation of the sentence.

Cosine Embedding Loss: The model was also trained to generate hidden states that closely match those of the BERT base model, reinforcing the learning of similar inner representations of the English language.

In summary, DistilBERT's training process involves distillation from a BERT base model, masked language modeling for bidirectional sentence representation, and cosine embedding loss to ensure similarity in hidden states. This results in a model that learns the nuances of the English language akin to its teacher model but with the advantage of faster inference and downstream task execution.

In [3]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.c

## Load IMDb dataset

The load_dataset function is used to load a dataset from the Hugging Face datasets library. Hugging Face provides a wide range of datasets for NLP tasks, and this function allows you to easily download and use them in your Python code. When you specify the name of the dataset you want to load, it will return a dictionary-like object that contains the dataset. The load_metric function is used to load an evaluation metric from the Hugging Face datasets library. Specific metrics are used to evaluate the performance of models on NLP tasks. This function allows to easily load such metrics. Both functions are part of the datasets library, which aims to provide a unified interface for working with various datasets and metrics in the field of natural language processing. Before running this code, you  need to have the datasets library installed, which you can do using pip.

In [5]:
from datasets import load_dataset, load_metric

The load_dataset function is used to load the "imdb" dataset. The "imdb" dataset refers to the Internet Movie Database dataset, which contains movie reviews labeled as positive or negative sentiment.

The function returns a dictionary-like object (ds in this case) that contains information about the dataset, including the training, validation, and test splits, as well as metadata such as column names and data types.

In [6]:
ds = load_dataset("imdb")

In [7]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [8]:
ds['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

## Load Pretrained DistilBERT

Import the AutoModelForSequenceClassification and AutoTokenizer classes from the transformers library. These classes provide a convenient way to load pre-trained models and tokenizers without having to specify the exact model architecture.

Here, the pre-trained DistilBERT model with an uncased vocabulary is used. The uncased models are trained on lowercase text, which is suitable for many English NLP tasks.

The AutoTokenizer.from_pretrained method is used to load the tokenizer associated with the same pre-trained DistilBERT model. Tokenizers are used for converting text data into a format suitable for input to the model.

After running this code, we will have a pre-trained DistilBERT model ready for sequence classification tasks and a corresponding tokenizer for processing input text. These components can then be used to tokenize input text and make predictions using the pre-trained model.

In [9]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Prepocess Data

The tokenize function takes a dictionary-like object as input, typically representing a dataset example. It extracts the text data from the 'text' field in each example and applies the tokenizer to tokenize the text.

The map method is used to apply the tokenize function to each example in the dataset. The batched=True argument indicates that the tokenization should be applied in batches, which improves efficiency when dealing with large datasets.

The result is a new dataset (tokenized_ds) where the 'text' field in each example has been replaced with tokenized representations suitable for input to a transformer model.


In [10]:
def tokenize(examples):
    outputs = tokenizer(examples['text'], truncation=True)
    return outputs

tokenized_ds = ds.map(tokenize, batched=True)

## Prepare Trainer

In [11]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

In [12]:
import numpy as np

def compute_metrics(eval_preds):
    metric = load_metric("accuracy")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [13]:
pip install accelerate==0.20.3



Here we specify the parameters for training.

In [14]:
training_args = TrainingArguments(num_train_epochs=1,
                                  output_dir="distilbert-imdb",
                                  push_to_hub=True,
                                  per_device_train_batch_size=1,
                                  per_device_eval_batch_size=1,
                                  evaluation_strategy="epoch")

In [15]:
data_collator = DataCollatorWithPadding(tokenizer)

In [16]:
trainer = Trainer(model=model, tokenizer=tokenizer,
                  data_collator=data_collator,
                  args=training_args,
                  train_dataset=tokenized_ds["train"],
                  eval_dataset=tokenized_ds["test"],
                  compute_metrics=compute_metrics)

## Train Model and Push to Hub

In [None]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


In [None]:
trainer.push_to_hub()

#Challenges Faced during the process

In this process, the first challenge faced was that of choosing a dataset which is not too small as that would be a problem that could lead to overfitting. Another problem face was getting the pre installed libraries and their version correct to use with the model while passing arguments. The hyperparameters that were used for training the model such as epoch size, batch size, learning rates had to be chosen carefully as to achieve optimal performance on the IMDB dataset. One main challenge was that depending on the available computational resources, training and fine-tuning DistilBERT was  slower compared to smaller models. Thus, to check if the model was getting trained properly, the batch size had to be changed to reduce the computational time. Regardless, the model took 15 hours to train.

#Results

This model can be used to classify if a review is positive or negative. This exercise gave us a model that is effective at understanding context and nuances in language, which is crucial for accurately analyzing movie reviews where sentiment can be subtle or complex. It is robust to the noise present in real-world data i.e online movie reviews. It can handle varied sentence structures, slang, and other idiosyncrasies common in user-generated content.

In summary,we found an efficient, robust, and effective solution for sentiment analysis, especially when resources or processing power are limited.