### IMPORT
*   `import collections ` collections module provides a set of useful data structures, such as Counter, defaultdict, and OrderedDict
*   `import datasets` provides a standardized interface for accessing and manipulating datasets, making it easier to work with different datasets in a consistent manner
*   `import torch.nn as nn` module within PyTorch that provides a set of pre-built neural network layers and modules. This module is used to define the architecture of the neural network model for sentiment analysis.
*   `import torch.optim as optim` module within PyTorch that provides a set of optimization algorithms for training neural networks
*   `import torchtext` provides utilities and datasets specifically designed for natural language processing tasks. It offers functionalities for text preprocessing, vocabulary building, word embeddings, and data loading
*   `import tqdm` allows you to add progress bars to your loops or iterative processes, providing visual feedback on the progress of a task. It can be helpful for monitoring the progress of training or evaluation loops in sentiment analysis.
*   `import transformers` this library provides pre-trained models, such as BERT, GPT, and RoBERTa



In [1]:
!pip install datasets transformers torch torchvision torchaudio matplotlib numpy tqdm

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.

In [2]:
import collections
import datasets
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
import tqdm
import transformers

### SETTING SEED
*   `seed = 1234` seed is a starting point for the random number generator, which ensures that the same sequence of random numbers is generated every time the code is run. seed value is used to initialize the random number generator
*   `np.random.seed(seed)` setting the seed for the NumPy random number generator to ensure that any random operations performed using NumPy (such as initializing weights or shuffling data) will use the same sequence of random numbers
*   `torch.manual_seed(seed)` sets the random seed for PyTorch's CPU-based operations. ensure that PyTorch generates the same sequence of random numbers every time the code is run.
*   `torch.cuda.manual_seed(seed)` sets the random seed for PyTorch's GPU-based operations. ensures that the random number generator on the GPU is initialized with the same seed value.
*   `torch.backends.cudnn.deterministic = True` ensure that cuDNN will use a deterministic algorithm, which means that the same input will always produce the same output, regardless of the hardware or software environment.

> CUDA AND CUDNN:
*   *CUDA* - Compute Unified Device Architecture by NVIDIA. provides a set of tools and libraries that enable programmers to write code that can be executed on NVIDIA GPUs, taking advantage of their massive parallel processing capabilities.
*   *CUDNN* - CUDA Deep Neural Network by NVIDIA for NVIDIA GPUs. provides highly efficient implementations of common deep learning operations, such as convolution, pooling, normalization, and activation functions.









In [3]:
seed = 1234
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

### SPLITTING DATASET INTO TRAINING AND TESTING

In [4]:
train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

### SETTING TRANSFORMER
*   `transformer_name = "bert-base-uncased"` sets the name of the pre-trained transformer model that will be used. *bert-base-uncased* is a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model provided by the Hugging Face Transformers library. *base* part of the name indicates that we're using the base model, which is the smallest and most efficient version of BERT.*uncased* part means that the model is trained on uncased text data, which means it's not sensitive to the case of the input text
*   `tokenizer = transformers.AutoTokenizer.from_pretrained(transformer_name)` tokenizer is responsible for converting the input text into a format that can be processed by the transformer model. AutoTokenizer class is a convenient way to create a tokenizer that's compatible with the pre-trained model specified by transformer_name. here tokenizer will be configured to work with the "bert-base-uncased" model, ensuring that the input text is tokenized in a way that is compatible with the pre-trained BERT model.



In [5]:
transformer_name = "bert-base-uncased"

tokenizer = transformers.AutoTokenizer.from_pretrained(transformer_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### TOKENIZER
*   `tokenizer.tokenize("hello world!")` tokenize() method takes the input text and breaks it down into a list of tokens, which are the fundamental units that the model will process
*   `tokenizer.encode("hello world!")` encode() method takes the input text and returns a list of integers, where each integer represents a unique token in the tokenizer's vocabulary.
*   `tokenizer.convert_ids_to_tokens(tokenizer.encode("hello world"))` *encode()* method converts the input text "hello world" into a sequence of token IDs. *convert_ids_to_tokens()* method converts the sequence of token IDs back into the corresponding list of tokens.
*   `tokenizer("hello world!")` shorthand way of calling both the *tokenize() and encode()* methods on the input text "hello world!". it returns a dictionary with two keys: *'input_ids' and 'attention_mask'*. *'input_ids'* key contains the sequence of token IDs, just like the output of the encode() method. *'attention_mask'* key contains a sequence of 0s and 1s, where 1 indicates that the corresponding token should be attended to by the model, and 0 indicates that the token should be ignored.



In [6]:
tokenizer.tokenize("hello world!")
tokenizer.encode("hello world!")
tokenizer.convert_ids_to_tokens(tokenizer.encode("hello world"))
tokenizer("hello world!")

{'input_ids': [101, 7592, 2088, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

### DEFINING TOKENIZE FUNCTION
*   `def tokenize_and_numericalize_example(example, tokenizer)` this function has *example* which is dictionary-like object that contains a text key, which holds the text data to be processed. *tokenizer* is an instance of a tokenizer class, which is responsible for converting text into numerical representations
*   `ids = tokenizer(example["text"], truncation = True)["input_ids"]` tokenizer object with the input text (example["text"]) as the argument. truncation=True argument specifies that if the input text exceeds the maximum sequence length supported by the model, it should be truncated to fit within the limit. tokenizer call returns a dictionary containing various outputs, but we specifically extract the "input_ids" key using ["input_ids"]. The "input_ids" key holds the numerical representation of the tokenized text, where each token is mapped to its corresponding ID in the tokenizer's vocabulary.
*   `return {"ids": ids}` returns a dictionary with a single key-value pair. key is named "ids", and its value is the ids variable obtained from the tokenizer.


In [7]:
def tokenize_and_numericalize_example(example, tokenizer):
    ids = tokenizer(example["text"], truncation=True)["input_ids"]
    return {"ids": ids}

### TRAINING AND TESTING DATA
*   `train_data = train_data.map(..)` map function is commonly used in PyTorch datasets to apply a transformation function to each example in the dataset.
*   `tokenize_and_numericalize_example`  applied to each example in the train_data dataset.
*   `fn_kwargs = {"tokenizer": tokenizer}` *fn_kwargs* parameter is used to pass additional keyword arguments to the transformation function. Here, we pass the tokenizer object as a keyword argument named "tokenizer". This allows the tokenize_and_numericalize_example function to access the tokenizer when processing each example.
*   `test_data = test_data.map(tokenize_and_numericalize_example, fn_kwargs = {"tokenizer": tokenizer}) ` it is same as above
*   `train_data[0]` accesses the first example of the transformed train_data dataset using the index [0].


In [8]:
train_data = train_data.map(
    tokenize_and_numericalize_example, fn_kwargs={"tokenizer": tokenizer}
)
test_data = test_data.map(
    tokenize_and_numericalize_example, fn_kwargs={"tokenizer": tokenizer}
)
train_data[0]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

### TOKENIZER VOCAB AND PADDING
*   `tokenizer.vocab["!"]` retrieves the token ID associated with the exclamation mark ("!") from the tokenizer's vocabulary.
*   `tokenizer.pad_token` retrieves the padding token used by the tokenizer.
Padding tokens are special tokens added to sequences to ensure that all sequences have the same length.
*   `tokenizer.pad_token_id`  retrieves the token ID associated with the padding token in the tokenizer's vocabulary.
*   `tokenizer.vocab[tokenizer.pad_token]` retrieves the token ID of the padding token by accessing the tokenizer's vocabulary using the padding token itself.
*   `pad_index = tokenizer.pad_token_id` assigns the token ID of the padding token to a variable named pad_index to make it easier to reference throughout the code


In [9]:
tokenizer.vocab["!"]
tokenizer.pad_token
tokenizer.pad_token_id
tokenizer.vocab[tokenizer.pad_token]
pad_index = tokenizer.pad_token_id

### SPLITTING INTO TRAIN, TEST AND VALID DATA
*   `test_size = 0.25` 25% of the train_data will be allocated for validation, while the remaining 75% will be used for training.
*   `train_valid_data = train_data.train_test_split(test_size = test_size)` splits the train_data into training and validation sets using the train_test_split method. test_size parameter is passed to specify the proportion of data to be used for validation.
*   `valid_data = train_valid_data["test"]` assigns the training subset from the train_valid_data to the train_data variable.
*   `train_data = train_data.with_format(type = "torch", columns = ["ids", "label"])` applies a specific format to the train_data using the with_format method. type="torch" parameter indicates that the data should be formatted as PyTorch tensors. columns=["ids", "label"] parameter specifies the columns to be included in the formatted data, which are the "ids" (likely representing the token IDs) and the "label" (representing the sentiment labels).
*   `valid_data = valid_data.with_format(type = "torch", columns = ["ids", "label"])` same as above but for valid data
*   `test_data = test_data.with_format(type = "torch", columns = ["ids", "label"])` same as above but for test data



In [10]:
test_size = 0.25

train_valid_data = train_data.train_test_split(test_size=test_size)
train_data = train_valid_data["train"]
valid_data = train_valid_data["test"]

train_data = train_data.with_format(type="torch", columns=["ids", "label"])
valid_data = valid_data.with_format(type="torch", columns=["ids", "label"])
test_data = test_data.with_format(type="torch", columns=["ids", "label"])

### COLLATE FUNCTION

> it is a special function used in Pytorch's DataLoader to prepare batch of data for training and inference. its line a batch maker that takes individual data points and combines them into single batch. it is used here cause there can be different labels for a sentence and Pytorch needs fixed-length inputs so this function ensures that.

*   `def get_collate_fn(pad_index)` the parameter is value used for padding input sequences to ensure they have same length
*   `def collate_fn(batch)` its the actual function i.e. returned by the *get_collate_fn*. it is reponsible for preprocessing input data before it is passed to model.  
*   `batch_ids = [i["ids"] for i in batch]` it extracts ids from each sample in batch. ids are input sequences for sentiment analysis task.
*   `batch_ids = nn.util.rnn.pad_seuence(..)` *pad_sequence* function from PyTorch's *nn.utils.rnn* module to pad input sequences to same length.
*   `batch_ids, padding_value = pad_index, batch_first = True` *padding_value* parameter specifies value used for padding i.e. is *pad_index*. *batch_first* ensures that batch dimension (1st dimension) is preserved.
*   `batch_label = [i["label"] for i in batch]` extracts labels from each sample in the batch
*   `batch_label = torch.stack(batch_label)` it converts list of labels into PyTorch tensor using *torch.stack* function
*   `batch = {"ids": batch_ids, "label": batch_label}` creates dictionary that contains preprocessed input sequences *batch_ids* and corresponding labels *batch_label*
*   `return batch` *collate_fn* returns the preprocessed batch as dictionary
*   `return collate_fn` returns collate_fn





In [11]:
def get_collate_fn(pad_index):
    def collate_fn(batch):
        batch_ids = [i["ids"] for i in batch]
        batch_ids = nn.utils.rnn.pad_sequence(
            batch_ids, padding_value=pad_index, batch_first=True
        )
        batch_label = [i["label"] for i in batch]
        batch_label = torch.stack(batch_label)
        batch = {"ids": batch_ids, "label": batch_label}
        return batch

    return collate_fn

### DATA LOADER FUNCTION

> this function is used to efficiently load and iterate over a dataset during training or inference

*   `def get_data_loader(dataset, batch_size, pad_index, shuffle = False)` *dataset* param that contains data for sentiment analysis. *batch_size* param is no. of samples to be included in each batch during training. *pad_index* is used for pading input sequences to same length. *shuffle* determines whether samples in dataset should be shuffled before being loaded into batches.
*   `collate_fn = get_collate_fn(pad_index)` it is used to create custom collate function i.e. passed to DataLoader. padding all input sequences to same length using pad_index value.
*   `data_loader = torch.utils.data.DataLoader(dataset = dataset, batch_size = batch_size, collate_fn = collate_fn, shuffle = shuffle,)` object of *data_loader* class. *dataset* parameter contains data for sentiment analysis. *batch_size* no.of samples to be included in each batch during training. *collate_fn* is custom collate function created previously i.e. responsible for preparing batch data. *shuffle* determines whetehr samples in dataset should be sufflled before being loaded into batches.   
*   `return data_loader` returns DataLoader objects that can be used to iterate over dataset in batches during training.


In [12]:
def get_data_loader(dataset, batch_size, pad_index, shuffle=False):
    collate_fn = get_collate_fn(pad_index)
    data_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        collate_fn=collate_fn,
        shuffle=shuffle,
    )
    return data_loader

### LOADING DATA
*   `batch_size = 8` its the no. of samples that will be passed through model at once during training, validation and testing
*   `train_data_loader = get_data_loader(train_data, batch_size, pad_index, shuffle = True)` *train_data* is training data that contains list of samples. *batch_size* set to 8 as above. *pad_index* is index used for padding inputs sequences to fixed length. *shuffle = True* indicates training data that should be shuffled before fed into model.
*   `valid_data_loader = get_data_loader(valid_data, batch_size, pad_index)` same as above just shuffle = False.
*   `test_data_loader = get_data_loader(test_data, batch_size, pad_index)` same as above just shuffle = False


In [13]:
batch_size = 8

train_data_loader = get_data_loader(train_data, batch_size, pad_index, shuffle=True)
valid_data_loader = get_data_loader(valid_data, batch_size, pad_index)
test_data_loader = get_data_loader(test_data, batch_size, pad_index)

### TRANSFORMER CLASS
*   `class Transformer(nn.Module)` class of PyTorch module Transformer that inherits from *nn.Module* which is base class for all neural network modules in PyTorch
*   `def __init__(self, transformer, output_dim, freezer)` *transformer* is the pre-trained transformer model that we are using. *output_dim* no. of output classes for sentiment analysis. *freeze* is boolean flag that determines whether pre-trained transformer model's parameters should be frozen
*   `super().__init__()` calls constructor of parent class *nn.Module* to initialize base class properties
*   `self.transformer = transformer` stores pre-trained transformer model in Transformer module
*   `hidden_dim = transformer.config.hidden_size` retrieves hidden dimension size of pre-trained transformer model
*   `self.fc = nn.Linear(hidden_dim, output_dim)` creates fully connected layer that maps tranformer's output to desired no. of output classes.
*   `if freeze: for param in self.transformer.parameters(): param.requires_grad = False` *if freeze* block sets *requires_grad* attribute of transformer's parameters to False if *freeze is True*. this freezes pre-trained transformer's weights, preventing them from being updated during training.
*   `def forward(self, ids)` its to forward method of Transformer module that defines forward pass of model. *ids* argument represents input sequence of token IDs.
*   `output = self.transformer(ids, output_attentions = True)` passes input token IDs through pre-trained transformer model and retrieves output, which includes last hidden state and attention weights.
*   `hidden = output.last_hidden_state` extracts last hidden state from transformer's output that represents contextual representations of input sequence.
*   `attention = output.attentions [-1]` extracts attention weights from transformer's output which can be useful for interpretability and understanding model's decision-making process.
*   `cls_hidden = hidden[:, 0, :]` extracts hidden state corresponding to special classification token from transformer's output.
*   `prediction = self.fc(torch.tanh(cls_hidden))` passes the CLS token's hidden state through fully connected layer to produce final sentiment prediction
*   `return prediction` returns sentiment prediction as output of forward pass.





In [14]:
class Transformer(nn.Module):
    def __init__(self, transformer, output_dim, freeze):
        super().__init__()
        self.transformer = transformer
        hidden_dim = transformer.config.hidden_size
        self.fc = nn.Linear(hidden_dim, output_dim)
        if freeze:
            for param in self.transformer.parameters():
                param.requires_grad = False

    def forward(self, ids):
        # ids = [batch size, seq len]
        output = self.transformer(ids, output_attentions=True)
        hidden = output.last_hidden_state
        # hidden = [batch size, seq len, hidden dim]
        attention = output.attentions[-1]
        # attention = [batch size, n heads, seq len, seq len]
        cls_hidden = hidden[:, 0, :]
        prediction = self.fc(torch.tanh(cls_hidden))
        # prediction = [batch size, output dim]
        return prediction

### TRANSFORMER CONFIGURATION
*   `transformer = transformers.AutoModel.from_pretrained(transformer_name)` loads pre-trained transformer model from Hugging Face Transformers library. *transformers.AutoModel.from_pretrained()* is used to automatically select and load appropriate pre-trained transformer model based on *transformer_name* parameter.
*   `transformer.config.hidden_size` retrieves *hidden_size* attribute from configuration of loaded transformer model. *hidden_size* represents dimensionality of hidden representations produced by transformer model. it also determines dimensionality of input features that will be used for sentiment classification task. a larger hidden size can capture more nuanced information about input text thus leading to better performance.



In [15]:
transformer = transformers.AutoModel.from_pretrained(transformer_name)
transformer.config.hidden_size

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

768

### OUTPUT DIMENSIONS
*   `output_dim = len(train_data["label"].unique())` determines no.of unique sentiment labels in training data. *train_data["label"]* refers to feature in training dataset that contains sentiment labels. *unique()* is used to count no.of distinct sentiment labels present in training data.
*   `freeze = False` freezing means that the weights of pre-trained transformer model will not be updated during training process thus allowing model to adapt and learn specific characteristics of sentiment analysis.
*   `model = Transformer(transformer, output_dim, freeze)` this is instance of Transformer class. *transformer* is pre-trained transformer model that was loaded before. *output_dim* is no.of unique sentiment labels which was determined before. *freeze* is boolean flag that determines whether pre-trained transformer model's parameters will be fine-tuned during training.


In [16]:
output_dim = len(train_data["label"].unique())
freeze = False

model = Transformer(transformer, output_dim, freeze)

### PARAMETER COUNT
*   `def count_parameters(model)` defines function that takes PyTorch model as argument.
*   `return sum(p.numel() for p in model.parameters() if p.requires_grad)` it calculates total no. of trainable parameters in input model. *model.parameters()* method returns iterator over all parameters i.e. weights and biases of model. *p.numel()* returns no. of elements (scalar values_ in each parameter tensor p. *if p.requires_grad* filters out parameters that aren't trainable. *sum()* adds no. of elements in all trainable parameter tensors giving total no. of trainable parameters in model.
*   `print(f"The model has {count_parameters(model):,} trainable parameters")` prints total number of trainable parameters in model. *{count_parameters(model):,}* uses f-string formatting to insert result of *count_parameters(model)* function into output string. *,* seperator is used to make large no. more readable by adding commas to output.



In [17]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f"The model has {count_parameters(model):,} trainable parameters")

The model has 109,483,778 trainable parameters


### OPTIMIZER
*   `lr = 1e-5` this sets learning rate to 1e-5 i.e. common notation for 0.00001. this determines step size that model takes during optimization process.
*   `optimizer = optim.Adamax(model.parameters(), lr=lr)` creates instance of Adam optimizer from PyTorch optim module. *model.parameters()* is an iterator over all trainable parameters of model. *lr = lr* sets learning rate for Adam optimizer to value specified in lr variable.


In [18]:
lr = 1e-5

optimizer = optim.Adamax(model.parameters(), lr=lr) #changed

### LOSS FUNCTIONS
*   `criterion = nn.CrossEntropyLoss()` *nn.CrossEntropyLoss()* is a PyTorch module that implements cross-entropy loss function that is a measure of difference between 2 probability distributions


In [19]:
criterion = nn.CrossEntropyLoss()

### DEVICE
*   `device = torch.device("cuda" if torch.cude.is_available() else "cpu")` this determines the device on which PyTorch model will be executed. *torch.cuda.is_available()* checks if CUDA-enabled GPU is available on system. if its available then code sets device to cuda i.e. on GPU or else it will be executed on CPU.
*   `model = model.to(device)` this moves the model to the device specified in the above *device* variable. by moving it model's parameters and computations will be performed on corresponding hardware that will improve performance.
*   `criterion = criterion.to(device)` this moves loss function to same device as model thus helping perform efficiently.



In [20]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

model = model.to(device)
criterion = criterion.to(device)

### TRAINING FUNCTION
*   `def train(data_loader, model, criterion, optimizer, device)` *data_loader* is instance of PyTorch's DataLoader class i.e. responsible for loading and batching training data. *model* is PyTorch model that will be trained to perform sentiment analysis. *criterion* is loss function that will calculate loss during training. *optimizer* is optimizer that will be used to update model's parameters during training. *device* specifies device on which training will be performed.
*   `model.train()` sets model into training mode.
*   `epoch_losses = [] and epoch_accs = []` they initialize empty lists to store losses and accuracies for each batch in current epoch.
*   `for batch in tqdm.tqdm(data_loader, desc = "training...")` iterates over batches in *data_loader*. *tqm* is lib that provides progress bar for tracking training progress. *desc* argument sets description of progress bar to *training...*
*   `ids = batch["ids"],to(device) and label = batch["label"].to(device)` extracts input IDs and labels from current batch and move them to specified device.
*   `prediction = model(ids)` passes input IDs through model to obtain predicted output
*   `loss = criterion(prediction, label)` calculates the loss between predicted output and true label using specified criterion
*   `accuracy = get_accuracy(prediction, label)` calculates accuracy of model's predictions using custom *get_accuracy* function.
*   `optimizer.zero_grad()` resets gradients of model's parameters to 0 because PyTorch accumulates gradients from previous iterations but want fresh start for each batch.
*   `loss.backward()` computes gradients of loss with respect to model's parameters using backpropagation
*   `optimizer.step()` updates model's parameters using optimizer and computed gradients.
*   `epoch_losses.append(loss.item()) and epoch_accs.append(accuracy.item())` append curreny loss and accuracy values to *epoch_losses* and *epoch_accs* lists, respectively
*   `return np.mean(epoch_losses), np.mean(epoch_accs)` returns average loss and accuracy for current epochs





In [21]:
def train(data_loader, model, criterion, optimizer, device):
    model.train()
    epoch_losses = []
    epoch_accs = []
    for batch in tqdm.tqdm(data_loader, desc="training..."):
        ids = batch["ids"].to(device)
        label = batch["label"].to(device)
        prediction = model(ids)
        loss = criterion(prediction, label)
        accuracy = get_accuracy(prediction, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_losses.append(loss.item())
        epoch_accs.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accs)

### EVALUATE FUNCTION
*   `def evaluate(data_loader, model, criterion, device)` *data_loader* PyTorch DataLoader object that provides batches of data for evaluation. *model* is the trained sentiment analysis model *criterion* loss function used to calculate prediction loss. *device* CPU or GPU that has been selected above that will be evaluated.
*   `model.eval()` set the model to evaluation mode to ensure that layers deactivate their behaviours during training and set to behaving to evaluation.
*   `epoch_losses = [] and epoch_accs = []` empty lists to store calculated accuracy and losses for each batch in evaluation dataset.
*   `with torch.no_grad()` this temporarily disables gradient calculations because during evaluation we care about only the model's performance.
*   `for batch in tqdm.tqdm(data_loader, desc = "evaluating...")` this loop iterates through evaluation dataset, one batch at a time. *tqdm* provides progress bar for visualization.
*   `ids = batch["ids].to(device) and label = batch["lavel"].to(device)` extracts input ids and corresponding labels from current batch and then move it to the specified device
*   `prediction = model(ids)` pass input IDs through sentiment analysis model to get predicted sentiment probabilities.
*   `loss = criterion(prediction, label)` calculate loss for this batch by comparing model's predictions to true labels
*   `accuracy = get_accuracy(prediction, label)` compute accuracy of predictions for this batch.
*   `epoch_losses.append(loss.item()) and epoch_accs.append(accuracy.item())` store loss and accuracy values for this batch as regular Python numbers in their respective lists.
*   `return np.mean(epoch_losses), np.mean(epoch_accs)` after processing all batches, function returns average loss and accuracy accross entire evaluation dataset.




In [22]:
def evaluate(data_loader, model, criterion, device):
    model.eval()
    epoch_losses = []
    epoch_accs = []
    with torch.no_grad():
        for batch in tqdm.tqdm(data_loader, desc="evaluating..."):
            ids = batch["ids"].to(device)
            label = batch["label"].to(device)
            prediction = model(ids)
            loss = criterion(prediction, label)
            accuracy = get_accuracy(prediction, label)
            epoch_losses.append(loss.item())
            epoch_accs.append(accuracy.item())
    return np.mean(epoch_losses), np.mean(epoch_accs)

### ACCURACY FUNCTION
*   `def get_accuracy(prediction, label)` *prediction* is the output of sentiment analysis that is in terms of probability. *label* true sentiment labels for corresponding texts and they are encoded as integers
*   `batch_size, _ = prediction.shape` extracts *batch_size* from shape of *prediction* tensor. *_* is to discard the 2nd element of shape.
*   `predicted_classes = prediction.argmax(dim = -1)` *argmax(dim = -1)* finds index of highest probability score along last dimension *dim = -1* of *prediction* tensor. indices represent predicted sentiment classes.
*   `correct_predictions = predicted_classes.eq(label).sum()` we compare models *predicted_classes* to true labels. *.eq(label)* checks for element-wise equality between predicted classes and true labels, it creates tensor of True/False values. *.sum()* sums up all the True values in resulting tensor and it represents sum of all the correct predictions made by model.
*   `accuracy = correct_predictions / batch_size` calculate the accuracy by dividing *correct_predictions* by total number of samples in batch
*   `return accuracy` returns calculated accuracy value



In [23]:
def get_accuracy(prediction, label):
    batch_size, _ = prediction.shape
    predicted_classes = prediction.argmax(dim=-1)
    correct_predictions = predicted_classes.eq(label).sum()
    accuracy = correct_predictions / batch_size
    return accuracy

### EPOCH TRAINING
*   `n_epochs = 3` sets number of times model will be trained on entire dataset 3 times.
*   `best_valid_loss = float("inf")` initializes *best_valid_loss* to track best validation loss seen. starting with infinity ensures 1st model we train will be considered best.
*   `metrics = collections.defaultdict(list)` creates dictionary that stores training and validation metrics for each epoch
*   `for epoch in range(n_epochs)`iterates for specifies number of epochs, training and evaluating model in each iterations
*   `train_loss, train_acc = train(train_data_loader, model, criterion, optimizer, device)` call *train* function to train model for each epoch. iterates through *train_data_loader* and passes each batch through *model* to get predictions. then calculates *train_loss* using *criterion*. next it updates model's parameters using *optimizer* to minimize the loss. then it computes *train_acc* and *device* is either CPU or GPU
*   `valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)` calls evaluate function from above, and assesses model's performance on validation set.
*   `metrics["train_losses"].append(train_loss) and same for train_accs, valid_losses, valid_accs` calculates losses and accuracies for both training and validation and then append it to *metrics* dictionary
*   `if valid_loss < best_valid_loss` hecks if the current validation loss is better (lower) than the best validation loss seen so far.
*   `best_valid_loss = valid_loss` if current validation loss is better, this line updates the *best_valid_loss* variable with the current validation loss.
*   `torch.save(model.state_dict(), "transformer.pt")` saves model's state dictionary to file named *transformer.pt* if current validation loss is best seen.
*   `print(f"epoch: {epoch}") and print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}") and print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")` these print progress of the model i.e.  current epoch number, training loss and accuracy, and validation loss and accuracy for each epoch.



In [24]:
n_epochs = 3
best_valid_loss = float("inf")

metrics = collections.defaultdict(list)

for epoch in range(n_epochs):
    train_loss, train_acc = train(
        train_data_loader, model, criterion, optimizer, device
    )
    valid_loss, valid_acc = evaluate(valid_data_loader, model, criterion, device)
    metrics["train_losses"].append(train_loss)
    metrics["train_accs"].append(train_acc)
    metrics["valid_losses"].append(valid_loss)
    metrics["valid_accs"].append(valid_acc)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), "transformer.pt")
    print(f"epoch: {epoch}")
    print(f"train_loss: {train_loss:.3f}, train_acc: {train_acc:.3f}")
    print(f"valid_loss: {valid_loss:.3f}, valid_acc: {valid_acc:.3f}")

training...:   0%|          | 0/2344 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
training...: 100%|██████████| 2344/2344 [27:07<00:00,  1.44it/s]
evaluating...: 100%|██████████| 782/782 [03:01<00:00,  4.31it/s]


epoch: 0
train_loss: 0.282, train_acc: 0.879
valid_loss: 0.239, valid_acc: 0.906


training...: 100%|██████████| 2344/2344 [27:11<00:00,  1.44it/s]
evaluating...: 100%|██████████| 782/782 [03:01<00:00,  4.31it/s]


epoch: 1
train_loss: 0.187, train_acc: 0.930
valid_loss: 0.197, valid_acc: 0.927


training...: 100%|██████████| 2344/2344 [27:16<00:00,  1.43it/s]
evaluating...: 100%|██████████| 782/782 [03:00<00:00,  4.32it/s]

epoch: 2
train_loss: 0.152, train_acc: 0.944
valid_loss: 0.203, valid_acc: 0.926





### MODEL STATE LOAD
*   `model.load_state_dict(torch.load("transformer.pt"))` it is to load pre-trained model's state dictionary from file named *transformer.pt*. *torch.load("transformer.pt")* loads saved model state dictionary from file *transformer.pt* and it contains learned weights and biases of pre-trained model.
*   `test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)` this evaluates performance of loaded model on test dataset. the variables are same as before. *evaluate* function is custom function defined above.
*   `print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")` prints test loss and accuracy obtained from evaluation. f is f-string in Python that allows embedding expressions inside string literals.





In [25]:
model.load_state_dict(torch.load("transformer.pt"))

test_loss, test_acc = evaluate(test_data_loader, model, criterion, device)
print(f"test_loss: {test_loss:.3f}, test_acc: {test_acc:.3f}")

evaluating...: 100%|██████████| 3125/3125 [11:45<00:00,  4.43it/s]

test_loss: 0.180, test_acc: 0.932





### SENTIMENT PREDICT FUNCTION
*   `def predict_sentiment(text, model, tokenizer, device)` *text* is imput text for sentiment prediction. *model* is trained model we are using. *tokenizer* is tokenizer object that coverts input text into numericals i.e. tokens. *device* CPU or GPU where prediction will be performed
*   `ids = tokenizer(text)["input_ids"]` *tokenizer(text)* call returns dictionary containing various outputs and *["input_ids"]* retrieves list of token IDs corresponding to input text
*   `tensor = torch.LongTensor(ids).unsqueeze(dim = 0).to(device)` it converts token IDs into PyTorch tensor. *torch.LongTensor(ids)* creates tensor of type *long* from token IDs. *.unsqueeze(dim = 0)* adds extra dimension to tensor to make it suitable for batch processing. it converts tensor from shape *sequence_length* to *[1, sequence_length]*. *.to(device)* moves tensor to specified device for computation.
*   `prediction = model(tensor).squeeze(dim = 0)` feeds input tensor to sentiment analysis model to obtain predicted sentiment scores. *model(tensor)* performs forwards pass of input tensor throught model generating output tensor. *.squeeze(dim=0)* removes extra dimension added earlier, converting output tensor from shape *[1, num_classes]* to *[num_classes]*
*   `probability = torch.softmax(prediction, dim = -1)` applies softmax function to along last dimension of prediction tensor and then converting scores into probabilities that sum up to 1
*   `predicted_class = prediction.argmax(dim = -1).item()` determines predicted sentiment class by finding index of highest predicted score. *prediction.argmax(dim = -1)* returns index of max value along last dimension of prediction tensor. *.item()* converts tensor value to plain Python scalar.
*   `predicted_probability = probability[predicted_class].item()` retrieves predicted probability for predicted sentiment class. *probability[predicted_class]* selects probability value corresponding to predicted class index. *.item()* converts tensor value to plain Python scalar.
*   `return predicted_class, predicted_probability` returns predicted sentiment class and its corresponding probability as output of *predicted_sentiment* function.




In [26]:
def predict_sentiment(text, model, tokenizer, device):
    ids = tokenizer(text)["input_ids"]
    tensor = torch.LongTensor(ids).unsqueeze(dim=0).to(device)
    prediction = model(tensor).squeeze(dim=0)
    probability = torch.softmax(prediction, dim=-1)
    predicted_class = prediction.argmax(dim=-1).item()
    predicted_probability = probability[predicted_class].item()
    return predicted_class, predicted_probability

### TEXT SENTIMENT PREDICTION

In [27]:
text = "This film is terrible!"

predict_sentiment(text, model, tokenizer, device)

(0, 0.966673731803894)

In [28]:
text = "This film is great!"

predict_sentiment(text, model, tokenizer, device)

(1, 0.9537339806556702)

In [29]:
text = "This film is not terrible, it's great!"

predict_sentiment(text, model, tokenizer, device)

(1, 0.9763466119766235)

In [30]:
text = "This film is not great, it's terrible!"

predict_sentiment(text, model, tokenizer, device)

(0, 0.9813977479934692)