# Building a Sentiment predictor across IMDB Reviews

In this project, we fine-tune an LSTM-based model for sentiment analysis on the IMDB movie review dataset. The dataset consists of 50,000 movie reviews, each labeled as either positive or negative, making it an ideal binary classification task. We split the data into training, validation, and test sets, using 80% for training and the remaining 20% for validation, while a separate test set evaluates the model's performance after training.

A key aspect of our approach is the use of pre-trained GloVe embeddings. GloVe embeddings capture semantic relationships between words, having been trained on large corpora to represent words as dense vectors. By using these pre-trained embeddings, we leverage linguistic knowledge that helps our model understand word meanings more effectively right from the start, reducing the amount of training data needed and improving generalization.

During fine-tuning, we adapt the model to the specific task of sentiment classification. The LSTM processes each review as a sequence of tokens, and at the final time step, we extract the hidden state of the last token, which captures the contextual information from the entire review. This hidden state is passed through a fully connected layer and a sigmoid activation function to output a probability, which represents the model's confidence that the review has a positive sentiment. We then apply a threshold (0.5 by default) to classify the review as either positive or negative based on this probability.

The final product is a fine-tuned LSTM model capable of analyzing the sentiment of new, unseen movie reviews, classifying them as either positive or negative. This model could be further deployed for tasks such as customer feedback analysis or opinion mining in different domains, where understanding sentiment is critical.

### 1. **Importing Libraries**
- **`numpy as np`**: Used for numerical operations like creating arrays and performing mathematical functions.
- **`torch`**: Core PyTorch library for building and training deep learning models.
- **`torch.nn as nn`**: Provides layers, loss functions, and utilities for neural networks.
- **`torch.optim as optim`**: Contains optimization algorithms like `SGD` and `Adam` for model weight updates.
- **`torch.utils.data.DataLoader` and `Dataset`**: Manage batching, shuffling, and parallel data loading.
- **`datasets.load_dataset`**: From Hugging Face, used to load datasets (e.g., IMDB, MNIST).
- **`os`**: Interacts with the operating system (file and directory management).
- **`pickle`**: Serializes and deserializes objects, useful for saving/loading models or data.
- **`sklearn.metrics.accuracy_score`**: Evaluates model accuracy based on predictions.
- **`tqdm`**: Displays progress bars for loops, useful for tracking training or data processing tasks.
- **`torch.utils.tensorboard.SummaryWriter`**: Logs data to TensorBoard, helping to visualize metrics like loss and accuracy during training.

### 2. **Setting the Device**
- The code checks if a **GPU** (CUDA) is available. If so, it sets the device to **GPU**, otherwise, it defaults to **CPU**.
- **`cuda`**: Refers to Nvidia GPUs, which can accelerate model training.
- **`cpu`**: The fallback device when no GPU is available.

### 3. **Printing the Device in Use**
- Verifies and prints which device (GPU or CPU) is being used, ensuring the model runs on the correct hardware.

### Summary
- The code is setting up for deep learning model execution using PyTorch.
- It loads essential libraries for building, training, and evaluating the model.
- It ensures that computations run on the best available hardware, be it a GPU or CPU.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
import os
import pickle
from sklearn.metrics import accuracy_score
from tqdm import tqdm
from torch.utils.tensorboard import SummaryWriter

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

### 1. **Function Definition**
- **`initialize_tokenizer`**: This function creates or loads a tokenizer, which maps words to indices and vice versa. Tokenizers are essential in NLP for converting text data into numerical format that can be fed into machine learning models.

### 2. **Parameters**
- **`tokenizer_type='word'`**: Defines the type of tokenizer (e.g., word-level). Though this parameter is not used in the current function, it allows future extension (for tokenizing by character, subword, etc.).
- **`text_data=None`**: Expects a list of text data that the tokenizer will be built from. If this is `None`, there is no input data to create the tokenizer.
- **`min_freq=1`**: Filters out words that appear less frequently than the specified minimum frequency (`min_freq`). This is useful to remove rare words that may not provide much value in modeling.
- **`max_vocab_size=None`**: Limits the tokenizer to a maximum vocabulary size. If this is set, only the `max_vocab_size` most frequent words will be kept.
- **`tokenizer_file='tokenizer.pkl'`**: Specifies the filename where the tokenizer will be saved or loaded from. If this file exists, the function loads the tokenizer from this file.

### 3. **Loading Existing Tokenizer**
- **`os.path.exists(tokenizer_file)`**: Checks if a tokenizer file already exists. If it does, it loads the tokenizer from the file using `pickle`.
  - **`with open(tokenizer_file, 'rb') as f:`**: Opens the tokenizer file in binary read mode and loads it.
  - **`tokenizer['word2idx']`**: A dictionary that maps words to their indices (e.g., `{'cat': 1}`).
  - **`vocab_size`**: The size of the vocabulary (i.e., the number of unique words in the tokenizer).

### 4. **Building a New Tokenizer**
- If the tokenizer file does not exist, the function builds a new tokenizer from the `text_data`:
  - **`tokenizer = {'word2idx': {'<UNK>': 0}, 'idx2word': {0: '<UNK>'}}`**: Initializes the tokenizer with a special token `'<UNK>'` for unknown words. It creates two dictionaries:
    - `word2idx`: Maps words to indices.
    - `idx2word`: Maps indices back to words.
  - **`word_counts = {}`**: Initializes a dictionary to store word frequencies.

### 5. **Counting Word Frequencies**
- The function iterates over `text_data`, splits each text into words, and counts the frequency of each word. This helps identify the most common words to be added to the tokenizer:
  - **`word_counts[word] = word_counts.get(word, 0) + 1`**: Increments the word count for each word in the text.

### 6. **Filtering Words by Frequency**
- **`if count >= min_freq`**: Filters out words that occur less than the `min_freq`.
  
### 7. **Sorting and Limiting Vocabulary**
- **`sorted(words, key=lambda x: x[1], reverse=True)`**: Sorts the words by their frequency in descending order so that the most frequent words are added to the tokenizer first.
- **`if max_vocab_size is not None`**: If `max_vocab_size` is specified, the vocabulary is limited to that many words, accounting for the `'<UNK>'` token.

### 8. **Adding Words to the Tokenizer**
- The loop **`for word, _ in words:`** adds each word and its corresponding index to the `word2idx` and `idx2word` dictionaries:
  - **`tokenizer['word2idx'][word] = idx`**: Assigns an index to each word.
  - **`tokenizer['idx2word'][idx] = word`**: Adds the reverse mapping of index to word.

### 9. **Saving the Tokenizer**
- **`with open(tokenizer_file, 'wb') as f:`**: Saves the tokenizer using `pickle` to a binary file for future use. This avoids rebuilding the tokenizer every time the program runs.

### 10. **Returning the Tokenizer and Vocabulary Size**
- The function returns the `tokenizer` dictionary and the size of the vocabulary.

### Summary
- This function efficiently builds or loads a tokenizer that converts words into indices. It allows setting a minimum word frequency and a maximum vocabulary size to filter out rare words and limit the vocabulary size. The tokenizer can be saved and loaded from a file to avoid unnecessary recomputation.

In [None]:
# Function to initialize tokenizer with a max vocabulary size
def initialize_tokenizer(tokenizer_type='word', text_data=None, min_freq=1, max_vocab_size=None, tokenizer_file='tokenizer.pkl'):
    if os.path.exists(tokenizer_file):
        print("Loading tokenizer from file...")
        with open(tokenizer_file, 'rb') as f:
            tokenizer = pickle.load(f)
        vocab_size = len(tokenizer['word2idx'])
    else:
        print("Building tokenizer...")
        tokenizer = {'word2idx': {'<UNK>': 0}, 'idx2word': {0: '<UNK>'}}
        idx = 1
        word_counts = {}

        # Count the frequency of each word
        for text in text_data:
            words = text.strip().split()
            for word in words:
                word_counts[word] = word_counts.get(word, 0) + 1
        
        # Filter words by minimum frequency
        words = [(word, count) for word, count in word_counts.items() if count >= min_freq]

        # Sort words by frequency in descending order and limit to max_vocab_size
        words = sorted(words, key=lambda x: x[1], reverse=True)
        if max_vocab_size is not None:
            words = words[:max_vocab_size - 1]  # Subtract 1 to account for <UNK>
        
        # Add words to the tokenizer
        for word, _ in words:
            tokenizer['word2idx'][word] = idx
            tokenizer['idx2word'][idx] = word
            idx += 1

        vocab_size = idx

        # Save the tokenizer to a file
        with open(tokenizer_file, 'wb') as f:
            pickle.dump(tokenizer, f)
    
    return tokenizer, vocab_size

### 1. **Function Definition**
- **`load_glove_embeddings`**: This function loads GloVe word embeddings and maps them to the vocabulary created earlier using `word2idx`. GloVe embeddings are pre-trained word vectors that capture semantic relationships between words.

### 2. **Parameters**
- **`glove_path`**: Path to the GloVe embeddings file (e.g., a `.txt` file containing word vectors).
- **`word2idx`**: A dictionary mapping words to their corresponding indices in the tokenizer.
- **`embedding_dim`**: The dimensionality of the word vectors (e.g., 50, 100, 300), which defines the size of the embedding space.
- **`embeddings_file='embeddings.pt'`**: Specifies the filename where the embeddings will be saved or loaded from, using PyTorch's format (`.pt`).

### 3. **Loading Existing Embeddings**
- **`if os.path.exists(embeddings_file):`**: Checks if the embeddings file already exists. If so, it loads the pre-saved embeddings from the file to save time.
  - **`torch.load(embeddings_file)`**: Loads the saved embeddings in a PyTorch tensor format from the specified file. This avoids having to reprocess the GloVe embeddings every time.

### 4. **Generating New Embeddings**
- If the embeddings file does not exist, the function generates new embeddings:
  - **`embeddings = np.random.normal(scale=0.6, size=(vocab_size, embedding_dim))`**: Initializes an embedding matrix where each word in the vocabulary is represented by a random vector. The random vectors follow a normal distribution with a standard deviation of 0.6.
  - **`with open(glove_path, 'r', encoding='utf-8') as f:`**: Opens the GloVe file for reading. This file contains words and their corresponding pre-trained embeddings.

### 5. **Processing Each Line of GloVe Embeddings**
- **`for line in f:`**: Iterates over each line in the GloVe file. Each line represents a word followed by its embedding vector.
  - **`values = line.split()`**: Splits the line into individual components. The first component is the word, and the rest are the values of its embedding vector.
  - **`word = values[0]`**: The word itself.
  - **`vector = np.asarray(values[1:], dtype='float32')`**: Converts the rest of the line (the embedding vector values) into a NumPy array.
  - **`if word in word2idx:`**: Checks if the word from the GloVe file is in the `word2idx` dictionary (i.e., if it's part of the vocabulary). If it is, the corresponding vector is assigned to the appropriate index in the embedding matrix:
    - **`embeddings[word2idx[word]] = vector`**: Assigns the GloVe embedding vector to the correct position in the embeddings matrix.

### 6. **Converting to PyTorch Tensor**
- **`torch.tensor(embeddings, dtype=torch.float)`**: Converts the NumPy embedding matrix into a PyTorch tensor, which is required for deep learning models in PyTorch.
- **`torch.save(embeddings, embeddings_file)`**: Saves the embeddings to a file for future use, avoiding reprocessing.

### 7. **Returning Embeddings**
- The function returns the final embeddings matrix in the form of a PyTorch tensor.

### Summary
- This function either loads pre-trained GloVe embeddings from a file or generates new embeddings if the file does not exist.
- If generating new embeddings, it initializes a random embedding matrix, then replaces the random vectors with the pre-trained GloVe embeddings for any words in the vocabulary (`word2idx`).
- The embeddings are saved in a PyTorch tensor format, making them easy to load and use in subsequent training tasks.

In [None]:
# Load GloVe embeddings (from previous script)
def load_glove_embeddings(glove_path, word2idx, embedding_dim, embeddings_file='embeddings.pt'):
    vocab_size = len(word2idx)
    if os.path.exists(embeddings_file):
        print("Loading embeddings from file...")
        embeddings = torch.load(embeddings_file)
    else:
        print("Generating embeddings...")
        embeddings = np.random.normal(scale=0.6, size=(vocab_size, embedding_dim))
        with open(glove_path, 'r', encoding='utf-8') as f:
            for line in f:
                values = line.split()
                word = values[0]
                vector = np.asarray(values[1:], dtype='float32')
                if word in word2idx:
                    embeddings[word2idx[word]] = vector
        embeddings = torch.tensor(embeddings, dtype=torch.float)
        torch.save(embeddings, embeddings_file)
    return embeddings

### 1. **Class Definition**
- **`SentimentDataset`**: This class is a custom PyTorch dataset tailored for sentiment analysis tasks. It inherits from `torch.utils.data.Dataset`, allowing PyTorch to manage it like any other dataset (for batching, shuffling, etc.).

### 2. **`__init__` Method**
- **`texts`**: A list of text data (e.g., movie reviews) that will be used for sentiment analysis.
- **`labels`**: A list of labels corresponding to the texts, indicating sentiment (e.g., positive or negative).
- **`tokenizer`**: A dictionary containing mappings from words to indices (`word2idx`) created in earlier steps. It is used to convert the text data into a sequence of indices.
- **`sequence_length`**: The maximum number of tokens per text. Texts longer than this will be truncated, and shorter ones will be padded.
- **`self.word2idx`**: Extracts the `word2idx` dictionary from the tokenizer and stores it for easier access.

### 3. **`__len__` Method**
- **`return len(self.texts)`**: Returns the number of texts in the dataset. This helps PyTorch understand how many samples are in the dataset, which is necessary for batching and iteration.

### 4. **`__getitem__` Method**
This method retrieves a specific data sample based on the provided index (`idx`). It's essential for PyTorch's data loading process.

- **`text = self.texts[idx]`**: Retrieves the text at the given index `idx`.
- **`label = self.labels[idx]`**: Retrieves the corresponding label for that text (e.g., 0 for negative, 1 for positive sentiment).

### 5. **Tokenizing the Text**
- **`tokens = [self.word2idx.get(word, 0) for word in text.strip().split()]`**: This line splits the text into individual words and converts each word into its corresponding index using the `word2idx` dictionary. If a word is not found, the default index `0` is used, which typically represents the `<UNK>` (unknown) token.
  
### 6. **Handling Sequence Length**
- **`tokens = tokens[:self.sequence_length]`**: Truncates the list of tokens if it exceeds the specified `sequence_length`, ensuring that all inputs are of uniform length.
- **Padding**: 
  - **`if len(tokens) < self.sequence_length:`**: If the number of tokens is less than the `sequence_length`, the text is padded with zeros (index `0`) at the beginning. This ensures that all token sequences are of the same length.
  - **`tokens = [0] * (self.sequence_length - len(tokens)) + tokens`**: Adds the necessary number of padding tokens (zeros) to the front of the sequence.

### 7. **Returning Tensors**
- **`torch.tensor(tokens, dtype=torch.long)`**: Converts the tokenized and padded sequence into a PyTorch tensor of type `long`, which is typically used for categorical data such as word indices.
- **`torch.tensor(label, dtype=torch.float)`**: Converts the label into a PyTorch tensor of type `float`, which is useful for binary classification tasks (such as sentiment analysis).

### Summary
- **`SentimentDataset`** is a custom PyTorch dataset class for sentiment analysis. It processes a list of texts by tokenizing and padding them, then returns them as tensors. The labels are also returned as tensors, which are useful for training a model in PyTorch.
- The class ensures that all sequences are of uniform length, either by truncating or padding them, which is essential for batch processing in deep learning models.

In [None]:
# Define the dataset class for sentiment analysis
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, sequence_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.sequence_length = sequence_length
        self.word2idx = tokenizer['word2idx']

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        # Tokenize the text
        tokens = [self.word2idx.get(word, 0) for word in text.strip().split()]
        tokens = tokens[:self.sequence_length]
        if len(tokens) < self.sequence_length:
            tokens = [0] * (self.sequence_length - len(tokens)) + tokens

        return torch.tensor(tokens, dtype=torch.long), torch.tensor(label, dtype=torch.float)

### 1. **Class Definition**
- **`LSTMSentimentModel`**: This class defines an LSTM-based neural network model for sentiment analysis. It inherits from `nn.Module`, the base class for all neural networks in PyTorch.

### 2. **`__init__` Method**
This method initializes the layers and components of the model.

- **`input_size`**: The size of the input vocabulary (i.e., the number of unique tokens in the tokenizer). This is the number of words that can be represented in the embedding layer.
- **`embedding_dim`**: The size of each word embedding vector (e.g., 100 or 300 dimensions).
- **`hidden_size`**: The number of units in the hidden layers of the LSTM. This controls the capacity of the model to learn from sequential data.
- **`output_size`**: The number of output units. Typically, for binary sentiment analysis, this is `1` (representing the probability of positive/negative sentiment).
- **`num_layers=2`**: The number of stacked LSTM layers. Stacking multiple LSTM layers allows the model to learn more complex patterns.
- **`dropout=0.3`**: Dropout rate, used to prevent overfitting by randomly dropping some neurons during training.
- **`embedding_weights=None`**: If pre-trained word embeddings (e.g., GloVe) are provided, they are used to initialize the embedding layer.
- **`bidirectional=False`**: If `True`, the LSTM will be bidirectional, meaning it will process the input sequence in both forward and backward directions.

#### Embedding Layer
- **`self.embedding = nn.Embedding(input_size, embedding_dim)`**: This layer maps the input words (represented by their indices) to dense vectors of size `embedding_dim`.
- **`if embedding_weights is not None:`**: If pre-trained embeddings are provided, they are loaded into the embedding layer.
  - **`self.embedding.weight.data.copy_(embedding_weights)`**: Copies the pre-trained embedding weights (e.g., GloVe) into the embedding layer.

#### LSTM Layer
- **`self.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers=num_layers, batch_first=True, bidirectional=bidirectional)`**:
  - **`embedding_dim`**: The input to the LSTM is the embedding of each word.
  - **`hidden_size`**: The LSTM's hidden state size, controlling the model's memory capacity.
  - **`num_layers=num_layers`**: Specifies the number of LSTM layers stacked.
  - **`batch_first=True`**: Ensures that the batch dimension is the first dimension in the input (`batch_size, sequence_length, embedding_dim`).
  - **`bidirectional=bidirectional`**: If `True`, the LSTM will process the sequence in both forward and backward directions.

#### Dropout Layer
- **`self.dropout = nn.Dropout(dropout)`**: Adds dropout regularization to prevent overfitting by randomly dropping some units during training.

#### Fully Connected Layer
- **`self.fc = nn.Linear(hidden_size, output_size)`**: A fully connected layer that maps the final hidden state of the LSTM to the output (the predicted sentiment). If the LSTM is bidirectional, `hidden_size` is doubled.

### 3. **`forward` Method**
This method defines the forward pass of the model, where input data passes through the layers.

- **Embedding Layer**: 
  - **`x = self.embedding(x)`**: Converts the input word indices into dense vectors (embeddings).
  
- **LSTM Layer**:
  - **`x, _ = self.lstm(x)`**: Processes the embeddings through the LSTM. The LSTM outputs a sequence of hidden states (one for each time step in the input sequence).
  
- **Dropout**:
  - **`x = self.dropout(x)`**: Applies dropout to the LSTM outputs to regularize the model.
  
- **Fully Connected Layer**:
  - **`x = self.fc(x[:, -1, :])`**: Only the last time step of the LSTM's output is passed to the fully connected layer. This represents the final hidden state, which is used for sentiment prediction.

- **Sigmoid Activation**:
  - **`torch.sigmoid(x)`**: The output of the fully connected layer is passed through a sigmoid activation function to map the output to a probability between 0 and 1, useful for binary classification.
  - **`.squeeze()`**: Removes any extra dimensions from the output.

### Summary
- The model consists of an embedding layer (which can use pre-trained embeddings), followed by a stacked LSTM for sequence processing, and a dropout layer for regularization. A fully connected layer generates the final prediction, and a sigmoid activation is applied to produce the output as a probability of positive/negative sentiment.
- If bidirectional LSTM is enabled, it will process the sequence in both forward and backward directions, improving the model's ability to understand context.

In [None]:
# Define the LSTM model for sentiment analysis
class LSTMSentimentModel(nn.Module):
    def __init__(self, input_size, embedding_dim, hidden_size, output_size, num_layers=2, dropout=0.3, embedding_weights=None, bidirectional=False):
        super(LSTMSentimentModel, self).__init__()
        self.embedding = nn.Embedding(input_size, embedding_dim)
        if embedding_weights is not None:
            self.embedding.weight.data.copy_(embedding_weights)
        self.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers=num_layers, batch_first=True, bidirectional=bidirectional)
        self.dropout = nn.Dropout(dropout)
        if bidirectional:
            hidden_size *= 2
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.lstm(x)
        x = self.dropout(x)
        x = self.fc(x[:, -1, :])  # Use only the last output of the LSTM
        return torch.sigmoid(x).squeeze()

### 1. **Function Definition**
- **`save_checkpoint`**: This function saves the model's state (a checkpoint) to a file during training, typically after every epoch. If the current model is the best so far (based on some evaluation metric), it also updates the "best" model checkpoint.

### 2. **Parameters**
- **`state`**: A dictionary containing the current state of the model and optimizer, typically including model weights, optimizer states, epoch number, and any other relevant information you wish to save.
- **`is_best`**: A boolean flag indicating whether the current model is the best model so far (based on validation performance, for example). If `True`, the function saves an additional checkpoint labeled as the "best" model.
- **`checkpoint_dir="checkpoints"`**: The directory where all model checkpoints will be stored. If this directory does not exist, the function creates it.
- **`best_model_name="sentiment_model_best.pt"`**: The filename used to save the best model checkpoint.

### 3. **Creating the Checkpoint Directory**
- **`if not os.path.exists(checkpoint_dir):`**: Checks if the directory for storing checkpoints exists. If not:
  - **`os.makedirs(checkpoint_dir)`**: Creates the directory to store checkpoints.

### 4. **Saving the Epoch Checkpoint**
- **`epoch_checkpoint = os.path.join(checkpoint_dir, f'sentiment_model_epoch_{state["epoch"]}.pt')`**: Creates a filename for saving the checkpoint. The filename includes the current epoch number, allowing the model state to be saved after each epoch.
- **`torch.save(state, epoch_checkpoint)`**: Saves the `state` dictionary to the specified file. This is the primary function for saving model checkpoints in PyTorch.

### 5. **Updating the Best Model Checkpoint**
- **`if is_best:`**: If the current model is the best so far (i.e., if `is_best` is `True`), the function saves it as the "best" model.
  - **`best_checkpoint = os.path.join(checkpoint_dir, best_model_name)`**: Creates the path for the best model checkpoint.
  - **`torch.save(state, best_checkpoint)`**: Saves the current state as the best model checkpoint.
  - This ensures that the best model is updated and can be retrieved later even if subsequent models (from later epochs) perform worse.

### 6. **Printing Checkpoint Status**
- **`print(f"Checkpoint saved: {epoch_checkpoint}")`**: Notifies that a checkpoint for the current epoch has been saved.
- **`print(f"Best checkpoint updated: {best_checkpoint}")`**: Notifies that the best model checkpoint has been updated.

### Summary
- This function saves the model's state after each epoch to track the model's progress over time.
- If the model performs better than any previous versions (based on some external evaluation), it also updates a separate file to store the "best" model.
- The function ensures the checkpoints are organized and easy to retrieve for later use or to resume training.

In [None]:
# Save a model checkpoint
def save_checkpoint(state, is_best, checkpoint_dir="checkpoints", best_model_name="sentiment_model_best.pt"):
    if not os.path.exists(checkpoint_dir):
        os.makedirs(checkpoint_dir)

    epoch_checkpoint = os.path.join(checkpoint_dir, f'sentiment_model_epoch_{state["epoch"]}.pt')
    torch.save(state, epoch_checkpoint)
    print(f"Checkpoint saved: {epoch_checkpoint}")
    
    if is_best:
        best_checkpoint = os.path.join(checkpoint_dir, best_model_name)
        torch.save(state, best_checkpoint)
        print(f"Best checkpoint updated: {best_checkpoint}")

### 1. **Function Definition**
- **`get_model`**: This function initializes a model for sentiment analysis based on the input parameters. It either loads a pre-trained model or initializes a model with random weights depending on the `pretrained` flag.

### 2. **Parameters**
- **`vocab_size`**: The size of the vocabulary (i.e., the number of unique tokens). This is used for the embedding layer, where each word is represented by a vector.
- **`embedding_dim`**: The dimensionality of the word embeddings (e.g., 100 or 300). This defines the size of the embedding space.
- **`hidden_size`**: The number of hidden units in the LSTM layers. A higher hidden size allows the model to capture more complex patterns but also increases computation.
- **`num_layers`**: The number of stacked LSTM layers. Stacking multiple layers allows the model to learn more hierarchical features from the data.
- **`dropout`**: The dropout rate applied to the LSTM layers, which helps prevent overfitting by randomly dropping units during training.
- **`pretrained=False`**: A flag indicating whether to load a pre-trained model or to initialize the model with random weights. If `True`, it attempts to use `pretrained_weights` to initialize the embedding layer.
- **`pretrained_weights=None`**: If `pretrained` is `True`, this parameter holds the pre-trained word embeddings (e.g., GloVe) that will be used to initialize the embedding layer.

### 3. **Condition for Pre-trained Model**
- **`if pretrained and pretrained_weights is not None:`**: This checks whether the user wants to use a pre-trained model and whether pre-trained weights have been provided.
  - **`print("Loading pre-trained model...")`**: Notifies that a pre-trained model is being loaded.
  - **`LSTMSentimentModel(..., embedding_weights=pretrained_weights)`**: Initializes the `LSTMSentimentModel` with the pre-trained embedding weights. The embeddings will not be randomly initialized but instead will be copied from `pretrained_weights`.

### 4. **Random Initialization**
- **`else:`**: If `pretrained` is `False` or no pre-trained weights are provided, the model will be randomly initialized.
  - **`print("Randomly initializing model...")`**: Notifies that the model is being initialized with random weights.
  - **`LSTMSentimentModel(..., embedding_weights=None)`**: Initializes the model with random embeddings, as no pre-trained weights are provided.

### 5. **Return the Model**
- The function returns the constructed model, which can either be initialized with pre-trained weights or with random weights, depending on the parameters.

### Summary
- The `get_model` function provides flexibility in model initialization. It can either load a pre-trained model (if `pretrained=True` and `pretrained_weights` are provided) or randomly initialize a new model. This allows you to leverage pre-trained embeddings like GloVe or FastText if available, or train from scratch if desired.
- The use of hyperparameters such as `vocab_size`, `embedding_dim`, `hidden_size`, `num_layers`, and `dropout` allows you to easily configure and adjust the model architecture based on your requirements.

In [None]:
# Load or initialize model based on a hyperparameter
def get_model(vocab_size, embedding_dim, hidden_size, num_layers, dropout, pretrained=False, pretrained_weights=None):
    if pretrained and pretrained_weights is not None:
        print("Loading pre-trained model...")
        model = LSTMSentimentModel(vocab_size, embedding_dim, hidden_size, 1, num_layers=num_layers, dropout=dropout, embedding_weights=pretrained_weights)
    else:
        print("Randomly initializing model...")
        model = LSTMSentimentModel(vocab_size, embedding_dim, hidden_size, 1, num_layers=num_layers, dropout=dropout)
    return model

### 1. **Function Definition**
- **`evaluate_model`**: This function evaluates the performance of a model on a validation dataset. It calculates the loss and accuracy by passing the validation data through the model without updating the weights.

### 2. **Parameters**
- **`model`**: The LSTM model (or any neural network model) that is being evaluated.
- **`data_loader`**: A PyTorch `DataLoader` containing the validation data. The data loader batches the data for evaluation.
- **`criterion`**: The loss function used to calculate how well the model's predictions match the actual labels. Common criteria include `nn.BCELoss` or `nn.CrossEntropyLoss`.

### 3. **Model in Evaluation Mode**
- **`model.eval()`**: This sets the model to evaluation mode, disabling dropout and batch normalization, which are only needed during training.
- **`with torch.no_grad():`**: This context manager ensures that no gradients are computed during evaluation, saving memory and computation time since the model is not being updated.

### 4. **Looping Through the Validation Data**
- **`for texts, labels in data_loader:`**: Iterates through batches of data (texts and corresponding labels) from the data loader.
  - **`texts, labels = texts.to(device), labels.to(device)`**: Moves the texts and labels to the correct device (GPU or CPU) for processing.

### 5. **Model Predictions**
- **`outputs = model(texts)`**: Passes the texts (inputs) through the model to obtain the outputs (predictions). These outputs are usually probabilities between 0 and 1, especially if using a sigmoid activation for binary classification.

### 6. **Loss Calculation**
- **`loss = criterion(outputs, labels)`**: Calculates the loss for the current batch using the loss function (`criterion`). The loss measures how far off the model's predictions are from the true labels.
- **`total_loss += loss.item()`**: Adds the loss for the current batch to the total loss.

### 7. **Converting Outputs to Predictions**
- **`preds = (outputs >= 0.5).float()`**: Converts the model's outputs to binary predictions (either 0 or 1). If the output is greater than or equal to 0.5, it is considered a positive prediction (1); otherwise, it is negative (0).
  - **`predictions.extend(preds.cpu().numpy())`**: Adds the predicted values for the current batch to the list of all predictions. The `.cpu().numpy()` ensures that the predictions are moved to CPU and converted to NumPy arrays for compatibility with evaluation functions.
  - **`true_labels.extend(labels.cpu().numpy())`**: Adds the true labels for the current batch to the list of all true labels.

### 8. **Accuracy Calculation**
- **`accuracy = accuracy_score(true_labels, predictions)`**: Uses `sklearn.metrics.accuracy_score` to compute the accuracy by comparing the model's predictions with the true labels.
- **`print(f"Accuracy: {accuracy:.4f}")`**: Prints the accuracy of the model on the validation dataset.

### 9. **Average Validation Loss**
- **`avg_val_loss = total_loss / len(data_loader)`**: Computes the average validation loss by dividing the total loss by the number of batches.
- **`print(f"Validation Loss: {avg_val_loss:.4f}")`**: Prints the average validation loss.

### 10. **Return Values**
- **`return avg_val_loss, accuracy`**: Returns both the average validation loss and the accuracy, which can be used for further analysis or logging.

### Summary
- **`evaluate_model`** computes both the validation loss and accuracy of the model without updating the weights (i.e., in evaluation mode).
- It processes the validation data in batches, calculates the loss and predictions, and evaluates the model's performance by comparing its predictions to the true labels.
- The function outputs the average validation loss and accuracy, which are essential metrics to monitor when assessing a model's performance during training.

In [None]:
# Evaluation function to compute the validation loss
def evaluate_model(model, data_loader, criterion):
    model.eval()
    predictions = []
    true_labels = []
    total_loss = 0
    with torch.no_grad():
        for texts, labels in data_loader:
            texts, labels = texts.to(device), labels.to(device)
            outputs = model(texts)
            loss = criterion(outputs, labels)
            total_loss += loss.item()
            preds = (outputs >= 0.5).float()
            predictions.extend(preds.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    accuracy = accuracy_score(true_labels, predictions)
    print(f"Accuracy: {accuracy:.4f}")
    avg_val_loss = total_loss / len(data_loader)
    print(f"Validation Loss: {avg_val_loss:.4f}")
    return avg_val_loss, accuracy

### 1. **Function Definition**
- **`train_model`**: This function trains the model over multiple epochs, saves checkpoints after each epoch, and logs training and validation metrics such as loss and accuracy. 

### 2. **Parameters**
- **`model`**: The LSTM model (or any neural network model) being trained.
- **`train_loader`**: A `DataLoader` containing the training data, which provides batches of texts and labels for training.
- **`val_loader`**: A `DataLoader` containing the validation data for evaluating the model after each epoch.
- **`criterion`**: The loss function used to calculate the training loss (e.g., `nn.BCELoss` for binary classification).
- **`optimizer`**: The optimizer (e.g., `Adam`, `SGD`) used to update the model's parameters during training.
- **`num_epochs=5`**: The number of epochs (full passes through the training data) for which the model will be trained.
- **`writer=None`**: A `SummaryWriter` object for logging to TensorBoard. This logs metrics like loss and accuracy so they can be visualized in TensorBoard.

### 3. **Tracking Best Validation Loss**
- **`best_val_loss = float('inf')`**: Initializes the best validation loss as infinity. This is used to keep track of the best model based on validation loss. If the current validation loss is lower than `best_val_loss`, the model is saved as the best model.

### 4. **Epoch Loop**
- **`for epoch in range(num_epochs):`**: Loops over the specified number of epochs. Each epoch represents one full pass over the training data.
  - **`model.train()`**: Puts the model in training mode, enabling dropout and batch normalization layers (if present).

### 5. **Training Loop**
- **`for texts, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):`**: Loops over batches of training data provided by `train_loader`. `tqdm` is used to show a progress bar for each epoch.
  - **`texts, labels = texts.to(device), labels.to(device)`**: Moves the input data (texts and labels) to the appropriate device (GPU or CPU).
  - **`optimizer.zero_grad()`**: Resets the gradients for the optimizer before backpropagation.
  - **`outputs = model(texts)`**: Passes the batch of texts through the model to get the predictions (outputs).
  - **`loss = criterion(outputs, labels)`**: Computes the loss for the current batch by comparing the outputs and labels using the specified loss function.
  - **`loss.backward()`**: Performs backpropagation, computing the gradients for each parameter based on the loss.
  - **`nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)`**: Clips the gradients to prevent exploding gradients, ensuring the maximum norm of the gradients does not exceed 1.
  - **`optimizer.step()`**: Updates the model parameters using the gradients calculated during backpropagation.

### 6. **Tracking and Logging Training Loss**
- **`total_loss += loss.item()`**: Accumulates the training loss for the current epoch.
- **`avg_train_loss = total_loss / len(train_loader)`**: Computes the average training loss for the current epoch.
- **`writer.add_scalar('Loss/Train', avg_train_loss, epoch + 1)`**: Logs the average training loss to TensorBoard for the current epoch.

### 7. **Evaluating on Validation Set**
- **`avg_val_loss, accuracy_score = evaluate_model(model, val_loader, criterion)`**: After each epoch, the model is evaluated on the validation set using the `evaluate_model` function, which returns the average validation loss and accuracy.
- **`writer.add_scalar('Loss/Validation', avg_val_loss, epoch + 1)`**: Logs the validation loss to TensorBoard for the current epoch.
- **`writer.add_scalar('Accuracy/Validation', accuracy_score, epoch + 1)`**: Logs the validation accuracy to TensorBoard for the current epoch.

### 8. **Saving Checkpoints**
- **`is_best = avg_val_loss < best_val_loss`**: Checks if the current epoch's validation loss is the best so far (i.e., lower than `best_val_loss`).
- **`if is_best:`**: If the current model has the best validation loss, update the best validation loss and save the model checkpoint as the best model.
  - **`best_val_loss = avg_val_loss`**: Updates the `best_val_loss` to the current epoch's validation loss.

- **`checkpoint_state`**: A dictionary that stores the current state of the model, including:
  - `epoch`: The current epoch number.
  - `model_state_dict`: The state of the model (weights).
  - `optimizer_state_dict`: The state of the optimizer.
  - `loss`: The average training loss for the current epoch.
  - `val_loss`: The average validation loss for the current epoch.

- **`save_checkpoint(checkpoint_state, is_best)`**: Saves a checkpoint for the current epoch, and if it's the best model so far, it also updates the "best model" checkpoint.

### Summary
- **`train_model`** trains the model for a specified number of epochs while tracking and logging training and validation metrics.
- During each epoch, the model parameters are updated based on the training data, and the model is evaluated on the validation data.
- Checkpoints are saved at the end of each epoch, and the model with the best validation loss is flagged as the "best model" checkpoint. The use of TensorBoard logging allows for easy visualization of training progress.

In [None]:
# Training function with checkpoint saving
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs=5, writer=None):
    best_val_loss = float('inf')  # Track best validation loss

    for epoch in range(num_epochs):
        total_loss = 0
        model.train()
        for texts, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):
            texts, labels = texts.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(texts)
            loss = criterion(outputs, labels)
            loss.backward()

            # Clip the gradients to prevent exploding gradients
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

            optimizer.step()

            total_loss += loss.item()

        avg_train_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}, Train Loss: {avg_train_loss:.4f}")
        
        # Log the train loss to TensorBoard
        writer.add_scalar('Loss/Train', avg_train_loss, epoch + 1)

        # Evaluate on validation set
        avg_val_loss, accuracy_score = evaluate_model(model, val_loader, criterion)
        writer.add_scalar('Loss/Validation', avg_val_loss, epoch + 1)
        writer.add_scalar('Accuracy/Validation', accuracy_score, epoch + 1)

        # Save a checkpoint for every epoch
        is_best = avg_val_loss < best_val_loss
        if is_best:
            best_val_loss = avg_val_loss
        
        checkpoint_state = {
            'epoch': epoch + 1,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': avg_train_loss,
            'val_loss': avg_val_loss
        }
        
        save_checkpoint(checkpoint_state, is_best)

### 1. **Main Function Definition**
- **`main(pretrained=False)`**: This is the main function that sets up the model, loads the data, and trains the model for sentiment analysis. The `pretrained` flag determines whether to use pre-trained word embeddings (e.g., GloVe) or initialize the embeddings randomly.

### 2. **Hyperparameters**
- **`embedding_dim = 50`**: The dimensionality of the word embeddings (each word will be represented as a vector of 50 dimensions).
- **`hidden_size = 256`**: The number of units in the LSTM's hidden layers.
- **`num_layers = 2`**: The number of LSTM layers to stack.
- **`dropout = 0.05`**: The dropout rate for regularization.
- **`sequence_length = 512`**: The maximum length of each input sequence (texts longer than this will be truncated, and shorter ones padded).
- **`batch_size = 1024`**: The number of samples processed together in each batch during training.
- **`num_epochs = 20`**: The number of epochs (full passes through the training data).

### 3. **Initialize TensorBoard Writer**
- **`writer = SummaryWriter()`**: Initializes the TensorBoard writer to log training and validation metrics, allowing visualization of the training process.

### 4. **Load IMDB Dataset**
- **`load_dataset('imdb')`**: Loads the IMDB dataset using the Hugging Face `datasets` library, which contains a large collection of movie reviews labeled as positive or negative.
- **`train_dataset` and `test_dataset`**: The dataset is split into a training set (`train_dataset`) and a test set (`test_dataset`).

### 5. **Train/Validation Split**
- **`train_val_split = train_dataset.train_test_split(test_size=0.2)`**: Splits the training dataset into 80% training and 20% validation sets. The validation set is used to evaluate the model after each epoch.
- **`train_texts`, `train_labels`, `val_texts`, `val_labels`**: These variables store the texts and corresponding labels for the training and validation sets.
  
### 6. **Initialize Tokenizer**
- **`initialize_tokenizer`**: Initializes a tokenizer to convert the input texts into sequences of word indices. It ensures the vocabulary is capped at 32,000 words, with a minimum word frequency of 5.
- **`vocab_size`**: The size of the tokenizer's vocabulary, which will be used for the embedding layer.

### 7. **Loading GloVe Embeddings (if `pretrained=True`)**
- **`if pretrained:`**: If pre-trained embeddings are required, it loads the GloVe embeddings from a file and maps them to the words in the tokenizer using the function `load_glove_embeddings`.
- **`glove_path = './glove.6B.50d.txt'`**: Path to the GloVe embeddings file with 50-dimensional word vectors.

### 8. **Create Dataset and DataLoaders**
- **`SentimentDataset`**: Creates instances of the `SentimentDataset` class for the training, validation, and test datasets. This class converts the texts into sequences of word indices and pads/truncates them to the specified `sequence_length`.
- **`DataLoader`**: PyTorch's `DataLoader` is used to load batches of data efficiently:
  - **`train_loader`**: Loads batches of training data with shuffling enabled.
  - **`val_loader`**: Loads validation data without shuffling.
  - **`test_loader`**: Loads test data without shuffling for evaluation after training.

### 9. **Initialize or Load Model**
- **`get_model`**: Initializes the LSTM model for sentiment analysis. If `pretrained=True`, the embedding layer will be initialized with pre-trained embeddings; otherwise, random embeddings will be used.

### 10. **Loss Function and Optimizer**
- **`criterion = nn.BCELoss()`**: The binary cross-entropy loss function is used, as this is a binary classification task (positive vs. negative sentiment).
- **`optimizer = optim.Adam(model.parameters(), lr=0.001)`**: The Adam optimizer is used for updating the model's parameters with a learning rate of 0.001.

### 11. **Train the Model**
- **`train_model`**: The function trains the model over the specified number of epochs (`num_epochs`). It logs training and validation losses to TensorBoard and saves model checkpoints after each epoch.

### 12. **Close TensorBoard Writer**
- **`writer.close()`**: Closes the TensorBoard writer after training is complete to ensure all data is properly written.

### 13. **Evaluate on the Full Test Set**
- **`evaluate_model`**: After training, the model is evaluated on the entire test dataset to determine its final performance (validation loss and accuracy).

### 14. **Main Execution**
- **`if __name__ == '__main__':`**: This ensures the script runs when executed directly. By setting `pretrained=False`, the model will be initialized with random embeddings. If you set `pretrained=True`, GloVe embeddings will be used.

### Summary
- This script trains an LSTM model for sentiment analysis using the IMDB dataset.
- The model can optionally use pre-trained GloVe embeddings for word representations.
- Training and validation metrics (loss, accuracy) are logged to TensorBoard for visualization.
- After training, the model is evaluated on a separate test set to assess its performance.


In [None]:
def main(pretrained=False):
    # Hyperparameters
    embedding_dim = 50
    hidden_size = 256
    num_layers = 2
    dropout = 0.05
    sequence_length = 512
    batch_size = 1024
    num_epochs = 20

    # Initialize TensorBoard writer
    writer = SummaryWriter()

    # Load training dataset
    dataset = load_dataset('imdb')
    
    train_dataset = dataset['train']
    test_dataset = dataset['test']
    
    # Split train set into 80% train, 20% validation
    train_val_split = train_dataset.train_test_split(test_size=0.2)
    train_texts = train_val_split['train']['text']
    train_labels = train_val_split['train']['label']
    val_texts = train_val_split['test']['text']
    val_labels = train_val_split['test']['label']
    
    test_texts = test_dataset['text']
    test_labels = test_dataset['label']

    # Initialize tokenizer
    tokenizer, vocab_size = initialize_tokenizer(
        tokenizer_type='word', 
        text_data=train_texts, 
        min_freq=5, 
        max_vocab_size=32000,  # Cap the vocabulary at 10,000 words
        tokenizer_file='tokenizer.pkl'
    )

    # Load GloVe embeddings if pretrained=True
    if pretrained:
        glove_path = './glove.6B.50d.txt'
        embeddings = load_glove_embeddings(glove_path, tokenizer['word2idx'], embedding_dim)
    else:
        embeddings = None

    # Create dataset and dataloaders
    train_data = SentimentDataset(train_texts, train_labels, tokenizer, sequence_length)
    val_data = SentimentDataset(val_texts, val_labels, tokenizer, sequence_length)
    test_data = SentimentDataset(test_texts, test_labels, tokenizer, sequence_length)

    train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_data, batch_size=batch_size, shuffle=False)
    test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False)

    # Initialize or load model
    model = get_model(vocab_size, embedding_dim, hidden_size, num_layers, dropout, pretrained, embeddings).to(device)

    # Loss function and optimizer
    criterion = nn.BCELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Train model
    train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, writer)

    # Close the TensorBoard writer
    writer.close()

    # Evaluate the model on the full test set
    print("\nEvaluating on full test set...")
    evaluate_model(model, test_loader, criterion)

if __name__ == '__main__':
    # Set to True to load pretrained embeddings, False for random initialization
    main(pretrained=False)