# Building a Deep Neural Net for Sentiment Analysis on IMDb Reviews

## 1. Data collection and preprocessing
- Collect a dataset of IMDb reviews
- Preprocess the text data (tokenization, lowercasing, removing special characters, etc.)
- Split the dataset into training, validation, and test sets

## 2. **Model selection and architecture**
- Research different types of deep learning models (**RNN**, LSTM, GRU, CNN, Transformer)
- Decide on a model architecture
- Experiment with pre-trained models (BERT, GPT, RoBERTa) for fine-tuning

## 3. Model training and hyperparameter tuning
- Set up a training loop
- Use backpropagation to update the model's weights based on the loss function
- Experiment with different hyperparameters (learning rate, batch size, dropout rate, etc.) and optimization algorithms (Adam, RMSprop, etc.)
- Monitor performance on the validation set during training

## 4. Model evaluation and refinement
- Evaluate the model on the test set using relevant metrics (accuracy, F1 score, precision, recall, etc.)
- Identify areas for improvement and iterate on the model architecture, training process, or preprocessing techniques

## 5. "Extra for experts" ideas
- Handle class imbalance (oversampling, undersampling, or SMOTE)
- Experiment with different word embeddings (Word2Vec, GloVe, FastText) or contextual embeddings (ELMo, BERT)
- Explore advanced model architectures (multi-head attention, capsule networks, memory-augmented networks)
- Investigate transfer learning or multi-task learning
- Conduct error analysis to understand and address specific issues
- Develop a user interface or API for your sentiment analysis model

In [1]:
import tokenizers
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
import torch.nn.functional as F

class RNNModel(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        emb_dim: int = 300,
        hidden_size: int = 400,
        n_rnn_layers: int = 5,
    ):
        super().__init__()
        self.emb = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=emb_dim,
        )
        self.rnn = nn.RNN(
            input_size=emb_dim,
            hidden_size=hidden_size,
            num_layers=n_rnn_layers,
        )

        self.seq = nn.Sequential(
            nn.Linear(hidden_size, 500),
            nn.ReLU(),
            nn.Linear(500, 500),
            nn.ReLU(),
            nn.Linear(500, 500),
            nn.ReLU(),
            nn.Linear(500, 500),
            nn.ReLU(),
            nn.Linear(500, 2),
            nn.Softmax(dim=1),
        )

    def forward(self, x: torch.Tensor, lengths: torch.Tensor):
        # x shape: (B, L)
        # convert token indices to embedding values
        x = self.emb(x)
        # x shape: (B, L, Emb dim)

        x = x.transpose(0, 1)
        # x shape: (L, B, Emb dim)

        # pack the sequence
        x = pack_padded_sequence(x, lengths, enforce_sorted=False)

        # run the rnn, only taking the final rnn hidden state from the last layer
        # TODO: understand the difference between the two outputs more
        _, x = self.rnn(x)
        # x shape: (B, n_rnn_layers, Hidden size?)
        
        # take only the last layer
        x = x[-1, :, :]
        # x shape: (B, Hidden size?)
        
        x = self.fc(x)
        # x shape: (B, 2)

        return F.softmax(F.relu(x), dim=1)

In [2]:
from torch.utils.data.dataloader import Dataset

# load in tokenized data
data_dict = torch.load("data/imdb_data.pt")
data = data_dict["reviews"]
labels = data_dict["labels"]
lengths = data_dict["lengths"]

# split into train and test by 80:20
train_data = data[:int(len(data) * 0.8)]
train_labels = labels[:int(len(data) * 0.8)]
train_lengths = lengths[:int(len(data) * 0.8)]

test_data = data[int(len(data) * 0.8):]
test_labels = labels[int(len(data) * 0.8):]
test_lengths = lengths[int(len(data) * 0.8):]


# load in tokenizer
tokenizer = tokenizers.Tokenizer.from_file("models/tokenizer.json")
vocab_size = tokenizer.get_vocab_size()

# Training loop

In [3]:
import torch.optim as optim
from tqdm import tqdm

batch_size = 256 + 128

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# create the model
model = RNNModel(vocab_size=vocab_size)
model = model.to(device)

train_data = train_data.to(device)
train_labels = train_labels.to(device)

optimizer = optim.SGD(model.parameters(), lr=0.01)

eval_interval = 5

# train the model
for epoch in range(500):
    if epoch % eval_interval == 0:
        # calculate overall loss (need batching for memory reasons)
        loss = 0
        total = 0
        correct = 0

        with torch.no_grad():
            for i in tqdm(range(0, len(train_data), 2*batch_size)):
                training_data_batch = train_data[i:i+batch_size]
                training_labels_batch = train_labels[i:i+batch_size]
                training_lengths_batch = train_lengths[i:i+batch_size]

                output = model(training_data_batch, lengths=training_lengths_batch)
                ## Calculate correct predictions
                _, y_predicted = torch.max(output, 1)
                _, y = torch.max(training_labels_batch, 1)
                total += training_labels_batch.size(0)
                correct += (y_predicted == y).sum().item()
                loss += nn.functional.cross_entropy(output, training_labels_batch)
        
        print(f"Epoch: {epoch}, Loss: {loss.item()}, Accuracy: {correct/total*100: .2f}")
    
    for i in tqdm(range(0, len(train_data), batch_size)):
        training_data_batch = train_data[i:i+batch_size]
        training_labels_batch = train_labels[i:i+batch_size]
        training_lengths_batch = train_lengths[i:i+batch_size]

        output = model(training_data_batch, lengths=training_lengths_batch)

        # forward pass
        optimizer.zero_grad()
        loss = nn.functional.cross_entropy(output, training_labels_batch)
        loss.backward()
        optimizer.step()



100%|██████████| 53/53 [00:11<00:00,  4.57it/s]


Epoch: 0, Loss: 36.74216842651367, Accuracy:  49.79


100%|██████████| 105/105 [01:07<00:00,  1.56it/s]
100%|██████████| 105/105 [01:07<00:00,  1.57it/s]
100%|██████████| 105/105 [01:07<00:00,  1.57it/s]
100%|██████████| 105/105 [01:07<00:00,  1.57it/s]
100%|██████████| 105/105 [01:07<00:00,  1.57it/s]
100%|██████████| 53/53 [00:11<00:00,  4.73it/s]


Epoch: 5, Loss: 36.633888244628906, Accuracy:  52.96


100%|██████████| 105/105 [01:07<00:00,  1.57it/s]
100%|██████████| 105/105 [01:07<00:00,  1.57it/s]
100%|██████████| 105/105 [01:07<00:00,  1.57it/s]
100%|██████████| 105/105 [01:07<00:00,  1.57it/s]
100%|██████████| 105/105 [01:07<00:00,  1.57it/s]
100%|██████████| 53/53 [00:11<00:00,  4.73it/s]


Epoch: 10, Loss: 36.465213775634766, Accuracy:  55.14


 16%|█▌        | 17/105 [00:11<00:59,  1.49it/s]


KeyboardInterrupt: 