# Building a Deep Neural Net for Sentiment Analysis on IMDb Reviews

## 1. Data collection and preprocessing
- Collect a dataset of IMDb reviews
- Preprocess the text data (tokenization, lowercasing, removing special characters, etc.)
- Split the dataset into training, validation, and test sets

## 2. Model selection and architecture
- Research different types of deep learning models (RNN, LSTM, GRU, CNN, Transformer)
- Decide on a model architecture
- Experiment with pre-trained models (BERT, GPT, RoBERTa) for fine-tuning

## 3. Model training and hyperparameter tuning
- Set up a training loop
- Use backpropagation to update the model's weights based on the loss function
- Experiment with different hyperparameters (learning rate, batch size, dropout rate, etc.) and optimization algorithms (Adam, RMSprop, etc.)
- Monitor performance on the validation set during training

## 4. Model evaluation and refinement
- Evaluate the model on the test set using relevant metrics (accuracy, F1 score, precision, recall, etc.)
- Identify areas for improvement and iterate on the model architecture, training process, or preprocessing techniques

## 5. "Extra for experts" ideas
- Handle class imbalance (oversampling, undersampling, or SMOTE)
- Experiment with different word embeddings (Word2Vec, GloVe, FastText) or contextual embeddings (ELMo, BERT)
- Explore advanced model architectures (multi-head attention, capsule networks, memory-augmented networks)
- Investigate transfer learning or multi-task learning
- Conduct error analysis to understand and address specific issues
- Develop a user interface or API for your sentiment analysis model


# Load in data (collected from [kaggle](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews))

In [1]:
import pandas as pd

In [107]:
# Load in training data

data = pd.read_csv("../data/imdb_data.csv")

all_text_file = "data/imdb_text.txt"


n_data = len(data)

train_test_split = 0.9


# Splitting train/test
training_data = data[:int(len(data) * 0.9)]
testing_data = data[int(len(data) * 0.9):]

# can do this all in RAM because it's a pretty small dataset
with open(all_text_file, "w") as f:
    f.write("\n".join(training_data.iloc[:, 0]))

In [42]:
print(all_text[:2000])

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

# Pre-processing

Following is just me learning about pre-processing for natural language processing

In [9]:
# tokenisation

v = data.iloc[0, 0]

### NLTK

In [10]:
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/jerome/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [28]:
nltk_tokens = nltk.word_tokenize(v)
print(nltk_tokens[:10])
print(len(nltk_tokens))

['One', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching']
380


### SpaCy

In [26]:
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp(v)

spacy_tokens = [token.text for token in doc]

print(spacy_tokens[:10])
print(len(spacy_tokens))

['One', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching']
359


In [31]:
print(set(spacy_tokens).difference(nltk_tokens))
print(set(nltk_tokens).difference(spacy_tokens))

{'/>The', 'away.<br', 'word.<br', 'me.<br', '/>I', '/><br', '/>It'}
{'away.', 'br', '>', 'me.', 'word.', '/', '<'}


### tokenizers (huggingface)

In [45]:
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors

# training my own tokenizer based on the imdb data
# we would want to exclude test data if we were going to use this

# Using the BPE tokenizer as an example
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.decoder = decoders.BPEDecoder()

# Train the tokenizer on a sample text
tokenizer.train([all_text_file])

encoding = tokenizer.encode(v)
tokens = encoding.tokens

print(tokens[:10])
print(len(tokens))




['One', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching']
390


### SentencePiece

In [47]:
import sentencepiece as spm

# Train the SentencePiece model on a sample text
spm.SentencePieceTrainer.train(
    input=all_text_file, 
    model_prefix="spm", 
    vocab_size=2000,
)

sp = spm.SentencePieceProcessor()
sp.load("spm.model")

tokens = sp.encode_as_pieces(v)

sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: data/imdb_text.txt
  input_format: 
  model_prefix: spm
  model_type: UNIGRAM
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0
  differential_privacy_clip

In [48]:
print(tokens[:10])

['▁One', '▁of', '▁the', '▁other', '▁review', 'ers', '▁has', '▁mention', 'ed', '▁that']


### gensim

In [52]:
from gensim.utils import simple_preprocess

tokens = simple_preprocess(v)

print(tokens[:10])
print(len(tokens))

['one', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching']
307


### Manual

In [89]:
nltk.download("stopwords")
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
stop_words = set(nltk.corpus.stopwords.words())

[nltk_data] Downloading package stopwords to /home/jerome/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jerome/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/jerome/nltk_data...


In [98]:
import re
from typing import List
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer


def nltk_to_wordnet_pos(nltk_pos):
    if nltk_pos.startswith('J'):
        return wordnet.ADJ
    elif nltk_pos.startswith('V'):
        return wordnet.VERB
    elif nltk_pos.startswith('N'):
        return wordnet.NOUN
    elif nltk_pos.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

    
# input text
input_text = data.iloc[0, 0]

text = input_text


def _preprocess_text(text: str) -> List[str]:
    # lower-casing
    text = text.lower()

    # remove html
    text = re.sub("\<.*\/?\>", "", text)

    # remove special characters
    text = re.sub("[^\w\s]", "", text)

    # whitespace tokenizing
    tokens = re.split("\s", text)

    # add POS tags
    pos_tagged_tokens = nltk.pos_tag(tokens)

    # stem the words using wordnet lemmatizer
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [
        lemmatizer.lemmatize(t, pos=nltk_to_wordnet_pos(pos)) 
        for t, pos in pos_tagged_tokens
    ]

    # remove stopwords
    tokens = list(filter(lambda x: x not in stop_words, tokens))
    
    return tokens

In [104]:
manual_tokens = _preprocess_text(v)

print(manual_tokens[:10])
print(len(manual_tokens))

['reviewers', 'mentioned', 'watching', '1', 'oz', 'episode', 'youll', 'hooked', 'happened', 'appeal']
73


### I think that's enough pre-processing for now

I think I now know enough basics about pre-processing. I need to look into models to then customise the pre-processing appropriately

## 2. Model selection and architecture
- Research different types of deep learning models (RNN, LSTM, GRU, CNN, Transformer)
- Decide on a model architecture
- Experiment with pre-trained models (BERT, GPT, RoBERTa) for fine-tuning

## RNN

In [115]:
import torch
import torch.nn as nn

class RNNModel(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        emb_dim: int = 300,
        hidden_size: int = 300,
        n_rnn_layers: int = 5,
    ):
        super().__init__()
        self.emb = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=emb_dim,
        )
        self.rnn = nn.RNN(
            input_size=emb_dim,
            hidden_size=hidden_size,
            num_layers=n_rnn_layers,
        )

        self.fc = nn.Linear(hidden_size, 2)

    def forward(self, x: torch.Tensor):
        # x shape: (B, L)
        
        # convert token indices to embedding values
        x = self.emb(x)
        # x shape: (B, L, Emb dim)
        
        # run the rnn, only taking the final rnn hidden state from the last layer
        # TODO: understand the difference between the two outputs more
        _, x = self.rnn(x)
        # x shape: (B, n_rnn_layers, Hidden size?)
        
        # take only the
        x = x[:, -1, :]
        # x shape: (B, Hidden size?)
        
        x = self.fc(x)
        # x shape: (B, 2)

## LSTM

## GRU

## CNN

## Transformer

## BERT

## GPT

## RoBERTA