## RNN Vignette

#### Introduction

In this vignette, we will build a Recurrent Neural Network (RNN) using an LSTM (Long Short-Term Memory) architecture to classify movie reviews as positive or negative.

RNNs are a powerful deep learning architecture designed to work with sequential data, such as text, speech, or time-series information. Unlike traditional neural networks, RNNs preserve information across time steps, allowing them to understand context, order, and dependencies.

Sentiment analysis is a classic application of natural language processing (NLP). In this example, we use an IMDB movie review dataset and train a model that learns to interpret word patterns and predict the sentiment of a review.

#### What is a RNN

A Recurrent Neural Network is a neural network that processes input one element at a time while maintaining a hidden “memory” state:

$$
h_t = \tanh \left( W_x x_t + W_h h_{t-1} + b \right)
$$


Where:

- $x_t$ is the input at time step $t$
- $h_t$ is the hidden state at time $t$
- $h_{t-1}$ is the previous hidden state
- $W_x, W_h$ are learned weight matrices
- $b$ is a bias vector


LSTMs (Long Short-Term Memory networks) were developed to solve these issues.
They use gates input, forget, and output that control which information to remember and which to discard.

An LSTM cell maintains a cell state C that flows through the sequence with minimal modification, enabling long-term memory.

### Overview of dataset and Preprocessing

Before an RNN can learn from text, the text must be converted into numbers. Neural networks cannot process raw characters or words, so we must transform the movie reviews into a standardized numeric format.

The dataset contains IMDB movie reviews with two columns: the review text and its sentiment (positive or negative). We first convert the sentiment labels into binary values, where positive is mapped to 1 and negative is mapped to 0.

Next, we perform tokenization. Tokenization means converting each word into an integer based on frequency or identity. For example, the word “the” might become 1, “movie” might become 2, and “awful” might become 450. This step converts the variable-length text into a sequence of integers that the RNN can process.

We then set a maximum sequence length. Because movie reviews vary in length, all reviews must be padded or truncated to the same length so that the model can read them as a uniform tensor. Reviews longer than the maximum number of words are cut, and shorter reviews are padded with zeros. This ensures the input shape is consistent for the RNN.

Finally, we restrict the vocabulary size to the top 10,000 most frequent words. This avoids extremely rare words that introduce noise and are not helpful for training.

#### Word Embeddings

Before the integer sequences are fed into the LSTM, they pass through an embedding layer. An embedding layer learns a dense vector representation for each word. This creates a numerical “meaning” for each word that captures relationships between words in a continuous space.

For example, the embedding for “good” will end up closer to the embedding for “great” than to “terrible”. These relationships are learned automatically during training. We choose an embedding dimension of 64, meaning each word is represented as a 64-element vector.

The output of the embedding layer is a matrix for each review: one row per word, and each row containing its learned 64-dimensional representation. This becomes the input to the LSTM layer.

#### Building the LSTM Sentiment Classifier

Our model uses a bidirectional LSTM. A bidirectional LSTM reads each review twice: once from left to right, and once from right to left. This allows the model to understand both previous context and future context within the review.

The architecture consists of three main layers:

Embedding layer: Converts integer word indices into learned 64-dimensional vectors.

Bidirectional LSTM layer: Processes the sequence forward and backward and outputs a 128-dimensional representation.

Dense output layer: A single neuron with a sigmoid activation that outputs a probability between 0 and 1, representing how likely the review is positive.

This structure allows the model to capture semantic meaning, word order, and context in the review.

#### Training the Model

We train the model using the Adam optimizer and binary cross-entropy loss, which is standard for binary classification tasks. The training process involves feeding batches of reviews through the model, computing the loss, and updating the weights.

We train for several epochs, where one epoch means the model has seen every training example once. We also use a validation split, meaning part of the training data is held out during training to check if the model is overfitting.

The history of training shows how accuracy and loss change over time. Ideally, training loss decreases and validation loss also decreases or stabilizes, meaning the model is learning meaningful patterns rather than memorizing the data.

#### Visual Word Embeddings

To understand what the model learned, we extract the embedding weights from the embedding layer and reduce their dimensionality using PCA (Principal Component Analysis). Because the embeddings are 64-dimensional, PCA helps us project them into two dimensions for visualization.

The resulting scatterplot shows how the model organizes words in space. Positive words tend to cluster near each other, negative words form their own cluster, and neutral or common words appear near the center. This visualization provides insight into how the model captures semantic structure in the dataset.

#### Evaluation

After training, we evaluate the model on the test set, which contains data the model has never seen before. The evaluation returns metrics such as accuracy and loss. A strong model will achieve high accuracy on the test set and maintain low loss.

This evaluation step ensures that the model generalizes well and can correctly classify new movie reviews.

#### Predicting Sentiment on New Examples

We test the model on new input sentences that were not part of the dataset. These sentences are tokenized and padded in the same way as the training data. The model outputs a probability value. If the value is greater than 0.5, the review is classified as positive; otherwise, it is negative.

For example:
“This is the best movie I have ever seen!” should receive a positive prediction.
“The acting in this movie was horrible.” should receive a negative prediction.

This demonstrates that the trained model can interpret sentiment from text it has never encountered before.

#### Limitations on Model

While LSTMs are powerful, they have limitations. Word-level models cannot understand misspellings or new words outside the vocabulary. Very long documents may also lose information, even with LSTMs. Additionally, this model does not include an attention mechanism, meaning it weighs all words somewhat equally instead of focusing on the most important ones.

More advanced NLP models such as GRUs, Transformers, BERT, or GPT-based architectures outperform LSTMs on modern sentiment analysis tasks. However, LSTMs remain an excellent teaching tool for understanding sequence modeling.