<a href="https://www.kaggle.com/code/mrafraim/dl-day-29-sentiment-prediction?scriptVersionId=291811609" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 29: RNN Mini Project - Sentiment Prediction

Welcome to Day 29!

Today you’ll learn:

1. How to preprocess text for RNN (tokenization, encoding, padding)
2. How to define a simple RNN in PyTorch
3. How to train RNN on small dataset
4. How to evaluate predictions
5. How hidden states capture sequence patterns
6. End-to-end workflow from raw text → prediction

By the end of this notebook, you will have built your first end-to-end text classifier.

If you found this notebook helpful, your **<b style="color:orange;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---


# Import Libraries

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence
import numpy as np

# Tiny Sentiment Dataset

In [2]:
data = [
    ("I love this movie", 1),   # positive
    ("This film is great", 1),  # positive
    ("I hate this movie", 0),   # negative
    ("This film is awful", 0)   # negative
]

sentences, labels = zip(*data)  # zip(*data) unzips the list of tuples
labels = torch.tensor(labels)
sentences


('I love this movie',
 'This film is great',
 'I hate this movie',
 'This film is awful')

# Tokenization 

In [3]:
# Word-level tokenization
tokenized_sentences = [s.lower().split() for s in sentences]

print("Tokennized Sentences: ")
tokenized_sentences

Tokennized Sentences: 


[['i', 'love', 'this', 'movie'],
 ['this', 'film', 'is', 'great'],
 ['i', 'hate', 'this', 'movie'],
 ['this', 'film', 'is', 'awful']]

# Vocabulary Creation

In [4]:
# Build vocabulary
vocab = set()

for sentence in tokenized_sentences:
    for word in sentence:
        vocab.add(word)

print(vocab)

vocab_size = len(vocab)
print("Vocabulary size: ", vocab_size)

{'is', 'film', 'movie', 'hate', 'love', 'awful', 'this', 'great', 'i'}
Vocabulary size:  9


In [5]:
# Map words to unique integer IDs

word_to_idx = {word: idx+1 for idx, word in enumerate(vocab)} # start from 1, 0 reserved for padding
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

print("Word to index mapping:")
word_to_idx

Word to index mapping:


{'is': 1,
 'film': 2,
 'movie': 3,
 'hate': 4,
 'love': 5,
 'awful': 6,
 'this': 7,
 'great': 8,
 'i': 9}

# Encoding

In [6]:
# Convert each token in the sentences to its corresponding integer index
encoded_sentences = [
    torch.tensor([word_to_idx[word] for word in sentence])
    for sentence in tokenized_sentences
]

print("Encoded sentences:")
encoded_sentences

Encoded sentences:


[tensor([9, 5, 7, 3]),
 tensor([7, 2, 1, 8]),
 tensor([9, 4, 7, 3]),
 tensor([7, 2, 1, 6])]

# Pad Sequences

In [7]:
# Pad sequences
padded_sequences = pad_sequence(encoded_sentences, 
                                batch_first=True, 
                                padding_value=0)

padded_sequences

tensor([[9, 5, 7, 3],
        [7, 2, 1, 8],
        [9, 4, 7, 3],
        [7, 2, 1, 6]])

Explanation:

- All sequences now have the same length
- Ready for batch processing

# Define Simple RNN Model

In [8]:
class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size+1,
                                      embedding_dim=embed_dim,
                                      padding_idx=0)
        self.rnn = nn.RNN(input_size=embed_dim,
                          hidden_size=hidden_dim,
                          batch_first=True)
        self.fc = nn.Linear(hidden_dim, 
                            output_dim)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        embedded = self.embedding(x)
        out, h_n = self.rnn(embedded)
        out = self.fc(out[:, -1, :])  # use last hidden state
        return self.sigmoid(out)


 `__init__(...)`

* `vocab_size`: number of unique tokens in your vocabulary
* `embed_dim`: size of each word vector
* `hidden_dim`: RNN memory capacity
* `output_dim`: number of output neurons


```python
self.embedding = nn.Embedding(
    num_embeddings=vocab_size + 1,
    embedding_dim=embed_dim,
    padding_idx=0
)
```

* Converts word indices → dense vectors
* Shape:

  ```
  (batch, seq_len) → (batch, seq_len, embed_dim)
  ```

**Why `+1`**

* Index `0` is reserved for padding
* Real tokens start at `1`

**Why `padding_idx=0`**

* Padding tokens produce zero vectors
* Gradients are NOT updated for padding

This is non-negotiable for sequence models.


```python
self.rnn = nn.RNN(
    input_size=embed_dim,
    hidden_size=hidden_dim,
    batch_first=True
)
```

* Processes sequences step by step
* Maintains hidden state across time

**Input shape**

```
(batch, seq_len, embed_dim)
```

**Output**

```python
out   → (batch, seq_len, hidden_dim)
h_n   → (1, batch, hidden_dim)
```


```python
self.fc = nn.Linear(hidden_dim, output_dim)
```

This maps:

```
Final hidden state → class score
```

You are compressing temporal information into a single scalar.


```python
self.sigmoid = nn.Sigmoid()
```

This forces output into:

```
(0, 1)
```

Meaning:

* Probability of positive sentiment


`forward(self, x)`

This is the actual computation graph.


Step 1: Embedding

```python
embedded = self.embedding(x)
```

Input:

```
x → (batch, seq_len)
```

Output:

```
embedded → (batch, seq_len, embed_dim)
```

Raw integers → semantic vectors.


Step 2: RNN

```python
out, h_n = self.rnn(embedded)
```

* `out`: hidden state at every time step
* `h_n`: hidden state at final time step

You ignore `h_n` and do this instead:


Step 3: Last time step extraction

```python
out[:, -1, :]
```

This means:

> Use the hidden state of the **last token** as sentence representation.

Step 4: Classification

```python
out = self.fc(out[:, -1, :])
```

Shape:

```
(batch, hidden_dim) → (batch, 1)
```

Step 5: Sigmoid

```python
return self.sigmoid(out)
```

Final output:

```
(batch, 1)
```

Probability of positive sentiment.

# Initialize Model, Loss, Optimizer

In [9]:
embed_dim = 8
hidden_dim = 16
output_dim = 1

model = SentimentRNN(vocab_size, embed_dim, hidden_dim, output_dim)

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)


# Training Loop

In [10]:
num_epochs = 200                                 # total iterations over the dataset

for epoch in range(num_epochs):
    optimizer.zero_grad()                        # reset gradients
    outputs = model(padded_sequences).squeeze()  # forward pass to get predictions
    loss = criterion(outputs, labels.float())    # compute binary cross-entropy loss
    loss.backward()                              # compute gradients via backpropagation
    optimizer.step()                             # update model parameters
    if epoch % 50 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")  # print loss every 50 epochs



Epoch 0, Loss: 0.6691
Epoch 50, Loss: 0.0033
Epoch 100, Loss: 0.0015
Epoch 150, Loss: 0.0010


- outputs are probabilities (0 → negative, 1 → positive)
- Loss decreases over epochs, meaning the model is learning

# Predictions

In [11]:
with torch.no_grad():                                # disable gradient calculation for inference/evaluation
    predictions = model(padded_sequences).squeeze()  # forward pass to get predicted probabilities
    predicted_labels = (predictions >= 0.5).long()   # convert probabilities to 0/1 class labels

predictions, predicted_labels                        # display predicted probabilities and class labels


(tensor([9.9963e-01, 9.9898e-01, 4.1784e-04, 9.7878e-04]),
 tensor([1, 1, 0, 0]))

# Interpret Predictions

In [12]:
# iterate over sentences and their predicted labels
for sentence, pred in zip(sentences, predicted_labels):  
    print(f"{sentence} → {pred.item()}")  # print each sentence with its predicted label (0 or 1)


I love this movie → 1
This film is great → 1
I hate this movie → 0
This film is awful → 0


# End-to-End Flowchart

```

┌─────────────────────────────┐
│ Raw Sentences (Text)        │
│ ──────────────────────────  │
│ "i love this movie"         │
│ "this film is great"        │
│ "i hate this movie"         │
│ "this film is awful"        │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Tokenization                │
│ ──────────────────────────  │
│ ["i","love","this","movie"] │
│ ["this","film","is","great"]│
│ ["i","hate","this","movie"] │
│ ["this","film","is","awful"]│
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Vocabulary Encoding         │
│ ──────────────────────────  │
│ i → 1, love → 2, this → 3   │
│ movie → 4, film → 5         │
│ is → 6, great → 7           │
│ hate → 8, awful → 9         │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Integer Sequences           │
│ ──────────────────────────  │
│ [1,2,3,4]                   │
│ [3,5,6,7]                   │
│ [1,8,3,4]                   │
│ [3,5,6,9]                   │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Padding (pad_sequence)      │
│ padding_value = 0           │
│ ──────────────────────────  │
│ [1,2,3,4]                   │
│ [3,5,6,7]                   │
│ [1,8,3,4]                   │
│ [3,5,6,9]                   │
│ Shape: (batch, seq_len)     │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Embedding Layer             │
│ nn.Embedding                │
│ ──────────────────────────  │
│ Input: (4, 4)               │
│ Output: (4, 4, embed_dim)   │
│ embed_dim = 8               │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ RNN Layer                   │
│ nn.RNN                      │
│ ──────────────────────────  │
│ Input: (4, 4, 8)            │
│ Output: (4, 4, 16)          │
│ Hidden state h_n: (1,4,16)  │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Last Time Step Selection    │
│ ──────────────────────────  │
│ out[:, -1, :]               │
│ Shape: (4, 16)              │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Fully Connected Layer       │
│ nn.Linear(16 → 1)           │
│ ──────────────────────────  │
│ Output: (4, 1)              │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Sigmoid Activation          │
│ ──────────────────────────  │
│ Output Probabilities        │
│ (0 → negative, 1 → positive)│
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│ Thresholding (>= 0.5)       │
│ ──────────────────────────  │
│ Final Sentiment Labels      │
│ 0 = Negative, 1 = Positive  │
└─────────────────────────────┘

```

# Key Takeaways from Day 29

- End-to-end RNN workflow:
    - Tokenization → Encoding → Padding → Embedding → RNN → Prediction
- Hidden state summarizes sequence information
- RNNs can capture short-term patterns in text
- With LSTM/GRU, longer dependencies become possible
- Even a tiny dataset helps conceptually understand sequence modeling

---

<p style="text-align:center; font-size:18px;">
© 2026 Mostafizur Rahman
</p>
