# Part 1 → RNN


## Sentiment Analysis with RNN on Amazon Fine Food Reviews

## Dataset
- **Amazon Fine Food Reviews**  
- [Download from Kaggle](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews)  
- Dataset contains **500,000+ reviews** with ratings (1–5 stars).



## Objective
- Build a **Recurrent Neural Network (RNN)** model to predict review sentiment:
  - **Multi-class** → Ratings from 1 to 5 stars  




## Steps

### 1. Data Preprocessing
- Load the CSV file → focus on **`Text`** + **`Score`** columns.
- Clean text: lowercase, remove punctuation, (optional: remove stopwords).
- Tokenize reviews (convert words to numbers).
- Pad sequences to a fixed length.
- Convert labels:
  - Multi-class (1–5 → one-hot encoded).


In [11]:
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.models import Sequential # type: ignore
from tensorflow.keras.layers import Dense, LSTM, Dropout,SimpleRNN , Embedding # type: ignore
from tensorflow.keras.preprocessing.sequence import pad_sequences # type: ignore
from tensorflow.keras.preprocessing.text import Tokenizer # type: ignore
from sklearn.utils import shuffle 

In [4]:
df = pd.read_csv('Reviews.csv')
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [5]:
df=df[['Score','Text']]
df.head()

Unnamed: 0,Score,Text
0,5,I have bought several of the Vitality canned d...
1,1,Product arrived labeled as Jumbo Salted Peanut...
2,4,This is a confection that has been around a fe...
3,2,If you are looking for the secret ingredient i...
4,5,Great taffy at a great price. There was a wid...


In [20]:
df_shuffled = shuffle(df,random_state=42)
df_shuffled = df_shuffled[0:40000]
df_shuffled.Score.value_counts()

Score
5    25628
4     5656
1     3623
3     3016
2     2077
Name: count, dtype: int64

In [21]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    words = [word for word in words if word not in stopwords.words('english')]
    return ' '.join(words)

In [22]:
df_shuffled.Text = df_shuffled.Text.apply(preprocess_text)

In [23]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_shuffled.Text)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(df_shuffled.Text)
padded_sequences = pad_sequences(sequences, maxlen=100, padding='post')
padded_sequences

array([[  23,  300,  204, ...,    0,    0,    0],
       [  89,   85,   76, ...,    0,    0,    0],
       [  19,   98,  540, ...,    0,    0,    0],
       ...,
       [  27,   96,  117, ...,    0,    0,    0],
       [2872,  449,  495, ...,    0,    0,    0],
       [  45, 2152,   85, ...,    0,    0,    0]])

In [24]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
labels = ohe.fit_transform(df_shuffled[['Score']]).toarray()
labels

array([[0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.]])

In [25]:
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, labels, test_size=0.2, random_state=42)


### 2. Build RNN Model
- Use **Embedding Layer** to convert tokens into vectors.
- Add **SimpleRNN layer** (e.g., 32 or 64 units).
- Add **Dense output layer**:
  - Activation = `softmax` (for multi-class classification).


In [26]:
rnn_model = Sequential(
    [
        Embedding(input_dim=len(word_index)+1, output_dim=128, input_length=padded_sequences.shape[1]),
        SimpleRNN(64),
        Dense(5, activation='softmax')
    ]
)




### 3. Training
- Compile with:
  - Optimizer → `adam`
  - Loss → `categorical_crossentropy` (multi-class)
  - Metric → `accuracy`
- Train for **5–10 epochs**.
- Use a **validation split** (e.g., 20%).


In [27]:

rnn_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
rnn_model.summary()
rnn_model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.2)

Epoch 1/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 65ms/step - accuracy: 0.6411 - loss: 1.1309 - val_accuracy: 0.6406 - val_loss: 1.1238
Epoch 2/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 66ms/step - accuracy: 0.6666 - loss: 1.0693 - val_accuracy: 0.6383 - val_loss: 1.1312
Epoch 3/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 89ms/step - accuracy: 0.6773 - loss: 1.0347 - val_accuracy: 0.6309 - val_loss: 1.1547
Epoch 4/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 92ms/step - accuracy: 0.6814 - loss: 1.0198 - val_accuracy: 0.6189 - val_loss: 1.1994
Epoch 5/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 91ms/step - accuracy: 0.6850 - loss: 1.0033 - val_accuracy: 0.6231 - val_loss: 1.2049
Epoch 6/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 70ms/step - accuracy: 0.6861 - loss: 0.9975 - val_accuracy: 0.6241 - val_loss: 1.1944
Epoch 7/10
[1m4

<keras.src.callbacks.history.History at 0x29ec93e7fd0>

##  The Model is Overfitting so I Will A Dropout layer

In [28]:
rnn_model = Sequential(
    [
        Embedding(input_dim=len(word_index)+1, output_dim=128, input_length=padded_sequences.shape[1]),
        SimpleRNN(64),
        Dropout(0.7),
        Dense(5, activation='softmax')
    ]
)

In [30]:

rnn_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
rnn_model.summary()
rnn_model.fit(X_train, y_train, epochs=8, batch_size=64, validation_split=0.2)

Epoch 1/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 66ms/step - accuracy: 0.6476 - loss: 1.1286 - val_accuracy: 0.6391 - val_loss: 1.1269
Epoch 2/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 64ms/step - accuracy: 0.6592 - loss: 1.0889 - val_accuracy: 0.6364 - val_loss: 1.1314
Epoch 3/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 65ms/step - accuracy: 0.6691 - loss: 1.0577 - val_accuracy: 0.6367 - val_loss: 1.1374
Epoch 4/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 65ms/step - accuracy: 0.6734 - loss: 1.0399 - val_accuracy: 0.6383 - val_loss: 1.1368
Epoch 5/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 65ms/step - accuracy: 0.6770 - loss: 1.0229 - val_accuracy: 0.6353 - val_loss: 1.1649
Epoch 6/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 64ms/step - accuracy: 0.6773 - loss: 1.0170 - val_accuracy: 0.6348 - val_loss: 1.1434
Epoch 7/8
[1m400/400

<keras.src.callbacks.history.History at 0x29f23435e10>


### 4. Evaluation
- Evaluate model on test set.
- Show:
  - **Accuracy score**
  - **Confusion matrix**
- Print a few **example predictions**:
  - Input review
  - True label
  - Predicted sentiment


In [31]:
from sklearn.metrics import accuracy_score,  confusion_matrix
preds = rnn_model.predict(X_test)
preds = np.argmax(preds, axis=1)
true_labels = np.argmax(y_test, axis=1)
rnn_accuracy = accuracy_score(true_labels, preds)
rnn_cm = confusion_matrix(true_labels, preds)
print(f'RNN Test Accuracy: {rnn_accuracy}')
print(f'RNN Confusion Matrix:{rnn_cm}')

[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step
RNN Test Accuracy: 0.6335
RNN Confusion Matrix:[[  10    0    5   11  737]
 [   1    3    5   12  375]
 [   0    4   11   13  602]
 [   6    4   10   30 1039]
 [  16    9   29   54 5014]]


In [36]:
print(f'Input: {X_test[0]} => Predicted Sentiment: {np.argmax(rnn_model.predict(np.array([X_test[0]])), axis=1)[0]}, True Sentiment: {true_labels[0]}')
print(f'Input: {X_test[1]} => Predicted Sentiment: {np.argmax(rnn_model.predict(np.array([X_test[1]])), axis=1)[0]}, True Sentiment: {true_labels[1]}')
print(f'Input: {X_test[2]} => Predicted Sentiment: {np.argmax(rnn_model.predict(np.array([X_test[2]])), axis=1)[0]}, True Sentiment: {true_labels[2]}')


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
Input: [  64 1167  373  289  334  162   59  805  160  727  367 1916  490   91
    7  230  533   97  401    7  230   33  869   80    2   60    5    7
   16    2  182   32    6    7 1473  131   40    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0] => Predicted Sentiment: 4, True Sentiment: 3
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
Input: [   8   73 1665   72  244   36   76   30  106  694    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    


### 5. User Input Prediction
- After training, let the user type **any review sentence**.  
- Preprocess it the same way (tokenize + pad).  
- Pass it to the trained RNN model.  
- Print the predicted sentiment (Rating 1–5).  

## Hints
- Start with a **small subset** of the dataset (e.g., 20k reviews) to save training time.
- Watch for **overfitting** (training accuracy much higher than validation).

In [37]:
def User_input(text):
    processed_text = preprocess_text(text)
    sequence = tokenizer.texts_to_sequences([processed_text])
    padded_sequence = pad_sequences(sequence, maxlen=100, padding='post')
    prediction = rnn_model.predict(padded_sequence)
    predicted_label = np.argmax(prediction, axis=1)[0]
    print(f'Input: {text} => Predicted Sentiment: {predicted_label}')

In [38]:
User_input("This product is great and works well!")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
Input: This product is great and works well! => Predicted Sentiment: 4


# Part 2 – LSTM
1. **Build LSTM Model**
   - Keep preprocessing exactly the same.
   - Replace `SimpleRNN` with `LSTM` (e.g., 64–128 units).
   - Dense output (sigmoid or softmax).


In [39]:
lstm_model = Sequential(
    [
        Embedding(input_dim=len(word_index)+1, output_dim=128, input_length=padded_sequences.shape[1]),
        LSTM(128),
        Dropout(0.7),
        Dense(5, activation='softmax')
    ]
)




2. **Training**
   - Same setup (adam, crossentropy, accuracy).
   - Train for 5–10 epochs.


In [40]:

lstm_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
lstm_model.summary()
lstm_model.fit(X_train, y_train, epochs=8, batch_size=64, validation_split=0.2)

Epoch 1/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m93s[0m 226ms/step - accuracy: 0.6403 - loss: 1.1586 - val_accuracy: 0.6392 - val_loss: 1.1261
Epoch 2/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 244ms/step - accuracy: 0.6425 - loss: 1.1337 - val_accuracy: 0.6391 - val_loss: 1.1220
Epoch 3/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m97s[0m 243ms/step - accuracy: 0.6504 - loss: 1.1083 - val_accuracy: 0.6403 - val_loss: 1.1199
Epoch 4/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 245ms/step - accuracy: 0.6595 - loss: 1.0851 - val_accuracy: 0.6375 - val_loss: 1.1294
Epoch 5/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m102s[0m 255ms/step - accuracy: 0.6626 - loss: 1.0708 - val_accuracy: 0.6348 - val_loss: 1.1611
Epoch 6/8
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 244ms/step - accuracy: 0.6685 - loss: 1.0232 - val_accuracy: 0.6555 - val_loss: 1.0125
Epoch 7/8
[1m4

<keras.src.callbacks.history.History at 0x29eae12af10>


3. **Evaluation**
   - Compare accuracy with RNN model.
   - Show confusion matrix.
   - Example predictions.


In [43]:
preds = lstm_model.predict(X_test)
preds = np.argmax(preds, axis=1)
true_labels = np.argmax(y_test, axis=1)
lstm_accuracy = accuracy_score(true_labels, preds)
lstm_cm = confusion_matrix(true_labels, preds)
print(f'LSTM Test Accuracy: {lstm_accuracy}')
print(f'LSTM Confusion Matrix:{lstm_cm}')

[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 19ms/step
LSTM Test Accuracy: 0.667375
LSTM Confusion Matrix:[[ 345    4    7  257  150]
 [ 103    7    6  176  104]
 [  83    2   12  250  283]
 [  30    2    9  223  825]
 [  41    2    5  322 4752]]


In [45]:
print(f'Input: {X_test[10]} => Predicted Sentiment: {np.argmax(lstm_model.predict(np.array([X_test[10]])), axis=1)[0]}, True Sentiment: {true_labels[10]}')
print(f'Input: {X_test[20]} => Predicted Sentiment: {np.argmax(lstm_model.predict(np.array([X_test[20]])), axis=1)[0]}, True Sentiment: {true_labels[20]}')
print(f'Input: {X_test[20]} => Predicted Sentiment: {np.argmax(lstm_model.predict(np.array([X_test[30]])), axis=1)[0]}, True Sentiment: {true_labels[30]}')


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
Input: [  433     6     8     6   523    11   389     9  5246   124    30   165
 15458   359    28  1278   671  1881    11   107     5   134  2596  1337
   389     9   722   221   621  3364  2991  1383     6    58   689   466
  8223    54  4086     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0] => Predicted Sentiment: 4, True Sentiment: 4
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
Input: [   27   291   117   181    23    61   795   291   242  3444 38055 11434
   291   756   293   516   451    63   175     0     0     0     0     0
     0     0     0     0     0     0     


4. **User Input Prediction. **
   - Take a sentence from user.
   - Preprocess (tokenize + pad).
   - Predict sentiment using the LSTM model.


In [None]:
def User_input(text):
    processed_text = preprocess_text(text)
    sequence = tokenizer.texts_to_sequences([processed_text])
    padded_sequence = pad_sequences(sequence, maxlen=100, padding='post')
    prediction = lstm_model.predict(padded_sequence)
    predicted_label = np.argmax(prediction, axis=1)[0]
    print(f'Input: {text} => Predicted Sentiment: {predicted_label}')

In [50]:
User_input("I  Hate this product. It is terrible and does not work at all!")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
Input: I  Hate this product. It is terrible and does not work at all! => Predicted Sentiment: 3


# RNN vs LSTM Performance Comparison

## Accuracy Scores

| Model | Training Accuracy | Validation Accuracy | Test Accuracy |
|-------|------------------|-------------------|--------------|
| RNN | [0.6789] | [0.6334] | [0.6335] |
| LSTM | [0.7271] | [0.6652] | [0.667375] |
