# 📘 Sentiment Analysis on IMDB Dataset using Keras (Beginner Friendly)

In this notebook, we will:
1. Understand the **problem statement**.
2. Explore the **IMDB dataset structure**.
3. Apply a **traditional Bag-of-Words/TF-IDF approach** for text processing.
4. Build and train a **Neural Network model with Keras**.
5. Evaluate the performance of our model.

---
### Problem Statement:
We want to classify **movie reviews** as **positive** or **negative**. This is a classic **binary text classification** problem.

In [None]:
# Importing necessary libraries
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


## Step 1: Load the IMDB Dataset
Keras provides the IMDB dataset, but it comes pre-tokenized (as integers). For better **understanding of dataset structure**, we’ll instead use a **text-based version of IMDB dataset** from Keras datasets.

In [None]:
# Load raw IMDB dataset (Keras provides integer-encoded version)
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data(num_words=10000)
word_index = keras.datasets.imdb.get_word_index()

# Reverse the word index to get actual words
reverse_word_index = {value: key for (key, value) in word_index.items()}

# Function to decode reviews back to text
def decode_review(text_ids):
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in text_ids])

# Show one decoded review
print("Example review (decoded):\n")
print(decode_review(X_train[0]))
print("\nLabel:", y_train[0])

Here we:
- Loaded the IMDB dataset.
- Decoded integer tokens back to **text reviews**.
- Checked the structure of data: `X_train` contains movie reviews, `y_train` contains sentiment labels (0 = negative, 1 = positive).

In [None]:
# Convert reviews to text for Bag-of-Words processing
X_train_text = [decode_review(x) for x in X_train[:5000]]
X_test_text = [decode_review(x) for x in X_test[:2000]]
y_train_sample = y_train[:5000]
y_test_sample = y_test[:2000]

# Show a small sample
print("First training review:", X_train_text[0][:500])
print("Label:", y_train_sample[0])

## Step 2: Convert Text to Features (Traditional Bag-of-Words / TF-IDF)
Since machine learning models need **numeric input**, we will transform text into vectors.
- **CountVectorizer** → converts text into word count vectors.
- **TfidfVectorizer** → adjusts counts based on word importance.

We will use **TF-IDF** here.

In [None]:
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train_text).toarray()
X_test_tfidf = tfidf.transform(X_test_text).toarray()

print("TF-IDF shape:", X_train_tfidf.shape)

Now we have numeric vectors for each review.
- Shape `(5000, 5000)` → 5000 reviews, each represented by 5000 features.
- These features correspond to important words in the dataset.

In [None]:
# Example: first review as numeric vector
print("First review vector (first 20 features):")
print(X_train_tfidf[0][:20])

## Step 3: Build Neural Network Model with Keras
We will use a simple **Feedforward Neural Network**:
- Input: 5000 features (from TF-IDF).
- Hidden layer with ReLU activation.
- Output: Single neuron with Sigmoid activation (for binary classification).

In [None]:
model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(5000,)),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

## Step 4: Train the Model

In [None]:
history = model.fit(X_train_tfidf, y_train_sample, epochs=5, batch_size=32, validation_data=(X_test_tfidf, y_test_sample))

## Step 5: Evaluate the Model

In [None]:
loss, acc = model.evaluate(X_test_tfidf, y_test_sample)
print(f"Test Accuracy: {acc*100:.2f}%")

# Detailed classification report
y_pred = (model.predict(X_test_tfidf) > 0.5).astype("int32")
print(classification_report(y_test_sample, y_pred))

## 🎯 Summary
- We **loaded and explored** the IMDB dataset.
- We saw **real text reviews and labels**.
- We applied **TF-IDF vectorization** (traditional NLP method).
- We built a **Neural Network with Keras**.
- We trained and evaluated our model.

This case study shows how to move from **raw text → numeric features → deep learning model**.