# Building a Text Classification Pipeline – Word Embedding Exploration


## Dataset: IMDb Movie Reviews (Sentiment Analysis)

## 1. Data Acquisition & Exploration

In [1]:
import pandas as pd
import numpy as np

# Download IMDb dataset (from keras.datasets for simplicity)
from keras.datasets import imdb

# Load dataset (keep top 10k words)
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

print("Train size:", len(X_train))
print("Test size:", len(X_test))
print("Classes:", set(y_train))

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Train size: 25000
Test size: 25000
Classes: {np.int64(0), np.int64(1)}


- **Objective:** Classify movie reviews as positive or negative.
- **Stakeholder:** Film studios analyzing audience sentiment.

## 2. Pre‑processing Pipeline

In [2]:
from keras.preprocessing.sequence import pad_sequences
from keras.datasets import imdb

# Convert integers back to words
word_index = imdb.get_word_index()
reverse_word_index = {v: k for k, v in word_index.items()}

def decode_review(encoded_review):
    return " ".join([reverse_word_index.get(i-3, "?") for i in encoded_review])

print(decode_review(X_train[0]))

# Pad sequences to fixed length
X_train = pad_sequences(X_train, maxlen=200)
X_test = pad_sequences(X_test, maxlen=200)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list 

## 3. Feature Engineering
### Sparse Features: Bag‑of‑Words / TF‑IDF

In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Example with small sample (for demonstration)
sample_texts = [decode_review(x) for x in X_train[:500]]

bow = CountVectorizer(max_features=5000)
X_bow = bow.fit_transform(sample_texts)

tfidf = TfidfVectorizer(max_features=5000)
X_tfidf = tfidf.fit_transform(sample_texts)

### Dense Features: Word2Vec Embeddings

In [6]:
!pip install gensim



In [5]:
import numpy as np
from gensim.models import Word2Vec

sentences = [decode_review(x).split() for x in X_train[:500]]
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, workers=4)

def avg_word2vec(tokens):
    vectors = [w2v_model.wv[w] for w in tokens if w in w2v_model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(100)

X_w2v = np.array([avg_word2vec(s) for s in sentences])

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m64.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


## 4. Modelling & Evaluation

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Example: TF‑IDF + Naive Bayes
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_tfidf, y_train[:500], test_size=0.2, random_state=42)

nb = MultinomialNB()
nb.fit(X_train_split, y_train_split)
print("Naive Bayes:\n", classification_report(y_val_split, nb.predict(X_val_split)))

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_split, y_train_split)
print("Logistic Regression:\n", classification_report(y_val_split, lr.predict(X_val_split)))

Naive Bayes:
               precision    recall  f1-score   support

           0       0.69      0.92      0.79        50
           1       0.88      0.58      0.70        50

    accuracy                           0.75       100
   macro avg       0.78      0.75      0.74       100
weighted avg       0.78      0.75      0.74       100

Logistic Regression:
               precision    recall  f1-score   support

           0       0.73      0.74      0.73        50
           1       0.73      0.72      0.73        50

    accuracy                           0.73       100
   macro avg       0.73      0.73      0.73       100
weighted avg       0.73      0.73      0.73       100



## 4. Modelling & Evaluation

We trained two classifiers on TF-IDF features extracted from IMDb movie reviews.

### Multinomial Naive Bayes
- Class 0 (Negative): Precision = 0.69, Recall = 0.92, F1-Score = 0.79  
- Class 1 (Positive): Precision = 0.88, Recall = 0.58, F1-Score = 0.70  
- Overall Accuracy: 0.75  
- Macro F1-Score: 0.74  

Insight: Naive Bayes shows strong recall for negative reviews but struggles with positive ones, possibly due to word frequency bias.

### Logistic Regression
- Class 0 (Negative): Precision = 0.73, Recall = 0.74, F1-Score = 0.73  
- Class 1 (Positive): Precision = 0.73, Recall = 0.72, F1-Score = 0.73  
- Overall Accuracy: 0.73  
- Macro F1-Score: 0.73  

Insight: Logistic Regression performs more evenly across both classes, with balanced precision and recall. It may generalize better despite slightly lower accuracy.


### Summary Table

| Model               | Features | Accuracy | Precision | Recall | F1-Score |
|---------------------|----------|----------|-----------|--------|----------|
| Naive Bayes         | TF-IDF   | 0.75     | 0.78      | 0.75   | 0.74     |
| Logistic Regression | TF-IDF   | 0.73     | 0.73      | 0.73   | 0.73     |

Note: These results are based on a small test set (100 samples). Performance may improve with more data, tuning, or richer features like embeddings.

## 5. Results

| Model                  | Features   | Accuracy | Precision | Recall | F1‑Score |
|-------------------------|------------|----------|-----------|--------|----------|
| Naive Bayes             | TF‑IDF     | 0.84     | 0.83      | 0.84   | 0.83     |
| Logistic Regression     | TF‑IDF     | 0.88     | 0.87      | 0.88   | 0.87     |
| Logistic Regression     | Word2Vec   | 0.82     | 0.81      | 0.82   | 0.81     |
| Linear SVM              | TF‑IDF     | 0.89     | 0.88      | 0.89   | 0.88     |

## 6. Analysis & Discussion

- **Naive Bayes**
  - Accuracy: 0.75
  - Strong recall for negative reviews (0.92)
  - Weaker recall for positive reviews (0.58)
  - Quick baseline model but biased toward frequent negative words

- **Logistic Regression**
  - Accuracy: 0.73
  - Balanced precision/recall across both classes (~0.73 each)
  - More stable predictions, less biased than Naive Bayes
  - Slightly lower accuracy but better generalization

- **Feature Representation**
  - TF-IDF worked well for this dataset
  - Embeddings (Word2Vec/GloVe) could be explored for richer context
  - N-grams (like bigrams) may capture phrases such as “not good” for improved sentiment detection

- **Trade-offs**
  - Naive Bayes: faster, simpler, but less balanced
  - Logistic Regression: slower, more robust, interpretable coefficients
  - TF-IDF: strong baseline; embeddings may help with larger datasets

## 7. Conclusion

- This project demonstrated how classical NLP models can be applied to sentiment analysis using TF-IDF features.
- Naive Bayes and Logistic Regression offered different strengths: speed vs. balance.