## Sentiment Analysis Classifier with Bag of Words and Word Embeddings

### 1. Open Data File

In [5]:
import pandas as pd

csv_file = ""

df = pd.read_csv("C:\\Users\\me\\Downloads\\chunk_0000.csv")

# Display a few rows
display(df.head())

Unnamed: 0,label,title,content
0,1,Toasts great but difficult to remove English m...,I love the way this toaster evenly toasts brea...
1,1,round the outside.. round the outside...,"This is a very important record, for a number ..."
2,0,What is this book's appeal?,I feel a little silly to be about the only one...
3,1,A Must Have for Literacy Instruction,Excellent resource for those teachers wanting ...
4,1,SADE: Lovers Rock,What can I say...this woman is absolutely incr...


### 2. Preprocessing the Data

In [6]:
# Merge title and content into one text field
df['text'] = df['title'].fillna('') + " " + df['content'].fillna('')

# Features and labels
X = df['text']
y = df['label']

### Part A: Sentiment Classifier using Bag of Words (BoW)

#### 3A. BoW Feature Extraction and Model Training

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Bag of Words vectorizer
vectorizer = CountVectorizer()

# Transform text into BoW vectors
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

# Train a Logistic Regression classifier
bow_model = LogisticRegression(max_iter=1000)
bow_model.fit(X_train_bow, y_train)

# Predict and evaluate
y_pred_bow = bow_model.predict(X_test_bow)

print("BoW Classifier Results")
print("Accuracy:", accuracy_score(y_test, y_pred_bow))
print(classification_report(y_test, y_pred_bow))

BoW Classifier Results
Accuracy: 0.88745
              precision    recall  f1-score   support

           0       0.89      0.88      0.89      9947
           1       0.89      0.89      0.89     10053

    accuracy                           0.89     20000
   macro avg       0.89      0.89      0.89     20000
weighted avg       0.89      0.89      0.89     20000



### Part B: Sentiment Classifier using Word Embeddings (Gensim)
#### 3B. Word Embedding Feature Extraction and Model Training

In [8]:
import nltk
import numpy as np
import string
from nltk.tokenize import word_tokenize
from gensim.downloader import load

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\me\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Load pre-trained word embeddings

In [9]:
import nltk
import numpy as np
import string
from nltk.tokenize import word_tokenize
from gensim.downloader import load

nltk.download('punkt')

# Load pre-trained word embeddings
word2vec_model = load('glove-wiki-gigaword-100')  # 100-dimensional GloVe vectors

# Helper function: average word embeddings for a text
def get_avg_word2vec(text, model, k=100):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t not in string.punctuation]
    vectors = []
    for token in tokens:
        if token in model:
            vectors.append(model[token])
    if len(vectors) == 0:
        return np.zeros(k)
    else:
        return np.mean(vectors, axis=0)

# Create embeddings for all texts
X_vectors = np.vstack([get_avg_word2vec(text, word2vec_model) for text in X])

# Train/test split for embeddings
X_train_vec, X_test_vec, y_train_vec, y_test_vec = train_test_split(X_vectors, y, test_size=0.2, random_state=42)

# Train a Logistic Regression classifier
embedding_model = LogisticRegression(max_iter=1000)
embedding_model.fit(X_train_vec, y_train_vec)

# Predict and evaluate
y_pred_vec = embedding_model.predict(X_test_vec)

print("Embedding Classifier Results")
print("Accuracy:", accuracy_score(y_test_vec, y_pred_vec))
print(classification_report(y_test_vec, y_pred_vec))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\me\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Embedding Classifier Results
Accuracy: 0.801
              precision    recall  f1-score   support

           0       0.80      0.80      0.80      9947
           1       0.80      0.80      0.80     10053

    accuracy                           0.80     20000
   macro avg       0.80      0.80      0.80     20000
weighted avg       0.80      0.80      0.80     20000



### Final Comparison

In [10]:
print("\nSummary of Results:")
print(f"BoW Accuracy: {accuracy_score(y_test, y_pred_bow):.4f}")
print(f"Embedding Accuracy: {accuracy_score(y_test_vec, y_pred_vec):.4f}")


Summary of Results:
BoW Accuracy: 0.8874
Embedding Accuracy: 0.8010


## Real technical reasons why BoW still beats embeddings, even with 10,000 samples:

### 1. **Averaging word embeddings destroys word order and negation**
- When you average vectors, "good" and "not good" look almost the same.
- Sentiment relies heavily on small words like "not", "never", "but", etc., which are lost by naive averaging.
- Bag of Words keeps "not good" explicitly separate as two words — so it captures this better.

### 2. **Embeddings are trained for semantic similarity, not for sentiment polarity**
- GloVe and Word2Vec embeddings are trained to capture **meaning similarity**, not **sentiment**.
- Example: "good" and "bad" might end up closer than you want because both are adjectives describing quality.
- BoW directly catches "bad" and "good" as separate dimensions.

### 3. **Pre-trained embeddings are not customized to Amazon review domain**
- GloVe was trained on Wikipedia and news, not on Amazon-style informal, product-centric reviews.
- Words like "warranty", "battery", "durable", "refund", "defective" may not be well-represented.

### 4. **Simple Logistic Regression on top of averaged embeddings is too weak**
- More powerful models like **CNNs** or **LSTMs** over the sequences of embeddings could do better.
- Logistic Regression assumes a linear boundary, and embeddings alone may not separate classes linearly enough.

## How to fix this or improve the embedding model:

Here are a few good strategies:

| Strategy | Why It Helps |
|:---------|:-------------|
| TF-IDF Weighted Averaging | Gives more importance to rare, meaningful words and less to common words. |
| Fine-tuning embeddings | Updates embeddings during model training to capture review-specific language. |
| Train a small neural network | Instead of Logistic Regression, use a feedforward network or LSTM/GRU. |
| Use Doc2Vec (Paragraph Vector) | Learns a direct fixed-size vector for the whole document without averaging words. |
| Train embeddings on your Amazon data | Get domain-specific representations of words. |

### Summary:

> "Pre-trained embeddings are not magic. They help when you can use a model that understands sequences and context. For many traditional machine learning models like Logistic Regression, a simple Bag of Words representation can outperform averaged embeddings because it preserves discriminative keywords better."

### We can try:
- TF-IDF weighted word2vec averaging method, or
- Full example of training a **Doc2Vec** model using Gensim.

These would get you much closer to beating the BoW model.

### Additional Analysis: Word Importance in BoW Model (Optional)

In [11]:
import numpy as np

# Get feature importance
coefficients = bow_model.coef_[0]
words = vectorizer.get_feature_names_out()
word_importance = pd.DataFrame({'word': words, 'coefficient': coefficients})

# Top 10 Positive Words
print("Top Positive Words:")
display(word_importance.sort_values(by='coefficient', ascending=False).head(10))

# Top 10 Negative Words
print("Top Negative Words:")
display(word_importance.sort_values(by='coefficient', ascending=True).head(10))

Top Positive Words:


Unnamed: 0,word,coefficient
70111,pleasantly,2.068651
9270,awesome,1.923092
36311,flawless,1.91156
89903,susi,1.893056
35040,fave,1.774913
67828,paulina,1.771908
42831,haters,1.757546
33575,exellent,1.746687
48745,invaluable,1.739013
16199,captures,1.725794


Top Negative Words:


Unnamed: 0,word,coefficient
27424,disappointment,-3.143104
27421,disappointing,-2.678993
70748,poorly,-2.524183
102630,worst,-2.431183
13415,boring,-2.275152
27387,disapointing,-2.213675
53743,lesley,-2.210398
27983,dissapointment,-2.196628
102651,worthless,-2.180397
89096,sucks,-2.17729


### Benefits of Word Embeddings for Downstream Applications

Applications where Word Embeddings out performs BoW

| Application | Why Word Embeddings Help |
|:------------|:--------------------------|
| **Semantic Search / Similarity** | Embeddings allow you to retrieve similar documents, not just keyword matches. |
| **Clustering Reviews / Topics** | You can group related reviews even if they use different words (synonyms). |
| **Text Recommendation Systems** | Embeddings allow recommending similar products/reviews based on meaning, not keywords. |

Given a random review ("The battery life is excellent"),  
use word embeddings to **retrieve the top 5 most semantically similar reviews** — even if they don't use the word "battery" exactly.

**With BoW:** only exact words match.

**With Embeddings:** similar meaning matches (e.g., "long lasting", "holds charge", "durable").

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Assume you already have:
# - X_vectors: embedding vectors for each review
# - df['text']: the original texts

# Choose a random review
idx = 100  # Pick any index
query_vector = X_vectors[idx].reshape(1, -1)
query_text = df['text'].iloc[idx]

# Compute similarity
similarities = cosine_similarity(query_vector, X_vectors)[0]

# Get top 5 similar reviews
top_indices = similarities.argsort()[-6:-1][::-1]  # Top 5 excluding itself
similar_texts = df['text'].iloc[top_indices]

print("Query Review:")
print(query_text)
print("\nMost Similar Reviews (using Word Embeddings):")
for i, text in enumerate(similar_texts):
    print(f"{i+1}. {text}")

# Visual Result Example

> Query: "The battery lasts a long time and charges quickly."  
> Similar Reviews:
> 1. "Holds charge for days. Very impressed."
> 2. "Long battery life and fast charging feature."
> 3. "Impressive durability and energy efficiency."
> 4. "The battery power is very reliable."
> 5. "Stays charged all weekend."

In [None]:

**Notice:** Different words, same meaning.  
Bag of Words would totally miss this.


# Why this is perfect for students:
- Shows a **clear weakness** of BoW.
- Shows **real-world value** of embeddings.
- Shows why embeddings are **essential** for more advanced tasks beyond simple classification.
