# Sentiment Analysis and Semantic Similarity with Bag of Words and Word Embeddings


This notebook covers two key parts:

1. Sentiment Analysis using Bag of Words and Word Embeddings.
2. Semantic Similarity Search using Word Embeddings.

We will demonstrate the strengths and weaknesses of both representations in real downstream applications.


## 1. File Upload and Preprocessing

In [None]:

import pandas as pd
from google.colab import files
from IPython.display import display

# Upload CSV file dynamically
uploaded = files.upload()

# List uploaded files
for filename in uploaded.keys():
    print(f'User uploaded file "{filename}"')

# Load the selected file
csv_file = list(uploaded.keys())[0]
df = pd.read_csv(csv_file)

# Display a few rows
display(df.head())

# Merge title and content
df['text'] = df['title'].fillna('') + " " + df['content'].fillna('')

# Prepare features and labels
X = df['text']
y = df['label']


# Part 1: Sentiment Analysis

## 2A. Sentiment Analysis using Bag of Words

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bag of Words vectorization
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

# Train Logistic Regression
bow_model = LogisticRegression(max_iter=1000)
bow_model.fit(X_train_bow, y_train)

# Predictions
y_pred_bow = bow_model.predict(X_test_bow)

print("BoW Classifier Results")
print("Accuracy:", accuracy_score(y_test, y_pred_bow))
print(classification_report(y_test, y_pred_bow))


## 2B. Sentiment Analysis using Word Embeddings

In [None]:

import nltk
import numpy as np
import string
from nltk.tokenize import word_tokenize
from gensim.downloader import load

nltk.download('punkt')

# Load pre-trained word embeddings
word2vec_model = load('glove-wiki-gigaword-100')

# Helper function to average word embeddings
def get_avg_word2vec(text, model, k=100):
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t not in string.punctuation]
    vectors = []
    for token in tokens:
        if token in model:
            vectors.append(model[token])
    if len(vectors) == 0:
        return np.zeros(k)
    else:
        return np.mean(vectors, axis=0)

# Build vectors
X_vectors = np.vstack([get_avg_word2vec(text, word2vec_model) for text in X])

# Split again
X_train_vec, X_test_vec, y_train_vec, y_test_vec = train_test_split(X_vectors, y, test_size=0.2, random_state=42)

# Train Logistic Regression
embedding_model = LogisticRegression(max_iter=1000)
embedding_model.fit(X_train_vec, y_train_vec)

# Predictions
y_pred_vec = embedding_model.predict(X_test_vec)

print("Embedding Classifier Results")
print("Accuracy:", accuracy_score(y_test_vec, y_pred_vec))
print(classification_report(y_test_vec, y_pred_vec))


## 3. Final Comparison

In [None]:

print("\nSummary of Results:")
print(f"BoW Accuracy: {accuracy_score(y_test, y_pred_bow):.4f}")
print(f"Embedding Accuracy: {accuracy_score(y_test_vec, y_pred_vec):.4f}")


# Part 2: Semantic Similarity Search with Word Embeddings

In [None]:

from sklearn.metrics.pairwise import cosine_similarity

# Choose a random review
idx = 100  # or any valid index
query_vector = X_vectors[idx].reshape(1, -1)
query_text = df['text'].iloc[idx]

# Compute cosine similarities
similarities = cosine_similarity(query_vector, X_vectors)[0]

# Get top 5 most similar reviews (excluding the query itself)
top_indices = similarities.argsort()[-6:-1][::-1]
similar_texts = df['text'].iloc[top_indices]

print("Query Review:")
print(query_text)
print("\nMost Similar Reviews (using Word Embeddings):")
for i, text in enumerate(similar_texts):
    print(f"{i+1}. {text}")



# Conclusion

Bag of Words models are strong for simple classification tasks where important information is contained in the presence of specific keywords.

Word embeddings become essential in downstream tasks that require understanding of semantic similarity, such as:

- Semantic search
- Text clustering
- Recommendation systems
- Transfer learning

Embeddings allow capturing meaning beyond surface-level word matching.
