
# Sentiment Analysis of Real-time Flipkart Product Reviews  
**Product:** YONEX MAVIS 350 Nylon Shuttle (Flipkart)  
**Dataset Size:** 8,518 reviews  

This notebook follows the official problem statement end-to-end:
- Correct dataset selection
- Text preprocessing
- TF-IDF feature extraction
- Sentiment classification
- F1-score evaluation
- Model & vectorizer saving



## Dataset Note (Important)
This project **uses only** the YONEX MAVIS 350 dataset as specified.

Correct file path:
```
reviews_badminton/data.csv
```
Other datasets (tawa, tea) are **not used** for submission.


In [19]:

# Imports
import pandas as pd
import numpy as np
import re, string, pickle, nltk

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report

nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [20]:
# Load Dataset (correct path)
df = pd.read_csv("/content/data.csv")
df = df[['Review text', 'Ratings']]
df.rename(columns={'Review text':'review','Ratings':'rating'}, inplace=True)
df.shape

(8518, 2)

In [21]:
# Target Variable Creation
# rating >= 4 -> Positive (1)
# rating <= 2 -> Negative (0)
# rating == 3 -> dropped

df = df[df['rating'] != 3]
df['sentiment'] = df['rating'].apply(lambda x: 1 if x >= 4 else 0)

df['sentiment'].value_counts()


Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
1,6826
0,1077


In [22]:

# Text Cleaning & Normalization
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    return " ".join(tokens)

df['clean_review'] = df['review'].astype(str).apply(clean_text)


In [23]:
# Pain-point extraction from negative reviews
from sklearn.feature_extraction.text import CountVectorizer

neg_reviews = df[df['sentiment'] == 0]['clean_review']

cv = CountVectorizer(max_features=20)
X_neg = cv.fit_transform(neg_reviews)

pain_points = pd.DataFrame({
    "keyword": cv.get_feature_names_out(),
    "frequency": X_neg.sum(axis=0).A1
}).sort_values(by="frequency", ascending=False)

pain_points

Unnamed: 0,keyword,frequency
18,shuttle,300
15,quality,254
13,product,214
0,bad,178
16,qualityread,122
7,good,117
19,worst,102
12,poor,89
8,goodread,88
14,productread,79



## Text Embedding Techniques
- **BoW (CountVectorizer)** – baseline
- **TF-IDF** – used for final model
- Word2Vec / BERT – can be added as extensions (optional)


In [24]:

# Train-Test Split
X = df['clean_review']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [25]:

# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2))

X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)


In [26]:

# Model Training
model = LogisticRegression(max_iter=1000)
model.fit(X_train_vec, y_train)


In [27]:

# Evaluation (F1 Score)
y_pred = model.predict(X_test_vec)

print("F1 Score:", f1_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


F1 Score: 0.9542715349166963
              precision    recall  f1-score   support

           0       0.84      0.49      0.62       215
           1       0.93      0.99      0.95      1366

    accuracy                           0.92      1581
   macro avg       0.88      0.74      0.79      1581
weighted avg       0.91      0.92      0.91      1581



In [28]:

# Save Model & Vectorizer
with open("sentiment_model.pkl","wb") as f:
    pickle.dump(model,f)

with open("tfidf_vectorizer.pkl","wb") as f:
    pickle.dump(tfidf,f)

print("Saved sentiment_model.pkl and tfidf_vectorizer.pkl")


Saved sentiment_model.pkl and tfidf_vectorizer.pkl



## Deployment (High-Level)
- Streamlit / Flask app loads `.pkl` files
- App deployed on AWS EC2 (Ubuntu)
- Exposed via port 8501 (Streamlit)

(Implementation provided in `app.py`)
