# Sentiment Analysis

 ## Project Workflow
    STEP 1: Load data
    STEP 2: Create sentiment labels
    STEP 3: Clean text (remove noise + stopwords)
    STEP 4: Analyze cleaned text (insights)
    STEP 5: Train-test split
    STEP 6: Multiple models
    STEP 7: Pick best (F1-score)
    STEP 8: Deploy into AWS


#### Import libraries

In [None]:
import pandas as pd

## 1.Load Dataset

In [None]:
df = pd.read_csv("data.csv")
df.head()

Unnamed: 0,Reviewer Name,Review Title,Place of Review,Up Votes,Down Votes,Month,Review text,Ratings
0,Kamal Suresh,Nice product,"Certified Buyer, Chirakkal",889.0,64.0,Feb 2021,"Nice product, good quality, but price is now r...",4
1,Flipkart Customer,Don't waste your money,"Certified Buyer, Hyderabad",109.0,6.0,Feb 2021,They didn't supplied Yonex Mavis 350. Outside ...,1
2,A. S. Raja Srinivasan,Did not meet expectations,"Certified Buyer, Dharmapuri",42.0,3.0,Apr 2021,Worst product. Damaged shuttlecocks packed in ...,1
3,Suresh Narayanasamy,Fair,"Certified Buyer, Chennai",25.0,1.0,,"Quite O. K. , but nowadays the quality of the...",3
4,ASHIK P A,Over priced,,147.0,24.0,Apr 2016,Over pricedJust â?¹620 ..from retailer.I didn'...,1


In [None]:
df.shape

(8518, 8)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8518 entries, 0 to 8517
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Reviewer Name    8508 non-null   object 
 1   Review Title     8508 non-null   object 
 2   Place of Review  8468 non-null   object 
 3   Up Votes         8508 non-null   float64
 4   Down Votes       8508 non-null   float64
 5   Month            8053 non-null   object 
 6   Review text      8510 non-null   object 
 7   Ratings          8518 non-null   int64  
dtypes: float64(2), int64(1), object(5)
memory usage: 532.5+ KB


In [None]:
df.columns

Index(['Reviewer Name', 'Review Title', 'Place of Review', 'Up Votes',
       'Down Votes', 'Month', 'Review text', 'Ratings'],
      dtype='object')

In [None]:
df.isnull().sum()

Unnamed: 0,0
Reviewer Name,10
Review Title,10
Place of Review,50
Up Votes,10
Down Votes,10
Month,465
Review text,8
Ratings,0


In [None]:
df["Ratings"].value_counts().sort_index()

Unnamed: 0_level_0,count
Ratings,Unnamed: 1_level_1
1,769
2,308
3,615
4,1746
5,5080


## Considering only required columns

In [None]:
df = df[["Review text", "Ratings"]]

In [None]:
df.head()

Unnamed: 0,Review text,Ratings
0,"Nice product, good quality, but price is now r...",4
1,They didn't supplied Yonex Mavis 350. Outside ...,1
2,Worst product. Damaged shuttlecocks packed in ...,1
3,"Quite O. K. , but nowadays the quality of the...",3
4,Over pricedJust â?¹620 ..from retailer.I didn'...,1


In [None]:
df.isna().sum()

Unnamed: 0,0
Review text,8
Ratings,0


In [None]:
df = df.dropna()

In [None]:
df.isna().sum()

Unnamed: 0,0
Review text,0
Ratings,0


## 2.Creating Sentiment Column

In [None]:
df['Ratings'].value_counts()

Unnamed: 0_level_0,count
Ratings,Unnamed: 1_level_1
5,5080
4,1746
1,769
3,615
2,308


In [None]:
df["sentiment"] = df["Ratings"].apply(
    lambda x: "Positive" if x >= 4 else "Negative"
)

In [None]:
df["sentiment"].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
Positive,6823
Negative,1687


In [None]:
df.shape

(8510, 3)

In [None]:
df.head()

Unnamed: 0,Review text,Ratings,sentiment
0,"Nice product, good quality, but price is now r...",4,Positive
1,They didn't supplied Yonex Mavis 350. Outside ...,1,Negative
2,Worst product. Damaged shuttlecocks packed in ...,1,Negative
3,"Quite O. K. , but nowadays the quality of the...",3,Negative
4,Over pricedJust â?¹620 ..from retailer.I didn'...,1,Negative


## 3.Clean text (remove noise + stopwords)

In [None]:
import re
import nltk

nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


stop_words = set(stopwords.words("english"))

custom_noise = {
    "read", "more", "productread", "goodread",
    "niceread", "qualityread"
}

stop_words = stop_words.union(custom_noise)

lemmatizer = WordNetLemmatizer()

def clean_text(doc):
    if not isinstance(doc, str):
        return ""

    # remove special characters & digits
    doc = re.sub(r"[^a-zA-Z ]", " ", doc)

    # lowercase
    doc = doc.lower()

    # tokenize
    tokens = doc.split()

    # stopword + short word removal + lemmatization
    cleaned_tokens = [
        lemmatizer.lemmatize(token)
        for token in tokens
        if token not in stop_words and len(token) > 2
    ]

    return " ".join(cleaned_tokens)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
# Apply cleaning
df["clean_review"] = df["Review text"].apply(clean_text)

In [None]:
df[["Review text", "clean_review"]].head(5)

Unnamed: 0,Review text,clean_review
0,"Nice product, good quality, but price is now r...",nice product good quality price rising bad sig...
1,They didn't supplied Yonex Mavis 350. Outside ...,supplied yonex mavis outside cover yonex insid...
2,Worst product. Damaged shuttlecocks packed in ...,worst product damaged shuttlecock packed new b...
3,"Quite O. K. , but nowadays the quality of the...",quite nowadays quality cork like year back usi...
4,Over pricedJust â?¹620 ..from retailer.I didn'...,pricedjust retailer understand wat advantage b...


## 4.Insight Analysis :
- Gain insights into product features that contribute to customer satisfaction or dissatisfaction.

In [None]:
positive_reviews = df[df["sentiment"] == "Positive"]["clean_review"]
negative_reviews = df[df["sentiment"] == "Negative"]["clean_review"]

In [None]:
from collections import Counter

pos_words = Counter(" ".join(positive_reviews).split()).most_common(20)
pos_words

[('good', 1904),
 ('product', 962),
 ('nice', 763),
 ('shuttle', 585),
 ('quality', 487),
 ('best', 416),
 ('original', 274),
 ('delivery', 255),
 ('one', 230),
 ('price', 223),
 ('superread', 197),
 ('genuine', 165),
 ('excellent', 149),
 ('flipkart', 142),
 ('super', 139),
 ('thanks', 137),
 ('time', 133),
 ('great', 125),
 ('yonex', 118),
 ('awesome', 114)]

In [None]:
neg_words = Counter(" ".join(negative_reviews).split()).most_common(20)
neg_words

[('shuttle', 426),
 ('quality', 361),
 ('product', 325),
 ('good', 269),
 ('bad', 217),
 ('one', 124),
 ('worst', 108),
 ('poor', 104),
 ('day', 98),
 ('box', 79),
 ('buy', 78),
 ('cork', 78),
 ('time', 76),
 ('mavis', 73),
 ('badread', 71),
 ('original', 63),
 ('damaged', 59),
 ('got', 59),
 ('flipkart', 58),
 ('last', 57)]

### Bigram Analysis

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Positive bigrams
pos_vec = CountVectorizer(ngram_range=(2,2), max_features=10)
pos_vec.fit(positive_reviews)
positive_bigrams = pos_vec.get_feature_names_out()

# Negative bigrams
neg_vec = CountVectorizer(ngram_range=(2,2), max_features=10)
neg_vec.fit(negative_reviews)
negative_bigrams = neg_vec.get_feature_names_out()

print(positive_bigrams)
print()
print(negative_bigrams)

['best shuttle' 'fast delivery' 'genuine product' 'good one'
 'good oneread' 'good product' 'good quality' 'nice product'
 'original product' 'product good']

['bad product' 'bad quality' 'good product' 'one shuttle' 'poor quality'
 'quality good' 'quality product' 'quality shuttle' 'shuttle good'
 'worst product']


## 5.Train Test Split

In [None]:
X = df["clean_review"]
y = df["sentiment"]

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [None]:
print("Train distribution:")
print(y_train.value_counts(normalize=True))

print("\nTest distribution:")
print(y_test.value_counts(normalize=True))

Train distribution:
sentiment
Positive    0.801704
Negative    0.198296
Name: proportion, dtype: float64

Test distribution:
sentiment
Positive    0.801998
Negative    0.198002
Name: proportion, dtype: float64


##### Observations :         
- Class imbalance
- First we train then evaluate
- Later,Handles imbalance
- Again,cross checks metrics

## 6.Training multiple Models

### Logistic Regression

In [None]:
# Base Model
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

baseline_model_logistic_regression = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=5000,
        ngram_range=(1,2)
    )),
    ("clf", LogisticRegression(max_iter=1000))
])


In [None]:
baseline_model_logistic_regression.fit(X_train, y_train)

In [None]:
from sklearn.metrics import classification_report, f1_score, accuracy_score

y_pred_base = baseline_model_logistic_regression.predict(X_test)

print("Baseline Classification with Logistic Regression Report:")
print(classification_report(y_test, y_pred_base))

print("Baseline F1-score for Logistic Regression Model (weighted):",
      f1_score(y_test, y_pred_base, average="weighted"))

print("Accuracy for Base Logistic Regression Model : ",accuracy_score(y_test,y_pred_base))


Baseline Classification with Logistic Regression Report:
              precision    recall  f1-score   support

    Negative       0.83      0.44      0.57       337
    Positive       0.88      0.98      0.92      1365

    accuracy                           0.87      1702
   macro avg       0.85      0.71      0.75      1702
weighted avg       0.87      0.87      0.85      1702

Baseline F1-score for Logistic Regression Model (weighted): 0.8542001881678832
Accuracy for Base Logistic Regression Model :  0.8707403055229143


### Handling imbalance

In [None]:
balanced_logistic_regression_model = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=5000,
        ngram_range=(1,2)
    )),
    ("clf", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])

In [None]:
balanced_logistic_regression_model.fit(X_train, y_train)


In [None]:
y_pred_bal = balanced_logistic_regression_model.predict(X_test)

print("Balanced Model Classification Report:")
print(classification_report(y_test, y_pred_bal))

print("Balanced F1-score (weighted):",
      f1_score(y_test, y_pred_bal, average="weighted"))

print("Accuracy for Balanced Logistic Regression Model : ",accuracy_score(y_test, y_pred_bal))

Balanced Model Classification Report:
              precision    recall  f1-score   support

    Negative       0.64      0.66      0.65       337
    Positive       0.92      0.91      0.91      1365

    accuracy                           0.86      1702
   macro avg       0.78      0.78      0.78      1702
weighted avg       0.86      0.86      0.86      1702

Balanced F1-score (weighted): 0.8588945346428434
Accuracy for Balanced Logistic Regression Model :  0.8578143360752056


### TFIDF _ VECTORIZATION

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1,2)
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [None]:
X_train_tfidf.shape, X_test_tfidf.shape

((6808, 5000), (1702, 5000))

## Applying Smote

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

X_train_smote, y_train_smote = smote.fit_resample(
    X_train_tfidf,
    y_train
)

In [None]:
from collections import Counter

Counter(y_train_smote)

Counter({'Negative': 5458, 'Positive': 5458})

# Model 1 : Logistic Regression + Smote

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score

lr_smote = LogisticRegression(max_iter=1000)
lr_smote.fit(X_train_smote, y_train_smote)

y_pred_lr_smote = lr_smote.predict(X_test_tfidf)

print("Logistic Regression + SMOTE")
print(classification_report(y_test, y_pred_lr_smote))
print("Weighted F1:",
      f1_score(y_test, y_pred_lr_smote, average="weighted"))

Logistic Regression + SMOTE
              precision    recall  f1-score   support

    Negative       0.37      0.76      0.50       337
    Positive       0.92      0.68      0.79      1365

    accuracy                           0.70      1702
   macro avg       0.65      0.72      0.64      1702
weighted avg       0.81      0.70      0.73      1702

Weighted F1: 0.7290225907286909


# Model 2 : SVM(Linear SVC)

In [None]:
from sklearn.svm import LinearSVC

svm_model = LinearSVC()
svm_model.fit(X_train_tfidf, y_train)

y_pred_svm = svm_model.predict(X_test_tfidf)

print("SVM (LinearSVC)")
print(classification_report(y_test, y_pred_svm))
print("Weighted F1:",
      f1_score(y_test, y_pred_svm, average="weighted"))

SVM (LinearSVC)
              precision    recall  f1-score   support

    Negative       0.75      0.51      0.61       337
    Positive       0.89      0.96      0.92      1365

    accuracy                           0.87      1702
   macro avg       0.82      0.74      0.76      1702
weighted avg       0.86      0.87      0.86      1702

Weighted F1: 0.8593149579624707


# MODEL 3: Naive Bayes (MultinomialNB)

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

y_pred_nb = nb_model.predict(X_test_tfidf)

print("Naive Bayes")
print(classification_report(y_test, y_pred_nb))
print("Weighted F1:",
      f1_score(y_test, y_pred_nb, average="weighted"))

Naive Bayes
              precision    recall  f1-score   support

    Negative       0.88      0.34      0.49       337
    Positive       0.86      0.99      0.92      1365

    accuracy                           0.86      1702
   macro avg       0.87      0.66      0.70      1702
weighted avg       0.86      0.86      0.83      1702

Weighted F1: 0.8334038889633382


# Model 4 : Decision Tree + SMOTE

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier(
    max_depth=20,
    random_state=42
)

dt_model.fit(X_train_smote, y_train_smote)

y_pred_dt = dt_model.predict(X_test_tfidf)

print("Decision Tree + SMOTE")
print(classification_report(y_test, y_pred_dt))
print("Weighted F1:",
      f1_score(y_test, y_pred_dt, average="weighted"))

Decision Tree + SMOTE
              precision    recall  f1-score   support

    Negative       0.69      0.48      0.56       337
    Positive       0.88      0.95      0.91      1365

    accuracy                           0.85      1702
   macro avg       0.78      0.71      0.74      1702
weighted avg       0.84      0.85      0.84      1702

Weighted F1: 0.8431659842693485


In [None]:
import pandas as pd

comparison = pd.DataFrame({
    "Model": [
        "Logistic Regression (baseline)",
        "Logistic Regression (balanced)",
        "Logistic Regression + SMOTE",
        "SVM (Linear)",
        "Naive Bayes",
        "Decision Tree + SMOTE"
    ],
    "Weighted F1": [
        0.8542,
        0.8589,
        f1_score(y_test, y_pred_lr_smote, average="weighted"),
        f1_score(y_test, y_pred_svm, average="weighted"),
        f1_score(y_test, y_pred_nb, average="weighted"),
        f1_score(y_test, y_pred_dt, average="weighted")
    ]
})

comparison

Unnamed: 0,Model,Weighted F1
0,Logistic Regression (baseline),0.8542
1,Logistic Regression (balanced),0.8589
2,Logistic Regression + SMOTE,0.729023
3,SVM (Linear),0.859315
4,Naive Bayes,0.833404
5,Decision Tree + SMOTE,0.843166


    Why SMOTE performed poorly here

    SMOTE on high-dimensional sparse TF-IDF often:

    Creates unrealistic synthetic samples

    Hurts generalization

    This is expected and normal

    You did the experiment → that’s what matters ✅

    You can safely say:

    “SMOTE did not improve performance for text-based TF-IDF features and was therefore not selected.”

# Metrics Comparision

In [None]:
from sklearn.metrics import classification_report, accuracy_score, f1_score
import pandas as pd

results = []

# helper function
def evaluate_model(name, y_true, y_pred):
    report = classification_report(y_true, y_pred, output_dict=True)
    return {
        "Model": name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Weighted F1": report["weighted avg"]["f1-score"],
        "Macro F1": report["macro avg"]["f1-score"],
        "Negative Recall": report["Negative"]["recall"],
        "Classification Report": classification_report(y_true, y_pred)
    }

# normal models (use X_test)
results.append(evaluate_model(
    "Logistic Regression (Baseline)",
    y_test,
    baseline_model_logistic_regression.predict(X_test)
))

results.append(evaluate_model(
    "Logistic Regression (Balanced)",
    y_test,
    balanced_logistic_regression_model.predict(X_test)
))

results.append(evaluate_model(
    "SVM (Linear)",
    y_test,
    y_pred_svm
))

results.append(evaluate_model(
    "Naive Bayes",
    y_test,
    y_pred_nb
))

# SMOTE-based models (use X_test_tfidf)
results.append(evaluate_model(
    "Logistic Regression + SMOTE",
    y_test,
    y_pred_lr_smote
))

results.append(evaluate_model(
    "Decision Tree + SMOTE",
    y_test,
    y_pred_dt
))

# final comparison dataframe
comparison_df = pd.DataFrame(results)

# sort by Weighted F1
comparison_df = comparison_df.sort_values(
    by="Weighted F1",
    ascending=False
).reset_index(drop=True)

comparison_df.round(3)

Unnamed: 0,Model,Accuracy,Weighted F1,Macro F1,Negative Recall,Classification Report
0,SVM (Linear),0.869,0.859,0.765,0.513,precision recall f1-score ...
1,Logistic Regression (Balanced),0.858,0.859,0.78,0.662,precision recall f1-score ...
2,Logistic Regression (Baseline),0.871,0.854,0.748,0.436,precision recall f1-score ...
3,Decision Tree + SMOTE,0.854,0.843,0.738,0.478,precision recall f1-score ...
4,Naive Bayes,0.86,0.833,0.703,0.338,precision recall f1-score ...
5,Logistic Regression + SMOTE,0.7,0.729,0.643,0.763,precision recall f1-score ...


# Conclusion
- Final Model: Logistic Regression with class_weight="balanced"

In [None]:
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix
)
import pandas as pd

#  Predict using balanced logistic regression model
y_pred_balanced = balanced_logistic_regression_model.predict(X_test)

# Classification report (text + dict)
report_text = classification_report(y_test, y_pred_balanced)
report_dict = classification_report(
    y_test, y_pred_balanced, output_dict=True
)

# Extract key metrics
accuracy = accuracy_score(y_test, y_pred_balanced)
weighted_f1 = report_dict["weighted avg"]["f1-score"]
macro_f1 = report_dict["macro avg"]["f1-score"]
negative_recall = report_dict["Negative"]["recall"]

# Confusion matrix
cm = confusion_matrix(
    y_test,
    y_pred_balanced,
    labels=["Negative", "Positive"]
)

cm_df = pd.DataFrame(
    cm,
    index=["Actual Negative", "Actual Positive"],
    columns=["Predicted Negative", "Predicted Positive"]
)

# Metrics summary table
metrics_df = pd.DataFrame({
    "Metric": [
        "Accuracy",
        "Weighted F1",
        "Macro F1",
        "Negative Recall"
    ],
    "Value": [
        accuracy,
        weighted_f1,
        macro_f1,
        negative_recall
    ]
})

# Display results
print("=== Balanced Logistic Regression Classification Report ===")
print(report_text)

print("\n=== Metrics Summary ===")
display(metrics_df.round(3))

print("\n=== Confusion Matrix ===")
display(cm_df)

=== Balanced Logistic Regression Classification Report ===
              precision    recall  f1-score   support

    Negative       0.64      0.66      0.65       337
    Positive       0.92      0.91      0.91      1365

    accuracy                           0.86      1702
   macro avg       0.78      0.78      0.78      1702
weighted avg       0.86      0.86      0.86      1702


=== Metrics Summary ===


Unnamed: 0,Metric,Value
0,Accuracy,0.858
1,Weighted F1,0.859
2,Macro F1,0.78
3,Negative Recall,0.662



=== Confusion Matrix ===


Unnamed: 0,Predicted Negative,Predicted Positive
Actual Negative,223,114
Actual Positive,128,1237


## 7.Picking the best one

In [None]:
import pickle

with open("sentiment_pipeline.pkl", "wb") as f:
    pickle.dump(balanced_logistic_regression_model, f)

In [None]:
# Checking

with open("sentiment_pipeline.pkl", "rb") as f:
    model = pickle.load(f)

model.predict(["The product quality is very bad"])

array(['Negative'], dtype=object)

## 8.Deploytment
- Deployed on AWS