# Naive Bayes Project
## Sentiment Analysis – Google Play Reviews
MSalaverri

We are aiming to classify Google Play Store app reviews as positive or negative.

*Type of Problem:* **Classification** using Naive Bayes models.

In [77]:
# IMPORT LIBRARIES

# Data manipulation & visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Text processing
from sklearn.feature_extraction.text import CountVectorizer

# Models
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

# Train/test split & metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, confusion_matrix, classification_report

# Saving models
import joblib

import warnings
warnings.filterwarnings('ignore')


In [78]:
# LOADING ORIGINAL DATA
data = "https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv"
df = pd.read_csv(data)
df.head()

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0


## Exploratory Data Analysis

In [79]:
#Dataset dimension
rows, column = df.shape
print("DATASET Dimension:")
print(f'{rows} rows and {column} columns')

DATASET Dimension:
891 rows and 3 columns


In [80]:
# Get information about range index, number of columns and labels, data types, and the number of cells in each column (non-null values)
print("DATASET General Info:")
df.info()

DATASET General Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   package_name  891 non-null    object
 1   review        891 non-null    object
 2   polarity      891 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 21.0+ KB


In [81]:
#List Missing values
all_missing_values = df.isnull().sum()
print("Missing Values per Column:")
all_missing_values.reset_index()

Missing Values per Column:


Unnamed: 0,index,0
0,package_name,0
1,review,0
2,polarity,0


In [82]:
num_of_zeros = (df == 0).sum()
print("Count of Zeros per Column:")
num_of_zeros.sort_values(ascending=False).reset_index()

Count of Zeros per Column:


Unnamed: 0,index,0
0,polarity,584
1,package_name,0
2,review,0


### Initial Summary
The dataset contains **891 rows and 3 columns**, each row representing a Google Play Store app review.

All columns are fully populated, with no missing values reported. The dataset is small enough to handle easily but still provides enough examples for training a sentiment classifier.

The dataset includes:

- package_name: the app’s name (categorical, not useful for sentiment).

- review: the text comment (categorical, main predictor).

- polarity: the target variable (0 = negative, 1 = positive).

Comment: Only the review column is relevant for prediction. Text must be cleaned and transformed into numeric features before modeling.

## Data Prep

### Only Missing Values

In [83]:
only_missing_values_columns = all_missing_values[all_missing_values > 0].sort_values(ascending=False)

print("Only Missing Values Columns:")

if only_missing_values_columns.empty:
     print("No missing values present")
else:
    print(only_missing_values_columns.reset_index())

Only Missing Values Columns:
No missing values present


### Duplicates

In [84]:
duplicates = df.duplicated()
sum_duplicates = duplicates.sum()

print("\nDuplicate Rows:")
if sum_duplicates == 0:
    print("No duplicates present")
else:
    print(f"{sum_duplicates} duplicate rows present")


Duplicate Rows:
No duplicates present


### Data Set Stats

In [85]:
print("\nBASIC DESCRIPTIVE STATISTICS:")
df.describe()


BASIC DESCRIPTIVE STATISTICS:


Unnamed: 0,polarity
count,891.0
mean,0.344557
std,0.47549
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


Since it is all text, we don't see any nulls, duplicates or Outliers present in the data set. We need to focus in the rerview column.

In [86]:
df = df.drop(columns=["package_name"])

We’ll drop package_name since it doesn’t affect sentiment.

## Preprocessing

In [87]:
df["review"] = df["review"].str.strip().str.lower()
X = df["review"]
y = df["polarity"]

Reviews are cleaned by removing spaces and converting to lowercase.

### Train & Test Split

In [88]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Train size: {X_train.shape[0]}")
print(f"Test size: {X_test.shape[0]}")

Train size: 712
Test size: 179


#### Vectorization

In [89]:
vec_model = CountVectorizer(stop_words="english")
X_train_vec = vec_model.fit_transform(X_train).toarray()
X_test_vec = vec_model.transform(X_test).toarray()

print(f"Unique word count: {len(vec_model.vocabulary_)}")

Unique word count: 3272


Vectorization gives us numeric predictors for the models. Transformed text into word count

## Modeling

In [90]:
# Naive Bayes Models
models = {
    "GaussianNB": GaussianNB(),
    "MultinomialNB": MultinomialNB(),
    "BernoulliNB": BernoulliNB()
}

results = {}
for name, model in models.items():
    model.fit(X_train_vec, y_train)
    y_pred = model.predict(X_test_vec)
    ras = roc_auc_score(y_test, y_pred)
    results[name]= ras
    print(f"\n--- {name} Results:\n")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))



--- GaussianNB Results:

Accuracy: 0.8156424581005587
Confusion Matrix:
 [[104  13]
 [ 20  42]]
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.89      0.86       117
           1       0.76      0.68      0.72        62

    accuracy                           0.82       179
   macro avg       0.80      0.78      0.79       179
weighted avg       0.81      0.82      0.81       179


--- MultinomialNB Results:

Accuracy: 0.8547486033519553
Confusion Matrix:
 [[112   5]
 [ 21  41]]
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.96      0.90       117
           1       0.89      0.66      0.76        62

    accuracy                           0.85       179
   macro avg       0.87      0.81      0.83       179
weighted avg       0.86      0.85      0.85       179


--- BernoulliNB Results:

Accuracy: 0.7821229050279329
Confusion Matrix:
 [[113   4]
 [ 35  27]]
Cla

In [91]:
# Best NB Model
best_nb_model = max(results, key=results.get)
print(f"Best Naive Bayes: \n{best_nb_model} (AUC: {results[best_nb_model]:.4f})")

Best Naive Bayes: 
MultinomialNB (AUC: 0.8093)


Each Naive Bayes variant is tested. MultinomialNB selected as the Best NB as it  works best for text classification.

In [92]:
nb = MultinomialNB()

### Optimization WIth RF

In [93]:
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train_vec, y_train)
y_pred_rf = rf.predict(X_test_vec)

print("\n---Random Forest Results:\n")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))



---Random Forest Results:

Accuracy: 0.8212290502793296
Confusion Matrix:
 [[105  12]
 [ 20  42]]
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.90      0.87       117
           1       0.78      0.68      0.72        62

    accuracy                           0.82       179
   macro avg       0.81      0.79      0.80       179
weighted avg       0.82      0.82      0.82       179



## **Final Conclusions**
The Multinomial Naive Bayes model achieved strong performance for sentiment classification of Google Play reviews. Accuracy reached around 85%, and the AUC was close to 0.83, which means the model can reliably distinguish between positive and negative comments.

The most important predictor was the review text itself, transformed into word counts. This makes sense for sentiment analysis:

- The presence of positive words strongly indicates a positive sentiment.

- The presence of negative words reflects dissatisfaction or criticism.

- Word frequency patterns contribute naturally, as repeated expressions of praise or complaint strengthen classification.

Compared to other Naive Bayes variants and Random Forest, MultinomialNB provided the best balance of accuracy and interpretability. It is efficient, lightweight, and well‑suited for text classification tasks based on word counts.

**Conclusion:** Multinomial Naive Bayes offers the strongest performance among the models tested. It is robust, efficient, and ready to use for sentiment analysis of app reviews, with clear potential for deployment in real‑world recommendation or feedback systems.

In [94]:
# SAVE MODEL AND VECTORIZER
joblib.dump(nb, 'sentiment_model.pkl')
joblib.dump(vec_model, 'vectorizer.pkl')
print("Sentiment model and vectorizer saved.")


Sentiment model and vectorizer saved.
