<br>

<br>

<br>

# ðŸ’• **SENTIMENT ANALYSIS** ðŸ’•

**NAIVE BAYES**
-  **GaussianNB**
- **MultinomialNB**
- **BernoulliNB**

<br>

## **INDEX**

- **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**
- **STEP 2: STUDY OF VARIABLES AND PROCESSING**
- **STEP 3: 

### **PROBLEM DEFINITION**


The goal of this project is to develop a **Sentiment Analysis Classifier** for Google Play Store reviews using **Naive Bayes Models**. This classifier will determine whether a review has a **positive** (`1`) or **negative** (`0`) polarity based on its textual content.

<br>

**WHAT IS SENTIMENT ANALYSIS?**

Sentiment analysis is a process in **Natural Language Processing (NLP)** used to identify and classify the sentiment of text data into categories such as **positive**, **negative**, or **neutral**. It is widely used in business and research to understand user feedback, gauge customer satisfaction, and monitor public opinion.

<br>

**NAIVE BAYES MODELS**

Naive Bayes is a family of **Probabilistic Classification Algorithms** based on **Bayes' Theorem**. It assumes independence among predictors, making it highly efficient for text classification tasks like sentiment analysis. 

**Types of Naive Bayes Models**
1. **GaussianNB**: Assumes features follow a normal distribution.
2. **MultinomialNB**: Suitable for discrete features, like word counts.
3. **BernoulliNB**: Designed for binary/boolean features.

In this project, these models will be applied to classify **Google Play Store reviews**, with a focus on identifying the most appropriate Naive Bayes implementation for the problem.

<br>

**VARIABLES**
- **`review` (Predictor)**: The text of the userâ€™s comment, which will be processed into numerical features.
- **`polarity` (Target)**: The sentiment of the comment, either **0** (negative) or **1** (positive).

<br>

**Key Steps**
1. **Text Preprocessing**: Cleaning and converting text into a numerical format using techniques like removing spaces, converting to lowercase, and vectorization with **CountVectorizer**.
2. **Model Selection**: Comparing and evaluating **GaussianNB**, **MultinomialNB**, and **BernoulliNB** to identify the best-performing model.
3. **Model Optimization**: Enhancing the chosen model with additional algorithms, such as **Random Forest**.
4. **Model Deployment**: Saving the trained model for future use.

<br>

**Characteristics of the Problem**
- The dataset is **imbalanced**, containing textual data with **dichotomous labels**.
- The primary predictor, `review`, needs **NLP preprocessing** before modeling.
- The solution requires not only classification but also model **optimization** for better performance.



<br>

**1.2. LIBRARY IMPORTING**

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer



**STEP 1** Loading the dataset


In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv")
df.head()

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0


<br>

## **STEP 2: STUDY OF VARIABLES AND PROCESSING**

- 2.1. Focus on relevant variables
- 2.2. Preprocess the text
- 2.3. Split the data into Training and Testing sets
- 2.4. Vectorize Text Data

In this step, we are preparing the dataset for modeling.

<br>

**2.1. FOCUS ON RELEVANT VARIABLES**

Remove the **`package_name`** column since it doesn't contribute to classifying the sentiment.

In [5]:
# Dropping the irrelevant column
df = df.drop(columns=['package_name'])

# Display the updated structure of the dataset
df.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0
3,the new features suck for those of us who don...,0
4,forced reload on uploading pic on replying co...,0


<br>

**2.2. PREPOCESS THE TEXT**

Clean and normalize the text in the **`review`** column by removing spaces and converting all text to lowercase.

In [6]:
# Cleaning and normalizing text
df['review'] = df['review'].str.strip().str.lower()

# Display a few cleaned reviews
df['review'].head()

0    privacy at least put some option appear offlin...
1    messenger issues ever since the last update, i...
2    profile any time my wife or anybody has more t...
3    the new features suck for those of us who don'...
4    forced reload on uploading pic on replying com...
Name: review, dtype: object

<br>

**2.3. SPLIT THE DATA INTO TRAINING AND TESTING SETS**

Divide the dataset into **TRAINING** and **TESTING** subsets.

In [9]:
# Define predictors (X) and target (y)
X = df['review']
y = df['polarity']

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


<br>

**2.4. VECTORIZE TEXT DATA**

Convert the cleaned text into a numerical format using **CountVectorizer**, which creates a matrix of word counts, ignoring common stop words.

In [10]:
# Initialize the CountVectorizer
vec_model = CountVectorizer(stop_words="english")

# Fit and transform the training data
X_train = vec_model.fit_transform(X_train).toarray()

# Transform the testing data using the same vectorizer
X_test = vec_model.transform(X_test).toarray()


<br>

## **STEP 3: MODEL SELECTION**

In [15]:
from sklearn.naive_bayes import BernoulliNB

# Initialize the BernoulliNB model
bernoulli_nb = BernoulliNB()

# Train the model
bernoulli_nb.fit(X_train, y_train)

# Make predictions
y_pred_bernoulli = bernoulli_nb.predict(X_test)

# Evaluate the model
print("BernoulliNB Accuracy:", accuracy_score(y_test, y_pred_bernoulli))
print("Classification Report for BernoulliNB:\n", classification_report(y_test, y_pred_bernoulli))

BernoulliNB Accuracy: 0.770949720670391
Classification Report for BernoulliNB:
               precision    recall  f1-score   support

           0       0.79      0.93      0.85       126
           1       0.70      0.40      0.51        53

    accuracy                           0.77       179
   macro avg       0.74      0.66      0.68       179
weighted avg       0.76      0.77      0.75       179



In [16]:
from sklearn.naive_bayes import GaussianNB

# Initialize the GaussianNB model
gaussian_nb = GaussianNB()

# Train the model (requires dense array for GaussianNB)
gaussian_nb.fit(X_train, y_train)

# Make predictions
y_pred_gaussian = gaussian_nb.predict(X_test)

# Evaluate the model
print("GaussianNB Accuracy:", accuracy_score(y_test, y_pred_gaussian))
print("Classification Report for GaussianNB:\n", classification_report(y_test, y_pred_gaussian))


GaussianNB Accuracy: 0.8044692737430168
Classification Report for GaussianNB:
               precision    recall  f1-score   support

           0       0.85      0.88      0.86       126
           1       0.69      0.62      0.65        53

    accuracy                           0.80       179
   macro avg       0.77      0.75      0.76       179
weighted avg       0.80      0.80      0.80       179



In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Initialize the MultinomialNB model
multinomial_nb = MultinomialNB()

# Train the model
multinomial_nb.fit(X_train, y_train)

# Make predictions
y_pred_multinomial = multinomial_nb.predict(X_test)

# Evaluate the model
print("MultinomialNB Accuracy:", accuracy_score(y_test, y_pred_multinomial))
print("Classification Report for MultinomialNB:\n", classification_report(y_test, y_pred_multinomial))


MultinomialNB Accuracy: 0.8156424581005587
Classification Report for MultinomialNB:
               precision    recall  f1-score   support

           0       0.84      0.90      0.87       126
           1       0.73      0.60      0.66        53

    accuracy                           0.82       179
   macro avg       0.79      0.75      0.77       179
weighted avg       0.81      0.82      0.81       179

