<br>

<br>

<br>

# 💕 **SENTIMENT ANALYSIS** 💕

**NAIVE BAYES**
-  **GaussianNB**
- **MultinomialNB**
- **BernoulliNB**

<br>

## **INDEX**

- **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**
- **STEP 2: STUDY OF VARIABLES AND PROCESSING**
- **STEP 3: MODEL SELECTION**
- **STEP 4: OPTIMIZE THE MODEL WITH RANDOM FOREST**
- **STEP 5: SAVE THE MODEL**
- **STEP 6: STEP 6: EXPLORE ALTERNATIVE MODELS**
- **STEP 7: CONCLUSION**

<br>

## **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**

- 1.1. Problem Definition
- 1.2. Library Importing
- 1.3. Loading Dataset

<br>

**1.1. PROBLEM DEFINITION**


The goal of this project is to develop a **Sentiment Analysis Classifier** for Google Play Store reviews using **Naive Bayes Models**. This classifier will determine whether a review has a **positive** (`1`) or **negative** (`0`) polarity based on its textual content.

<br>

**WHAT IS SENTIMENT ANALYSIS?**

Sentiment analysis is a process in **Natural Language Processing (NLP)** used to identify and classify the sentiment of text data into categories such as **positive**, **negative**, or **neutral**. It is widely used in business and research to understand user feedback, gauge customer satisfaction, and monitor public opinion.

<br>

**NAIVE BAYES MODELS**

Naive Bayes is a family of **Probabilistic Classification Algorithms** based on **Bayes' Theorem**. It assumes independence among predictors, making it highly efficient for text classification tasks like sentiment analysis. 

**Types of Naive Bayes Models**
1. **GaussianNB**: Assumes features follow a normal distribution.
2. **MultinomialNB**: Suitable for discrete features, like word counts.
3. **BernoulliNB**: Designed for binary/boolean features.

In this project, these models will be applied to classify **Google Play Store reviews**, with a focus on identifying the most appropriate Naive Bayes implementation for the problem.

<br>

**VARIABLES**
- **`review` (Predictor)**: The text of the user’s comment, which will be processed into numerical features.
- **`polarity` (Target)**: The sentiment of the comment, either **0** (negative) or **1** (positive).

<br>

**KEY STEPS**
1. **Text Preprocessing**: Cleaning and converting text into a numerical format using techniques like removing spaces, converting to lowercase, and vectorization with **CountVectorizer**.
2. **Model Selection**: Comparing and evaluating **GaussianNB**, **MultinomialNB**, and **BernoulliNB** to identify the best-performing model.
3. **Model Optimization**: Enhancing the chosen model with additional algorithms, such as **Random Forest**.
4. **Model Deployment**: Saving the trained model for future use.

<br>

**CHARACTERISTICS OF THE PROBLEM**
- The dataset is **imbalanced**, containing textual data with **dichotomous labels**.
- The primary predictor, `review`, needs **NLP preprocessing** before modeling.
- The solution requires not only classification but also model **optimization** for better performance.

<br>

<br>

**1.2. LIBRARY IMPORTING**

In [36]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pickle
import warnings
warnings.filterwarnings("ignore")

<br>

**1.3. LOADING THE DATASET**


In [37]:
df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv")
df.head()

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0


<br>

## **STEP 2: STUDY OF VARIABLES AND PROCESSING**

- 2.1. Focus on relevant variables
- 2.2. Preprocess the text
- 2.3. Split the data into Training and Testing sets
- 2.4. Vectorize Text Data

In this step, we are preparing the dataset for modeling.

<br>

<br>

**2.1. FOCUS ON RELEVANT VARIABLES**

Remove the **`package_name`** column since it doesn't contribute to classifying the sentiment.

In [38]:
# Dropping the irrelevant column
df = df.drop(columns=['package_name'])

# Display the updated structure of the dataset
df.head()

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0
3,the new features suck for those of us who don...,0
4,forced reload on uploading pic on replying co...,0


<br>

**2.2. PREPOCESS THE TEXT**

Clean and normalize the text in the **`review`** column by removing spaces and converting all text to lowercase.

In [39]:
# Cleaning and normalizing text
df['review'] = df['review'].str.strip().str.lower()

# Display a few cleaned reviews
df['review'].head()

0    privacy at least put some option appear offlin...
1    messenger issues ever since the last update, i...
2    profile any time my wife or anybody has more t...
3    the new features suck for those of us who don'...
4    forced reload on uploading pic on replying com...
Name: review, dtype: object

<br>

**2.3. SPLIT THE DATA INTO TRAINING AND TESTING SETS**

Divide the dataset into **TRAINING** and **TESTING** subsets.

In [40]:
# Define predictors (X) and target (y)
X = df['review']
y = df['polarity']

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


<br>

**2.4. VECTORIZE TEXT DATA**

Convert the cleaned text into a numerical format using **CountVectorizer**, which creates a matrix of word counts, ignoring common stop words.

In [41]:
# Initialize the CountVectorizer
vec_model = CountVectorizer(stop_words="english")

# Fit and transform the training data
X_train = vec_model.fit_transform(X_train).toarray()

# Transform the testing data using the same vectorizer
X_test = vec_model.transform(X_test).toarray()


<br>

<br>

## **STEP 3: MODEL SELECTION**

- 3.1. Train and evaluate models using training and testing datasets.
- 3.2. Compare the models' performance to determine the most suitable one for this problem.

<br>

<br>

**3.1. TRAIN AND EVALUATE MODELS USING TRAINING AND TESTING DATASETS**

- Train the models using the training dataset.
- Evaluate their performance by testing them on the testing dataset using appropriate metrics (e.g., `accuracy`, `precision`, `recall`, or `F1-score`).


- **Train BernoulliNB**

In [42]:
# Initialize the BernoulliNB model
bernoulli_nb = BernoulliNB()

# Train the model
bernoulli_nb.fit(X_train, y_train)

# Make predictions
y_pred_bernoulli = bernoulli_nb.predict(X_test)

# Evaluate the model
print("BernoulliNB Accuracy:", accuracy_score(y_test, y_pred_bernoulli))
print("Classification Report for BernoulliNB:\n", classification_report(y_test, y_pred_bernoulli))

BernoulliNB Accuracy: 0.770949720670391
Classification Report for BernoulliNB:
               precision    recall  f1-score   support

           0       0.79      0.93      0.85       126
           1       0.70      0.40      0.51        53

    accuracy                           0.77       179
   macro avg       0.74      0.66      0.68       179
weighted avg       0.76      0.77      0.75       179



<br>

- **Train GaussianNB**

In [43]:
# Initialize the GaussianNB model
gaussian_nb = GaussianNB()

# Train the model (requires dense array for GaussianNB)
gaussian_nb.fit(X_train, y_train)

# Make predictions
y_pred_gaussian = gaussian_nb.predict(X_test)

# Evaluate the model
print("GaussianNB Accuracy:", accuracy_score(y_test, y_pred_gaussian))
print("Classification Report for GaussianNB:\n", classification_report(y_test, y_pred_gaussian))


GaussianNB Accuracy: 0.8044692737430168
Classification Report for GaussianNB:
               precision    recall  f1-score   support

           0       0.85      0.88      0.86       126
           1       0.69      0.62      0.65        53

    accuracy                           0.80       179
   macro avg       0.77      0.75      0.76       179
weighted avg       0.80      0.80      0.80       179



<br>

- **Train MultinomialNB**

In [44]:
# Initialize the MultinomialNB model
multinomial_nb = MultinomialNB()

# Train the model
multinomial_nb.fit(X_train, y_train)

# Make predictions
y_pred_multinomial = multinomial_nb.predict(X_test)

# Evaluate the model
print("MultinomialNB Accuracy:", accuracy_score(y_test, y_pred_multinomial))
print("Classification Report for MultinomialNB:\n", classification_report(y_test, y_pred_multinomial))


MultinomialNB Accuracy: 0.8156424581005587
Classification Report for MultinomialNB:
               precision    recall  f1-score   support

           0       0.84      0.90      0.87       126
           1       0.73      0.60      0.66        53

    accuracy                           0.82       179
   macro avg       0.79      0.75      0.77       179
weighted avg       0.81      0.82      0.81       179



<br>

**3.2. COMPARE THE MODEL'S PERFORMANCE TO DETERMINE THE MOST SUITABLE ONE FOR THIS PROBLEM**


**MODEL PERFORMANCE COMPARISON** 

Based on the accuracy scores of the three Naive Bayes models:

- **BernoulliNB Accuracy**: 0.7709
- **GaussianNB Accuracy**: 0.8045
- **MultinomialNB Accuracy**: 0.8156

The **`MultinomialNB`** model performs best for this problem, achieving the highest accuracy of **81.56%**. This is expected due to its design, which is optimized for text classification tasks involving word count data.

The GaussianNB model, while performing decently, is more suited for continuous data rather than the discrete features we are using. The BernoulliNB model, optimized for binary features, also performed well but not as effectively as MultinomialNB for this specific problem.

Hence, **`MultinomialNB` is the best choice for our sentiment analysis classifier**.

<br>

<br>

## **STEP 4: OPTIMIZE THE MODEL WITH RANDOM FOREST**

In Step 4, we aim to improve the performance of the MultinomialNB model (our best-performing Naive Bayes implementation) by combining it with a Random Forest Classifier. This involves:

Using the predictions from the MultinomialNB model as features for the Random Forest.
Training the Random Forest model to further refine the classification.
Comparing the performance of the optimized model with the original Naive Bayes model to evaluate improvements.
This step explores whether an ensemble method like Random Forest can boost the accuracy by capturing patterns that MultinomialNB might miss.

**Generate Predictions from MultinomialNB**

In [50]:
# Use the best MultinomialNB model to generate predictions for train and test data
multinomial_nb = MultinomialNB()
multinomial_nb.fit(X_train, y_train)

# Get probabilities (predictions for Random Forest)
train_predictions_nb = multinomial_nb.predict_proba(X_train)
test_predictions_nb = multinomial_nb.predict_proba(X_test)


**Train the Random Forest**

In [46]:
# Initialize the Random Forest Classifier
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)

# Train Random Forest using MultinomialNB probabilities as features
random_forest.fit(train_predictions_nb, y_train)

# Make predictions with the optimized model
y_pred_rf = random_forest.predict(test_predictions_nb)

# Evaluate the Random Forest model
from sklearn.metrics import accuracy_score, classification_report

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report for Random Forest:\n", classification_report(y_test, y_pred_rf))


Random Forest Accuracy: 0.8268156424581006
Classification Report for Random Forest:
               precision    recall  f1-score   support

           0       0.87      0.88      0.88       126
           1       0.71      0.70      0.70        53

    accuracy                           0.83       179
   macro avg       0.79      0.79      0.79       179
weighted avg       0.83      0.83      0.83       179



**Compare Performance**

In [47]:
# Comparing MultinomialNB and Optimized Random Forest
print("MultinomialNB Accuracy:", accuracy_score(y_test, multinomial_nb.predict(X_test)))
print("Random Forest Accuracy (Optimized):", accuracy_score(y_test, y_pred_rf))


MultinomialNB Accuracy: 0.8156424581005587
Random Forest Accuracy (Optimized): 0.8268156424581006


**Conclusion:**

By training a Random Forest Classifier using the output of **`MultinomialNB`** as features, we can evaluate if the ensemble approach provides an improvement in accuracy or other performance metrics. If the optimized model shows better results, it validates the effectiveness of combining models for sentiment analysis.

<br>

## **STEP 5: SAVE THE MODEL**

In [51]:
# Save the Optimized Random Forest Model
with open('random_forest_model.pkl', 'wb') as model_file:
    pickle.dump(random_forest, model_file)

print("Optimized Random Forest model saved as 'random_forest_model.pkl'")

Optimized Random Forest model saved as 'random_forest_model.pkl'


<br>

## **STEP 6: EXPLORE ALTERNATIVE MODELS**

In [None]:
# Initialize the SVM model
svm_model = SVC(kernel='linear', random_state=42)

# Train the model
svm_model.fit(X_train, y_train)

# Make predictions
y_pred_svm = svm_model.predict(X_test)

# Evaluate the model
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Classification Report for SVM:\n", classification_report(y_test, y_pred_svm))


SVM Accuracy: 0.8324022346368715
Classification Report for SVM:
               precision    recall  f1-score   support

           0       0.93      0.83      0.87       126
           1       0.67      0.85      0.75        53

    accuracy                           0.83       179
   macro avg       0.80      0.84      0.81       179
weighted avg       0.85      0.83      0.84       179



### **Comparison of SVM and Random Forest Results**

#### **Support Vector Machines (SVM) Results:**
- **Accuracy**: 0.8324
- **Precision**:
  - Class 0: 0.93
  - Class 1: 0.67
- **Recall**:
  - Class 0: 0.83
  - Class 1: 0.85
- **F1-Score**:
  - Class 0: 0.87
  - Class 1: 0.75
- **Macro Average F1-Score**: 0.81
- **Weighted Average F1-Score**: 0.85

#### **Random Forest Results:**
- **Accuracy**: 0.8268
- **Precision**:
  - Class 0: 0.87
  - Class 1: 0.71
- **Recall**:
  - Class 0: 0.88
  - Class 1: 0.70
- **F1-Score**:
  - Class 0: 0.88
  - Class 1: 0.70
- **Macro Average F1-Score**: 0.79
- **Weighted Average F1-Score**: 0.83

<br>

### **Comparison:**
1. **Accuracy**:
   - SVM slightly outperformed Random Forest with **83.24%** vs. **82.68%**.

2. **Precision**:
   - SVM achieved a higher precision for **Class 0** (0.93 vs. 0.87), indicating fewer false positives.
   - Random Forest had better precision for **Class 1** (0.71 vs. 0.67), showing it handled positive reviews slightly better in terms of false positives.

3. **Recall**:
   - SVM had a better recall for **Class 1** (0.85 vs. 0.70), capturing more true positives for this class.
   - Random Forest had a slightly higher recall for **Class 0** (0.88 vs. 0.83), meaning it detected more true negatives.

4. **F1-Score**:
   - SVM had a higher F1-Score for **Class 1** (0.75 vs. 0.70), showing a better balance between precision and recall.
   - For **Class 0**, Random Forest and SVM were very close (0.88 vs. 0.87).

5. **Weighted and Macro Averages**:
   - SVM showed better overall scores, with a **weighted average F1-Score** of **0.85** vs. **0.83** for Random Forest, and a **macro average F1-Score** of **0.81** vs. **0.79**.

<br>

### **Observation:**
While both models performed well, **SVM** slightly outperformed Random Forest in terms of accuracy, recall for Class 1, and overall F1-Scores. The SVM model is particularly effective for text classification problems, leveraging its strength in high-dimensional spaces such as those generated by word count matrices.


<br>

<br>

# **STEP 7: CONCLUSION**

#### **Overview**
This project aimed to build a sentiment analysis classifier for Google Play Store reviews using **Naive Bayes Models** and explore alternative methods to optimize performance. Through the steps, we developed, trained, and evaluated multiple models, including **MultinomialNB**, **Random Forest**, and **Support Vector Machines (SVM)**.

<br>

---

<br>

#### **Model Comparisons**
The performance of the models is summarized below:

- **MultinomialNB**:
  - Accuracy: **0.8156**
  - Strength: Efficient and interpretable for text classification tasks using word count data.
  - Limitation: Slightly lower recall for positive reviews (Class 1).

- **Random Forest**:
  - Accuracy: **0.8268**
  - Strength: Ensemble learning improved classification balance across both classes.
  - Limitation: Computationally more expensive and slightly less recall for positive reviews compared to SVM.

- **Support Vector Machines (SVM)**:
  - Accuracy: **0.8324**
  - Precision:
    - Class 0: **0.93**
    - Class 1: **0.67**
  - Recall:
    - Class 0: **0.83**
    - Class 1: **0.85**
  - Strength: Best accuracy and recall for positive reviews (Class 1), showcasing its ability to handle high-dimensional data.
  - Limitation: Slightly lower precision for positive reviews (Class 1).

<br>

---

<br>

#### **Key Findings**
1. **Naive Bayes Models**:
   - The **MultinomialNB** model performed well, leveraging its simplicity and suitability for word count data.
   - While effective, its performance was slightly outperformed by more advanced models like SVM.

2. **Alternative Models**:
   - The **Random Forest** model provided a balanced performance but was computationally more expensive.
   - The **SVM** model achieved the highest accuracy and overall performance, particularly excelling in identifying positive reviews.

<br>

---

<br>

#### **Final Recommendation**
For this specific sentiment analysis task:
- **Support Vector Machines (SVM)** is the best-performing model, offering the highest accuracy (**83.24%**) and strong recall for positive reviews. 
- However, **MultinomialNB** remains a solid choice for quick, efficient text classification tasks, especially when computational resources are limited.

<br>

---

<br>

#### **Future Work**
1. Experiment with hyperparameter tuning for all models to further improve performance.
2. Use more advanced text vectorization techniques, such as **TF-IDF** or **word embeddings** (e.g., Word2Vec, GloVe).
3. Explore deep learning models (e.g., LSTM, Transformers) for sentiment analysis on larger datasets.

The project demonstrates how different machine learning models can be applied to sentiment analysis, highlighting the strengths and trade-offs of each approach.

<br>

<br>

<br>
