## Sentiment analysis
Naive Bayes models are very useful when we want to analyze sentiment, classify texts into topics or recommendations, as the characteristics of these challenges meet the theoretical and methodological assumptions of the model very well.

In this project you will practice with a dataset to create a review classifier for the Google Play store.

In [16]:
# Step 0. Import libraries, custom modules and logging
# Basics ---------------------------------------------------------------
#import logging
#import joblib
# Data -----------------------------------------------------------------
import pandas as pd
import numpy as np
# Graphics -------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns
# Machine learning -----------------------------------------------------
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.naive_bayes import (
                        GaussianNB,
                        MultinomialNB,
                        BernoulliNB
)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

In [2]:
# 1 Create data frame
# 1.1 Read from source and get basic info
df_raw = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv')
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   package_name  891 non-null    object
 1   review        891 non-null    object
 2   polarity      891 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 21.0+ KB


In [3]:
df_raw.sample(10)

Unnamed: 0,package_name,review,polarity
456,com.whatsapp,great app but... i heard that apple added som...,1
843,com.hamropatro,superb! all in one!,1
187,com.imangi.templerun2,good game but... i have always liked temple r...,0
351,com.viber.voip,"so far nice and useful but... please, introdu...",1
318,com.viber.voip,good job but.... please reduce its size. runn...,0
492,com.Slack,"excellent this app is great, keep up the good...",1
217,com.supercell.clashofclans,it was epic....then a certain update happened...,0
671,com.hamrokeyboard,white theme please ŕ¤žŕ¤˛ŕ¤žŕ¤ ŕ¤żŕľ app ŕ¤...,0
310,com.tencent.mm,want to know why did the other people i can s...,0
783,org.mozilla.firefox,"all you need, easy and gives you control open...",0


In [None]:
# 1.2 Process Data 
df_interim = (
    df_raw
    .copy()
    .drop(["package_name"], axis=1)
    )
df_interim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   review    891 non-null    object
 1   polarity  891 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 14.1+ KB


In [11]:
df_interim["review"] = df_interim["review"].str.strip().str.lower()
df_interim.sample(10)

Unnamed: 0,review,polarity
57,why is there so much space? there is so much w...,0
668,very good application ever dharai ramro app ho...,1
697,its really lovely apps . i am allready using a...,1
21,"keeps updating updates all the time, slows my ...",0
451,its really good but.. not everyone in your add...,1
698,"almost every time i scroll on a webpage, the c...",1
742,takes longer time plz make d newspaper downloa...,0
581,very useful i haven't used it as much as i cou...,1
866,"great game, too much ads angry birds is fun an...",1
739,error and only error i think this is the worst...,0


In [12]:
df = df_interim.copy()

In [None]:
# Step 2. Split the dataset

df_train, df_test = train_test_split(df, 
                                     random_state=2024, 
                                     test_size=0.2)
df_train = df_train.reset_index(drop=True).sort_values(by='polarity')
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 712 entries, 19 to 0
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   review    712 non-null    object
 1   polarity  712 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 16.7+ KB


In [14]:
X_train = df_train.reset_index(drop=True).drop('polarity', axis=1)
y_train = df_train['polarity'].reset_index(drop=True)
X_test = df_test.reset_index(drop=True).drop('polarity', axis=1)
y_test = df_test['polarity'].reset_index(drop=True)

In [17]:
# 2.1 Transform the text into a word count matrix. 
vec_model = CountVectorizer(stop_words = "english")
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()

In [None]:
# Gaussian Naive Bayes Model