# Explore here

It's recommended to use this notebook for exploration purposes.

For example: 

1. You could import the CSV generated by python into your notebook and explore it.
2. You could connect to your database using `pandas.read_sql` from this notebook and explore it.

In [64]:
!pip install pandas
!pip install matplotlib
!pip install sklearn
!pip install pickle

[31mERROR: Could not find a version that satisfies the requirement pickle (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pickle[0m[31m
[0m

In [65]:
# Example reading the SQL database from here

import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split


In [6]:
# Example importing the CSV here
df = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews_dataset.csv')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   package_name  891 non-null    object
 1   review        891 non-null    object
 2   polarity      891 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 21.0+ KB


In [8]:
df.sample(5)

Unnamed: 0,package_name,review,polarity
751,com.shirantech.kantipur,its just ok in news paper there alot of time ...,1
749,com.shirantech.kantipur,{•=•} new ui looks good but still laggy and u...,0
611,com.evernote,useful app very useful app in my opinion. not...,1
27,com.facebook.katana,doesn't work 90% of the time. doesn't update ...,0
647,com.uc.browser.en,bullshit update... after updating its totally...,0


In [9]:
df = df.drop(columns=['package_name'])

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   review    891 non-null    object
 1   polarity  891 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 14.0+ KB


In [11]:
df['review']

0       privacy at least put some option appear offli...
1       messenger issues ever since the last update, ...
2       profile any time my wife or anybody has more ...
3       the new features suck for those of us who don...
4       forced reload on uploading pic on replying co...
                             ...                        
886     loved it i loooooooooooooovvved it because it...
887     all time legendary game the birthday party le...
888     ads are way to heavy listen to the bad review...
889     fun works perfectly well. ads aren't as annoy...
890     they're everywhere i see angry birds everywhe...
Name: review, Length: 891, dtype: object

In [12]:
df['review'] = df['review'].str.lower()
df['review'] = df['review'].str.strip()

In [13]:
df['review']

0      privacy at least put some option appear offlin...
1      messenger issues ever since the last update, i...
2      profile any time my wife or anybody has more t...
3      the new features suck for those of us who don'...
4      forced reload on uploading pic on replying com...
                             ...                        
886    loved it i loooooooooooooovvved it because it ...
887    all time legendary game the birthday party lev...
888    ads are way to heavy listen to the bad reviews...
889    fun works perfectly well. ads aren't as annoyi...
890    they're everywhere i see angry birds everywher...
Name: review, Length: 891, dtype: object

In [14]:
X = df['review']
y = df['polarity']

In [17]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(X)
vectorizer.get_feature_names_out()

array(['000', '04', '0x', ..., 'žŕľ', 'ˇŕ', 'ˇŕľ'], dtype=object)

In [18]:
X = X.toarray()

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=13)

In [37]:
print(X_train)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [38]:
print(X.shape)
print(X_test.shape)
print(X_train.shape)

(891, 3721)
(357, 3721)
(534, 3721)


In [39]:
clf = GaussianNB()
clf.fit(X_train, y_train)

In [40]:
y_train.value_counts()

0    337
1    197
Name: polarity, dtype: int64

In [41]:
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

In [42]:
target_names = ['bad', 'good']
print(classification_report(y_train, y_train_pred, target_names=target_names))

              precision    recall  f1-score   support

         bad       1.00      0.99      0.99       337
        good       0.98      1.00      0.99       197

    accuracy                           0.99       534
   macro avg       0.99      0.99      0.99       534
weighted avg       0.99      0.99      0.99       534



In [43]:
print(classification_report(y_test, y_test_pred, target_names=target_names))


              precision    recall  f1-score   support

         bad       0.80      0.85      0.82       247
        good       0.61      0.54      0.57       110

    accuracy                           0.75       357
   macro avg       0.71      0.69      0.70       357
weighted avg       0.74      0.75      0.75       357



No resultó muy bueno - overfit

In [45]:
clf_multinomial = MultinomialNB()
clf_multinomial.fit(X_train, y_train)


In [46]:
y_train_pred = clf_multinomial.predict(X_train)
y_test_pred = clf_multinomial.predict(X_test)

In [47]:
print(classification_report(y_train, y_train_pred, target_names=target_names))

              precision    recall  f1-score   support

         bad       0.97      0.99      0.98       337
        good       0.97      0.94      0.96       197

    accuracy                           0.97       534
   macro avg       0.97      0.96      0.97       534
weighted avg       0.97      0.97      0.97       534



In [48]:
print(classification_report(y_test, y_test_pred, target_names=target_names))

              precision    recall  f1-score   support

         bad       0.87      0.86      0.87       247
        good       0.70      0.71      0.70       110

    accuracy                           0.82       357
   macro avg       0.78      0.79      0.78       357
weighted avg       0.82      0.82      0.82       357



In [50]:
Z = vectorizer.transform(['I like this app'])

In [53]:
print(np.array(Z))
print(Z.shape)
print(Z)

  (0, 222)	1
  (0, 1856)	1
(1, 3721)
  (0, 222)	1
  (0, 1856)	1


In [55]:
# Modelo GaussianNB

print(clf.predict(Z.toarray()))

[1]


In [56]:
# Modelo Multinomial

print(clf_multinomial.predict(Z.toarray()))

[0]


In [61]:
pd.set_option('display.max_colwidth', None)
df.tail(10)

Unnamed: 0,review,polarity
881,"game ruined because of ads. i felt like re-downloading after haven't playing it in awhile, but i was immediately frustrated. a 20 second, awful mobile game ad popped up after the second level, and after looking at the store, i realized that you have to pay to remove ads. yep, uninstalling.",0
882,ads way to many ads can't even enjoy the game with the amount of ads i have to watch. every other level there's an ad. that's just ridiculous.,0
883,"great game, but too many ads almost not worth playing.",1
884,"fun but hard angry birds is really fun video game that people from at least 6 year olds and people that are a lot, a lot, a lot of years older than that. just one thing: it's really, really hard. but a challenge can teach children how to control their tempers. even though it can ba a little frustrating.",1
885,too many ads far more adverts than any other game i've played. i know it's free and they need the ads to make a profit but there needs to be a balance.,1
886,loved it i loooooooooooooovvved it because it is incredible awesome and it's in go power and make a new clash of clans the same thing butt better,1
887,all time legendary game the birthday party levels and short fuse levels are fantastic.especially when the pigs crash onto different chemicals is just great.suggestion to all those players who cringe about too much ads is close ur wi-fi connection and then play the game.then the ads won't trouble you.,1
888,"ads are way to heavy listen to the bad reviews. there are ads after every round, whether you pass it or fail it. sometimes there are ads before the next round starts to. you spend more time on ads than game play. i develop web apps, and honestly many people rely on ads to make a living. i can appreciate that all to well. however, these developers have went far beyond that. frankly, they are disrespectful nitwits.",0
889,"fun works perfectly well. ads aren't as annoying as you think, especially for a free game.",1
890,they're everywhere i see angry birds everywhere because i can't stop playing this game. get out my head devs! 4 đ because nothing's perfect,1


In [63]:
Z2 = vectorizer.transform(['I do not like this app'])

# Modelo GaussianNB
print(f"Gauss: {clf.predict(Z2.toarray())}")


# Modelo Multinomial
print(f"Multi: {clf_multinomial.predict(Z2.toarray())}")

Gauss: [1]
Multi: [0]


Export del modelo

In [67]:
pickle.dump(clf_multinomial, open('../models/multinomial.pkl', 'wb'))

In [68]:
load_model = pickle.load(open('../models/multinomial.pkl', 'rb'))

In [69]:
Z2 = vectorizer.transform(['I do not like this app'])

# Modelo Multinomial loaded
print(f"Multi: {load_model.predict(Z2.toarray())}")

Multi: [0]
