# Explore here

It's recommended to use this notebook for exploration purposes.

For example: 

1. You could import the CSV generated by python into your notebook and explore it.
2. You could connect to your database using `pandas.read_sql` from this notebook and explore it.

In [None]:
'''# Example reading the SQL database from here

from utils import db_connect
import pandas as pd
engine = db_connect()

dataframe = pd.read_sql("Select * from books;", engine)
print(dataframe.describe())'''

## Paso 1: Carga del conjunto de datos

In [99]:
# Example importing the CSV here
import pandas as pd 

dataframe = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv')
dataframe

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0
...,...,...,...
886,com.rovio.angrybirds,loved it i loooooooooooooovvved it because it...,1
887,com.rovio.angrybirds,all time legendary game the birthday party le...,1
888,com.rovio.angrybirds,ads are way to heavy listen to the bad review...,0
889,com.rovio.angrybirds,fun works perfectly well. ads aren't as annoy...,1


## Paso 2: Estudio de variables y su contenido

In [100]:
# Eliminar la variable "package_name"
dataframe.drop("package_name", axis=1, inplace=True)


In [101]:
dataframe

Unnamed: 0,review,polarity
0,privacy at least put some option appear offli...,0
1,"messenger issues ever since the last update, ...",0
2,profile any time my wife or anybody has more ...,0
3,the new features suck for those of us who don...,0
4,forced reload on uploading pic on replying co...,0
...,...,...
886,loved it i loooooooooooooovvved it because it...,1
887,all time legendary game the birthday party le...,1
888,ads are way to heavy listen to the bad review...,0
889,fun works perfectly well. ads aren't as annoy...,1


In [102]:
#Eliminar espacios y convertir a minúsculas el texto:
dataframe["review"] = dataframe["review"].str.strip().str.lower()

In [103]:
dataframe.info


<bound method DataFrame.info of                                                 review  polarity
0    privacy at least put some option appear offlin...         0
1    messenger issues ever since the last update, i...         0
2    profile any time my wife or anybody has more t...         0
3    the new features suck for those of us who don'...         0
4    forced reload on uploading pic on replying com...         0
..                                                 ...       ...
886  loved it i loooooooooooooovvved it because it ...         1
887  all time legendary game the birthday party lev...         1
888  ads are way to heavy listen to the bad reviews...         0
889  fun works perfectly well. ads aren't as annoyi...         1
890  they're everywhere i see angry birds everywher...         1

[891 rows x 2 columns]>

In [104]:
#Dividir el conjunto de datos en train y test

# Split the DataSet
import pandas as pd
from sklearn.model_selection import train_test_split

X = dataframe["review"]
y = dataframe["polarity"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train.head()

331    just did the latest update on viber and yet ag...
733    keeps crashing it only works well in extreme d...
382    the fail boat has arrived the 6.0 version is t...
704    superfast, just as i remember it ! opera mini ...
813    installed and immediately deleted this crap i ...
Name: review, dtype: object

In [105]:
X_test.head()


709    love/hate has bug and security issues. i tried...
439    whatsapp i use this app now that blackberry me...
840                             usefully verry  nice app
720    fonts why in the heck is this thing analysing ...
39     app doesn't work after latest upgrade the face...
Name: review, dtype: object

In [106]:
from sklearn.feature_extraction.text import CountVectorizer

vec_model = CountVectorizer(stop_words = "english")
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()

X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## Paso 3: Construye un naive bayes

In [107]:
#Paso 1: Inicialización y entrenamiento del modelo MultinomialNB
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)


In [108]:
#Paso 2: Predicción del modelo

y_pred = model.predict(X_test)
y_pred

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0], dtype=int64)

In [109]:
#paso 3: ver resultado
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.8156424581005587

In [110]:
'''#Paso 1: Inicialización y entrenamiento del modelo GaussianNB
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)'''


'#Paso 1: Inicialización y entrenamiento del modelo GaussianNB\nfrom sklearn.naive_bayes import GaussianNB\n\nmodel = GaussianNB()\nmodel.fit(X_train, y_train)'

In [111]:
'''#Paso 2: Predicción del modelo

y_pred = model.predict(X_test)
y_pred'''

'#Paso 2: Predicción del modelo\n\ny_pred = model.predict(X_test)\ny_pred'

In [112]:
'''#paso 3: ver resultado
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)'''

'#paso 3: ver resultado\nfrom sklearn.metrics import accuracy_score\n\naccuracy_score(y_test, y_pred)'

In [113]:
'''#Paso 1: Inicialización y entrenamiento del modelo BernoulliNB
from sklearn.naive_bayes import BernoulliNB

model = BernoulliNB()
model.fit(X_train, y_train)'''


'#Paso 1: Inicialización y entrenamiento del modelo BernoulliNB\nfrom sklearn.naive_bayes import BernoulliNB\n\nmodel = BernoulliNB()\nmodel.fit(X_train, y_train)'

In [114]:
'''#Paso 2: Predicción del modelo

y_pred = model.predict(X_test)
y_pred'''

'#Paso 2: Predicción del modelo\n\ny_pred = model.predict(X_test)\ny_pred'

In [115]:
'''#paso 3: ver resultado
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)'''

'#paso 3: ver resultado\nfrom sklearn.metrics import accuracy_score\n\naccuracy_score(y_test, y_pred)'

conclusiones:
- nos quedamos con MultinomialNB que es el modelo que nos da el accuracy score mas alto, que es 81.5%. Lo guardamos para poder utilizarlo en el futuro


In [116]:
# Paso 4: Guardado del modelo
from pickle import dump

dump(model, open("naive_bayes_default_MultinomialNB.sav", "wb"))

## Paso 4: Optimiza el modelo anterior con GRID SEARCH

In [117]:
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB


# Definimos los parámetros a mano que queremos ajustar
param_grid = {
    'alpha': [0.1, 1.0, 2.0],
    'fit_prior': [True, False]
}

# Inicializamos la grid
grid = GridSearchCV(model, param_grid, scoring = "accuracy", cv = 5)
grid

In [118]:
grid.fit(X_train, y_train)

print(f"Mejores hiperparámetros: {grid.best_params_}")

Mejores hiperparámetros: {'alpha': 2.0, 'fit_prior': False}


In [119]:
# Inicializar la grid
model = MultinomialNB(alpha = 2.0, fit_prior = False)
model.fit(X_train, y_train)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

accuracy_score(y_test, y_pred)

0.8212290502793296

In [120]:
# Paso 4: Guardado del modelo
from pickle import dump

dump(model, open("naive_bayes_default_MultinomialNB_gridsearch.sav", "wb"))

## Paso 5:  Explora otras alternativas

implementamos el modelo GradientBoostingClassifier para ver si mejoramos el modelo

In [121]:
import pandas as pd

# Cargar el conjunto de datos desde la URL proporcionada
url = "https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv"
df = pd.read_csv(url)

# Separar características (X) y etiquetas (y)
X = df["review"]
y = df["polarity"]


In [122]:
from sklearn.model_selection import train_test_split

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [123]:
from sklearn.feature_extraction.text import CountVectorizer

# Inicializar el vectorizador de recuento de palabras
vec_model = CountVectorizer(stop_words="english")

# Transformar el texto en una matriz de recuento de palabras para entrenamiento y prueba
X_train_vec = vec_model.fit_transform(X_train).toarray()
X_test_vec = vec_model.transform(X_test).toarray()


In [124]:
from sklearn.ensemble import GradientBoostingClassifier

# Crear una instancia del modelo Gradient Boosting Classifier
boosting_model = GradientBoostingClassifier()

# Entrenar el modelo utilizando el conjunto de entrenamiento
boosting_model.fit(X_train_vec, y_train)


In [125]:
# Realizar predicciones en el conjunto de prueba
y_pred = boosting_model.predict(X_test_vec)


In [126]:
from sklearn.metrics import accuracy_score

# Calcular la precisión del modelo en el conjunto de prueba
accuracy = accuracy_score(y_test, y_pred)
print("Precisión del modelo Gradient Boosting Classifier:", accuracy)



Precisión del modelo Gradient Boosting Classifier: 0.7597765363128491


conclusiones:
- no mejoramos el modelo, vamosa  probar con XGBoost

In [127]:
import xgboost as xgb

# Crear una instancia del modelo XGBoost para clasificación
xgb_model = xgb.XGBClassifier()

# Entrenar el modelo utilizando el conjunto de entrenamiento
xgb_model.fit(X_train_vec, y_train)


In [128]:
# Realizar predicciones en el conjunto de prueba
y_pred = xgb_model.predict(X_test_vec)


In [129]:
from sklearn.metrics import accuracy_score

# Calcular la precisión del modelo en el conjunto de prueba
accuracy = accuracy_score(y_test, y_pred)
print("Precisión del modelo XGBoost:", accuracy)


Precisión del modelo XGBoost: 0.8100558659217877


conclusiones:
- conseguimos mejor resultado con XGBoost que con GradientBoostingClassifier, pero no mejor resultado que con el grid search. Nos quedamos con el modelo MultinomialNB_gridserach y una puntuacion de 82.12% 