## Table of Content
### Exploratory Data Analysis
* [0. Importing Libraries](#c0)
* [1. Data Loading](#c1)
* [2. Exploration and data cleaning](#c2)
    * [2.1 Understanding the features](#s21)
    * [2.2 Processing the data](#s22)
* [3. Train and Test Split](#c3)
* [4. Converting X_train into a Word Count Matrix](#c4)
### Machine Learning 
* [7.1 Bayes Model - Default Params](#s71)
* [7.2 Model Optimization](#s72)

## Exploratory Data Analysis (EDA) 

### 0. Importing Libraries <a class="anchor" id="c0"></a>

In [190]:
# Your code here
import pandas as pd 
import numpy as np
import datetime

from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

import warnings
def warn(*args, **kwargs):
    pass
warnings.warn = warn
warnings.filterwarnings("ignore", category=FutureWarning)
pd.set_option('display.max_columns', None)

### 1. Data Loading <a class="anchor" id="c1"></a>

In [116]:
df = pd.read_csv("../data/raw/playstore_reviews.csv")
df.head(3)

Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0


----------------------------------------------------------------

### 2. Exploration and data cleaning <a class="anchor" id="c2"></a>

#### 2.1 Understanding the features <a id="s21"></a>
- ```package_name```: Name of the mobile application (categorical)
- ```review```: Comment about the mobile application (categorical)
- ```polarity```: Class variable (0 or 1), being 0 a negative comment and 1, positive (numeric)

In [121]:
print('Our dataframe contains {} rows and it has {} features.'.format(len(df), df.shape[1]))

Our dataframe contains 891 rows and it has 3 features.


----------------------------------------------------------------

#### 2.2 Processing the data <a id="s22"></a>

In [125]:
df.drop(columns='package_name', inplace=True)
df['review'] = df['review'].str.lower().str.strip()
df.head(3)

Unnamed: 0,review,polarity
0,privacy at least put some option appear offlin...,0
1,"messenger issues ever since the last update, i...",0
2,profile any time my wife or anybody has more t...,0


----------------------------------------------------------------

### 3. Train and Test Split <a class="anchor" id="c3"></a>

In [136]:
X = df['review']
y = df['polarity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

X_train.head()

329    can't open after update the app crashes and do...
749    {•=•} new ui looks good but still laggy and un...
203    new update sucks no more dead bases. farming i...
421    messages are not sending or receiving right aw...
97     unsatisfactory version older version was more ...
Name: review, dtype: object

--------------------------------------------------------

### 4. Converting X_train into a Word Count Matrix <a class="anchor" id="c4"></a>

In [139]:
vec_model = CountVectorizer(stop_words = "english")
X_train = vec_model.fit_transform(X_train).toarray()
X_test = vec_model.transform(X_test).toarray()

X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

--------------------------------------------------------

## Machine Learning <a class="anchor" id="c7"></a>
#### 7.1 Bayes Model - Default Params <a id="s71"></a> 

In [193]:
# ********** Porque hemos escogido este modelo como el mejor y no los otros 2? ******

model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test,y_pred)

print(f'Our model with default params has an accuracy of {round(accuracy*100,4)}%')

Our model with default params has an accuracy of 83.7989%


#### 7.2 Model Optimization <a id="s72"></a> 

In [199]:
start = datetime.datetime.now()

model_params = {
    "alpha": [0.1, 0.5, 1.0, 2.0],
    "force_alpha": [True, False],
    "fit_prior": [True, False],
}

model = MultinomialNB()
model.fit(X_train, y_train)
grid = GridSearchCV(model, model_params, scoring="accuracy", cv=5)
grid.fit(X_train, y_train)
print(f'The best hyperparameters are: {grid.best_params_}')
end = datetime.datetime.now()

20:07
The best hyperparameters are: {'alpha': 2.0, 'fit_prior': False, 'force_alpha': True}
20:07


In [223]:
model_grid = MultinomialNB(alpha=2.0, fit_prior=False, force_alpha=True)
model_grid.fit(X_train, y_train)
y_pred = model_grid.predict(X_test)
model_accuracy = round(accuracy_score(y_test, y_pred),4)
print(f'The model accuracy with the hyperparameters is: {round(model_accuracy*100,2)}%, an increase of {round((model_accuracy-accuracy)*100,2)}% vs the default model')

The model accuracy with the hyperparameters is: 84.92%, an increase of 1.12% vs the default model
