### Machine Learning

- Maria Zazpe Quintana

- Alba Rodríguez Berenguel

Con lo visto en el análisis EDA, vamos a realizar las últimas transformaciones al dataset antes de comenzar con el modelado; crearemos los preprocesadores para los modelos y dividiremos los datos. Por tanto, los pasos a seguir son los siguientes:

1. Carga de los datos.
2. Transformación de variables.
3. Valores missing.
4. Construcción del preprocesador.
5. Selección de variables.
6. División y exportación de los datos.

In [1]:
# Libraries.
import pandas as pd
import numpy as np
import regex as re
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pickle
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

## 1. Carga de los datos.

In [2]:
# Load data we prepared in EDA analysis.
reviews_df = pd.read_csv('../data/processed/data_reviews.csv')

In [3]:
# The first thing we do is remove the variable state, since we only wanted it for geoespacial visualizations.
# We are not going to consider it for the model.
reviews_df = reviews_df.drop(columns=['state'])

reviews_df.head()

Unnamed: 0,delivery,outdoor_seating,credit_cards,bike_parking,price_range,take_out,wifi,alcohol,caters,wheelchair_accessible,...,attire,reservations,table_service,good_for_groups,tv,noise_level,useful,funny,cool,stars
0,False,False,False,True,1.0,True,u'free',u'none',True,,...,,,,,,,3,1,2,1
1,True,True,True,True,1.0,True,u'no',u'none',True,,...,u'casual',False,,True,True,u'quiet',12,4,5,1
2,True,True,True,True,2.0,True,u'no',u'full_bar',True,True,...,u'casual',False,True,False,False,u'average',1,0,1,1
3,False,False,False,True,1.0,True,u'no','none',False,,...,'casual',False,,True,False,u'average',1,0,0,1
4,True,True,True,True,2.0,True,u'no',u'full_bar',False,True,...,u'casual',False,True,True,True,u'average',5,1,2,1


## 2. Transformación de variables. 

En el análisis EDA pudimos ver que la mayoría de nuestras variables categóricas tienen valores True, False y None. Lo que vamos a hacer es codificar estos valores como 1 (True) y 0 (False) para que no nos den problemas, los None pasarán a ser NA.

También, algunas de las variables categóricas tenían estos valores duplicados, ya que algunos son tipo *string* y otros *bool*, en este caso los agruparemos y posteriormente los codificaremos para tener el mismo resultado.

Vamos a analizar cada uno de los casos.

In [4]:
#We check the types
reviews_df.dtypes

delivery                 object
outdoor_seating          object
credit_cards             object
bike_parking             object
price_range              object
take_out                 object
wifi                     object
alcohol                  object
caters                   object
wheelchair_accessible    object
good_for_kids            object
attire                   object
reservations             object
table_service            object
good_for_groups          object
tv                       object
noise_level              object
useful                    int64
funny                     int64
cool                      int64
stars                     int64
dtype: object

In [5]:
# We see all the unique values for categorical variables
for i in reviews_df.select_dtypes(include=['object']):
    print(i, reviews_df[i].unique())

delivery ['False' 'True' 'None' nan]
outdoor_seating ['False' 'True' 'None' nan]
credit_cards ['False' 'True' nan 'None' True False]
bike_parking ['True' 'False' nan 'None' True False]
price_range ['1.0' '2.0' '3.0' '4.0' nan '1' '2' '3' '4' 'None' 2.0 4.0 1.0 3.0]
take_out ['True' 'None' nan 'False']
wifi ["u'free'" "u'no'" "'free'" "'no'" nan "u'paid'" "'paid'" 'None']
alcohol ["u'none'" "u'full_bar'" "'none'" "'full_bar'" "u'beer_and_wine'" nan
 "'beer_and_wine'" 'None']
caters ['True' 'False' nan 'None']
wheelchair_accessible [nan True False 'True' 'False' 'None']
good_for_kids [nan 'True' 'False' 'None' True False]
attire [nan "u'casual'" "'casual'" "u'dressy'" "'dressy'" "u'formal'" 'None'
 "'formal'"]
reservations [nan 'False' 'True' 'None']
table_service [nan 'True' 'False' 'None' False True]
good_for_groups [nan 'True' 'False' 'None' True False]
tv [nan 'True' 'False' 'None' True False]
noise_level [nan "u'quiet'" "u'average'" "u'loud'" "'quiet'" "'average'"
 "u'very_loud'" "'

En primer lugar, trataremos los que tienen categorías duplicadas. Lo que vamos a hacer para solucionarlo es convertir todos los valores de la columna a tipo string, de esta manera se agruparán los True y False. Después, reemplazaremos los *True* por 1, los *False* por 0 y los *None* y *nan* pasarán a ser valores missing.

In [6]:
# We create a list with the variables that we want to transform in this way.
lista1 = ['credit_cards', 'bike_parking', 'caters', 'wheelchair_accessible', 'good_for_kids', 'table_service', 
          'good_for_groups', 'tv']

#First we transform the variables to string and then we stablish 1 if is True and 0 if is False and nan, None to nan.
for i in lista1:
    reviews_df[i] = reviews_df[i].astype(str)
    reviews_df[i] = reviews_df[i].replace({'True': 1, 'False': 0, 'nan': np.nan, 'None': np.nan})

En segundo lugar, transformamos las que también tienen valores True y False pero no están duplicados. En este caso seguimos el mismo proceso, pero suprimimos el paso de convertirlos valores a string.

In [7]:
# We create a list with the variables that we want to transform in this other way.
lista2 = ['delivery', 'outdoor_seating', 'take_out', 'reservations']

#1 if is True and 0 if is False and nan, None to nan.
for i in lista2:
    reviews_df[i] = reviews_df[i].replace({'True': 1, 'False': 0, 'nan': np.nan, 'None': np.nan})

Vamos a continuar con las variables *wifi, alcohol, attire y noise_level* que también tienen categorías duplicadas, pero porque algunas de ellas están mal escritas, como por ejemplo *"casual"* y *"u'casual'"* en el caso de attire.

Esta estructura se repite en todas las que vamos a añadir a la lista, por lo tanto, lo que necesitamos hacer es eliminar la *u* y las *''*, de esta manera se agruparán las categorías. También convertiremos los None en nulos.

In [8]:
# Finally, we create a list with the variables that we want to transform in this other way.
lista_u = ['wifi', 'alcohol', 'attire', 'noise_level']

#We eliminate the u' and ' and convert None to nan.
for i in reviews_df[lista_u]:
    reviews_df[i] = reviews_df[i].str.replace("u'", "'")
    reviews_df[i] = reviews_df[i].str.replace("'", "")
    reviews_df[i] = reviews_df[i].replace("None", np.nan)
    reviews_df[i] = reviews_df[i].replace("none", np.nan)

Por último, en la variable *price_range* tenemos números con distinto formato, algunos son *string* otros *float*... Los vamos a convertir todos en float.

In [9]:
# This variable we have to transformate in a float type
reviews_df['price_range'] = reviews_df['price_range'].replace({'1.0': 1, '2.0': 2, '3.0': 3, '4.0': 4, 1.0: 1, 2.0: 2, 
                                                               3.0: 3, 4.0: 4, '1': 1, '2': 2, '3': 3,
                                                               '4': 4, 'None':np.nan})

In [10]:
# We look other time the unique values for each variable
for i in reviews_df.select_dtypes(include=['object', 'float']):
    print(i, reviews_df[i].unique())

delivery [ 0.  1. nan]
outdoor_seating [ 0.  1. nan]
credit_cards [ 0.  1. nan]
bike_parking [ 1.  0. nan]
price_range [ 1.  2.  3.  4. nan]
take_out [ 1. nan  0.]
wifi ['free' 'no' nan 'paid']
alcohol [nan 'full_bar' 'beer_and_wine']
caters [ 1.  0. nan]
wheelchair_accessible [nan  1.  0.]
good_for_kids [nan  1.  0.]
attire [nan 'casual' 'dressy' 'formal']
reservations [nan  0.  1.]
table_service [nan  1.  0.]
good_for_groups [nan  1.  0.]
tv [nan  1.  0.]
noise_level [nan 'quiet' 'average' 'loud' 'very_loud']


## 3. Valores missing

En este apartado vamos a rellenar algunos de los valores missing. En las variables categóricas que hemos codificado con 0 y 1 había valores missing, además de los que hemos convertido nosotras al reemplazar los *None* por *Nan*.

En el caso de estas variables vamos a dar por hecho que al no disponer de ese valor significa que el restaurante no lo tiene, dado que en la mayoría de sitios web de reviews es así.

In [11]:
#We create a list with the categorical variables that are coded with 0 and 1.
lista_na = ['delivery', 'outdoor_seating', 'credit_cards', 'bike_parking', 'price_range', 'take_out', 'caters', 
            'wheelchair_accessible', 'good_for_kids', 'reservations', 'table_service', 'good_for_groups', 'tv']

# We fill all tha nan values with 0.
reviews_df[lista_na] = reviews_df[lista_na].fillna(0)

## 4. Construcción del preprocesador - Pipelines.

En este apartado vamos a construir el preprocesador que aplicaremos a los modelos.

In [12]:
# We define the categorical and numerical variables.
cat_var = ['delivery', 'outdoor_seating', 'credit_cards', 'bike_parking', 'price_range', 'take_out', 'wifi', 
           'alcohol', 'caters', 'wheelchair_accessible', 'good_for_kids', 'attire', 'reservations', 'table_service', 
           'good_for_groups','tv', 'noise_level']

num_var = ['useful', 'funny', 'cool']

En el caso de las variables numéricas, el preprocesador se va a encargar de escalarlas, mediante la función StandardScarler.

In [13]:
# We make the pipeline to transform the numerical variables.
num_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

Para las categóricas se aplicará OneHotEncoder, y además, aquellas que no hemos rellenado los NA's, es decir, las que no estaban codificadas con 0 y 1, sino que tienen categorías, como por ejemplo wifi o alcohol, se construirá otra categoría (Unknown) para los missing.

In [14]:
# We make the pipeline to transform the categorical variables.
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [15]:
# We make the preprocessor, where we define the transformers and the variables that we want to transform.
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', cat_transformer, cat_var),
        ('num', num_transformer, num_var)
    ]
)

In [16]:
# We save the preprocessor.
with open('../models/preprocessor.pickle', 'wb') as f:
    pickle.dump(preprocessor, f)

## 5. Selección de variables.

No vamos a aplicar ningún método de selección de variables a nuestro dataset. Tenemos un total de 20 variables, sin incluir la variable target, por lo que no consideramos que sean demasiadas como para seleccionar algunas de ellas y eliminar.
Además, cuando revisamos las correlaciones no guardaban gran relación, salvo entre dos de las variables que de todas formas decidimos no eliminar, por lo que tampoco tendría mucho sentido eliminar variables si no es porque nos indican información muy similar.

## 6. División y exportación de los datos.

Hemos decidido dividir los datos en train, validación y test, con el objetivo de agilizar el proceso de modelización.

Por lo tanto, entrenaremos los modelos con train y validación y una vez decidido el modelo, realizaremos las predicciones aplicando la parte de test. Tanto la validación como el test van a ser del 20%.

In [17]:
# We split the dataset in train and test. We put stratify because the data is unbalanced and select the size of the test
# of 20%.
X_train, X_test, y_train, y_test = train_test_split(reviews_df.drop('stars',axis=1), 
                                                   reviews_df['stars'], 
                                                   stratify=reviews_df['stars'], 
                                                   test_size=0.2, random_state=12345)


In [18]:
X_train

Unnamed: 0,delivery,outdoor_seating,credit_cards,bike_parking,price_range,take_out,wifi,alcohol,caters,wheelchair_accessible,good_for_kids,attire,reservations,table_service,good_for_groups,tv,noise_level,useful,funny,cool
557874,1.0,1.0,1.0,1.0,1.0,1.0,no,,1.0,0.0,1.0,casual,0.0,0.0,1.0,1.0,average,3,1,1
572446,0.0,1.0,1.0,1.0,2.0,0.0,no,full_bar,0.0,1.0,0.0,casual,0.0,0.0,1.0,1.0,average,3,1,2
579337,1.0,1.0,1.0,1.0,3.0,1.0,free,full_bar,1.0,1.0,0.0,dressy,1.0,1.0,1.0,1.0,average,0,0,0
401606,1.0,0.0,1.0,1.0,1.0,1.0,free,,0.0,0.0,1.0,casual,0.0,0.0,1.0,1.0,,0,1,0
436523,1.0,1.0,1.0,1.0,3.0,1.0,free,full_bar,0.0,0.0,0.0,dressy,1.0,0.0,1.0,1.0,quiet,14,4,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
788816,1.0,1.0,1.0,1.0,2.0,1.0,no,full_bar,1.0,1.0,1.0,casual,1.0,1.0,1.0,1.0,average,2,0,1
731289,0.0,0.0,1.0,0.0,2.0,1.0,free,,0.0,0.0,1.0,casual,1.0,1.0,1.0,1.0,average,9,4,4
481209,0.0,0.0,1.0,1.0,2.0,1.0,no,,0.0,0.0,1.0,casual,0.0,1.0,1.0,1.0,average,0,0,0
84102,1.0,1.0,1.0,1.0,2.0,1.0,free,full_bar,1.0,1.0,1.0,casual,1.0,1.0,1.0,1.0,average,0,0,0


In [19]:
pd.concat([X_train, pd.DataFrame(y_train)]).describe()

Unnamed: 0,delivery,outdoor_seating,credit_cards,bike_parking,price_range,take_out,caters,wheelchair_accessible,good_for_kids,reservations,table_service,good_for_groups,tv,useful,funny,cool,stars
count,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0,807663.0
mean,0.601165,0.562709,0.95057,0.778559,1.835792,0.862553,0.489185,0.45493,0.722388,0.394188,0.456387,0.829617,0.669821,2.048786,0.714255,1.283959,0.720631
std,0.489659,0.496052,0.216765,0.415217,0.595647,0.344319,0.499883,0.497965,0.447821,0.488676,0.498095,0.375969,0.470278,4.431386,2.73995,3.803669,0.448689
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,1.0,1.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0
75%,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,399.0,370.0,399.0,1.0


In [20]:
pd.concat([X_test, pd.DataFrame(y_test)]).describe()

Unnamed: 0,delivery,outdoor_seating,credit_cards,bike_parking,price_range,take_out,caters,wheelchair_accessible,good_for_kids,reservations,table_service,good_for_groups,tv,useful,funny,cool,stars
count,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0,201916.0
mean,0.602741,0.562917,0.950702,0.780889,1.836605,0.863413,0.490343,0.457512,0.722498,0.395145,0.457631,0.830078,0.669075,2.051531,0.709899,1.283474,0.720631
std,0.489332,0.496027,0.216489,0.413645,0.595966,0.343411,0.499908,0.498193,0.447767,0.488883,0.498203,0.375565,0.470547,4.497317,2.495978,3.779965,0.44869
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,1.0,1.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0
75%,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,392.0,357.0,359.0,1.0


La información estadística de train y test es muy similar, la división es correcta.

In [21]:
#We split the train data in train and validation.
X_train_val, X_val, y_train_val, y_val = train_test_split(X_train, y_train,
                                                   stratify= y_train, 
                                                   test_size=0.2, random_state=12345)

In [22]:
pd.concat([X_train_val, pd.DataFrame(y_train_val)]).describe()

Unnamed: 0,delivery,outdoor_seating,credit_cards,bike_parking,price_range,take_out,caters,wheelchair_accessible,good_for_kids,reservations,table_service,good_for_groups,tv,useful,funny,cool,stars
count,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0,646130.0
mean,0.600695,0.562789,0.950655,0.778774,1.836276,0.862326,0.489039,0.455065,0.72215,0.394207,0.456702,0.829429,0.669344,2.051764,0.713539,1.286863,0.720631
std,0.489756,0.496042,0.216587,0.415073,0.595683,0.344557,0.49988,0.497977,0.447939,0.48868,0.498122,0.376134,0.47045,4.477028,2.659072,3.856057,0.44869
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,1.0,1.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0
75%,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,360.0,370.0,360.0,1.0


In [23]:
pd.concat([X_val, pd.DataFrame(y_val)]).describe()

Unnamed: 0,delivery,outdoor_seating,credit_cards,bike_parking,price_range,take_out,caters,wheelchair_accessible,good_for_kids,reservations,table_service,good_for_groups,tv,useful,funny,cool,stars
count,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0,161533.0
mean,0.603047,0.562387,0.950227,0.777699,1.833854,0.863458,0.48977,0.45439,0.723338,0.394111,0.455127,0.830369,0.671733,2.036878,0.717117,1.272341,0.720633
std,0.489268,0.496094,0.217477,0.415794,0.595502,0.343364,0.499897,0.497917,0.447349,0.488661,0.497984,0.375309,0.469584,4.243909,3.042038,3.586459,0.44869
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,1.0,1.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0
75%,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,399.0,369.0,399.0,1.0


Los datos de train y validación también son válidos para los modelos, los datos estadísticos son similares y también, se asimilan bastante a los de train y test.

In [24]:
# We save the train and test datasets.
X_train.to_csv("../data/processed/X_train.csv")
y_train.to_csv("../data/processed/y_train.csv")

X_test.to_csv("../data/processed/X_test.csv")
y_test.to_csv("../data/processed/y_test.csv")

X_train_val.to_csv("../data/processed/X_train_val.csv")
y_train_val.to_csv("../data/processed/y_train_val.csv")

X_val.to_csv("../data/processed/X_val.csv")
y_val.to_csv("../data/processed/y_val.csv")