# Intro to ML

La práctica de hoy consiste en un simulacro de como empezar un proyecto de analítica / Machine Learning desde el principio.

Cómo hemos comentado en clases anteriores, el primer paso, antes incluso que ponerse a analizar el dataset, es plantearse la estrategía a seguir o como plasmar un problema de negocio en un problema de datos.

Nuestro cliente, una importante hotelera mallorquina (oh, sorpresa). No tiene muy claro que es lo que quiere, simplemente sabe que todo el mundo usa IA actualmente y no quiere quedarse atrás. Hablando con ellos, hemos accedido a hacer una propuesta para ayudarles a entender que valor podrían extraer del análisis de datos / modelización. Por el momento, sabemos que están preocupados por la reputación de sus hoteles y que les interesa saber que motiva que un cliente deje una buena review.

Para tal proposito, disponemos de un dataset con reviews de hoteles en Europa. El dataset se encuentra en la carpeta `data` junto con una descripción de los campos disponibles.

## Trabajo por equipos

#### 1. Revisa el fichero `data_format.txt` para saber que variables tenemos disponibles. A partir de las variables, plantea al menos 3 preguntas que te permitan entender el dataset: por ejemplo:

> Cómo varía la puntuación media de un hotel a lo largo del tiempo ?

1. Vemos que tenemos una variable temporal `review_date` disponible. Plantea cómo podríamos transformar esta variable de forma que pueda aportar valor a la modelizción. Piensa que en un contexto de turismo la estacionalidad es un factor muy importante.

1. Vemos que de todas las variables que tenemos, 3 son texto (2 texto libre y 1 tags). Piensa como vas a tratar esas variables. (Nota: muchas veces nos sabremos de antemano cómo tratar una transformación de un tipo de dato. Usar google es legal.)

1. Vemos que como variable numéricas tenemos la Lat/Long. Cómo podríamos usarla ? Que tipo de nuevas features se podrían calcular a partir de ellas que nos aporten informacion ?

1. Si queremos saber que afecta a una review positiva/negativa, como podríamos reducir el problema de ML supervisado ? Preferímos regresion o clasificacion?



## Trabajo entregable

**Ahora que más o menos tenemos una idea inicial de como tratar el dataset...**

NOTA: el objetivo es ver que sabemos usar los pipelines de sklear. No nos volvamos locos con hacer un trabajo completo.

1. Carga las librerías básicas para el analisis de datos
1. Realiza una limpieza no exhaustiva de los datos
1. Separa los datos en train y test. OJO: no queremos filtrar información! Hay que pensar bien como queremos hacer la separación...
1. Plantea una primera aproximación utilizando un modelo lineal y dibuja una arquitectura de cómo quedaría el `pipeline` de transformación + modelización tipo el mostrado en la clase práctiva de la W5
1. Entrena el modelo y revisa las variables/palabras que son sido más importantes.
1. Cuentale al cliente un mensaje clave sobre que tipo de cosas tienen más en cuenta los clientes: `Si no quieres una mala review no sirvas comida fría`

## Resposta

En aquesta llibreta s'apliquen les conclusions i l'après 
al quadern d'[enginyeria de característiques](10_hr_features_engineering.ipynb) 
que s'ha lliurat de forma conjunta amb aquest. 

Així doncs, alguns dels comentaris faran referència al que ja s'ha informat en 
aquell altre quadern de *Jupyter*.

### 1. Carga las librerías básicas para el analisis de datos

#### Llibreries mínimes per càrrega i anàlisi bàsic de les dades

In [1]:
import numpy as np
import pandas as pd

### 2. Realiza una limpieza no exhaustiva de los datos

#### Carregam les dades

In [2]:
data_file_list = [
    'data/hotel_reviews_dataset_part_1.csv',
    'data/hotel_reviews_dataset_part_2.csv', 
    'data/hotel_reviews_dataset_part_3.csv',
    'data/hotel_reviews_dataset_part_4.csv']

df = pd.DataFrame()

for file in data_file_list:
    print(f"Loading {file}")
    df = pd.concat([df, pd.read_csv(file)], ignore_index = True)

Loading data/hotel_reviews_dataset_part_1.csv
Loading data/hotel_reviews_dataset_part_2.csv
Loading data/hotel_reviews_dataset_part_3.csv
Loading data/hotel_reviews_dataset_part_4.csv


In [3]:
df.head(3)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7.1,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968


#### Anàlisi estadístic bàsic

In [4]:
df.describe()

Unnamed: 0,Additional_Number_of_Scoring,Average_Score,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,lat,lng
count,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,512470.0,512470.0
mean,498.081836,8.397487,18.53945,2743.743944,17.776458,7.166001,8.395077,49.442439,2.823803
std,500.538467,0.548048,29.690831,2317.464868,21.804185,11.040228,1.637856,3.466325,4.579425
min,1.0,5.2,0.0,43.0,0.0,1.0,2.5,41.328376,-0.369758
25%,169.0,8.1,2.0,1161.0,5.0,1.0,7.5,48.214662,-0.143372
50%,341.0,8.4,9.0,2134.0,11.0,3.0,8.8,51.499981,0.010607
75%,660.0,8.8,23.0,3613.0,22.0,8.0,9.6,51.516288,4.834443
max,2682.0,9.8,408.0,16670.0,395.0,355.0,10.0,52.400181,16.429233


Crida l'atenció que la latitud i la longitud tenguin un recorregut tan petit. 
Però com ja sabem pel treball fet al quadern anterior, 
això és degut a que els hotels són només de 6 ciutats diferents. 

#### Tractament dels valors *nan*

A mode de recordatori del vist al quadern d'enginyeria de característiques, 
miram (per segon cop) si tenim valors *nan* a qualque columna ...

In [5]:
df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515738 entries, 0 to 515737
Data columns (total 17 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   Hotel_Address                               515738 non-null  object 
 1   Additional_Number_of_Scoring                515738 non-null  int64  
 2   Review_Date                                 515738 non-null  object 
 3   Average_Score                               515738 non-null  float64
 4   Hotel_Name                                  515738 non-null  object 
 5   Reviewer_Nationality                        515738 non-null  object 
 6   Negative_Review                             515738 non-null  object 
 7   Review_Total_Negative_Word_Counts           515738 non-null  int64  
 8   Total_Number_of_Reviews                     515738 non-null  int64  
 9   Positive_Review                             515738 non-null  object 
 

Pel que fa **als valors nuls** a les coordenades geogràfiques
(camps `lat` i `lng`) d'alguns registres podriem escollir entre: 
- completar el nostre *data set* des de fonts externes, informant els nuls per inferència de les seves coordenades aproximades a partir del país i ciutat o l'adreça completa
- suprimir els registres sense coordenades informades

> Com ja s'ha vist al quadern d'[enginyeria de característiques](10_hr_features_engineering.ipynb#Valors_nuls), el percentatge d'afectació pel que fa als registres és inferior al 0.7%. 
>
> Degut a aquesta escasa afectació, **optam per la supresió dels registres amb nuls a les característiques `lat` i `lng`**, doncs el més probable és que no valgui la pena l'esforç per completar-les i a més, part de la informació que aportaria la latitud, ja està capturada per la nova característica `hotel_country` vista al quadern d'enginyeria de característiques.

#### Estudi de les característiques predictores i creació de noves

Com ja s'ha dit a l'inici de la [resposta](#Resposta), 
per tal de mantenir aquest *notebook* compacte, 
s'ha optat per fer l'estudi de les característiques en 
el quadern [10_hr_features_engineering.ipynb](10_hr_features_engineering.ipynb) 
que es subministra conjuntament amb aquest.

Aquí tan sols i a mode de recordatori i referència s'inclou la seguent taula resum:

##### Taula 1: resum de la codificació de les característiques

| Columna       | Característica        | Explicació |
| --- | --- | --- |
| Hotel_Address | hotel_id              | Cream un valor de hash únic per a cada establiment. L'utilitzarem per a dividir el conjunt de forma consistent |
| Ídem          | h_country1..5      | Extreim el nom del país de `Hotel_Address` i aplicam *OHE*. Esperam que sigui moderadament explicativa |
| Average_Score | average_score         | És el valor escalat de la columna. Esperam que sigui àltament explicativa |
| Hotel_Name    | Contribueix a hotel_id  | Tot i que el nom de l'hotel podria influenciar a un percentage petit dels hostes, pensam que no val la pena fer-la servir, excepte per a la generació del hash `hotel_id` |
| Total_Number_of_Reviews | hotel_tnreviews  | Conté el valor escalat de la columna. Pensam que un número gran d'opinions pot indicar que l'establiment necessita millorar |
| Additional_Number_of_Scoring | No utilitzada | Pensam que serà molt menys explicativa que `hotel_tnreviewes` i la ignoram. |
| lat      | No utilitzada | Part de la seva informació la capturam amb el país de l'adreça (`hotel_countryN`) |
| lng      | No utilitzada | Ídem |
| Review_Date | rd_2q..4q    | Aquestes variables *dummy* codifiquen el trimestre de l'any en que es grava la valoració. |
| Reviewer_Nationality | rvr_country0..9 | Generam aquestes 10 variables *dummy* aplicant *OHE* als 10 països més freqüents amb l'esperança de que aporti una certa informació respecte de la forma de pensar i valorar de cada nacionalitat. |
| Negative_Review |  |  |
| Review_Total_Negative_Word_Counts | rv_negwcount | Una opinió negativa més llarga segurament està inversament relacionada amb la nota |
| Positive_Review |  |  |
| Review_Total_Positive_Word_Counts | rv_poswcount | Una opinió positiva més llarga pot estar (o no) directament relacionada amb una nota més alta |
| Reviewer_Score | y | Formarà el vector de valors a predir |
| Total_Number_of_Reviews_Reviewer_Has_Given | rvr_tnreviews | Pensam que si té experiència en opinar, les opinions seran més extenses a l'hora d'enumerar les falles o les lloances i que la nota molts cops serà proporcional a l'extensió. |
| Tags | Veure taula 2 |  |
| days_since_review | No utilitzada | Pensam que aquesta característica aportarà poca informació o senzillament renou. |

On:
- Columna: nom de la columna original al *data set* subministrat
- Característica: nom de la característica al *data set* processat pel *pipeline*
- Explicació: comentari aclaridor

##### Taula 2: codificació de la característica `Tags`

| Semàntica          | Variable *dummy*     | Val 1 només si existeix l'etiqueta |
| --- | --- | --- |
| tipologia de l'hoste | `gt_fam_old_child`   | `Family with older children` |
| ídem               | `gt_fam_young_child` | `Family with young children` |
| ídem               | `gt_group`           | `Group` |
| ídem               | `gt_solo`            | `Solo traveler` |
| ídem               | `gt_couple`          | `Couple` |
| tipus de viatge    | `tt_business`        | `Business trip` |
| ídem               | `tt_leisure`         | `Leisure trip` |
| durada de l'estada | `stayed_1`           | `Stayed 1 night` |
| ídem               | `stayed_2`           | `Stayed 2 nights` |
| ídem               | `stayed_3`           | `Stayed 3 nights` |
| ídem               | `stayed_4`           | `Stayed 4 nights` |

#### Preparació de la nostra l'arquitectura de *pipelines* 

Seguirem aquestes passes:

##### 1. Definir les classes necessàries

Rescatam les classes SelectColumns i DropColumns del notebook `W5/ 5 - Intro to ML.ipynb`
i afegim les noves que necessitam 
per tractar cada una de transformacions de variables categòriques.

In [6]:
from sklearn.base import TransformerMixin

### aux functions

class SelectColumns(TransformerMixin):
    def __init__(self, columns: list) -> pd.DataFrame:
        if not isinstance(columns, list):
            raise ValueError('Specount the columns into a list')
        self.columns = columns
    def fit(self, X, y=None): # we do not need to specify the target in the transformer. We leave it as optional arg for consistency
        return self
    def transform(self, X):
        return X[self.columns]
    
class DropColumns(TransformerMixin):
    def __init__(self, columns: list) -> pd.DataFrame:
        if not isinstance(columns, list):
            raise ValueError('Specify the columns into a list')
        self.columns = columns
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.drop(self.columns, axis=1)

També afegirem una nova classe amb la qual 
podrem **extreure les dades numériques** 
de columnes amb números, seguides d'unitats, 
com per exemple la `days_since_review` 
si al final decidíssim fer-la servir 
(de moment no la consideram interessant).

In [7]:
class Extract2Num(TransformerMixin):
    def __init__(self, columns: list) -> pd.DataFrame:
        if not isinstance(columns, list):
            raise ValueError('Specify the columns into a list')
        self.columns = columns
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        _X = pd.DataFrame()
        for column in self.columns: 
            _X[column] = X[column].str.extract(r'(\d+)')
        return X.drop(self.columns, axis = 1).join(_X)

La següent **genera un valor hash** a partir d'una tupla 
especificada per `columns`:

In [8]:
import pandas as pd

# afegim el paràmetre rsuffix
# per evitar errors per mor de col·lisions al join final

class HashThis(TransformerMixin):
    def __init__(self, columns: list, new_col: str) -> pd.DataFrame:
        if not isinstance(columns, list):
            raise ValueError('Specify the columns into a list')
        if not isinstance(new_col, str):
            raise ValueError('Specify the new column as str')
        self.columns = columns
        self.new_col = new_col
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        hashes = []
        for t in X[self.columns].itertuples(index = False, name = None):
            hashes.append(hash(str(t)))
        return X.join(
            pd.DataFrame(hashes, columns = [self.new_col]))

La següent 
**codifica el nom del país en forma de variables dummy**. 
Li passam la columna amb l'adreça a explorar, 
la llista de països i un suffixe per evitar col·lisions 
en fer el *join* amb el *data frame* original:

In [9]:
import pandas as pd

class DummyCountryEnds(TransformerMixin):
    def _zero_1(self, length: int, one_pos: int) -> list:
        """
        Return zero'ed vector of length, except at one_pos
        """
        v = [0] * length
        v[one_pos] = 1
        return v
    
    def __init__(self, column: str, 
                 country_list: list,
                 prefix: str
                ) -> pd.DataFrame:
        if not isinstance(column, str):
            raise ValueError('Specify the column as str')
        if not isinstance(country_list, list):
            raise ValueError('Specify the country_list as a list')
        if not isinstance(prefix, str):
            raise ValueError('Specify the columns name prefix as str')
        self.column = column
        self.country_list = country_list
        self.prefix = prefix
        self.list_len = len(country_list)
        # construir llista de noms de columna
        self.columns = list(prefix + str(i) for i in range(self.list_len))
        # construir diccionari de dummys
        self.country_dict = {
            country_list[i]: self._zero_1(self.list_len, i) 
            for i in range(0, self.list_len)}
        self.country_dict['None'] = [0] * self.list_len
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        country_dummy = []
        for address in X[self.column]:
            country_name = 'None'
            i = 0
            while (country_name == 'None') and (i < self.list_len):
                if address.endswith(self.country_list[i]):
                    country_name = self.country_list[i]
                i += 1
            country_dummy.append(self.country_dict[country_name])
        return X.join(
            pd.DataFrame(country_dummy, columns = self.columns))

La proper també 
**codifica el nom del país en forma de variables dummy**, 
però ho fa comparant tota la variable. 
Li passam la columna amb la'dreça a explorar, 
la llista de països i un suffixe per evitar col·lisions 
en fer el *join* amb el *data frame* original:

In [10]:
import pandas as pd

class DummyCountryIs(TransformerMixin):
    def _zero_1(self, length: int, one_pos: int) -> list:
        """
        Return zero'ed vector of length, except at one_pos
        """
        v = [0] * length
        v[one_pos] = 1
        return v
    
    def __init__(self, column: str, 
                 country_list: list,
                 prefix: str
                ) -> pd.DataFrame:
        if not isinstance(column, str):
            raise ValueError('Specify the column as str')
        if not isinstance(country_list, list):
            raise ValueError('Specify the country_list as a list')
        if not isinstance(prefix, str):
            raise ValueError('Specify the columns name prefix as str')
        self.column = column
        self.country_list = country_list
        self.prefix = prefix
        self.list_len = len(country_list)
        # construir llista de noms de columna
        self.columns = list(prefix + str(i) for i in range(self.list_len))
        # construir diccionari de dummys
        self.country_dict = {
            country_list[i]: self._zero_1(self.list_len, i) 
            for i in range(0, self.list_len)}
        self.country_dict['None'] = [0] * self.list_len
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        country_dummy = []
        for country in X[self.column]:
            country_name = 'None'
            i = 0
            while (country_name == 'None') and (i < self.list_len):
                if country.strip() == self.country_list[i]:
                    country_name = self.country_list[i]
                i += 1
            country_dummy.append(self.country_dict[country_name])
        return X.join(
            pd.DataFrame(country_dummy, columns = self.columns))

Amb la que segueix, cream les **variables *dummy* del trimestre** 
de la data de la gravació.

In [11]:
import pandas as pd
from datetime import datetime

class DummyDateQuarter(TransformerMixin):
    def _zero_1(self, length: int, one_pos: int) -> list:
        """
        Return zero'ed vector of length, except at one_pos
        """
        v = [0] * length
        v[one_pos] = 1
        return v
    
    def __init__(self, column: str, 
                 dformat: str, 
                 prefix: str
                ) -> pd.DataFrame:
        if not isinstance(column, str):
            raise ValueError('Specify the column as str')
        if not isinstance(dformat, str):
            raise ValueError('Specify the date format as datetime format string')
        if not isinstance(prefix, str):
            raise ValueError('Specify the columns name prefix as str')
        self.column = column
        self.dformat = dformat
        self.prefix = prefix
        # construir llista de noms de columna
        self.columns = list(prefix + str(i) for i in range(4))
        # construir diccionari de dummys
        self.month_dict = {
            i + 1: self._zero_1(4, i // 3) 
            for i in range(12)}
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        quarter_dummy = []
        for date in X[self.column]:
            quarter_dummy.append(
                self.month_dict[
                    datetime.strptime(date, self.dformat).month
                    ]
                )
        return X.join(
            pd.DataFrame(quarter_dummy, columns = self.columns))

I finalment **processam les etiquetes**, convertint-les en variables *dummy*.

In [12]:
import pandas as pd
from ast import literal_eval

class DummyTags(TransformerMixin):    
    def __init__(self, column: str, 
                 tags_list: list,
                 prefix: str
                ) -> pd.DataFrame:
        if not isinstance(column, str):
            raise ValueError('Specify the column as str')
        if not isinstance(tags_list, list):
            raise ValueError('Specify the country_list as a list')
        if not isinstance(prefix, str):
            raise ValueError('Specify the columns name prefix as str')
        self.column = column
        self.tags_list = tags_list
        self.prefix = prefix
        self.list_len = len(tags_list)
        # construir llista de noms de columna
        self.columns = list(prefix + str(i) for i in range(self.list_len))
        self.zeros = [0] * self.list_len
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        tags_dummy = []
        for tags in X[self.column]:
            tags = literal_eval(tags)
            dummies = [0] * self.list_len
            for i in range(self.list_len):
                if self.tags_list[i] in tags:
                    dummies[i] = 1
            tags_dummy.append(dummies)
        return X.join(
            pd.DataFrame(tags_dummy, columns = self.columns))

##### 2. Segregar les columnes per itineraris

Separarem les columnes en quatre classes:
- les numèriques que necessiten netejar valors *nan* i després ésser escalades
- les categòriques, que requereixen d'una transformació (per exemple, codificar per mitjà de variables *dummy*)
- les que cal suprimir al final del procés de transformació i abans d'aplicar-les al model
- el valor que volem predir

In [13]:
from sklearn.pipeline import Pipeline

In [14]:
num_cols_list = [
    'Average_Score',
    'Total_Number_of_Reviews',
    'Review_Total_Negative_Word_Counts',
    'Review_Total_Positive_Word_Counts',
    'Total_Number_of_Reviews_Reviewer_Has_Given',
    'Reviewer_Score']

cat_cols_list = [
    'Hotel_Address', 
    'Hotel_Name',
    'Review_Date',
    'Reviewer_Nationality',
    'Tags']

# eliminam també una columna del país de l'hotel 
# i una del trimestre per evitar colinealitats
drop_cat_cols_list = [
    'Hotel_Address',
    'Hotel_Name',
    'Review_Date',
    'Reviewer_Nationality',
    'Tags',
    'hc0', # primera columna del país (OHE)
    'rd_q0' # primera columna del trimestre
    ]

Itinerari columnes numèriques:

In [15]:
from sklearn.preprocessing import MinMaxScaler

sel_num_col_step = ('sel_num_col', 
                    SelectColumns(num_cols_list)
                   )

scaler_step = ('scaler', 
               MinMaxScaler()
              )

num_pipe_steps = [sel_num_col_step, scaler_step]

num_pipe = Pipeline(num_pipe_steps)

Itinerari columnes categòriques:

In [16]:
# Columnes per a la generació del codi únic d'establiment hoteler
hotel_id_cols = ['Hotel_Address', 'Hotel_Name']

# Extracció del país des de l'adreça de l'hotel
hotel_address_col = 'Hotel_Address'
hotel_country_list = ['Austria',
                      'France', 
                      'Italy',
                      'Netherlands', 
                      'Spain', 
                      'United Kingdom']

# Extracció del trimestre
rv_date_col = 'Review_Date'
rv_date_format = '%m/%d/%Y'

# Llista de les 10 nacionalitats més freqüents
rvr_country_col = 'Reviewer_Nationality'
rvr_country_list = ['Australia',
                    'Canada', 
                    'Germany', 
                    'Ireland',
                    'Netherlands', 
                    'Saudi Arabia', 
                    'Switzerland', 
                    'United Arab Emirates',
                    'United Kingdom',
                    'United States of America']

# Extracció des d'etiquetes
rv_tags_col = 'Tags'
rv_tags_list = [' Family with older children ',
                ' Family with young children ',
                ' Group ',
                ' Solo traveler ',
                ' Couple ',
                ' Business trip ',
                ' Leisure trip ',
                ' Stayed 1 night ',
                ' Stayed 2 nights ',
                ' Stayed 3 nights ',
                ' Stayed 4 nights ']

In [17]:
# farem servir OneHotEncoder a mida per tal de garantir 
# que els camps dummy sempre facin referència al mateix

sel_cat_col_step = ('sel_cat_col', 
                    SelectColumns(cat_cols_list)
                   )

hotel_id_step = ('hotel_id', 
                 HashThis(columns = hotel_id_cols, 
                          new_col = 'hotel_id')
                )

hotel_country_step = ('hotel_country', 
                      DummyCountryEnds(column = hotel_address_col, 
                                       country_list = hotel_country_list,
                                       prefix = 'hc')
                     )

rv_date_step = ('rd_q',
                DummyDateQuarter(column = rv_date_col, 
                                 dformat = rv_date_format,
                                 prefix = 'rd_q')
               )

rvr_country_step = ('rvr_country', 
                    DummyCountryIs(column = rvr_country_col, 
                                   country_list = rvr_country_list,
                                   prefix = 'rc')
                   )

rv_tags_step = ('rv_tags', 
                    DummyTags(column = rv_tags_col, 
                              tags_list = rv_tags_list,
                              prefix = 'tag')
                   )

drop_cat_col_step = ('drop_cat_col', DropColumns(drop_cat_cols_list))

cat_pipe_steps = [sel_cat_col_step,
                  hotel_id_step,
                  hotel_country_step,
                  rv_date_step,
                  rvr_country_step, 
                  rv_tags_step,
                  drop_cat_col_step]

cat_pipe = Pipeline(cat_pipe_steps)

Després les tornam a unir per a obtenir les tuples a processar

In [18]:
from sklearn.pipeline import FeatureUnion

transformer_list = [('cat_pipe', cat_pipe),
                    ('num_pipe', num_pipe)]

### Juntam els dos itineraris 

data_prep_pipe = FeatureUnion(transformer_list=transformer_list)

data_prep_step = ('data_prep', data_prep_pipe)

In [19]:
out_cols = ['hotel_id',
#            'h_country0', # colinealitat
            'h_country1', 'h_country2', 
            'h_country3', 'h_country4', 'h_country5',
#            'rd_1q', # colinealitat
            'rd_2q', 'rd_3q', 'rd_4q',
            'rvr_country0', 'rvr_country1', 
            'rvr_country2', 'rvr_country3', 
            'rvr_country4', 'rvr_country5', 
            'rvr_country6', 'rvr_country7', 
            'rvr_country8', 'rvr_country9', 
            'gt_fam_old_child', 'gt_fam_young_child',
            'gt_group', 'gt_solo', 'gt_couple',      
            'tt_business', 'tt_leisure',
            'stayed_1', 'stayed_2', 'stayed_3', 'stayed_4',
            'average_score', 
            'hotel_tnreviews', 
            'rv_negwcount', 
            'rv_poswcount', 
            'rvr_tnreviews',
            'y']

#### Prova del pipeline

Provam a veure que tal ha quedat:

In [20]:
pipe_test = Pipeline([data_prep_step])

X_trans = pd.DataFrame(
    pipe_test.fit_transform(df), columns = out_cols)

In [21]:
X_trans.describe()

Unnamed: 0,hotel_id,h_country1,h_country2,h_country3,h_country4,h_country5,rd_2q,rd_3q,rd_4q,rvr_country0,...,stayed_1,stayed_2,stayed_3,stayed_4,average_score,hotel_tnreviews,rv_negwcount,rv_poswcount,rvr_tnreviews,y
count,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,...,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0
mean,-2.788511e+17,0.116199,0.072143,0.110936,0.116627,0.508594,0.256465,0.276144,0.231552,0.042048,...,0.375472,0.2597,0.185794,0.092716,0.695106,0.162431,0.04544,0.045004,0.017418,0.78601
std,5.368031e+18,0.320463,0.258725,0.314053,0.320976,0.499927,0.436682,0.447089,0.421824,0.2007,...,0.484245,0.43847,0.388941,0.290034,0.119141,0.13938,0.072772,0.0552,0.031187,0.218381
min,-9.209622e+18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-5.183919e+18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.630435,0.06724,0.004902,0.012658,0.0,0.666667
50%,-2.958857e+17,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.695652,0.125759,0.022059,0.027848,0.00565,0.84
75%,4.250164e+18,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,1.0,1.0,0.0,0.0,0.782609,0.214711,0.056373,0.055696,0.019774,0.946667
max,9.214524e+18,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [22]:
X_trans.iloc[:, :9]

Unnamed: 0,hotel_id,h_country1,h_country2,h_country3,h_country4,h_country5,rd_2q,rd_3q,rd_4q
0,-8.795192e+18,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,-8.795192e+18,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,-8.795192e+18,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,-8.795192e+18,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,-8.795192e+18,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...
515733,4.981431e+18,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
515734,4.981431e+18,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
515735,4.981431e+18,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
515736,4.981431e+18,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [23]:
X_trans.iloc[:, 9:19]

Unnamed: 0,rvr_country0,rvr_country1,rvr_country2,rvr_country3,rvr_country4,rvr_country5,rvr_country6,rvr_country7,rvr_country8,rvr_country9
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
515733,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
515734,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
515735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
515736,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
X_trans.iloc[:, 19:30]

Unnamed: 0,gt_fam_old_child,gt_fam_young_child,gt_group,gt_solo,gt_couple,tt_business,tt_leisure,stayed_1,stayed_2,stayed_3,stayed_4
0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
515733,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
515734,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
515735,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
515736,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [25]:
X_trans.iloc[:, 30:]

Unnamed: 0,average_score,hotel_tnreviews,rv_negwcount,rv_poswcount,rvr_tnreviews,y
0,0.543478,0.081795,0.973039,0.027848,0.016949,0.053333
1,0.543478,0.081795,0.000000,0.265823,0.016949,0.666667
2,0.543478,0.081795,0.102941,0.053165,0.022599,0.613333
3,0.543478,0.081795,0.514706,0.065823,0.000000,0.173333
4,0.543478,0.081795,0.343137,0.020253,0.005650,0.560000
...,...,...,...,...,...,...
515733,0.630435,0.167198,0.034314,0.005063,0.019774,0.600000
515734,0.630435,0.167198,0.026961,0.027848,0.031073,0.440000
515735,0.630435,0.167198,0.046569,0.000000,0.005650,0.000000
515736,0.630435,0.167198,0.000000,0.063291,0.005650,0.840000


Sembla tot correcte. :-)

### 3. Separa los datos en train y test. 
OJO: no queremos filtrar información! Hay que pensar bien como queremos hacer la separación...

In [16]:
from sklearn.model_selection import train_test_split

## Ho feim reproduible per poder depurar millor

np.random.seed(42)

## Feim partició train | test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)

In [17]:
# Com que les columnes de la transformada es desordenen, haurem de construir la llista de nou en l'ordre apropiat

trans_columns = [element for element in X.columns if element not in cat_cols_list] + cat_cols_list

In [None]:
X_train_trans = pd.DataFrame(
    pipe.fit_transform(X_train),
    columns = trans_columns)

In [23]:
X_train.head(3)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Tags,days_since_review,lat,lng
264651,97 Great Russell Street Bloomsbury Camden Lond...,406,8/8/2016,8.2,Radisson Blu Edwardian Kenilworth,Argentina,Very small room,4,2011,No Positive,0,15,"[' Leisure trip ', ' Couple ', ' Standard Doub...",360 day,51.517972,-0.12805
150426,33 37 Hogarth Road Kensington and Chelsea Lond...,989,8/31/2016,8.4,Park Grand London Kensington,United Kingdom,The view from the window was pigeon nests and...,11,4660,The bedroom was clean and modern with good fa...,10,1,"[' Leisure trip ', ' Solo traveler ', ' Superi...",337 day,51.493847,-0.191758
308128,Damrak 1 5 Amsterdam City Center 1012 LG Amste...,973,5/31/2016,8.0,Park Plaza Victoria Amsterdam,United Kingdom,Nothing,2,4820,Location perfect and customer service was exc...,9,2,"[' Leisure trip ', ' Couple ', ' Double Room '...",429 day,52.377278,4.897818


In [None]:
X_train_trans.head(3)

In [138]:
y_train.head(3)

264651     True
150426    False
308128     True
Name: Reviewer_Score, dtype: bool

In [None]:
## train all pipeline from raw data
pipe.fit(X_train, y_train)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline## Les passes de les *pipeline* són duples ("nom2", funció())

imputation_step = ('imputer', SimpleImputer(strategy='mean'))
scaling_step = ('scaler', StandardScaler())

## Llista tractament numèric
scalar_steps = [imputation_step, scaling_step]

## Finalmente llamamos al creador de pipeline
scalar_pipe = Pipeline(scalar_steps)

X_train_transformed = scalar_pipe.fit_transform(X_train)
X_test_transformed = scalar_pipe.transform(X_test)

print('X_train: \n')
print('Mean before pipeline: \n', X_train.mean())
print('Mean after pipeline (imputer + scaler): \n', X_train_transformed.mean(axis=0))

print('\n X_test: \n')
print('Mean before pipeline: \n', X_test.mean())
print('Mean after pipeline (imputer + scaler): \n', X_test_transformed.mean(axis=0))

### 4. Plantea una primera aproximación utilizando un modelo lineal ... 
... y dibuja una arquitectura de cómo quedaría el `pipeline` de transformación + modelización tipo el mostrado en la clase práctiva de la W5

In [51]:
from sklearn.ensemble import RandomForestRegressor

### Ara ja podem aplicar el nostre classificador

classifier_step = ('model', RandomForestRegressor())

pipe_steps = [data_prep_step, classifier_step]

pipe = Pipeline(pipe_steps)

Seguim preparant per fer la nostra classificació ...

In [35]:
## Separam la variable dependent

X = df.drop(['Reviewer_Score'], axis = 1)

## Classificador: si la puntuació és més gran que goodScore, aleshores la consideram bona

minGoodScore = 8.4

y = df['Reviewer_Score'] >= minGoodScore

## A veure que tenim ...
y.value_counts()

True     293974
False    221764
Name: Reviewer_Score, dtype: int64

### 5. Entrena el modelo y revisa las variables/palabras que han sido más importantes.

### 6. Cuéntale al cliente un mensaje clave ... 
... sobre que tipo de cosas tienen más en cuenta los clientes: `Si no quieres una mala review no sirvas comida fría`