# Variables categóricas

En esta clase vamos a ver como utilizar pandas y scikit learn para transformar variables categóricas en algo que los modelos de machine learning puedan entender.

Vamos a utilizar un dataset armado a mano y bastante simple para aprender a utilizar scikit learn y pandas.

Luego, tendrán que aplicar lo aprendido sobre el dataset de la clase pasada (ecommerce).

In [1]:
#from google.colab import drive # La usamos para montar nuestra unidad de Google Drive
#drive.mount('/content/drive') # Montamos nuestra unidad de Google Drive

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
data = {'Temperature': ['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm', 'Hot', 'Hot', 'Cold'],
        'Color': ['Red', 'Yellow','Blue', 'Blue', 'Red', 'Yellow', 'Red', 'Yellow', 'Yellow', 'Blue']}

df = pd.DataFrame(data)

In [4]:
df

Unnamed: 0,Temperature,Color
0,Hot,Red
1,Cold,Yellow
2,Very Hot,Blue
3,Warm,Blue
4,Hot,Red
5,Warm,Yellow
6,Warm,Red
7,Hot,Yellow
8,Hot,Yellow
9,Cold,Blue


## One hot encoding

En este simple caso, vemos que la variable Temperature puede ser considerada ordinal porque la temperatura va desde cold hasta very hot.

Por otro lado, en la variable color no vemos ningún orden, no podemos considerarla ordinal.

Vamos a aplicar one hot encoding en la variable color.

Esto se puede hacer con pandas o con el OneHotEncoder de scikit learn.

Comencemos con pandas.

Pandas nos brinda la funcion get_dummies():

In [5]:
pd.get_dummies(df.Color)

Unnamed: 0,Blue,Red,Yellow
0,0,1,0
1,0,0,1
2,1,0,0
3,1,0,0
4,0,1,0
5,0,0,1
6,0,1,0
7,0,0,1
8,0,0,1
9,1,0,0


¿ Cómo agregamos estas columnas a nuestro dataset ?

Podemos concatenar horizontalmente este dataset de variables dummies a el original:

La próxima clase veremos en más detalle los métodos concat, merge, etcétera.

In [6]:
dummies = pd.get_dummies(df.Color) # Obtenemos dummies
df_encoded = pd.concat([df, dummies], axis=1) # Concatenamos horizontalmente con axis=1

df_encoded

Unnamed: 0,Temperature,Color,Blue,Red,Yellow
0,Hot,Red,0,1,0
1,Cold,Yellow,0,0,1
2,Very Hot,Blue,1,0,0
3,Warm,Blue,1,0,0
4,Hot,Red,0,1,0
5,Warm,Yellow,0,0,1
6,Warm,Red,0,1,0
7,Hot,Yellow,0,0,1
8,Hot,Yellow,0,0,1
9,Cold,Blue,1,0,0


Ahora podemos eliminar la columna original

In [7]:
df_encoded = df_encoded.drop('Color', axis=1)

In [8]:
df_encoded

Unnamed: 0,Temperature,Blue,Red,Yellow
0,Hot,0,1,0
1,Cold,0,0,1
2,Very Hot,1,0,0
3,Warm,1,0,0
4,Hot,0,1,0
5,Warm,0,0,1
6,Warm,0,1,0
7,Hot,0,0,1
8,Hot,0,0,1
9,Cold,1,0,0


¿ Cómo hacemos lo mismo con scikit learn ?

Tenemos el OneHotEncoder en el módulo de preprocessing:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder



In [9]:
from sklearn.preprocessing import OneHotEncoder

In [10]:
enc = OneHotEncoder(handle_unknown='ignore', sparse=False)

Averiguar en la documentación: 
- ¿ Qué significa el `handle_unknown='ignore'` ?
- Que es "sparse" ?
- Que sucede si ponemos sparse = True?

Hacemos fit:

In [11]:
enc.fit(df.Color.values.reshape(-1,1))

OneHotEncoder(handle_unknown='ignore', sparse=False)

- ¿ Qué pasa si sacamos el .reshape(-1, 1) ?

- ¿ Qué otra forma se les ocurre para solucionar el error sin usar reshape ?

In [15]:
encoded_color = enc.transform(df.Color.values.reshape(-1, 1))
encoded_color

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.]])

Ahora, como agregamos esto a nuestro dataframe?

El método get_feature_names nos da los nombres de las nuevas features creadas:

In [16]:
enc.get_feature_names(['Color'])



array(['Color_Blue', 'Color_Red', 'Color_Yellow'], dtype=object)

In [17]:
encoded_color_columns = enc.get_feature_names(['Color'])

In [18]:
encoded_color_df = pd.DataFrame(data=encoded_color, columns= encoded_color_columns)
encoded_color_df

Unnamed: 0,Color_Blue,Color_Red,Color_Yellow
0,0.0,1.0,0.0
1,0.0,0.0,1.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,0.0,1.0,0.0
5,0.0,0.0,1.0
6,0.0,1.0,0.0
7,0.0,0.0,1.0
8,0.0,0.0,1.0
9,1.0,0.0,0.0


Ahora, como hicimos antes, podemos concatenar y eliminar la columna original:

In [19]:
pd.concat([df, encoded_color_df], axis=1).drop('Color', 1)

  pd.concat([df, encoded_color_df], axis=1).drop('Color', 1)


Unnamed: 0,Temperature,Color_Blue,Color_Red,Color_Yellow
0,Hot,0.0,1.0,0.0
1,Cold,0.0,0.0,1.0
2,Very Hot,1.0,0.0,0.0
3,Warm,1.0,0.0,0.0
4,Hot,0.0,1.0,0.0
5,Warm,0.0,0.0,1.0
6,Warm,0.0,1.0,0.0
7,Hot,0.0,0.0,1.0
8,Hot,0.0,0.0,1.0
9,Cold,1.0,0.0,0.0


Muchas veces, en lugar de crear todas las columnas, se utiliza el atributo `drop='first'`.

Esto crea todas las columnas menos la primera (en nuestro caso no se crearía color_blue), esto es porque si ninguna de las otras es 1, significa que blue es 1. Sirve para ahorrarnos una columna.

En el caso de variables binarias, podemos crear una única columna utilizando:

`drop='if_binary'`

## Label encoder

Se utiliza de una forma muy similar a el OneHotEncoder de scikit learn.

In [20]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['Temperature_label_encoded'] = le.fit_transform(df.Temperature)
df

Unnamed: 0,Temperature,Color,Temperature_label_encoded
0,Hot,Red,1
1,Cold,Yellow,0
2,Very Hot,Blue,2
3,Warm,Blue,3
4,Hot,Red,1
5,Warm,Yellow,3
6,Warm,Red,3
7,Hot,Yellow,1
8,Hot,Yellow,1
9,Cold,Blue,0


No se utiliza para datos ordinales ya que scikit learn le asigna un valor numérico pero sin tener en cuenta que nosotros queremos que cold sea menor que hot.

Cuando queremos especificar nosotros los valores numéricos para cada valor de la variable categórica, podemos utilizar la función .replace() de pandas.

Esta función recibe un diccionario en el que la key tiene que ser el valor que queremos transformar y el value el valor resultante que queremos.

Veamos un ejemplo:

In [21]:
df.Temperature.unique()

array(['Hot', 'Cold', 'Very Hot', 'Warm'], dtype=object)

In [22]:
mapping_dict = {
    'Cold': 1,
    'Warm': 2,
    'Hot': 3,
    'Very Hot': 4
}

temperature_ordinal = df.Temperature.replace(mapping_dict)
temperature_ordinal

0    3
1    1
2    4
3    2
4    3
5    2
6    2
7    3
8    3
9    1
Name: Temperature, dtype: int64

In [23]:
df['Temperature_ordinal'] = temperature_ordinal
df

Unnamed: 0,Temperature,Color,Temperature_label_encoded,Temperature_ordinal
0,Hot,Red,1,3
1,Cold,Yellow,0,1
2,Very Hot,Blue,2,4
3,Warm,Blue,3,2
4,Hot,Red,1,3
5,Warm,Yellow,3,2
6,Warm,Red,3,2
7,Hot,Yellow,1,3
8,Hot,Yellow,1,3
9,Cold,Blue,0,1


# Discretización

Vamos a ver como hacerlo con sklearn. Para este caso vamos a utilizar otro dataset con una variable continua:

Creamos el dataset:

In [25]:
variable_continua = np.arange(200)
df_cont = pd.DataFrame({'X': variable_continua})

In [26]:
df_cont.head()

Unnamed: 0,X
0,0
1,1
2,2
3,3
4,4


Aplicamos KBinsDiscretizer.

Tenemos que pasarle la cantidad de bins, encode y strategy.

Averiguar que significan estos parametros:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html

In [27]:
from sklearn.preprocessing import KBinsDiscretizer
est = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy = 'uniform')

En scikit learn siempre estuvimos aplicando el método fit y transform por separado. Scikit lern nos permite aplicar los dos en una linea con el método fit_transform:

In [28]:
df_cont['discretized'] = est.fit_transform(df_cont[['X']]) # Acá en lugar de hacer reshape(-1, 1) utilizamos doble [[]]

In [29]:
df_cont

Unnamed: 0,X,discretized
0,0,0.0
1,1,0.0
2,2,0.0
3,3,0.0
4,4,0.0
...,...,...
195,195,4.0
196,196,4.0
197,197,4.0
198,198,4.0


In [30]:
df_cont.discretized.value_counts()

0.0    40
1.0    40
2.0    40
3.0    40
4.0    40
Name: discretized, dtype: int64

# Ejercicio

Vamos a levantar el dataset de la clase pasada (esta vez sin nulos) y transformar las variables categóricas.

Tienen que utilizar su criterio para decidir cuando conviene ordinal, one hot, etc.

Recuerden que las columnas del dataset son:


- id: Id del usuario 
- administrative: Número de veces que el usuario visito la sección "administrative"
- administrative_duration: Tiempo que el usuario paso en la sección administrative
- informational: Número de veces que el usuario visitó la sección "informational"
- informational_duration: Tiempo que el usuario paso en la sección informational
- productrelated: Número de veces que el usuario visitó la sección "products related"
- productrelated_duration: Tiempo que el usuario pasó en la sección 
- bouncerates: Porcentaje de visitantes que entran a la página e inmediatamente la dejan sin interactuar con la misma. Esta metrica solo se tiene en cuenta si es la primer página que se visitó del sitio web.
- exitrates: De la cantidad total de visitas a las páginas del sitio web, el porcentage de usuarios que lo abandonaron en esta página. Esto es, el  porcentaje de usuarios que su última visita al sitio fué en esta página.
- pagevalues: Este es el valor promedio del sitio web, indica la contribución que este sitio web hizo al visitante que llega a la página o sección de compra final.
-  specialday: Es una fecha especial o no (1 o 0)
- operatingsystems: Sistema operativo
- browser: Nombre del navegador
- region: Region geográfica del usuario
- traffictype: Tipo de tráfico web
- visitortype: Nuevo o uno que retorno al sitio
- Weekend: 1 si es fin de semana y 0 en otro caso
- revenue: 1 si el usuario hizo una compra y 0 en otro caso

In [32]:
df = pd.read_csv('online-shoppers-intention-sin-nulos.csv')

In [33]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,revenue
0,0,1.0,0.0,0.0,0.0,0.0,5.0,81.083333,0.04,0.157839,0.0,0.0,Dec,-0.14476,2.0,-0.892079,2.0,New_Visitor,0.0,0.0
1,1,2.0,0.0,0.0,0.0,0.0,3.0,189.0,0.003073,0.50708,0.0,0.0,Mar,-0.14476,2.0,2.012326,1.0,Returning_Visitor,0.0,0.0
2,2,3.0,0.0,0.0,1.0,132.0,8.0,445.0,0.0,-0.590534,0.0,0.0,Mar,-0.14476,2.0,0.352666,14.0,Returning_Visitor,1.0,0.0
3,3,4.0,0.0,0.0,0.0,0.0,3.0,0.0,0.2,3.301007,0.0,0.0,Mar,-0.14476,8.0,-0.477164,1.0,Returning_Visitor,0.0,0.0
4,4,5.0,0.0,0.0,0.0,0.0,4.0,14.0,0.1,2.253284,0.0,0.0,Mar,0.965712,2.0,-0.892079,1.0,Returning_Visitor,0.0,0.0


Transformar las variables:

- Month
- Visitor type
- weekend

Con los métodos que aprendimos.

Discretizar:
- ExitRates
- BounceRates


Investigar:

- ¿Cómo puedo saber desde que valor hasta que valor van cada uno de los "bins" en KBinsDiscretizer? (buscar los atributos del discretizer en la documentación)
- ¿ Qué pasa si en lugar de usar encode="ordinal" uso encode=‘onehot’ o ‘onehot-dense’?
- ¿Cuál es la diferencia entre strategy=‘uniform’ y strategy=‘quantile’ ?

In [34]:
df["Month"].unique()

array(['Dec', 'Mar', 'Oct', 'May', 'Nov', 'Aug', 'Jul', 'Sep', 'Feb',
       'June'], dtype=object)

In [35]:
df["VisitorType"].unique()

array(['New_Visitor', 'Returning_Visitor', 'Other'], dtype=object)

In [40]:
month_mapping_dict = {"Jan": 1, "Feb": 2, "Mar": 3, "Apr": 4, "May": 5, "Jun": 6, "Jul": 7, "Aug": 8, "Sep": 9, "Oct": 10, "Nov": 11, "Dec":12}

In [42]:
df["Month"] = df["Month"].replace(month_mapping_dict)

In [45]:
one_hot_visitor = pd.DataFrame(enc.fit_transform(df[["VisitorType"]]), columns= ["New_visitor", "Other_visitor", "Returning_visitor"])

In [50]:
df = pd.concat([df, one_hot_visitor], axis=1).drop(["Unnamed: 0", "VisitorType"], axis=1)


In [55]:
df["ExitRates"] = est.fit_transform(df[["ExitRates"]])

In [57]:
df["BounceRates"] = est.fit_transform(df[["BounceRates"]])

In [63]:
est.bin_edges_

array([array([0.  , 0.04, 0.08, 0.12, 0.16, 0.2 ])], dtype=object)

In [58]:
df

Unnamed: 0,id,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,...,Month,OperatingSystems,Browser,Region,TrafficType,Weekend,revenue,New_visitor,Other_visitor,Returning_visitor
0,1.0,0.0,0.000000,0.0,0.0,5.0,81.083333,1.0,1.0,0.000,...,12,-0.144760,2.0,-0.892079,2.0,0.0,0.0,1.0,0.0,0.0
1,2.0,0.0,0.000000,0.0,0.0,3.0,189.000000,0.0,1.0,0.000,...,3,-0.144760,2.0,2.012326,1.0,0.0,0.0,0.0,0.0,1.0
2,3.0,0.0,0.000000,1.0,132.0,8.0,445.000000,0.0,0.0,0.000,...,3,-0.144760,2.0,0.352666,14.0,1.0,0.0,0.0,0.0,1.0
3,4.0,0.0,0.000000,0.0,0.0,3.0,0.000000,4.0,4.0,0.000,...,3,-0.144760,8.0,-0.477164,1.0,0.0,0.0,0.0,0.0,1.0
4,5.0,0.0,0.000000,0.0,0.0,4.0,14.000000,2.0,3.0,0.000,...,3,0.965712,2.0,-0.892079,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8626,8627.0,1.0,1005.608333,0.0,0.0,25.0,732.344872,0.0,0.0,0.000,...,11,-0.144760,2.0,-0.892079,2.0,0.0,0.0,0.0,0.0,1.0
8627,8628.0,0.0,0.000000,0.0,0.0,14.0,340.000000,0.0,0.0,23.388,...,5,-0.144760,2.0,-0.062249,1.0,1.0,1.0,0.0,0.0,1.0
8628,8629.0,0.0,0.000000,0.0,0.0,3.0,189.000000,0.0,1.0,0.000,...,5,-0.144760,2.0,-0.062249,4.0,0.0,0.0,0.0,0.0,1.0
8629,8630.0,0.0,0.000000,0.0,0.0,13.0,305.000000,0.0,0.0,0.000,...,3,-0.144760,1.0,-0.892079,2.0,0.0,0.0,1.0,0.0,0.0
