# Transformación de Datos con Scikit-Learn

Vamos a mostrar algunas funcionalidades de Transformación de Datos con Scikit-Learn con algunos ejemplos sobre el dataset de Titanic, de la misma manera que hicimos con Pandas.

Primero importamos las librerías y cargamos el dataset

Cargamos las librerías y los datos

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

In [3]:
### Carga de datos
df = pd.read_csv('DS_Clase_05_titanic.csv')
print(df.shape)
df.head(5)

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Imputación de valores faltantes con Scikit-Learn

Una cosa para tener en cuenta es que a Scikit-Learn no le gustan los valores faltantes, por lo que una de las primeras cosas que tendremos que hacer es imputarlos. En el módulo `sklearn.impute`, del cual recomendamos mirar su [documentación](https://scikit-learn.org/stable/modules/impute.html#impute), pueden encontrar algunas clases útiles para esta tarea.

El imputador más sencillo es el `SimpleImputer`, el cual nos servirá para rellenar valores faltantes en las columnas que elijamos. Mirar el siguiente ejemplo y explorar cuáles son los parámetros de ese objeto.

### Antes que nada, debemos instalar `Scikit-Learn`

Para ello, primero deben activar su environment (`conda activate <nombre>`) y luego desde consola, ejecutar:

`conda install scikit-learn`

In [4]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')

In [7]:
edades = df.Age.values
edades

array([22.  , 38.  , 26.  , 35.  , 35.  ,   nan, 54.  ,  2.  , 27.  ,
       14.  ,  4.  , 58.  , 20.  , 39.  , 14.  , 55.  ,  2.  ,   nan,
       31.  ,   nan, 35.  , 34.  , 15.  , 28.  ,  8.  , 38.  ,   nan,
       19.  ,   nan,   nan, 40.  ,   nan,   nan, 66.  , 28.  , 42.  ,
         nan, 21.  , 18.  , 14.  , 40.  , 27.  ,   nan,  3.  , 19.  ,
         nan,   nan,   nan,   nan, 18.  ,  7.  , 21.  , 49.  , 29.  ,
       65.  ,   nan, 21.  , 28.5 ,  5.  , 11.  , 22.  , 38.  , 45.  ,
        4.  ,   nan,   nan, 29.  , 19.  , 17.  , 26.  , 32.  , 16.  ,
       21.  , 26.  , 32.  , 25.  ,   nan,   nan,  0.83, 30.  , 22.  ,
       29.  ,   nan, 28.  , 17.  , 33.  , 16.  ,   nan, 23.  , 24.  ,
       29.  , 20.  , 46.  , 26.  , 59.  ,   nan, 71.  , 23.  , 34.  ,
       34.  , 28.  ,   nan, 21.  , 33.  , 37.  , 28.  , 21.  ,   nan,
       38.  ,   nan, 47.  , 14.5 , 22.  , 20.  , 17.  , 21.  , 70.5 ,
       29.  , 24.  ,  2.  , 21.  ,   nan, 32.5 , 32.5 , 54.  , 12.  ,
         nan, 24.  ,

In [8]:
imp.fit(edades.reshape(-1,1))
print(imp.statistics_)

[29.69911765]


In [9]:
edades_imputed = imp.transform(edades.reshape(-1,1))
print(edades_imputed[:10])

[[22.        ]
 [38.        ]
 [26.        ]
 [35.        ]
 [35.        ]
 [29.69911765]
 [54.        ]
 [ 2.        ]
 [27.        ]
 [14.        ]]


Y, si queremos agregarlas al DataFrame,

In [10]:
df['Age_imputed'] = edades_imputed
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_imputed
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,35.0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,29.699118
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,54.0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,2.0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,27.0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,14.0


### Discretización y binning con Scikit-Learn

La principal diferencia entre Scikit-Learn y Pandas es que Scikit-Learn decide los límites de los bines de acuerdo a una estrategia que le pasemos. La clase que vamos a usar se llama `KBinsDiscretizer`.

In [13]:
from sklearn.preprocessing import KBinsDiscretizer
est = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy = 'uniform')

Separamos los valores que queremos fitear.

In [14]:
edades = df.Age_imputed.values
print(edades.reshape(-1,1).shape)

(891, 1)


Y fiteamos el estimador

In [15]:
est.fit(edades.reshape(-1,1))

KBinsDiscretizer(encode='ordinal', n_bins=5, strategy='uniform')

Miramos los límites de cada bin

In [16]:
est.bin_edges_

array([array([ 0.42 , 16.336, 32.252, 48.168, 64.084, 80.   ])],
      dtype=object)

In [17]:
bines_asignados = est.transform(edades.reshape(-1,1))
print(bines_asignados)

[[1.]
 [2.]
 [1.]
 [2.]
 [2.]
 [1.]
 [3.]
 [0.]
 [1.]
 [0.]
 [0.]
 [3.]
 [1.]
 [2.]
 [0.]
 [3.]
 [0.]
 [1.]
 [1.]
 [1.]
 [2.]
 [2.]
 [0.]
 [1.]
 [0.]
 [2.]
 [1.]
 [1.]
 [1.]
 [1.]
 [2.]
 [1.]
 [1.]
 [4.]
 [1.]
 [2.]
 [1.]
 [1.]
 [1.]
 [0.]
 [2.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [3.]
 [1.]
 [4.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [2.]
 [2.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [2.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [2.]
 [1.]
 [3.]
 [1.]
 [4.]
 [1.]
 [2.]
 [2.]
 [1.]
 [1.]
 [1.]
 [2.]
 [2.]
 [1.]
 [1.]
 [1.]
 [2.]
 [1.]
 [2.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [4.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [2.]
 [2.]
 [3.]
 [0.]
 [1.]
 [1.]
 [1.]
 [2.]
 [2.]
 [1.]
 [2.]
 [1.]
 [1.]
 [1.]
 [1.]
 [2.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [2.]
 [2.]
 [3.]
 [1.]
 [3.]
 [2.]
 [1.]
 [3.]
 [0.]
 [1.]
 [1.]
 [1.]
 [2.]
 [2.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.

Y agregamos al dataframe

In [19]:
df['rangos_etarios_scikit'] = bines_asignados
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_imputed,rangos_etarios_scikit
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0,2.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0,1.0


Se puede hacer en una sola línea con `.fit_transform`

In [20]:
df['rangos_etarios_scikit'] = est.fit_transform(edades.reshape(-1,1))
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_imputed,rangos_etarios_scikit
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0,2.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0,2.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,35.0,2.0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,29.699118,1.0
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,54.0,3.0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,2.0,0.0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,27.0,1.0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,14.0,0.0


¿Cuáles son las estrategias posibles del `KBinsDiscretizer`?¿Qué formas tiene de *encodear* la salida?

### `OneHotEncoder`

El caballito de batalla es el `OneHotEncoder`.

In [22]:
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse = False)

In [26]:
generos = df.Sex.values.reshape(-1,1)
print(np.unique(generos))

['female' 'male']


In [27]:
onehot_encoder.fit(generos)

OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=False)

In [28]:
onehot_encoder.categories_

[array(['female', 'male'], dtype=object)]

In [29]:
generos_encoded = onehot_encoder.transform(generos)
print(generos_encoded)

[[0. 1.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [0. 1.]
 [0. 1.]]


In [30]:
onehot_encoder.inverse_transform(generos_encoded[500].reshape(1,-1))

array([['male']], dtype=object)

In [31]:
df['female_encoded'] = generos_encoded[:,0]
df['male_encoded'] = generos_encoded[:,1]
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_imputed,rangos_etarios_scikit,female_encoded,male_encoded
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0,1.0,0.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0,2.0,1.0,0.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0,1.0,1.0,0.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0,2.0,1.0,0.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,35.0,2.0,0.0,1.0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,29.699118,1.0,0.0,1.0
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,54.0,3.0,0.0,1.0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,2.0,0.0,0.0,1.0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,27.0,1.0,1.0,0.0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,14.0,0.0,1.0,0.0


### Ejercitación

Tomar el dataset 'DS_Clase_10_Heart.csv' y hacer la transformación de datos que hicieron con Pandas, pero ahora con Scikit-Learn. Transformar la columna `sex` con una `LabelEncoder` y la columna `thal` con un `OneHotEncoder`.

In [33]:
data = pd.read_csv('DS_Clase_10_Heart.csv')
data.head()

Unnamed: 0,patient_id,slope_of_peak_exercise_st_segment,thal,resting_blood_pressure,chest_pain_type,num_major_vessels,fasting_blood_sugar_gt_120_mg_per_dl,resting_ekg_results,serum_cholesterol_mg_per_dl,oldpeak_eq_st_depression,sex,age,max_heart_rate_achieved,exercise_induced_angina,heart_disease_present
0,0z64un,1,normal,128,2,0,0,2,308,0.0,male,45,170,0,0
1,ryoo3j,2,normal,110,3,0,0,0,214,1.6,female,54,158,0,0
2,yt1s1x,1,normal,125,4,3,0,2,304,0.0,male,77,162,1,1
3,l2xjde,1,reversible_defect,152,4,0,0,0,223,0.0,male,40,181,0,1
4,oyt4ek,3,reversible_defect,178,1,0,0,2,270,4.2,male,59,145,0,0


In [34]:
from sklearn import preprocessing

In [50]:
sexo = data.sex.values
np.unique(sexo)

array(['female', 'male'], dtype=object)

In [52]:
le = preprocessing.LabelEncoder()

In [53]:
le.fit(sexo)

LabelEncoder()

In [54]:
print(le.classes_)

['female' 'male']


In [55]:
sexo_encoded = le.transform(sexo)

In [64]:
sexo_encoded = sexo_encoded.reshape(-1,1)

In [66]:
data['sex_encoded'] = sexo_encoded
data

Unnamed: 0,patient_id,slope_of_peak_exercise_st_segment,thal,resting_blood_pressure,chest_pain_type,num_major_vessels,fasting_blood_sugar_gt_120_mg_per_dl,resting_ekg_results,serum_cholesterol_mg_per_dl,oldpeak_eq_st_depression,sex,age,max_heart_rate_achieved,exercise_induced_angina,heart_disease_present,sex_encoded
0,0z64un,1,normal,128,2,0,0,2,308,0.0,male,45,170,0,0,1
1,ryoo3j,2,normal,110,3,0,0,0,214,1.6,female,54,158,0,0,0
2,yt1s1x,1,normal,125,4,3,0,2,304,0.0,male,77,162,1,1,1
3,l2xjde,1,reversible_defect,152,4,0,0,0,223,0.0,male,40,181,0,1,1
4,oyt4ek,3,reversible_defect,178,1,0,0,2,270,4.2,male,59,145,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
175,5qfar3,2,reversible_defect,125,4,2,1,0,254,0.2,male,67,163,0,1,1
176,2s2b1f,2,normal,180,4,0,0,1,327,3.4,female,55,117,1,1,0
177,nsd00i,2,reversible_defect,125,3,0,0,0,309,1.8,male,64,131,1,1,1
178,0xw93k,1,normal,124,3,2,1,0,255,0.0,male,48,175,0,0,1


In [69]:
le.inverse_transform(sexo_encoded[1])

array(['female'], dtype=object)

In [79]:
thal = data.thal.values
np.unique(thal)

array(['fixed_defect', 'normal', 'reversible_defect'], dtype=object)

In [70]:
ohe = OneHotEncoder(sparse = False)

In [88]:
thal_encoded = ohe.fit_transform(thal.reshape(-1,1))
thal_encoded

array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1

In [89]:
data['fixed_encoded'] = thal_encoded[:,0]
data['normal_encoded'] = thal_encoded[:,1]
data['reverse_encoded'] = thal_encoded[:,2]
data.head(10)

Unnamed: 0,patient_id,slope_of_peak_exercise_st_segment,thal,resting_blood_pressure,chest_pain_type,num_major_vessels,fasting_blood_sugar_gt_120_mg_per_dl,resting_ekg_results,serum_cholesterol_mg_per_dl,oldpeak_eq_st_depression,sex,age,max_heart_rate_achieved,exercise_induced_angina,heart_disease_present,sex_encoded,fixed_encoded,normal_encoded,reverse_encoded
0,0z64un,1,normal,128,2,0,0,2,308,0.0,male,45,170,0,0,1,0.0,1.0,0.0
1,ryoo3j,2,normal,110,3,0,0,0,214,1.6,female,54,158,0,0,0,0.0,1.0,0.0
2,yt1s1x,1,normal,125,4,3,0,2,304,0.0,male,77,162,1,1,1,0.0,1.0,0.0
3,l2xjde,1,reversible_defect,152,4,0,0,0,223,0.0,male,40,181,0,1,1,0.0,0.0,1.0
4,oyt4ek,3,reversible_defect,178,1,0,0,2,270,4.2,male,59,145,0,0,1,0.0,0.0,1.0
5,ldukkw,1,normal,130,3,0,0,0,180,0.0,male,42,150,0,0,1,0.0,1.0,0.0
6,2gbyh9,2,reversible_defect,150,4,2,0,2,258,2.6,female,60,157,0,1,0,0.0,0.0,1.0
7,daa9kp,2,fixed_defect,150,4,1,0,2,276,0.6,male,57,112,1,1,1,1.0,0.0,0.0
8,3nwy2n,3,reversible_defect,170,4,0,0,2,326,3.4,male,59,140,1,1,1,0.0,0.0,1.0
9,1r508r,2,normal,120,3,0,0,0,219,1.6,female,50,158,0,0,0,0.0,1.0,0.0
