# Preprocesado de datos Regresión Logística

Usando el mismo dataset que usatéis ayer, los objetivos de los ejercicios de hoy son:

- Estandarizar las variables numéricas de vuestro set de datos
- Codificar las variables categóricas. Recordad que tendréis que tener en cuenta si vuestras variables tienen orden o no.
- Chequear si vuestros datos están balanceados. En caso de que no lo estén utilizad algunas de las herramientas aprendidas en la lección para balancearlos.
- Guardad el dataframe con los cambios que habéis aplicado para utilizarlo en la siguiente lección.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.combine import SMOTETomek
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("data/Churn_modelling_eda.csv", index_col = 0)
df_encoding = df.copy()

In [3]:
df.head()

Unnamed: 0_level_0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,619,France,Female,42,2,0.0,1,1,1,101348.88,1
2,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,699,France,Female,39,1,0.0,2,0,0,93826.63,0
5,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Estandarizamos nuestros datos con el método StandardScaler de Sklearn.

In [4]:
scaler = StandardScaler()

In [5]:
numericas = df_encoding.select_dtypes(include = np.number)
numericas.head()

Unnamed: 0_level_0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,619,42,2,0.0,1,1,1,101348.88,1
2,608,41,1,83807.86,1,0,1,112542.58,0
3,502,42,8,159660.8,3,1,0,113931.57,1
4,699,39,1,0.0,2,0,0,93826.63,0
5,850,43,2,125510.82,1,1,1,79084.1,0


In [6]:
numericas.drop(["Exited", "HasCrCard","IsActiveMember"], axis = 1, inplace = True)

In [7]:
# ajustamos los datos
scaler.fit(numericas)

In [8]:
# transformamos los datos
x_escaladas = scaler.transform(numericas)

In [9]:
# Convertimos le array a Df
numericas_estandar = pd.DataFrame(x_escaladas, columns = numericas.columns)
numericas_estandar.head(2)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,EstimatedSalary
0,-0.326221,0.293517,-1.04176,-1.225848,-0.911583,0.021886
1,-0.440036,0.198164,-1.387538,0.11735,-0.911583,0.216534


¡ Ya tenemos nuestros datos en la misma escala! Ahora vamos a codificar las categóricas para poder utilizarlos con Sklearn.

In [10]:
columnas_categoria = ["Gender", "Geography"]

for i in columnas_categoria:
    df_encoding[i] = df_encoding[i].astype("category")

- Orden de las variables:

        - Gender: Tiene orden 
        - Geography : No tiene orden

In [11]:
mapa_gender = {"Male": 1, "Female": 0}

In [12]:
df_encoding["gender_map"] = df_encoding["Gender"].map(mapa_gender)

In [13]:
dummies = pd.get_dummies(df_encoding["Geography"], dtype = int)
dummies.head(2)

Unnamed: 0_level_0,France,Germany,Spain
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,0,0
2,0,0,1


In [14]:
df_encoded = pd.concat([ df_encoding, dummies,numericas_estandar], axis = 1)

In [15]:
df_encoded.drop(["Gender", "Geography"], axis = 1, inplace = True)

In [16]:
df_encoded.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,gender_map,France,Germany,Spain,CreditScore.1,Age.1,Tenure.1,Balance.1,NumOfProducts.1,EstimatedSalary.1
1,619.0,42.0,2.0,0.0,1.0,1.0,1.0,101348.88,1.0,0,1.0,0.0,0.0,-0.440036,0.198164,-1.387538,0.11735,-0.911583,0.216534
2,608.0,41.0,1.0,83807.86,1.0,0.0,1.0,112542.58,0.0,0,0.0,0.0,1.0,-1.536794,0.293517,1.032908,1.333053,2.527057,0.240687
3,502.0,42.0,8.0,159660.8,3.0,1.0,0.0,113931.57,1.0,0,1.0,0.0,0.0,0.501521,0.007457,-1.387538,-1.225848,0.807737,-0.108918
4,699.0,39.0,1.0,0.0,2.0,0.0,0.0,93826.63,0.0,0,1.0,0.0,0.0,2.063884,0.388871,-1.04176,0.785728,-0.911583,-0.365276
5,850.0,43.0,2.0,125510.82,1.0,1.0,1.0,79084.1,0.0,0,0.0,0.0,1.0,-0.057205,0.484225,1.032908,0.597329,0.807737,0.86365
