# Preprocesado de datos Regresión Logística

Usando el mismo dataset que usatéis ayer, los objetivos de los ejercicios de hoy son:

- Estandarizar las variables numéricas de vuestro set de datos
- Codificar las variables categóricas. Recordad que tendréis que tener en cuenta si vuestras variables tienen orden o no.
- Chequear si vuestros datos están balanceados. En caso de que no lo estén utilizad algunas de las herramientas aprendidas en la lección para balancearlos.
- Guardad el dataframe con los cambios que habéis aplicado para utilizarlo en la siguiente lección.

In [163]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.combine import SMOTETomek
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

In [164]:
df = pd.read_csv("data/Churn_modelling_eda.csv", index_col = 0)
df_encoding = df.copy()

In [165]:
df.head()

Unnamed: 0_level_0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,619,France,Female,42,2,0.0,1,1,1,101348.88,1
2,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,699,France,Female,39,1,0.0,2,0,0,93826.63,0
5,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Estandarizamos nuestros datos con el método StandardScaler de Sklearn.

In [166]:
scaler = StandardScaler()

In [167]:
numericas = df_encoding.select_dtypes(include = np.number)
numericas.drop('Exited',axis = 1 , inplace = True)

In [168]:
# ajustamos los datos
scaler.fit(numericas)

In [169]:
# transformamos los datos
x_escaladas = scaler.transform(numericas)

In [170]:
# Convertimos le array a Df
numericas_estandar = pd.DataFrame(x_escaladas, columns = numericas.columns)
numericas_estandar.head(2)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,-0.326221,0.293517,-1.04176,-1.225848,-0.911583,0.646092,0.970243,0.021886
1,-0.440036,0.198164,-1.387538,0.11735,-0.911583,-1.547768,0.970243,0.216534


¡ Ya tenemos nuestros datos en la misma escala! Ahora vamos a codificar las categóricas para poder utilizarlos con Sklearn.

In [171]:
df_encoding.isnull().sum()

CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

- Orden de las variables:

        - Gender: Tiene orden 
        - Geography : No tiene orden

In [172]:
mapa_gender = {"Male": 1, "Female": 0}

In [173]:
df_encoding["gender_map"] = df_encoding["Gender"].map(mapa_gender)

In [174]:
df.columns

Index(['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance',
       'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary',
       'Exited'],
      dtype='object')

In [175]:
df_encoding.drop(['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'EstimatedSalary'], axis = 1, inplace = True )

In [176]:
dummies = pd.get_dummies(df["Geography"], dtype = int)
dummies.head(2)

Unnamed: 0_level_0,France,Germany,Spain
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,0,0
2,0,0,1


In [177]:
df_encoded = pd.concat([df_encoding, dummies, numericas_estandar], axis = 1)

In [185]:
np.isnan(df_encoded).any(axis=1)

1        False
2        False
3        False
4        False
5        False
         ...  
9997     False
9998     False
9999     False
10000     True
0         True
Length: 10001, dtype: bool

In [186]:
df_encoded.isnull().sum()

NumOfProducts      1
HasCrCard          1
IsActiveMember     1
Exited             1
gender_map         1
France             1
Germany            1
Spain              1
CreditScore        1
Age                1
Tenure             1
Balance            1
NumOfProducts      1
HasCrCard          1
IsActiveMember     1
EstimatedSalary    1
dtype: int64

- Desbalanceo de los datos.

In [None]:

#plt.figure(figsize=(8,5))

#fig1 = sns.countplot(data = df_encoded, x = "Exited",  color = "mediumaquamarine",  edgecolor='black')
#fig1.set(xticklabels=["No", "Yes"]) 
#plt.show()

In [None]:
#num_minoritarios = df_encoded['Exited'].value_counts()[1]
#num_minoritarios

In [None]:
#minoritarios = df_encoded[df_encoded['Exited'] == 1]
#minoritarios.head(2)

In [None]:
#mayoritarios = df_encoded[df_encoded['Exited'] == 0].sample(num_minoritarios, random_state = 42)
#mayoritarios.head(2)

In [None]:
#balanceado = pd.concat([minoritarios,mayoritarios],axis = 0)
##balanceado.head(2)

In [None]:
#balanceado['Exited'].value_counts()

In [None]:
#num_mayoritarios = df_encoded['Exited'].value_counts()[0]
#num_mayoritarios

In [None]:
#mayoritarios2 = df_encoded[df_encoded['Exited']== 0]
#mayoritarios2.head(2)

In [None]:
#minoritarios2 =df_encoded[df_encoded['Exited']==1].sample(num_mayoritarios, replace=True)
#minoritarios2.head(2)

In [None]:
#balanceado2 = pd.concat([mayoritarios2,minoritarios2], axis = 0)
balanceado2.head(10)

In [None]:
#balanceado2['Exited'].value_counts()

In [None]:
#X = df_encoded.drop('Exited', axis = 1)
##y = df_encoded['Exited']

In [None]:
#down = RandomUnderSampler()

In [None]:
#X_down, y_down = down.fit_resample(X,y)

In [180]:
y = df_encoded['Exited']
X = df_encoded.drop('Exited', axis=1)




In [183]:
y.isnull().sum()

1

In [None]:
#dividimos en sets de entrenamiento y test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

In [181]:
# iniciamos el método
os_us = SMOTETomek()

# ajustamos el modelo
X_train_res, y_train_res = os_us.fit_resample(X_train, y_train)

ValueError: Input y contains NaN.