# Preprocesamiento de datos

En este documento veremos algunos ejemplos de preprocesamiento de datos 


In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, FunctionTransformer, MinMaxScaler

Haremos uso de la base de datos ejemplo de iris. Parte de los datasets de scikitlearn
Mostramos la descripción de la base de datos asi como una vista con pandas de los datos originales.


In [2]:
data = load_iris()
X = data.data  
Y = data.target  

In [3]:
print(data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [4]:
df = pd.DataFrame(data= data.data, columns= data['feature_names'] )
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


## Tratamiento de duplicados 
Pandas ofrece una funcionalidad especifica para encontrar elementos duplicados, esto puede ser sobre todas las columnas o solo sobre un conjunto, también es útil para comprobar identificadores.


In [5]:
duplicados = df.duplicated(keep=False)
df[duplicados]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
101,5.8,2.7,5.1,1.9
142,5.8,2.7,5.1,1.9


## Separación de entrenamiento
Sklearn puede separar la base datos de manera aleatoria en entrenamiento y prueba.


In [6]:
X_train, X_test = train_test_split(X)

Podemos ver como quedan las tablas una vez que han quedado divididas.

In [7]:
pd.DataFrame(
    data= X_train, 
    columns= data['feature_names'] 
    )

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,4.4,2.9,1.4,0.2
1,6.3,3.3,4.7,1.6
2,4.9,3.1,1.5,0.1
3,4.9,3.1,1.5,0.2
4,5.2,3.5,1.5,0.2
...,...,...,...,...
107,6.1,3.0,4.6,1.4
108,6.4,3.1,5.5,1.8
109,7.1,3.0,5.9,2.1
110,5.6,3.0,4.5,1.5


In [8]:
pd.DataFrame(
    data= X_test, 
    columns= data['feature_names'] 
    )

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,4.8,3.0,1.4,0.3
1,5.0,3.6,1.4,0.2
2,5.5,2.6,4.4,1.2
3,7.4,2.8,6.1,1.9
4,5.5,4.2,1.4,0.2
5,6.6,2.9,4.6,1.3
6,6.7,3.1,4.4,1.4
7,6.3,3.4,5.6,2.4
8,5.4,3.9,1.7,0.4
9,7.2,3.0,5.8,1.6


## Estandarizado / escalado
Tenemos varios métodos para estandarizar los datos.
El siguiente método es el estándar el cual lleva los datos a una distribución normal estándar. Por lo cual la media es 0 y varianza es 1 


In [9]:
scaler_estandar = StandardScaler().fit(X_train)

In [13]:
df2 = pd.DataFrame(
    data= scaler_estandar.transform(X_train), 
    columns= data['feature_names'] 
    )
df2

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-1.774034,-0.350605,-1.327474,-1.283991
1,0.589863,0.567939,0.558131,0.566466
2,-1.151956,0.108667,-1.270335,-1.416166
3,-1.151956,0.108667,-1.270335,-1.283991
4,-0.778709,1.027211,-1.270335,-1.283991
...,...,...,...,...
107,0.341032,-0.120969,0.500991,0.302115
108,0.714279,0.108667,1.015247,0.830817
109,1.585189,-0.120969,1.243806,1.227344
110,-0.281046,-0.120969,0.443852,0.434291


In [14]:
df2.std()

sepal length (cm)    1.004494
sepal width (cm)     1.004494
petal length (cm)    1.004494
petal width (cm)     1.004494
dtype: float64

In [15]:
df2.mean()

sepal length (cm)   -1.753558e-15
sepal width (cm)     1.224219e-15
petal length (cm)   -7.057846e-16
petal width (cm)    -7.236275e-17
dtype: float64

Es importante conservar el transformador de escala ya que ese debe ser fijado y utilizado para ambos conjuntos.

In [16]:
pd.DataFrame(
    data= scaler_estandar.transform(X_test), 
    columns= data['feature_names'] 
    )

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-1.276371,-0.120969,-1.327474,-1.151815
1,-1.02754,1.256847,-1.327474,-1.283991
2,-0.405462,-1.039513,0.386712,0.037764
3,1.958436,-0.580241,1.358085,0.962993
4,-0.405462,2.634663,-1.327474,-1.283991
5,0.96311,-0.350605,0.500991,0.16994
6,1.087526,0.108667,0.386712,0.302115
7,0.589863,0.797575,1.072387,1.62387
8,-0.529877,1.945755,-1.156056,-1.01964
9,1.709604,-0.120969,1.186666,0.566466


También está disponible el estandarizado minmax el cual lleva todos los datos al rango de 0 a 1 donde 0 es el valor mínimo y 1 el máximo. 

In [17]:
scaler_minmax = MinMaxScaler().fit(X)

In [18]:
df3 = pd.DataFrame(
    data= scaler_minmax.transform(X_train), 
    columns= data['feature_names'] )
df3

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.027778,0.375000,0.067797,0.041667
1,0.555556,0.541667,0.627119,0.625000
2,0.166667,0.458333,0.084746,0.000000
3,0.166667,0.458333,0.084746,0.041667
4,0.250000,0.625000,0.084746,0.041667
...,...,...,...,...
107,0.500000,0.416667,0.610169,0.541667
108,0.583333,0.458333,0.762712,0.708333
109,0.777778,0.416667,0.830508,0.833333
110,0.361111,0.416667,0.593220,0.583333


In [19]:
df3.min()

sepal length (cm)    0.027778
sepal width (cm)     0.000000
petal length (cm)    0.033898
petal width (cm)     0.000000
dtype: float64

In [20]:
df3.max()

sepal length (cm)    0.944444
sepal width (cm)     1.000000
petal length (cm)    1.000000
petal width (cm)     1.000000
dtype: float64

**Ejercicio:** aplicar esta transformación al conjunto de prueba y calcular el mínimo y el máximo

In [21]:
df4 = pd.DataFrame(
    data= scaler_minmax.transform(X_test), 
    columns= data['feature_names'] )
df4

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.138889,0.416667,0.067797,0.083333
1,0.194444,0.666667,0.067797,0.041667
2,0.333333,0.25,0.576271,0.458333
3,0.861111,0.333333,0.864407,0.75
4,0.333333,0.916667,0.067797,0.041667
5,0.638889,0.375,0.610169,0.5
6,0.666667,0.458333,0.576271,0.541667
7,0.555556,0.583333,0.779661,0.958333
8,0.305556,0.791667,0.118644,0.125
9,0.805556,0.416667,0.813559,0.625


In [22]:
df4.min()

sepal length (cm)    0.000
sepal width (cm)     0.125
petal length (cm)    0.000
petal width (cm)     0.000
dtype: float64

In [23]:
df4.max()

sepal length (cm)    1.000000
sepal width (cm)     0.916667
petal length (cm)    0.915254
petal width (cm)     0.958333
dtype: float64

## Reducción de dimensionalidad
Vemos un ejemplo de aplicación de componentes principales, a pesar de tener una cantidad reducida de variables.



In [25]:
pd.DataFrame(
    data= scaler_minmax.transform(X_test), 
    columns= data['feature_names'] )

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.138889,0.416667,0.067797,0.083333
1,0.194444,0.666667,0.067797,0.041667
2,0.333333,0.25,0.576271,0.458333
3,0.861111,0.333333,0.864407,0.75
4,0.333333,0.916667,0.067797,0.041667
5,0.638889,0.375,0.610169,0.5
6,0.666667,0.458333,0.576271,0.541667
7,0.555556,0.583333,0.779661,0.958333
8,0.305556,0.791667,0.118644,0.125
9,0.805556,0.416667,0.813559,0.625


In [26]:
pca = PCA(n_components=2) 
X_pca = pca.fit_transform(X_test)

In [27]:
print("Varianza explicada por cada componente principal:")
print(pca.explained_variance_ratio_)

Varianza explicada por cada componente principal:
[0.92323815 0.05558832]


Encontramos que la 97% de la variabilidad queda explicada por dos componentes 

In [28]:
np.cumsum(pca.explained_variance_ratio_)

array([0.92323815, 0.97882647])

## Creación de características
Podemos crear todos los cruces de las características así como los cuadrados de estas con ayuda de sklearn


In [29]:
poly = PolynomialFeatures(2)

pd.DataFrame(
    data= poly.fit_transform(X_train), 
    columns= poly.get_feature_names_out(data['feature_names']))


Unnamed: 0,1,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),sepal length (cm)^2,sepal length (cm) sepal width (cm),sepal length (cm) petal length (cm),sepal length (cm) petal width (cm),sepal width (cm)^2,sepal width (cm) petal length (cm),sepal width (cm) petal width (cm),petal length (cm)^2,petal length (cm) petal width (cm),petal width (cm)^2
0,1.0,4.4,2.9,1.4,0.2,19.36,12.76,6.16,0.88,8.41,4.06,0.58,1.96,0.28,0.04
1,1.0,6.3,3.3,4.7,1.6,39.69,20.79,29.61,10.08,10.89,15.51,5.28,22.09,7.52,2.56
2,1.0,4.9,3.1,1.5,0.1,24.01,15.19,7.35,0.49,9.61,4.65,0.31,2.25,0.15,0.01
3,1.0,4.9,3.1,1.5,0.2,24.01,15.19,7.35,0.98,9.61,4.65,0.62,2.25,0.30,0.04
4,1.0,5.2,3.5,1.5,0.2,27.04,18.20,7.80,1.04,12.25,5.25,0.70,2.25,0.30,0.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107,1.0,6.1,3.0,4.6,1.4,37.21,18.30,28.06,8.54,9.00,13.80,4.20,21.16,6.44,1.96
108,1.0,6.4,3.1,5.5,1.8,40.96,19.84,35.20,11.52,9.61,17.05,5.58,30.25,9.90,3.24
109,1.0,7.1,3.0,5.9,2.1,50.41,21.30,41.89,14.91,9.00,17.70,6.30,34.81,12.39,4.41
110,1.0,5.6,3.0,4.5,1.5,31.36,16.80,25.20,8.40,9.00,13.50,4.50,20.25,6.75,2.25


También podemos definir transformaciones, por ejemplo aplicar el logaritmo a las variables 

In [30]:
transformer = FunctionTransformer(np.log1p, validate=True)

In [31]:
pd.DataFrame(
    data= transformer.transform(X_train), 
    columns= data['feature_names'] )


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,1.686399,1.360977,0.875469,0.182322
1,1.987874,1.458615,1.740466,0.955511
2,1.774952,1.410987,0.916291,0.095310
3,1.774952,1.410987,0.916291,0.182322
4,1.824549,1.504077,0.916291,0.182322
...,...,...,...,...
107,1.960095,1.386294,1.722767,0.875469
108,2.001480,1.410987,1.871802,1.029619
109,2.091864,1.386294,1.931521,1.131402
110,1.887070,1.386294,1.704748,0.916291


## Integración de múltiples fuentes
Para esto tomaremos la muestra del censo de población y vivienda 2020 con las tabas de personas y viviendas.


In [32]:
personas = pd.read_csv("https://www.inegi.org.mx/contenidos/programas/ccpv/2020/microdatos/Censo2020_CPV_CB_Personas_ejemplo_csv.zip",dtype=str)
personas

Unnamed: 0,ENT,MUN,LOC,AGEB,MZA,SEG,ID_VIV,ID_PERSONA,TIPO_REG,CLASE_VIV,...,ESCOACUM,ENT_PAIS_RES_5A,MUN_RES_5A,CAUSA_MIG_V,SITUA_CONYUGAL,CONACT,HIJOS_NAC_VIVOS,HIJOS_FALLECIDOS,TAMLOC,TAMLOC14
0,21,108,0001,0017,001,N,211080000005,21108000000500001,0,02,...,15,021,108,,8,10,,,5,13
1,21,108,0001,0017,001,N,211080000005,21108000000500003,0,02,...,5,023,009,0301,5,80,11,0,5,13
2,21,108,0001,0017,001,N,211080000005,21108000000500002,0,02,...,3,023,009,0301,5,80,,,5,13
3,14,086,0001,0017,003,N,140860001428,14086000142800003,0,03,...,,014,086,,,,,,5,13
4,14,086,0001,0017,003,N,140860001428,14086000142800001,0,03,...,9,014,086,,1,10,1,0,5,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1257501,14,065,0017,0049,800,N,140650001652,14065000165200004,0,01,...,6,014,065,,8,50,0,,1,01
1257502,14,009,0020,0053,800,N,140090000378,14009000037800001,0,01,...,1,014,009,,2,10,,,1,01
1257503,21,210,0037,0034,002,N,212100000299,21210000029900003,0,01,...,1,021,210,,,,,,1,01
1257504,21,210,0037,0034,002,N,212100000299,21210000029900002,0,01,...,6,021,210,,7,60,1,0,1,01


In [33]:
viviendas = pd.read_csv("https://www.inegi.org.mx/contenidos/programas/ccpv/2020/microdatos/Censo2020_CPV_CB_Viviendas_ejemplo_csv.zip",dtype=str)
viviendas



Unnamed: 0,ENT,MUN,LOC,AGEB,MZA,SEG,ID_VIV,TIPO_REG,CLASE_VIV,PISOS,...,INTERNET,SERV_TV_PAGA,SERV_PEL_PAGA,CON_VJUEGOS,NUMPERS,TIPOHOG,JEFE_SEXO,JEFE_EDAD,TAMLOC,TAMLOC14
0,14,001,0024,0307,055,N,140010000001,0,01,2,...,7,1,4,6,6,2,3,39,1,03
1,14,001,0001,530A,008,N,140010000002,0,01,2,...,7,1,4,6,4,1,1,84,5,13
2,14,001,0025,1200,001,N,140010000003,0,01,2,...,7,2,4,6,2,1,3,59,3,09
3,14,001,0001,0664,015,N,140010000004,0,02,3,...,7,1,4,6,2,1,1,24,5,12
4,14,001,0081,0242,001,N,140010000005,0,01,3,...,7,2,4,6,2,1,1,62,1,01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
352363,21,217,0001,063A,011,N,212170000544,0,01,2,...,8,1,4,6,4,1,1,40,3,08
352364,21,217,0002,188A,006,N,212170000545,0,02,2,...,7,2,4,6,4,2,1,87,2,06
352365,21,217,0044,0020,033,N,212170000546,0,02,2,...,8,2,4,6,5,2,3,58,1,04
352366,21,217,0001,0141,001,N,212170000547,0,01,3,...,8,2,4,6,5,1,1,30,4,10


Las unimos asegurándonos que cada vivienda tenga una o mas personas, es muy importante revisar los conteos.

In [34]:
unido = pd.merge(
    viviendas,
    personas,
    on = "ID_VIV",
    how="inner",
    validate="one_to_many"
)
unido

Unnamed: 0,ENT_x,MUN_x,LOC_x,AGEB_x,MZA_x,SEG_x,ID_VIV,TIPO_REG_x,CLASE_VIV_x,PISOS,...,ESCOACUM,ENT_PAIS_RES_5A,MUN_RES_5A,CAUSA_MIG_V,SITUA_CONYUGAL,CONACT,HIJOS_NAC_VIVOS,HIJOS_FALLECIDOS,TAMLOC_y,TAMLOC14_y
0,14,001,0024,0307,055,N,140010000001,0,01,2,...,9,014,001,,6,16,,,1,03
1,14,001,0024,0307,055,N,140010000001,0,01,2,...,9,014,001,,8,60,0,,1,03
2,14,001,0024,0307,055,N,140010000001,0,01,2,...,,014,001,,,,,,1,03
3,14,001,0024,0307,055,N,140010000001,0,01,2,...,9,014,001,,7,60,2,0,1,03
4,14,001,0024,0307,055,N,140010000001,0,01,2,...,9,014,001,,6,60,2,0,1,03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1257501,21,217,0005,0130,006,N,212170000548,0,01,2,...,6,021,217,,7,10,,,1,02
1257502,21,217,0005,0130,006,N,212170000548,0,01,2,...,4,021,217,,7,10,11,0,1,02
1257503,21,217,0005,0130,006,N,212170000548,0,01,2,...,2,021,217,,,,,,1,02
1257504,21,217,0005,0130,006,N,212170000548,0,01,2,...,0,021,217,,,,,,1,02
