# Bootcamp Data Science y MLOps

<img src="https://i.ibb.co/5RM26Cw/LOGO-COLOR2.png" width="500px">

---

# Ejercicio 🏡

## 📍 Objetivo
<br>Realizar la preparación de datos de la Encuesta anual de hogares realizada en todo el territorio de la Ciudad de Buenos Aires, Argentina en el 2019.
<br>Prácticamente vas a acondicionar el dataset para que te quede listo para buscar correlaciones o entrenar algún modelo de Machine Learning.

El dataset proviene del Open Data del Gobierno de Buenos Aires: [Encuesta anual de hogares 2019](https://data.buenosaires.gob.ar/dataset/encuesta-anual-hogares)

---

## 📍 Consigna 1

**_1) Cargamos los datos_**
- Carguen el dataset:
```python
  data = pd.read_csv("encuesta-anual-hogares-2019.csv", sep=',') 
```
**_2) Inspección inicial_**
- Eliminen las columnas `id` y `hijos_nacidos_vivos`

**_3) Discretización_**
- Para las siguientes columnas discreticen por igual frecuencia e igual rango.
    <br>`ingresos_familiares` con q=8
    <br>`ingreso_per_capita_familiar` con q=10

- En algunas situaciones hay ciertos elementos que se repiten al momento de discretizar, una forma de eliminar duplicados es con el argumento `duplicates='drop'`.
    <br><br>Para las siguientes columnas `ingreso_total_lab` y `ingreso_total_no_lab` consideren:
    ```python
    data['ingreso_total_lab'] = pd.qcut(data['ingreso_total_lab'], q=10, duplicates='drop')
    data['ingreso_total_no_lab'] = pd.qcut(data['ingreso_total_no_lab'], q=4, duplicates='drop')
    ```
<br>
- Para la columna `edad` discreticen usando igual distancia con `bins=5`.

**_4) Preparación de datos_**
- Cambien el tipo de dato a `str` de las siguientes columnas: `comuna` y `nhogar`.
  <br>_¿Por qué hacemos esto?_ Para estas columnas el número es simplemente una connotación, para representar una comuna por ejemplo pero no hay una relación numérica entre ellos.

- _¿Qué esperas como valor en la columna `años_escolaridad`?_ Número enteros pero no siempre es así, cada entidad o empresa tiene diferentes formas de rellenar una encuesta.
    <br>Evalua lo siguiente: `data['años_escolaridad'].unique()`, vas a poder ver los valores únicos en la columna. Donde destacamos que todos son `object/string`.

- Reemplazar `Ningun año de escolaridad aprobado` por un '0'. 
<br>Efectivamente por '0' y no 0, porque esta columna maneja datos tipo `object/string`.
    ```python
    data['años_escolaridad'] = data['años_escolaridad'].replace('Ningun año de escolaridad aprobado', '0')
    ```
<br>
- Vamos a convertir los tipos de datos de la columna anterior `años_escolaridad` a enteros.
    <br>De manera intuitiva podríamos hacer:
    ```python
    data['años_escolaridad'] = data['años_escolaridad'].astype(float).astype("Int32")
    ```
    PEROOOOOOO marca un error, ¿cierto?
    
    Posiblemente muchas veces les pase que cuando hagan una _cast_ (conversión de un tipo de dato a otro) pueden llegar a tener conflictos si esa columna tienen _NaN_. Para este caso si queremos convertir los valores de la columna `años_escolaridad` de _string_ a _int_, hay que hacer un paso intermedio que es pasarlo a _float_.
    ```python
    data['años_escolaridad'] = data['años_escolaridad']## 📍 Respuesta esperada 1.astype("Int32")
    ```
<br>
- Discreticen para la columna `años_escolaridad` por igual frecuencia e igual rango con un `q=5`

- No necesariamente siempre hay que rellenar los `NaN` en todas las columnas, porque quizás esa cantidad de `NaN` no es tan representativa para nuestro análisis. Así que podes eliminarlo para todo el dataframe o para ciertas columnas.
    ```python
    # Eliminar filas que contengan NaN
    data = data.dropna(subset=['situacion_conyugal', 'sector_educativo', 'lugar_nacimiento', 'afiliacion_salud'])
    ```
<br>
- Después de eliminar filas, podes resetear el índice para que mantenga la secuencia:
    ```python
    data = data.reset_index(drop=True)
    ```
<br>
- Rellenar los datos faltantes para la columna `años_escolaridad`. Primero añadan la categoría `desconocido` y luego hacen un rellenado de los datos faltantes con `desconocido`.

- Rellenar los datos faltantes para la columna `nivel_max_educativo` con `value=desconocido`

## 📍 Respuesta esperada 1

Si hacen `funpymodeling.status(data)` deberían obtener lo siguiente:

In [4]:
respuesta_1 = pd.read_csv("data/tarea_respuesta1.csv", sep=',') 

In [5]:
respuesta_1

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,nhogar,0,0.0,0,0.0,7,object
1,miembro,0,0.0,0,0.0,19,int64
2,comuna,0,0.0,0,0.0,15,object
3,dominio,0,0.0,0,0.0,2,object
4,edad,0,0.0,0,0.0,5,category
5,sexo,0,0.0,0,0.0,2,object
6,parentesco_jefe,0,0.0,0,0.0,9,object
7,situacion_conyugal,0,0.0,0,0.0,7,object
8,num_miembro_padre,0,0.0,0,0.0,9,object
9,num_miembro_madre,0,0.0,0,0.0,11,object


## 📍 Consigna 2

**_5) One hot encoding_**

- Hagan `data_ohe = pd.get_dummies(data)`
- Guardar `data_ohe` en un archivo pickle como vimos en clase con el nombre `categories_ohe.pickle`.
- Carguen el dataset `new_data = pd.read_csv("new_data.csv", sep=',')`
- A `new_data` hagan un reindex con las columnas que guardaron el archivo pickle y para los valores `NaN` rellenenlos con un `0`.

## 📍 Respuesta esperada 2

In [6]:
respuesta_2 = pd.read_csv("data/tarea_respuesta2.csv", sep=',') 

In [7]:
respuesta_2

Unnamed: 0,miembro,ingresos_totales,nhogar_1,nhogar_2,nhogar_3,nhogar_4,nhogar_5,nhogar_6,nhogar_7,comuna_1,...,cantidad_hijos_nac_vivos_15,cantidad_hijos_nac_vivos_2,cantidad_hijos_nac_vivos_3,cantidad_hijos_nac_vivos_4,cantidad_hijos_nac_vivos_5,cantidad_hijos_nac_vivos_6,cantidad_hijos_nac_vivos_7,cantidad_hijos_nac_vivos_8,cantidad_hijos_nac_vivos_9,cantidad_hijos_nac_vivos_No corresponde
0,1,4000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,1,22000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,1,25000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,2,30000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,1,20000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


## 📍 Consigna 3

Cargar su notebook y datasets a un repositorio público personal y compartirlo por Discord.
<br>Consideren usar git lfs para los dataset con extensión csv.

---

In [8]:
import numpy as np
import pandas as pd

import funpymodeling

In [9]:
# Para este caso nos interesa visualizar todas las columnas
pd.set_option('display.max_columns', None)

# 1) Cargamos los datos

In [10]:
df_data = pd.read_csv("data/encuesta-anual-hogares-2019.csv", sep=',')

In [11]:
print(df_data.shape)
print(df_data.columns)

(14319, 31)
Index(['id', 'nhogar', 'miembro', 'comuna', 'dominio', 'edad', 'sexo',
       'parentesco_jefe', 'situacion_conyugal', 'num_miembro_padre',
       'num_miembro_madre', 'estado_ocupacional', 'cat_ocupacional',
       'calidad_ingresos_lab', 'ingreso_total_lab', 'calidad_ingresos_no_lab',
       'ingreso_total_no_lab', 'calidad_ingresos_totales', 'ingresos_totales',
       'calidad_ingresos_familiares', 'ingresos_familiares',
       'ingreso_per_capita_familiar', 'estado_educativo', 'sector_educativo',
       'nivel_actual', 'nivel_max_educativo', 'años_escolaridad',
       'lugar_nacimiento', 'afiliacion_salud', 'hijos_nacidos_vivos',
       'cantidad_hijos_nac_vivos'],
      dtype='object')


In [12]:
df_data

Unnamed: 0,id,nhogar,miembro,comuna,dominio,edad,sexo,parentesco_jefe,situacion_conyugal,num_miembro_padre,num_miembro_madre,estado_ocupacional,cat_ocupacional,calidad_ingresos_lab,ingreso_total_lab,calidad_ingresos_no_lab,ingreso_total_no_lab,calidad_ingresos_totales,ingresos_totales,calidad_ingresos_familiares,ingresos_familiares,ingreso_per_capita_familiar,estado_educativo,sector_educativo,nivel_actual,nivel_max_educativo,años_escolaridad,lugar_nacimiento,afiliacion_salud,hijos_nacidos_vivos,cantidad_hijos_nac_vivos
0,1,1,1,5,Resto de la Ciudad,18,Mujer,Jefe,Soltero/a,Padre no vive en el hogar,Madre no vive en el hogar,Inactivo,No corresponde,No tuvo ingresos,0,Tuvo ingresos y declara monto,6000,Tuvo ingresos y declara monto,6000,Tuvo ingresos y declara monto,18000,9000,Asiste,Estatal/publico,Universitario,Otras escuelas especiales,12,PBA excepto GBA,Solo obra social,No,No corresponde
1,1,1,2,5,Resto de la Ciudad,18,Mujer,Otro no familiar,Soltero/a,Padre no vive en el hogar,Madre no vive en el hogar,Inactivo,No corresponde,No tuvo ingresos,0,Tuvo ingresos y declara monto,12000,Tuvo ingresos y declara monto,12000,Tuvo ingresos y declara monto,18000,9000,Asiste,Estatal/publico,Universitario,Otras escuelas especiales,12,Otra provincia,Solo plan de medicina prepaga por contratación...,No,No corresponde
2,2,1,1,2,Resto de la Ciudad,18,Varon,Jefe,Soltero/a,Padre no vive en el hogar,2,Inactivo,No corresponde,No tuvo ingresos,0,No tuvo ingresos,0,No tuvo ingresos,0,Tuvo ingresos pero no declara monto,100000,33333,Asiste,Privado religioso,Universitario,Otras escuelas especiales,12,CABA,Solo plan de medicina prepaga por contratación...,,No corresponde
3,2,1,2,2,Resto de la Ciudad,50,Mujer,Padre/Madre/Suegro/a,Viudo/a,No corresponde,No corresponde,Ocupado,Asalariado,Tuvo ingresos y declara monto,70000,Tuvo ingresos pero no declara monto,30000,Tuvo ingresos pero no declara monto,100000,Tuvo ingresos pero no declara monto,100000,33333,No asiste pero asistió,No corresponde,No corresponde,Secundario/medio comun,17,CABA,Solo prepaga o mutual via OS,Si,2
4,2,1,3,2,Resto de la Ciudad,17,Varon,Otro familiar,Soltero/a,Padre no vive en el hogar,2,Inactivo,No corresponde,No tuvo ingresos,0,No tuvo ingresos,0,No tuvo ingresos,0,Tuvo ingresos pero no declara monto,100000,33333,Asiste,Privado religioso,Secundario/medio comun,EGB (1° a 9° año),10,CABA,Solo plan de medicina prepaga por contratación...,,No corresponde
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14314,5794,1,1,10,Resto de la Ciudad,99,Varon,Jefe,Casado/a,No corresponde,No corresponde,Inactivo,No corresponde,No tuvo ingresos,0,Tuvo ingresos pero no declara monto,24000,Tuvo ingresos pero no declara monto,24000,Tuvo ingresos pero no declara monto,57000,14250,No asiste pero asistió,No corresponde,No corresponde,Sala de 5,5,Pais no limitrofe,Solo obra social,,No corresponde
14315,5794,1,2,10,Resto de la Ciudad,78,Mujer,Otro familiar,Soltero/a,No corresponde,No corresponde,Inactivo,No corresponde,No tuvo ingresos,0,Tuvo ingresos y declara monto,11000,Tuvo ingresos y declara monto,11000,Tuvo ingresos pero no declara monto,57000,14250,No asiste pero asistió,No corresponde,No corresponde,EGB (1° a 9° año),9,Partido GBA,Solo obra social,No,No corresponde
14316,5794,1,3,10,Resto de la Ciudad,60,Mujer,Hijo/a - Hijastro/a,Separado/a de unión o matrimonio,No corresponde,No corresponde,Inactivo,No corresponde,No tuvo ingresos,0,Tuvo ingresos y declara monto,11000,Tuvo ingresos y declara monto,11000,Tuvo ingresos pero no declara monto,57000,14250,No asiste pero asistió,No corresponde,No corresponde,Primario especial,12,CABA,Solo obra social,Si,2
14317,5794,1,4,10,Resto de la Ciudad,92,Mujer,Conyugue o pareja,Casado/a,No corresponde,No corresponde,Inactivo,No corresponde,No tuvo ingresos,0,Tuvo ingresos y declara monto,11000,Tuvo ingresos y declara monto,11000,Tuvo ingresos pero no declara monto,57000,14250,No asiste pero asistió,No corresponde,No corresponde,Primario comun,7,CABA,Solo obra social,Si,1


# 2) Inspección inicial

In [13]:
df_data.head(5)

Unnamed: 0,id,nhogar,miembro,comuna,dominio,edad,sexo,parentesco_jefe,situacion_conyugal,num_miembro_padre,num_miembro_madre,estado_ocupacional,cat_ocupacional,calidad_ingresos_lab,ingreso_total_lab,calidad_ingresos_no_lab,ingreso_total_no_lab,calidad_ingresos_totales,ingresos_totales,calidad_ingresos_familiares,ingresos_familiares,ingreso_per_capita_familiar,estado_educativo,sector_educativo,nivel_actual,nivel_max_educativo,años_escolaridad,lugar_nacimiento,afiliacion_salud,hijos_nacidos_vivos,cantidad_hijos_nac_vivos
0,1,1,1,5,Resto de la Ciudad,18,Mujer,Jefe,Soltero/a,Padre no vive en el hogar,Madre no vive en el hogar,Inactivo,No corresponde,No tuvo ingresos,0,Tuvo ingresos y declara monto,6000,Tuvo ingresos y declara monto,6000,Tuvo ingresos y declara monto,18000,9000,Asiste,Estatal/publico,Universitario,Otras escuelas especiales,12,PBA excepto GBA,Solo obra social,No,No corresponde
1,1,1,2,5,Resto de la Ciudad,18,Mujer,Otro no familiar,Soltero/a,Padre no vive en el hogar,Madre no vive en el hogar,Inactivo,No corresponde,No tuvo ingresos,0,Tuvo ingresos y declara monto,12000,Tuvo ingresos y declara monto,12000,Tuvo ingresos y declara monto,18000,9000,Asiste,Estatal/publico,Universitario,Otras escuelas especiales,12,Otra provincia,Solo plan de medicina prepaga por contratación...,No,No corresponde
2,2,1,1,2,Resto de la Ciudad,18,Varon,Jefe,Soltero/a,Padre no vive en el hogar,2,Inactivo,No corresponde,No tuvo ingresos,0,No tuvo ingresos,0,No tuvo ingresos,0,Tuvo ingresos pero no declara monto,100000,33333,Asiste,Privado religioso,Universitario,Otras escuelas especiales,12,CABA,Solo plan de medicina prepaga por contratación...,,No corresponde
3,2,1,2,2,Resto de la Ciudad,50,Mujer,Padre/Madre/Suegro/a,Viudo/a,No corresponde,No corresponde,Ocupado,Asalariado,Tuvo ingresos y declara monto,70000,Tuvo ingresos pero no declara monto,30000,Tuvo ingresos pero no declara monto,100000,Tuvo ingresos pero no declara monto,100000,33333,No asiste pero asistió,No corresponde,No corresponde,Secundario/medio comun,17,CABA,Solo prepaga o mutual via OS,Si,2
4,2,1,3,2,Resto de la Ciudad,17,Varon,Otro familiar,Soltero/a,Padre no vive en el hogar,2,Inactivo,No corresponde,No tuvo ingresos,0,No tuvo ingresos,0,No tuvo ingresos,0,Tuvo ingresos pero no declara monto,100000,33333,Asiste,Privado religioso,Secundario/medio comun,EGB (1° a 9° año),10,CABA,Solo plan de medicina prepaga por contratación...,,No corresponde


In [14]:
# Eliminamos las columnas 'id' e 'hijos_nacidos_vivos'
labels = ['id', 'hijos_nacidos_vivos']
df_data = df_data.drop(labels=labels, axis=1)

In [15]:
funpymodeling.status(df_data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,nhogar,0,0.0,0,0.0,7,int64
1,miembro,0,0.0,0,0.0,19,int64
2,comuna,0,0.0,0,0.0,15,int64
3,dominio,0,0.0,0,0.0,2,object
4,edad,0,0.0,128,0.008939,101,int64
5,sexo,0,0.0,0,0.0,2,object
6,parentesco_jefe,0,0.0,0,0.0,9,object
7,situacion_conyugal,1,7e-05,0,0.0,7,object
8,num_miembro_padre,0,0.0,0,0.0,9,object
9,num_miembro_madre,0,0.0,0,0.0,11,object


# 3) Discretización

### 3.1) Por igual frecuencia y por igual rango

In [16]:
ingresos_familiares_cat = pd.qcut(df_data['ingresos_familiares'], q=8)

In [17]:
ingresos_familiares_cat

0          (-0.001, 20800.0]
1          (-0.001, 20800.0]
2        (90000.0, 124000.0]
3        (90000.0, 124000.0]
4        (90000.0, 124000.0]
                ...         
14314     (54000.0, 70000.0]
14315     (54000.0, 70000.0]
14316     (54000.0, 70000.0]
14317     (54000.0, 70000.0]
14318     (70000.0, 90000.0]
Name: ingresos_familiares, Length: 14319, dtype: category
Categories (8, interval[float64, right]): [(-0.001, 20800.0] < (20800.0, 30000.0] < (30000.0, 42000.0] < (42000.0, 54000.0] < (54000.0, 70000.0] < (70000.0, 90000.0] < (90000.0, 124000.0] < (124000.0, 1000000.0]]

In [18]:
funpymodeling.status(ingresos_familiares_cat)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,ingresos_familiares,0,0.0,0,0.0,8,category


In [19]:
funpymodeling.freq_tbl(ingresos_familiares_cat).sort_values('ingresos_familiares')

Unnamed: 0,ingresos_familiares,frequency,percentage,cumulative_perc
3,"(-0.001, 20800.0]",1796,0.125428,0.513444
2,"(20800.0, 30000.0]",1810,0.126405,0.388016
1,"(30000.0, 42000.0]",1841,0.12857,0.26161
6,"(42000.0, 54000.0]",1721,0.12019,0.882883
0,"(54000.0, 70000.0]",1905,0.13304,0.13304
5,"(70000.0, 90000.0]",1781,0.12438,0.762693
7,"(90000.0, 124000.0]",1677,0.117117,1.0
4,"(124000.0, 1000000.0]",1788,0.124869,0.638313


In [20]:
df_data['ingresos_familiares'] = ingresos_familiares_cat

In [21]:
ingreso_per_capita_familiar_cat = pd.qcut(df_data['ingreso_per_capita_familiar'], q=10)

In [22]:
ingreso_per_capita_familiar_cat

0           (8700.0, 12000.0]
1           (8700.0, 12000.0]
2          (30000.0, 38300.0]
3          (30000.0, 38300.0]
4          (30000.0, 38300.0]
                 ...         
14314      (12000.0, 15016.0]
14315      (12000.0, 15016.0]
14316      (12000.0, 15016.0]
14317      (12000.0, 15016.0]
14318    (52340.0, 1000000.0]
Name: ingreso_per_capita_familiar, Length: 14319, dtype: category
Categories (10, interval[float64, right]): [(-0.001, 5400.0] < (5400.0, 8700.0] < (8700.0, 12000.0] < (12000.0, 15016.0] ... (24000.0, 30000.0] < (30000.0, 38300.0] < (38300.0, 52340.0] < (52340.0, 1000000.0]]

In [23]:
funpymodeling.status(ingreso_per_capita_familiar_cat)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,ingreso_per_capita_familiar,0,0.0,0,0.0,10,category


In [24]:
funpymodeling.freq_tbl(ingreso_per_capita_familiar_cat).sort_values('ingreso_per_capita_familiar')

Unnamed: 0,ingreso_per_capita_familiar,frequency,percentage,cumulative_perc
3,"(-0.001, 5400.0]",1434,0.100147,0.421957
6,"(5400.0, 8700.0]",1431,0.099937,0.721908
1,"(8700.0, 12000.0]",1554,0.108527,0.219848
8,"(12000.0, 15016.0]",1309,0.091417,0.913192
4,"(15016.0, 19900.0]",1432,0.100007,0.521964
2,"(19900.0, 24000.0]",1460,0.101962,0.32181
0,"(24000.0, 30000.0]",1594,0.111321,0.111321
9,"(30000.0, 38300.0]",1243,0.086808,1.0
7,"(38300.0, 52340.0]",1430,0.099867,0.821775
5,"(52340.0, 1000000.0]",1432,0.100007,0.621971


In [25]:
df_data['ingreso_per_capita_familiar'] = ingreso_per_capita_familiar_cat

In [26]:
df_data['ingreso_total_lab']

0            0
1            0
2            0
3        70000
4            0
         ...  
14314        0
14315        0
14316        0
14317        0
14318        0
Name: ingreso_total_lab, Length: 14319, dtype: int64

In [27]:
ingreso_total_lab_cat = pd.qcut(df_data['ingreso_total_lab'], q=10, duplicates='drop')

In [28]:
ingreso_total_lab_cat

0            (-0.001, 2500.0]
1            (-0.001, 2500.0]
2            (-0.001, 2500.0]
3        (56000.0, 1000000.0]
4            (-0.001, 2500.0]
                 ...         
14314        (-0.001, 2500.0]
14315        (-0.001, 2500.0]
14316        (-0.001, 2500.0]
14317        (-0.001, 2500.0]
14318        (-0.001, 2500.0]
Name: ingreso_total_lab, Length: 14319, dtype: category
Categories (6, interval[float64, right]): [(-0.001, 2500.0] < (2500.0, 15000.0] < (15000.0, 25000.0] < (25000.0, 37000.0] < (37000.0, 56000.0] < (56000.0, 1000000.0]]

In [29]:
funpymodeling.status(ingreso_total_lab_cat)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,ingreso_total_lab,0,0.0,0,0.0,6,category


In [30]:
funpymodeling.freq_tbl(ingreso_total_lab_cat).sort_values('ingreso_total_lab')

Unnamed: 0,ingreso_total_lab,frequency,percentage,cumulative_perc
0,"(-0.001, 2500.0]",7168,0.500594,0.500594
1,"(2500.0, 15000.0]",1499,0.104686,0.60528
4,"(15000.0, 25000.0]",1397,0.097563,0.902437
2,"(25000.0, 37000.0]",1431,0.099937,0.705217
5,"(37000.0, 56000.0]",1397,0.097563,1.0
3,"(56000.0, 1000000.0]",1427,0.099658,0.804875


In [31]:
df_data['ingreso_total_lab'] = ingreso_total_lab_cat

In [32]:
df_data['ingreso_total_no_lab']

0         6000
1        12000
2            0
3        30000
4            0
         ...  
14314    24000
14315    11000
14316    11000
14317    11000
14318    82000
Name: ingreso_total_no_lab, Length: 14319, dtype: int64

In [33]:
ingreso_total_no_lab_cat = pd.qcut(df_data['ingreso_total_no_lab'], q=4, duplicates='drop')

In [34]:
ingreso_total_no_lab_cat

0        (4000.0, 500000.0]
1        (4000.0, 500000.0]
2          (-0.001, 4000.0]
3        (4000.0, 500000.0]
4          (-0.001, 4000.0]
                ...        
14314    (4000.0, 500000.0]
14315    (4000.0, 500000.0]
14316    (4000.0, 500000.0]
14317    (4000.0, 500000.0]
14318    (4000.0, 500000.0]
Name: ingreso_total_no_lab, Length: 14319, dtype: category
Categories (2, interval[float64, right]): [(-0.001, 4000.0] < (4000.0, 500000.0]]

In [35]:
funpymodeling.status(ingreso_total_no_lab_cat)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,ingreso_total_no_lab,0,0.0,0,0.0,2,category


In [36]:
funpymodeling.freq_tbl(ingreso_total_no_lab_cat).sort_values('ingreso_total_no_lab')

Unnamed: 0,ingreso_total_no_lab,frequency,percentage,cumulative_perc
0,"(-0.001, 4000.0]",10741,0.750122,0.750122
1,"(4000.0, 500000.0]",3578,0.249878,1.0


In [37]:
df_data['ingreso_total_no_lab'] = ingreso_total_no_lab_cat

### 3.2) Por igual distancia

In [38]:
df_data['edad']

0         18
1         18
2         18
3         50
4         17
        ... 
14314     99
14315     78
14316     60
14317     92
14318    100
Name: edad, Length: 14319, dtype: int64

In [39]:
edad_cat = pd.cut(df_data['edad'], bins=5)

In [40]:
edad_cat

0         (-0.1, 20.0]
1         (-0.1, 20.0]
2         (-0.1, 20.0]
3         (40.0, 60.0]
4         (-0.1, 20.0]
             ...      
14314    (80.0, 100.0]
14315     (60.0, 80.0]
14316     (40.0, 60.0]
14317    (80.0, 100.0]
14318    (80.0, 100.0]
Name: edad, Length: 14319, dtype: category
Categories (5, interval[float64, right]): [(-0.1, 20.0] < (20.0, 40.0] < (40.0, 60.0] < (60.0, 80.0] < (80.0, 100.0]]

In [41]:
funpymodeling.status(edad_cat)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,edad,0,0.0,0,0.0,5,category


In [42]:
funpymodeling.freq_tbl(edad_cat).sort_values('edad')

Unnamed: 0,edad,frequency,percentage,cumulative_perc
1,"(-0.1, 20.0]",3669,0.256233,0.549829
0,"(20.0, 40.0]",4204,0.293596,0.293596
2,"(40.0, 60.0]",3452,0.241078,0.790907
3,"(60.0, 80.0]",2448,0.170962,0.961869
4,"(80.0, 100.0]",546,0.038131,1.0


In [43]:
df_data['edad'] = edad_cat

# 4) Preparación de datos

In [44]:
funpymodeling.status(df_data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,nhogar,0,0.0,0,0.0,7,int64
1,miembro,0,0.0,0,0.0,19,int64
2,comuna,0,0.0,0,0.0,15,int64
3,dominio,0,0.0,0,0.0,2,object
4,edad,0,0.0,0,0.0,5,category
5,sexo,0,0.0,0,0.0,2,object
6,parentesco_jefe,0,0.0,0,0.0,9,object
7,situacion_conyugal,1,7e-05,0,0.0,7,object
8,num_miembro_padre,0,0.0,0,0.0,9,object
9,num_miembro_madre,0,0.0,0,0.0,11,object


In [45]:
# Cambiamos el tipo de dato de las columnas 'comuna' y 'nhogar'
df_data['comuna'] = df_data['comuna'].astype(str)
df_data['nhogar'] = df_data['nhogar'].astype(str)

In [46]:
# Vemos los valores unicos de la columna 'años_escolaridad'
df_data['años_escolaridad'].unique()

array(['12', '17', '10', '8', 'Ningun año de escolaridad aprobado', '11',
       '9', '13', '7', '16', '14', '15', '5', '6', '2', '19', '4', '1',
       '3', '18', nan], dtype=object)

In [47]:
# Reemplazamos los valores 'Ningun año de escolaridad aprobado' de la columna 'años_escolaridad' por un '0' string
df_data['años_escolaridad'] = df_data['años_escolaridad'].replace('Ningun año de escolaridad aprobado', '0')

In [48]:
df_data['años_escolaridad'].unique()

array(['12', '17', '10', '8', '0', '11', '9', '13', '7', '16', '14', '15',
       '5', '6', '2', '19', '4', '1', '3', '18', nan], dtype=object)

In [49]:
# Cambiamos el tipo de dato de la columna 'años_escolaridad'
df_data['años_escolaridad'] = df_data['años_escolaridad'].astype('int32')

ValueError: cannot convert float NaN to integer

In [None]:
df_data['años_escolaridad'].unique()

In [None]:
# Cambiamos el tipo de dato de la columna 'años_escolaridad'
df_data['años_escolaridad'] = df_data['años_escolaridad'].astype(float)

In [50]:
df_data['años_escolaridad'].unique()

array(['12', '17', '10', '8', '0', '11', '9', '13', '7', '16', '14', '15',
       '5', '6', '2', '19', '4', '1', '3', '18', nan], dtype=object)

In [51]:
funpymodeling.status(df_data['años_escolaridad'])

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,años_escolaridad,62,0.00433,0,0.0,20,object


In [52]:
# Cambiamos el tipo de dato de la columna 'años_escolaridad'
df_data['años_escolaridad'] = df_data['años_escolaridad'].astype('Int32')

In [53]:
df_data['años_escolaridad'].unique()

<IntegerArray>
[12, 17, 10, 8, 0, 11, 9, 13, 7, 16, 14, 15, 5, 6, 2, 19, 4, 1, 3, 18, <NA>]
Length: 21, dtype: Int32

In [54]:
funpymodeling.status(df_data['años_escolaridad'])

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,años_escolaridad,62,0.00433,1226,0.085621,20,Int32


In [55]:
# Discretizamos la columna 'años_escolaridad' en q=5
df_data['años_escolaridad'], años_escolaridad_bins = pd.qcut(df_data['años_escolaridad'], q=5, retbins=True)

In [56]:
df_data['años_escolaridad'].unique()

[(11.0, 12.0], (16.0, 19.0], (7.0, 11.0], (-0.001, 7.0], (12.0, 16.0], NaN]
Categories (5, interval[float64, right]): [(-0.001, 7.0] < (7.0, 11.0] < (11.0, 12.0] < (12.0, 16.0] < (16.0, 19.0]]

In [57]:
funpymodeling.status(df_data['años_escolaridad'])

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,años_escolaridad,62,0.00433,0,0.0,5,category


In [58]:
# Agregamos la categoria 'desconocido' a las categorias de la columna 'años_escolaridad'
df_data['años_escolaridad'] = df_data['años_escolaridad'].cat.add_categories("desconocido")

In [59]:
df_data['años_escolaridad']

0         (11.0, 12.0]
1         (11.0, 12.0]
2         (11.0, 12.0]
3         (16.0, 19.0]
4          (7.0, 11.0]
             ...      
14314    (-0.001, 7.0]
14315      (7.0, 11.0]
14316     (11.0, 12.0]
14317    (-0.001, 7.0]
14318     (12.0, 16.0]
Name: años_escolaridad, Length: 14319, dtype: category
Categories (6, object): [(-0.001, 7.0] < (7.0, 11.0] < (11.0, 12.0] < (12.0, 16.0] < (16.0, 19.0] < 'desconocido']

In [60]:
# Rellenamos los NaN con el valor 'desconocido'
df_data['años_escolaridad'] = df_data['años_escolaridad'].fillna(value="desconocido")

In [61]:
df_data['años_escolaridad']

0         (11.0, 12.0]
1         (11.0, 12.0]
2         (11.0, 12.0]
3         (16.0, 19.0]
4          (7.0, 11.0]
             ...      
14314    (-0.001, 7.0]
14315      (7.0, 11.0]
14316     (11.0, 12.0]
14317    (-0.001, 7.0]
14318     (12.0, 16.0]
Name: años_escolaridad, Length: 14319, dtype: category
Categories (6, object): [(-0.001, 7.0] < (7.0, 11.0] < (11.0, 12.0] < (12.0, 16.0] < (16.0, 19.0] < 'desconocido']

In [62]:
# Eliminamos los NaN de las columnas: 'situacion_conyugal', 'sector_educativo', 'lugar_nacimiento', 'afiliacion_salud'
columns = ['situacion_conyugal', 'sector_educativo', 'lugar_nacimiento', 'afiliacion_salud']
df_data = df_data.dropna(subset=columns)

In [63]:
# Reseteamos el index para que se mantenga la secuencia
df_data = df_data.reset_index(drop=True)

In [64]:
funpymodeling.freq_tbl(df_data['nivel_max_educativo'])

Unnamed: 0,nivel_max_educativo,frequency,percentage,cumulative_perc
0,Secundario/medio comun,3670,0.256464,0.276835
1,Otras escuelas especiales,2570,0.179595,0.470695
2,EGB (1° a 9° año),2302,0.160867,0.644339
3,Primario especial,2192,0.15318,0.809685
4,Sala de 5,1539,0.107547,0.925775
5,Primario comun,942,0.065828,0.996832
6,No corresponde,42,0.002935,1.0


In [65]:
df_data['nivel_max_educativo'].value_counts(dropna=False)

Secundario/medio comun       3670
Otras escuelas especiales    2570
EGB (1° a 9° año)            2302
Primario especial            2192
Sala de 5                    1539
NaN                          1053
Primario comun                942
No corresponde                 42
Name: nivel_max_educativo, dtype: int64

In [66]:
# Rellenamos los NaN de la columna 'nivel_max_educativo' con el valor 'desconocido'
df_data = df_data.fillna({'nivel_max_educativo': 'desconocido'})

In [67]:
funpymodeling.freq_tbl(df_data['nivel_max_educativo'])

Unnamed: 0,nivel_max_educativo,frequency,percentage,cumulative_perc
0,Secundario/medio comun,3670,0.256464,0.256464
1,Otras escuelas especiales,2570,0.179595,0.436059
2,EGB (1° a 9° año),2302,0.160867,0.596925
3,Primario especial,2192,0.15318,0.750105
4,Sala de 5,1539,0.107547,0.857652
5,desconocido,1053,0.073585,0.931237
6,Primario comun,942,0.065828,0.997065
7,No corresponde,42,0.002935,1.0


In [68]:
funpymodeling.status(df_data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,nhogar,0,0.0,0,0.0,7,object
1,miembro,0,0.0,0,0.0,19,int64
2,comuna,0,0.0,0,0.0,15,object
3,dominio,0,0.0,0,0.0,2,object
4,edad,0,0.0,0,0.0,5,category
5,sexo,0,0.0,0,0.0,2,object
6,parentesco_jefe,0,0.0,0,0.0,9,object
7,situacion_conyugal,0,0.0,0,0.0,7,object
8,num_miembro_padre,0,0.0,0,0.0,9,object
9,num_miembro_madre,0,0.0,0,0.0,11,object


In [69]:
respuesta_1

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,nhogar,0,0.0,0,0.0,7,object
1,miembro,0,0.0,0,0.0,19,int64
2,comuna,0,0.0,0,0.0,15,object
3,dominio,0,0.0,0,0.0,2,object
4,edad,0,0.0,0,0.0,5,category
5,sexo,0,0.0,0,0.0,2,object
6,parentesco_jefe,0,0.0,0,0.0,9,object
7,situacion_conyugal,0,0.0,0,0.0,7,object
8,num_miembro_padre,0,0.0,0,0.0,9,object
9,num_miembro_madre,0,0.0,0,0.0,11,object


In [70]:
respuesta_1.compare(funpymodeling.status(df_data))

# 5) One hot encoding

In [71]:
df_data_ohe = pd.get_dummies(df_data) 

In [72]:
df_data_ohe

Unnamed: 0,miembro,ingresos_totales,nhogar_1,nhogar_2,nhogar_3,nhogar_4,nhogar_5,nhogar_6,nhogar_7,comuna_1,comuna_10,comuna_11,comuna_12,comuna_13,comuna_14,comuna_15,comuna_2,comuna_3,comuna_4,comuna_5,comuna_6,comuna_7,comuna_8,comuna_9,dominio_Resto de la Ciudad,dominio_Villas de emergencia,"edad_(-0.1, 20.0]","edad_(20.0, 40.0]","edad_(40.0, 60.0]","edad_(60.0, 80.0]","edad_(80.0, 100.0]",sexo_Mujer,sexo_Varon,parentesco_jefe_Conyugue o pareja,parentesco_jefe_Hijo/a - Hijastro/a,parentesco_jefe_Jefe,parentesco_jefe_Nieto/a,parentesco_jefe_Otro familiar,parentesco_jefe_Otro no familiar,parentesco_jefe_Padre/Madre/Suegro/a,parentesco_jefe_Servicio domestico y sus familiares,parentesco_jefe_Yerno/nuera,situacion_conyugal_Casado/a,situacion_conyugal_Divorciado/a,situacion_conyugal_No corresponde,situacion_conyugal_Separado/a de unión o matrimonio,situacion_conyugal_Soltero/a,situacion_conyugal_Unido/a,situacion_conyugal_Viudo/a,num_miembro_padre_1,num_miembro_padre_2,num_miembro_padre_3,num_miembro_padre_4,num_miembro_padre_5,num_miembro_padre_6,num_miembro_padre_7,num_miembro_padre_No corresponde,num_miembro_padre_Padre no vive en el hogar,num_miembro_madre_1,num_miembro_madre_15,num_miembro_madre_2,num_miembro_madre_3,num_miembro_madre_4,num_miembro_madre_5,num_miembro_madre_6,num_miembro_madre_7,num_miembro_madre_9,num_miembro_madre_Madre no vive en el hogar,num_miembro_madre_No corresponde,estado_ocupacional_Desocupado,estado_ocupacional_Inactivo,estado_ocupacional_Ocupado,cat_ocupacional_Asalariado,cat_ocupacional_No corresponde,cat_ocupacional_Patron/empleador,cat_ocupacional_Trabajador familiar,cat_ocupacional_Trabajador por cuenta propia,calidad_ingresos_lab_No corresponde,calidad_ingresos_lab_No tuvo ingresos,calidad_ingresos_lab_Tuvo ingresos pero no declara monto,calidad_ingresos_lab_Tuvo ingresos y declara monto,"ingreso_total_lab_(-0.001, 2500.0]","ingreso_total_lab_(2500.0, 15000.0]","ingreso_total_lab_(15000.0, 25000.0]","ingreso_total_lab_(25000.0, 37000.0]","ingreso_total_lab_(37000.0, 56000.0]","ingreso_total_lab_(56000.0, 1000000.0]",calidad_ingresos_no_lab_No corresponde,calidad_ingresos_no_lab_No tuvo ingresos,calidad_ingresos_no_lab_Tuvo ingresos pero no declara monto,calidad_ingresos_no_lab_Tuvo ingresos y declara monto,"ingreso_total_no_lab_(-0.001, 4000.0]","ingreso_total_no_lab_(4000.0, 500000.0]",calidad_ingresos_totales_No corresponde,calidad_ingresos_totales_No tuvo ingresos,calidad_ingresos_totales_Tuvo ingresos pero no declara monto,calidad_ingresos_totales_Tuvo ingresos y declara monto,calidad_ingresos_familiares_No tuvo ingresos,calidad_ingresos_familiares_Tuvo ingresos pero no declara monto,calidad_ingresos_familiares_Tuvo ingresos y declara monto,"ingresos_familiares_(-0.001, 20800.0]","ingresos_familiares_(20800.0, 30000.0]","ingresos_familiares_(30000.0, 42000.0]","ingresos_familiares_(42000.0, 54000.0]","ingresos_familiares_(54000.0, 70000.0]","ingresos_familiares_(70000.0, 90000.0]","ingresos_familiares_(90000.0, 124000.0]","ingresos_familiares_(124000.0, 1000000.0]","ingreso_per_capita_familiar_(-0.001, 5400.0]","ingreso_per_capita_familiar_(5400.0, 8700.0]","ingreso_per_capita_familiar_(8700.0, 12000.0]","ingreso_per_capita_familiar_(12000.0, 15016.0]","ingreso_per_capita_familiar_(15016.0, 19900.0]","ingreso_per_capita_familiar_(19900.0, 24000.0]","ingreso_per_capita_familiar_(24000.0, 30000.0]","ingreso_per_capita_familiar_(30000.0, 38300.0]","ingreso_per_capita_familiar_(38300.0, 52340.0]","ingreso_per_capita_familiar_(52340.0, 1000000.0]",estado_educativo_Asiste,estado_educativo_No asiste pero asistió,estado_educativo_Nunca asistio,sector_educativo_Estatal/publico,sector_educativo_No corresponde,sector_educativo_Privado no religioso,sector_educativo_Privado religioso,nivel_actual_Jardin maternal,nivel_actual_No corresponde,nivel_actual_Otras escuelas especiales,nivel_actual_Postgrado,nivel_actual_Primario adultos,nivel_actual_Primario comun,nivel_actual_Primario especial,nivel_actual_Sala de 3,nivel_actual_Sala de 4,nivel_actual_Sala de 5,nivel_actual_Secundario/medio adultos,nivel_actual_Secundario/medio comun,nivel_actual_Terciario/superior no universitario,nivel_actual_Universitario,nivel_max_educativo_EGB (1° a 9° año),nivel_max_educativo_No corresponde,nivel_max_educativo_Otras escuelas especiales,nivel_max_educativo_Primario comun,nivel_max_educativo_Primario especial,nivel_max_educativo_Sala de 5,nivel_max_educativo_Secundario/medio comun,nivel_max_educativo_desconocido,"años_escolaridad_(-0.001, 7.0]","años_escolaridad_(7.0, 11.0]","años_escolaridad_(11.0, 12.0]","años_escolaridad_(12.0, 16.0]","años_escolaridad_(16.0, 19.0]",años_escolaridad_desconocido,lugar_nacimiento_CABA,lugar_nacimiento_Otra provincia,lugar_nacimiento_PBA excepto GBA,lugar_nacimiento_PBA sin especificar,lugar_nacimiento_Pais limitrofe,lugar_nacimiento_Pais no limitrofe,lugar_nacimiento_Partido GBA,afiliacion_salud_Otros,afiliacion_salud_Solo obra social,afiliacion_salud_Solo plan de medicina prepaga por contratación voluntaria,afiliacion_salud_Solo prepaga o mutual via OS,afiliacion_salud_Solo sistema publico,cantidad_hijos_nac_vivos_1,cantidad_hijos_nac_vivos_10,cantidad_hijos_nac_vivos_11,cantidad_hijos_nac_vivos_12,cantidad_hijos_nac_vivos_15,cantidad_hijos_nac_vivos_2,cantidad_hijos_nac_vivos_3,cantidad_hijos_nac_vivos_4,cantidad_hijos_nac_vivos_5,cantidad_hijos_nac_vivos_6,cantidad_hijos_nac_vivos_7,cantidad_hijos_nac_vivos_8,cantidad_hijos_nac_vivos_9,cantidad_hijos_nac_vivos_No corresponde
0,1,6000,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,2,12000,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,2,100000,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14305,1,24000,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
14306,2,11000,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
14307,3,11000,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
14308,4,11000,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [73]:
df_data_ohe.columns

Index(['miembro', 'ingresos_totales', 'nhogar_1', 'nhogar_2', 'nhogar_3',
       'nhogar_4', 'nhogar_5', 'nhogar_6', 'nhogar_7', 'comuna_1',
       ...
       'cantidad_hijos_nac_vivos_15', 'cantidad_hijos_nac_vivos_2',
       'cantidad_hijos_nac_vivos_3', 'cantidad_hijos_nac_vivos_4',
       'cantidad_hijos_nac_vivos_5', 'cantidad_hijos_nac_vivos_6',
       'cantidad_hijos_nac_vivos_7', 'cantidad_hijos_nac_vivos_8',
       'cantidad_hijos_nac_vivos_9',
       'cantidad_hijos_nac_vivos_No corresponde'],
      dtype='object', length=179)

In [74]:
import pickle

with open('categories_ohe.pkl', 'wb') as f:
    pickle.dump(df_data_ohe.columns, f, protocol=pickle.HIGHEST_PROTOCOL)

In [79]:
df_new_data = pd.read_csv("data/new_data.csv", sep=',')

In [80]:
df_new_data

Unnamed: 0,id,nhogar,miembro,comuna,dominio,edad,sexo,parentesco_jefe,situacion_conyugal,num_miembro_padre,num_miembro_madre,estado_ocupacional,cat_ocupacional,calidad_ingresos_lab,ingreso_total_lab,calidad_ingresos_no_lab,ingreso_total_no_lab,calidad_ingresos_totales,ingresos_totales,calidad_ingresos_familiares,ingresos_familiares,ingreso_per_capita_familiar,estado_educativo,sector_educativo,nivel_actual,nivel_max_educativo,años_escolaridad,lugar_nacimiento,afiliacion_salud,hijos_nacidos_vivos,cantidad_hijos_nac_vivos
0,21,1,1,3,Resto de la Ciudad,20,Varon,Jefe,Soltero/a,Padre no vive en el hogar,Madre no vive en el hogar,Desocupado,No corresponde,No tuvo ingresos,0,Tuvo ingresos y declara monto,4000,Tuvo ingresos y declara monto,4000,Tuvo ingresos y declara monto,4000,4000,Asiste,Privado no religioso,Universitario,Otras escuelas especiales,12,Otra provincia,Solo plan de medicina prepaga por contratación...,,No corresponde
1,22,1,1,12,Resto de la Ciudad,20,Varon,Jefe,Soltero/a,Padre no vive en el hogar,Madre no vive en el hogar,Inactivo,No corresponde,No tuvo ingresos,0,Tuvo ingresos y declara monto,22000,Tuvo ingresos y declara monto,22000,Tuvo ingresos y declara monto,22000,22000,Asiste,Estatal/publico,Universitario,Otras escuelas especiales,12,PBA excepto GBA,Solo obra social,,No corresponde
2,23,1,1,14,Resto de la Ciudad,20,Varon,Jefe,Soltero/a,Padre no vive en el hogar,Madre no vive en el hogar,Ocupado,Asalariado,Tuvo ingresos y declara monto,10000,Tuvo ingresos y declara monto,15000,Tuvo ingresos y declara monto,25000,Tuvo ingresos pero no declara monto,55000,27500,Asiste,Privado no religioso,Universitario,Otras escuelas especiales,13,CABA,Otros,,No corresponde
3,23,1,2,14,Resto de la Ciudad,21,Mujer,Otro familiar,Soltero/a,Padre no vive en el hogar,Madre no vive en el hogar,Ocupado,Trabajador por cuenta propia,Tuvo ingresos pero no declara monto,30000,No tuvo ingresos,0,Tuvo ingresos pero no declara monto,30000,Tuvo ingresos pero no declara monto,55000,27500,Asiste,Privado religioso,Universitario,Otras escuelas especiales,16,CABA,Solo plan de medicina prepaga por contratación...,No,No corresponde
4,24,1,1,7,Resto de la Ciudad,20,Varon,Jefe,Soltero/a,Padre no vive en el hogar,Madre no vive en el hogar,Inactivo,No corresponde,No tuvo ingresos,0,Tuvo ingresos y declara monto,20000,Tuvo ingresos y declara monto,20000,Tuvo ingresos y declara monto,20000,20000,Asiste,Privado no religioso,Universitario,Otras escuelas especiales,14,Otra provincia,Solo plan de medicina prepaga por contratación...,,No corresponde


In [81]:
with open('categories_ohe.pkl', 'rb') as f:
    ohe_tr = pickle.load(f)

In [82]:
pd.get_dummies(df_new_data).reindex(columns=ohe_tr)

Unnamed: 0,miembro,ingresos_totales,nhogar_1,nhogar_2,nhogar_3,nhogar_4,nhogar_5,nhogar_6,nhogar_7,comuna_1,comuna_10,comuna_11,comuna_12,comuna_13,comuna_14,comuna_15,comuna_2,comuna_3,comuna_4,comuna_5,comuna_6,comuna_7,comuna_8,comuna_9,dominio_Resto de la Ciudad,dominio_Villas de emergencia,"edad_(-0.1, 20.0]","edad_(20.0, 40.0]","edad_(40.0, 60.0]","edad_(60.0, 80.0]","edad_(80.0, 100.0]",sexo_Mujer,sexo_Varon,parentesco_jefe_Conyugue o pareja,parentesco_jefe_Hijo/a - Hijastro/a,parentesco_jefe_Jefe,parentesco_jefe_Nieto/a,parentesco_jefe_Otro familiar,parentesco_jefe_Otro no familiar,parentesco_jefe_Padre/Madre/Suegro/a,parentesco_jefe_Servicio domestico y sus familiares,parentesco_jefe_Yerno/nuera,situacion_conyugal_Casado/a,situacion_conyugal_Divorciado/a,situacion_conyugal_No corresponde,situacion_conyugal_Separado/a de unión o matrimonio,situacion_conyugal_Soltero/a,situacion_conyugal_Unido/a,situacion_conyugal_Viudo/a,num_miembro_padre_1,num_miembro_padre_2,num_miembro_padre_3,num_miembro_padre_4,num_miembro_padre_5,num_miembro_padre_6,num_miembro_padre_7,num_miembro_padre_No corresponde,num_miembro_padre_Padre no vive en el hogar,num_miembro_madre_1,num_miembro_madre_15,num_miembro_madre_2,num_miembro_madre_3,num_miembro_madre_4,num_miembro_madre_5,num_miembro_madre_6,num_miembro_madre_7,num_miembro_madre_9,num_miembro_madre_Madre no vive en el hogar,num_miembro_madre_No corresponde,estado_ocupacional_Desocupado,estado_ocupacional_Inactivo,estado_ocupacional_Ocupado,cat_ocupacional_Asalariado,cat_ocupacional_No corresponde,cat_ocupacional_Patron/empleador,cat_ocupacional_Trabajador familiar,cat_ocupacional_Trabajador por cuenta propia,calidad_ingresos_lab_No corresponde,calidad_ingresos_lab_No tuvo ingresos,calidad_ingresos_lab_Tuvo ingresos pero no declara monto,calidad_ingresos_lab_Tuvo ingresos y declara monto,"ingreso_total_lab_(-0.001, 2500.0]","ingreso_total_lab_(2500.0, 15000.0]","ingreso_total_lab_(15000.0, 25000.0]","ingreso_total_lab_(25000.0, 37000.0]","ingreso_total_lab_(37000.0, 56000.0]","ingreso_total_lab_(56000.0, 1000000.0]",calidad_ingresos_no_lab_No corresponde,calidad_ingresos_no_lab_No tuvo ingresos,calidad_ingresos_no_lab_Tuvo ingresos pero no declara monto,calidad_ingresos_no_lab_Tuvo ingresos y declara monto,"ingreso_total_no_lab_(-0.001, 4000.0]","ingreso_total_no_lab_(4000.0, 500000.0]",calidad_ingresos_totales_No corresponde,calidad_ingresos_totales_No tuvo ingresos,calidad_ingresos_totales_Tuvo ingresos pero no declara monto,calidad_ingresos_totales_Tuvo ingresos y declara monto,calidad_ingresos_familiares_No tuvo ingresos,calidad_ingresos_familiares_Tuvo ingresos pero no declara monto,calidad_ingresos_familiares_Tuvo ingresos y declara monto,"ingresos_familiares_(-0.001, 20800.0]","ingresos_familiares_(20800.0, 30000.0]","ingresos_familiares_(30000.0, 42000.0]","ingresos_familiares_(42000.0, 54000.0]","ingresos_familiares_(54000.0, 70000.0]","ingresos_familiares_(70000.0, 90000.0]","ingresos_familiares_(90000.0, 124000.0]","ingresos_familiares_(124000.0, 1000000.0]","ingreso_per_capita_familiar_(-0.001, 5400.0]","ingreso_per_capita_familiar_(5400.0, 8700.0]","ingreso_per_capita_familiar_(8700.0, 12000.0]","ingreso_per_capita_familiar_(12000.0, 15016.0]","ingreso_per_capita_familiar_(15016.0, 19900.0]","ingreso_per_capita_familiar_(19900.0, 24000.0]","ingreso_per_capita_familiar_(24000.0, 30000.0]","ingreso_per_capita_familiar_(30000.0, 38300.0]","ingreso_per_capita_familiar_(38300.0, 52340.0]","ingreso_per_capita_familiar_(52340.0, 1000000.0]",estado_educativo_Asiste,estado_educativo_No asiste pero asistió,estado_educativo_Nunca asistio,sector_educativo_Estatal/publico,sector_educativo_No corresponde,sector_educativo_Privado no religioso,sector_educativo_Privado religioso,nivel_actual_Jardin maternal,nivel_actual_No corresponde,nivel_actual_Otras escuelas especiales,nivel_actual_Postgrado,nivel_actual_Primario adultos,nivel_actual_Primario comun,nivel_actual_Primario especial,nivel_actual_Sala de 3,nivel_actual_Sala de 4,nivel_actual_Sala de 5,nivel_actual_Secundario/medio adultos,nivel_actual_Secundario/medio comun,nivel_actual_Terciario/superior no universitario,nivel_actual_Universitario,nivel_max_educativo_EGB (1° a 9° año),nivel_max_educativo_No corresponde,nivel_max_educativo_Otras escuelas especiales,nivel_max_educativo_Primario comun,nivel_max_educativo_Primario especial,nivel_max_educativo_Sala de 5,nivel_max_educativo_Secundario/medio comun,nivel_max_educativo_desconocido,"años_escolaridad_(-0.001, 7.0]","años_escolaridad_(7.0, 11.0]","años_escolaridad_(11.0, 12.0]","años_escolaridad_(12.0, 16.0]","años_escolaridad_(16.0, 19.0]",años_escolaridad_desconocido,lugar_nacimiento_CABA,lugar_nacimiento_Otra provincia,lugar_nacimiento_PBA excepto GBA,lugar_nacimiento_PBA sin especificar,lugar_nacimiento_Pais limitrofe,lugar_nacimiento_Pais no limitrofe,lugar_nacimiento_Partido GBA,afiliacion_salud_Otros,afiliacion_salud_Solo obra social,afiliacion_salud_Solo plan de medicina prepaga por contratación voluntaria,afiliacion_salud_Solo prepaga o mutual via OS,afiliacion_salud_Solo sistema publico,cantidad_hijos_nac_vivos_1,cantidad_hijos_nac_vivos_10,cantidad_hijos_nac_vivos_11,cantidad_hijos_nac_vivos_12,cantidad_hijos_nac_vivos_15,cantidad_hijos_nac_vivos_2,cantidad_hijos_nac_vivos_3,cantidad_hijos_nac_vivos_4,cantidad_hijos_nac_vivos_5,cantidad_hijos_nac_vivos_6,cantidad_hijos_nac_vivos_7,cantidad_hijos_nac_vivos_8,cantidad_hijos_nac_vivos_9,cantidad_hijos_nac_vivos_No corresponde
0,1,4000,,,,,,,,,,,,,,,,,,,,,,,1,,,,,,,0,1,,,1,,0,,,,,,,,,1,,,,,,,,,,,1,,,,,,,,,,1,,1,0,0,0,1,,,0,,1,0,0,,,,,,,,0,,1,,,,,0,1,,0,1,,,,,,,,,,,,,,,,,,,1,,,0,,1,0,,,,,,,,,,,,,,1,,,1,,,,,,,,,,,,0,1,0,,,,,0,0,1,,,,,,,,,,,,,,,,1
1,1,22000,,,,,,,,,,,,,,,,,,,,,,,1,,,,,,,0,1,,,1,,0,,,,,,,,,1,,,,,,,,,,,1,,,,,,,,,,1,,0,1,0,0,1,,,0,,1,0,0,,,,,,,,0,,1,,,,,0,1,,0,1,,,,,,,,,,,,,,,,,,,1,,,1,,0,0,,,,,,,,,,,,,,1,,,1,,,,,,,,,,,,0,0,1,,,,,0,1,0,,,,,,,,,,,,,,,,1
2,1,25000,,,,,,,,,,,,,,,,,,,,,,,1,,,,,,,0,1,,,1,,0,,,,,,,,,1,,,,,,,,,,,1,,,,,,,,,,1,,0,0,1,1,0,,,0,,0,0,1,,,,,,,,0,,1,,,,,0,1,,1,0,,,,,,,,,,,,,,,,,,,1,,,0,,1,0,,,,,,,,,,,,,,1,,,1,,,,,,,,,,,,1,0,0,,,,,1,0,0,,,,,,,,,,,,,,,,1
3,2,30000,,,,,,,,,,,,,,,,,,,,,,,1,,,,,,,1,0,,,0,,1,,,,,,,,,1,,,,,,,,,,,1,,,,,,,,,,1,,0,0,1,0,0,,,1,,0,1,0,,,,,,,,1,,0,,,,,1,0,,1,0,,,,,,,,,,,,,,,,,,,1,,,0,,0,1,,,,,,,,,,,,,,1,,,1,,,,,,,,,,,,1,0,0,,,,,0,0,1,,,,,,,,,,,,,,,,1
4,1,20000,,,,,,,,,,,,,,,,,,,,,,,1,,,,,,,0,1,,,1,,0,,,,,,,,,1,,,,,,,,,,,1,,,,,,,,,,1,,0,1,0,0,1,,,0,,1,0,0,,,,,,,,0,,1,,,,,0,1,,0,1,,,,,,,,,,,,,,,,,,,1,,,0,,1,0,,,,,,,,,,,,,,1,,,1,,,,,,,,,,,,0,1,0,,,,,0,0,1,,,,,,,,,,,,,,,,1


In [83]:
df_data_ohe_2 = pd.get_dummies(df_new_data).reindex(columns=ohe_tr).fillna(0)

In [84]:
df_data_ohe_2

Unnamed: 0,miembro,ingresos_totales,nhogar_1,nhogar_2,nhogar_3,nhogar_4,nhogar_5,nhogar_6,nhogar_7,comuna_1,comuna_10,comuna_11,comuna_12,comuna_13,comuna_14,comuna_15,comuna_2,comuna_3,comuna_4,comuna_5,comuna_6,comuna_7,comuna_8,comuna_9,dominio_Resto de la Ciudad,dominio_Villas de emergencia,"edad_(-0.1, 20.0]","edad_(20.0, 40.0]","edad_(40.0, 60.0]","edad_(60.0, 80.0]","edad_(80.0, 100.0]",sexo_Mujer,sexo_Varon,parentesco_jefe_Conyugue o pareja,parentesco_jefe_Hijo/a - Hijastro/a,parentesco_jefe_Jefe,parentesco_jefe_Nieto/a,parentesco_jefe_Otro familiar,parentesco_jefe_Otro no familiar,parentesco_jefe_Padre/Madre/Suegro/a,parentesco_jefe_Servicio domestico y sus familiares,parentesco_jefe_Yerno/nuera,situacion_conyugal_Casado/a,situacion_conyugal_Divorciado/a,situacion_conyugal_No corresponde,situacion_conyugal_Separado/a de unión o matrimonio,situacion_conyugal_Soltero/a,situacion_conyugal_Unido/a,situacion_conyugal_Viudo/a,num_miembro_padre_1,num_miembro_padre_2,num_miembro_padre_3,num_miembro_padre_4,num_miembro_padre_5,num_miembro_padre_6,num_miembro_padre_7,num_miembro_padre_No corresponde,num_miembro_padre_Padre no vive en el hogar,num_miembro_madre_1,num_miembro_madre_15,num_miembro_madre_2,num_miembro_madre_3,num_miembro_madre_4,num_miembro_madre_5,num_miembro_madre_6,num_miembro_madre_7,num_miembro_madre_9,num_miembro_madre_Madre no vive en el hogar,num_miembro_madre_No corresponde,estado_ocupacional_Desocupado,estado_ocupacional_Inactivo,estado_ocupacional_Ocupado,cat_ocupacional_Asalariado,cat_ocupacional_No corresponde,cat_ocupacional_Patron/empleador,cat_ocupacional_Trabajador familiar,cat_ocupacional_Trabajador por cuenta propia,calidad_ingresos_lab_No corresponde,calidad_ingresos_lab_No tuvo ingresos,calidad_ingresos_lab_Tuvo ingresos pero no declara monto,calidad_ingresos_lab_Tuvo ingresos y declara monto,"ingreso_total_lab_(-0.001, 2500.0]","ingreso_total_lab_(2500.0, 15000.0]","ingreso_total_lab_(15000.0, 25000.0]","ingreso_total_lab_(25000.0, 37000.0]","ingreso_total_lab_(37000.0, 56000.0]","ingreso_total_lab_(56000.0, 1000000.0]",calidad_ingresos_no_lab_No corresponde,calidad_ingresos_no_lab_No tuvo ingresos,calidad_ingresos_no_lab_Tuvo ingresos pero no declara monto,calidad_ingresos_no_lab_Tuvo ingresos y declara monto,"ingreso_total_no_lab_(-0.001, 4000.0]","ingreso_total_no_lab_(4000.0, 500000.0]",calidad_ingresos_totales_No corresponde,calidad_ingresos_totales_No tuvo ingresos,calidad_ingresos_totales_Tuvo ingresos pero no declara monto,calidad_ingresos_totales_Tuvo ingresos y declara monto,calidad_ingresos_familiares_No tuvo ingresos,calidad_ingresos_familiares_Tuvo ingresos pero no declara monto,calidad_ingresos_familiares_Tuvo ingresos y declara monto,"ingresos_familiares_(-0.001, 20800.0]","ingresos_familiares_(20800.0, 30000.0]","ingresos_familiares_(30000.0, 42000.0]","ingresos_familiares_(42000.0, 54000.0]","ingresos_familiares_(54000.0, 70000.0]","ingresos_familiares_(70000.0, 90000.0]","ingresos_familiares_(90000.0, 124000.0]","ingresos_familiares_(124000.0, 1000000.0]","ingreso_per_capita_familiar_(-0.001, 5400.0]","ingreso_per_capita_familiar_(5400.0, 8700.0]","ingreso_per_capita_familiar_(8700.0, 12000.0]","ingreso_per_capita_familiar_(12000.0, 15016.0]","ingreso_per_capita_familiar_(15016.0, 19900.0]","ingreso_per_capita_familiar_(19900.0, 24000.0]","ingreso_per_capita_familiar_(24000.0, 30000.0]","ingreso_per_capita_familiar_(30000.0, 38300.0]","ingreso_per_capita_familiar_(38300.0, 52340.0]","ingreso_per_capita_familiar_(52340.0, 1000000.0]",estado_educativo_Asiste,estado_educativo_No asiste pero asistió,estado_educativo_Nunca asistio,sector_educativo_Estatal/publico,sector_educativo_No corresponde,sector_educativo_Privado no religioso,sector_educativo_Privado religioso,nivel_actual_Jardin maternal,nivel_actual_No corresponde,nivel_actual_Otras escuelas especiales,nivel_actual_Postgrado,nivel_actual_Primario adultos,nivel_actual_Primario comun,nivel_actual_Primario especial,nivel_actual_Sala de 3,nivel_actual_Sala de 4,nivel_actual_Sala de 5,nivel_actual_Secundario/medio adultos,nivel_actual_Secundario/medio comun,nivel_actual_Terciario/superior no universitario,nivel_actual_Universitario,nivel_max_educativo_EGB (1° a 9° año),nivel_max_educativo_No corresponde,nivel_max_educativo_Otras escuelas especiales,nivel_max_educativo_Primario comun,nivel_max_educativo_Primario especial,nivel_max_educativo_Sala de 5,nivel_max_educativo_Secundario/medio comun,nivel_max_educativo_desconocido,"años_escolaridad_(-0.001, 7.0]","años_escolaridad_(7.0, 11.0]","años_escolaridad_(11.0, 12.0]","años_escolaridad_(12.0, 16.0]","años_escolaridad_(16.0, 19.0]",años_escolaridad_desconocido,lugar_nacimiento_CABA,lugar_nacimiento_Otra provincia,lugar_nacimiento_PBA excepto GBA,lugar_nacimiento_PBA sin especificar,lugar_nacimiento_Pais limitrofe,lugar_nacimiento_Pais no limitrofe,lugar_nacimiento_Partido GBA,afiliacion_salud_Otros,afiliacion_salud_Solo obra social,afiliacion_salud_Solo plan de medicina prepaga por contratación voluntaria,afiliacion_salud_Solo prepaga o mutual via OS,afiliacion_salud_Solo sistema publico,cantidad_hijos_nac_vivos_1,cantidad_hijos_nac_vivos_10,cantidad_hijos_nac_vivos_11,cantidad_hijos_nac_vivos_12,cantidad_hijos_nac_vivos_15,cantidad_hijos_nac_vivos_2,cantidad_hijos_nac_vivos_3,cantidad_hijos_nac_vivos_4,cantidad_hijos_nac_vivos_5,cantidad_hijos_nac_vivos_6,cantidad_hijos_nac_vivos_7,cantidad_hijos_nac_vivos_8,cantidad_hijos_nac_vivos_9,cantidad_hijos_nac_vivos_No corresponde
0,1,4000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.0,0.0,1,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,1,0,0,0,1,0.0,0.0,0,0.0,1,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,0.0,0.0,0.0,0.0,0,1,0.0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0,0.0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0.0,0.0,0.0,0.0,0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,1,22000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.0,0.0,1,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0,1,0,0,1,0.0,0.0,0,0.0,1,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,0.0,0.0,0.0,0.0,0,1,0.0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,1,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1,0.0,0.0,0.0,0.0,0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,1,25000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.0,0.0,1,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0,0,1,1,0,0.0,0.0,0,0.0,0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,0.0,0.0,0.0,0.0,0,1,0.0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0,0.0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0.0,0.0,0.0,0.0,1,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,2,30000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0.0,0.0,0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0,0,1,0,0,0.0,0.0,1,0.0,0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0,0.0,0.0,0.0,0.0,1,0,0.0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0,0.0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0.0,0.0,0.0,0.0,0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,1,20000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.0,0.0,1,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0,1,0,0,1,0.0,0.0,0,0.0,1,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,0.0,0.0,0.0,0.0,0,1,0.0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,0,0.0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0.0,0.0,0.0,0.0,0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [86]:
df_data_ohe_2.compare(respuesta_2)