# Encuesta Nacional de Hogares 2029

## 📍 Objetivo
Realizar la preparación de datos de la Encuesta anual de hogares realizada en todo el territorio de la Ciudad de Buenos Aires, Argentina en el 2019.

Prácticamente vas a acondicionar el dataset para que te quede listo para buscar correlaciones o entrenar algún modelo de Machine Learning.

El dataset proviene del Open Data del Gobierno de Buenos Aires: [Encuesta anual de hogares 2019](https://data.buenosaires.gob.ar/dataset/encuesta-anual-hogares)

## 0. Importar librerias ,utilerias y constantes

In [22]:
import pandas as pd
import os
import sys

# Permitiendo el acceso a modulos en el codigo superior
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from helpers import Explorator

DATA_FILE = '../data/encuesta-anual-hogares-2019.csv'


## 1. Carga y revisión inicial de los datos

In [8]:
data = pd.read_csv(DATA_FILE, sep=',')

In [9]:
data.head()

Unnamed: 0,id,nhogar,miembro,comuna,dominio,edad,sexo,parentesco_jefe,situacion_conyugal,num_miembro_padre,...,ingreso_per_capita_familiar,estado_educativo,sector_educativo,nivel_actual,nivel_max_educativo,años_escolaridad,lugar_nacimiento,afiliacion_salud,hijos_nacidos_vivos,cantidad_hijos_nac_vivos
0,1,1,1,5,Resto de la Ciudad,18,Mujer,Jefe,Soltero/a,Padre no vive en el hogar,...,9000,Asiste,Estatal/publico,Universitario,Otras escuelas especiales,12,PBA excepto GBA,Solo obra social,No,No corresponde
1,1,1,2,5,Resto de la Ciudad,18,Mujer,Otro no familiar,Soltero/a,Padre no vive en el hogar,...,9000,Asiste,Estatal/publico,Universitario,Otras escuelas especiales,12,Otra provincia,Solo plan de medicina prepaga por contratación...,No,No corresponde
2,2,1,1,2,Resto de la Ciudad,18,Varon,Jefe,Soltero/a,Padre no vive en el hogar,...,33333,Asiste,Privado religioso,Universitario,Otras escuelas especiales,12,CABA,Solo plan de medicina prepaga por contratación...,,No corresponde
3,2,1,2,2,Resto de la Ciudad,50,Mujer,Padre/Madre/Suegro/a,Viudo/a,No corresponde,...,33333,No asiste pero asistió,No corresponde,No corresponde,Secundario/medio comun,17,CABA,Solo prepaga o mutual via OS,Si,2
4,2,1,3,2,Resto de la Ciudad,17,Varon,Otro familiar,Soltero/a,Padre no vive en el hogar,...,33333,Asiste,Privado religioso,Secundario/medio comun,EGB (1° a 9° año),10,CABA,Solo plan de medicina prepaga por contratación...,,No corresponde


In [23]:
expl_data = Explorator(data)
expl_data.totals()

Unnamed: 0,variable,qty_nan,perc_nan,qty_zeros,perc_zeros,unique,type
0,nhogar,0,0.0,0,0.0,7,int64
1,miembro,0,0.0,0,0.0,19,int64
2,comuna,0,0.0,0,0.0,15,int64
3,dominio,0,0.0,0,0.0,2,object
4,edad,0,0.0,128,0.008939,101,int64
5,sexo,0,0.0,0,0.0,2,object
6,parentesco_jefe,0,0.0,0,0.0,9,object
7,situacion_conyugal,1,7e-05,0,0.0,7,object
8,num_miembro_padre,0,0.0,0,0.0,9,object
9,num_miembro_madre,0,0.0,0,0.0,11,object


### 1.1 Eliminar columnas id y hijos_nacidos_vivos

In [10]:
data = data.drop(['id','hijos_nacidos_vivos'], axis=1)

In [11]:


expl_1 = Explorator(data)
expl_1.totals()

Unnamed: 0,variable,qty_nan,perc_nan,qty_zeros,perc_zeros,unique,type
0,nhogar,0,0.0,0,0.0,7,int64
1,miembro,0,0.0,0,0.0,19,int64
2,comuna,0,0.0,0,0.0,15,int64
3,dominio,0,0.0,0,0.0,2,object
4,edad,0,0.0,128,0.008939,101,int64
5,sexo,0,0.0,0,0.0,2,object
6,parentesco_jefe,0,0.0,0,0.0,9,object
7,situacion_conyugal,1,7e-05,0,0.0,7,object
8,num_miembro_padre,0,0.0,0,0.0,9,object
9,num_miembro_madre,0,0.0,0,0.0,11,object


## 2. Discretización

In [43]:
data_work=data.copy()

### 2.1 Discretizar 'ingresos_familiares'

In [44]:
data_work['ingresos_familiares_cat'], saved_bins = pd.qcut(data_work['ingresos_familiares'], q=8, retbins=True)
saved_bins

array([      0.,   20800.,   30000.,   42000.,   54000.,   70000.,
         90000.,  124000., 1000000.])

In [45]:
expl_2 = Explorator(data_work['ingresos_familiares_cat'])
expl_2.frequency()

Unnamed: 0,ingresos_familiares_cat,frequency,percentage,cumulative_perc
0,"(54000.0, 70000.0]",1905,0.13304,0.13304
1,"(30000.0, 42000.0]",1841,0.12857,0.26161
2,"(20800.0, 30000.0]",1810,0.126405,0.388016
3,"(-0.001, 20800.0]",1796,0.125428,0.513444
4,"(124000.0, 1000000.0]",1788,0.124869,0.638313
5,"(70000.0, 90000.0]",1781,0.12438,0.762693
6,"(42000.0, 54000.0]",1721,0.12019,0.882883
7,"(90000.0, 124000.0]",1677,0.117117,1.0


### 2.2 Discretizar ‘ingreso_per_capita_familiar’

In [46]:
data_work['ingreso_per_capita_familiar_cat'], saved_bins_2 = pd.qcut(data_work['ingreso_per_capita_familiar'], q=10, retbins=True)
saved_bins_2

array([      0.,    5400.,    8700.,   12000.,   15016.,   19900.,
         24000.,   30000.,   38300.,   52340., 1000000.])

In [47]:
expl_3 = Explorator(data_work['ingreso_per_capita_familiar_cat'])
expl_3.frequency()

Unnamed: 0,ingreso_per_capita_familiar_cat,frequency,percentage,cumulative_perc
0,"(24000.0, 30000.0]",1594,0.111321,0.111321
1,"(8700.0, 12000.0]",1554,0.108527,0.219848
2,"(19900.0, 24000.0]",1460,0.101962,0.32181
3,"(-0.001, 5400.0]",1434,0.100147,0.421957
4,"(15016.0, 19900.0]",1432,0.100007,0.521964
5,"(52340.0, 1000000.0]",1432,0.100007,0.621971
6,"(5400.0, 8700.0]",1431,0.099937,0.721908
7,"(38300.0, 52340.0]",1430,0.099867,0.821775
8,"(12000.0, 15016.0]",1309,0.091417,0.913192
9,"(30000.0, 38300.0]",1243,0.086808,1.0


### 2.3 Discretizar ‘ingreso_per_capita_familiar’ y 'ingreso_total_no_lab'

In [48]:
data_work['ingreso_total_lab_cat'] = pd.qcut(data_work['ingreso_total_lab'], q=10, duplicates='drop')
data_work['ingreso_total_no_lab_cat'] = pd.qcut(data_work['ingreso_total_no_lab'], q=4, duplicates='drop')

In [49]:
expl_4 = Explorator(data_work[['ingreso_per_capita_familiar_cat', 'ingreso_total_no_lab_cat']])
expl_4.frequency()

  ingreso_per_capita_familiar_cat  frequency  percentage  cumulative_perc
0              (24000.0, 30000.0]       1594    0.111321         0.111321
1               (8700.0, 12000.0]       1554    0.108527         0.219848
2              (19900.0, 24000.0]       1460    0.101962         0.321810
3                (-0.001, 5400.0]       1434    0.100147         0.421957
4              (15016.0, 19900.0]       1432    0.100007         0.521964
5            (52340.0, 1000000.0]       1432    0.100007         0.621971
6                (5400.0, 8700.0]       1431    0.099937         0.721908
7              (38300.0, 52340.0]       1430    0.099867         0.821775
8              (12000.0, 15016.0]       1309    0.091417         0.913192
9              (30000.0, 38300.0]       1243    0.086808         1.000000

----------------------------------------------------------------

  ingreso_total_no_lab_cat  frequency  percentage  cumulative_perc
0         (-0.001, 4000.0]      10741    0.750122   

### 2.4 Discretizar ‘edad’