# Mobility expenditure in Spain Exploration

This  notebook tries to solve questions that could indicate a change in mobility in the last years. This is done by exploring the datasets of the Spanish household expenditure survey.

## Table of Contents
<ol>
<li><a href="#business">Business Understanding</a></li>
<li><a href="#dataunderstanding">Data Understanding</a></li>
<li><a href="#datapreparation"> Data Preparation</a>
    <ol>
        <li><a href="#datawrangling"> Data Wrangling</a></li>
            <ul>
                <li><a href="#gath">Gathering</a></li>
                <li><a href="#asse">Assessing</a></li>
                <li><a href="#clea">Cleaning</a></li>
            </ul> 
        <li><a href="#eda"> Exploratory Data Analysis</a></li>
        </li>
    </ol>
<li><a href="#model"> Data Modeling </a></li>   
<li><a href="#result">Result Evaluation</a></li>
<li><a href="#conclusions">Conclusions</a></li>
<li><a href="#Refere">References</a></li>    
</ol>

<a id='business'></a>
## 1. Business Understanding

### Goals
For this case study, the primary goal is to check the economic impact of new mobility trends on the transportation expenditures of Spanish people.  
This impact will be assessed by answering the following questions:
1. Verify if there have been significant Changes in ownership in the last five years.
    - Expenses on driving licenses- are we driving less?
    - car purchase
2. Check if there has been a steep rise in the purchase of personal mobility vehicles (bikes, kickscooters)
3. Are traditional mobility patterns changing?
4. Use of public transportation

<a id='dataunderstanding'></a>
## 2. Data Understanding

### Data collection

These questions require to know how much of the household spendings are destinated to transportation. Also within the transportation expenses it would be interesting to know what are the transportation means employed.  
Household budget surveys are widely spread datasets collected by many countries. They focus on households expenditures and provide a picture of living conditions of a given country.(cita eurostat hbs). These surveys could be useful to answer the questions of interest.

From the wide variety of countries providing these datasets, the ones from Spain were selected to be analyzed. These could be later be compared with data already analyzed and provided by the USA as a means to see evident differences within the two countries. 

Data from the spanish household expenditure survey was downloaded from the National Statistics Institute. Survey results are divided in three files. The first one includes the expenditures of all families, the second describes the household and the third one provides details related to the members of the household. This study will make use of the first two.
These datasets were included in this repository.

<a id='datapreparation'></a>
## 3. Data Preparation

<a id='datawrangling'></a>
### Data wrangling

#### Required dependencies

In [10]:
pip install -r requirements.txt

Collecting ipywidgets (from -r requirements.txt (line 7))
  Obtaining dependency information for ipywidgets from https://files.pythonhosted.org/packages/b8/d4/ce436660098b2f456e2b8fdf76d4f33cbc3766c874c4aa2f772c7a5e943f/ipywidgets-8.1.0-py3-none-any.whl.metadata
  Downloading ipywidgets-8.1.0-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.7 (from ipywidgets->-r requirements.txt (line 7))
  Obtaining dependency information for widgetsnbextension~=4.0.7 from https://files.pythonhosted.org/packages/8e/d4/d31b12ac0b87e8cc9fdb6ea1eb6596de405eaaa2f25606aaa755d0eebbc0/widgetsnbextension-4.0.8-py3-none-any.whl.metadata
  Downloading widgetsnbextension-4.0.8-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.7 (from ipywidgets->-r requirements.txt (line 7))
  Obtaining dependency information for jupyterlab-widgets~=3.0.7 from https://files.pythonhosted.org/packages/74/5e/2475ac62faf2e342b2bf20b8d8e375f49400ecb38f52e4e0a7557eb1cedb/jupyterlab_widgets-3.0

In [16]:
# Imports
# data handling
import requests as rd
import time
import json
import zipfile

# data preparation and analysis 
import pandas as pd
import numpy as np


# data visualization and interactivity
import seaborn as sb
from matplotlib import pyplot as plt
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
import IPython.display
from IPython.display import display, clear_output


<a id='gath'></a>
### Gathering

As explained above each year is described in three datasets (spendings, household information and household member information).  
For The options of creating one dataset per year would leave us with 15 dataframes which are too many; another option would be to generate three datasets with all 5 years, but then it would be difficult to get insights for specific households for specific years. This is because household number is the one thing linking the three files together and they are repeated over the years. For now the option is creating three dictionaries (spendings, household information and household member information) with 5 dataframes; one per year.

In [65]:
# constants
path2data = 'data'
years = [2018, 2019, 2020, 2021, 2022]
df_dict_keys = list(map(str, years))

# data import
# dict: expenses dataframes
dfs_ = []
for i in range(len(df_dict_keys)):
    file_year_zip = f"{path2data}/EPFgastos_{years[i]}.csv.zip"
    file_zipfile = zipfile.ZipFile(file_year_zip)
    df = pd.read_csv(file_zipfile.open(file_year_zip[5:-4]), sep='\t', low_memory=False)
    dfs_.append(df)
dict_df_EPF_all_expenses = dict(zip(df_dict_keys, dfs_))

In [68]:
# households dataframes 
dfs_households = []
for i in range(len(df_dict_keys)):
    df = pd.read_csv(f"{path2data}/EPFhogar_{years[i]}.csv", sep='\t', low_memory=False)
    dfs_households.append(df)
dict_df_EPF_all_households = dict(zip(df_dict_keys, dfs_households))


In [70]:
# household member information 
dfs_membersHouseholds = []
for i in range(len(df_dict_keys)):
    df = pd.read_csv(f"{path2data}/EPFmhogar_{years[i]}.csv", sep='\t', low_memory=False)
    dfs_membersHouseholds.append(df)
dict_df_EPF_all_memberHouseholds = dict(zip(df_dict_keys, dfs_membersHouseholds))

The datasets are inspected to make sure the dictionaries were created correctly.

In [76]:
dict_df_EPF_all_expenses['2021']

Unnamed: 0,ANOENC,NUMERO,CODIGO,GASTO,PORCENDES,PORCENIMP,CANTIDAD,GASTOMON,GASTNOM1,GASTNOM2,GASTNOM3,GASTNOM4,GASTNOM5,FACTOR
0,2021,1,01113,31051.14,0.0,0.0,18326.47,31051.14,,,,,,585.77697
1,2021,1,01136,117292.44,0.0,0.0,12217.64,117292.44,,,,,,585.77697
2,2021,1,01163,68996.09,0.0,0.0,51924.99,68996.09,,,,,,585.77697
3,2021,1,01167,86934.65,0.0,0.0,137448.49,86934.65,,,,,,585.77697
4,2021,1,0117A,139711.81,0.0,0.0,103849.97,139711.81,,,,,,585.77697
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1488794,2021,19394,11111,2760364.87,0.0,0.0,,2760364.87,,,,,,2150.09954
1488795,2021,19394,11112,11668201.56,0.0,0.0,,11668201.56,,,,,,2150.09954
1488796,2021,19394,12111,444713.47,0.0,0.0,,444713.47,,,,,,2150.09954
1488797,2021,19394,12521,537524.89,0.0,0.0,,537524.89,,,,,,2150.09954


In [73]:
dict_df_EPF_all_households['2020']

Unnamed: 0,ANOENC,NUMERO,CCAA,NUTS1,CAPROV,TAMAMU,DENSIDAD,CLAVE,CLATEO,FACTOR,...,FUENPRIN,FUENPRINRED,IMPEXAC,INTERIN,NUMPERI,COMIMH,COMISD,COMIHU,COMIINV,COMITOT
0,2020,1,12,1,6,3,3,1,1,1593.985745,...,2,2,2726,6,2,56,0,0,0,56
1,2020,2,5,7,1,1,1,1,1,259.167031,...,1,1,4998,7,2,60,0,0,0,60
2,2020,3,5,7,1,1,1,1,1,529.688667,...,3,3,1245,3,2,56,0,0,0,56
3,2020,4,10,5,6,3,2,2,2,1034.909802,...,3,3,3106,7,2,54,0,0,0,54
4,2020,5,16,2,6,2,1,2,2,371.914654,...,4,3,1250,3,1,112,0,0,0,112
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19165,2020,19166,7,4,6,5,3,1,1,580.026322,...,3,3,2185,5,1,56,0,0,0,56
19166,2020,19167,9,5,6,5,3,2,2,2578.309998,...,2,2,2200,5,1,112,0,0,0,112
19167,2020,19168,9,5,6,5,3,2,2,1846.706676,...,1,1,2215,5,2,81,0,0,0,81
19168,2020,19169,10,5,6,3,2,2,2,1571.709414,...,3,3,1365,3,1,28,0,0,0,28


In [72]:
dict_df_EPF_all_memberHouseholds['2019']

Unnamed: 0,ANOENC,NUMERO,NORDEN,CATEGMH,SUSPRIN,RELASP,EDAD,SEXO,PAISNACIM,NACIONA,...,SITURED,OCU,JORNADA,PERCEP,IMPEXACP,INTERINP,NINODEP,HIJODEP,ADULTO,FACTOR
0,2019,1,1,1,1,1,56,6,1,1,...,1,1,1,1,-9.0,03,6,6,1,1294.618356
1,2019,1,2,1,6,2,64,1,1,1,...,2,2,,1,-9.0,02,6,6,1,1294.618356
2,2019,1,3,1,6,3,27,1,1,1,...,2,2,,6,,,6,6,1,1294.618356
3,2019,2,1,1,1,1,69,1,1,1,...,2,2,,1,-9.0,05,6,6,1,1569.018845
4,2019,3,1,1,1,1,66,6,1,1,...,2,2,,1,558.0,02,6,6,1,332.468410
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54032,2019,20816,4,1,6,3,10,1,1,1,...,,,,6,,,1,1,6,371.083222
54033,2019,20816,5,1,6,3,3,6,1,1,...,,,,6,,,1,1,6,371.083222
54034,2019,20817,1,1,1,1,72,1,1,1,...,2,2,,1,-9.0,02,6,6,1,534.816259
54035,2019,20817,2,1,6,2,67,6,1,1,...,2,2,,6,,,6,6,1,534.816259



<a id='asse'></a>
### Assessing


<p style='text-align: justify;'> This part of the analysis will focus on identifying those aspects of the datasets to improve. These improvements will make the posterior modelling possible. </p>

In [84]:
# visual inspection 
dict_df_EPF_all_expenses['2022'].head()

Unnamed: 0,ANOENC,NUMERO,CODIGO,GASTO,PORCENDES,PORCENIMP,CANTIDAD,GASTOMON,GASTNOM1,GASTNOM2,GASTNOM3,GASTNOM4,GASTNOM5,FACTOR
0,2022,1,1111,4904.58,100.0,0.0,1563.96,4904.58,,,,,,299.937215
1,2022,1,1112,2202.06,100.0,0.0,,2202.06,,,,,,299.937215
2,2022,1,1113,191447.43,32.34,0.0,56302.55,191447.43,,,,,,299.937215
3,2022,1,1114,54403.9,100.0,0.0,9383.76,54403.9,,,,,,299.937215
4,2022,1,1115,8617.42,100.0,0.0,,8617.42,,,,,,299.937215


In [94]:
dict_df_EPF_all_expenses['2022'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1437060 entries, 0 to 1437059
Data columns (total 14 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   ANOENC     1437060 non-null  int64  
 1   NUMERO     1437060 non-null  int64  
 2   CODIGO     1437060 non-null  object 
 3   GASTO      1437060 non-null  float64
 4   PORCENDES  1437060 non-null  float64
 5   PORCENIMP  1437060 non-null  float64
 6   CANTIDAD   545314 non-null   float64
 7   GASTOMON   1437058 non-null  float64
 8   GASTNOM1   2537 non-null     float64
 9   GASTNOM2   934 non-null      float64
 10  GASTNOM3   1868 non-null     float64
 11  GASTNOM4   20665 non-null    float64
 12  GASTNOM5   0 non-null        float64
 13  FACTOR     1437060 non-null  float64
dtypes: float64(11), int64(2), object(1)
memory usage: 153.5+ MB


The columns with most missing values seem to be the ones named 'GASTNOM#'. The description file indicated that these columns are reserved for expenses that were paid with non-money related ways. These could be labor, work tickets, etc.

In [107]:
for year in df_dict_keys:
    print(f"{year} expenses table is composed by {dict_df_EPF_all_expenses[year].shape[0]:,} rows and {dict_df_EPF_all_expenses[year].shape[1]} columns.")

2018 expenses table is composed by 1,924,175 rows and 14 columns.
2019 expenses table is composed by 1,889,678 rows and 14 columns.
2020 expenses table is composed by 1,568,629 rows and 14 columns.
2021 expenses table is composed by 1,488,799 rows and 14 columns.
2022 expenses table is composed by 1,437,060 rows and 14 columns.


All tables share the same number of columns but the rows vary.

In [110]:
for year in df_dict_keys:
    print(f"{year} describes the spending habits of {len(dict_df_EPF_all_expenses[year]['NUMERO'].unique()):,} families.")

2018 describes the spending habits of 21,395 families.
2019 describes the spending habits of 20,817 families.
2020 describes the spending habits of 19,169 families.
2021 describes the spending habits of 19,394 families.
2022 describes the spending habits of 20,585 families.


The year describing more families is 2018. This has little impact on the results traslated to the population because each family has a

In [85]:
dict_df_EPF_all_expenses['2021'].describe()

Unnamed: 0,ANOENC,NUMERO,GASTO,PORCENDES,PORCENIMP,CANTIDAD,GASTOMON,GASTNOM1,GASTNOM2,GASTNOM3,GASTNOM4,GASTNOM5,FACTOR
count,1488799.0,1488799.0,1488799.0,1488799.0,1488799.0,590133.0,1488797.0,2447.0,1530.0,1895.0,19749.0,0.0,1488799.0
mean,2021.0,9679.386,370580.9,37.6253,14.40463,258994.6,285586.5,614732.2,185281.3,956202.7,6225154.0,,983.4518
std,0.0,5587.63,1638910.0,47.56322,34.50117,1270695.0,1288340.0,3027980.0,767557.2,3138617.0,6322289.0,,736.8025
min,2021.0,1.0,0.01,0.0,0.0,10.01,0.0,0.0,0.78,92.03,84530.72,,74.11386
25%,2021.0,4839.0,5591.385,0.0,0.0,8155.76,4469.96,71902.04,11651.94,8113.86,2542857.0,,457.2252
50%,2021.0,9692.0,51609.08,0.0,0.0,27116.2,47052.94,172347.0,48621.74,101022.0,4230137.0,,782.3255
75%,2021.0,14482.0,215188.6,100.0,0.0,92687.74,197338.8,395628.7,143224.9,750210.3,7803856.0,,1331.782
max,2021.0,19394.0,279792800.0,100.0,100.0,111814800.0,279792800.0,102036700.0,16373940.0,57075370.0,244465600.0,,9423.128


In [89]:
dict_df_EPF_all_expenses['2020'].describe(include = 'object')

Unnamed: 0,CODIGO
count,1568629
unique,361
top,4511
freq,18850


In [99]:
sum(dict_df_EPF_all_expenses['2022'].duplicated())

0

In [90]:
# visual inspection 
dict_df_EPF_all_households['2019'].head()

Unnamed: 0,ANOENC,NUMERO,CCAA,NUTS1,CAPROV,TAMAMU,DENSIDAD,CLAVE,CLATEO,FACTOR,...,FUENPRIN,FUENPRINRED,IMPEXAC,INTERIN,NUMPERI,COMIMH,COMISD,COMIHU,COMIINV,COMITOT
0,2019,1,1,6,1,1,1,2,2,1294.618356,...,1,1,1716,4,2,49,0,0,0,49
1,2019,2,13,3,6,1,1,1,2,1569.018845,...,3,3,2184,5,1,8,0,0,4,12
2,2019,3,15,2,6,4,2,1,2,332.46841,...,3,3,558,2,1,28,0,0,0,28
3,2019,4,16,2,6,4,1,1,1,311.734979,...,2,2,2254,5,2,56,0,0,10,66
4,2019,5,3,1,6,4,3,2,2,524.758178,...,3,3,4719,7,3,92,0,0,0,92


In [103]:
dict_df_EPF_all_households['2019'].NUTS1.value_counts()

NUTS1
2    4480
5    4059
6    3502
4    3345
1    3000
3    1517
7     914
Name: count, dtype: int64

In [92]:
dict_df_EPF_all_households['2020'].head()

Unnamed: 0,ANOENC,NUMERO,CCAA,NUTS1,CAPROV,TAMAMU,DENSIDAD,CLAVE,CLATEO,FACTOR,...,FUENPRIN,FUENPRINRED,IMPEXAC,INTERIN,NUMPERI,COMIMH,COMISD,COMIHU,COMIINV,COMITOT
0,2020,1,12,1,6,3,3,1,1,1593.985745,...,2,2,2726,6,2,56,0,0,0,56
1,2020,2,5,7,1,1,1,1,1,259.167031,...,1,1,4998,7,2,60,0,0,0,60
2,2020,3,5,7,1,1,1,1,1,529.688667,...,3,3,1245,3,2,56,0,0,0,56
3,2020,4,10,5,6,3,2,2,2,1034.909802,...,3,3,3106,7,2,54,0,0,0,54
4,2020,5,16,2,6,2,1,2,2,371.914654,...,4,3,1250,3,1,112,0,0,0,112


In [91]:
dict_df_EPF_all_households['2018'].describe()

Unnamed: 0,ANOENC,NUMERO,CCAA,NUTS1,CAPROV,TAMAMU,DENSIDAD,CLAVE,CLATEO,FACTOR,...,RENTAS,OTROIN,IMPEXAC,INTERIN,NUMPERI,COMIMH,COMISD,COMIHU,COMIINV,COMITOT
count,21395.0,21395.0,21395.0,21395.0,21395.0,21395.0,21395.0,21395.0,21395.0,21395.0,...,21395.0,21395.0,21395.0,21395.0,21395.0,21395.0,21395.0,21395.0,21395.0,21395.0
mean,2018.0,10698.0,9.094835,3.749942,4.296565,2.715074,1.813788,1.422622,1.522038,870.655626,...,5.559477,5.788035,2155.686282,4.458565,1.589297,60.500444,0.118766,0.13606,4.220799,64.978593
std,0.0,6176.348841,5.015944,1.842902,2.369757,1.610837,0.847493,0.493988,0.499526,615.040256,...,1.585391,1.232785,1432.750748,1.937125,1.17804,31.919562,1.650964,2.216172,10.935581,33.32194
min,2018.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,64.061002,...,-9.0,-9.0,0.0,1.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0
25%,2018.0,5349.5,5.0,2.0,1.0,1.0,1.0,1.0,1.0,420.004649,...,6.0,6.0,1209.0,3.0,1.0,34.0,0.0,0.0,0.0,41.0
50%,2018.0,10698.0,9.0,4.0,6.0,3.0,2.0,1.0,2.0,689.051204,...,6.0,6.0,1800.0,4.0,2.0,56.0,0.0,0.0,0.0,59.0
75%,2018.0,16046.5,13.0,5.0,6.0,4.0,3.0,2.0,2.0,1182.371731,...,6.0,6.0,2736.0,6.0,2.0,82.0,0.0,0.0,3.0,84.0
max,2018.0,21395.0,19.0,7.0,6.0,5.0,3.0,2.0,2.0,6121.820126,...,6.0,6.0,17500.0,10.0,7.0,308.0,56.0,125.0,196.0,308.0


In [80]:
dict_df_EPF_all_households['2022'].describe(include = 'object')

Unnamed: 0,PAISSP,UNIONSP,JORNADASP,INTERINPSP,OCUPA,OCUPARED,ACTESTB,ACTESTBRED,SITPROF,SECTOR,...,FUENACV8,CALEFV8,FUENCAV8,REGTENV9,AGUACV9,FUENACV9,CALEFV9,FUENCAV9,FUENPRIN,FUENPRINRED
count,20585.0,20585,20585,20585,20585,20585,20585,20585,20585,20585,...,20585.0,20585.0,20585.0,20585.0,20585.0,20585.0,20585.0,20585.0,20585,20585
unique,4.0,4,3,9,12,7,22,5,6,4,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9,5
top,,1,1,3,2,2,C,3,1,6,...,,,,,,,,,2,2
freq,18938.0,11391,11941,6309,3538,5963,3435,13277,16761,12325,...,20585.0,20585.0,20585.0,20585.0,20585.0,20585.0,20585.0,20585.0,10378,10378


In [93]:
# visual inspection 
dict_df_EPF_all_memberHouseholds['2019'].head()

Unnamed: 0,ANOENC,NUMERO,NORDEN,CATEGMH,SUSPRIN,RELASP,EDAD,SEXO,PAISNACIM,NACIONA,...,SITURED,OCU,JORNADA,PERCEP,IMPEXACP,INTERINP,NINODEP,HIJODEP,ADULTO,FACTOR
0,2019,1,1,1,1,1,56,6,1,1,...,1,1,1.0,1,-9.0,3.0,6,6,1,1294.618356
1,2019,1,2,1,6,2,64,1,1,1,...,2,2,,1,-9.0,2.0,6,6,1,1294.618356
2,2019,1,3,1,6,3,27,1,1,1,...,2,2,,6,,,6,6,1,1294.618356
3,2019,2,1,1,1,1,69,1,1,1,...,2,2,,1,-9.0,5.0,6,6,1,1569.018845
4,2019,3,1,1,1,1,66,6,1,1,...,2,2,,1,558.0,2.0,6,6,1,332.46841


In [97]:
dict_df_EPF_all_memberHouseholds['2020'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49423 entries, 0 to 49422
Data columns (total 33 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ANOENC       49423 non-null  int64  
 1   NUMERO       49423 non-null  int64  
 2   NORDEN       49423 non-null  int64  
 3   CATEGMH      49423 non-null  int64  
 4   SUSPRIN      49423 non-null  int64  
 5   RELASP       49423 non-null  int64  
 6   EDAD         49423 non-null  int64  
 7   SEXO         49423 non-null  int64  
 8   PAISNACIM    49423 non-null  int64  
 9   NACIONA      49423 non-null  int64  
 10  PAISNACION   49423 non-null  object 
 11  SITURES      49423 non-null  int64  
 12  ECIVILLEGAL  49423 non-null  int64  
 13  NORDENCO     49423 non-null  int64  
 14  UNION        49423 non-null  object 
 15  CONVIVENCIA  49423 non-null  int64  
 16  NORDENPA     49423 non-null  int64  
 17  PAISPADRE    49423 non-null  int64  
 18  NORDENMA     49423 non-null  int64  
 19  PAIS

In [81]:
dict_df_EPF_all_memberHouseholds['2022'].describe()

Unnamed: 0,ANOENC,NUMERO,NORDEN,CATEGMH,SUSPRIN,RELASP,EDAD,SEXO,PAISNACIM,NACIONA,...,NORDENPA,PAISPADRE,NORDENMA,PAISMADRE,PERCEP,IMPEXACP,NINODEP,HIJODEP,ADULTO,FACTOR
count,52148.0,52148.0,52148.0,52148.0,52148.0,52148.0,52148.0,52148.0,52148.0,52148.0,...,52148.0,52148.0,52148.0,52148.0,52148.0,35453.0,52148.0,52148.0,52148.0,52148.0
mean,2022.0,10279.885633,2.073291,1.000978,4.026291,2.079389,43.753509,3.565487,1.23044,1.121117,...,73.816292,1.285553,68.204936,1.295735,2.315583,488.447296,4.852209,4.867262,2.146832,901.627944
std,0.0,5939.262859,1.146288,0.041763,2.443999,1.14053,22.4819,2.499166,0.766754,0.408808,...,42.742317,0.855987,45.218442,0.865681,3.020692,828.710403,2.104131,2.094381,2.103515,671.861765
min,2022.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,1.0,-9.0,1.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,48.270897
25%,2022.0,5132.0,1.0,1.0,1.0,1.0,24.0,1.0,1.0,1.0,...,2.0,1.0,2.0,1.0,1.0,-9.0,6.0,6.0,1.0,424.920887
50%,2022.0,10273.5,2.0,1.0,6.0,2.0,47.0,6.0,1.0,1.0,...,99.0,1.0,99.0,1.0,1.0,-9.0,6.0,6.0,1.0,709.770429
75%,2022.0,15420.25,3.0,1.0,6.0,3.0,61.0,6.0,1.0,1.0,...,99.0,1.0,99.0,1.0,6.0,926.0,6.0,6.0,1.0,1190.157681
max,2022.0,20585.0,13.0,4.0,6.0,6.0,85.0,6.0,4.0,3.0,...,99.0,4.0,99.0,4.0,6.0,17000.0,6.0,6.0,6.0,9983.726258


In [82]:
dict_df_EPF_all_memberHouseholds['2022'].describe(include = 'object')

Unnamed: 0,PAISNACION,UNION,ESTUDIOS,ESTUDRED,SITUACT,SITURED,OCU,JORNADA,INTERINP
count,52148.0,52148.0,52148,52148,52148,52148,52148,52148.0,52148.0
unique,4.0,4.0,10,6,10,4,4,3.0,9.0
top,,,3,4,1,1,1,,
freq,47414.0,25098.0,12566,15386,21317,26348,22425,29723.0,16695.0


<a id='clea'></a>
### Cleaning

<a id='eda'></a>
### Exploratory Data Analysis

<a id='model'></a>
## Data Modeling

<a id='result'></a>
## Result Evaluation

<a id='conclusions'></a>
## Conclusions

<a id='Refere'></a>
## References