## Machine Learning Model Building Pipeline: Data Analysis


===================================================================================================

## Predicting Client Arrears

The aim of the project is to build a machine learning model to predict the Clients Arrears based on different explanatory variables describing aspects of profile and bureaus. 

### Why is this important? 

Predicting client arrears is useful to identify trustful clients, or to determine whether the client will have an acceptable rate of miss-payments.

### What is the objective of the machine learning model?

We aim to minimise the difference between the maximum arrear of a client and the arrear estimated by our model. We will evaluate model performance using the mean squared error (mse) and the root squared of the mean squared error (rmse).

### How do I get the dataset?

AWS Account:
S3 Bucket:
S3 Path:
Contact: 
Dataset name:
Dataset date:
Dataset time range:

**Note the following:**
-  You need to have access to the aws console and be able to read Bucket specifyed path.
-  If you save the file to the same directory where you saved this jupyter notebook, then you can run the code as it is written here.

====================================================================================================

## Clients tuca and directSale dataset: Feature Selection

In the following cells, we will select a group of variables, the most predictive ones, to build our machine learning model. 

### Why do we select variables?

- For production: Fewer variables mean smaller client input requirements (e.g. customers filling out a form on a website or mobile app), and hence less code for error handling. This reduces the chances of introducing bugs.

- For model performance: Fewer variables mean simpler, more interpretable, better generalizing models


**We will select variables using the Lasso regression: Lasso has the property of setting the coefficient of non-informative variables to zero. This way we can identify those variables and remove them from our final model.**


### Setting the seed

It is important to note, that we are engineering variables and pre-processing data with the idea of deploying the model. Therefore, from now on, for each step that includes some element of randomness, it is extremely important that we **set the seed**. This way, we can obtain reproducibility between our research and our development code.

**Always set the seeds**.

Let's go ahead and load the dataset.

In [1]:
target_var = 'maxmora'
identifier = ['ide_tramite', 'ide_cui', 'id_credito', 'id_solicitud', 'id_tramite','cl_unq_act_act_agencia']


In [2]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt

# to build the models
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [3]:
# load the train and test set with the engineered variables

# we built and saved these datasets in the previous lecture.
# If you haven't done so, go ahead and check the previous notebook
# to find out how to create these datasets

X_train = pd.read_csv('xtrain.csv')
X_test = pd.read_csv('xtest.csv')

X_train.head()

Unnamed: 0,id_solicitud,key_solicitud,id_credito,maxmora,qty_meses_desde_desembolso,dfi_solicitud_mora,dfi_solicitud_productos,cl_unq_act_act_messolicitud,cl_unq_act_act_trimestresolicitud,active_cch3_months,ide_tramite,id_tramite,ide_cui,cl_unq_act_act_fechasolicitud,cl_unq_act_act_fechasolicitud_date,cl_unq_act_act_monto,cl_unq_act_act_plazo,cl_unq_act_act_agencia,cl_unq_act_act_ptodestino,cl_unq_act_act_flagaprobado,cl_unq_act_act_longlat,cl_unq_act_act_longitud,cl_unq_act_act_latitud,cl_unq_act_act_depnacimiento,cl_unq_act_act_estadocivil,cl_unq_act_act_estadocivilmodificado,cl_unq_act_act_genero,cl_unq_act_act_profesion,cl_unq_act_act_profesionmodificada,cl_unq_act_act_flagpuedeescribir,cl_unq_act_act_flagpuedeleer,cl_unq_act_act_flaghablaespa_ol,cl_unq_act_act_flagpuedefirmar,cl_unq_act_act_flaghablaotroidioma,cl_unq_act_act_nivelacademico,cl_unq_act_act_tiempovivirresidencia,cl_unq_act_act_tipovivienda,cl_unq_act_act_personasdependientes,cl_unq_act_act_tipolocalidad,cl_unq_act_act_topografia,cl_unq_act_act_flagaccesovehicular,cl_unq_act_act_tipoaccesovehicular,cl_unq_act_act_tipoaccesopeatonal,cl_unq_act_act_flagaccesomensajeros,cl_unq_act_act_flagpidenimpuesto,cl_unq_act_act_vivtipoconstruccion,cl_unq_act_act_cantidadniveles,cl_unq_act_act_cantidaddormitorios,cl_unq_act_act_cantidadba_os,cl_unq_act_act_flagtienecocina,cl_unq_act_act_flagtienesala,cl_unq_act_act_flagtienejardin,cl_unq_act_act_flagtienegarage,cl_unq_act_act_flagtienecomedor,cl_unq_act_act_vehiculo,cl_unq_act_act_fuenteingresos,cl_unq_act_act_tiponegocio,cl_unq_act_act_depnegocio,cl_unq_act_act_flagvendealcredito,cl_unq_act_act_negociomontoventasefectivo,cl_unq_act_act_negociototalingresos,cl_unq_act_act_totalbienes,cl_unq_act_act_totalpasivos,cl_unq_act_act_totalgastosfam,cl_unq_act_act_totalingresosfam,cl_unq_act_act_estresventas,cl_unq_act_act_estrescostoventas,cl_unq_act_act_estresgrossprofit,cl_unq_act_act_flagtieneelectricidad,cl_unq_act_act_flagtieneagua,cl_unq_act_act_flagtienetelfijo,cl_unq_act_act_flagtienecelular,cl_unq_act_act_flagtienetvcable,cl_unq_act_act_flagtienerefrigerador,cl_unq_act_act_flagtienelavadora,cl_unq_act_act_flagtienesecadora,cl_unq_act_act_flagtienehorno,cl_unq_act_act_flagtienemicroondas,cl_unq_act_act_flagtienestereo,cl_unq_act_act_fnacimiento_date,cl_unq_act_act_finicionegocio_date,cl_unq_act_act_fnacimiento_date_numberlong,cl_unq_act_act_finicionegocio_date_numberlong,id_mora,fecha_de_cierre_mora,fecha_consulta_date_mora,cl_cnt_12m_act_comcantidadmora1,cl_cnt_24m_act_comcantidadmora1,cl_cnt_12m_act_comcantidadmora2,cl_cnt_24m_act_comcantidadmora2,cl_des_12m_act_commaxdesvmora,cl_des_24m_act_commaxdesvmora,cl_max_12m_act_commaxmora,cl_max_24m_act_commaxmora,cl_max_act_act_commaxmora,cl_cnt_12m_act_ptocantidadmora1,cl_cnt_24m_act_ptocantidadmora1,cl_cnt_12m_act_ptocantidadmora2,cl_cnt_24m_act_ptocantidadmora2,cl_max_12m_act_ptomaxdesvmora,cl_max_24m_act_ptomaxdesvmora,cl_max_12m_act_ptomaxmora,cl_max_24m_act_ptomaxmora,cl_max_act_act_ptomaxmora,cl_cnt_12m_act_tccantidadmora1,cl_cnt_24m_act_tccantidadmora1,cl_cnt_12m_act_tccantidadmora2,cl_cnt_24m_act_tccantidadmora2,cl_des_12m_act_tcmaxdesvmora,cl_des_24m_act_tcmaxdesvmora,cl_max_12m_act_tcmaxmora,cl_max_24m_act_tcmaxmora,cl_unq_act_act_tcmoraact,cl_cnt_12m_act_servcantidadmora1,cl_cnt_12m_act_servcantidadmora2,cl_des_12m_act_servmaxdesvmora,cl_max_12m_act_servmaxmora,cl_max_act_act_servmaxmora,id_productos,fecha_de_cierre_productos,fecha_consulta_date_tu,cl_min_his_act_ptoexptotal,cl_min_his_act_ptoexpvig,cl_cnt_his_act_comcantidadtotal,cl_sum_his_act_commontototal,cl_cnt_act_act_comcantidadvig,cl_sum_act_act_commontovig,cl_sum_act_act_comsaldoenmora,cl_sum_act_act_comsaldovig,cl_cnt_his_act_ptocantidadtotal,cl_sum_his_act_ptomontototal,cl_cnt_act_act_ptocantidadvig,cl_sum_act_act_ptomontovigente,cl_sum_act_act_ptosaldomora,cl_sum_act_act_ptosaldovig,cl_cnt_act_act_servcantidadvig,cl_sum_act_act_servsaldomora,cl_sum_act_act_servsaldovig,cl_cnt_his_act_tccantidadtotal,cl_sum_his_act_tclimitetotal,cl_cnt_act_act_tccantidadvig,cl_sum_act_act_tclimitevig,cl_sum_act_act_tcsaldomora,cl_sum_act_act_tcsaldovig,cl_sum_act_act_comporcentajepagadomontosvig,cl_sum_act_act_comporcentajesaldoenmora,cl_sum_act_act_tcporcentajesaldomora,cl_sum_act_act_tcporcentajeutilizacion,cl_sum_act_act_ptoporcentajepagadomontosvig,cl_sum_act_act_ptoporcentajesaldoenmora,cl_sum_act_act_servporcentajesaldomora,dfi_solicitud_mora_na,dfi_solicitud_productos_na,cl_unq_act_act_plazo_na,cl_unq_act_act_personasdependientes_na,cl_unq_act_act_cantidadniveles_na,cl_unq_act_act_cantidaddormitorios_na,cl_unq_act_act_cantidadba_os_na,cl_unq_act_act_negociomontoventasefectivo_na,cl_unq_act_act_negociototalingresos_na,cl_unq_act_act_totalbienes_na,cl_unq_act_act_totalpasivos_na,cl_unq_act_act_totalgastosfam_na,cl_unq_act_act_totalingresosfam_na,cl_unq_act_act_estresventas_na,cl_unq_act_act_estrescostoventas_na,cl_unq_act_act_estresgrossprofit_na,cl_unq_act_act_fnacimiento_date_numberlong_na,cl_unq_act_act_finicionegocio_date_numberlong_na,id_mora_na,cl_cnt_12m_act_comcantidadmora1_na,cl_cnt_24m_act_comcantidadmora1_na,cl_cnt_12m_act_comcantidadmora2_na,cl_cnt_24m_act_comcantidadmora2_na,cl_des_12m_act_commaxdesvmora_na,cl_des_24m_act_commaxdesvmora_na,cl_max_12m_act_commaxmora_na,cl_max_24m_act_commaxmora_na,cl_max_act_act_commaxmora_na,cl_cnt_12m_act_ptocantidadmora1_na,cl_cnt_24m_act_ptocantidadmora1_na,cl_cnt_12m_act_ptocantidadmora2_na,cl_cnt_24m_act_ptocantidadmora2_na,cl_max_12m_act_ptomaxdesvmora_na,cl_max_24m_act_ptomaxdesvmora_na,cl_max_12m_act_ptomaxmora_na,cl_max_24m_act_ptomaxmora_na,cl_max_act_act_ptomaxmora_na,cl_cnt_12m_act_tccantidadmora1_na,cl_cnt_24m_act_tccantidadmora1_na,cl_cnt_12m_act_tccantidadmora2_na,cl_cnt_24m_act_tccantidadmora2_na,cl_des_12m_act_tcmaxdesvmora_na,cl_des_24m_act_tcmaxdesvmora_na,cl_max_12m_act_tcmaxmora_na,cl_max_24m_act_tcmaxmora_na,cl_unq_act_act_tcmoraact_na,cl_cnt_12m_act_servcantidadmora1_na,cl_cnt_12m_act_servcantidadmora2_na,cl_des_12m_act_servmaxdesvmora_na,cl_max_12m_act_servmaxmora_na,cl_max_act_act_servmaxmora_na,id_productos_na,cl_min_his_act_ptoexptotal_na,cl_min_his_act_ptoexpvig_na,cl_cnt_his_act_comcantidadtotal_na,cl_sum_his_act_commontototal_na,cl_cnt_act_act_comcantidadvig_na,cl_sum_act_act_commontovig_na,cl_sum_act_act_comsaldoenmora_na,cl_sum_act_act_comsaldovig_na,cl_cnt_his_act_ptocantidadtotal_na,cl_sum_his_act_ptomontototal_na,cl_cnt_act_act_ptocantidadvig_na,cl_sum_act_act_ptomontovigente_na,cl_sum_act_act_ptosaldomora_na,cl_sum_act_act_ptosaldovig_na,cl_cnt_act_act_servcantidadvig_na,cl_sum_act_act_servsaldomora_na,cl_sum_act_act_servsaldovig_na,cl_cnt_his_act_tccantidadtotal_na,cl_sum_his_act_tclimitetotal_na,cl_cnt_act_act_tccantidadvig_na,cl_sum_act_act_tclimitevig_na,cl_sum_act_act_tcsaldomora_na,cl_sum_act_act_tcsaldovig_na,cl_sum_act_act_comporcentajepagadomontosvig_na,cl_sum_act_act_comporcentajesaldoenmora_na,cl_sum_act_act_tcporcentajesaldomora_na,cl_sum_act_act_tcporcentajeutilizacion_na,cl_sum_act_act_ptoporcentajepagadomontosvig_na,cl_sum_act_act_ptoporcentajesaldoenmora_na,cl_sum_act_act_servporcentajesaldomora_na,cl_unq_act_act_totalgastosfam_ratio,cl_unq_act_act_negociototalingresos_ratio,cl_unq_act_act_totalbienes_ratio
0,0.0,0.0,0.0,0,0.0,0.328947,0.324561,1.0,1.0,0.186047,0.895358,0.0,0.280157,0.127119,0.150943,0.703962,0.37931,,0.5,1.0,0.0,0.008594,0.366208,0.909091,0.25,0.0,0.0,0.0,1.0,0.0,0.0,0.5,0.5,0.5,0.2,0.0,0.25,8.6e-05,0.0,0.666667,0.0,0.333333,0.25,0.5,0.5,0.142857,0.0,0.0,9e-05,0.5,0.0,0.5,0.0,0.5,0.5,0.333333,0.5,0.772727,0.5,0.015538,0.44546,0.53529,0.0,0.623491,0.00534,0.022357,0.009571,0.032762,0.5,0.5,0.5,0.5,0.5,0.5,0.0,0.0,0.5,0.0,0.5,0.882353,0.110112,0.958236,1.0,0.041812,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041812,0.6,1.0,0.042985,0.011364,0.0,0.0,0.0,0.0,0.0,0.0,0.015873,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.266418,1.0,0.0,0.0,0.224439,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.011443,0.00326,0.002361
1,0.0,0.0,0.0,0,0.333333,0.328947,0.324561,0.666667,1.0,0.674419,0.706568,0.0,0.330403,0.29661,0.339623,0.739614,1.0,,0.166667,1.0,0.0,0.011662,0.33951,0.772727,0.5,1.0,0.333333,0.8,1.0,0.0,0.0,0.5,0.5,0.5,0.0,0.0,0.25,8.6e-05,0.4,0.666667,0.0,0.333333,0.5,0.0,0.5,0.285714,0.0,0.021429,9e-05,0.5,0.5,0.5,0.5,0.0,0.5,0.666667,1.0,1.0,1.0,0.011094,0.41018,0.577611,0.0,0.631238,0.004005,0.015418,0.015,0.023589,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.0,0.5,0.0,0.0,0.666667,0.0,0.958236,1.0,0.041812,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041812,0.6,1.0,0.042985,0.011364,0.0,0.0,0.0,0.0,0.0,0.0,0.015873,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.266418,1.0,0.0,0.0,0.224439,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.011443,0.002048,0.003448
2,0.0,0.0,0.0,0,1.0,0.328947,0.328947,0.0,0.0,0.953488,0.019872,0.0,0.012853,0.957627,0.962264,0.34902,0.37931,,0.333333,1.0,0.0,0.01295,0.334636,0.181818,0.25,0.0,0.0,0.0,1.0,0.5,0.5,0.5,0.0,0.5,0.0,0.333333,0.25,0.000129,0.4,0.666667,0.0,0.0,0.25,0.0,0.5,0.714286,0.0,0.014286,9e-05,0.5,0.0,0.5,0.0,0.0,0.5,0.333333,0.5,0.181818,0.5,0.033316,0.501974,0.495499,0.0,0.655165,0.004005,0.011054,0.011494,0.010918,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.647059,0.145397,0.958236,1.0,0.013414,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013414,1.0,0.0,0.077589,0.011364,0.0,0.0,0.0,0.0,0.0,0.0,0.031746,0.00713,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.266418,1.0,0.0,0.0,0.224439,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.049587,0.017193,0.004487
3,0.0,0.0,0.0,0,0.666667,0.328947,0.280702,0.333333,1.0,0.488372,0.364585,0.0,0.32579,0.610169,0.509434,0.480019,0.37931,,0.5,1.0,0.0,0.002919,0.331779,0.545455,0.25,0.75,0.666667,0.0,0.5,0.0,0.0,0.5,0.5,0.5,0.2,0.0,0.25,0.000171,0.4,0.666667,0.0,0.666667,0.5,0.0,0.5,0.285714,0.0,0.021429,9e-05,0.5,0.5,0.0,0.0,0.5,0.5,0.333333,0.5,0.590909,0.5,0.00776,0.41018,0.655924,0.0,0.60463,0.006676,0.015196,0.013423,0.016574,0.5,0.5,0.5,0.5,0.5,0.5,0.0,0.0,0.5,0.0,0.5,0.745098,0.353984,0.958236,1.0,0.041812,0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.341727,0.8,0.0,0.042985,0.011364,0.034483,0.036662,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.266418,1.0,0.0,0.0,0.224439,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.019072,0.004702,0.019001
4,0.0,0.0,0.0,1,1.0,0.324561,0.324561,0.0,0.0,0.976744,0.035851,0.0,0.396632,0.949153,0.981132,0.669922,0.37931,,0.166667,1.0,0.0,0.011977,0.374688,0.863636,0.75,1.0,0.0,0.0,1.0,0.0,0.0,0.5,0.5,0.5,0.0,0.333333,0.25,8.6e-05,0.8,0.666667,0.0,0.666667,0.25,0.0,0.5,0.285714,0.15,0.007143,9e-05,0.5,0.5,0.5,0.0,0.5,0.5,0.333333,0.5,0.772727,0.5,0.026649,0.477479,0.596152,0.0,0.638021,0.004005,0.020426,0.022,0.018238,0.5,0.0,0.5,0.5,0.5,0.5,0.0,0.0,0.5,0.5,0.5,0.686275,0.163066,0.958236,1.0,0.41605,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.41605,1.0,0.0,0.108948,0.011364,0.068966,0.033417,0.25,0.025758,0.019705,0.00956,0.111111,0.063921,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.084743,0.737942,0.0,0.0,0.224439,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.015257,0.004962,0.005299


In [4]:
# capture the target (remember that the target is log transformed)
y_train = X_train[target_var]
y_test = X_test[target_var]

# drop unnecessary variables from our training and testing sets
X_train.drop([target_var], axis=1, inplace=True)
X_test.drop([target_var], axis=1, inplace=True)
X_train.drop(identifier, axis=1, inplace=True)
X_test.drop(identifier, axis=1, inplace=True)

### Feature Selection

Let's go ahead and select a subset of the most predictive features. There is an element of randomness in the Lasso regression, so remember to set the seed.

In [5]:
# We will do the model fitting and feature selection
# altogether in a few lines of code

# first, we specify the Lasso Regression model, and we
# select a suitable alpha (equivalent of penalty).
# The bigger the alpha the less features that will be selected.

# Then we use the selectFromModel object from sklearn, which
# will select automatically the features which coefficients are non-zero

# remember to set the seed, the random state in this function
sel_ = SelectFromModel(Lasso(alpha=0.005, random_state=0))

# train Lasso model and select features
sel_.fit(X_train, y_train)

SelectFromModel(estimator=Lasso(alpha=0.005, random_state=0))

In [6]:
# let's visualise those features that were selected.
# (selected features marked with True)

sel_.get_support()

array([False,  True, False, False,  True, False, False,  True,  True,
       False, False,  True, False, False, False, False,  True,  True,
        True,  True, False, False,  True, False, False, False, False,
       False,  True, False, False, False, False,  True,  True, False,
        True, False, False, False, False, False, False, False, False,
        True,  True, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False,  True,  True, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [7]:
# let's print the number of total and selected features

# this is how we can make a list of the selected features
selected_feats = X_train.columns[(sel_.get_support())]

# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feats)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(sel_.estimator_.coef_ == 0)))

total features: 229
selected features: 22
features with coefficients shrank to zero: 206


In [8]:
# print the selected features
selected_feats

Index(['qty_meses_desde_desembolso', 'cl_unq_act_act_messolicitud',
       'cl_unq_act_act_fechasolicitud', 'cl_unq_act_act_fechasolicitud_date',
       'cl_unq_act_act_ptodestino', 'cl_unq_act_act_depnacimiento',
       'cl_unq_act_act_estadocivil', 'cl_unq_act_act_estadocivilmodificado',
       'cl_unq_act_act_genero', 'cl_unq_act_act_flagpuedeescribir',
       'cl_unq_act_act_tiempovivirresidencia',
       'cl_unq_act_act_flagaccesovehicular',
       'cl_unq_act_act_tipoaccesovehicular',
       'cl_unq_act_act_flagaccesomensajeros', 'cl_unq_act_act_flagtienegarage',
       'cl_unq_act_act_flagtienecomedor', 'cl_unq_act_act_depnegocio',
       'cl_unq_act_act_flagtieneagua', 'cl_unq_act_act_flagtienerefrigerador',
       'cl_unq_act_act_flagtienelavadora', 'cl_unq_act_act_flagtienestereo',
       'cl_unq_act_act_fnacimiento_date_numberlong_na'],
      dtype='object')

### Identify the selected variables

In [9]:
# this is an alternative way of identifying the selected features
# based on the non-zero regularisation coefficients:

selected_feats = X_train.columns[(sel_.estimator_.coef_ != 0).ravel().tolist()]

selected_feats

Index(['qty_meses_desde_desembolso', 'cl_unq_act_act_messolicitud',
       'cl_unq_act_act_fechasolicitud', 'cl_unq_act_act_fechasolicitud_date',
       'cl_unq_act_act_ptodestino', 'cl_unq_act_act_depnacimiento',
       'cl_unq_act_act_estadocivil', 'cl_unq_act_act_estadocivilmodificado',
       'cl_unq_act_act_genero', 'cl_unq_act_act_flagpuedeescribir',
       'cl_unq_act_act_flagpuedeleer', 'cl_unq_act_act_tiempovivirresidencia',
       'cl_unq_act_act_flagaccesovehicular',
       'cl_unq_act_act_tipoaccesovehicular',
       'cl_unq_act_act_flagaccesomensajeros', 'cl_unq_act_act_flagtienegarage',
       'cl_unq_act_act_flagtienecomedor', 'cl_unq_act_act_depnegocio',
       'cl_unq_act_act_flagtieneagua', 'cl_unq_act_act_flagtienerefrigerador',
       'cl_unq_act_act_flagtienelavadora', 'cl_unq_act_act_flagtienestereo',
       'cl_unq_act_act_fnacimiento_date_numberlong_na'],
      dtype='object')

In [10]:
pd.Series(selected_feats).to_csv('recommended_features.csv', index=False)

In [11]:
cat_vars_selected = [var for var in selected_feats if X_train[var].dtype == 'O']

In [12]:
all_cat_features = pd.read_csv('all_categorical_features.csv')

In [13]:
selected_feats

Index(['qty_meses_desde_desembolso', 'cl_unq_act_act_messolicitud',
       'cl_unq_act_act_fechasolicitud', 'cl_unq_act_act_fechasolicitud_date',
       'cl_unq_act_act_ptodestino', 'cl_unq_act_act_depnacimiento',
       'cl_unq_act_act_estadocivil', 'cl_unq_act_act_estadocivilmodificado',
       'cl_unq_act_act_genero', 'cl_unq_act_act_flagpuedeescribir',
       'cl_unq_act_act_flagpuedeleer', 'cl_unq_act_act_tiempovivirresidencia',
       'cl_unq_act_act_flagaccesovehicular',
       'cl_unq_act_act_tipoaccesovehicular',
       'cl_unq_act_act_flagaccesomensajeros', 'cl_unq_act_act_flagtienegarage',
       'cl_unq_act_act_flagtienecomedor', 'cl_unq_act_act_depnegocio',
       'cl_unq_act_act_flagtieneagua', 'cl_unq_act_act_flagtienerefrigerador',
       'cl_unq_act_act_flagtienelavadora', 'cl_unq_act_act_flagtienestereo',
       'cl_unq_act_act_fnacimiento_date_numberlong_na'],
      dtype='object')

In [14]:
mask = all_cat_features.index[np.in1d(all_cat_features['0'], selected_feats)]

In [15]:
all_cat_features.loc[mask].to_csv('recommended_categorical_features.csv', index=False)

That is all for this notebook. In the next one, we will go ahead and build the final model using the selected features.