# Hoja de Trabajo \# 6


---


por Josué Obregón <br>
DS6011 - Feature Engineering <br>
UVG Masters - Escuela de Negocios<br>


## Objetivos

El objetivo de esta hoja de trabajo  es presentar al estudiante diferentes técnicas selección de atributos.

También se busca que el estudiante practique la utilización de éstas técnicas con las librerías disponibles en el lenguaje Python.

## Importación de librerías y carga de los datos

Las librerías que importaremos para empezar son pandas y numpy para el manejo de los datos, y matplotlib, seaborn y plotly para la generación de visualizaciones. 



Primero hacemos un update a scikit-learn para poder utilizar la versión mas reciente

In [105]:
!pip install --upgrade scikit-learn

Requirement already up-to-date: scikit-learn in /usr/local/lib/python3.7/dist-packages (0.24.2)


In [106]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

### Cargando los datos

In [107]:
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [108]:
credit_g =  fetch_openml('credit-g') # https://www.openml.org/t/31

In [109]:
credit_g.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [110]:
print(credit_g['DESCR'])

**Author**: Dr. Hans Hofmann  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) - 1994    
**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)

**German Credit dataset**  
This dataset classifies people described by a set of attributes as good or bad credit risks.

This dataset comes with a cost matrix: 
``` 
Good  Bad (predicted)  
Good   0    1   (actual)  
Bad    5    0  
```

It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).  

### Attribute description  

1. Status of existing checking account, in Deutsche Mark.  
2. Duration in months  
3. Credit history (credits taken, paid back duly, delays, critical accounts)  
4. Purpose of the credit (car, television,...)  
5. Credit amount  
6. Status of savings account/bonds, in Deutsche Mark.  
7. Present employment, in number of years.  
8. Installment rate in percentage of disposable income  
9. Perso

In [111]:
df_credit = credit_g['data']
lbl_enc = LabelEncoder()
df_credit['class']= lbl_enc.fit_transform(credit_g['target'])
df_credit.head()

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,<0,6.0,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4.0,male single,none,4.0,real estate,67.0,none,own,2.0,skilled,1.0,yes,yes,1
1,0<=X<200,48.0,existing paid,radio/tv,5951.0,<100,1<=X<4,2.0,female div/dep/mar,none,2.0,real estate,22.0,none,own,1.0,skilled,1.0,none,yes,0
2,no checking,12.0,critical/other existing credit,education,2096.0,<100,4<=X<7,2.0,male single,none,3.0,real estate,49.0,none,own,1.0,unskilled resident,2.0,none,yes,1
3,<0,42.0,existing paid,furniture/equipment,7882.0,<100,4<=X<7,2.0,male single,guarantor,4.0,life insurance,45.0,none,for free,1.0,skilled,2.0,none,yes,1
4,<0,24.0,delayed previously,new car,4870.0,<100,1<=X<4,3.0,male single,none,4.0,no known property,53.0,none,for free,2.0,skilled,2.0,none,yes,0


Generamos el conjunto de datos de entrenamiento y el de prueba

In [112]:
X_train, X_test, y_train, y_test = train_test_split(df_credit.drop(['class'],axis=1),df_credit['class'],train_size=0.80, random_state=6011, shuffle=True)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(800, 20)
(800,)
(200, 20)
(200,)


#### Codificando variables categóricas

In [113]:
!pip install category_encoders



In [114]:
from category_encoders import MEstimateEncoder

In [115]:
cat_cols = ['checking_status','credit_history','purpose','savings_status','employment','personal_status','other_parties','residence_since','property_magnitude','other_payment_plans','housing','job','own_telephone','foreign_worker']

In [116]:
mest_enc = MEstimateEncoder( cols=cat_cols)

In [117]:
X_train_cod1 = mest_enc.fit_transform(X_train, y_train)


is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead



In [118]:
X_train_cod1

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker
621,0.874023,18.0,0.824947,0.607955,1530.0,0.622500,0.683036,3.0,0.723900,0.688523,0.672262,0.672126,32.0,0.568638,0.724373,2.0,0.692749,1.0,0.674349,0.680114
199,0.611944,18.0,0.671875,0.679789,4297.0,0.622500,0.737377,4.0,0.597384,0.688523,0.709500,0.550305,40.0,0.712270,0.724373,1.0,0.629808,1.0,0.707104,0.680114
360,0.611944,18.0,0.663586,0.556090,1239.0,0.803272,0.683036,4.0,0.723900,0.688523,0.681835,0.550305,61.0,0.712270,0.586596,1.0,0.692749,1.0,0.674349,0.680114
65,0.874023,27.0,0.663586,0.634375,5190.0,0.803272,0.737377,4.0,0.723900,0.688523,0.681835,0.672126,48.0,0.712270,0.724373,4.0,0.692749,2.0,0.707104,0.680114
981,0.874023,48.0,0.663586,0.638117,4844.0,0.622500,0.593750,3.0,0.723900,0.688523,0.672262,0.679136,33.0,0.568638,0.596391,1.0,0.629808,1.0,0.707104,0.680114
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187,0.611944,16.0,0.824947,0.607955,1175.0,0.622500,0.593750,2.0,0.723900,0.688523,0.709500,0.679136,68.0,0.712270,0.586596,3.0,0.593750,1.0,0.707104,0.680114
478,0.611944,12.0,0.663586,0.638117,1037.0,0.681882,0.784052,3.0,0.723900,0.688523,0.681835,0.785278,39.0,0.712270,0.724373,1.0,0.724124,1.0,0.674349,0.680114
559,0.611944,18.0,0.824947,0.679789,1928.0,0.622500,0.558067,2.0,0.723900,0.688523,0.672262,0.785278,31.0,0.712270,0.724373,2.0,0.724124,1.0,0.674349,0.680114
305,0.874023,6.0,0.663586,0.679789,1543.0,0.876453,0.683036,4.0,0.597384,0.688523,0.672262,0.785278,33.0,0.712270,0.724373,1.0,0.692749,1.0,0.674349,0.680114


In [119]:
mest_enc.mapping

{'checking_status': checking_status
  1    0.874023
  2    0.611944
  3    0.477192
  4    0.722656
 -1    0.687500
 -2    0.687500
 dtype: float64, 'credit_history': credit_history
  1    0.824947
  2    0.671875
  3    0.663586
  4    0.417187
  5    0.391071
 -1    0.687500
 -2    0.687500
 dtype: float64, 'employment': employment
  1    0.683036
  2    0.737377
  3    0.593750
  4    0.784052
  5    0.558067
 -1    0.687500
 -2    0.687500
 dtype: float64, 'foreign_worker': foreign_worker
  1    0.680114
  2    0.865234
 -1    0.687500
 -2    0.687500
 dtype: float64, 'housing': housing
  1    0.724373
  2    0.586596
  3    0.596391
 -1    0.687500
 -2    0.687500
 dtype: float64, 'job': job
  1    0.692749
  2    0.629808
  3    0.724124
  4    0.593750
 -1    0.687500
 -2    0.687500
 dtype: float64, 'other_parties': other_parties
  1    0.688523
  2    0.535985
  3    0.802365
 -1    0.687500
 -2    0.687500
 dtype: float64, 'other_payment_plans': other_payment_plans
  1    0.5

In [120]:
X_test_cod1 = mest_enc.transform(X_test)

In [121]:
from sklearn.preprocessing import PolynomialFeatures

In [122]:
poly_gen = PolynomialFeatures(2)

In [123]:
poly_gen.fit_transform(X_train_cod1).shape

(800, 231)

In [124]:
X_train_cod = pd.DataFrame(poly_gen.fit_transform(X_train_cod1), columns=poly_gen.get_feature_names() )
X_train_cod

Unnamed: 0,1,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x0^2,x0 x1,x0 x2,x0 x3,x0 x4,x0 x5,x0 x6,x0 x7,x0 x8,x0 x9,x0 x10,x0 x11,x0 x12,x0 x13,x0 x14,x0 x15,x0 x16,x0 x17,x0 x18,...,x11 x16,x11 x17,x11 x18,x11 x19,x12^2,x12 x13,x12 x14,x12 x15,x12 x16,x12 x17,x12 x18,x12 x19,x13^2,x13 x14,x13 x15,x13 x16,x13 x17,x13 x18,x13 x19,x14^2,x14 x15,x14 x16,x14 x17,x14 x18,x14 x19,x15^2,x15 x16,x15 x17,x15 x18,x15 x19,x16^2,x16 x17,x16 x18,x16 x19,x17^2,x17 x18,x17 x19,x18^2,x18 x19,x19^2
0,1.0,0.874023,18.0,0.824947,0.607955,1530.0,0.622500,0.683036,3.0,0.723900,0.688523,0.672262,0.672126,32.0,0.568638,0.724373,2.0,0.692749,1.0,0.674349,0.680114,0.763917,15.732422,0.721023,0.531367,1337.255859,0.544080,0.596989,2.622070,0.632705,0.601785,0.587573,0.587454,27.968750,0.497003,0.633119,1.748047,0.605479,0.874023,0.589397,...,0.465614,0.672126,0.453247,0.457122,1024.0,18.196429,23.179931,64.0,22.167969,32.0,21.579167,21.763636,0.323350,0.411906,1.137277,0.393924,0.568638,0.383461,0.386739,0.524716,1.448746,0.501809,0.724373,0.488480,0.492656,4.0,1.385498,2.0,1.348698,1.360227,0.479901,0.692749,0.467155,0.471148,1.0,0.674349,0.680114,0.454747,0.458634,0.462555
1,1.0,0.611944,18.0,0.671875,0.679789,4297.0,0.622500,0.737377,4.0,0.597384,0.688523,0.709500,0.550305,40.0,0.712270,0.724373,1.0,0.629808,1.0,0.707104,0.680114,0.374476,11.015000,0.411150,0.415993,2629.525278,0.380935,0.451234,2.447778,0.365566,0.421338,0.434175,0.336756,24.477778,0.435869,0.443276,0.611944,0.385407,0.611944,0.432708,...,0.346586,0.550305,0.389123,0.374270,1600.0,28.490783,28.974913,40.0,25.192308,40.0,28.284161,27.204545,0.507328,0.515949,0.712270,0.448593,0.712270,0.503649,0.484424,0.524716,0.724373,0.456216,0.724373,0.512207,0.492656,1.0,0.629808,1.0,0.707104,0.680114,0.396658,0.629808,0.445340,0.428341,1.0,0.707104,0.680114,0.499996,0.480911,0.462555
2,1.0,0.611944,18.0,0.663586,0.556090,1239.0,0.803272,0.683036,4.0,0.723900,0.688523,0.681835,0.550305,61.0,0.712270,0.586596,1.0,0.692749,1.0,0.674349,0.680114,0.374476,11.015000,0.406078,0.340296,758.199167,0.491558,0.417980,2.447778,0.442986,0.421338,0.417245,0.336756,37.328611,0.435869,0.358964,0.611944,0.423924,0.611944,0.412664,...,0.381223,0.550305,0.371098,0.374270,3721.0,43.448445,35.782380,61.0,42.257690,61.0,41.135286,41.486932,0.507328,0.417815,0.712270,0.493424,0.712270,0.480318,0.484424,0.344095,0.586596,0.406364,0.586596,0.395571,0.398952,1.0,0.692749,1.0,0.674349,0.680114,0.479901,0.692749,0.467155,0.471148,1.0,0.674349,0.680114,0.454747,0.458634,0.462555
3,1.0,0.874023,27.0,0.663586,0.634375,5190.0,0.803272,0.737377,4.0,0.723900,0.688523,0.681835,0.672126,48.0,0.712270,0.724373,4.0,0.692749,2.0,0.707104,0.680114,0.763917,23.598633,0.579989,0.554459,4536.181641,0.702078,0.644485,3.496094,0.632705,0.601785,0.595940,0.587454,41.953125,0.622540,0.633119,3.496094,0.605479,1.748047,0.618026,...,0.465614,1.344251,0.475263,0.457122,2304.0,34.188940,34.769896,192.0,33.251953,96.0,33.940994,32.645455,0.507328,0.515949,2.849078,0.493424,1.424539,0.503649,0.484424,0.524716,2.897491,0.501809,1.448746,0.512207,0.492656,16.0,2.770996,8.0,2.828416,2.720455,0.479901,1.385498,0.489846,0.471148,4.0,1.414208,1.360227,0.499996,0.480911,0.462555
4,1.0,0.874023,48.0,0.663586,0.638117,4844.0,0.622500,0.593750,3.0,0.723900,0.688523,0.672262,0.679136,33.0,0.568638,0.596391,1.0,0.629808,1.0,0.707104,0.680114,0.763917,41.953125,0.579989,0.557729,4233.769531,0.544080,0.518951,2.622070,0.632705,0.601785,0.587573,0.593581,28.842773,0.497003,0.521260,0.874023,0.550467,0.874023,0.618026,...,0.427725,0.679136,0.480220,0.461889,1089.0,18.765067,19.680898,33.0,20.783654,33.0,23.334433,22.443750,0.323350,0.339131,0.568638,0.358133,0.568638,0.402087,0.386739,0.355682,0.596391,0.375612,0.596391,0.421710,0.405614,1.0,0.629808,1.0,0.707104,0.680114,0.396658,0.629808,0.445340,0.428341,1.0,0.707104,0.680114,0.499996,0.480911,0.462555
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,1.0,0.611944,16.0,0.824947,0.607955,1175.0,0.622500,0.593750,2.0,0.723900,0.688523,0.709500,0.679136,68.0,0.712270,0.586596,3.0,0.593750,1.0,0.707104,0.680114,0.374476,9.791111,0.504822,0.372034,719.034722,0.380935,0.363342,1.223889,0.442986,0.421338,0.434175,0.415593,41.612222,0.435869,0.358964,1.835833,0.363342,0.611944,0.432708,...,0.403237,0.679136,0.480220,0.461889,4624.0,48.434332,39.888554,204.0,40.375000,68.0,48.083075,46.247727,0.507328,0.417815,2.136809,0.422910,0.712270,0.503649,0.484424,0.344095,1.759789,0.348292,0.586596,0.414785,0.398952,9.0,1.781250,3.0,2.121312,2.040341,0.352539,0.593750,0.419843,0.403817,1.0,0.707104,0.680114,0.499996,0.480911,0.462555
796,1.0,0.611944,12.0,0.663586,0.638117,1037.0,0.681882,0.784052,3.0,0.723900,0.688523,0.681835,0.785278,39.0,0.712270,0.724373,1.0,0.724124,1.0,0.674349,0.680114,0.374476,7.343333,0.406078,0.390492,634.586389,0.417274,0.479796,1.835833,0.442986,0.421338,0.417245,0.480546,23.865833,0.435869,0.443276,0.611944,0.443124,0.611944,0.412664,...,0.568639,0.785278,0.529551,0.534078,1521.0,27.778514,28.250541,39.0,28.240844,39.0,26.299609,26.524432,0.507328,0.515949,0.712270,0.515772,0.712270,0.480318,0.484424,0.524716,0.724373,0.524536,0.724373,0.488480,0.492656,1.0,0.724124,1.0,0.674349,0.680114,0.524356,0.724124,0.488312,0.492487,1.0,0.674349,0.680114,0.454747,0.458634,0.462555
797,1.0,0.611944,18.0,0.824947,0.679789,1928.0,0.622500,0.558067,2.0,0.723900,0.688523,0.672262,0.785278,31.0,0.712270,0.724373,2.0,0.724124,1.0,0.674349,0.680114,0.374476,11.015000,0.504822,0.415993,1179.828889,0.380935,0.341506,1.223889,0.442986,0.421338,0.411387,0.480546,18.970278,0.435869,0.443276,1.223889,0.443124,0.611944,0.412664,...,0.568639,0.785278,0.529551,0.534078,961.0,22.080357,22.455558,62.0,22.447850,31.0,20.904818,21.083523,0.507328,0.515949,1.424539,0.515772,0.712270,0.480318,0.484424,0.524716,1.448746,0.524536,0.724373,0.488480,0.492656,4.0,1.448248,2.0,1.348698,1.360227,0.524356,0.724124,0.488312,0.492487,1.0,0.674349,0.680114,0.454747,0.458634,0.462555
798,1.0,0.874023,6.0,0.663586,0.679789,1543.0,0.876453,0.683036,4.0,0.597384,0.688523,0.672262,0.785278,33.0,0.712270,0.724373,1.0,0.692749,1.0,0.674349,0.680114,0.763917,5.244141,0.579989,0.594151,1348.618164,0.766041,0.596989,3.496094,0.522127,0.601785,0.587573,0.686351,28.842773,0.622540,0.633119,0.874023,0.605479,0.874023,0.589397,...,0.544000,0.785278,0.529551,0.534078,1089.0,23.504896,23.904304,33.0,22.860718,33.0,22.253516,22.443750,0.507328,0.515949,0.712270,0.493424,0.712270,0.480318,0.484424,0.524716,0.724373,0.501809,0.724373,0.488480,0.492656,1.0,0.692749,1.0,0.674349,0.680114,0.479901,0.692749,0.467155,0.471148,1.0,0.674349,0.680114,0.454747,0.458634,0.462555


In [125]:
X_test_cod = pd.DataFrame(poly_gen.transform(X_test_cod1), columns=poly_gen.get_feature_names() )

#### Modelo base

In [126]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [127]:
base_model = LogisticRegression(penalty='none', max_iter=1000, random_state=6011)
base_model.fit(X_train_cod,y_train)

LogisticRegression(max_iter=1000, penalty='none', random_state=6011)

In [128]:
print(f'Coeficientes mayores a cero: {(base_model.coef_>0).sum()}')

Coeficientes mayores a cero: 209


In [129]:
print(classification_report(y_test, base_model.predict(X_test_cod)))

              precision    recall  f1-score   support

           0       0.43      0.12      0.19        50
           1       0.76      0.95      0.85       150

    accuracy                           0.74       200
   macro avg       0.60      0.53      0.52       200
weighted avg       0.68      0.74      0.68       200



# Métodos Intrínsecos

## Regresion Ridge

In [130]:
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LogisticRegressionCV

In [131]:
ridge_classifier = LogisticRegressionCV(cv=2, max_iter=100, penalty='l2', solver='liblinear')
ridge_classifier.fit(X_train_cod,y_train)

LogisticRegressionCV(cv=2, solver='liblinear')

In [132]:
ridge_classifier.classes_

array([0, 1])

In [133]:
ridge_classifier.coef_

array([[ 6.24738680e-06,  1.82255624e-05, -8.10023310e-05,
         1.25880905e-05,  8.73636093e-06, -3.24914711e-03,
         7.67350187e-06,  7.19661235e-06,  1.56774654e-05,
         5.18459184e-06,  5.97899079e-06,  4.81220577e-06,
         7.75974124e-06,  2.12220880e-04,  6.39457464e-06,
         6.41755925e-06,  1.73387589e-05,  5.10746920e-06,
         1.98647911e-06,  4.74233127e-06,  5.73044504e-06,
         2.20360776e-05,  2.06705707e-04,  1.79008374e-05,
         1.56554251e-05,  1.02507256e-03,  1.46615501e-05,
         1.46170469e-05,  6.34312433e-05,  1.30740886e-05,
         1.35752299e-05,  1.28955325e-05,  1.49493927e-05,
         6.31525060e-04,  1.42981161e-05,  1.40676762e-05,
         2.75865836e-05,  1.32527603e-05,  1.64244843e-05,
         1.27702223e-05,  1.33589659e-05, -2.31703848e-03,
        -2.16849095e-05,  1.01084751e-05,  2.49066452e-05,
         3.13674361e-05, -5.68375564e-06, -2.97485034e-04,
        -3.39125631e-05, -2.44357037e-05, -3.52780981e-0

In [134]:
print(f'Coeficientes mayores a cero: {(ridge_classifier.coef_>0).sum()}')

Coeficientes mayores a cero: 204


In [135]:
ridge_classifier.C_

array([166.81005372])

In [136]:
print(f'Mejor valor para lambda: {1/ridge_classifier.C_}')

Mejor valor para lambda: [0.00599484]


In [137]:
print('================ Resultados Ridge Classifier =================\n')
print(classification_report(y_test, ridge_classifier.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.62      0.50      0.56        50
           1       0.84      0.90      0.87       150

    accuracy                           0.80       200
   macro avg       0.73      0.70      0.71       200
weighted avg       0.79      0.80      0.79       200



## Regresión LASSO

In [138]:
lasso_classifier = LogisticRegressionCV(cv=2, penalty='l1', max_iter=100, solver='liblinear')
lasso_classifier.fit(X_train_cod,y_train)
# 21 sec


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



LogisticRegressionCV(cv=2, penalty='l1', solver='liblinear')

In [139]:
lasso_classifier.classes_

array([0, 1])

In [140]:
lasso_classifier.coef_

array([[ 0.00000000e+00,  0.00000000e+00, -4.38145577e-01,
         0.00000000e+00,  0.00000000e+00, -1.69448200e-03,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00, -4.12924209e-01,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  1.84887621e-01,  0.00000000e+00,
         0.00000000e+00, -5.50974753e-04,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         8.90173375e-02,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00, -1.53264321e-03,
        -3.59957202e-01,  3.27072959e-02,  2.38066964e-05,
         1.98203936e-01,  1.04899354e-01, -1.48001668e-02,
         0.00000000e+00,  0.00000000e+00,  1.13704281e-0

In [141]:
print(f'Coeficientes mayores a cero: {(lasso_classifier.coef_>0).sum()}')

Coeficientes mayores a cero: 29


In [142]:
lasso_classifier.C_

array([0.35938137])

In [143]:
print(f'Mejor valor para lambda: {1/lasso_classifier.C_}')

Mejor valor para lambda: [2.7825594]


In [144]:
print('================ Resultados LASSO Logistic Regressor =================\n')
print(classification_report(y_test, lasso_classifier.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.57      0.62      0.60        50
           1       0.87      0.85      0.86       150

    accuracy                           0.79       200
   macro avg       0.72      0.73      0.73       200
weighted avg       0.80      0.79      0.79       200



## Elastic Net (L1 y L2 regularizer combinados)

In [145]:
elastic_classifier = LogisticRegressionCV(cv=2, penalty='elasticnet', solver='saga', max_iter=100, l1_ratios=[0.5])
elastic_classifier.fit(X_train_cod,y_train)


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_iter was reached which means the coef_ did not converge


The max_i

LogisticRegressionCV(cv=2, l1_ratios=[0.5], penalty='elasticnet', solver='saga')

In [146]:
elastic_classifier.classes_

array([0, 1])

In [147]:
elastic_classifier.coef_

array([[1.10823430e-12, 9.26588398e-13, 1.76086089e-11, 8.40265170e-13,
        7.94864849e-13, 2.98318409e-09, 8.03644440e-13, 7.91646623e-13,
        3.35127699e-12, 7.71488306e-13, 7.65243788e-13, 7.56841842e-13,
        7.92179757e-13, 5.15355645e-11, 7.72390425e-13, 7.77860297e-13,
        1.70448125e-12, 7.61839136e-13, 1.30187844e-12, 7.56456145e-13,
        7.63391381e-13, 7.79113510e-13, 1.65264675e-11, 6.89498046e-13,
        6.59379217e-13, 2.77473485e-09, 6.66282182e-13, 6.57394363e-13,
        2.85859040e-12, 6.41859423e-13, 6.35705637e-13, 6.31005534e-13,
        6.54061818e-13, 4.26140995e-11, 6.44047095e-13, 6.46506519e-13,
        1.38003895e-12, 6.35658260e-13, 1.07863056e-12, 6.31339430e-13,
        6.35085276e-13, 2.76872049e-10, 1.33598795e-11, 1.29263143e-11,
        3.87147100e-08, 1.33421186e-11, 1.27002570e-11, 5.18456412e-11,
        1.21682316e-11, 1.19654892e-11, 1.17930475e-11, 1.25008690e-11,
        7.77376978e-10, 1.20286929e-11, 1.24118467e-11, 2.689489

In [148]:
print(f'Coeficientes mayores a cero: {(elastic_classifier.coef_>0).sum()}')

Coeficientes mayores a cero: 231


In [149]:
elastic_classifier.C_

array([10000.])

In [150]:
print(f'Mejor valor para lambda: {1/elastic_classifier.C_}')

Mejor valor para lambda: [0.0001]


In [151]:
print('================ Resultados de la Elastic Net =================\n')
print(classification_report(y_test, elastic_classifier.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00        50
           1       0.75      1.00      0.86       150

    accuracy                           0.75       200
   macro avg       0.38      0.50      0.43       200
weighted avg       0.56      0.75      0.64       200




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



## Árboles de Decisión

In [152]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [153]:
dt = DecisionTreeClassifier()
dt.fit(X_train_cod, y_train)

DecisionTreeClassifier()

In [154]:
dt.feature_importances_

array([0.        , 0.        , 0.        , 0.        , 0.00387879,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.01570909, 0.00549495,
       0.0080881 , 0.13687648, 0.00561755, 0.        , 0.00551196,
       0.        , 0.00857522, 0.        , 0.01602425, 0.        ,
       0.        , 0.        , 0.00484848, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.0097467 , 0.        ,
       0.        , 0.        , 0.01801024, 0.        , 0.00620606,
       0.01206988, 0.03660315, 0.00649533, 0.00930909, 0.01207273,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.01994834, 0.08174029, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.05169166,
       0.        , 0.00555372, 0.        , 0.00487378, 0.00498

In [155]:
print(f'Importancias mayores a cero: {(dt.feature_importances_>0).sum()}')

Importancias mayores a cero: 75


In [156]:
print('================ Resultados Decision Tree =================\n')
print(classification_report(y_test, dt.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.35      0.44      0.39        50
           1       0.80      0.73      0.76       150

    accuracy                           0.66       200
   macro avg       0.58      0.59      0.58       200
weighted avg       0.69      0.66      0.67       200



## Ensamblados de árboles

In [157]:
rf = RandomForestClassifier()
rf.fit(X_train_cod, y_train)

RandomForestClassifier()

In [158]:
rf.feature_importances_

array([0.00000000e+00, 1.69159669e-02, 4.08921657e-03, 1.82228301e-03,
       3.24643213e-03, 6.25795369e-03, 1.51404862e-03, 2.10989770e-03,
       6.57532858e-04, 1.68366017e-03, 1.07176558e-03, 2.19085101e-03,
       5.23130140e-04, 4.64956761e-03, 8.90447890e-04, 4.61180740e-04,
       8.43569633e-04, 1.33315213e-03, 1.33980925e-04, 2.03283974e-04,
       0.00000000e+00, 2.47961938e-03, 3.76662352e-03, 1.35959745e-02,
       1.61426708e-02, 7.25331753e-03, 1.62370689e-02, 1.58439160e-02,
       2.71435190e-03, 1.09515937e-02, 5.36880818e-03, 7.84370353e-03,
       9.01901438e-03, 1.15112900e-02, 7.57567554e-03, 7.54299103e-03,
       4.85900028e-03, 8.55767297e-03, 2.96330388e-03, 8.08026803e-03,
       7.86837954e-03, 3.55613947e-03, 5.20020528e-03, 5.38862686e-03,
       6.28034317e-03, 8.04654287e-03, 6.82333074e-03, 9.43293783e-03,
       5.34229315e-03, 3.98211107e-03, 5.96862906e-03, 7.60514459e-03,
       8.82061626e-03, 5.74446062e-03, 6.32317512e-03, 6.16969050e-03,
      

In [159]:
print(f'Importancias mayores a cero: {(rf.feature_importances_>0).sum()}')

Importancias mayores a cero: 229


In [160]:
print('================ Resultados Random Forest =================\n')
print(classification_report(y_test, rf.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.55      0.58      0.56        50
           1       0.86      0.84      0.85       150

    accuracy                           0.78       200
   macro avg       0.70      0.71      0.71       200
weighted avg       0.78      0.78      0.78       200



In [161]:
gbt = GradientBoostingClassifier()
gbt.fit(X_train_cod, y_train)

GradientBoostingClassifier()

In [162]:
gbt.feature_importances_

array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 4.77346683e-04, 2.67723012e-05,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 3.52720421e-05, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 5.42973149e-03, 5.38215262e-03,
       3.78338500e-03, 1.30600279e-02, 1.08647813e-01, 1.34727833e-02,
       2.19039539e-03, 2.75180936e-02, 2.31503472e-03, 6.08520610e-03,
       4.53976113e-02, 3.21476816e-02, 7.07367261e-03, 0.00000000e+00,
       9.17097800e-04, 1.89002535e-03, 0.00000000e+00, 9.82575543e-04,
       3.19193646e-05, 1.00720390e-04, 3.12191783e-03, 1.71907784e-03,
       5.72995253e-03, 2.28304313e-02, 5.25896195e-03, 3.23274457e-02,
       0.00000000e+00, 1.37689264e-06, 2.88757018e-03, 1.25065845e-02,
       1.96510576e-02, 0.00000000e+00, 1.03978583e-03, 1.95253712e-03,
      

In [163]:
print(f'Importancias mayores a cero: {(gbt.feature_importances_>0).sum()}')

Importancias mayores a cero: 156


In [164]:
print('================ Resultados Gradient Boosting Trees =================\n')
print(classification_report(y_test, gbt.predict(X_test_cod)))


              precision    recall  f1-score   support

           0       0.56      0.60      0.58        50
           1       0.86      0.84      0.85       150

    accuracy                           0.78       200
   macro avg       0.71      0.72      0.71       200
weighted avg       0.79      0.78      0.78       200



# Wrapper Methods

In [165]:
from sklearn.feature_selection import SequentialFeatureSelector, RFECV
from sklearn.neighbors import KNeighborsClassifier

## Stepwise selection

### Forward

In [166]:
rf = RandomForestClassifier(n_estimators=25)

In [167]:
n_features = 10

In [168]:
forward_stepwise = SequentialFeatureSelector(rf,n_features_to_select=n_features, direction='forward', cv=2)

In [169]:
forward_stepwise.fit(X_train_cod1, y_train) #X_train_cod1 -> 20 features

SequentialFeatureSelector(cv=2,
                          estimator=RandomForestClassifier(n_estimators=25),
                          n_features_to_select=10)

In [170]:
forward_stepwise.get_support()

array([ True,  True,  True, False, False,  True,  True, False,  True,
        True, False, False, False, False, False, False,  True, False,
        True,  True])

In [171]:
X_train_frw = forward_stepwise.transform(X_train_cod1)
X_test_frw = forward_stepwise.transform(X_test_cod1)

In [172]:
rf_frw = RandomForestClassifier(n_estimators=25)
rf_frw.fit(X_train_frw, y_train)

RandomForestClassifier(n_estimators=25)

In [173]:
print('================ Resultados Forward Stepwise Random Forest =================\n')
print(classification_report(y_test, rf_frw.predict(X_test_frw)))


              precision    recall  f1-score   support

           0       0.51      0.52      0.51        50
           1       0.84      0.83      0.84       150

    accuracy                           0.76       200
   macro avg       0.67      0.68      0.68       200
weighted avg       0.76      0.76      0.76       200



### Backward

In [174]:
rf = RandomForestClassifier(n_estimators=25)

In [175]:
n_features = 10

In [176]:
backward_stepwise = SequentialFeatureSelector(rf,n_features_to_select=n_features, direction='backward', cv=2)

In [177]:
backward_stepwise.fit(X_train_cod1, y_train) #X_train_cod1 -> 20 features

SequentialFeatureSelector(cv=2, direction='backward',
                          estimator=RandomForestClassifier(n_estimators=25),
                          n_features_to_select=10)

In [178]:
backward_stepwise.get_support()

array([ True,  True,  True, False,  True,  True,  True, False,  True,
        True, False, False, False,  True,  True, False, False, False,
       False, False])

In [179]:
X_train_bkw = backward_stepwise.transform(X_train_cod1)
X_test_bkw = backward_stepwise.transform(X_test_cod1)

In [180]:
rf_bkw = RandomForestClassifier(n_estimators=25)
rf_bkw.fit(X_train_bkw, y_train)

RandomForestClassifier(n_estimators=25)

In [181]:
print('================ Resultados Backward Stepwise Random Forest =================\n')
print(classification_report(y_test, rf_bkw.predict(X_test_bkw)))


              precision    recall  f1-score   support

           0       0.56      0.58      0.57        50
           1       0.86      0.85      0.85       150

    accuracy                           0.78       200
   macro avg       0.71      0.71      0.71       200
weighted avg       0.78      0.78      0.78       200



## Recursive Feature Elimination

In [182]:
rf = RandomForestClassifier(n_estimators=25)

In [183]:
n_features = 10

In [184]:
rfe = RFECV(rf,min_features_to_select=n_features, cv=2)

In [185]:
rfe.fit(X_train_cod1, y_train) #X_train_cod1 -> 20 features

RFECV(cv=2, estimator=RandomForestClassifier(n_estimators=25),
      min_features_to_select=10)

In [186]:
rfe.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True, False])

In [187]:
X_train_rfe = rfe.transform(X_train_cod1)
X_test_ref = rfe.transform(X_test_cod1)

In [188]:
rf_rfe = RandomForestClassifier(n_estimators=25)
rf_rfe.fit(X_train_rfe, y_train)

RandomForestClassifier(n_estimators=25)

In [189]:
print('================ Resultados Recursive Feature Elimination Random Forests =================\n')
print(classification_report(y_test, rf_rfe.predict(X_test_ref)))


              precision    recall  f1-score   support

           0       0.50      0.44      0.47        50
           1       0.82      0.85      0.84       150

    accuracy                           0.75       200
   macro avg       0.66      0.65      0.65       200
weighted avg       0.74      0.75      0.74       200



# Filtros

In [190]:
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

## Filtro utilizando Chi2

In [191]:
chi2_best = SelectKBest(chi2, k=5)
chi2_best.fit_transform(X_train_cod1, y_train)

array([[8.74023438e-01, 1.80000000e+01, 1.53000000e+03, 3.00000000e+00,
        3.20000000e+01],
       [6.11944444e-01, 1.80000000e+01, 4.29700000e+03, 4.00000000e+00,
        4.00000000e+01],
       [6.11944444e-01, 1.80000000e+01, 1.23900000e+03, 4.00000000e+00,
        6.10000000e+01],
       ...,
       [6.11944444e-01, 1.80000000e+01, 1.92800000e+03, 2.00000000e+00,
        3.10000000e+01],
       [8.74023438e-01, 6.00000000e+00, 1.54300000e+03, 4.00000000e+00,
        3.30000000e+01],
       [8.74023438e-01, 1.20000000e+01, 1.18500000e+03, 3.00000000e+00,
        2.70000000e+01]])

In [192]:
chi2_best.fit_transform(X_train_cod1, y_train).shape

(800, 5)

In [193]:
chi2_best.get_support()

array([ True,  True, False, False,  True, False, False,  True, False,
       False, False, False,  True, False, False, False, False, False,
       False, False])

In [194]:
X_train_chi2 = chi2_best.transform(X_train_cod1)
X_test_chi2 = chi2_best.transform(X_test_cod1)

In [195]:
rf_chi2 = RandomForestClassifier(n_estimators=25)
rf_chi2.fit(X_train_chi2, y_train)

RandomForestClassifier(n_estimators=25)

In [196]:
print('================ Resultados Chi2 Filter Random Forests =================\n')
print(classification_report(y_test, rf_chi2.predict(X_test_chi2)))


              precision    recall  f1-score   support

           0       0.39      0.40      0.40        50
           1       0.80      0.79      0.80       150

    accuracy                           0.69       200
   macro avg       0.60      0.60      0.60       200
weighted avg       0.70      0.69      0.70       200



## Filtro utilizando ANOVA F-value

In [197]:
fclass_best = SelectKBest(f_classif, k=5)
fclass_best.fit_transform(X_train_cod1, y_train)

array([[ 0.87402344, 18.        ,  0.82494703,  0.60795455,  0.6225    ],
       [ 0.61194444, 18.        ,  0.671875  ,  0.67978896,  0.6225    ],
       [ 0.61194444, 18.        ,  0.66358568,  0.55608974,  0.80327181],
       ...,
       [ 0.61194444, 18.        ,  0.82494703,  0.67978896,  0.6225    ],
       [ 0.87402344,  6.        ,  0.66358568,  0.67978896,  0.87645349],
       [ 0.87402344, 12.        ,  0.82494703,  0.63811728,  0.6225    ]])

In [198]:
fclass_best.fit_transform(X_train_cod1, y_train).shape

(800, 5)

In [199]:
fclass_best.get_support()

array([ True,  True,  True,  True, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False])

In [200]:
X_train_fclass = fclass_best.transform(X_train_cod1)
X_test_fclass = fclass_best.transform(X_test_cod1)

In [201]:
rf_fclass = RandomForestClassifier(n_estimators=25)
rf_fclass.fit(X_train_fclass, y_train)

RandomForestClassifier(n_estimators=25)

In [202]:
print('================ Resultados F-value Filter Random Forests =================\n')
print(classification_report(y_test, rf_fclass.predict(X_test_fclass)))


              precision    recall  f1-score   support

           0       0.50      0.54      0.52        50
           1       0.84      0.82      0.83       150

    accuracy                           0.75       200
   macro avg       0.67      0.68      0.68       200
weighted avg       0.76      0.75      0.75       200



## Filtro utilizando mutual information

In [203]:
mutual_best = SelectKBest(mutual_info_classif, k=5)
mutual_best.fit_transform(X_train_cod1, y_train)

array([[8.74023438e-01, 1.80000000e+01, 1.53000000e+03, 6.22500000e-01,
        6.83035714e-01],
       [6.11944444e-01, 1.80000000e+01, 4.29700000e+03, 6.22500000e-01,
        7.37376847e-01],
       [6.11944444e-01, 1.80000000e+01, 1.23900000e+03, 8.03271812e-01,
        6.83035714e-01],
       ...,
       [6.11944444e-01, 1.80000000e+01, 1.92800000e+03, 6.22500000e-01,
        5.58067376e-01],
       [8.74023438e-01, 6.00000000e+00, 1.54300000e+03, 8.76453488e-01,
        6.83035714e-01],
       [8.74023438e-01, 1.20000000e+01, 1.18500000e+03, 6.22500000e-01,
        6.83035714e-01]])

In [204]:
mutual_best.fit_transform(X_train_cod1, y_train).shape

(800, 5)

In [205]:
mutual_best.get_support()

array([ True, False,  True,  True,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True])

In [206]:
X_train_mutual = mutual_best.transform(X_train_cod1)
X_test_mutual = mutual_best.transform(X_test_cod1)

In [207]:
rf_mutual = RandomForestClassifier(n_estimators=25)
rf_mutual.fit(X_train_mutual, y_train)

RandomForestClassifier(n_estimators=25)

In [208]:
print('================ Resultados Mutual Information Filter Random Forests =================\n')
print(classification_report(y_test, rf_mutual.predict(X_test_mutual)))


              precision    recall  f1-score   support

           0       0.41      0.56      0.47        50
           1       0.83      0.73      0.78       150

    accuracy                           0.69       200
   macro avg       0.62      0.64      0.62       200
weighted avg       0.73      0.69      0.70       200

