## Feature engineering

El objetivo de este notebook es generar nuevas variables a partir de las analizadas. Tambíen se hará una breve exploración para determinar si son variables que pueden ser importantes para el modelo.

### Import de los paquetes

In [1]:
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

### Configuraciones

In [2]:
pd.set_option("display.max_columns", None)

### Carga de datos

In [3]:
df = pd.read_csv("../data/intermediate/dataset_newfeatures_processed.csv")
df.shape

(3000, 28)

In [4]:
df.head()

Unnamed: 0,invoiceId,businessId,payerId,montoFactura,relationDays,relationRecurrence,issuerInvoicesAmount,issuerCancelledInvoices,diasActividadPagador,Clients12Months,mora,facturaMorosa,montoFacturaWsz,issuerInvoicesAmountWsz,Clients12MonthsWsz,payerNroFactMorosas,payerNroFacturas,payerAmountFacturas,payerDiasMora,payerRatioMorosidad,payerAvgAmountFacturas,payerAvgDiasMora,payerDesvAmount,businessNroFactMorosas,businessNroFacturas,businessDiasMora,businessRatioMororsidad,businessAvgDiasMora
2186,11654,5,5015,4713103,428.0,10.166667,187036960,0.021942,2632.0,4,18.0,1,4713103.0,187036960.0,4.0,18,45,181609529.0,102.0,0.4,4035767,2.266667,677336.0,0,0,0.0,0.0,0.0
2819,13430,5,5015,7879645,455.0,9.891304,231623853,0.020079,2659.0,4,-6.0,0,7879645.0,231623853.0,4.0,22,52,210861531.0,149.0,0.423077,4055029,2.865385,3824616.0,1,1,18.0,1.0,18.0
2713,15957,5,5015,1190417,493.0,9.264151,272695026,0.016629,2697.0,4,1.0,1,1190417.0,272695026.0,4.0,23,66,261505599.0,419.0,0.348485,3962206,6.348485,-2771789.0,1,2,12.0,0.5,6.0
1141,18135,5,5015,189924,542.0,9.016667,321318559,0.014322,2746.0,3,0.0,0,189924.0,321318559.0,3.0,27,72,278112178.0,507.0,0.375,3862669,7.041667,-3672745.0,2,3,13.0,0.666667,4.333333
2301,19438,5,5015,3755846,562.0,8.920635,346234215,0.013314,2766.0,3,-5.0,0,3755846.0,346234215.0,3.0,29,77,297456818.0,476.0,0.376623,3863075,6.181818,-107229.0,2,4,13.0,0.5,3.25


### Procesamiento

#### Removemos datos incorrectos
Esto ya lo mencionamos en el notebook 1. Dado que carecemos del conocimiento necesario para saber que tratamiento darle a estos casos. Optamos por removerlos.

In [5]:
# Contabilizamos cuantos casos son
df.loc[(df["relationDays"] == 0) & (df["facturaMorosa"] == 1)].shape

(38, 28)

In [6]:
# Removemos
lista_facturas_invalidas = df.loc[
    (df["relationDays"] == 0) & (df["facturaMorosa"] == 1)
]["invoiceId"].tolist()

df = (
    df.loc[~df["invoiceId"].isin(lista_facturas_invalidas)]
    .reset_index(drop=True)
    .copy()
)
df.shape

(2962, 28)

In [7]:
# Dejamos una copiar para guardar al final
df_final = df.copy()

#### Selección de columnas
En este caso podemos tomar varias decisiones:
- La primera tiene que ver si utilizamos o no las variables con outliers y usamos a las cuales le hicimos tratamiento.
- En el paso anterior creamos algunas variables auxiliares. En el algunos casos podría darse que algunas de estas variables funciones. Por lo tanto las evaluamos.

In [8]:
# Dejamos las variables con los outliers procesados
opcion_1_variables = [
    "businessAvgDiasMora",
    "businessNroFacturas",
    "businessRatioMororsidad",
    "Clients12MonthsWsz",
    "diasActividadPagador",
    "facturaMorosa",
    "invoiceId",
    "issuerCancelledInvoices",
    "issuerInvoicesAmountWsz",
    "montoFacturaWsz",
    "payerAmountFacturas",
    "payerAvgAmountFacturas",
    "payerAvgDiasMora",
    "payerDesvAmount",
    "payerNroFacturas",
    "payerRatioMorosidad",
    "relationDays",
    "relationRecurrence",
]

In [9]:
# Dejamos las variables a pesar de los outliers
opcion_2_variables = [
    "businessAvgDiasMora",
    "businessNroFactMorosas",
    "businessNroFacturas",
    "businessRatioMororsidad",
    "Clients12Months",
    "diasActividadPagador",
    "facturaMorosa",
    "invoiceId",
    "issuerCancelledInvoices",
    "issuerInvoicesAmount",
    "montoFactura",
    "payerAmountFacturas",
    "payerAvgAmountFacturas",
    "payerAvgDiasMora",
    "payerDesvAmount",
    "payerNroFacturas",
    "payerRatioMorosidad",
    "relationDays",
    "relationRecurrence",
]

In [10]:
df = df[opcion_1_variables].copy()

#### Split

Para la validación del modelo vamos a mantener un 10% aproximadamente de la información. Este 10% serán las últimas facturas emitidas, ya que son datos que el modelo no observó para entrenar y así evitaríamos un sesgo en los resultados por causa de overfitting.

In [11]:
# Eliminamos las últimas 150 filas
df = df.sort_values(["invoiceId"]).copy()
df = df.iloc[:-250, :].copy()
df.drop(columns=["invoiceId"], inplace=True)
df.shape

(2712, 17)

#### Selección de variables

Utilizaremos diferentes métodos que nos permitan seleccionar las mejores variables.

##### Procesamiento

In [12]:
# Generamos una copia para no modificar el datset original
df_feature_selection = df.copy()

In [13]:
# Separamos los features del target
x = df_feature_selection.drop(["facturaMorosa"], axis=1)
y = df_feature_selection["facturaMorosa"]

In [14]:
# Estandatizamos los features
x_scaled = StandardScaler().fit_transform(x)
x_scaled = pd.DataFrame(x_scaled, columns=x.columns)

In [15]:
# Generamos el split de los datos
x_train, x_test, y_train, y_test = train_test_split(
    x_scaled, y, test_size=0.1, random_state=42
)

print("Shape X train:", x_train.shape)
print("Shape X test:", x_test.shape)
print("Shape y train:", y_train.shape)
print("Shape y test:", y_test.shape)

print("Cantidad clase positiva y train:", y_train.sum())
print("Cantidad clase positiva y test:", y_test.sum())

Shape X train: (2440, 16)
Shape X test: (272, 16)
Shape y train: (2440,)
Shape y test: (272,)
Cantidad clase positiva y train: 987
Cantidad clase positiva y test: 106


In [16]:
# Generamos una lista para guardar los resultados de los distintos algoritmos utilizados
score_df_list = []

##### Método 1: LassoCV

In [17]:
# Entrenamos el modelo
lasso_model = LassoCV(cv=5)
lasso_model.fit(x_train, y_train)

In [18]:
# Generamos los scores
lasso_df = pd.DataFrame(lasso_model.coef_, index=x_train.columns).rename(
    columns={0: "score_lasso"}
)

# Guardamos los resultados
score_df_list.append(lasso_df)

lasso_df.head()

Unnamed: 0,score_lasso
businessAvgDiasMora,0.006442
businessNroFacturas,-0.036785
businessRatioMororsidad,0.050572
Clients12MonthsWsz,0.018137
diasActividadPagador,-0.087887


##### Método 2: RFE

In [19]:
# Entrenamos el modelo
rfe_model = RFE(
    estimator=DecisionTreeClassifier(), n_features_to_select=len(x_train.shape), step=15
)
rfe_model.fit(x_train, y_train)

In [20]:
# Generamos los scores
rfe_df = pd.DataFrame(rfe_model.ranking_, index=x_train.columns).rename(
    columns={0: "rfe_score"}
)

# Guardamos los resultados
score_df_list.append(rfe_df)

rfe_df.head()

Unnamed: 0,rfe_score
businessAvgDiasMora,2
businessNroFacturas,2
businessRatioMororsidad,2
Clients12MonthsWsz,2
diasActividadPagador,2


##### Feature Importance

In [21]:
# Entrenamos el modelo
fi_model = DecisionTreeClassifier(random_state=42)
fi_model.fit(x_train, y_train)

In [22]:
# Generamos los scores
fi_df = pd.DataFrame(fi_model.feature_importances_, index=x_train.columns).rename(
    columns={0: "feature_importance"}
)

# Guardamos los datos
score_df_list.append(fi_df)

fi_df.head()

Unnamed: 0,feature_importance
businessAvgDiasMora,0.035076
businessNroFacturas,0.016763
businessRatioMororsidad,0.010807
Clients12MonthsWsz,0.049969
diasActividadPagador,0.097802


##### Generamos una tabla con el resumen de los metodos empleados

In [23]:
feature_selection_scores = pd.concat(score_df_list, axis=1).dropna()
feature_selection_scores.sort_values("feature_importance", ascending=False)

Unnamed: 0,score_lasso,rfe_score,feature_importance
payerAvgDiasMora,0.041313,1,0.169175
relationDays,-0.041688,1,0.103012
diasActividadPagador,-0.087887,2,0.097802
payerRatioMorosidad,0.086448,2,0.087841
issuerCancelledInvoices,0.011413,2,0.082218
relationRecurrence,0.017703,2,0.073024
issuerInvoicesAmountWsz,-0.012461,2,0.0675
payerDesvAmount,0.025664,2,0.06028
montoFacturaWsz,0.0,2,0.055307
Clients12MonthsWsz,0.018137,2,0.049969


In [24]:
feature_selection_scores.shape

(16, 3)

**Notas:**

Dado el análisis realizado anteriormente, más lo observado en los métodos de selección de variables empleados vamos a quedarnos con las siguientes variables:

- `invoiceId`
- `relationDays`
- `relationRecurrence`
- `issuerCancelledInvoices`
- `diasActividadPagador`
- `facturaMorosa`
- `montoFacturaWsz`
- `issuerInvoicesAmountWsz`
- `Clients12MonthsWsz`
- `payerAmountFacturas`
- `payerRatioMorosidad`
- `payerAvgAmountFacturas`
- `payerAvgDiasMora`
- `payerDesvAmount`
- `businessRatioMororsidad`
- `businessAvgDiasMora`


### Guardado de datos

In [25]:
# Seleccionamos las columnas a utilizar
columnas_seleccionadas = [
    "invoiceId",
    "businessId",
    "payerId",
    "relationDays",
    "relationRecurrence",
    "issuerCancelledInvoices",
    "diasActividadPagador",
    "facturaMorosa",
    "montoFacturaWsz",
    "issuerInvoicesAmountWsz",
    "Clients12MonthsWsz",
    "payerAmountFacturas",
    "payerRatioMorosidad",
    "payerAvgAmountFacturas",
    "payerAvgDiasMora",
    "payerDesvAmount",
    "businessRatioMororsidad",
    "businessAvgDiasMora",
]

df_final = df_final[columnas_seleccionadas].copy()
df_final.shape

(2962, 18)

In [26]:
df_final.to_csv("../data/processed/dataset_training_v0.csv", index_label=False)