## Feature Selection
El objetivo de este notebook es generar nuevos features o descartar los ya existentes en base a su importancia. Adicionalmente, se generará el dataset para el entrenamiento 


#### Carga de paquetes

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import RFE, SelectKBest, f_classif
from sklearn.linear_model import LassoCV, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

In [2]:
np.random.seed(79)

#### Carga de datos

In [3]:
# Importamos el dataset de features
df = pd.read_csv("../data/intermediate/dataset_cursada_v0.csv")
df.shape

(174420, 33)

In [4]:
df.head()

Unnamed: 0,particion,periodo,cuatrimestre,legajo,course_name,nota_final_materia,split,al_promedio_general,al_tasa_aprobacion,al_promedio_parcial,...,avances_parcial,avances_recuperatorio,avances_integrador,promedio_tareas_tp,promedio_tareas_otros,promedio_tiempo_sub_tp,avances_tareas_tp,avances_tareas_otros,avances_tareas_extra,aprobo
0,0,1-2022,1.0,002566SBS,assimilated exuding groupware,8.0,train,6.833333,1.0,6.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,1,1-2022,1.0,002566SBS,assimilated exuding groupware,8.0,train,6.833333,1.0,6.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,2,1-2022,1.0,002566SBS,assimilated exuding groupware,8.0,train,6.833333,1.0,6.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,3,1-2022,1.0,002566SBS,assimilated exuding groupware,8.0,train,6.833333,1.0,6.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,4,1-2022,1.0,002566SBS,assimilated exuding groupware,8.0,train,6.833333,1.0,6.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


#### Configuración

In [5]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

#### Procesamiento

In [6]:
# Se eliminan duplicados
df_columns = df.columns.tolist()
df_columns.remove("particion")

# No vamos a eliminar todos los duplicados, generamos un sampleo y borramos x proporcion
df["duplicado"] = df.duplicated(subset=df_columns)
df = df.loc[(df["duplicado"]) & (np.random.rand(len(df)) > 0.85)].copy()
df.drop(columns=["duplicado"], inplace=True)

**Importante:**

Por default el conjunto de datos tiene poca variabilidad. Es por eso removemos particiones que tienen datos idénticos a otras particiones. Para no eliminar los datos completamente, sampleamos duplicados por partición y lo mantenemos.

In [7]:
df.head()

Unnamed: 0,particion,periodo,cuatrimestre,legajo,course_name,nota_final_materia,split,al_promedio_general,al_tasa_aprobacion,al_promedio_parcial,al_promedio_score_tp,al_promedio_tiempo_sub_tp,al_tasa_recuperadas,al_n_materias_periodo,cn_promedio_general,cn_tasa_aprobacion,cn_promedio_parcial,cn_promedio_score_tp,cn_promedio_tiempo_sub_tp,cn_tasa_recuperadas,promedio_parcial,promedio_recuperatorio,promedio_integrador,avances_parcial,avances_recuperatorio,avances_integrador,promedio_tareas_tp,promedio_tareas_otros,promedio_tiempo_sub_tp,avances_tareas_tp,avances_tareas_otros,avances_tareas_extra,aprobo
24,24,1-2022,1.0,002566SBS,assimilated exuding groupware,8.0,train,6.833333,1.0,6.333333,74.450813,21.634146,0.166667,6,8.0,1.0,8.5,79.438542,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,81.766667,0.0,33.0,0.375,0.0,0.0,1
25,25,1-2022,1.0,002566SBS,assimilated exuding groupware,8.0,train,6.833333,1.0,6.333333,74.450813,21.634146,0.166667,6,8.0,1.0,8.5,79.438542,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,81.766667,0.0,33.0,0.375,0.0,0.0,1
28,28,1-2022,1.0,002566SBS,assimilated exuding groupware,8.0,train,6.833333,1.0,6.333333,74.450813,21.634146,0.166667,6,8.0,1.0,8.5,79.438542,26.0,0.0,8.0,0.0,0.0,0.5,0.0,0.0,81.06,0.0,33.6,0.625,0.0,0.0,1
29,29,1-2022,1.0,002566SBS,assimilated exuding groupware,8.0,train,6.833333,1.0,6.333333,74.450813,21.634146,0.166667,6,8.0,1.0,8.5,79.438542,26.0,0.0,8.0,0.0,0.0,0.5,0.0,0.0,81.06,0.0,33.6,0.625,0.0,0.0,1
36,36,1-2022,1.0,002566SBS,assimilated exuding groupware,8.0,train,6.833333,1.0,6.333333,74.450813,21.634146,0.166667,6,8.0,1.0,8.5,79.438542,26.0,0.0,8.0,0.0,0.0,0.5,0.0,0.0,81.06,0.0,33.6,0.625,0.0,0.0,1


In [8]:
# Dejamos una copia para guardar al final
df_train = df.copy()

In [9]:
# Nos quedamos con el set de entrenamiento
df = df.loc[df["split"] == "train"].drop(columns=["split"]).copy()
df.shape

(18733, 32)

In [10]:
# Eliminamos las columnas que no vamos a utilizar
df.drop(
    columns=["periodo", "legajo", "course_name", "nota_final_materia"], inplace=True
)

#### Feature selection

Se probarán distintos enfoques para ver la importancia de los features y en base a estos enfoques se tomará una decisión sobre que features quedarán.

##### Procesamiento previo

In [11]:
# Generamos una copia para no modificar el datset original
df_feature_selection = df.copy()

In [12]:
# Separamos los features del target
x = df_feature_selection.drop(["aprobo"], axis=1)
y = df_feature_selection["aprobo"]

In [13]:
# Estandatizamos los features
x_scaled = StandardScaler().fit_transform(x)
x_scaled = pd.DataFrame(x_scaled, columns=x.columns)

In [14]:
# Generamos el split de los datos
x_train, x_test, y_train, y_test = train_test_split(
    x_scaled, y, test_size=0.2, random_state=42
)

print("Shape X train:", x_train.shape)
print("Shape X test:", x_test.shape)
print("Shape y train:", y_train.shape)
print("Shape y test:", y_test.shape)

print("Cantidad clase positiva y train:", y_train.sum())
print("Cantidad clase positiva y test:", y_test.sum())

Shape X train: (14986, 27)
Shape X test: (3747, 27)
Shape y train: (14986,)
Shape y test: (3747,)
Cantidad clase positiva y train: 14494
Cantidad clase positiva y test: 3646


In [15]:
# Generamos una lista para guardar los resultados de los distintos algoritmos utilizados
score_df_list = []

##### Correlacion

In [16]:
# Utilizamos las variables sin estandarizar para ver la correlación
df_feature_selection = x.copy()
df_feature_selection["suscriptor"] = y

In [17]:
# Generamos los scores
correlation_df = df_feature_selection.corr()

In [18]:
# Nos quedamos con la columna correspondiente al target
correlation_df = (
    correlation_df.nlargest(len(correlation_df.columns), "suscriptor")["suscriptor"]
    .to_frame()
    .rename(columns={"suscriptor": "corr_score"})
)

In [19]:
# Guardamos el resultado
score_df_list.append(correlation_df)

##### Lasso

In [20]:
# Entrenamos el modelo
lasso_model = LassoCV(cv=5)
lasso_model.fit(x_train, y_train)

In [21]:
# Generamos los scores
lasso_df = pd.DataFrame(lasso_model.coef_, index=x_train.columns).rename(
    columns={0: "score_lasso"}
)
lasso_df.head()

Unnamed: 0,score_lasso
particion,-0.014394
cuatrimestre,0.000544
al_promedio_general,0.004503
al_tasa_aprobacion,0.084702
al_promedio_parcial,-0.00607


In [22]:
# Guardamos los resultados
score_df_list.append(lasso_df)

##### Select from model

In [23]:
# Entrenamos el modelo
rfe_model = RFE(
    estimator=LogisticRegression(), n_features_to_select=len(x_train.shape), step=1
)
rfe_model.fit(x_train, y_train)

In [24]:
# Generamos los scores
rfe_df = pd.DataFrame(rfe_model.ranking_, index=x_train.columns).rename(
    columns={0: "rfe_score"}
)
rfe_df.head()

Unnamed: 0,rfe_score
particion,9
cuatrimestre,22
al_promedio_general,1
al_tasa_aprobacion,2
al_promedio_parcial,7


In [25]:
# Guardamos los resultados
score_df_list.append(rfe_df)

##### Select K Best

In [26]:
# Entrenamos el modelo
kbest_model = SelectKBest(score_func=f_classif, k=len(x_train.columns))
kbest_model.fit(x_train, y_train)

In [27]:
kbest_df = pd.DataFrame(kbest_model.scores_, index=x_train.columns).rename(
    columns={0: "f_classif_score"}
)
kbest_df.head()

Unnamed: 0,f_classif_score
particion,2.498532
cuatrimestre,1.805451
al_promedio_general,2152.454011
al_tasa_aprobacion,7183.507654
al_promedio_parcial,525.782142


In [28]:
# Guardamos los datos
score_df_list.append(kbest_df)

##### Feature importance

In [29]:
# Entrenamos el modelo
fi_model = DecisionTreeClassifier(random_state=42)
fi_model.fit(x_train, y_train)

In [30]:
fi_df = pd.DataFrame(fi_model.feature_importances_, index=x_train.columns).rename(
    columns={0: "feature_importance"}
)
fi_df.head()

Unnamed: 0,feature_importance
particion,0.005028
cuatrimestre,0.0
al_promedio_general,0.064632
al_tasa_aprobacion,0.268327
al_promedio_parcial,0.027873


In [31]:
# Guardamos los datos
score_df_list.append(fi_df)

#### Generamos una tabla resumen de score de seleccion

In [32]:
feature_selection_scores = pd.concat(score_df_list, axis=1).dropna()
feature_selection_scores["corr_score"] = feature_selection_scores["corr_score"].round(4)
feature_selection_scores.sort_values("feature_importance", ascending=False)

Unnamed: 0,corr_score,score_lasso,rfe_score,f_classif_score,feature_importance
al_tasa_aprobacion,0.5597,0.084702,2.0,7183.507654,0.268327
cn_tasa_aprobacion,0.4284,0.056294,1.0,3401.547519,0.257708
cn_promedio_tiempo_sub_tp,0.0206,0.00259,13.0,4.827655,0.1012
al_promedio_general,0.3456,0.004503,1.0,2152.454011,0.064632
al_promedio_score_tp,0.1231,-0.001493,25.0,266.899626,0.046938
al_n_materias_periodo,0.0193,0.00872,24.0,5.584685,0.0432
cn_promedio_parcial,0.0029,0.001036,15.0,0.06215,0.03626
al_tasa_recuperadas,-0.1426,-0.000275,23.0,317.458183,0.034622
al_promedio_parcial,0.1775,-0.00607,7.0,525.782142,0.027873
cn_promedio_general,0.2076,-0.008085,21.0,682.815031,0.025872


**Importante:**

Luego de probar distintos métodos de feature selection y adicionalmente utilizando conocimiento de negocio, seleccionamos un conjunto de variables que pueden funcionar de forma correcta para el objetivo planteado. La idea tambíen es quitarle complejidad al modelo para evitar lo máximo posible el overfitting.

In [33]:
# Seleccionamos las columnas a eliminar
features_a_excluir = [
    "periodo",
    "legajo",
    "course_name",
    "nota_final_materia",
    "avances_tareas_tp",
    "avances_integrador",
    "avances_recuperatorio",
    "cuatrimestre",
    "promedio_tareas_otros",
    "avances_tareas_extra",
    "avances_parcial",
    "avances_tareas_otros",
]

df_train.drop(columns=features_a_excluir, inplace=True)

In [34]:
df_train.columns

Index(['particion', 'split', 'al_promedio_general', 'al_tasa_aprobacion',
       'al_promedio_parcial', 'al_promedio_score_tp',
       'al_promedio_tiempo_sub_tp', 'al_tasa_recuperadas',
       'al_n_materias_periodo', 'cn_promedio_general', 'cn_tasa_aprobacion',
       'cn_promedio_parcial', 'cn_promedio_score_tp',
       'cn_promedio_tiempo_sub_tp', 'cn_tasa_recuperadas', 'promedio_parcial',
       'promedio_recuperatorio', 'promedio_integrador', 'promedio_tareas_tp',
       'promedio_tiempo_sub_tp', 'aprobo'],
      dtype='object')

#### Guardado de datos

In [35]:
df_train.to_csv("../data/processed/dataset_training_v0.csv", index_label=False)