# Feature Selection

#### Feature Selection

The purpose of this notebook is to reduce the number of features in our model. We use scikit [recursive feature elimination](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) to determine which combination of our top 24 features (found in `features.py`) should be used for development.

#### Environment setup

This notebook should run against our general-purpose `eda` environment.

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from optbinning.scorecard import plot_auc_roc, plot_cap, plot_ks
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from optbinning import BinningProcess

pd.options.display.max_columns = None



## 1. Get the data

Let's take a look at our raw data.

In [None]:
data_path = "/home/modelling/users-workspace/nsofinij/lab/mlzc/e2eML/data/transform-data.parquet"

raw_data = pd.read_parquet(data_path)#.drop(columns = ["customerid", "Unnamed: 0"])
raw_data.head()

In [None]:
raw_data.max()

In [None]:
def get_iv_from_binning_obj(path):
    iv_table = BinningProcess.load(path).summary()
    iv_table["iv"] = iv_table["iv"].astype("float").round(3)
    return iv_table[["iv", "name", "n_bins"]]

In [None]:
path_bin_obj = "/home/modelling/users-workspace/nsofinij/lab/mlzc/e2eML/data/binning-transformer.pkl"
iv_table = get_iv_from_binning_obj(path_bin_obj)

In [None]:
def filter_iv_table(iv_table, iv_cutoff=0.02, min_n_bins=2):
    # Filter based on IV and min_number of bins
    return iv_table.query(f"n_bins >= {min_n_bins} and iv >= {iv_cutoff}").name.values

In [None]:
modelling_features = filter_iv_table(iv_table, iv_cutoff=0.02, min_n_bins=2)

In [None]:
len(modelling_features)

In [None]:
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt


In [None]:

distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(1, 15)

X = raw_data[modelling_features].T

print(X.shape)
  
for k in K:
    # Building and fitting the model
    kmeanModel = KMeans(n_clusters=k).fit(X)
    # kmeanModel.fit(X)
  
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                        'euclidean'), axis=1)) / X.shape[0])
    inertias.append(kmeanModel.inertia_)
  
    mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                   'euclidean'), axis=1)) / X.shape[0]
    mapping2[k] = kmeanModel.inertia_

plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()

In [None]:
frame = pd.DataFrame({'Cluster':K, 'SSE':inertias})
plt.figure(figsize=(12,6))
plt.plot(frame['Cluster'], frame['SSE'], marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')

In [None]:
X = raw_data[modelling_features].T
kmeans = KMeans(n_clusters = 8, init='k-means++')
kmeans.fit(X)


In [None]:
cluster_class = pd.DataFrame(
    {
        "feature": modelling_features, 
        "cluster": kmeans.predict(X)
        }
    ).sort_values(by="cluster")
cluster_class

Explore the clusters

In [None]:
r_square_ratio = []
r_square_own = []
r_square_nc = []
for i, l in enumerate(kmeans.labels_):
    centroid = kmeans.cluster_centers_[l]
    # print(centroid.shape, X.values[0].shape)
    rsq_own = np.corrcoef(X.values[i], centroid)[0, 1]**2 
    rsq_nc = np.max(
        [np.corrcoef(X.values[i], kmeans.cluster_centers_[j])[0, 1]**2 for j in set(kmeans.labels_) if j!=l]
    )
    # print(X[i].values)
    # print(rsq)
    # print(f"R_Ratio={(1-rsq)/(1-rsq_other_cluster)}")
    r_square_own.append(rsq_own)
    r_square_nc.append(rsq_nc)
    r_square_ratio.append((1-rsq_own)/(1-rsq_nc))

In [None]:
t3 = cluster_class.assign(
    rsq_ratio=r_square_ratio,
    r_square_own=r_square_own,
    r_square_nc=r_square_nc
    ).sort_values(by=["cluster", "rsq_ratio"]).round(2)

In [None]:
t3.head()

In [None]:
iv_table.head()

In [None]:
rsq_iv_table = pd.merge(
    t3,
    iv_table.rename(columns={"name":"feature"}),
    on="feature"
)
rsq_iv_table

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering

In [None]:
linkage_data = linkage(X, method='ward', metric='euclidean')

In [None]:
dendrogram(linkage_data)
plt.show()

In [None]:
from sklearn.cluster import AgglomerativeClustering

In [None]:
agc_clustering = AgglomerativeClustering(n_clusters=5).fit(X)

In [None]:
agc_clustering.labels_

In [None]:
t = pd.DataFrame({"feature": modelling_features, "km_cluster": pred, "ag_clusters":agc_clustering.labels_}).sort_values(by="km_cluster")
# t['clusters'] = t

In [None]:
t["ag_clusters2"] = t["ag_clusters"].replace({3:0, 4:1, 0:2, 2:4})

## 2. Logistic Regression

In [None]:
logistic_regression = LogisticRegression(
    C=3, max_iter=1000, random_state=42
)

binning_process = util.setup_binning(
    model_data,
    features=features.all_features,
    params=features.binning_params
    )

## 3. Get a working scorecard

In [None]:
scorecard = util.estimator(binning_process, method=logistic_regression)

target = "B1_DEFLT_IN_12MO_PERF_WNDW_IND"
X = model_data.drop(target, axis=1)
y = model_data[target].astype('int8')

X.fillna(0, inplace=True)

scorecard.fit(X, y)

Our initial scorecard table is as follows:

In [None]:
t = scorecard.table(style="detailed").round(3)
t.head()

The IV values for our features are as follows:

In [None]:
t.groupby('Variable')['IV'].sum().sort_values()

## 4. Selecting the best features using RFE

In [None]:
binning_process.fit(X, y)

X_transform = binning_process.transform(X)

binning_logreg_estimator = Pipeline(
    steps=[("binning_process", binning_process), ("regressor", LogisticRegression())]
)
binning_logreg_estimator.fit(X, y)

We are using recursive feature elimination.

In [None]:
rfe = RFE(
    estimator= LogisticRegression(), n_features_to_select=4
)
rfe.fit(X_transform, y)

In [None]:
feature_pipeline = Pipeline(
    steps=[("rfe", rfe), ("regressor", logistic_regression)]
)
feature_pipeline.fit(X_transform, y)

Let's see what features were selected.

In [None]:
feature_pipeline[:-1].get_feature_names_out()

## 5. Create a model with RFE selected features

Below, we are repeating our process for creating a logistic regression model. Here we use the features that we selected in the last section. 

In [None]:
rfe_selected_features = feature_pipeline[:-1].get_feature_names_out()
rfe_subset_data = model_data[list(rfe_selected_features) + [target]]

log_reg = LogisticRegression(
    C=3, max_iter=1000, random_state=42
)

rfe_binning_params = {
    col: values for col, values in features.binning_params.items() if col in rfe_selected_features
}

binning_process_rfe = util.setup_binning(
    rfe_subset_data,
    features = list(rfe_selected_features),
    params=rfe_binning_params
    )

scorecard_rfe = util.estimator(binning_process_rfe,
                      method=log_reg)

target = "B1_DEFLT_IN_12MO_PERF_WNDW_IND"
X_rfe = rfe_subset_data.drop(target, axis=1)
y_rfe = rfe_subset_data[target].astype('int8')

X_rfe.fillna(0, inplace=True)

scorecard_rfe.fit(X_rfe, y_rfe)

In [None]:
t=scorecard_rfe.table(style="detailed").round(3)
t.head()

Let's see the features and information values.

In [None]:
t.groupby('Variable')['IV'].sum().sort_values()

## 6. Evaluation

In [None]:
y_pred = scorecard_rfe.predict_proba(X)[:, 1]
plot_auc_roc(y, y_pred)

In [None]:
plot_cap(y, y_pred)

In [None]:
plot_ks(y, y_pred)