# **Feature Selection**
Here we explore methodologies to identify which features are useful provide a higher predictive power to the model. Given a dataset, a model trained on it can depend on features directly on derived features. How do we tell wich features are the most useful? Multiple approaches exist, which are based on simple ideas of univariate analysis to complex multivariate analysis. In univariate analysis we look at how a single feature contribute to the model. Although useful, it does have pitfalls as some features are better together. In multivariate analysis we can tell which features perform well and more importantly which perform well together. Various techniques exist driven differentiated by how information is extracted. When data contains label like the case here, we use supervised techniques, nevetheless, unsupervised techniques can be used for unlabelled data.

Collaborative filtering is built on the assumption that a good way to predict the
preference of an active consumer for a target product is to find other consumers
who have similar preferences and use their votes for that product to make a
prediction.
As noted in the [source page](https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/), these techniques can be classified as follows
- **Filter methods:** based on features properties highlighted via univariate analysis

- **Wrapper methods:** With a specific learning algorithm, these methose can perform a greedy search of the best feature by fitting models with possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. 
- **Embedded methods:** Here they aim to combine the power of both filters and wrapper while maintaining reasonable computational cost.
- **Hybrid method:** Hybrid methods basically select features via a global transformation reduces the data to a desided number of dimensions. The new features can bear little or no resemblance to the initial features.



In [None]:
import pandas as pd
import numpy as np
import saspy
import pickle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

## 6 Hybrid methods
As mentioned, these are methods that transforms the data into a completely different vector space and bear little or no resemblance to the original data yet carry the same information. They are commonly referred to as dimentionalily reduction methods, and a viewed as __feature engineering__

In [None]:
%run ../src/data_utils.py

In [None]:
sess = saspy.SASsession(
        cfgfile=mk.saspy_file_path,
        cfgname=mk.saspy_cfgname
    )

In [None]:
dataset2 = mk.dataset2

In [None]:
sess.saslib(dataset2['lib_name'], path=dataset2['path'])
lgd_data = sess.sd2df(dataset2['table_name'], libref=dataset2['lib_name'], method="CSV")

In [None]:
lgd_data.head(2)

In [None]:
lgd_data.shape

In [None]:
lgd_data.columns = [col.lower() for col in lgd_data.columns]
lgd_data.head(2)

In [None]:
import category_encoders as ce

In [None]:
encoder = ce.TargetEncoder()
lgd_data_cat = encoder.fit_transform(lgd_data[categorical_columns], lgd_data['LGD_bad_ind'])

In [None]:
processed_data = lgd_data.copy()
processed_data[categorical_columns] = lgd_data_cat

In [None]:
# with open('../data/lgd_data.pkl', 'wb') as f:
#     pickle.dump(lgd_data, f)

In [None]:
import pickle
with open('../data/lgd_data.pkl', 'rb') as f:
    lgd_data = pickle.load(f)

In [None]:
pd.options.display.max_columns = None
lgd_data.columns = [column.lower() for column in lgd_data.columns]
lgd_data.head()

Reduce dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    lgd_data.drop('lgd_bad_ind', axis=1), lgd_data.lgd_bad_ind, test_size=0.2, random_state=42, stratify=lgd_data.lgd_bad_ind)

In [None]:
# X_train.columns = [column.lower() for column in X_train.columns]
# X_test.columns = [column.lower() for column in X_train.columns]

In [None]:
categorical_features = df_application_train.select_dtypes(include=['object', 'category']).columns.values
numerical_features = df_application_train.select_dtypes(include=np.number).columns.values

### 6.1 Principal Component Analysis

Create features matrix

In [None]:
# Feature matrix and class label
cols_to_drop = ['naics_industry_cd']
X, y = X_train.drop(cols_to_drop, axis = 1), y_train

Transformation pipeline
1. Impute missing values

In [None]:
# Import custom classes
%run ../src/data_utils.py
%run ../src/imputer.py
%run ../src/transforms.py

In [None]:
# Instantiate the classes
transfxn = TransformationPipeline()
imputer = DataFrameImputer()

Transformation pipeline
1. Impute missing values

In [None]:
# Fit transform the training set
X_imputed = imputer.fit_transform(X)

2. Pre-processing

In [None]:
# Transform and scale data
X_scaled, _, feat_nm = transfxn.preprocessing(X_imputed, X_imputed)

In [None]:
print('Data size after pre-processing:', X_scaled.shape)

PCA plot

In [None]:
pcs_data = transfxn.pca_plot_labeled(X_scaled, y, palette = ['b', 'r'])

In [None]:
plt.plot(np.cumsum(pcs_data[0].explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

KMean Clustering

In [None]:
from sklearn.cluster import KMeans

In [None]:
Xx=pcs_data[0]
labels = KMeans(6, random_state=0).fit_predict(Xx)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis')

### 6.2 Singular Value Decomposition
This also a form of feature engineering. SVD is commonly used when data is sparse and basically projects data from higherdimensions to projections that represents a hand full of dimensions. Since we are appying one hot encoding, we will have a lot of zeros making SVD appropriate for this.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
import category_encoders as ce

In [None]:
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.values
numerical_features = X_train.select_dtypes(include=np.number).columns.values

In [None]:
#working with numerical data
X = lgd_data.drop('lgd_bad_ind', axis=1)
Y = lgd_data.lgd_bad_ind
numerical_columns = X.select_dtypes(include=np.number).columns.values
categorical_columns = X.select_dtypes(include=['object', 'category']).columns.values

In [None]:
encoder = ce.TargetEncoder()
X_cat = encoder.fit_transform(X[categorical_columns], Y)
X[categorical_columns] = X_cat

In [None]:

# define the pipeline
steps = [('svd', TruncatedSVD(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(
    steps= [
        ('svd', TruncatedSVD(n_components=10)), 
        ('m', LogisticRegression())
        ]     
)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X.fillna(0), Y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

### 6.3 Linear Discriminant Analysis
Linear Discriminant Analysis seeks to best separate (or discriminate) the samples in the training dataset by their class value. It is applied to supervised learning

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

In [None]:

# define the pipeline
steps = [('lda', LinearDiscriminantAnalysis(n_components=5)), ('m', GaussianNB())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X.fillna(0), Y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

In [None]:
y = 'LGD_bad_ind'