<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 6.5
## Feature Selection

### Data

**Predict the onset of diabetes based on diagnostic measures.**

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

[Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database/download)

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

#### 1. Load Data

In [3]:
# Read Data
diabetes_csv = 'diabetes.csv'
diabetes = pd.read_csv(diabetes_csv)
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### 2. Perform EDA

Perform EDA. Check Null Values. Impute if necessary.

In [4]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [7]:
diabetes.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

#### 3. Set Target

- Set `Outcome` as target.
- Set Features 

In [122]:
y = diabetes['Outcome']
X = diabetes.drop('Outcome', axis=1) #features
features = X.columns

In [88]:
#You can use MinMaxScaler. By default, it will scale the data within the range [0,1]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
scaler.transform(X)

array([[0.35294118, 0.74371859, 0.59016393, ..., 0.50074516, 0.23441503,
        0.48333333],
       [0.05882353, 0.42713568, 0.54098361, ..., 0.39642325, 0.11656704,
        0.16666667],
       [0.47058824, 0.91959799, 0.52459016, ..., 0.34724292, 0.25362938,
        0.18333333],
       ...,
       [0.29411765, 0.6080402 , 0.59016393, ..., 0.390462  , 0.07130658,
        0.15      ],
       [0.05882353, 0.63316583, 0.49180328, ..., 0.4485842 , 0.11571307,
        0.43333333],
       [0.05882353, 0.46733668, 0.57377049, ..., 0.45305514, 0.10119556,
        0.03333333]])

#### 4. Select Feature

The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.

##### 4.1 Univariate Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method:

- SelectKBest removes all but the  highest scoring features
- Use sklearn.feature_selection.chi2 as score function
    > Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.


More Reads:
[Univariate feature selection](https://scikit-learn.org/stable/modules/feature_selection.html)

- Create an instance of SelectKBest
    - Use sklearn.feature_selection.chi2 as score_func
    - Use k of your choice
- Fit X, y 
- Find top 4 features
- Transform features to a DataFrame

In [132]:
# Create an instance of SelectKBest

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

score_func = SelectKBest(chi2, k=4)

#chi-square test measures dependence between stochastic variables, 
# so using this function “weeds out” the features that are the most likely to be independent of class 
# and therefore irrelevant for classification.

In [133]:
# Fit 
X_new = score_func.fit(X, y)

In [134]:
# Print Score 
X_new.scores_


array([ 111.51969064, 1411.88704064,   17.60537322,   53.10803984,
       2175.56527292,  127.66934333,    5.39268155,  181.30368904])

In [146]:
feature_scores = pd.DataFrame(X_new.scores_, index=features)
feature_scores[0].sort_values(ascending=False)

Insulin                     2175.565273
Glucose                     1411.887041
Age                          181.303689
BMI                          127.669343
Pregnancies                  111.519691
SkinThickness                 53.108040
BloodPressure                 17.605373
DiabetesPedigreeFunction       5.392682
Name: 0, dtype: float64

In [126]:
# Find Top 4 Features
# Transform X to Features
x_new=score_func.fit_transform(X, y)

In [137]:
# Transform features to a dataframe

x_new = pd.DataFrame(x_new)
x_new

Unnamed: 0,0,1,2,3
0,148.0,0.0,33.6,50.0
1,85.0,0.0,26.6,31.0
2,183.0,0.0,23.3,32.0
3,89.0,94.0,28.1,21.0
4,137.0,168.0,43.1,33.0
...,...,...,...,...
763,101.0,180.0,32.9,63.0
764,122.0,0.0,36.8,27.0
765,121.0,112.0,26.2,30.0
766,126.0,0.0,30.1,47.0


##### 4.2 Recursive feature elimination

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features.That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

More Reads:
[Recursive feature elimination](https://scikit-learn.org/stable/modules/feature_selection.html)

- Use RFE to extract feature
    - use LogisticRegression as estimator
    - Number of n_features_to_select as of your choice
- Fit X, y to RFE
- Find Selected Features

In [156]:
# ANSWER

from sklearn.feature_selection import RFE


estimator = LogisticRegression()
selector = RFE(estimator, n_features_to_select=4, step=1)
selector = selector.fit(X, y)

print(selector.support_)
print(selector.ranking_)

[ True  True False False False  True  True False]
[1 1 3 4 5 1 1 2]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [157]:
# Print Score 
feature_scores = pd.DataFrame(selector.ranking_, index=features)
feature_scores[0].sort_values()

Pregnancies                 1
Glucose                     1
BMI                         1
DiabetesPedigreeFunction    1
Age                         2
BloodPressure               3
SkinThickness               4
Insulin                     5
Name: 0, dtype: int32

In [158]:
# Find Features

# Transform X to Features
x_new = selector.transform(X)

# Transform features to a dataframe

x_new = pd.DataFrame(x_new)
x_new

Unnamed: 0,0,1,2,3
0,6.0,148.0,33.6,0.627
1,1.0,85.0,26.6,0.351
2,8.0,183.0,23.3,0.672
3,1.0,89.0,28.1,0.167
4,0.0,137.0,43.1,2.288
...,...,...,...,...
763,10.0,101.0,32.9,0.171
764,2.0,122.0,36.8,0.340
765,5.0,121.0,26.2,0.245
766,1.0,126.0,30.1,0.349




---



---



> > > > > > > > > © 2019 Institute of Data


---



---



