### Feature Engineering

Whatever you do wih features, we call it as Feature Engineering.

* Feature Elimination    - dropping the features.
* Feature Addition       - adding some features.
* Feature Transformation - transforming the given feature values into an another scale - Log Tranformation, Sqrt Transformation..
* Feature Selection      - deciding which features are important out of many features and choosing that features for model building.

#### Feature Selection Techniques

* sklearn - SelectFromModel
* sklearn - RFE(ie,Recursive Feature Elimination)

## 1. Import Necessasry libraries

In [1]:
import pandas as pd

## 2. Import Dataset

In [3]:
from sklearn.datasets import load_breast_cancer
cancer_data = load_breast_cancer()

In [6]:
cancer_data_df = pd.DataFrame(data = cancer_data.data,columns=cancer_data.feature_names)
cancer_data_df['target'] = cancer_data.target
cancer_data_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


## 3. Data Understanding

In [7]:
cancer_data_df.isna().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

In [8]:
cancer_data_df.dtypes

mean radius                float64
mean texture               float64
mean perimeter             float64
mean area                  float64
mean smoothness            float64
mean compactness           float64
mean concavity             float64
mean concave points        float64
mean symmetry              float64
mean fractal dimension     float64
radius error               float64
texture error              float64
perimeter error            float64
area error                 float64
smoothness error           float64
compactness error          float64
concavity error            float64
concave points error       float64
symmetry error             float64
fractal dimension error    float64
worst radius               float64
worst texture              float64
worst perimeter            float64
worst area                 float64
worst smoothness           float64
worst compactness          float64
worst concavity            float64
worst concave points       float64
worst symmetry      

In [9]:
cancer_data_df.shape

(569, 31)

## 4. Model Building

In [12]:
X = cancer_data_df.iloc[:,:-1]
y = cancer_data_df[['target']]

In [13]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=12,stratify=y)

In [14]:
X_train.shape,y_train.shape

((455, 30), (455, 1))

In [15]:
X_test.shape,y_test.shape

((114, 30), (114, 1))

## 5. FEATURE SELECTION TECHNIQUES

### 5.1 SelectFromModel Technique

In [33]:
from sklearn.feature_selection import RFE,SelectFromModel
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier

from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix

import warnings
warnings.filterwarnings('ignore')

In [42]:
select_from_model = SelectFromModel(estimator = RandomForestClassifier(n_estimators=100,random_state=12),max_features=None)
select_from_model.fit(X_train,y_train)

SelectFromModel(estimator=RandomForestClassifier(random_state=12))

In [43]:
select_from_model.get_support()

array([False, False, False,  True, False, False, False,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True, False,  True,  True, False, False, False,
        True, False, False])

In [44]:
X_train.columns[select_from_model.get_support()]

Index(['mean area', 'mean concave points', 'area error', 'worst radius',
       'worst perimeter', 'worst area', 'worst concave points'],
      dtype='object')

In [45]:
X_train_transformed = select_from_model.transform(X_train)
X_test_transformed  = select_from_model.transform(X_test)

In [46]:
X_train_transformed.shape

(455, 7)

In [47]:
X_test_transformed.shape

(114, 7)

In [51]:
def run_RandomForestClassifier(X_train,X_test,y_train,y_test):    
    rf_classifier = RandomForestClassifier()
    rf_classifier.fit(X_train,y_train) 
    y_pred = rf_classifier.predict(X_test)
    print('Accuracy score  : ',round(accuracy_score(y_test,y_pred),4))
    print('Precision score : ',round(precision_score(y_test,y_pred),4))
    print('Recall score    : ',round(recall_score(y_test,y_pred),4))
    print('Confusion Matrix:\n',confusion_matrix(y_test,y_pred))

In [52]:
%%time
run_RandomForestClassifier(X_train,X_test,y_train,y_test)

Accuracy score  :  0.9474
Precision score :  0.9583
Recall score    :  0.9583
Confusion Matrix:
 [[39  3]
 [ 3 69]]
Wall time: 370 ms


In [53]:
%%time
run_RandomForestClassifier(X_train_transformed,X_test_transformed,y_train,y_test)

Accuracy score  :  0.9298
Precision score :  0.9324
Recall score    :  0.9583
Confusion Matrix:
 [[37  5]
 [ 3 69]]
Wall time: 412 ms


## Technique 2: RFE

In [54]:
from sklearn.feature_selection import RFE

In [57]:
rfe = RFE(estimator = RandomForestClassifier(random_state=12), n_features_to_select=12)
rfe.fit(X_train,y_train)

RFE(estimator=RandomForestClassifier(random_state=12), n_features_to_select=12)

In [58]:
rfe.get_support()

array([ True, False,  True,  True, False, False,  True,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True,  True,  True,  True, False, False,  True,
        True, False, False])

In [61]:
X_train.columns[rfe.get_support()]

Index(['mean radius', 'mean perimeter', 'mean area', 'mean concavity',
       'mean concave points', 'area error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst concavity',
       'worst concave points'],
      dtype='object')

In [60]:
len(X_train.columns[rfe.get_support()])

12

In [63]:
X_train_transformed_rfe = rfe.transform(X_train)
X_test_transformed_rfe  = rfe.transform(X_test)

In [65]:
X_train.shape,X_test.shape

((455, 30), (114, 30))

In [67]:
X_train_transformed_rfe.shape,X_test_transformed_rfe.shape

((455, 12), (114, 12))

In [51]:
def run_RandomForestClassifier(X_train,X_test,y_train,y_test):    
    rf_classifier = RandomForestClassifier()
    rf_classifier.fit(X_train,y_train) 
    y_pred = rf_classifier.predict(X_test)
    print('Accuracy score  : ',round(accuracy_score(y_test,y_pred),4))
    print('Precision score : ',round(precision_score(y_test,y_pred),4))
    print('Recall score    : ',round(recall_score(y_test,y_pred),4))
    print('Confusion Matrix:\n',confusion_matrix(y_test,y_pred))

In [68]:
%%time
run_RandomForestClassifier(X_train,X_test,y_train,y_test) #30features

Accuracy score  :  0.9561
Precision score :  0.9718
Recall score    :  0.9583
Confusion Matrix:
 [[40  2]
 [ 3 69]]
Wall time: 427 ms


In [69]:
%%time
run_RandomForestClassifier(X_train_transformed_rfe,X_test_transformed_rfe,y_train,y_test) 

Accuracy score  :  0.9386
Precision score :  0.9452
Recall score    :  0.9583
Confusion Matrix:
 [[38  4]
 [ 3 69]]
Wall time: 361 ms


### Pick up the important features from Gradient Boosting to build RandomForestClassfier

In [82]:
rfe = RFE(estimator = GradientBoostingClassifier(random_state=12), n_features_to_select=12)
rfe.fit(X_train,y_train)

RFE(estimator=GradientBoostingClassifier(random_state=12),
    n_features_to_select=12)

In [83]:
rfe.get_support()

array([False,  True, False,  True, False, False,  True,  True, False,
       False, False, False, False,  True, False, False,  True, False,
       False,  True, False,  True, False,  True, False, False,  True,
        True,  True, False])

In [72]:
X_train.columns[rfe.get_support()]

Index(['mean texture', 'mean area', 'mean concavity', 'mean concave points',
       'area error', 'concavity error', 'fractal dimension error',
       'worst texture', 'worst area', 'worst concavity',
       'worst concave points', 'worst symmetry'],
      dtype='object')

In [73]:
len(X_train.columns[rfe.get_support()])

12

In [74]:
X_train_transformed_rfe = rfe.transform(X_train)
X_test_transformed_rfe  = rfe.transform(X_test)

In [75]:
X_train.shape,X_test.shape

((455, 30), (114, 30))

In [76]:
X_train_transformed_rfe.shape,X_test_transformed_rfe.shape

((455, 12), (114, 12))

In [77]:
def run_RandomForestClassifier(X_train,X_test,y_train,y_test):    
    rf_classifier = RandomForestClassifier()
    rf_classifier.fit(X_train,y_train) 
    y_pred = rf_classifier.predict(X_test)
    print('Accuracy score  : ',round(accuracy_score(y_test,y_pred),4))
    print('Precision score : ',round(precision_score(y_test,y_pred),4))
    print('Recall score    : ',round(recall_score(y_test,y_pred),4))
    print('Confusion Matrix:\n',confusion_matrix(y_test,y_pred))

In [78]:
%%time
run_RandomForestClassifier(X_train,X_test,y_train,y_test) #30features

Accuracy score  :  0.9649
Precision score :  0.9722
Recall score    :  0.9722
Confusion Matrix:
 [[40  2]
 [ 2 70]]
Wall time: 421 ms


In [79]:
%%time
run_RandomForestClassifier(X_train_transformed_rfe,X_test_transformed_rfe,y_train,y_test) 

Accuracy score  :  0.9737
Precision score :  0.9726
Recall score    :  0.9861
Confusion Matrix:
 [[40  2]
 [ 1 71]]
Wall time: 318 ms


### How to decide the optimal number of features to get better Accuracy?

In [84]:
for i in range(1,31):
    rfe = RFE(estimator = GradientBoostingClassifier(random_state=12), n_features_to_select=i)
    rfe.fit(X_train,y_train)
    X_train_transformed_rfe = rfe.transform(X_train)
    X_test_transformed_rfe  = rfe.transform(X_test)
    print('Selected Features : ',i)
    run_RandomForestClassifier(X_train_transformed_rfe,X_test_transformed_rfe,y_train,y_test)
    print('--------------------------------------------------------------------------------')

Selected Features :  1
Accuracy score  :  0.8158
Precision score :  0.8696
Recall score    :  0.8333
Confusion Matrix:
 [[33  9]
 [12 60]]
--------------------------------------------------------------------------------
Selected Features :  2
Accuracy score  :  0.9211
Precision score :  0.9315
Recall score    :  0.9444
Confusion Matrix:
 [[37  5]
 [ 4 68]]
--------------------------------------------------------------------------------
Selected Features :  3
Accuracy score  :  0.9211
Precision score :  0.9315
Recall score    :  0.9444
Confusion Matrix:
 [[37  5]
 [ 4 68]]
--------------------------------------------------------------------------------
Selected Features :  4
Accuracy score  :  0.9561
Precision score :  0.9589
Recall score    :  0.9722
Confusion Matrix:
 [[39  3]
 [ 2 70]]
--------------------------------------------------------------------------------
Selected Features :  5
Accuracy score  :  0.9386
Precision score :  0.9577
Recall score    :  0.9444
Confusion Matrix:
 

In [85]:
rfe = RFE(estimator = GradientBoostingClassifier(random_state=12), n_features_to_select=8)
rfe.fit(X_train,y_train)

RFE(estimator=GradientBoostingClassifier(random_state=12),
    n_features_to_select=8)

In [86]:
rfe.get_support()

array([False,  True, False, False, False, False,  True,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False, False,  True, False,  True, False, False,  True,
        True, False, False])

In [88]:
X_train.columns[rfe.get_support()]

Index(['mean texture', 'mean concavity', 'mean concave points', 'area error',
       'worst texture', 'worst area', 'worst concavity',
       'worst concave points'],
      dtype='object')

In [89]:
X_train_transformed_rfe = rfe.transform(X_train)
X_test_transformed_rfe  = rfe.transform(X_test)

In [90]:
X_train_transformed_rfe.shape,X_test_transformed_rfe.shape

((455, 8), (114, 8))

In [91]:
%%time
run_RandomForestClassifier(X_train_transformed_rfe,X_test_transformed_rfe,y_train,y_test)

Accuracy score  :  0.9474
Precision score :  0.9714
Recall score    :  0.9444
Confusion Matrix:
 [[40  2]
 [ 4 68]]
Wall time: 385 ms


### THE END!!