# Feature Selection II - Selecting for Model Accuracy

In [1]:
import pandas as pd
ansur_female = pd.read_csv('datasets/ANSUR_II_FEMALE.csv')
ansur_male = pd.read_csv('datasets/ANSUR_II_MALE.csv')

ansur = pd.concat([ansur_female,ansur_male])
ansur.shape

(6068, 99)

In [2]:
ansur_filtered = ansur[['Gender','chestdepth','handlength',
                        'neckcircumference','shoulderlength',
                       'earlength']]
ansur_filtered.head()

Unnamed: 0,Gender,chestdepth,handlength,neckcircumference,shoulderlength,earlength
0,Female,245,184,335,148,65
1,Female,206,189,302,142,60
2,Female,223,195,325,164,65
3,Female,285,186,357,157,62
4,Female,290,187,340,156,65


In [3]:
# Features
X = ansur_filtered.drop('Gender',axis=1)

# Target
y = ansur_filtered['Gender']

In [4]:
# Pre-processing the data

# Split into train-test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Standardize the data, mean=0,variance=1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)

In [5]:
# Creating a logistic regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create Logistic regression model and fit using standardized data
lr = LogisticRegression()
lr.fit(X_train_std, y_train)

# Calculate accuarcy of the model 
X_test_std = scaler.transform(X_test) #<-- first standardized

y_pred = lr.predict(X_test_std)
print(accuracy_score(y_test,y_pred))

0.9917627677100495


In [6]:
# Inspecting the feature coefficients
print(lr.coef_)

[[-3.13015988  0.09670542  7.6930968   1.18884854  0.72173223]]


**Note:** These coefficients will be multiplied with the feature values when the model makes a prediction, features with coefficients close to zero will contribute little to the end result.

In [7]:
print(dict(zip(X.columns, abs(lr.coef_[0]))))

{'chestdepth': 3.130159883283854, 'handlength': 0.09670541530941415, 'neckcircumference': 7.693096795904505, 'shoulderlength': 1.1888485408523073, 'earlength': 0.7217322335012618}


In [8]:
# Dropping feature that contribute little to the model
X.drop('handlength',axis=1, inplace=True)

# Recalculating the accuarcy after dropping a feature
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

lr = LogisticRegression()
lr.fit(scaler.fit_transform(X_train), y_train)

print(accuracy_score(y_test, lr.predict(scaler.transform(X_test))))

0.9862712795167491


Thus, increased accuracy and decreased model complexity. To repeat this step recursively we have RFE (Recursive Feature Elimination).

## Recursive Feature Elimination
- Feature selection algorithm that can be wrapped around any model that produces feature coefficients or feature importance values.
- We can pass it the model we want to use and the number of features we want to select. 
-  While fitting to our data it will repeat a process where it first fits the internal model and then drops the feature with the weakest coefficient.
-  It will keep doing this until the desired number of features is reached.

In [9]:
# Features
X = ansur_filtered.drop('Gender',axis=1)

# Target
y = ansur_filtered['Gender']

In [10]:
# Pre-processing the data

# Split into train-test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Standardize the data, mean=0,variance=1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)

In [11]:
# RFE
from sklearn.feature_selection import RFE

# Instantiate RFE
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=2, verbose=1)

# Fit the model
rfe.fit(X_train_std, y_train)

Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.


In [12]:
# Inspecting the RFE results
X.columns[rfe.support_]

Index(['chestdepth', 'neckcircumference'], dtype='object')

In [13]:
# See in which iteration a feature was dropped
dict(zip(X.columns, rfe.ranking_))

{'chestdepth': 1,
 'handlength': 4,
 'neckcircumference': 1,
 'shoulderlength': 2,
 'earlength': 3}

In [14]:
# Check accuracy from the two remaining feature
X_test_std = scaler.transform(X_test)

accuracy_score(y_test, rfe.predict(X_test_std))

0.9802306425041186

# Tree-based feature selection

![image-3](image-3.png)

Random Forest is one of such models that performs feature selection by design to avoid overfitting. 
- It pass different, random, subsets of features to a number of decision trees.

In [15]:
import pandas as pd
ansur_female = pd.read_csv('datasets/ANSUR_II_FEMALE.csv')
ansur_male = pd.read_csv('datasets/ANSUR_II_MALE.csv')

ansur = pd.concat([ansur_female,ansur_male])
ansur.shape

(6068, 99)

In [16]:
ansur.drop(['Branch','Component','BMI_class','Height_class'],
          axis=1,
          inplace=True)

In [17]:
# Features
X = ansur.drop('Gender',axis=1)

# Target
y = ansur['Gender']

In [18]:
# Split into train-test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [19]:
# Random forest classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier()

rf.fit(X_train, y_train)

print(accuracy_score(y_test, rf.predict(X_test)))

0.9923119165293794


Being able to get such high accuracy means it managed to escape the curse of dimensionality and didn't overfit on the many features in the training set.

In [20]:
# Feature importance values
rf.feature_importances_

array([0.00142133, 0.00109268, 0.00038479, 0.00080267, 0.00355326,
       0.00621627, 0.00095298, 0.1094743 , 0.00413159, 0.00717979,
       0.01355828, 0.01857631, 0.0007869 , 0.00510404, 0.00112701,
       0.01260415, 0.00289644, 0.00072711, 0.00191812, 0.00155842,
       0.00164152, 0.01178926, 0.0007571 , 0.00166167, 0.00579531,
       0.01892502, 0.00047618, 0.00635288, 0.00065596, 0.00089918,
       0.00089665, 0.00194052, 0.00060721, 0.00349696, 0.00141798,
       0.00260607, 0.0058748 , 0.0218189 , 0.00487626, 0.00067282,
       0.00050672, 0.02580583, 0.06156871, 0.00084673, 0.00078337,
       0.00100948, 0.0004952 , 0.04692033, 0.00071765, 0.01632792,
       0.02435656, 0.0006808 , 0.00076808, 0.00267134, 0.01166292,
       0.00052644, 0.00033687, 0.00133139, 0.00787438, 0.00543047,
       0.00135421, 0.16912925, 0.12530706, 0.00510626, 0.00042008,
       0.00242982, 0.00104707, 0.03809052, 0.00105749, 0.00100116,
       0.00367124, 0.04459685, 0.00416229, 0.01334964, 0.00136

An advantage of these feature importance values over coefficients is that they are comparable between features by default, since they always sum up to one.

In [21]:
# Feature importance as a feature selector
mask = rf.feature_importances_ > 0.1
print(mask)

X_reduced = X.loc[:,mask]
print(X_reduced.columns)

[False False False False False False False  True False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False  True  True False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False]
Index(['biacromialbreadth', 'neckcircumference', 'neckcircumferencebase'], dtype='object')


In [22]:
# RFE with random forests
from sklearn.feature_selection import RFE

# Model instantiate
rfe = RFE(estimator=RandomForestClassifier(),
         n_features_to_select = 6,
         step = 10, #<-- at each step 10 least imp. features are dropped
         verbose = 1)

# Fit the model
rfe.fit(X_train, y_train)

# Remaining features
print(X.columns[rfe.support_])

Fitting estimator with 94 features.
Fitting estimator with 84 features.
Fitting estimator with 74 features.
Fitting estimator with 64 features.
Fitting estimator with 54 features.
Fitting estimator with 44 features.
Fitting estimator with 34 features.
Fitting estimator with 24 features.
Fitting estimator with 14 features.
Index(['biacromialbreadth', 'handcircumference', 'hipbreadthsitting',
       'neckcircumference', 'neckcircumferencebase', 'shouldercircumference'],
      dtype='object')


In [23]:
# accuracy calculation
accuracy_score(y_test, rfe.predict(X_test))

0.9906644700713894

# Regularized linear regression
![image-4](image-4.png)


In [27]:
ansur_male.drop(['Branch','Component','BMI_class','Height_class','Gender'],
          axis=1,
          inplace=True)

In [28]:
ansur_male.shape

(4082, 94)

In [29]:
X = ansur_male.drop('BMI',axis=1)
y = ansur_male['BMI']