# Feature selection II, selecting for model accuracy
> In this second chapter on feature selection, you'll learn how to let models help you find the most important features in a dataset for predicting a particular target feature. In the final lesson of this chapter, you'll combine the advice of multiple, different, models to decide on which features are worth keeping.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 3 exercises "Dimensionality Reduction in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Selecting features for model performance

### Building a diabetes classifier


<div class=""><p>You'll be using the Pima Indians diabetes dataset to predict whether a person has diabetes using logistic regression. There are 8 features and one target in this dataset. The data has been split into a training and test set and pre-loaded for you as <code>X_train</code>, <code>y_train</code>, <code>X_test</code>, and <code>y_test</code>.</p>
<p>A <code>StandardScaler()</code> instance has been predefined as <code>scaler</code> and a <code>LogisticRegression()</code> one as <code>lr</code>.</p></div>

In [None]:
diabetes_df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/7-dimensionality-reduction-in-python/datasets/diabetes_df.csv')

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
X, y = diabetes_df.iloc[:, :-1], diabetes_df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

scaler = StandardScaler()
lr = LogisticRegression()

Instructions
<ul>
<li>Fit the scaler on the training features and transform these features in one go.</li>
<li>Fit the logistic regression model on the scaled training data.</li>
<li>Scale the test features.</li>
<li>Predict diabetes presence on the scaled test set.</li>
</ul>

In [None]:
# Fit the scaler on the training features and transform these in one go
X_train_std = scaler.fit_transform(X_train)

# Fit the logistic regression model on the scaled training data
lr.fit(X_train_std, y_train)

# Scale the test features
X_test_std = scaler.transform(X_test)

# Predict diabetes presence on the scaled test set
y_pred = lr.predict(X_test_std)

# Prints accuracy metrics and feature coefficients
print("{0:.1%} accuracy on test set.".format(accuracy_score(y_test, y_pred))) 
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

79.6% accuracy on test set.
{'pregnant': 0.05, 'glucose': 1.23, 'diastolic': 0.03, 'triceps': 0.24, 'insulin': 0.19, 'bmi': 0.38, 'family': 0.35, 'age': 0.34}


**We get almost 80% accuracy on the test set. Take a look at the differences in model coefficients for the different features.**

### Manual Recursive Feature Elimination


<div class=""><p>Now that we've created a diabetes classifier, let's see if we can reduce the number of features without hurting the model accuracy too much.</p>
<p>On the second line of code the features are selected from the original dataframe. Adjust this selection.</p>
<p>A <code>StandardScaler()</code> instance has been predefined as <code>scaler</code> and a <code>LogisticRegression()</code> one as <code>lr</code>.</p>
<p>All necessary functions and packages have been pre-loaded too.</p></div>

Instructions 1/3
<li>First, run the given code, then remove the feature with the lowest model coefficient from <code>X</code>.</li>

In [None]:
# Remove the feature with the lowest model coefficient
X = diabetes_df[['pregnant', 'glucose', 'triceps', 'insulin', 'bmi', 'family', 'age']]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scales features and fits the logistic regression model
lr.fit(scaler.fit_transform(X_train), y_train)

# Calculates the accuracy on the test set and prints coefficients
acc = accuracy_score(y_test, lr.predict(scaler.transform(X_test)))
print("{0:.1%} accuracy on test set.".format(acc)) 
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

80.6% accuracy on test set.
{'pregnant': 0.05, 'glucose': 1.24, 'triceps': 0.24, 'insulin': 0.2, 'bmi': 0.39, 'family': 0.34, 'age': 0.35}


Instructions 2/3
<li>Run the code and remove 2 more features with the lowest model coefficients.</li>

In [None]:
# Remove the 2 features with the lowest model coefficients
X = diabetes_df[['glucose', 'triceps', 'bmi', 'family', 'age']]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scales features and fits the logistic regression model
lr.fit(scaler.fit_transform(X_train), y_train)

# Calculates the accuracy on the test set and prints coefficients
acc = accuracy_score(y_test, lr.predict(scaler.transform(X_test)))
print("{0:.1%} accuracy on test set.".format(acc)) 
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

79.6% accuracy on test set.
{'glucose': 1.13, 'triceps': 0.25, 'bmi': 0.34, 'family': 0.34, 'age': 0.37}


Instructions 2/3
<li>Run the code and only keep the feature with the highest coefficient.</li>

In [None]:
# Only keep the feature with the highest coefficient
X = diabetes_df[['glucose']]

# Performs a 25-75% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Scales features and fits the logistic regression model to the data
lr.fit(scaler.fit_transform(X_train), y_train)

# Calculates the accuracy on the test set and prints coefficients
acc = accuracy_score(y_test, lr.predict(scaler.transform(X_test)))
print("{0:.1%} accuracy on test set.".format(acc)) 
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

75.5% accuracy on test set.
{'glucose': 1.28}


**Removing all but one feature only reduced the accuracy by a few percent.**

### Automatic Recursive Feature Elimination


<div class=""><p>Now let's automate this recursive process. Wrap a Recursive Feature Eliminator (RFE) around our logistic regression estimator and pass it the desired number of features.</p>
<p>All the necessary functions and packages have been pre-loaded and the features have been scaled for you.</p></div>

In [None]:
from sklearn.feature_selection import RFE

In [None]:
X, y = diabetes_df.iloc[:, :-1], diabetes_df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

#lr = LogisticRegression()
X_train_std = scaler.fit_transform(X_train)
#lr.fit(X_train_std, y_train)
X_test_std = scaler.transform(X_test)

Instructions
<ul>
<li>Create the RFE with a <code>LogisticRegression()</code> estimator and 3 features to select.</li>
<li>Print the features and their ranking.</li>
<li>Print the features that are not eliminated.</li>
</ul>

In [None]:
# Create the RFE with a LogisticRegression estimator and 3 features to select
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=3, verbose=1)

# Fits the eliminator to the data
rfe.fit(X_train_std, y_train)

# Print the features and their ranking (high = dropped early on)
print(dict(zip(X.columns, rfe.ranking_)))

# Print the features that are not eliminated
print(X.columns[rfe.support_])
#abs(lr.coef_[0])
# Calculates the test set accuracy
acc = accuracy_score(y_test, rfe.predict(X_test_std))
print("{0:.1%} accuracy on test set.".format(acc)) 

Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
{'pregnant': 5, 'glucose': 1, 'diastolic': 6, 'triceps': 3, 'insulin': 4, 'bmi': 1, 'family': 2, 'age': 1}
Index(['glucose', 'bmi', 'age'], dtype='object')
80.6% accuracy on test set.


**When we eliminate all but the 3 most relevant features we get a 80.6% accuracy on the test set.**

## Tree-based feature selection


### Building a random forest model


<div class=""><p>You'll again work on the Pima Indians dataset to predict whether an individual has diabetes. This time using a random forest classifier. You'll fit the model on the training data after performing the train-test split and consult the feature importance values.</p>
<p>The feature and target datasets have been pre-loaded for you as <code>X</code> and <code>y</code>. Same goes for the necessary packages and functions.</p></div>

In [None]:
from sklearn.ensemble import RandomForestClassifier

Instructions
<ul>
<li>Set a 25% test size to perform a 75%-25% train-test split.</li>
<li>Fit the random forest classifier to the training data.</li>
<li>Calculate the accuracy on the test set.</li>
<li>Print the feature importances per feature.</li>
</ul>

In [None]:
# Perform a 75% training and 25% test data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Fit the random forest model to the training data
rf = RandomForestClassifier(random_state=0, n_estimators = 10)
rf.fit(X_train, y_train)

# Calculate the accuracy
acc = accuracy_score(y_test, rf.predict(X_test))

# Print the importances per feature
print(dict(zip(X.columns, rf.feature_importances_.round(2))))

# Print accuracy
print("{0:.1%} accuracy on test set.".format(acc))

{'pregnant': 0.09, 'glucose': 0.21, 'diastolic': 0.08, 'triceps': 0.11, 'insulin': 0.13, 'bmi': 0.09, 'family': 0.12, 'age': 0.16}
77.6% accuracy on test set.


**The random forest model gets 78% accuracy on the test set and 'glucose' is the most important feature (0.21).**

## Random forest for feature selection

<div class=""><p>Now lets use the fitted random model to select the most important features from our input dataset <code>X</code>.</p>
<p>The trained model from the previous exercise has been pre-loaded for you as <code>rf</code>.</p></div>

Instructions 1/2
<li>Create a mask for features with an importance higher than 0.15.</li>

In [None]:
# Create a mask for features importances above the threshold
mask = rf.feature_importances_ > 0.15

# Prints out the mask
print(mask)

[False  True False False False False False  True]


Instructions 2/2
<li>Sub-select the most important features by applying the mask to <code>X</code>.</li>

In [None]:
# Apply the mask to the feature dataset X
reduced_X = X.loc[:, mask]

# prints out the selected column names
print(reduced_X.columns)

Index(['glucose', 'age'], dtype='object')


**Only the features 'glucose' and 'age' were considered sufficiently important.**

### Recursive Feature Elimination with random forests


<div class=""><p>You'll wrap a Recursive Feature Eliminator around a random forest model to remove features step by step. This method is more conservative compared to selecting features after applying a single importance threshold. Since dropping one feature can influence the relative importances of the others.</p>
<p>You'll need these pre-loaded datasets: <code>X</code>, <code>X_train</code>, <code>y_train</code>.</p>
<p>Functions and classes that have been pre-loaded for you are: <code>RandomForestClassifier()</code>, <code>RFE()</code>, <code>train_test_split()</code>.</p></div>

Instructions 1/4
<li>Create a recursive feature eliminator that will select the 2 most important features using a random forest model.</li>

In [None]:
# Wrap the feature eliminator around the random forest model
rfe = RFE(estimator=RandomForestClassifier(n_estimators = 10), n_features_to_select=2, verbose=1)

Instructions 2/4
<li>Fit the recursive feature eliminator to the training data.</li>

In [None]:
# Fit the model to the training data
rfe.fit(X_train, y_train)

Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.


RFE(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
  n_features_to_select=2, step=1, verbose=1)

Instructions 3/4
<li>Create a mask using the fitted eliminator, then apply it to the feature dataset <code>X</code>.</li>

In [None]:
# Create a mask using an attribute of rfe
mask = rfe.support_

# Apply the mask to the feature dataset X and print the result
reduced_X = X.loc[:, mask]
print(reduced_X.columns)

Index(['glucose', 'insulin'], dtype='object')


Instructions 4/4
<li>Change the settings of <code>RFE()</code> to eliminate 2 features at each <code>step</code>.</li>

In [None]:
# Set the feature eliminator to remove 2 features on each step
rfe = RFE(estimator=RandomForestClassifier(n_estimators = 10), n_features_to_select=2, step=2, verbose=1)

# Fit the model to the training data
rfe.fit(X_train, y_train)

# Create a mask
mask = rfe.support_

# Apply the mask to the feature dataset X and print the result
reduced_X = X.loc[:, mask]
print(reduced_X.columns)

Fitting estimator with 8 features.
Fitting estimator with 6 features.
Fitting estimator with 4 features.
Index(['glucose', 'insulin'], dtype='object')


**Compared to the quick and dirty single threshold method from the previous exercise one of the selected features is different.**

## Regularized linear regression

### Creating a LASSO regressor
<div class=""><p>You'll be working on the numeric ANSUR body measurements dataset to predict a persons Body Mass Index (BMI) using the pre-imported <code>Lasso()</code> regressor. BMI is a metric derived from body height and weight but those two features have been removed from the dataset to give the model a challenge.</p>
<p>You'll standardize the data first using the <code>StandardScaler()</code> that has been instantiated for you as <code>scaler</code> to make sure all coefficients face a comparable regularizing force trying to bring them down.</p>
<p>All necessary functions and classes plus the input datasets <code>X</code> and <code>y</code> have been pre-loaded.</p></div>

In [None]:
ansur_bmi = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/7-dimensionality-reduction-in-python/datasets/ansur_bmi.csv')
X, y = ansur_bmi.iloc[:, :-1], ansur_bmi.iloc[:, -1]
from sklearn.linear_model import Lasso

Instructions
<ul>
<li>Set the test size to 30% to get a 70-30% train test split.</li>
<li>Fit the scaler on the training features and transform these in one go.</li>
<li>Create the Lasso model.</li>
<li>Fit it to the scaled training data.</li>
</ul>

In [None]:
# Set the test size to 30% to get a 70-30% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Fit the scaler on the training features and transform these in one go
X_train_std = scaler.fit_transform(X_train)

# Create the Lasso model
la = Lasso()

# Fit it to the standardized training data
la.fit(X_train_std, y_train)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

**You've fitted the Lasso model to the standardized training data. Now let's look at the results!**

### Lasso model results
<div class=""><p>Now that you've trained the Lasso model, you'll score its predictive capacity (<mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" role="presentation" tabindex="0" ctxtmenu_counter="0" style="font-size: 116.7%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-msup><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D445 TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: 0.363em;"><mjx-mn class="mjx-n" size="s"><mjx-c class="mjx-c32"></mjx-c></mjx-mn></mjx-script></mjx-msup></mjx-math><mjx-assistive-mml role="presentation" unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>R²</mi></msup></math></mjx-assistive-mml></mjx-container>) on the test set and count how many features are ignored because their coefficient is reduced to zero.</p>
<p>The <code>X_test</code> and <code>y_test</code> datasets have been pre-loaded for you.</p>
<p>The <code>Lasso()</code> model and <code>StandardScaler()</code> have been instantiated as <code>la</code> and <code>scaler</code> respectively and both were fitted to the training data.</p></div>

Instructions
<ul>
<li>Transform the test set with the pre-fitted scaler.</li>
<li>Calculate the <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" role="presentation" tabindex="0" ctxtmenu_counter="1" style="font-size: 116.7%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-msup><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D445 TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: 0.363em;"><mjx-mn class="mjx-n" size="s"><mjx-c class="mjx-c32"></mjx-c></mjx-mn></mjx-script></mjx-msup></mjx-math><mjx-assistive-mml role="presentation" unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>R</mi><mn>2</mn></msup></math></mjx-assistive-mml></mjx-container> value on the scaled test data.</li>
<li>Create a list that has True values when coefficients equal 0.</li>
<li>Calculate the total number of features with a coefficient of 0.</li>
</ul>

In [None]:
# Transform the test set with the pre-fitted scaler
X_test_std = scaler.transform(X_test)

# Calculate the coefficient of determination (R squared) on X_test_std
r_squared = la.score(X_test_std, y_test)
print("The model can predict {0:.1%} of the variance in the test set.".format(r_squared))

# Create a list that has True values when coefficients equal 0
zero_coef = la.coef_ == 0

# Calculate how many features have a zero coefficient
n_ignored = sum(zero_coef)
print("The model has ignored {} out of {} features.".format(n_ignored, len(la.coef_)))

The model can predict 84.7% of the variance in the test set.
The model has ignored 82 out of 91 features.


  


**We can predict almost 85% of the variance in the BMI value using just 9 out of 91 of the features. The R^2 could be higher though.**

### Adjusting the regularization strength


<div class=""><p>Your current Lasso model has an <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" role="presentation" tabindex="0" ctxtmenu_counter="2" style="font-size: 116.7%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-msup><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D445 TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: 0.363em;"><mjx-mn class="mjx-n" size="s"><mjx-c class="mjx-c32"></mjx-c></mjx-mn></mjx-script></mjx-msup></mjx-math><mjx-assistive-mml role="presentation" unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>R²</mi></msup></math></mjx-assistive-mml></mjx-container> score of 84.7%. When a model applies overly powerful regularization it can suffer from high bias, hurting its predictive power.</p>
<p>Let's improve the balance between predictive power and model simplicity by tweaking the <code>alpha</code> parameter.</p></div>

Instructions
<li>Find the <strong>highest</strong> value for <code>alpha</code> that gives an <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" role="presentation" tabindex="0" ctxtmenu_counter="3" style="font-size: 116.7%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-msup><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D445 TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: 0.363em;"><mjx-mn class="mjx-n" size="s"><mjx-c class="mjx-c32"></mjx-c></mjx-mn></mjx-script></mjx-msup></mjx-math><mjx-assistive-mml role="presentation" unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>R</mi><mn>2</mn></msup></math></mjx-assistive-mml></mjx-container> value above 98% from the options: <code>1</code>, <code>0.5</code>, <code>0.1</code>, and <code>0.01</code>.</li>

In [None]:
# Find the highest alpha value with R-squared above 98%
la = Lasso(0.1, random_state=0)

# Fits the model and calculates performance stats
la.fit(X_train_std, y_train)
r_squared = la.score(X_test_std, y_test)
n_ignored_features = sum(la.coef_ == 0)

# Print peformance stats 
print("The model can predict {0:.1%} of the variance in the test set.".format(r_squared))
print("{} out of {} features were ignored.".format(n_ignored_features, len(la.coef_)))

The model can predict 98.3% of the variance in the test set.
64 out of 91 features were ignored.


**With this more appropriate regularization strength we can predict 98% of the variance in the BMI value while ignoring 2/3 of the features**

## Combining feature selectors


### Creating a LassoCV regressor

<div class=""><p>You'll be predicting biceps circumference on a subsample of the male ANSUR dataset using the <code>LassoCV()</code> regressor that automatically tunes the regularization strength (alpha value) using Cross-Validation.</p>
<p>The standardized training and test data has been pre-loaded for you as <code>X_train</code>, <code>X_test</code>, <code>y_train</code>, and <code>y_test</code>.</p></div>

In [None]:
biceps_df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/7-dimensionality-reduction-in-python/datasets/biceps_df.csv')

In [None]:
X, y = biceps_df.iloc[:, :-1], biceps_df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  


Instructions

<ul>
<li>Create and fit the LassoCV model on the training set.</li>
<li>Calculate <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" role="presentation" tabindex="0" ctxtmenu_counter="6" style="font-size: 116.7%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-msup><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D445 TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: 0.363em;"><mjx-mn class="mjx-n" size="s"><mjx-c class="mjx-c32"></mjx-c></mjx-mn></mjx-script></mjx-msup></mjx-math><mjx-assistive-mml role="presentation" unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>R</mi><mn>2</mn></msup></math></mjx-assistive-mml></mjx-container> on the test set.</li>
<li>Create a mask for coefficients not equal to zero.</li>
</ul>

In [None]:
from sklearn.linear_model import LassoCV

# Create and fit the LassoCV model on the training set
lcv = LassoCV(cv = 3)
lcv.fit(X_train, y_train)
print('Optimal alpha = {0:.3f}'.format(lcv.alpha_))

# Calculate R squared on the test set
r_squared = lcv.score(X_test, y_test)
print('The model explains {0:.1%} of the test set variance'.format(r_squared))

# Create a mask for coefficients not equal to zero
lcv_mask = lcv.coef_ != 0
print('{} features out of {} selected'.format(sum(lcv_mask), len(lcv_mask)))

Optimal alpha = 0.089
The model explains 88.2% of the test set variance
26 features out of 32 selected


**We got a decent R squared and removed 6 features. We'll save the lcv_mask for later on.**

### Ensemble models for extra votes


<div class=""><p>The <code>LassoCV()</code> model selected 26 out of 32 features. Not bad, but not a spectacular dimensionality reduction either. Let's use two more models to select the 10 features they consider most important using the Recursive Feature Eliminator (RFE).</p>
<p>The standardized training and test data has been pre-loaded for you as <code>X_train</code>, <code>X_test</code>, <code>y_train</code>, and <code>y_test</code>.</p></div>

Instructions 1/4
<li>Select 10 features with RFE on a <code>GradientBoostingRegressor</code> and drop 3 features on each step.</li>

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingRegressor

# Select 10 features with RFE on a GradientBoostingRegressor, drop 3 features on each step
rfe_gb = RFE(estimator=GradientBoostingRegressor(), 
             n_features_to_select=10, step=3, verbose=1)
rfe_gb.fit(X_train, y_train)

Fitting estimator with 32 features.
Fitting estimator with 29 features.
Fitting estimator with 26 features.
Fitting estimator with 23 features.
Fitting estimator with 20 features.
Fitting estimator with 17 features.
Fitting estimator with 14 features.
Fitting estimator with 11 features.


RFE(estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_sampl...=None, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False),
  n_features_to_select=10, step=3, verbose=1)

Instructions 2/4
<li>Calculate the <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" role="presentation" tabindex="0" ctxtmenu_counter="7" style="font-size: 116.7%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-msup><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D445 TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: 0.363em;"><mjx-mn class="mjx-n" size="s"><mjx-c class="mjx-c32"></mjx-c></mjx-mn></mjx-script></mjx-msup></mjx-math><mjx-assistive-mml role="presentation" unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>R²</mi></msup></math></mjx-assistive-mml></mjx-container> on the test set.</li>

In [None]:
# Calculate the R squared on the test set
r_squared = rfe_gb.score(X_test, y_test)
print('The model can explain {0:.1%} of the variance in the test set'.format(r_squared))

The model can explain 85.6% of the variance in the test set


Instructions 3/4
<li>Assign the support array of the fitted model to <code>gb_mask</code>.</li>

In [None]:
# Assign the support array to gb_mask
gb_mask = rfe_gb.support_

Instructions 4/4
<li>Modify the first step to select 10 features with RFE on a <strong><code>RandomForestRegressor()</code></strong> and drop 3 features on each step.</li>

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Select 10 features with RFE on a RandomForestRegressor, drop 3 features on each step
rfe_rf = RFE(estimator=RandomForestRegressor(n_estimators = 10), 
             n_features_to_select=10, step=3, verbose=1)
rfe_rf.fit(X_train, y_train)

# Calculate the R squared on the test set
r_squared = rfe_rf.score(X_test, y_test)
print('The model can explain {0:.1%} of the variance in the test set'.format(r_squared))

# Assign the support array to gb_mask
rf_mask = rfe_rf.support_

Fitting estimator with 32 features.
Fitting estimator with 29 features.
Fitting estimator with 26 features.
Fitting estimator with 23 features.
Fitting estimator with 20 features.
Fitting estimator with 17 features.
Fitting estimator with 14 features.
Fitting estimator with 11 features.
The model can explain 83.1% of the variance in the test set


**Inluding the Lasso linear model from the previous exercise, we now have the votes from 3 models on which features are important.**

### Combining 3 feature selectors


<div class=""><p>We'll combine the votes of the 3 models you built in the previous exercises, to decide which features are important into a meta mask. We'll then use this mask to reduce dimensionality and see how a simple linear regressor performs on the reduced dataset.</p>
<p>The per model votes have been pre-loaded as <code>lcv_mask</code>, <code>rf_mask</code>, and <code>gb_mask</code> and the feature and target datasets as <code>X</code> and <code>y</code>.</p></div>

Instructions 1/4
<li>Sum the votes of the three models using <code>np.sum()</code>.</li>

In [None]:
# Sum the votes of the three models
votes = np.sum([lcv_mask, rf_mask, gb_mask], axis=0)
print(votes)

[1 0 3 3 0 1 0 3 1 1 1 3 1 1 2 2 0 1 1 2 0 1 3 1 0 3 2 1 2 1 2 3]


Instructions 2/4
<li>Create a mask for features selected by all 3 models.</li>

In [None]:
# Create a mask for features selected by all 3 models
meta_mask = votes >= 3
print(meta_mask)

[False False  True  True False False False  True False False False  True
 False False False False False False False False False False  True False
 False  True False False False False False  True]


Instructions 3/4
<li>Apply the dimensionality reduction on X and print which features were selected.</li>

In [None]:
# Apply the dimensionality reduction on X
X_reduced = X.loc[:, meta_mask]
print(X_reduced.columns)

Index(['bideltoidbreadth', 'buttockcircumference', 'chestcircumference',
       'forearmcircumferenceflexed', 'shouldercircumference',
       'thighcircumference', 'BMI'],
      dtype='object')


Instructions 4/4
<li>Plug the reduced dataset into the code for simple linear regression that has been written for you.</li>

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

In [None]:
# Plug the reduced dataset into a linear regression pipeline
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.3, random_state=0)
lm.fit(scaler.fit_transform(X_train), y_train)
r_squared = lm.score(scaler.transform(X_test), y_test)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  after removing the cwd from sys.path.


In [None]:
print('The model can explain {0:.1%} of the variance in the test set using {1:} features.'.format(r_squared, len(lm.coef_)))

The model can explain 86.7% of the variance in the test set using 7 features.


**Using the votes from 3 models you were able to select just 7 features that allowed a simple linear model to get a high accuracy!**