# Data Science and Visualization (RUC F2023)

## Lecture 8: Clustering II

 # Automatic Feature Selection
 
 * Univariate Statistics (univariate feature selection)
 * Model-based Selection
 * Iterative Selection 

## 0. Setup and data loading

We use a dataset about fuel economy of cars.

In [8]:
import pandas as pd

# mpg: miles per gallon
mpg = pd.read_csv('C:/Data/mpg.csv')
mpg.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [9]:
mpg.shape

(398, 9)

Let's focus on the numeric columns:

In [10]:
mpg = mpg.select_dtypes('number').dropna()

We divide the features into Independent and Dependent Variables

In [11]:
X = mpg.drop('mpg', axis =1)
y = mpg['mpg']

In [12]:
X.shape

(392, 6)

There are 6 columns in X. We want to figure out which should be used for data modelling. In this case, data modelling is to predict **mpg** values by using other columns. So it is a regression problem.

## 1. Univariate Statistics (aka univariate feature selection)

* Consider each feature f individually. 
* Is there a significant relationship between f and the target?
* Select those fs that are related with the highest confidence.

### SelectKBest()

In [13]:
from sklearn.feature_selection import SelectKBest, f_regression

for n in range(1, 6):
    # Select top n features based on f_regression
    selector = SelectKBest(f_regression, k=n)
    selector.fit(X, y)
    print('Top', n, 'features:', X.columns[selector.get_support()])

Top 1 features: Index(['weight'], dtype='object')
Top 2 features: Index(['displacement', 'weight'], dtype='object')
Top 3 features: Index(['displacement', 'horsepower', 'weight'], dtype='object')
Top 4 features: Index(['cylinders', 'displacement', 'horsepower', 'weight'], dtype='object')
Top 5 features: Index(['cylinders', 'displacement', 'horsepower', 'weight', 'model_year'], dtype='object')


### SelectPercentile()

In [49]:
from sklearn.feature_selection import SelectPercentile, f_regression

for n in range(10, 110, 10):
    selector = SelectPercentile(f_regression, percentile=n)
    selector.fit(X, y)
    print('Top', n, '% features:', X.columns[selector.get_support()])

Top 10 % features: Index(['weight'], dtype='object')
Top 20 % features: Index(['weight'], dtype='object')
Top 30 % features: Index(['displacement', 'weight'], dtype='object')
Top 40 % features: Index(['displacement', 'weight'], dtype='object')
Top 50 % features: Index(['displacement', 'horsepower', 'weight'], dtype='object')
Top 60 % features: Index(['displacement', 'horsepower', 'weight'], dtype='object')
Top 70 % features: Index(['cylinders', 'displacement', 'horsepower', 'weight'], dtype='object')
Top 80 % features: Index(['cylinders', 'displacement', 'horsepower', 'weight'], dtype='object')
Top 90 % features: Index(['cylinders', 'displacement', 'horsepower', 'weight', 'model_year'], dtype='object')
Top 100 % features: Index(['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',
       'model_year'],
      dtype='object')


## 2. Model-based Selection

* Use a supervised learning model to judge the importance of each feature.
    * A different model than the final task can be used
* Select only the most important features
* All features are considered at once

In [18]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LinearRegression

# Selecting the Best important features according to Logistic Regression using SelectFromModel
sfm_selector = SelectFromModel(estimator=LinearRegression())

sfm_selector.fit(X, y)
X.columns[sfm_selector.get_support()]

Index(['cylinders', 'model_year'], dtype='object')

## 3. Recursive Feature Elimination (RFE)

* Multiple models, and multiple features incrementally
* Starts with all features to build a model, discards the least important features, and builds a new model with the remaining features.
* Repeats until a pre-specified number of features remain

In [26]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

for n in range(5, 0, -1):
    # Selecting the Best important features according to Linear Regression
    rfe_selector = RFE(estimator=LinearRegression(), n_features_to_select=n, step=1)
    rfe_selector.fit(X, y)
    print('Top', n, 'features:', X.columns[rfe_selector.get_support()])

Top 5 features: Index(['cylinders', 'displacement', 'weight', 'acceleration', 'model_year'], dtype='object')
Top 4 features: Index(['cylinders', 'displacement', 'acceleration', 'model_year'], dtype='object')
Top 3 features: Index(['cylinders', 'acceleration', 'model_year'], dtype='object')
Top 2 features: Index(['cylinders', 'model_year'], dtype='object')
Top 1 features: Index(['cylinders'], dtype='object')


### References

* https://lucashomil.github.io/datascience/blog-2.html
* https://towardsdatascience.com/5-feature-selection-method-from-scikit-learn-you-should-know-ed4d116e4172