## Learning Objectives

Today we will learn about two types of feature selection: variance thresholding and univariate feature selection and put them both into practice on a real dataset.

## Feature selection

So you know that you want to do supervised learning. You have already gathered relevant summary statistics on the data and visualized it. You have transformed your quantitative features, imputed or dropped NaNs, and standardized those that are quantitative. Now you are interested in what features you will want to include in the final prediction task. This part of machine learning is called feature selection. Again we are going to cover a couple of practical methods of doing feature selection while explaining the intuition behind them. Let's start off by repeating all of the process that we did above:

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/billionaires.csv')

del df['was founder']
del df['inherited']
del df['from emerging']

df.age.replace(-1, np.NaN, inplace=True)
df.founded.replace(0, np.NaN, inplace=True)
df.gdp.replace(0, np.NaN, inplace=True)

In [None]:
# first we delete some features that we won't use in prediction
data = df.copy()
del data['company.name']
del data['name']
del data['country code']
del data['citizenship']
del data['rank']
del data['relationship']
del data['sector']

In [15]:
data.head()

Unnamed: 0,age,category,company.type,founded,gdp,gender,industry,region,was political,wealth.type,worth in billions,year
0,,Financial,new,1968.0,158000000000.0,male,Money Management,Middle East/North Africa,False,self-made finance,1.0,1996
1,34.0,Financial,new,1946.0,8100000000000.0,female,Money Management,North America,False,inherited,2.5,1996
2,59.0,Non-Traded Sectors,new,1948.0,854000000000.0,male,"Retail, Restaurant",Latin America,False,inherited,1.2,1996
3,61.0,New Sectors,new,1881.0,2500000000000.0,male,Technology-Medical,Europe,False,inherited,1.0,1996
4,,Financial,new,1816.0,160000000000.0,male,Money Management,East Asia,False,inherited,2.2,1996


In [50]:
# next we will make the qualitative columns into dummies
dummy_data = pd.get_dummies(data, dummy_na=True, columns=data.select_dtypes(exclude=['float64']), drop_first=True)

len(dummy_data.columns)

71

In [51]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer

y = dummy_data['worth in billions']
del dummy_data['worth in billions']

qualitative_features = dummy_data.select_dtypes(exclude=['float64'])
quantitative_features = dummy_data.select_dtypes(include=['float64'])

# first we imput nan values
imp = Imputer(strategy='median')
quant_X = imp.fit_transform(quantitative_features)

# and now we scale the data that is quantitative
std_scaler = StandardScaler()
quant_X = std_scaler.fit_transform(quant_X)

quant_X.shape

(2614, 3)

## Two strategies

We will be going over two strategies here, both simple ways to reduce the total number of features in your model (in the next few classes we will be going over some non obvious reasons why you might want to do this - for now you can think of this as dropping useless data). The two strategies are:

1. Variance thresholding
2. Univariate feature selection

#### Variance thresholding

This is a simple strategy that removes features of low variance from the model. This makes a large implicit assumption that features of low variance will not be as predictive. I'll show you how to use it below:

In [52]:
X = np.concatenate([quant_X, qualitative_features], axis=1)
X.shape

(2614, 70)

In [53]:
from sklearn.feature_selection import VarianceThreshold

# we start off with 71 columns
vt = VarianceThreshold(threshold=0.1)
X_vt = vt.fit_transform(X)

# we end with 17
X_vt.shape

(2614, 17)

In [54]:
# it can often be good to threshold at 0.0 just to remove obviously redundent data
vt = VarianceThreshold(threshold=0.0)
X = vt.fit_transform(X)

X.shape

(2614, 67)

#### Univariate feature selection

The next best thing we can do is to select features based on a scoring funciton. The scoring function will try to do its best to tell us how relevant the features are, but realize this has shortcomings. First without seeing all the features and how they work together it is hard to tell which will be the best ones. Thus this becomes a combinatorial problem in the extreme. And second each model has different assumptions, and thus require different scoring functions. 

That being said let's go ahead and pick out the 'best' 15 features from our model:

In [56]:
from sklearn.feature_selection import SelectKBest
# we select mutual informaiton regression in because 
# we are initially interested in worth in billions, a quantitative feature
from sklearn.feature_selection import mutual_info_regression

skb = SelectKBest(mutual_info_regression, k=15)

# notice that this also needs to know the y variable
X = skb.fit_transform(X, y)

X.shape

(2614, 15)

In [64]:
# we can use the get_support function in order to see the columns that we took:
dummy_data.columns[vt.get_support()][skb.get_support()]

Index([u'founded', u'gdp', u'company.type_ new', u'company.type_new',
       u'company.type_subsidiary', u'industry_Consumer',
       u'industry_Diversified financial', u'industry_Energy',
       u'industry_Non-consumer industrial', u'industry_Real Estate',
       u'industry_Technology-Computer', u'industry_nan',
       u'region_North America', u'wealth.type_privatized and resources',
       u'wealth.type_self-made finance'],
      dtype='object')

## 1% finished

This is certainly not the end. There are many more strategies and if you want a more comprehensive take on them check out: [here](https://www.youtube.com/watch?v=wjKvyk8xStg&list=PLgJhDSE2ZLxb33q-x5592LCiVRsHDxVf3&index=6). (One of my favorites is recursive feature selection). 

But even more so, we are just about to begin our journey into supervised learning. So get ready!

## Learning Objectives

Today we will learn about two types of feature selection: variance thresholding and univariate feature selection and put them both into practice on a real dataset.

## Comprehension Questions

1.	Why don’t we use all the features? Why do we only select some?
2.	How can we tell what a good feature is?
3.	Would you want to standardize before or after variance thresholding?
4.	Why can’t we do multivariate feature selection?
5.	How do you know how many features to select with univariate feature selection?

