# Using Correlations

### Introduction

In this section we'll practice using the eli5 model to select our most important features.  In doing so, we'll use the Boston housing dataset.  Let's get started.

### Becoming more picky

We'll begin by loading up the boston housing dataset.

In [13]:
from sklearn.datasets import load_boston
import pandas as pd
data = load_boston()

X = data['data']
y = data['target']


X = pd.DataFrame(X, columns = data['feature_names'])

In [14]:
X[:2]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14


Let's take a look at what some of these abbreviations signify.

    - CRIM     per capita crime rate by town
    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    - INDUS    proportion of non-retail business acres per town
    - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    - NOX      nitric oxides concentration (parts per 10 million)
    - RM       average number of rooms per dwelling
    - AGE      proportion of owner-occupied units built prior to 1940
    - DIS      weighted distances to five Boston employment centres
    - RAD      index of accessibility to radial highways
    - TAX      full-value property-tax rate per $10,000
    
    - PTRATIO  pupil-teacher ratio by town
    - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    - LSTAT    % lower status of the population
    - MEDV     Median value of owner-occupied homes in $1000's


Then we can split our data.

We make our first of forty percent of the data.  And then we divide the test set in half, to allocate fifty percent of the test set to our validation set.

In [46]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

In [47]:
X.shape

(506, 13)

In [48]:
X_train.shape, X_validate.shape, X_test.shape

((303, 13), (101, 13), (102, 13))

This time our dataset has thirteen features and 506 observations.  Ok, now let's fit a model and narrow down our features.

### Training the model

Start by fitting a linear regression model on the training data and scoring it on the validation data.

In [41]:
from sklearn.linear_model import LinearRegression

In [42]:
model = LinearRegression()

model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [49]:
model.score(X_validate, y_validate)

# 0.71712

0.7171294823932444

Now let's begin to narrow down our dataset to the most important features.  Use eli5's `PermutationImportance` library to find the most important features.

In [50]:
from eli5.sklearn import PermutationImportance
import eli5
import numpy as np

perm = PermutationImportance(model).fit(X_validate, y_validate)

exp_df = eli5.explain_weights_df(perm, feature_names = list(X_train.columns))
exp_df

# 	feature	weight	std
# 0	LSTAT	0.444968	0.082064
# 1	DIS	0.237460	0.039644
# 2	RAD	0.191142	0.023422
# 3	RM	0.143388	0.021292
# 4	TAX	0.070999	0.023582
# 5	PTRATIO	0.069753	0.014583
# 6	NOX	0.068004	0.025333
# 7	CRIM	0.031118	0.010048
# 8	B	0.029326	0.014082
# 9	CHAS	0.010035	0.015782
# 10	ZN	0.002843	0.007735
# 11	AGE	0.001398	0.002491
# 12	INDUS	0.000502	0.002366

Unnamed: 0,feature,weight,std
0,LSTAT,0.444968,0.082064
1,DIS,0.23746,0.039644
2,RAD,0.191142,0.023422
3,RM,0.143388,0.021292
4,TAX,0.070999,0.023582
5,PTRATIO,0.069753,0.014583
6,NOX,0.068004,0.025333
7,CRIM,0.031118,0.010048
8,B,0.029326,0.014082
9,CHAS,0.010035,0.015782


Next select the features that have a weight above `.005`.

In [52]:
top_df = exp_df[exp_df['weight'] > .005]

top_df.shape

# (10, 3)

(10, 3)

Finally, plot the features and their corresponding weights in through pandas.

In [54]:
# top_df.plot(x = 'feature', y = 'weight')

<img src="./feat-imp-plt.png" width="40%">

Ok, here it's a pretty even decline in the feature importances.  Let's include all of these features.  Select the feature names from our `top_df`.

In [29]:
top_cols = top_df.feature.values

top_cols
# array(['LSTAT', 'DIS', 'RAD', 'RM', 'PTRATIO', 'NOX', 'TAX', 'ZN', 'CRIM',
#        'CHAS'], dtype=object)

array(['LSTAT', 'DIS', 'RAD', 'RM', 'PTRATIO', 'NOX', 'TAX', 'ZN', 'CRIM',
       'CHAS'], dtype=object)

Finally, let's retrain the model and see how well our model does with fewer features.

In [59]:
pruned_X = X[top_cols]

In [60]:
pruned_model = LinearRegression()
pruned_model.fit(X_train, y_train).score(X_test, y_test)

0.7336244309845095

So here we can see that our model performed a little better even with fewer features. 

## Summary

In this lesson, we practiced feature selection with the eli5 library.  We saw that we could maintain our score with reducing the number of features from 13 to 10. 