# Original Code

**The following code will compute permutation feature importance for a default Decision Tree model trained to predict MEDV in the housing dataset from the midterm (we will be filling missing values with 0 for simplicity):**

In [1]:
# Load Libraries
%matplotlib inline
import numpy as np
import pandas as pd
import io
from google.colab import files
from sklearn.tree import DecisionTreeRegressor
from sklearn.inspection import permutation_importance




# Read file
uploaded = files.upload()
dat = pd.read_csv(io.BytesIO(uploaded['dataset_midterm.csv']), sep = ",").fillna(0)


# 3) Define the model
model = DecisionTreeRegressor(random_state = 0)


# 4) Train the model
model.fit(dat[dat['dataset'] == 'train'].drop(['MEDV', 'dataset'], axis = 1), dat[dat['dataset'] == 'train'].MEDV.values)


# perform permutation importance
importance = permutation_importance(model,
                  dat[dat['dataset'] == 'val'].drop(['MEDV', 'dataset'], axis = 1),
                  dat[dat['dataset'] == 'val'].MEDV.values,
                  random_state = 1)
importance = pd.DataFrame({'variable' :dat.drop(['MEDV', 'dataset'], axis = 1).columns.values, 'imp' : np.abs(importance.importances_mean)/np.max(np.abs(importance.importances_mean))}).sort_values(by = 'imp', ascending = False)


# Show importance scores
importance

Saving dataset_midterm.csv to dataset_midterm.csv


Unnamed: 0,variable,imp
5,RM,1.0
0,CRIM,0.527148
7,DIS,0.290793
12,LSTAT,0.235971
4,NOX,0.156554
9,TAX,0.078964
11,B,0.069776
10,PTRATIO,0.058257
2,INDUS,0.025153
6,AGE,0.001571


# Exercise

**Answer the following questions:**

- **Which are the two most important variables looking at the results obtained?**

The two most important variables are *RM* and *CRIM*.

-**Modify the code to keep only the top 80% more important variables and drop the rest after reading the dataset_midterm.csv file.**

In [2]:
n_variables = 0.8*dat.shape[1]
n_variables

12.0

80% of the original variables, mean we want to keep the 12 most important variables. Let's select them from the *importance* variable created before.

In [5]:
selected_vars = importance.variable.values[0:12]
selected_vars

array(['RM', 'CRIM', 'DIS', 'LSTAT', 'NOX', 'TAX', 'B', 'PTRATIO',
       'INDUS', 'AGE', 'RAD', 'ZN'], dtype=object)

And now let's read the data again and keep only these columns (plus the target).

In [10]:
dat = pd.read_csv(io.BytesIO(uploaded['dataset_midterm.csv']), sep = ",").fillna(0)
dat = dat[np.concatenate((selected_vars, ['MEDV']))]
dat

Unnamed: 0,RM,CRIM,DIS,LSTAT,NOX,TAX,B,PTRATIO,INDUS,AGE,RAD,ZN,MEDV
0,6.575,0.00632,4.0900,4.98,0.000,296,396.90,15.3,2.31,65.2,1,18.0,24.0
1,6.421,0.02731,4.9671,9.14,0.469,242,396.90,17.8,7.07,78.9,2,0.0,21.6
2,7.185,0.02729,4.9671,4.03,0.469,242,392.83,17.8,7.07,61.1,2,0.0,34.7
3,6.998,0.03237,6.0622,2.94,0.458,222,394.63,18.7,2.18,45.8,3,0.0,33.4
4,7.147,0.06905,6.0622,0.00,0.000,222,396.90,18.7,2.18,54.2,3,0.0,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,6.593,0.06263,2.4786,0.00,0.000,273,391.99,21.0,11.93,69.1,1,0.0,22.4
502,6.120,0.04527,2.2875,9.08,0.000,273,396.90,21.0,11.93,76.7,1,0.0,20.6
503,6.976,0.06076,2.1675,5.64,0.573,273,396.90,21.0,11.93,91.0,1,0.0,23.9
504,6.794,0.10959,2.3889,6.48,0.000,273,393.45,21.0,11.93,89.3,1,0.0,22.0
