### Self-Study Colab Activity 8.4: The “Best” Model.

This module was all about regression and using Python's scikitlearn library to build regression models.  Below, a dataset related to real estate prices in California is given. During many of the assignments, you have built and evaluated different models, it is important to spend some time interpreting the resulting "best" model.  


Your goal is to build a regression model to predict the price of a house in California.  After doing so, you are to *interpret* the model.  There are many strategies for doing so, including some built-in methods from scikitlearn.  One example is `permutation_importance`.  Permutation feature importance is a strategy for inspecting a model and its features' importance.  

Take a look at the user guide for `permutation_importance` [here](https://scikit-learn.org/stable/modules/permutation_importance.html).  Use  the `sklearn.inspection` module implementation of `permutation_importance` to investigate the importance of different features to your regression models.  Share these results on the discussion board.

In [11]:
import pandas as pd
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures

In [3]:
import numpy as np

In [4]:
cali = pd.read_csv('module 8/colab_activity8_4_starter/data/housing.csv')

In [5]:
cali.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [6]:
cali.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [15]:
cali.isna().mean()

longitude             0.000000
latitude              0.000000
housing_median_age    0.000000
total_rooms           0.000000
total_bedrooms        0.010029
population            0.000000
households            0.000000
median_income         0.000000
median_house_value    0.000000
ocean_proximity       0.000000
dtype: float64

In [16]:
cali = cali.dropna()

In [17]:
# Something tells me this ought to be ordinal but will use one hot for now
cali['ocean_proximity'].unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [18]:
X = cali.drop('median_house_value', axis=1)
y = cali['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [19]:
# establish a baseline
baseline_train = np.ones(shape=y_train.shape) * y_train.mean()
baseline_test = np.ones(shape=y_test.shape) * y_test.mean()
mse_baseline_train = mean_squared_error(baseline_train, y_train)
mse_baseline_test = mean_squared_error(baseline_test, y_test)
print(mse_baseline_train)
print(mse_baseline_test)

13321500811.264175
13330932437.58671


In [20]:
best_mse = np.inf
best_pipe = None

train_mses = []
test_mses = []
for i in range(1, 11):
    transformer = make_column_transformer(
        (PolynomialFeatures(degree=i), make_column_selector(dtype_include=np.number)),
        (OneHotEncoder(), make_column_selector(dtype_include=np.object_))
    )
    pipe = Pipeline([
        ('transformer', transformer),
        ('linreg', LinearRegression())
    ])
    pipe.fit(X_train, y_train)

    train_mse = mean_squared_error(y_train, pipe.predict(X_train))
    test_mse = mean_squared_error(y_test, pipe.predict(X_test))
    train_mses.append(train_mse)
    test_mses.append(test_mse)
    if test_mse < best_mse:
        best_pipe = pipe
        best_mse = test_mse

print(train_mses)
print(test_mses)
print(best_pipe)
print(best_mse)

[4755931815.553963, 3977883869.6861057, 3505907323.7110233, 3430493004.383332, 5335950229.87282, 8490562306.88913, 11312653494.51514, 12611507071.139803, 12938676107.42846, 13091628432.15559]
[4614164009.958705, 3962746834.2814183, 3747055202.530923, 4270590346.241129, 304886976058.31323, 6120104945066.18, 16754953448077.01, 38912960404435.17, 372486098977584.8, 463982006986407.3]
Pipeline(steps=[('transformer',
                 ColumnTransformer(transformers=[('polynomialfeatures',
                                                  PolynomialFeatures(degree=3),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f6745a0fb00>),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f671e314bc0>)])),
                ('linre

In [21]:
best_pipe

In [22]:
# 3 degree model proved the best
permutation_importance(best_pipe, X, y)

{'importances_mean': array([ 2.07916945e+00,  2.48599031e+00,  2.78255712e-01,  1.01666429e+02,
         6.96135645e+01,  7.64166953e+00,  1.94033687e+03,  1.19343481e+00,
        -1.50618635e-06]),
 'importances_std': array([3.45445318e-02, 3.45069716e-02, 2.10189819e-01, 8.82739994e+00,
        5.78361242e+00, 2.10501092e+00, 6.37273952e+01, 8.74091081e-02,
        1.32288596e-07]),
 'importances': array([[ 2.12879670e+00,  2.08543006e+00,  2.04484818e+00,
          2.10043345e+00,  2.03633889e+00],
        [ 2.52350652e+00,  2.48554934e+00,  2.43991068e+00,
          2.52500427e+00,  2.45598075e+00],
        [ 1.24990842e-01,  2.19890058e-01,  1.22239972e-01,
          2.36159246e-01,  6.87998443e-01],
        [ 9.14298805e+01,  1.14356590e+02,  9.95285683e+01,
          9.37895015e+01,  1.09227606e+02],
        [ 6.13231156e+01,  6.87174139e+01,  7.53874390e+01,
          6.59090464e+01,  7.67308078e+01],
        [ 1.01641023e+01,  4.43074590e+00,  9.68217725e+00,
          6.51565