### Self-Study Colab Activity 8.4: The “Best” Model.

This module was all about regression and using Python's scikitlearn library to build regression models.  Below, a dataset related to real estate prices in California is given. During many of the assignments, you have built and evaluated different models, it is important to spend some time interpreting the resulting "best" model.  


Your goal is to build a regression model to predict the price of a house in California.  After doing so, you are to *interpret* the model.  There are many strategies for doing so, including some built-in methods from scikitlearn.  One example is `permutation_importance`.  Permutation feature importance is a strategy for inspecting a model and its features' importance.  

Take a look at the user guide for `permutation_importance` [here](https://scikit-learn.org/stable/modules/permutation_importance.html).  Use  the `sklearn.inspection` module implementation of `permutation_importance` to investigate the importance of different features to your regression models.  Share these results on the discussion board.

In [None]:
import pandas as pd
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures

In [None]:
import numpy as np

In [None]:
cali = pd.read_csv('module 8/colab_activity8_4_starter/data/housing.csv')

In [None]:
cali.head()

In [None]:
cali.info()

In [None]:
cali.isna().mean()

In [None]:
cali = cali.dropna()

In [None]:
# Something tells me this ought to be ordinal but will use one hot for now
cali['ocean_proximity'].unique()

In [None]:
X = cali.drop('median_house_value', axis=1)
y = cali['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# establish a baseline
baseline_train = np.ones(shape=y_train.shape) * y_train.mean()
baseline_test = np.ones(shape=y_test.shape) * y_test.mean()
mse_baseline_train = mean_squared_error(baseline_train, y_train)
mse_baseline_test = mean_squared_error(baseline_test, y_test)
print(mse_baseline_train)
print(mse_baseline_test)

In [None]:
best_mse = np.inf
best_pipe = None

train_mses = []
test_mses = []
for i in range(1, 11):
    transformer = make_column_transformer(
        (PolynomialFeatures(degree=i), make_column_selector(dtype_include=np.number)),
        (OneHotEncoder(), make_column_selector(dtype_include=np.object_))
    )
    pipe = Pipeline([
        ('transformer', transformer),
        ('linreg', LinearRegression())
    ])
    pipe.fit(X_train, y_train)

    train_mse = mean_squared_error(y_train, pipe.predict(X_train))
    test_mse = mean_squared_error(y_test, pipe.predict(X_test))
    train_mses.append(train_mse)
    test_mses.append(test_mse)
    if test_mse < best_mse:
        best_pipe = pipe
        best_mse = test_mse

print(train_mses)
print(test_mses)
print(best_pipe)
print(best_mse)

In [None]:
best_pipe

In [None]:
# 3 degree model proved the best
permutation_importance(best_pipe, X, y)