### Try-it 8.1: The "Best" Model

This module was all about regression and using Python's scikitlearn library to build regression models.  Below, a dataset related to real estate prices in California is given. While many of the assignments you have built and evaluated different models, it is important to spend some time interpreting the resulting "best" model.  


Your goal is to build a regression model to predict the price of a house in California.  After doing so, you are to *interpret* the model.  There are many strategies for doing so, including some built in methods from scikitlearn.  One example is `permutation_importance`.  Permutation feature importance is a strategy for inspecting a model and its features importance.  

Take a look at the user guide for `permutation_importance` [here](https://scikit-learn.org/stable/modules/permutation_importance.html).  Use  the `sklearn.inspection` modules implementation of `permutation_importance` to investigate the importance of different features to your regression models.  Share these results on the discussion board.

In [65]:
import pandas as pd
from sklearn.inspection import permutation_importance
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures, OrdinalEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge

In [2]:
import numpy as np
import matplotlib.pyplot as plt

In [3]:
cali = pd.read_csv('data/housing.csv')

In [4]:
cali.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [5]:
cali.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


by 'sales price' I assume that this means 'median_house_value'

In [84]:
X = cali.drop('median_house_value', axis=1)
y = cali['median_house_value']

# split the data into data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=22)

# check the data set shapes
print(X_train.shape) # (14448, 9)
print(X_test.shape)  # (6192, 9)
print(y_train.shape) # (14448,)
print(y_test.shape)  # (6192,)
print(type(X_train), type(y_train)) # DataFrame, Series

(14448, 9)
(6192, 9)
(14448,)
(6192,)
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'>


Get a baseline

In [26]:
baseline_train = np.ones(shape=y_train.shape) * y_train.mean()
baseline_test = np.ones(shape=y_test.shape) * y_test.mean()

mse_baseline_train = mean_squared_error(baseline_train, y_train) # 13560521547.416183
mse_baseline_test = mean_squared_error(baseline_test, y_test)   #  12743447477.687767

print(f'mse_baseline_train {mse_baseline_train} mse_baseline_test {mse_baseline_test}')

mse_baseline_train 13560521547.416183 mse_baseline_test 12743447477.687767


In [27]:
# what are the top 4 numeric correlations w/ 'median_house_value'?
four_highest_correlations = cali.corr(numeric_only=True)[['median_house_value']].nlargest(columns='median_house_value', n=5)
# median_income	        0.688075
# total_rooms	        0.134153
# housing_median_age	0.105623
# households	        0.065843

To continue, need to encode 'ocean_proximity'

In [86]:
# cali.ocean_proximity.unique()
# array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
#       dtype=object)

categorical_features = ['ocean_proximity']
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
ct = ColumnTransformer(verbose_feature_names_out=True, transformers=[
    ("cat", categorical_transformer, categorical_features),
])

# in_features = ['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND']
# ct = ColumnTransformer(get_feature_names_out=True, transformers=[
#     ("cat", categorical_transformer, categorical_features),
# ])

# TODO: how to maintain the column names; insetead of 0,1 .. 'INLAND' etc??????

df_x_train = pd.DataFrame(ct.fit_transform(X_train).toarray())
df_x_test = pd.DataFrame(ct.fit_transform(X_test).toarray())
# df_x_test

# need to drop 'ocean_proximity' column
X_train = X_train.drop(columns=['ocean_proximity'], axis=1)
X_test = X_test.drop(columns=['ocean_proximity'], axis=1)
X_train

# # need to append new columns
X_train = X_train.join(df_x_train)
X_test = X_test.join(df_x_test)
# X_test



Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
14697,-117.09,32.79,36.0,1936.0,345.0,861.0,343.0,3.8333,NEAR OCEAN
9338,-122.61,37.99,40.0,7737.0,1488.0,3108.0,1349.0,4.4375,NEAR OCEAN
19511,-121.02,37.62,30.0,1721.0,399.0,1878.0,382.0,2.5363,INLAND
14232,-117.04,32.68,11.0,1875.0,357.0,1014.0,386.0,4.3750,NEAR OCEAN
13398,-117.53,34.10,5.0,2185.0,488.0,1379.0,458.0,3.7917,INLAND
...,...,...,...,...,...,...,...,...,...
13970,-116.85,34.26,19.0,5395.0,1220.0,981.0,366.0,2.6094,INLAND
9181,-118.55,34.37,21.0,7010.0,1063.0,3331.0,1038.0,6.7760,<1H OCEAN
18911,-122.24,38.14,15.0,8479.0,1759.0,5008.0,1646.0,3.7240,NEAR BAY
15956,-122.45,37.71,41.0,1578.0,351.0,1159.0,299.0,3.9167,NEAR OCEAN


Determine multicolliniarity in the data using VIF; which columns to drop.

In [9]:
# from office hours 8/3/23
def sk_vif(ind_variables, data):
    vif_dict = {}
    for ind_var in ind_variables:
        # split the dataset: one variable against all others
        not_ind_var = [i for i in ind_variables if i != ind_var]
        X, y = data[not_ind_var], data[ind_var]
        # fit model and compute R^2
        r_squared = LinearRegression().fit(X, y).score(X, y)
        # compite VIF
        vif = 1/(1-r_squared)
        vif_dict[ind_var] = vif
    return pd.DataFrame({'VIF':vif_dict})

print('VIF for original dataset')
sk_vif(X.columns, X).sort_values(by='VIF', ascending=False)

VIF for original dataset


ValueError: could not convert string to float: 'NEAR BAY'

In [23]:
# create a simple model using 'median_income'

X = X_train[['median_income']]
linear_regression = LinearRegression().fit(X, y_train)
train_mse = mean_squared_error(y_train, linear_regression.predict(X))
test_mse = mean_squared_error(y_test, linear_regression.predict(X_test[['median_income']]))

print(f'simple model using "median_income"\ntrain_mse {train_mse} test_mse {test_mse}')

simple model using "median_income"
train_mse 7187740209.262024 test_mse 6600041018.770579


In [15]:
# Permutation feature importance
# src: https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-feature-importance

model = Ridge(alpha=1e-2).fit(X_train, y_train)
model.score(X_test, y_test) # ValueError: could not convert string to float: 'NEAR OCEAN'

ValueError: could not convert string to float: 'NEAR OCEAN'

In [None]:
# get values for ocean_proximity

# cali.ocean_proximity.unique()
# array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
#       dtype=object)

In [5]:
# convert 'ocean_proximity' to numeric values using OrdinalEncoder
# ISLAND      -> Isl: 0
# NEAR OCEAN  -> Near: 1
# <1H OCEAN   -> Close: 2
# INLAND      -> Far: 3



array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)