### Try-it 8.1: The "Best" Model

This module was all about regression and using Python's scikitlearn library to build regression models.  Below, a dataset related to real estate prices in California is given. While many of the assignments you have built and evaluated different models, it is important to spend some time interpreting the resulting "best" model.  


Your goal is to build a regression model to predict the price of a house in California.  After doing so, you are to *interpret* the model.  There are many strategies for doing so, including some built in methods from scikitlearn.  One example is `permutation_importance`.  Permutation feature importance is a strategy for inspecting a model and its features importance.  

Take a look at the user guide for `permutation_importance` [here](https://scikit-learn.org/stable/modules/permutation_importance.html).  Use  the `sklearn.inspection` modules implementation of `permutation_importance` to investigate the importance of different features to your regression models.  Share these results on the discussion board.

In [10]:
import pandas as pd
from sklearn.inspection import permutation_importance
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures, OrdinalEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.linear_model import Ridge
from sklearn.impute import SimpleImputer

In [2]:
import numpy as np
import matplotlib.pyplot as plt

In [3]:
cali = pd.read_csv('data/housing.csv')

In [4]:
cali.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [5]:
cali.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


First, let's see how much data there really is per column (NaaN etc.)

In [6]:
print(f'null values per column: {cali.isnull().sum()}')
print(f'NaaN values per column: {cali.isna().sum()}')

null values per column: longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64
NaaN values per column: longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64


Since 'total_bedrooms' is an important measure to use (a prediction), need to drop 207 rows where this value is null/Nan

In [8]:
# cali.shape # before: (20640, 10)
cali = cali.dropna(subset=['total_bedrooms'], axis='rows')
# cali.shape # 20640 - 207 = 20433
# print(f'null values per column: {cali.isnull().sum()}')
print(f'NaaN values per column: {cali.isna().sum()}')

NaaN values per column: longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64


'ocean_proximity' seems like an interesting feature (and is the only non-numeric column), lets see how many occurances there are of each 'type' to ensure we have one for every of the 20433 rows

In [102]:
# cali.groupby('ocean_proximity').count()
cali['ocean_proximity'].value_counts()
# <1H OCEAN     9034
# INLAND        6496
# NEAR OCEAN    2628
# NEAR BAY      2270
# ISLAND           5
# total: 20433

<1H OCEAN     9034
INLAND        6496
NEAR OCEAN    2628
NEAR BAY      2270
ISLAND           5
Name: ocean_proximity, dtype: int64

Since 'ocean_proximity' is non-numerical, we need to encode it: NOTE - I tried using ColumnTransformer but was not able to 'extract'/'set' the column names based on the variables

In [17]:
cali = pd.get_dummies(cali, columns=['ocean_proximity'], drop_first=False)
cali

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20428,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,0,1,0,0,0
20429,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,0,1,0,0,0
20430,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7,92300.0,0,1,0,0,0
20431,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,0,1,0,0,0


Now, we will split the data into the 'train' & 'test' sets & look at the permutation importance

In [34]:
X = cali.drop('median_house_value', axis=1)
y = cali['median_house_value']

# split the data into data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# check the data set shapes
# print(X_train.shape) # (16346, 13)
# print(X_test.shape)  # (4087, 13)
# print(y_train.shape) # (16346,)
# print(y_test.shape)  # (4087,)
# print(type(X_train), type(y_train)) # DataFrame, Series

lr_model = LinearRegression().fit(X_train, y_train)
perm_imp = permutation_importance(lr_model, X_train, y_train, n_repeats=11, random_state=42)
perm_imp_mean = perm_imp.importances_mean

dict = {}
for i in range(0, len(cali.columns)-1):
#     print(f'{cali.columns[i]} {perm_imp_mean[i]}')
    dict[cali.columns[i]] = perm_imp_mean[i]

sorted_dict = sorted(dict.items(), key=lambda x:x[1])
sorted_dict

[('ocean_proximity_INLAND', 0.0007177842127713753),
 ('ocean_proximity_NEAR BAY', 0.009128676454520878),
 ('ocean_proximity_ISLAND', 0.015213007535046013),
 ('housing_median_age', 0.027645782359361628),
 ('median_house_value', 0.03217038917187936),
 ('total_rooms', 0.0347240057100165),
 ('households', 0.05888625637525149),
 ('ocean_proximity_<1H OCEAN', 0.1529564855257065),
 ('total_bedrooms', 0.2564620069901641),
 ('population', 0.26088931752300365),
 ('longitude', 0.4168659445867039),
 ('latitude', 0.43253452076095034),
 ('median_income', 0.8397780113374822)]

From the above we can see that the important features are:

median_income 0.8397780113374822
latitude 0.43253452076095034
longitude 0.4168659445867039
population 0.26088931752300365
total_bedrooms 0.2564620069901641

I would have predicted that 'ocean_proximity' plays an important role in the prediction, but that is a personal bias since I love the ocean :)

Create a new 'X' using these 5 features, repeat the splitting, linear regression & MSE

In [37]:
X = cali[['median_income', 'latitude', 'longitude', 'population', 'total_bedrooms']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)


lr_model_train = LinearRegression().fit(X_train, y_train)
y_predictions = lr_model_train.predict(X_test)
mse = mean_squared_error(y_test, y_predictions)
mse # 4797942984.504951

4797942984.504951

In [38]:
# src: office hours 8/3/23
def sk_vif(exogs, data):
    vif_dict = {}

    for exog in exogs:
        not_exog = [i for i in exogs if i !=exog]
        # split the dataset, one independent variable against all others
        X, y = data[not_exog], data[exog]

        # fit the model and obtain R^2
        r_squared = LinearRegression().fit(X,y).score(X,y)

        # compute the VIF
        vif = 1/(1-r_squared)
        vif_dict[exog] = vif

    return pd.DataFrame({"VIF": vif_dict})

In [39]:
print("Original dataset")
sk_vif(X.columns, X).sort_values(by='VIF', ascending = False)

Original dataset


Unnamed: 0,VIF
latitude,7.379298
longitude,7.307134
population,4.41195
total_bedrooms,4.381215
median_income,1.065269


Here, we see that latitude and logitude have a high VIF value, will drop the 1st column to see if we get a better MSE

In [40]:
X = cali[['median_income', 'longitude', 'population', 'total_bedrooms']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)


lr_model_train = LinearRegression().fit(X_train, y_train)
y_predictions = lr_model_train.predict(X_test)
mse = mean_squared_error(y_test, y_predictions)
mse 
# 1: 4797942984.504951
# 2: 6236986863.593184

6236986863.593184

In [42]:
print("2nd dataset w/o latitude")
sk_vif(X.columns, X).sort_values(by='VIF', ascending = False)

2nd dataset w/o latitude


Unnamed: 0,VIF
population,4.388592
total_bedrooms,4.36562
longitude,1.011961
median_income,1.000968


The result is worse that the initial one, will drop the longitude column instead and compare

In [43]:
X = cali[['median_income', 'latitude', 'population', 'total_bedrooms']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)


lr_model_train = LinearRegression().fit(X_train, y_train)
y_predictions = lr_model_train.predict(X_test)
mse = mean_squared_error(y_test, y_predictions)
mse 
# 1: 4797942984.504951
# 2: 6236986863.593184
# 3. 6129436094.761154

6129436094.761154

The result her is slightly better than the 2nd approach but still worse than the original

In [44]:
print("3rd dataset w/o longitude")
sk_vif(X.columns, X).sort_values(by='VIF', ascending = False)

3rd dataset w/o longitude


Unnamed: 0,VIF
population,4.405921
total_bedrooms,4.373894
latitude,1.021955
median_income,1.006828
