## Module 6: Model Selection & Regularization Techniques for Regression

### Step 0

Load the appropriate libraries and bring in the data. Note that we have to run a script to get the [California Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) to match as it is in scikit-learn. We cannot pull it directly from scikit-learn since CodeGrade cannot access the internet.

In [1]:
# CodeGrade step0

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
import os
import tarfile
import joblib # Import joblib directly
from sklearn.datasets._base import _pkl_filepath, get_data_home
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report


archive_path = "cal_housing.tgz" # change the path if it's not in the current directory
data_home = get_data_home(data_home=None) # change data_home if you are not using ~/scikit_learn_data
if not os.path.exists(data_home):
    os.makedirs(data_home)
filepath = _pkl_filepath(data_home, 'cal_housing.pkz')

with tarfile.open(mode="r:gz", name=archive_path) as f:
    cal_housing = np.loadtxt(
        f.extractfile('CaliforniaHousing/cal_housing.data'),
        delimiter=',')
    # Columns are not in the same order compared to the previous
    # URL resource on lib.stat.cmu.edu
    columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
    cal_housing = cal_housing[:, columns_index]

    joblib.dump(cal_housing, filepath, compress=6) # Now using the directly imported joblib

# Load dataset
california = fetch_california_housing(as_frame=True)
data = california.data
data['MedianHouseValue'] = california.target

# Define predictors and response variable
X = data[['MedInc', 'AveRooms', 'AveOccup']]  # Select predictors
y = data['MedianHouseValue']  # Response variable

Print the basic information of the data using `.info()` and `.describe`.

In [2]:
# Display dataset structure
print(data.info())
print(data.describe())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   MedInc            20640 non-null  float64
 1   HouseAge          20640 non-null  float64
 2   AveRooms          20640 non-null  float64
 3   AveBedrms         20640 non-null  float64
 4   Population        20640 non-null  float64
 5   AveOccup          20640 non-null  float64
 6   Latitude          20640 non-null  float64
 7   Longitude         20640 non-null  float64
 8   MedianHouseValue  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB
None
             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.4

### Step 1

Add a constant to `X`, calling it `X_const`.

Now let `full_model` be the OLS model of `y` and `X_const`.

Rounding to the nearest whole number, return the full model's aic and bic, separated by a comma.

In [4]:
# CodeGrade step1
X_const = sm.add_constant(X)
full_model = sm.OLS(y, X_const).fit()

np.round(full_model.aic, 0), np.round(full_model.bic, 0)

(50962.0, 50994.0)

### Step 2


Let `subset1` be the OLS model fit with `MedInc` and `AveRooms` and `subset2` be the OLS model fit with `MedInc` and `AveOccup`.

Again rounding to the nearest whole number, give the AIC and BIC for the first subset and then the the same two information criteria for the second subset. These four values should be separated by commas.

In [6]:
# CodeGrade step2
X_const_subset1 = sm.add_constant(X[['MedInc', 'AveRooms']])
X_const_subset2 = sm.add_constant(X[['MedInc', 'AveOccup']])

model_subset1 = sm.OLS(y, X_const_subset1).fit()
model_subset2 = sm.OLS(y, X_const_subset2).fit()

np.round(model_subset1.aic, 0), np.round(model_subset1.bic, 0), np.round(model_subset2.aic, 0), np.round(model_subset2.bic, 0)

(51016.0, 51040.0, 51199.0, 51222.0)

Run the below code without change for the ridge model.

In [23]:
# CodeGrade step0

seed = 42

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

0,1,2
,alpha,1.0
,fit_intercept,True
,copy_X,True
,max_iter,
,tol,0.0001
,solver,'auto'
,positive,False
,random_state,


### Step 3

Define `ridge_pred` as the prediction of `X_test` and `ridge_mse` as the mean squared error of `y_test` and `ridge_pred`.

Return `ridge_mse`, rounded to four decimal places.

In [24]:
# CodeGrade step3
ridge_pred = ridge.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_pred)

np.round(ridge_mse, 4)

0.7007

### Step 4

Return `ridge.coef_`.

In [25]:
# CodeGrade step4

ridge.coef_

array([ 0.43687436, -0.04042693, -0.00382598])

Run the below code without change for the lasso model.

In [None]:
# CodeGrade step0

# Lasso regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

0,1,2
,alpha,0.1
,fit_intercept,True
,precompute,False
,copy_X,True
,max_iter,1000
,tol,0.0001
,warm_start,False
,positive,False
,random_state,
,selection,'cyclic'


### Step 5


Define `lasso_pred` as the prediction of `X_test` and `lasso_mse` as the mean squared error of `y_test` and `lasso_pred`.

Return `lasso_mse`, rounded to four decimal places.

In [15]:
# CodeGrade step5
lasso_pred = lasso.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_pred)

np.round(lasso_mse, 4)

0.7047

Print all of the results.

In [16]:
# Print all model results
print(f"Ridge Test MSE: {ridge_mse:.4f}")
print(f"Lasso Test MSE: {lasso_mse:.4f}")


print("Ridge Coefficients:", ridge.coef_)
print("Lasso Coefficients:", lasso.coef_)

Ridge Test MSE: 0.6879
Lasso Test MSE: 0.7047
Ridge Coefficients: [ 0.42481977 -0.03254789 -0.06667814]
Lasso Coefficients: [ 0.39731277 -0.01225582 -0.00290792]
