## Module 6: Model Selection & Regularization Techniques for Regression

### Step 0

Load the appropriate libraries and bring in the data. Note that we have to run a script to get the [California Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) to match as it is in scikit-learn. We cannot pull it directly from scikit-learn since CodeGrade cannot access the internet.

In [None]:
# CodeGrade step0

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
import os
import tarfile
import joblib # Import joblib directly
from sklearn.datasets._base import _pkl_filepath, get_data_home
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report


archive_path = "cal_housing.tgz" # change the path if it's not in the current directory
data_home = get_data_home(data_home=None) # change data_home if you are not using ~/scikit_learn_data
if not os.path.exists(data_home):
    os.makedirs(data_home)
filepath = _pkl_filepath(data_home, 'cal_housing.pkz')

with tarfile.open(mode="r:gz", name=archive_path) as f:
    cal_housing = np.loadtxt(
        f.extractfile('CaliforniaHousing/cal_housing.data'),
        delimiter=',')
    # Columns are not in the same order compared to the previous
    # URL resource on lib.stat.cmu.edu
    columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
    cal_housing = cal_housing[:, columns_index]

    joblib.dump(cal_housing, filepath, compress=6) # Now using the directly imported joblib

# Load dataset
california = fetch_california_housing(as_frame=True)
data = california.data
data['MedianHouseValue'] = california.target

# Define predictors and response variable
X = data[['MedInc', 'AveRooms', 'AveOccup']]  # Select predictors
y = data['MedianHouseValue']  # Response variable

Print the basic information of the data using `.info()` and `.describe`.

In [None]:
# Display dataset structure


### Step 1

Add a constant to `X`, calling it `X_const`.

Now let `full_model` be the OLS model of `y` and `X_const`.

Rounding to the nearest whole number, return the full model's aic and bic, separated by a comma.

In [None]:
# CodeGrade step1


### Step 2


Let `subset1` be the OLS model fit with `MedInc` and `AveRooms` and `subset2` be the OLS model fit with `MedInc` and `AveOccup`.

Again rounding to the nearest whole number, give the AIC and BIC for the first subset and then the the same two information criteria for the second subset. These four values should be separated by commas.

In [None]:
# CodeGrade step2


Run the below code without change for the ridge model.

In [None]:
# CodeGrade step0

seed = 42

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

### Step 3

Define `ridge_pred` as the prediction of `X_test` and `ridge_mse` as the mean squared error of `y_test` and `ridge_pred`.

Return `ridge_mse`, rounded to four decimal places.

In [None]:
# CodeGrade step3


### Step 4

Return `ridge.coef_`.

In [None]:
# CodeGrade step4



Run the below code without change for the lasso model.

In [None]:
# CodeGrade step0

# Lasso regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

### Step 5


Define `lasso_pred` as the prediction of `X_test` and `lasso_mse` as the mean squared error of `y_test` and `lasso_pred`.

Return `lasso_mse`, rounded to four decimal places.

In [None]:
# CodeGrade step5


Print all of the results.

In [None]:
# Print all model results
