## Module 4: General Linear Regression: Multiple Linear Regression and other regression models

### Step 0

Load the appropriate libraries and bring in the data. Note that we have to run a script to get the [California Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) to match as it is in scikit-learn. We cannot pull it directly from scikit-learn since CodeGrade cannot access the internet.

In [None]:
# CodeGrade step0

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
import os
import tarfile
import joblib # Import joblib directly
from sklearn.datasets._base import _pkl_filepath, get_data_home
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns

archive_path = "cal_housing.tgz" # change the path if it's not in the current directory
data_home = get_data_home(data_home=None) # change data_home if you are not using ~/scikit_learn_data
if not os.path.exists(data_home):
    os.makedirs(data_home)
filepath = _pkl_filepath(data_home, 'cal_housing.pkz')

with tarfile.open(mode="r:gz", name=archive_path) as f:
    cal_housing = np.loadtxt(
        f.extractfile('CaliforniaHousing/cal_housing.data'),
        delimiter=',')
    # Columns are not in the same order compared to the previous
    # URL resource on lib.stat.cmu.edu
    columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
    cal_housing = cal_housing[:, columns_index]

    joblib.dump(cal_housing, filepath, compress=6) # Now using the directly imported joblib

# Load the dataset
california = fetch_california_housing(as_frame=True)
data = california.data
data['MedianHouseValue'] = california.target

Look at the data using `.info()` and `.describe()`.

In [None]:
# Display basic information
print(data.info())
print(data.describe())

### Step 1

Let the `X` variable be `MedInc`, `AveRooms`, and `HouseAge` and `y` be `MedianHouseValue`.

Then add the constant for the intercept.

Next, create the baseline model, called `baseline_model` using `smf.ols` for the above variables, and fit the model.

To verify the mdoel, return the $r^2$ value rounded to four decimal places.

In [None]:
# CodeGrade step1


Now print the model summary.

In [None]:
# Print model summary


### Step 2


Add a quadratic term to the data called `MedInc_squared`, which is what it sounds like.

Now fit the model using `smf.ols` with the quadratic term, calling this model, `nonlinear_model`. Make sure to incluce the variables  `MedInc`, `AveRooms`, and `HouseAge` as well.

To verify the model, return the $r^2$ value rounded to six decimal places.



In [None]:
# CodeGrade step2


Now print the model summary.

In [None]:
# Print the summary


### Step 3

Add (again) the same quadratic term, but now also add an interaction term that represents the interaction between `MedInc` and `AveRooms`.

Now fit the model using `smf.ols` with the quadratic term, calling this model, `interaction_model`.  Make sure to incluce the variables  `MedInc`, `AveRooms`, and `HouseAge` as well.

To verify the model, return the $r^2$ value rounded to six decimal places.

In [None]:
# CodeGrade step3


Now print the model summary.

In [None]:
# Print the summary


### Step 4

Add (again) the same quadratic term as in the previous two steps.

Create an indicator variable by
1.   Find the median of `MedInc`, call this `median_income_threshold`
2.   Adding a new variable to the data set callede `HighIncome` that assigns a 1 to any value strictly greater than the median, and otherwise it assigns a 0.

Now fit the model using `smf.ols` with the quadratic term and indicator variable, calling this model, `indicator_model`.  Make sure to incluce the variables  `MedInc`, `AveRooms`, and `HouseAge` as well.

To verify the model, return the $r^2$ value rounded to six decimal places.

In [None]:
# CodeGrade step4


Now print the model summary.

In [None]:
# Print the summary


### Step 5

Add (again) the same quadratic term, but now also add an log transformed term of `AveRooms` called `log_AveRooms`.

Now fit the model using `smf.ols` with the quadratic term, calling this model, `log_model`.  Make sure to incluce the variables  `MedInc`, `AveRooms`, and `HouseAge` as well.

To verify the model, return the $r^2$ value rounded to six decimal places.

In [None]:
# CodeGrade step5


Now print the model summary.

In [None]:
# Print the summary


### Step 6

Return the shape of `log_model`'s residuals.

In [None]:
# CodeGrade step6


Now for the log model, plot the resisuals vs. the fitted values and the Q-Q plot.

In [None]:
# Residuals vs. Fitted Plot
