## Module 5: General Linear Regression and Statistical Inference

### Step 0

Load the appropriate libraries and bring in the data. Note that we have to run a script to get the [California Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) to match as it is in scikit-learn. We cannot pull it directly from scikit-learn since CodeGrade cannot access the internet.

In [None]:
# CodeGrade step0

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
import os
import tarfile
import joblib # Import joblib directly
from sklearn.datasets._base import _pkl_filepath, get_data_home
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns

archive_path = "cal_housing.tgz" # change the path if it's not in the current directory
data_home = get_data_home(data_home=None) # change data_home if you are not using ~/scikit_learn_data
if not os.path.exists(data_home):
    os.makedirs(data_home)
filepath = _pkl_filepath(data_home, 'cal_housing.pkz')

with tarfile.open(mode="r:gz", name=archive_path) as f:
    cal_housing = np.loadtxt(
        f.extractfile('CaliforniaHousing/cal_housing.data'),
        delimiter=',')
    # Columns are not in the same order compared to the previous
    # URL resource on lib.stat.cmu.edu
    columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
    cal_housing = cal_housing[:, columns_index]

    joblib.dump(cal_housing, filepath, compress=6) # Now using the directly imported joblib


# Load the dataset
california = fetch_california_housing(as_frame=True)
data = california.data
data['MedianHouseValue'] = california.target

Print the basic information of the data using `.info()` and `.describe`.

In [None]:
# Display basic information


### Step 1

Let `X` be the variables `MedInc`, `AveRooms`, and `AveOccup` and add the constant for the intercept. Let `y` be the `MedianHouseValue`.

Now fit the regreson model calling it `mlr_model`.

Finally, return the $r^2$ value of the model rounding to four decimal places.

In [None]:
# CodeGrade step1


Print the model summary.

In [None]:
# Print the model summary


### Step 2

Let `p_values` be the models' p-values.

Return the four p-values using `.iloc[]` from the first value to the fourth, in order and separated by commas. Make sure to round each to 5 decimal places.

In [None]:
# CodeGrade step2


### Step 3

Identify the significant predictors (strictly less than $\alpha=0.05$) calling this `significant_predictors`.

Reutn the shape of `significant_predictors`.

In [None]:
# CodeGrade step3


### Step 4

Find the confidence intervals of the model (at a 95% level of confidence) and calling this `conf_intervals`.

Using `.iloc[,]` and rounding to 2 decimal places return the four confidence intervals in order of (separated by commas)

> first row and first column, first row and second column, second row and first column, second row and second column





In [None]:
# CodeGrade step4


Now to see how the intervals looks "nicely" return `conf_intervals`.

In [None]:
#Pretty CIs


### Step 5

Add a quadratic term to the model, calling the new model `quad_model` where a new term is added to the data, viz. `MedInc_squared`, which is the square of `MedInc`.

Return $r^2$ of the quadratic model rounded to four decumal places.

In [None]:
# CodeGrade step5


Now print the model summary.

In [None]:
# Print the model summary


### Step 6

Find the adjusted $r^2$ for both of the models and call them `adjusted_r2_base` and `adjusted_r2_quad`, respectively.

Return these two adjusted $r^2$'s rounded to four decimal places, separated by a comma.

In [None]:
# CodeGrade step6


Print both these adjusted $r^2$'s.