<a href="https://colab.research.google.com/github/mutukuk/dsc-multiple-regression-intro/blob/master/linear_reg_checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression Checkpoint

This checkpoint is designed to test your understanding of linear regression.

Specifically, this will cover:

* Creating simple and multiple linear regression models with StatsModels
* Interpreting linear regression model metrics
* Interpreting linear regression model parameters

## Your Task: Build Linear Regression Models to Predict Home Prices

### Data Understanding

You will be using the Ames Housing dataset, modeling the `SalePrice` using these numeric features:

* `GrLivArea`: Above grade living area (square feet)
* `GarageArea`: Size of garage (square feet)
* `LotArea`: Lot size (square feet)
* `LotFrontage`: Length of street connected to property (feet)

In [1]:
# Run this cell without changes

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("ames.csv", index_col=0)
df = df[["SalePrice", "GrLivArea", "GarageArea", "LotArea", "LotFrontage"]].copy()
df.dropna(inplace=True)
df.head()

Unnamed: 0_level_0,SalePrice,GrLivArea,GarageArea,LotArea,LotFrontage
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,208500,1710,548,8450,65.0
2,181500,1262,460,9600,80.0
3,223500,1786,608,11250,68.0
4,140000,1717,642,9550,60.0
5,250000,2198,836,14260,84.0


### Modeling

You will apply an inferential modeling process using StatsModels. This means that you are trying to create the best model in terms of variance in `SalePrice` that is explained (i.e. R-Squared).

You will build **two models — one simple linear regression model and one multiple linear regresssion model** — then you will interpret the model summaries.

There are two relevant components of interpreting the model summaries: model **metrics** such as r-squared and p-values, which tell you how well your model is fit to the data, and model **parameters** (intercept and coefficients), which tell you how the model is using the feature(s) to predict the target.

### Requirements

## 1. Build a Simple Linear Regression Using StatsModels

Below, we use the `.corr()` method to find which features are most correlated with `SalePrice`:

In [2]:
# Run this cell without changes
df.corr()["SalePrice"]

Unnamed: 0,SalePrice
SalePrice,1.0
GrLivArea,0.703557
GarageArea,0.631761
LotArea,0.311416
LotFrontage,0.351799


The `GrLivArea` feature has the highest correlation with `SalePrice`, so we will use it to build a simple linear regression model.

Use the OLS model ([documentation here](https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.OLS.html)) with:

- `SalePrice` as the endogenous (dependent) variable
- `GrLivArea` as the exogenous (independent) variable

Don't forget to use `sm.add_constant` to ensure that there is an intercept term.

Fill in the appropriate values in the cell below.

In [3]:
# CodeGrade step1

import statsmodels.api as sm

# Replace None with appropriate code
exog = sm.add_constant(df["GrLivArea"])
endog = df["SalePrice"]
simple_model = sm.OLS(endog, exog)

simple_model_results = simple_model.fit()
print(simple_model_results.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.495
Model:                            OLS   Adj. R-squared:                  0.495
Method:                 Least Squares   F-statistic:                     1175.
Date:                Wed, 26 Feb 2025   Prob (F-statistic):          4.39e-180
Time:                        16:17:15   Log-Likelihood:                -14902.
No. Observations:                1201   AIC:                         2.981e+04
Df Residuals:                    1199   BIC:                         2.982e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1.347e+04   5171.332      2.605      0.0

In [4]:
# simple_model should be an OLS model
assert type(simple_model) == sm.OLS

# simple_model should have 1 feature (other than the constant)
assert simple_model.df_model == 1

## 2. Interpret Simple Linear Regression Model Metrics

We want to know:

1. How much of the variance is explained by this model? This is also known as the R-Squared. Fill in `r_squared` with this value — a floating point number between 0 and 1.
2. Is the model statistically significant at $\alpha = 0.05$? This is determined by comparing the probability of the f-statistic to the alpha. Fill in `model_is_significant` with this value — either `True` or `False`.

You can either just look at the print-out above and fill in the values, or you can use attributes of `simple_model_results` ([documentation here](https://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.RegressionResults.html)). If you are getting stuck, it's usually easier to type the answer in rather than writing code to do it.

In [6]:
# CodeGrade step2
# Replace None with appropriate code
r_squared = simple_model_results.rsquared
model_is_significant = simple_model_results.f_pvalue < 0.05

In [7]:
import numpy as np

# r_squared should be a floating point value between 0 and 1
assert 0 <= r_squared and r_squared <= 1
assert type(r_squared) == float or type(r_squared) == np.float64

# model_is_significant should be True or False
assert model_is_significant == True or model_is_significant == False

## 3. Interpret Simple Linear Regression Parameters

Now, we want to know what relationship the model has found between the feature and the target. Because this is a simple linear regression, it follows the format of $y = mx + b$ where $y$ is the `SalePrice`, $m$ is the slope of `GrLivArea`, $x$ is `GrLivArea`, and $b$ is the y-intercept (the value of $y$ when $x$ is 0).

In the cell below, fill in appropriate values for `m` and `b`. Again, you can use the print-out above or use attributes of `simple_model_results`.

In [9]:
# CodeGrade step3
# Replace None with appropriate code

# Slope (coefficient of GrLivArea)
m = simple_model_results.params["GrLivArea"]

# Intercept (coefficient of const)
b = simple_model_results.params["const"]

print(f"""
Our simple linear regression model found a y-intercept
of ${round(b, 2):,}, then for every increase of 1 square foot
above-ground living area, the price increases by ${round(m, 2)}
""")


Our simple linear regression model found a y-intercept
of $13,470.44, then for every increase of 1 square foot
above-ground living area, the price increases by $110.71 



In [10]:
from numbers import Number

# m should be a number
assert isinstance(m, Number)

# b should be a number
assert isinstance(b, Number)

## 4. Build a Multiple Regression Model Using StatsModels

Now, build an OLS model that contains all of the columns present in `df`.

Specifically, your model should have `SalePrice` as the target, and these columns as features:

* `GrLivArea`
* `GarageArea`
* `LotArea`
* `LotFrontage`

Remember to also account for the intercept as you did above!

In [15]:
# CodeGrade step4
# Replace None with appropriate code

exog = df[["GrLivArea", "GarageArea", "LotArea", "LotFrontage"]]
endog = df["SalePrice"]
multiple_model = sm.OLS(endog, exog)

multiple_model_results = multiple_model.fit()
print(multiple_model_results.summary())

                                 OLS Regression Results                                
Dep. Variable:              SalePrice   R-squared (uncentered):                   0.932
Model:                            OLS   Adj. R-squared (uncentered):              0.932
Method:                 Least Squares   F-statistic:                              4111.
Date:                Wed, 26 Feb 2025   Prob (F-statistic):                        0.00
Time:                        16:51:24   Log-Likelihood:                         -14742.
No. Observations:                1201   AIC:                                  2.949e+04
Df Residuals:                    1197   BIC:                                  2.951e+04
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
                  coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------

In [16]:
# multiple_model should be an OLS model
assert type(multiple_model) == sm.OLS

# multiple_model should have 4 features (other than the constant)
assert multiple_model.df_model == 4

## 5. Interpret Multiple Regression Model Results

Now we want to know: **is our multiple linear regression model a better fit than our simple linear regression model? We'll measure this in terms of percentage of variance explained (r-squared)**, where a higher r-squared indicates a better fit.

Replace `second_model_is_better` with either `True` if this model is better, or `False` if the previous model was better (or the two models are exactly the same).

In [18]:
# CodeGrade step5.1
# Replace None with appropriate code
second_model_is_better = True

In [19]:
# second_model_is_better should be True or False
assert second_model_is_better == True or second_model_is_better == False

One of the feature coefficients is not statistically significant. Which one is it?

Replace `not_significant` with the string name of the feature, which should be one of these four:

* `GrLivArea`
* `GarageArea`
* `LotArea`
* `LotFrontage`

In [20]:
# CodeGrade step5.2
# Replace None with appropriate string name for the feature
not_significant = "LotFrontage"

In [21]:
# not_significant should be a string
assert type(not_significant) == str

# It should be one of the features in df
assert not_significant in df.columns