---
title: "Housing Regression"
navtitle: "Housing Regression (NB)"
subtitle: "Introduction to Regression"
description: "Overview of conducting regression analyses for ML."
format:
  html:
    page-layout: full
    title-block-banner: true
---

# Housing Regression

This dataset is composed of over 20,000 rows and 9 columns.

Notebook goals:

- Complete regression analysis with train-test split and two models for comparison
- Data investigation of summary statistics and visualizations
- Metric evaluation and performance visualization

##### Imports

In [1]:
#| include: false
# Global plot settings (hidden from learners)

import plotly.io as pio
from mypyutils import load_json

custom_template = load_json("../tools/json/plotly_template.json")

pio.renderers.default = "png"  # static images, faster rendering
pio.templates["custom_clean"] = custom_template
pio.templates.default = "custom_clean"

Loading json from: ../tools/json/plotly_template.json


In [2]:
# import os
# import numpy as np
import pandas as pd
# import seaborn as sns
# import matplotlib.pyplot as plt
from skimpy import skim

In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import (
    mean_squared_error,
    explained_variance_score
)

## Data

### Get the Data

- Import the data from sklearn
- Transfer data into pandas DataFrame
- Basic data overview

Note: Data is returned as a bunch object, similar to a dictionary. We'll convert it to a pandas df.

In [4]:
# Load data
data = fetch_california_housing()

In [5]:
type(data)

sklearn.utils._bunch.Bunch

In [6]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

In [7]:
# Convert data to pandas dataframe
df = pd.DataFrame(data=data['data'], columns=data['feature_names'])

In [8]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [9]:
df.shape

(20640, 8)

#### Add target to df

In [10]:
df["target"] = data['target']

In [11]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [12]:
# NOTE: 1 additional column
df.shape

(20640, 9)

#### df Overview

In [13]:
# Built-in pandas function
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   target      20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [14]:
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


In [15]:
skim(df)

## Train-Test Split

Splitting data before EDA can be helpful to avoid data leakage or incorrect assumptions about what the data shows.

EDA and training will use only the train data. Test data will be used for evaluation only.

Note: Splitting is normally done with X (features) and y (target) separated to avoid data leakage. Here, we will training and test data and split X, y before model training. 

```X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)```

In [16]:
train_df, test_df = train_test_split(df, test_size=0.33, random_state=42)

## EDA

In [17]:
print(f'Shape of original data: {df.shape}')
print(f'Shape of training data: {train_df.shape}')
print(f'Shape of training data: {test_df.shape}')

Shape of original data: (20640, 9)
Shape of training data: (13828, 9)
Shape of training data: (6812, 9)


In [18]:
print(f'Percent of data in training: {len(train_df)/len(df):.0%}')
print(f'Percent of data in test: {len(test_df)/len(df):.0%}')

Percent of data in training: 67%
Percent of data in test: 33%


### Focus on Train for EDA

It is best not to `peek` at test.

In [19]:
skim(train_df)

In [20]:
# Built-in pandas function, returns df
df_desc = df.describe()
df_desc

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


## Train Models Notebook

In [21]:
# Instantiate model, fit model, save model
lr = LinearRegression()
lr.fit(X_train, y_train)

NameError: name 'X_train' is not defined

In [None]:
# Instantiate model, fit model, save model
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

In [None]:
# Predict on train with both models
# NOTE: test metrics are more insightful
lr_train_preds = lr.predict(X_train)
rf_train_preds = rf.predict(X_train)

In [None]:
# Calculate mean squared error for both models
lr_mse = mean_squared_error(y_train, lr_train_preds)
rf_mse = mean_squared_error(y_train, rf_train_preds)

In [None]:
# Print calculations
print(f"The MSE for the linear regression models is : {lr_mse: .2f}")
print(f"The MSE for the random forest regression models is : {rf_mse: .2f}")

In [None]:
# Plot both predictions
plt.figure(figsize=(10,10))
plt.scatter(y_train, lr_train_preds, c='crimson', label='Linear Regression')
plt.scatter(y_train, rf_train_preds, c='gold', label='RF Regression')

plt.xlabel('True Values', fontsize=15)
plt.ylabel('Predictions', fontsize=15)
plt.title('Training Error', fontsize=15)

plt.legend()
plt.tight_layout()
plt.show()

## Evaluate Models Notebook

In [None]:
# linear regression predict
lr_preds = lr.predict(X_test)
lr_preds

In [None]:
# random forest regression predict
rf_preds = rf.predict(X_test)
rf_preds

In [None]:
# Calculate explained variance for both models
lr_evs = explained_variance_score(y_test, lr_preds)
rf_evs = explained_variance_score(y_test, rf_preds)

In [None]:
# Display explained variance scores
print(f'The explained variance score for the linear regression models is: {lr_evs: .2f}')
print(f'The explained variance score for the random forest regression models is: {rf_evs: .2f}')

In [None]:
# Calculate mean squared error (MSE)
lr_mse = mean_squared_error(y_test, lr_preds)
rf_mse = mean_squared_error(y_test, rf_preds)

In [None]:
# Display MSE
print(f"The MSE for the linear regression models is : {lr_mse: .2f}")
print(f"The MSE for the random forest regression models is : {rf_mse: .2f}")

In [None]:
# create y_df with real and predicted values
y_df=pd.DataFrame({'y_true': y_test, 'lr_preds': lr_preds, 'rf_preds': rf_preds})

In [None]:
# Check df
y_df.head()

In [None]:
# Get correlation across real, lr, and rf values
y_df.corr()

In [None]:
# Seaborn pair plot on y data
sns.pairplot(y_df)

In [None]:
# Plot results
plt.figure(figsize=(10,10))
plt.scatter(y_test.target, lr_preds, c='crimson', label='Linear Regression')
plt.scatter(y_test.target, rf_preds, c='gold', label='RF Regression')

plt.xlabel('True Values', fontsize=15)
plt.ylabel('Predictions', fontsize=15)
plt.title('Test Error', fontsize=15)

plt.legend()
plt.tight_layout()
plt.show()