<a target="_blank" href="https://colab.research.google.com/github/lm2612/Tutorials/blob/main/1_supervised_learning_regression/1-LinearRegression_HousePrice.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# House price prediction

In this exercise, we are going to apply a regression modelling technique to house price prediction using a subset of the [California house price dataset](https://www.kaggle.com/camnugent/california-housing-prices). Our dataset contains 200 observations for housing blocks in California obtained from the 1990 census. The dataset contains columns:

1. `longitude`: A measure of how far west a house is; a higher value is farther west

2. `latitude`: A measure of how far north a house is; a higher value is farther north

3. `housing_median_age`: Median age of a house within a block; a lower number is a newer building

4. `total_rooms`: Total number of rooms within a block

5. `total_bedrooms`: Total number of bedrooms within a block

6. `population`: Total number of people residing within a block

7. `households`: Total number of households, a group of people residing within a home unit, for a block

8. `median_income`: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

9. `median_house_value`: Median house value for households within a block (measured in US Dollars)

10. `ocean_proximity`: Location of the house w.r.t ocean/sea

In this example, we are going to create a regression model to predict `median_house_value` using only `median_income`.

Load the first 200 rows of the file `sample_data/california_housing_train.csv`.

In [2]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    filepath = "https://raw.githubusercontent.com/lm2612/Tutorials/refs/heads/main/1_supervised_learning_regression/housing_short.csv"
    print(f"Notebook running in google colab. Using raw github filepath = {filepath}")

else:
    filepath = "./housing_short.csv"
    print(f"Notebook running locally. Using local filepath = {filepath}")


Notebook running locally. Using local filepath = ./housing_short.csv


In [4]:
df = pd.read_csv(filepath)
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41,880,129,322,126,8.3252,452600,NEAR BAY
1,-122.22,37.86,21,7099,1106,2401,1138,8.3014,358500,NEAR BAY
2,-122.24,37.85,52,1467,190,496,177,7.2574,352100,NEAR BAY
3,-122.25,37.85,52,1274,235,558,219,5.6431,341300,NEAR BAY
4,-122.25,37.85,52,1627,280,565,259,3.8462,342200,NEAR BAY


Our goal is to predict `median_house_value`. This will be our dependent variable, $y$. Pick another variable that you think will be a useful predictor of house value, that we will use as our dependent variable, $x$. First, we should check if these variables appear correlated by plotting them.

Does your choice of variable seem suitable for linear regression? 

## Linear regression
Split the dataset into a suitable training, validation and test set. 

In [None]:
training = # Use df.iloc[ ... , :] where "..." is your choice of indices 
validation = 
testing = 

Create a linear regression model to predict median house value from median income using the training set.

### Scikit-learn solution
You can use `sklearn.linear_model.LinearRegression()`. See https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LinearRegression.html.

In [None]:
# Select X to be your choice of predictor
X = training[" "]
y = training["median_house_value"]

X = X.iloc[:, ].values.reshape(-1, 1)      # This reshapes the array so the inputs are the correct size (N, 1)
y = y.iloc[:, ].values.reshape(-1, 1)


# Create your linear regression model and print the coefficients and intercept.



What does your model predict is the average increase in median house value associated with a \$10,000 increase in median income?

Overlay the linear regression model on top of the training data. What does this suggest about the appropriateness of the model?

## Check validation dataset
Predict the 'median_house_value' on the the validation dataset. How would you check the performance?


## Improving the model

Now fit a linear regression model to the training set of the form:

\begin{equation}
P_i = a + b_1 I_i + b_2 I_i^2
\end{equation}

where $P_i$ is median house price in block $i$ and $I_i$ is median house income of the same block.

Again plot the model fit for this model versus data. How does the fit compare with the linear model?

Now fit a model using up to 5th order polynomial terms.

Overlay all three models on top of the training data.

Now we are going to prepare the fit of all models on the validation set. To do so, first use each of your fitted models to predict house prices in this set.

Calculate the errors in prediction for each of the three models. Draw histograms of these for each of the models.

Calculate the root mean squared error for each of the models:

\begin{equation}
\text{RMSE} = \sqrt{1/N\sum_{i=1}^N \text{error}_i^2}
\end{equation}

Which model has the best performance and why?

Calculate the RMSE of prediction using the quadratic model on the testing set.

Why is the performance so much worse on the testing set?

In [None]:
plt.scatter(training["median_income"], training["median_house_value"], label="linear model")
plt.scatter(validation["median_income"], validation["median_house_value"], label="quadratic")
plt.scatter(testing["median_income"], testing["median_house_value"], label="5th order")

plt.legend()
plt.show()

## Multivariate regression
What about if you chose a different variable? How would you implement the above for multivariate regression?

We will choose the continous variables only.

In [None]:
multivariate_columnnames = [ 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income']
X = training[multivariate_columnnames].values
# No need to edit y. Check the sizes
X.shape, y.shape

When using multiple regression inputs, we need to scale variables first, otherwise those with larger values have large magnitudes (e.g., population) would be given a higher weighting than those with smaller magnitudes (e.g., houseing median age)

In [None]:
from sklearn.preprocessing import StandardScaler

# Use StandardScaler to fit and transform your X data so the variables are normalised to zero mean, unit variance


In [None]:
# Fit your linear regression model in the same way but using the multivariate X



What is the most important predictor?


How do the errors compare to the univariate prediction?

## Bonus question
Train a model that uses the L2 regularisation method (Ridge regression). Check the effect this has on the values of the coefficients and on the accuracy on the validation set