# Open Team Exercise: Predicting House Prices

![](graphics/house-for-sale-sign.jpg)

In this exercise, we are going to build another predictive model using machine learning. Our goal is to predict real estate prices, given various attributes of the building.  The main difference to our previous example is that the target variable we are interested in, the sale price, is now a continuous range of values rather than a discrete set of classes. Time to recall the concepts of **classification** and **regression**:

## Classification vs Regression

We speak of **classification** if the model outputs a _categorical_ variable, i.e. assigns labels to data points that divide them into groups. The machine learning algorithm often performs this task by creating and optimizing a **decision boundary** in the feature space that separates classes. (The previous chapter introduced an example of a predictive classification model.)

We speak of **regression** if the target variable is a _continuous_ value. This is the task of [📓fitting](../stats/stats-fitting-short.ipynb) a function to the data points so that it enables prediction.

![](https://upload.wikimedia.org/wikipedia/commons/1/13/Main-qimg-48d5bd214e53d440fa32fc9e5300c894.png)
**classification**
_Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Main-qimg-48d5bd214e53d440fa32fc9e5300c894.png)_

![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/500px-Linear_regression.svg.png) **regression** _Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Linear_regression.svg)

## Preamble

In [1]:
import data_science_learning_paths
data_science_learning_paths.setup_plot_style(dark=True)

## Loading the Data

For this exercise we are going to use a data set of house prices and (a vast number of) attributes. The dataset was provided by [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) for one of their machine learning challenges, in which teams compete for the first place on the global leaderboard - the best prediction wins.

In [2]:
data_dir = "../.assets/data/house/"

In [3]:
!ls {data_dir}

[31mdata_description.txt[m[m [31mprices.csv[m[m


The documentation of the data set contains explanation for the numerous attributes:

In [4]:
!cat {data_dir}/data_description.txt

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM

A quick look into the data file reveals a typical CSV file - we are going to parse it into a DataFrame.

In [5]:
!head {data_dir}/prices.csv

Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1,60,RL,65,8450,Pave,NA,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196,Gd,TA,PCo

In [6]:
import pandas

In [7]:
data = pandas.read_csv(f"{data_dir}/prices.csv")

Defining a schema for this large dataframe beforehand is a daunting task, so we leave the types a the default (string) and cast later as needed. We know however that the prices should be floating point numbers:

In [8]:
data["SalePrice"] = data["SalePrice"].astype("double")

This DataFrame has a large number of columns - let's select some to take a brief look:

In [9]:
data[["OverallQual", "OverallCond", "YearBuilt", "SalePrice"]].head()

Unnamed: 0,OverallQual,OverallCond,YearBuilt,SalePrice
0,7,5,2003,208500.0
1,6,8,1976,181500.0
2,7,5,2001,223500.0
3,7,5,1915,140000.0
4,8,5,2000,250000.0


## Task

Your task now is to build a predictive model for house prices, using `prices.csv` as training data.

- Build your pipeline using the building blocks provided by `sklearn`(Estimator, Transformer, Pipeline...). 
- `sklearn` provides several algorithms for regression - try them out
- Don't overcomplicate things at first - start by building a **minimal viable model** that uses a few strong features, and evaluate it - then add more features to improve performance.
- The performance of your predictive model is going to be evaluated in the section below. Take a look at the evaluation code and the error metrics used. Make sure to use the following naming conventions so the code below gets the right inputs:
    - `pipeline`: `Pipeline` object representing the entire ML pipeline that produces your model 


## Workspace

Write your ML pipeline code here...

---------

---------

## Evaluation

Here we evaluate the performance of the regression model. A better model produces smaller errors in the predicted price. The two error metrics we use are **Root-Mean-Squared-Error (RMSE)** and **Mean Average Error (MAE)** between the predicted value and the observed sales price. In order to get robust scores with less random fluctuation, we apply **cross-validation**.

In [10]:
import pandas
import datetime

In [11]:
ready = False   # set this to True once you are ready to evaluate your model

### Cross-Validation Result

In [12]:
# TODO:

## Diagnostics

In order to get a better understanding of the error made by the model, plot the distribution of prices, predicted prices, and errors. This can provide useful feedback for model improvement.

In [13]:
import seaborn
seaborn.set_style("whitegrid")

In [14]:
if ready:
    predicted_pd = predicted[["SalePrice", "prediction"]]
    seaborn.distplot(predicted_pd["SalePrice"])
    seaborn.distplot(predicted_pd["prediction"])
    seaborn.distplot(predicted_pd["SalePrice"] - predicted_pd["prediction"])    

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_