# Open Team Exercise: Predicting House Prices - Amsterdam Edition


In this exercise, we are going to build another predictive model using machine learning. Our goal is to predict real estate prices, given various attributes of the building.  The main difference to our previous example is that the target variable we are interested in, the sale price, is now a continuous range of values rather than a discrete set of classes. Time to recall the concepts of **classification** and **regression**:

## Classification vs Regression

We speak of **classification** if the model outputs a _categorical_ variable, i.e. assigns labels to data points that divide them into groups. The machine learning algorithm often performs this task by creating and optimizing a **decision boundary** in the feature space that separates classes. (The previous chapter introduced an example of a predictive classification model.)

We speak of **regression** if the target variable is a _continuous_ value. This is the task of [📓fitting](../stats/stats-fitting-short.ipynb) a function to the data points so that it enables prediction.

![](https://upload.wikimedia.org/wikipedia/commons/1/13/Main-qimg-48d5bd214e53d440fa32fc9e5300c894.png)
**classification**
_Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Main-qimg-48d5bd214e53d440fa32fc9e5300c894.png)_

![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/500px-Linear_regression.svg.png) **regression** _Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Linear_regression.svg)

## Preamble

In [1]:
import data_science_learning_paths
data_science_learning_paths.setup_plot_style(dark=True)

## Loading the Data

For this exercise we are going to use a data set of house prices and some attributes. The dataset was provided by [Kaggle](https://www.kaggle.com/datasets/thomasnibb/amsterdam-house-price-prediction?resource=download).

In [16]:
data_dir = "../.assets/data/house_amsterdam/"

In [17]:
!ls {data_dir}

HousingPrices-Amsterdam-August-2021.csv


A quick look into the data file reveals a typical CSV file - we are going to parse it into a DataFrame.

In [18]:
!head {data_dir}/HousingPrices-Amsterdam-August-2021.csv

"","Address","Zip","Price","Area","Room","Lon","Lat"
"1","Blasiusstraat 8 2, Amsterdam","1091 CR",685000,64,3,4.907736,52.356157
"2","Kromme Leimuidenstraat 13 H, Amsterdam","1059 EL",475000,60,3,4.850476,52.348586
"3","Zaaiersweg 11 A, Amsterdam","1097 SM",850000,109,4,4.944774,52.343782
"4","Tenerifestraat 40, Amsterdam","1060 TH",580000,128,6,4.789928,52.343712
"5","Winterjanpad 21, Amsterdam","1036 KN",720000,138,5,4.902503,52.410538
"6","De Wittenkade 134 I, Amsterdam","1051 AM",450000,53,2,4.875024,52.382228
"7","Pruimenstraat 18 B, Amsterdam","1033 KM",450000,87,3,4.896536,52.410585
"8","Da Costakade 32 II, Amsterdam","1053 WL",590000,80,2,4.871555,52.371041
"9","Postjeskade 41 2, Amsterdam","1058 DG",399000,49,3,4.854671,52.363471


In [19]:
import pandas

In [21]:
data = pandas.read_csv(f"{data_dir}/HousingPrices-Amsterdam-August-2021.csv")

Defining a schema for this large dataframe beforehand is a daunting task, so we leave the types a the default (string) and cast later as needed. We know however that the prices should be floating point numbers:

In [8]:
data["SalePrice"] = data["SalePrice"].astype("double")

This DataFrame has a large number of columns - let's select some to take a brief look:

In [9]:
data[["OverallQual", "OverallCond", "YearBuilt", "SalePrice"]].head()

Unnamed: 0,OverallQual,OverallCond,YearBuilt,SalePrice
0,7,5,2003,208500.0
1,6,8,1976,181500.0
2,7,5,2001,223500.0
3,7,5,1915,140000.0
4,8,5,2000,250000.0


## Task

Your task now is to build a predictive model for house prices, using `prices.csv` as training data.

- Build your pipeline using the building blocks provided by `sklearn`(Estimator, Transformer, Pipeline...). 
- `sklearn` provides several algorithms for regression - try them out
- Don't overcomplicate things at first - start by building a **minimal viable model** that uses a few strong features, and evaluate it - then add more features to improve performance.
- The performance of your predictive model is going to be evaluated in the section below. Take a look at the evaluation code and the error metrics used. Make sure to use the following naming conventions so the code below gets the right inputs:
    - `pipeline`: `Pipeline` object representing the entire ML pipeline that produces your model 


## Workspace

Write your ML pipeline code here...

---------

In [None]:
# your code here

---------

## Evaluation

Here we evaluate the performance of the regression model. A better model produces smaller errors in the predicted price. The two error metrics we use are **Root-Mean-Squared-Error (RMSE)** and **Mean Average Error (MAE)** between the predicted value and the observed sales price. In order to get robust scores with less random fluctuation, we apply **cross-validation**.

In [10]:
import pandas
import datetime

In [11]:
ready = False   # set this to True once you are ready to evaluate your model

### Cross-Validation Result

In [24]:
# your code here

## Diagnostics

In order to get a better understanding of the error made by the model, plot the distribution of prices, predicted prices, and errors. This can provide useful feedback for model improvement.

In [22]:
import seaborn
seaborn.set_style("whitegrid")

In [23]:
# your code here

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_