# Open Team Exercise: Predicting House Prices

![](graphics/house-for-sale-sign.jpg)

In this exercise, we are going to build another predictive model using machine learning. Our goal is to predict real estate prices, given various attributes of the building.  The main difference to our previous example is that the target variable we are interested in, the sale price, is now a continuous range of values rather than a discrete set of classes. Time to recall the concepts of **classification** and **regression**:

## Classification vs Regression

We speak of **classification** if the model outputs a _categorical_ variable, i.e. assigns labels to data points that divide them into groups. The machine learning algorithm often performs this task by creating and optimizing a **decision boundary** in the feature space that separates classes. (The previous chapter introduced an example of a predictive classification model.)

We speak of **regression** if the target variable is a _continuous_ value. This is the task of [📓fitting](../stats/stats-fitting-short.ipynb) a function to the data points so that it enables prediction.

![](https://upload.wikimedia.org/wikipedia/commons/1/13/Main-qimg-48d5bd214e53d440fa32fc9e5300c894.png)
**classification**
_Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Main-qimg-48d5bd214e53d440fa32fc9e5300c894.png)_

![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/500px-Linear_regression.svg.png) **regression** _Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Linear_regression.svg)

## Loading the Data

For this exercise we are going to use a data set of house prices and (a vast number of) attributes. The dataset was provided by [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) for one of their machine learning challenges, in which teams compete for the first place on the global leaderboard - the best prediction wins.

In [None]:
import findspark
findspark.init()
import pyspark

In [None]:
data_dir = "../.assets/data/house/"

In [None]:
!ls {data_dir}

The documentation of the data set contains explanation for the numerous attributes:

In [None]:
!cat {data_dir}/data_description.txt

A quick look into the data file reveals a typical CSV file - we are going to parse it into a DataFrame.

In [None]:
!head {data_dir}/prices.csv



After creating a `SparkSession`, we read the contents of the .csv file into a DataFrame. 

In [None]:
spark = pyspark.sql.SparkSession \
    .builder \
    .appName("HousePricePredictor") \
    .getOrCreate()


In [None]:
data = spark.read \
    .format("csv") \
    .option("header", "true") \
    .load(f"{data_dir}/prices.csv") 


Defining a schema for this large dataframe beforehand is a daunting task, so we leave the types a the default (string) and cast later as needed. We know however that the prices should be floating point numbers:

In [None]:
data = data.withColumn("SalePrice", data["SalePrice"].cast("DOUBLE"))

This DataFrame has a large number of columns - let's select some to take a brief look:

In [None]:
data[["OverallQual", "OverallCond", "YearBuilt", "SalePrice"]].show()

## Task

Your task now is to build a predictive model for house prices, using `prices.csv` as training data.

- Build your pipeline using the building blocks provided by `pyspark.ml` (Estimator, Transformer, Pipeline...). Go back to our [📓previous classification pipeline](../spark/spark-ml-pipeline.ipynb) for inspiration.
- `pyspark.ml` provides [**a few algorithms for regression**](https://spark.apache.org/docs/latest/ml-classification-regression.html#regression) - use both reasoning and experimentation to select a viable one.
- Don't overcomplicate things at first - start by building a **minimal viable model** that uses a few strong features, and evaluate it - then add more features to improve performance.
- The performance of your predictive model is going to be evaluated in the section below. Take a look at the evaluation code and the error metrics used. Make sure to use the following naming conventions so the code below gets the right inputs:
    - `pipeline`: `pyspark.ml.Pipeline` object representing the entire ML pipeline that produces your model 


## Workspace

Write your ML pipeline code here...

In [None]:
from pyspark.ml import Pipeline

---------

---------

## Evaluation

Here we evaluate the performance of the regression model. A better model produces smaller errors in the predicted price. The two error metrics we use are **Root-Mean-Squared-Error (RMSE)** and **Mean Average Error (MAE)** between the predicted value and the observed sales price. In order to get robust scores with less random fluctuation, we apply **cross-validation**.

In [None]:
import pandas
import datetime
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [None]:
ready = False   # set this to True once you are ready to evaluate your model

### Result

In [None]:
if ready:
    rmse = CrossValidator(estimator=pipeline,
                        evaluator=RegressionEvaluator(metricName="rmse", labelCol="label", predictionCol="prediction"),
                        estimatorParamMaps=ParamGridBuilder().build(),
                        numFolds=4) \
                        .fit(data.withColumnRenamed("SalePrice", "label")) \
                        .avgMetrics[0]

    mae = CrossValidator(estimator=pipeline,
                        evaluator=RegressionEvaluator(metricName="mae", labelCol="label", predictionCol="prediction"),
                        estimatorParamMaps=ParamGridBuilder().build(),
                        numFolds=4) \
                        .fit(data.withColumnRenamed("SalePrice", "label")) \
                        .avgMetrics[0]
    
    team_name = "Team 1"  # change this to the name of your team
    print("\t".join(["time", "team", "RMSE", "MAE"]))
    print("\t".join([datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), team_name, "{0:.4f}".format(rmse), "{0:.4f}".format(mae)]))

## Diagnostics

In order to get a better understanding of the error made by the model, plot the distribution of prices, predicted prices, and errors. This can provide useful feedback for model improvement.

In [None]:
if ready:
    predicted = pipeline.fit(data.withColumnRenamed("SalePrice", "label")).transform(data)
    predicted[["SalePrice", "prediction"]].show()

In [None]:
import seaborn
seaborn.set_style("whitegrid")

In [None]:
if ready:
    predicted_pd = predicted[["SalePrice", "prediction"]].toPandas()
    seaborn.distplot(predicted_pd["SalePrice"])
    seaborn.distplot(predicted_pd["prediction"])
    seaborn.distplot(predicted_pd["SalePrice"] - predicted_pd["prediction"])    

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_