# House price estimation

Let's study how we can use linear regression _i.e.,_ least squares fitting to estimate house prices in Hervanta region. The data is downloaded from http://asuntojen.hintatiedot.fi/haku/, copied to excel, cleaned and saved as `csv`. The data looks as follows.

num_rooms|type|sqm|eur/sqm|year|elevator|condition
---|---|---|---|---|---|---
1|kt|31,5|2413|1974|on|hyvä
1|rt|37|3405|2018|ei|hyvä
1|kt|30|3683|1990|on|tyyd.
1|kt|35|2343|1981|on|tyyd.
1|kt|32|2656|1977|on|hyvä

First, let's import the required libraries. We are using `scikit-learn` exclusively here.

In [2]:
import numpy as np
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

Load data line by line. Attributes (apt. size, year, etc) are added to X, and the target (actual selling price per sqm) to y. 

In [3]:
file = "prices.csv"

X = []
y = []

with open(file, "r") as f:
    for line in f:
        
        # Skip first line
        if line.startswith("num_rooms"):
            continue
        
        parts = line.strip().split(";")
        
        rooms = int(parts[0])
        kind  = parts[1]
        
        # Numbers use Finnish locale with decimals separated by comma.
        # Just use replace(), although the proper way would be with
        # locale module.
        
        sqm   = float(parts[2].replace(",", "."))
        price = float(parts[3])
        year  = int(parts[4])
        elev  = parts[5]
        cond  = parts[6]
        
        X.append([rooms, kind, sqm, year, elev, cond])
        y.append(price)
        
X = np.array(X)
y = np.array(y)

Some columns are categorical, so we need to encode those as dummy indicator variables. For example, `condition = {good, satisfactory, bad}` is encoded into three binary (numerical) variables: `is_good`, `is_satisfactory` and `is_bad`.

In [4]:
binarized_cols = [1, 4, 5]

for col in binarized_cols:
    lb = LabelBinarizer()
    z = lb.fit_transform(X[:, col])
    X = np.append(X, z, axis = 1)
    
for col in binarized_cols[::-1]: 
    X = np.delete(X, col, axis = 1)

X = X.astype(float)
y = y.astype(float)

Next, split the data into training and testing to evaluate the performance of our regression model.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Fit the regression model and predict.

In [6]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Let's see how we did with the first five test apartments:

num_rooms|type|sqm|eur/sqm|year|elevator|condition|prediction|error
---|---|---|---|---|---|---|---|--
3|rt|75,5|1762|1978|ei|hyvä|1654.3|107.6
3|kt|63|2190|2004|on|hyvä|2402.0|212.0
3|rt|77|2948|2017|ei|hyvä|2972.7|24.7
2|kt|58|1483|1974|on|hyvä|1571.0|88.0
2|kt|58|1379|1974|on|hyvä|1571.0|192.0

Compute the mean error of prediction at the test whole partition.

In [7]:
error = mean_absolute_error(y_test, y_pred)
print("Mean absolute error: {:.1f} eur/sqm".format(error))

mean_difference = np.mean(y_pred - y_test)
print("Average prediction error: {:.1f} eur/sqm".format(mean_difference))

Mean absolute error: 274.5 eur/sqm
Average prediction error: -19.1 eur/sqm


Our model coefficients are the following:

num_rooms|sqm|year|is_apt_building|is_house|is_row_house|no_elevator|has_elevator|is_good|is_bad
---|---|---|---|---|---|---|---|---|---
-154.06459911|-7.54989771|34.09409129|-374.30893573|570.79730825|-196.48837252|-55.32978503|52.2093088|38.80866212|-91.01797094

It seems the accuracy is somewhat reasonable given the simplicity of the model. Moreover, the coefficients make sense: unit price for larger apartments tends to decrease, elevator increases the price by about 100 €, etc.