# House price estimation

Let's study how we can use linear regression _i.e.,_ least squares fitting to estimate house prices in Hervanta region. The data is downloaded from http://asuntojen.hintatiedot.fi/haku/, copied to excel, cleaned and saved as `csv`. The data looks as follows.

num_rooms|type|sqm|eur/sqm|year|elevator|condition
---|---|---|---|---|---|---
1|kt|31,5|2413|1974|on|hyvä
1|rt|37|3405|2018|ei|hyvä
1|kt|30|3683|1990|on|tyyd.
1|kt|35|2343|1981|on|tyyd.
1|kt|32|2656|1977|on|hyvä

First, let's import the required libraries. We are using `scikit-learn` exclusively here.

In [98]:
import numpy as np
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

Load data line by line. Attributes (apt. size, year, etc) are added to X, and the target (actual selling price per sqm) to y. 

In [99]:
file = "prices.csv"

X = []
y = []

with open(file, "r") as f:
    for line in f:
        
        # Skip first line
        if line.startswith("num_rooms"):
            columns = line.split(";")
            continue
        
        parts = line.strip().split(";")
        
        rooms = int(parts[0])
        kind  = parts[1]
        
        # Numbers use Finnish locale with decimals separated by comma.
        # Just use replace(), although the proper way would be with
        # locale module.
        
        sqm   = float(parts[2].replace(",", "."))
        price = float(parts[3])
        year  = int(parts[4])
        elev  = parts[5]
        cond  = parts[6]
        
        X.append([rooms, kind, sqm, year, elev, cond])
        y.append(price)
        
X = np.array(X)
y = np.array(y)
columns = columns[:3] + columns[4:]

In [100]:
categorical_cols = [c for i,c in enumerate(columns) if i in [1,4,5]]
df = pd.DataFrame(X, columns = columns)
df1 = pd.get_dummies(df, columns = categorical_cols)
X = np.array(df1)
columns = df1.columns
columns

Index(['num_rooms', 'sqm', 'year', 'type_kt', 'type_ok', 'type_rt',
       'elevator_ei', 'elevator_on', 'condition\n_', 'condition\n_hyvä',
       'condition\n_tyyd.'],
      dtype='object')

Some columns are categorical, so we need to encode those as dummy indicator variables. For example, `condition = {good, satisfactory, bad}` is encoded into three binary (numerical) variables: `is_good`, `is_satisfactory` and `is_bad`.

Next, split the data into training and testing to evaluate the performance of our regression model.

In [101]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Fit the regression model and predict.

In [112]:
model = Lasso(alpha = 100) # Alpha is the Lambda of lecture slides 
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Let's see how we did with the first five test apartments:

num_rooms|type|sqm|eur/sqm|year|elevator|condition|prediction|error
---|---|---|---|---|---|---|---|--
3|rt|75,5|1762|1978|ei|hyvä|1654.3|107.6
3|kt|63|2190|2004|on|hyvä|2402.0|212.0
3|rt|77|2948|2017|ei|hyvä|2972.7|24.7
2|kt|58|1483|1974|on|hyvä|1571.0|88.0
2|kt|58|1379|1974|on|hyvä|1571.0|192.0

Compute the mean error of prediction at the test whole partition.

In [106]:
error = mean_absolute_error(y_test, y_pred)
print("Mean absolute error: {:.1f} eur/sqm".format(error))

mean_difference = np.mean(y_pred - y_test)
print("Average prediction error: {:.1f} eur/sqm".format(mean_difference))

Mean absolute error: 274.3 eur/sqm
Average prediction error: -19.1 eur/sqm


Our model coefficients are the following:

In [113]:
coefs = pd.DataFrame(model.coef_.reshape(1,-1), columns = columns)
coefs

Unnamed: 0,num_rooms,sqm,year,type_kt,type_ok,type_rt,elevator_ei,elevator_on,condition _,condition _hyvä,condition _tyyd.
0,-0.0,-8.710432,36.463217,-0.0,0.0,0.0,0.0,-0.0,0.0,0.0,-0.0
