# Setup

## Downloading Relevant Data

In [1]:
%%bash
OUTDIR="./data/real_estate/"
OUTFILE="real_estate_valuation.xlsx"

if [ ! -d $OUTDIR ]; then
    mkdir $OUTDIR
fi

if [ ! -f $OUTDIR/$OUTFILE ]; then
    cd $OUTDIR
    curl -o $OUTFILE "https://archive.ics.uci.edu/ml/machine-learning-databases/00477/Real%20estate%20valuation%20data%20set.xlsx"

fi

## Libraries

One of the best features of Scikit-Learn is the pipelines that allow seamless integration for preprocessing and allows transformations that do not leak testing information into the training set.

To allow the sklearn Pipeline to work with pandas, some custom transformers have been written under ./lib/custom_transforms

In [2]:
from pathlib import Path
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

import pandas as pd
import numpy as np
import statsmodels.api as sm


In [3]:
from lib.custom_transforms import DtypeMapper, DropColumn, TransformByDtype, PdDummyEncoder

In [4]:
seed = 83282168
np.random.seed(seed)

# Overview of Linear Regression

Linear Regression is a fairly simple machine learning model which attempts to find the optimal linear function satisfying the following equation:


## Solving Linear Regression Problem

 1. **Analytical Solution**
 
 One way to solve the linear regression problem would be to analytically derive the coefficient matrix using linear algebra. An overview of how this can be done can be found here.

 2. **Solving an Optimisation Problem**

 The second way is to frame Linear Regression as an optimisation problem which can be solved with gradient descent methods that minimize the squared-error cost function:
 

# Preparing the Dataset

## Overview of Dataset

The Real Estate Valuation dataset consists of variables that are can be used to predict the unit price of a property in Taiwan.

In [5]:
datadir = Path("./data/real_estate")

In [6]:
df = pd.read_excel(datadir / "real_estate_valuation.xlsx")

In [7]:
df.columns = ['id', 'transaction_date', 'house_age', 'nearest_subway_m', 'n_conv_store', 'latitude', 'longitude', 'unit_price']

In [8]:
df.head(10)

Unnamed: 0,id,transaction_date,house_age,nearest_subway_m,n_conv_store,latitude,longitude,unit_price
0,1,2012.916667,32.0,84.87882,10,24.98298,121.54024,37.9
1,2,2012.916667,19.5,306.5947,9,24.98034,121.53951,42.2
2,3,2013.583333,13.3,561.9845,5,24.98746,121.54391,47.3
3,4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
4,5,2012.833333,5.0,390.5684,5,24.97937,121.54245,43.1
5,6,2012.666667,7.1,2175.03,3,24.96305,121.51254,32.1
6,7,2012.666667,34.5,623.4731,7,24.97933,121.53642,40.3
7,8,2013.416667,20.3,287.6025,6,24.98042,121.54228,46.7
8,9,2013.5,31.7,5512.038,1,24.95095,121.48458,18.8
9,10,2013.416667,17.9,1783.18,3,24.96731,121.51486,22.1


## Splitting the Data

One of the key information that should be well represented in the training and test set would be the transaction date information. Thus, the dataset is split 70-30, stratified based on the year and month of the transaction

In [9]:
from sklearn.model_selection import train_test_split, StratifiedKFold

In [10]:
df['month'] = round((df.transaction_date - 2012) * 12).astype(int)

In [11]:
df['year'] = 2012 + (df.month > 12)
df['month'] = df.month - (df.month > 12) * 12

In [12]:
x_train, x_test, y_train, y_test = train_test_split(
    df.loc[:, ~df.columns.isin(['unit_price'])], 
    df['unit_price'], 
    test_size = 0.2, 
    stratify = df[['year','month']], 
    random_state=71631632)

# Feature Generation

Some new features will be generated that may be useful for the model

## Incorporating Domain Knowledge

1. **Time-value of Property**

    In finance, investment property are often priced based on the Discounted Cash Flow model. This means that there is often an inverse relationship between the unit price of the property and the age of the house

2. **Amenities, Utility and Diminishing Marginal Returns**

    Factors such as distance to nearest subway & number of nearby convenience store are amenities that provide additional utility to the home owner. Thus, the utility that they provide would follow the Law of Marginal Diminishing Returns.

    Shorter **distances to the nearest subway** would be more valuable to home owners and this quickly diminishes the further the subway is from the home (i.e. inverse relationship). 
    
    For the **convenience stores**, having some convenience stores nearby provides large amounts of utility but having a huge amount of convenience stores would not be much more valuable pass some threshold.

3. **Quarters instead of Months**

    Quarters might be more meaningful when it comes to pricing changes as prices tend to only vary over long periods of time and not between months

In [13]:
def MSE(Y_hat, Y_actual):
    return np.mean(np.square(Y_hat - Y_actual))

def MAE(Y_hat, Y_actual):
    return np.mean(np.abs(Y_hat - Y_actual))

In [14]:
def generate_features(X, copy=False):
    if copy:
        X = X.copy()

    X['inv_house_age'] = 1 / (X['house_age'] + 1)
    X['inv_nearest_subway'] = 1 / (X['nearest_subway_m'])
    X['n_conv_utility'] = np.log(X['n_conv_store']).replace({-np.inf: 0})

    X['half_year'] = np.select(
        condlist = [ X.month <= 6, X.month <= 12],
        choicelist = [1,2],
        default = -1
    )

    return X

In [15]:
pipeline2 = Pipeline([ 
    ('new_features', FunctionTransformer(generate_features, kw_args={'copy' : True})),
    ('set_dtypes', DtypeMapper({'category': ['half_year']})),
    ('drop_col', DropColumn(
        ['id', 'transaction_date', 'house_age', 'nearest_subway_m','n_conv_store', 'month'])),
    ('minmax_normalise', TransformByDtype(
        transformer = StandardScaler(), 
        include_dtypes = ['number'],
        combine_strategy = 'reassign')),
    ('dummy_encoding', PdDummyEncoder(dummy_na=False, drop_first=True))
])

In [16]:
train_cp = pipeline2.fit_transform(x_train)

In [17]:
model = LinearRegression()
model.fit(train_cp, y_train)

LinearRegression()

In [18]:
# transforming the test set and obtaining predictions
test_cp = pipeline2.transform(x_test)
prediction = model.predict(test_cp)

print('MSE: {:.3f}'.format(MSE(prediction, y_test)))
print('MAE: {:.3f}'.format(MAE(prediction, y_test)))

MSE: 162.183
MAE: 7.402
