# Intro
In notebook avm_singapore we build automated valuation models. In this notebook we will investigate if we can get better results by first clustering the properties and subsequently build an AVM model for the largest clusters of properties. We also build an AVM on all the data to fall back on when a property does not fit into any of the clusters.

# Analysis
Let's first explore the data.

In [2]:
import pandas as pd
import statsmodels.api as sm
import pickle
import copy
import random

random.seed(1)
df = pd.read_csv('datasets/ura_data_withproject.csv', index_col=0)
print(df.head(), '\n')
print(df.info(), '\n')

   log_price_psf            Project Name  Area (Sqft)  Type_Condominium  \
0       7.519692     STIRLING RESIDENCES          657                 0   
1       7.290293          WHISTLER GRAND          614                 0   
2       7.670895          MARGARET VILLE          463                 0   
3       7.525101     STIRLING RESIDENCES         1346                 0   
4       7.515889  AVENUE SOUTH RESIDENCE         1109                 0   

   Relative_tenure  SaleType_Resale  SaleType_Sub Sale  Floor_number  \
0         0.914111                0                  0          23.0   
1         0.915692                0                  0          18.0   
2         0.914111                0                  0          38.0   
3         0.914111                0                  0          23.0   
4         0.915692                0                  0           8.0   

   Market Segment_OCR  Market Segment_RCR  ...  Period_2017Q4  Period_2018Q1  \
0                   0               

In [3]:
y = df['log_price_psf']
res = []
per = ['%sQ%s' % (y,q) for y in range(2017,2021) for q in range(1,5)]
per = per[:-3]
print(per)
min_obs = 1000

['2017Q1', '2017Q2', '2017Q3', '2017Q4', '2018Q1', '2018Q2', '2018Q3', '2018Q4', '2019Q1', '2019Q2', '2019Q3', '2019Q4', '2020Q1']


First we'll build an model for the entire dataset. Afterwards we will cluster the data and build a model for each cluster. The models will be saved in the temp_files folder (make sure this folder exists before running the code).

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

for start in range(0,len(per)-5+1):
    win = per[start:start+5]
    print('Working on', win[0], 'to', win[-1])
    fil = df['Period_%s' % win[0]] == 1
    for i in [1,2,3,4]:
        fil = fil | (df['Period_%s' % win[i]] ==1)
    fil = copy.deepcopy(df[fil])
    
    # Fit overall model
    print('...fitting overall')
    y = fil['log_price_psf']
    cols = [c for c in df.columns if not c.startswith('Period_') and not c in ['log_price_psf', 'Project Name']]
    X = fil[cols]
    X = sm.add_constant(X)
    
    mod = sm.OLS(y, X)
    res = mod.fit()
    res.save('temp_files/rolling_%s.pkl' % win[-1])
    
    # Cluster the data
    print('...clustering')
    cols = [c for c in df.columns if not c.startswith('Period_') and not c.startswith('District') and not c.startswith('Type') and not c.startswith('SaleType') and not c.startswith('Market Segment')]
    # cols_std =
    mean = fil.groupby('Project Name')[cols].mean()
    mean.columns = ['%s_mean' % c for c in mean.columns]
    std = fil.groupby('Project Name')[cols].std()
    std.columns = ['%s_std' % c for c in std.columns]
    std.fillna(0, inplace=True)
    clus = pd.concat([mean, std], axis=1)
    
    ss = StandardScaler().fit_transform(clus)
    ss = pd.DataFrame(ss, index=clus.index, columns=clus.columns)
    
    pca = PCA()
    pca.fit_transform(ss)
    red = PCA(n_components=5).fit_transform(ss)
    
    km = KMeans(n_clusters=3)
    km.fit(red)
    lab = km.predict(red)
    ss['Cluster'] = lab
    pickle.dump(ss['Cluster'], open('temp_files/K3_cluster_%s.pkl' % win[-1], 'wb'))
    fil['Cluster'] = fil['Project Name'].apply(lambda x: ss.loc[x, 'Cluster'])
    
    # Fit one model for each cluster
    clus_gr = fil.groupby('Cluster')
    for c, d in clus_gr:
        c = int(c)
        if len(d) < min_obs:
            print('...skipping cluster', c)
        else:
            print('...fitting cluster', c)
            y = d['log_price_psf']
            cols = [c for c in df.columns if not c.startswith('Period_') and not c in ['log_price_psf', 'Project Name']] + ['Period_%s' % p for p in per[1:5]]
            X = d[cols]
            X = sm.add_constant(X)
            
            mod = sm.OLS(y,X)
            res = mod.fit()
            res.save('temp_files/rolling_%s_K3_%s.pkl' % (win[-1], c))
print('\nFinished: Models saved in temp_files folder')

Working on 2017Q1 to 2018Q1
...fitting overall
...clustering
...fitting cluster 0
...fitting cluster 1
...skipping cluster 2
Working on 2017Q2 to 2018Q2
...fitting overall
...clustering
...fitting cluster 0
...fitting cluster 1
...skipping cluster 2
Working on 2017Q3 to 2018Q3
...fitting overall
...clustering
...fitting cluster 0
...fitting cluster 1
...skipping cluster 2
Working on 2017Q4 to 2018Q4
...fitting overall
...clustering
...skipping cluster 0
...fitting cluster 1
...fitting cluster 2
Working on 2018Q1 to 2019Q1
...fitting overall
...clustering
...fitting cluster 0
...fitting cluster 1
...fitting cluster 2
Working on 2018Q2 to 2019Q2
...fitting overall
...clustering
...fitting cluster 0
...fitting cluster 1
...skipping cluster 2
Working on 2018Q3 to 2019Q3
...fitting overall
...clustering
...fitting cluster 0
...fitting cluster 1
...skipping cluster 2
Working on 2018Q4 to 2019Q4
...fitting overall
...clustering
...fitting cluster 0
...fitting cluster 1
...skipping cluster 2
W

# Conclusion
[Work in progress]

TODO: Calculate RMSE and compare results for the overall model with the specified models.