AZURE MACHINE LEARNING

Predicting the Value of Hong Kong Properties

A Step-By-Step Tutorial Using Azure Machine Learning

*Overview*
Learn how companies like Zillow would predict the value of your property in Hong Kong. In this tutorial you will learn how to build a model to predict the real estate sales price of a property based upon various historical features about the house and the sales transaction.

*About the Data*
The Hong Kong Island areas of Central, Sheung Wan and Sai Wan private housing dataset is a Excel Comma Separated Value (CSV) text file, which include 29 features and 1139 observations. Each observation represents the sale of a home and each feature is an attribute describing the house or the circumstance of the sale. A data dictionary has been provided to explain the features, click here for the full list of feature descriptions.

*Objective of this tutorial*
As a partner of a property investment company, your objective is to make a profit from investing in and the eventual sale of invested properties. To do this, you need a solid property prediction model based on historical property transactions. To enable the prediction of future property prices from your prediction model compared against prevailing asking prices. So that the future sale of a property will bring in a nice profit.


In [3]:
#Import python libraries

import numpy as np
import pandas as pd
import datetime
from sklearn.ensemble import GradientBoostingRegressor  #import gradient boosting from python sklearn package

*About the Data*
The Hong Kong Island areas of Central, Sheung Wan and Sai Wan private housing dataset is a Excel Comma Separated Value (CSV) text file, which include 29 features and 1139 observations. Each observation represents the sale of a home and each feature is an attribute describing the house or the circumstance of the sale. A data dictionary has been provided to explain the features, click here for the full list of feature descriptions.

Next, we import the dataset named "HKProp_Dataset.csv" using the pandas package.

In [4]:
!curl -L https://www.dropbox.com/s/4loif6ojs1s9fb6/HKProp_Dataset.csv?dl=0 -o HKProp_Dataset.csv
dataframe = pd.read_csv('HKProp_Dataset.csv')
dataframe.head()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  161k  100  161k    0     0   150k      0  0:00:01  0:00:01 --:--:-- 5031k


Unnamed: 0,Reg_Date,Reg_Year,Prop_Name_ENG,ADDRESS_ENG,Prop_Type,Estate_Size,Tower,Floor,Flat,Bed_Room,...,Kindergarten,Primary_Schools,Secondary_Schools,Parks,Library,Bus_Route,Mall,Wet Market,Latitude,Longitude
0,26/10/2016,2016,18 SHELLEY STREET,18 SHELLEY STREET,Single,1,,3,18,,...,1,2,1,6,0,52,0,0,22.281442,114.152991
1,16/11/2017,2017,Lilian Court,6-8 SHELLEY STREET,Single,1,,19,A,2.0,...,1,2,0,4,0,29,0,0,22.281865,114.153329
2,11/10/2016,2016,Lilian Court,6-8 SHELLEY STREET,Single,1,,6,B,2.0,...,1,2,0,4,0,29,0,0,22.281865,114.153329
3,18/10/2017,2017,Felicity Building,9-13 SHELLEY STREET,Single,1,,20,E,,...,1,1,1,9,0,31,0,0,22.281688,114.152795
4,18/10/2017,2017,9-13 SHELLEY STREET,9-13 SHELLEY STREET,Single,1,,2,A,,...,1,0,1,7,0,45,0,0,22.281688,114.152795


# saleprice is the target variable or predictor

In [5]:
y = np.array(dataframe['SalePrice_10k'])

Step 1 Exploratory Data Analysis

Step 2 Data Cleaning

clean dataset as there are many symbols such as '$' and ',' that can not be undertood by pandas.

Most machine learning algorithms are unable to account for missing values and some treat it inconsistently from others. To address this, we must make sure our dataset contains no missing, ânull,â âNANâ Not A Number or âNAâ values.

Replacement of missing values is the most versatile and preferred method because it allows us to keep our data. It also minimizes collateral damage to other columns because of one cellâs bad behavior. In replacement, numerical values can easily be replaced with statistical values such as mean, median, or mode. While categories can be commonly dealt with by replacing with the mode or a separate categorical value for unknowns.
For simplicity, all categorical missing values were cleaned with the mode and all numeric features were cleaned using the median. 

To further improve a modelâs performance, custom cleaning functions should be tried and implemented on each individual feature rather than a blanket transformation of all columns.


In [6]:
dataframe = dataframe.replace('--', '0').replace('$', '')  # replace '--' with 0, replace $ with space.

In [7]:
Reg_Date = pd.to_datetime(np.array(dataframe['Reg_Date']))

variable Reg_Date is in date format, we calculate the number of days to today as an integer.

In [8]:
Days_to_reg_date = [int(t.days) for t in (Reg_Date - datetime.datetime.today())]

variable of bedroom contains value 'Studio' that should be represented as a new catagorical variable

In [9]:
Bedroom = dataframe['Bed_Room'].replace('Studio', '0').fillna(0)

So we fillin this value with 0 and create a new dummy variable Is_studio to indicate the value 'Studio'

In [19]:
Is_studio = [1 if e == 0 else 0 for e in np.array(dataframe['Bed_Room'])] 

data clearning for SalableArea, SaleableAreaPrice, GrossArea, GrossAreaPrice, impute missing data with 0.

In [20]:
SaleableArea = [int(str(t).replace(',', '').replace('nan', '0')) for t in dataframe['SaleableArea']]
SaleableAreaPrice = [int(str(t).replace(',', '').replace('$', '0').replace('nan', '0')) for t in dataframe['SaleableAreaPrice']]
GrossArea = [int(str(t).replace(',', '').replace('nan', '0')) for t in dataframe['Gross Area']]
GrossAreaPrice = [int(str(t).replace(',', '').replace('$', '0').replace('nan', '0')) for t in dataframe['Gross Area_Price']]

Concatenate all the cleaned variables(columns) together as our model input X

In [21]:
 X = np.concatenate([np.array(Days_to_reg_date).reshape(-1, 1), np.array(Bedroom).reshape(-1, 1),
                        np.array(Is_studio).reshape(-1, 1), np.array(SaleableArea).reshape(-1, 1),
                        np.array(GrossAreaPrice).reshape(-1, 1),
                        np.array(GrossArea).reshape(-1, 1),
                        np.array(SaleableAreaPrice).reshape(-1, 1),
                        np.array(pd.get_dummies(dataframe[['Flat', 'Prop_Type', 'Tower', 'Roof']])),
                       np.array(dataframe[['Floor', 'Build_Ages', 'Rehab_Year', 'Kindergarten', 'Primary_Schools', 'Secondary_Schools',
 'Parks', 'Library', 'Bus_Route', 'Mall', 'Wet Market', 'Latitude', 'Longitude']].fillna(0))], axis=1)

modeling training and cross validation

In [23]:
list_scores = []
for k in range(100):
    # For each iteration we random select 80% of the data to be our training data,
    # and the rest 20% of the data as our testing data.
    # We iterate for 100 times and then take an average of the scores.
    print k
    # random select training data points from the dataset.
    random_idx = np.random.choice(range(len(y)), size=int(0.8*len(y)), replace=False)
    X_train = X[random_idx, :]
    y_train = y[random_idx]
    # the rest of the data are used as testing data.
    z =list(set(range(len(y)))-set(random_idx))
    X_test = X[z, :]
    y_test = y[z]
            # define a boosting model using sklearn model. Here we set up the hyper parameters of the model.
    # You can fine tune these parameters to achive better results.
    # learning_rate=0.1, n_estimators=3000, max_depth=3 is a good one as we have examined.
    # verbose=1 allows us to print the tree growing process of boosting.
    # You can see the training error goes down.
    gbt = GradientBoostingRegressor(verbose=1, learning_rate=0.1, n_estimators=3000, max_depth=3)
    gbt.fit(X_train, y_train)  # train the model

    list_scores.append(gbt.score(X_test, y_test))
    print np.average(list_scores)  # print average score after k times of cross-validations.
    # The average score that we got after 100 times of iterations is 0.9414.

0
      Iter       Train Loss   Remaining Time 
         1     1061625.3029           39.23s
         2      874968.7181           32.44s
         3      723854.4108           29.62s
         4      600672.0991           27.77s
         5      499170.0150           26.87s
         6      416663.4402           26.51s
         7      349494.6841           26.07s
         8      292635.1346           25.70s
         9      246006.7812           25.41s
        10      207445.8539           25.28s
        20       50646.6882           23.98s
        30       21857.1367           23.53s
        40       13283.2514           23.35s
        50        8326.3625           23.04s
        60        6010.8520           22.71s
        70        4891.5453           22.61s
        80        4247.6103           22.53s
        90        3759.7003           22.50s
       100        3399.0986           22.41s
       200        1598.9541           21.46s
       300         995.9342           20.52s
       

       700         234.8983           16.81s
       800         162.5506           17.08s
       900         111.0419           17.00s
      1000          78.0056           16.02s
      2000           6.1438            7.82s
      3000           1.2823            0.00s
0.948972677519
6
      Iter       Train Loss   Remaining Time 
         1     1149523.3076           29.91s
         2      945367.1071           28.51s
         3      779594.7357           26.47s
         4      644696.1214           25.37s
         5      533297.2066           24.78s
         6      441988.2035           24.45s
         7      367980.8191           24.13s
         8      306983.6621           24.00s
         9      257397.3828           23.89s
        10      217001.4051           23.60s
        20       52817.0277           22.42s
        30       22768.5267           22.37s
        40       12927.5610           22.07s
        50        8347.0519           21.99s
        60        6580.2939          

       200         638.4316           21.08s
       300         358.5604           20.09s
       400         237.4191           19.76s
       500         160.3543           19.22s
       600         108.2616           18.45s
       700          77.0464           17.70s
       800          51.7163           17.02s
       900          39.0039           16.29s
      1000          30.6982           15.42s
      2000           3.9623            7.79s
      3000           0.8032            0.00s
0.956748546528
12
      Iter       Train Loss   Remaining Time 
         1     1435678.5426           31.02s
         2     1178277.9614           26.04s
         3      969379.3971           24.57s
         4      799721.1636           24.70s
         5      660624.9188           25.05s
         6      546464.2075           25.00s
         7      452350.4680           24.53s
         8      375939.6091           24.19s
         9      313741.3138           23.93s
        10      263015.5825         

        60        4080.5828           22.36s
        70        3089.6375           22.43s
        80        2465.9071           22.24s
        90        2092.3164           22.16s
       100        1792.8012           21.97s
       200         752.0071           22.01s
       300         431.3080           21.17s
       400         276.8374           20.64s
       500         202.1249           19.42s
       600         154.0198           18.53s
       700         119.7196           17.88s
       800          89.0257           16.98s
       900          66.8338           16.09s
      1000          47.1189           15.31s
      2000           6.3352            7.56s
      3000           1.5112            0.00s
0.951873830656
18
      Iter       Train Loss   Remaining Time 
         1      859812.3339           32.96s
         2      708672.9637           28.44s
         3      585936.2012           26.38s
         4      486087.1867           25.42s
         5      404356.3792         

        30       22085.6080           23.35s
        40       13262.1036           22.92s
        50        9143.7485           22.58s
        60        6946.7354           22.36s
        70        5821.0509           22.49s
        80        5054.1217           22.34s
        90        4635.6278           22.10s
       100        4046.5073           21.94s
       200        2529.3124           20.76s
       300        1553.3386           20.43s
       400         917.8028           19.71s
       500         674.8177           19.14s
       600         578.9786           18.35s
       700         468.0983           17.70s
       800         370.7569           16.83s
       900         249.0084           16.11s
      1000         222.6326           16.76s
      2000          16.7574            8.19s
      3000           2.4389            0.00s
0.951958493323
24
      Iter       Train Loss   Remaining Time 
         1     1204771.7013           29.74s
         2      989518.6835         

        30       19406.0142           23.47s
        40       11021.4268           23.26s
        50        6466.4746           23.10s
        60        4091.6141           23.18s
        70        3089.9975           22.87s
        80        2635.9384           22.57s
        90        2181.8779           22.34s
       100        1866.6397           22.19s
       200         922.0175           21.14s
       300         623.7484           20.41s
       400         383.8815           19.63s
       500         220.0434           18.84s
       600         152.8921           18.43s
       700          98.6531           17.61s
       800          67.8996           16.78s
       900          44.5204           15.97s
      1000          32.7732           15.21s
      2000           3.7318            7.71s
      3000           0.6071            0.00s
0.949633417398
30
      Iter       Train Loss   Remaining Time 
         1     1283145.7212           30.76s
         2     1054013.8410         

        30       20064.2135           24.76s
        40       12020.4953           24.57s
        50        8137.4678           24.30s
        60        6467.1519           24.21s
        70        5505.6274           23.84s
        80        5050.3715           23.66s
        90        4653.7715           23.48s
       100        4248.2064           23.26s
       200        2788.3350           21.66s
       300        1555.4876           20.91s
       400         783.2698           20.35s
       500         593.8137           19.57s
       600         424.9137           19.10s
       700         319.9647           18.31s
       800         220.8373           17.42s
       900         191.3964           16.49s
      1000         152.3384           15.71s
      2000           9.2906            8.26s
      3000           1.6143            0.00s
0.946027978115
36
      Iter       Train Loss   Remaining Time 
         1     1275509.4491           33.19s
         2     1046795.4281         

        30       21612.7506           22.58s
        40       13294.8157           22.54s
        50        8799.8107           22.48s
        60        6709.2970           22.61s
        70        5692.2026           22.62s
        80        4818.2066           24.41s
        90        4452.9224           24.37s
       100        4040.2497           24.12s
       200        2004.4217           23.27s
       300        1087.6131           21.80s
       400         708.9587           20.54s
       500         568.2885           19.72s
       600         442.5355           18.90s
       700         356.5736           18.00s
       800         307.7953           17.15s
       900         240.0868           16.38s
      1000         162.5847           15.64s
      2000          15.7013            7.82s
      3000           2.0230            0.00s
0.947837419381
42
      Iter       Train Loss   Remaining Time 
         1     1446742.1846           33.80s
         2     1185426.2719         

        30       17839.1424           22.93s
        40        9870.5136           22.64s
        50        5857.7005           22.53s
        60        4212.3897           22.31s
        70        3372.3626           22.29s
        80        2695.0051           22.36s
        90        2303.2190           22.38s
       100        2019.6955           22.26s
       200        1079.7087           21.37s
       300         606.9130           20.32s
       400         405.1507           19.35s
       500         177.1141           18.54s
       600         108.3361           17.77s
       700          79.4476           16.96s
       800          60.3146           16.24s
       900          47.8343           15.62s
      1000          35.5406           15.12s
      2000           3.7193            7.66s
      3000           0.6695            0.00s
0.945470636145
48
      Iter       Train Loss   Remaining Time 
         1     1270191.0730           51.50s
         2     1041838.0009         

        30       18110.3694           24.86s
        40        9707.9498           24.52s
        50        5623.8625           24.57s
        60        3701.9125           24.56s
        70        2798.3872           24.34s
        80        2387.3399           24.04s
        90        1970.5164           24.02s
       100        1748.9307           23.72s
       200         767.9908           22.44s
       300         517.2632           21.53s
       400         346.1819           20.77s
       500         220.1043           19.93s
       600         157.8998           19.17s
       700          96.7600           18.36s
       800          60.5483           17.58s
       900          46.3165           16.89s
      1000          35.4295           16.11s
      2000           3.8018            8.45s
      3000           0.6285            0.00s
0.946412519008
54
      Iter       Train Loss   Remaining Time 
         1      472141.0452           30.49s
         2      395153.1130         

        20       36847.8172           30.06s
        30       13915.0841           27.84s
        40        7452.7343           29.12s
        50        4411.6573           27.96s
        60        3248.5090           27.24s
        70        2458.3552           26.50s
        80        1986.0387           26.01s
        90        1666.9571           25.62s
       100        1422.2436           25.21s
       200         557.4354           23.44s
       300         308.7443           22.10s
       400         190.7282           20.78s
       500         128.3252           20.14s
       600          89.1295           19.41s
       700          63.2993           18.76s
       800          50.0519           17.89s
       900          38.8974           17.09s
      1000          30.2137           16.28s
      2000           3.8311            8.25s
      3000           0.7834            0.00s
0.947190124043
60
      Iter       Train Loss   Remaining Time 
         1      730915.6611         

        30       23004.3436           23.64s
        40       14045.4884           23.22s
        50        9056.3873           22.99s
        60        6428.2671           23.06s
        70        5144.4523           23.14s
        80        4476.0909           22.99s
        90        4051.4982           22.70s
       100        3701.6633           22.57s
       200        2436.0448           30.56s
       300        1518.2972           26.41s
       400        1084.9769           24.78s
       500         953.0027           23.24s
       600         542.6348           21.78s
       700         374.2523           20.72s
       800         214.1985           19.98s
       900         171.6316           19.14s
      1000          90.1041           18.23s
      2000           9.7614            9.12s
      3000           1.7823            0.00s
0.943135590982
66
      Iter       Train Loss   Remaining Time 
         1     1246970.4106           34.75s
         2     1025326.8885         

        30       18306.3757           24.28s
        40        9968.6531           25.55s
        50        6220.3367           25.08s
        60        4520.5376           24.65s
        70        3833.6942           25.50s
        80        3264.8974           25.50s
        90        2978.7521           25.08s
       100        2714.1708           24.82s
       200        1746.4110           23.30s
       300        1447.9101           21.85s
       400         811.0917           20.64s
       500         452.7094           20.22s
       600         329.2475           19.65s
       700         204.5013           18.62s
       800         141.7604           17.66s
       900         122.2381           16.77s
      1000         106.8125           15.95s
      2000          10.0404            7.91s
      3000           3.0218            0.00s
0.94278027834
72
      Iter       Train Loss   Remaining Time 
         1     1336766.7185           33.97s
         2     1097534.3329          

        30       22127.4701           23.17s
        40       12997.1376           23.20s
        50        9086.6619           22.99s
        60        6457.4861           22.97s
        70        5210.1699           22.56s
        80        4563.1067           22.32s
        90        4183.9890           22.25s
       100        3852.8208           22.32s
       200        2687.0936           20.87s
       300        1523.4669           20.25s
       400         928.6622           19.71s
       500         675.6883           18.79s
       600         378.3484           18.00s
       700         237.4116           17.11s
       800         173.0209           16.30s
       900         139.8549           15.73s
      1000         109.6240           15.98s
      2000          13.1694            8.06s
      3000           2.9375            0.00s
0.940992867577
78
      Iter       Train Loss   Remaining Time 
         1     1244897.7070           38.93s
         2     1024144.2101         

        30       20765.3269           25.41s
        40       12534.7214           24.85s
        50        8366.3642           24.26s
        60        6663.7903           23.77s
        70        5846.0064           23.35s
        80        5193.0639           22.99s
        90        4716.8840           23.05s
       100        4337.4465           23.16s
       200        2655.2707           21.03s
       300        1243.4276           20.84s
       400         737.6711           20.95s
       500         385.7205           20.04s
       600         272.4696           19.39s
       700         198.9713           18.85s
       800         156.3033           18.22s
       900         104.2841           17.33s
      1000          71.4639           16.39s
      2000           6.0645            8.06s
      3000           1.0103            0.00s
0.938398940588
84
      Iter       Train Loss   Remaining Time 
         1     1150914.8378           36.38s
         2      944356.6341         

        30       18742.3688           21.37s
        40       10822.5094           21.14s
        50        6601.3717           20.83s
        60        4575.4223           20.78s
        70        3626.4217           20.59s
        80        3030.7229           20.56s
        90        2599.0152           20.79s
       100        2276.8172           20.72s
       200         867.4141           20.30s
       300         531.5714           20.31s
       400         278.3443           19.95s
       500         181.6257           18.99s
       600         115.9404           18.15s
       700          85.8035           17.55s
       800          62.5025           16.71s
       900          46.9925           15.87s
      1000          35.9403           15.03s
      2000           4.2949            7.53s
      3000           0.6847            0.00s
0.937247475323
90
      Iter       Train Loss   Remaining Time 
         1     1280495.2415           32.78s
         2     1050256.5358         

        30       19798.2484           24.25s
        40       11421.5150           23.92s
        50        7094.5293           23.61s
        60        5373.2574           23.51s
        70        4195.4655           23.25s
        80        3420.6171           22.90s
        90        2744.6129           22.66s
       100        2520.1464           22.46s
       200        1330.3543           21.34s
       300         713.2312           20.57s
       400         483.4955           19.77s
       500         304.6124           18.97s
       600         179.5964           18.20s
       700         133.1957           17.50s
       800          97.4007           16.77s
       900          77.6563           17.98s
      1000          60.6719           17.35s
      2000           5.6853            8.31s
      3000           1.0745            0.00s
0.934691473286
96
      Iter       Train Loss   Remaining Time 
         1     1190577.3260           31.81s
         2      978631.1597         