The objective is to solve the exercises in https://docs.google.com/presentation/d/1T1nXunoeTV05A6itZhvf4NbhbQUoteOQoAxaa2aPlsw/edit#slide=id.g27e9614b9c_103_0.

I am using the House Price Prediction from https://www.kaggle.com/datasets/kirbysasuke/house-price-prediction-simplified-for-regression. The dataset should be downloaded from this source and sotred in the data folder.

In [1]:
import numpy as np
import pandas as pd
import torch
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split

<h1>Data transformation and preparation</h1>

In [2]:
df_input = pd.read_csv('./data/Real_Estate.csv')

In [3]:
df_input

Unnamed: 0,Transaction date,House age,Distance to the nearest MRT station,Number of convenience stores,Latitude,Longitude,House price of unit area
0,2012-09-02 16:42:30.519336,13.3,4082.01500,8,25.007059,121.561694,6.488673
1,2012-09-04 22:52:29.919544,35.5,274.01440,2,25.012148,121.546990,24.970725
2,2012-09-05 01:10:52.349449,1.1,1978.67100,10,25.003850,121.528336,26.694267
3,2012-09-05 13:26:01.189083,22.2,1055.06700,5,24.962887,121.482178,38.091638
4,2012-09-06 08:29:47.910523,8.5,967.40000,6,25.011037,121.479946,21.654710
...,...,...,...,...,...,...,...
409,2013-07-25 15:30:36.565239,18.3,170.12890,6,24.981186,121.486798,29.096310
410,2013-07-26 17:16:34.019780,11.9,323.69120,2,24.950070,121.483918,33.871347
411,2013-07-28 21:47:23.339050,0.0,451.64190,8,24.963901,121.543387,25.255105
412,2013-07-29 13:33:29.405317,35.9,292.99780,5,24.997863,121.558286,25.285620


In [4]:
df_input.describe()

Unnamed: 0,House age,Distance to the nearest MRT station,Number of convenience stores,Latitude,Longitude,House price of unit area
count,414.0,414.0,414.0,414.0,414.0,414.0
mean,18.405072,1064.468233,4.2657,24.973605,121.520268,29.102149
std,11.75767,1196.749385,2.880498,0.024178,0.026989,15.750935
min,0.0,23.38284,0.0,24.932075,121.473888,0.0
25%,9.9,289.3248,2.0,24.952422,121.496866,18.422493
50%,16.45,506.1144,5.0,24.974353,121.520912,30.39407
75%,30.375,1454.279,6.75,24.994947,121.544676,40.615184
max,42.7,6306.153,10.0,25.014578,121.565321,65.571716


In [5]:
df_input.dtypes

Transaction date                        object
House age                              float64
Distance to the nearest MRT station    float64
Number of convenience stores             int64
Latitude                               float64
Longitude                              float64
House price of unit area               float64
dtype: object

Translating transaction date to unix timestamp

In [6]:
df_input['Transaction date'] = pd.to_datetime(df_input['Transaction date']).astype('int64') // 10**9
df_input

Unnamed: 0,Transaction date,House age,Distance to the nearest MRT station,Number of convenience stores,Latitude,Longitude,House price of unit area
0,1346604150,13.3,4082.01500,8,25.007059,121.561694,6.488673
1,1346799149,35.5,274.01440,2,25.012148,121.546990,24.970725
2,1346807452,1.1,1978.67100,10,25.003850,121.528336,26.694267
3,1346851561,22.2,1055.06700,5,24.962887,121.482178,38.091638
4,1346920187,8.5,967.40000,6,25.011037,121.479946,21.654710
...,...,...,...,...,...,...,...
409,1374766236,18.3,170.12890,6,24.981186,121.486798,29.096310
410,1374858994,11.9,323.69120,2,24.950070,121.483918,33.871347
411,1375048043,0.0,451.64190,8,24.963901,121.543387,25.255105
412,1375104809,35.9,292.99780,5,24.997863,121.558286,25.285620


Standardise input features

In [7]:
df_x = df_input.drop('House price of unit area', axis = 1)
df_x = (df_x - df_x.mean(axis = 0)) / df_x.std(axis = 0)
df_x.describe()

Unnamed: 0,Transaction date,House age,Distance to the nearest MRT station,Number of convenience stores,Latitude,Longitude
count,414.0,414.0,414.0,414.0,414.0,414.0
mean,-1.139614e-14,-1.802101e-16,5.1488600000000005e-17,-9.439577000000001e-17,7.311167e-14,-6.866863e-14
std,1.0,1.0,1.0,1.0,1.0,1.0
min,-1.680271,-1.565367,-0.8699277,-1.48089,-1.717652,-1.718473
25%,-0.8633026,-0.7233638,-0.6477074,-0.7865654,-0.8761327,-0.8670862
50%,-0.01133826,-0.1662806,-0.4665587,0.254921,0.03094246,0.02385681
75%,0.8047446,1.018053,0.3257246,0.8624547,0.882696,0.9043639
max,1.74024,2.066305,4.379935,1.990732,1.694608,1.669315


Train/test split

In [8]:
X = df_x.values
y = df_input['House price of unit area'].values

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.33, 
                                                    random_state=111)

Creating data wrappers for PyTorch

In [9]:
train_set = TensorDataset(torch.from_numpy(X_train).unsqueeze(1), torch.from_numpy(y_train))
test_set = TensorDataset(torch.from_numpy(X_test).unsqueeze(1), torch.from_numpy(y_test))

In [10]:
for i in range(3):
    sample_idx = torch.randint(len(train_set), size=(1,)).item()
    x, y = train_set[sample_idx]
    print('x = ' + str(x))
    print('y = ' + str(y))
    print()

x = tensor([[ 1.0500, -0.0770,  0.4693,  0.2549, -1.5840,  1.5753]],
       dtype=torch.float64)
y = tensor(21.2229, dtype=torch.float64)

x = tensor([[-0.1889, -0.3832, -0.6137, -1.4809, -0.5069,  0.8499]],
       dtype=torch.float64)
y = tensor(13.8489, dtype=torch.float64)

x = tensor([[-1.1049,  0.5864, -0.4065,  0.2549,  0.5092,  0.4867]],
       dtype=torch.float64)
y = tensor(20.1690, dtype=torch.float64)



<h1>Defining a linear regression NN in PyTorch</h1>