<a href="https://colab.research.google.com/github/hikmatfarhat-ndu/pytorch/blob/main/custom_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Estimating house prices
So far we have dealt with classification problems, i.e. identifying to which category a given object belongs to. This notebook is the first to deal with regression problems. Here the output is a value(s). In particular, we will build a model to estimate a house price based on the zipcode, number of rooms, size, images...

## What you will learn
 1. Learng to use the pandas package
 1. Building a custom dataset for Pytorch
 1. Handling categorical data using one-hot encoding
 1. Building a model that takes multimodal data as input

In [1]:
import torch
import torch.nn as nn
import torchvision as vision
from torch.utils.data import Dataset,DataLoader
import pandas as pd
import numpy as np

The data is in a Github repository. Our first task is to "clean" it by removing all houses belonging to zipcodes that occur less than 20 times in the dataset. The number 20 is arbitrary but it seems a good choice.
The non-image features of the houses are in a .csv file, "HousesInfo.csv"  without headers. Each house has 4 images: bathroom, bedroom,front,kitchen. To simplify matters we choose to use only the frontal image. The prefix of the image files is the index of house as it occurs in the "HousesInfo.csv" **starting from 1**

In [2]:
!git clone https://github.com/emanhamed/Houses-dataset

fatal: destination path 'Houses-dataset' already exists and is not an empty directory.


Read the .csv file into a pandas data frame. The parameters are self explanatory.

In [3]:
df=pd.read_csv("Houses-dataset/Houses Dataset/HousesInfo.txt",header=None,delim_whitespace=True,
               names=["bedrooms","bathrooms","size","zipcode","price"])

Remove all entries with zipcodes occuring less than 20 times

In [4]:
def cleanData(df):
    # compute the number of entries per zipcode
    zipcodes=df['zipcode'].value_counts().keys().tolist()
    counts=df['zipcode'].value_counts().tolist()
    #discard all zipcodes ocurring less than 20 times
    for count,zipcode in zip(counts,zipcodes):
      if count<20:
        idx=df[df['zipcode']==zipcode].index
        df.drop(idx,inplace=True)
    return df

In [5]:
dataset=cleanData(df)

In [6]:
max_price=dataset['price'].max()
max_size=dataset['size'].max()
price_col=dataset['price']/max_price
size_col=dataset['size']/max_size
one_hot_zip=pd.get_dummies(dataset['zipcode'])

In [7]:
one_hot_zip

Unnamed: 0,91901,92276,92677,92880,93446,93510,94501,94531
30,0,0,0,0,1,0,0,0
32,0,0,0,0,1,0,0,0
39,0,0,0,0,1,0,0,0
80,1,0,0,0,0,0,0,0
81,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
530,0,0,0,0,0,0,0,1
531,0,0,0,0,0,0,0,1
532,0,0,0,0,0,0,0,1
533,0,0,0,0,0,0,0,1


In [8]:
dataset=dataset.drop(['price','zipcode','size'],axis=1)
dataset=pd.concat([dataset,size_col,one_hot_zip,price_col],axis=1)

In [9]:
dataset

Unnamed: 0,bedrooms,bathrooms,size,91901,92276,92677,92880,93446,93510,94501,94531,price
30,5,3.0,0.264262,0,0,0,0,1,0,0,0,0.134688
32,3,2.0,0.188968,0,0,0,0,1,0,0,0,0.062308
39,3,3.0,0.225042,0,0,0,0,1,0,0,0,0.077672
80,4,2.5,0.258389,1,0,0,0,0,0,0,0,0.102253
81,2,2.0,0.193477,1,0,0,0,0,0,0,0,0.090440
...,...,...,...,...,...,...,...,...,...,...,...,...
530,5,2.0,0.216653,0,0,0,0,0,0,0,1,0.068266
531,4,3.5,1.000000,0,0,0,0,0,0,0,1,0.078525
532,3,2.0,0.211200,0,0,0,0,0,0,0,1,0.069478
533,4,3.0,0.242450,0,0,0,0,0,0,0,1,0.071526


Later we will split the dataset into training and testing so it is important we randomize the dataset before we do that

In [10]:
#randomize the dataframe
ran_dataset=dataset.sample(len(dataset))

In [11]:
ran_dataset

Unnamed: 0,bedrooms,bathrooms,size,91901,92276,92677,92880,93446,93510,94501,94531,price
398,2,2.0,0.100671,0,1,0,0,0,0,0,0,0.011864
102,2,2.0,0.105390,1,0,0,0,0,0,0,0,0.041977
136,3,3.5,0.293624,0,0,1,0,0,0,0,0,0.238989
125,6,6.5,0.734060,0,0,1,0,0,0,0,0,0.657221
348,2,2.0,0.140940,0,1,0,0,0,0,0,0,0.011523
...,...,...,...,...,...,...,...,...,...,...,...,...
109,5,4.5,0.511116,1,0,0,0,0,0,0,0,0.255889
280,4,5.0,0.346791,0,0,0,1,0,0,0,0,0.088750
199,3,2.5,0.173238,0,0,0,0,0,0,1,0,0.110789
128,4,3.0,0.310927,0,0,1,0,0,0,0,0,0.230283


train/test split and save to .csv files. The "index" is there to make sure we pick the correct image for each entry

In [12]:
# the dataset has 384 entries. Choose 310 for training and 74 for testing
train_dataset=ran_dataset[0:310]
test_dataset=ran_dataset[310:len(dataset)]
train_dataset.to_csv("train.csv",index_label="index")
test_dataset.to_csv("test.csv",index_label="index")

##Custom Pytorch dataset

To build a custom dataset we need to design a class that implements two methods
1. \_\_len(self)\__ should return the total number of items in the dataset 
1. \_\_getitem(self,index)\_\_ returns the item at "index"

The above are the same methods needed to make an object [**iterable**](https://docs.python.org/3/glossary.html#term-iterable)

**Note**: It is good practice to scale the inputs to small numbers by normalizing the data or dividing the values by the mean or max. If this is not done, the model might not converge, in many situations. 

In [13]:
from torchvision.io import read_image
import os
class CustomDataset(Dataset):
  def __init__(self,csvFile,imgDir):
    self.imgDir=imgDir
    self.data=pd.read_csv(csvFile)
  def __len__(self):
    return len(self.data)

  def __getitem__(self,idx):
    #
    img_idx=self.data.iloc[idx,0]
    # the images were labelled starting at 1. Pandas starts at 0
    path=os.path.join(self.imgDir,str(img_idx+1)+"_frontal.jpg")
    img=read_image(path)
    img=vision.transforms.Resize((48,48))(img)
    return (self.data.iloc[idx,1:-1].to_numpy(dtype=np.float32),img.float()),np.float32(self.data.iloc[idx,-1])

In [14]:
train_dataset=CustomDataset("train.csv","Houses-dataset/Houses Dataset/")
test_dataset=CustomDataset("test.csv","Houses-dataset/Houses Dataset/")

In [15]:
itr=iter(train_dataset)
result=next(itr)

In [16]:
img=result[0][1].float()

In [17]:
img.dtype

torch.float32

In [18]:
train_loader=DataLoader(train_dataset,batch_size=16,shuffle=True)
test_loader=DataLoader(test_dataset,batch_size=1,shuffle=False)

In [19]:
class Net(nn.Module):
  def __init__(self):
    super(Net,self).__init__()
    self.relu=nn.ReLU() 
    self.fc1=nn.Linear(in_features=11,out_features=32)
    self.fc2=nn.Linear(in_features=32,out_features=16)
    self.fc3=nn.Linear(in_features=16,out_features=1)
    # for images
    self.flatten=nn.Flatten()
    self.img_fc1=nn.Linear(in_features=6912,out_features=64)
    self.img_fc2=nn.Linear(in_features=64,out_features=32)
    self.img_fc3=nn.Linear(in_features=32,out_features=16)
    self.img_fc4=nn.Linear(in_features=16,out_features=1)
    self.combine=nn.Linear(in_features=2,out_features=1)
  def forward(self,z):
    x,y=z
    x=self.fc1(x)
    x=self.relu(x)
    x=self.fc2(x)
    x=self.relu(x)
    x=self.fc3(x)
    # image
    y=self.flatten(y)
    y=self.img_fc1(y)
    y=self.relu(y)
    y=self.img_fc2(y)
    y=self.relu(y)
    y=self.img_fc3(y)
    y=self.relu(y)
    y=self.img_fc4(y)
    z=torch.concat((x,y),dim=1)
    z=self.combine(z)
    return z
#    return x

In [20]:
from torch.optim import SGD,Adam
from torch.nn import MSELoss,L1Loss
from tqdm import tqdm
model=Net()
optimizer=Adam(model.parameters())
#optimizer=SGD(model.parameters(),lr=0.001)
# one could use a mean squared error loss
# but since our testing will be based on mean absolute error
# we will use the corresponding loss
#loss_fn=MSELoss()
loss_fn=L1Loss()
epochs=50

for epoch in range(epochs):
  for input,price in tqdm(train_loader):
    output=model(input)
    loss=loss_fn(output.squeeze(),price)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  print(loss)
  



100%|██████████| 20/20 [00:03<00:00,  6.18it/s]


tensor(0.5399, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:04<00:00,  4.65it/s]


tensor(1.1881, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:04<00:00,  4.57it/s]


tensor(0.1316, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.25it/s]


tensor(1.2843, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.47it/s]


tensor(0.6136, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.07it/s]


tensor(0.7733, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.02it/s]


tensor(0.1350, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.98it/s]


tensor(0.1270, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.27it/s]


tensor(0.1687, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.76it/s]


tensor(0.5852, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.90it/s]


tensor(0.0420, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.78it/s]


tensor(0.0401, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.70it/s]


tensor(0.0773, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.73it/s]


tensor(0.1369, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.60it/s]


tensor(0.0710, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.67it/s]


tensor(0.2919, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.67it/s]


tensor(0.0701, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.20it/s]


tensor(0.3753, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.47it/s]


tensor(0.0741, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.77it/s]


tensor(0.0567, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.66it/s]


tensor(0.0471, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.63it/s]


tensor(0.0857, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.61it/s]


tensor(0.0440, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.59it/s]


tensor(0.0536, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.59it/s]


tensor(0.0361, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.77it/s]


tensor(0.0460, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.66it/s]


tensor(0.0411, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.61it/s]


tensor(0.0792, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.44it/s]


tensor(0.1258, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.34it/s]


tensor(0.0788, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.25it/s]


tensor(0.0290, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:03<00:00,  6.56it/s]


tensor(0.0117, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.75it/s]


tensor(0.0208, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.32it/s]


tensor(0.0314, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.54it/s]


tensor(0.1452, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.41it/s]


tensor(0.1005, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.17it/s]


tensor(0.0233, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  6.70it/s]


tensor(0.0200, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.06it/s]


tensor(0.0293, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.08it/s]


tensor(0.0396, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.96it/s]


tensor(0.1449, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.99it/s]


tensor(0.0614, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.54it/s]


tensor(0.0145, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.88it/s]


tensor(0.0375, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.16it/s]


tensor(0.0354, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.86it/s]


tensor(0.0153, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.76it/s]


tensor(0.0219, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.91it/s]


tensor(0.0392, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.12it/s]


tensor(0.0239, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.22it/s]

tensor(0.0070, grad_fn=<L1LossBackward0>)





## Testing 

In [22]:
total=0.0
count=0

for input,price in test_loader:
  count+=1
  output=model(input)
  abs=torch.abs(output-price)/price.squeeze()
  total+=abs
# Average percentage difference
print(100*total.mean().item()/count)

36.66290334753088
