<a href="https://colab.research.google.com/github/hikmatfarhat-ndu/pytorch/blob/main/custom_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Estimating house prices
So far we have dealt with classification problems, i.e. identifying to which category a given object belongs to. This notebook is the first to deal with regression problems. Here the output is a value(s). In particular, we will build a model to estimate a house price based on the zipcode, number of rooms, size, images...

## What you will learn
 1. Using the pandas package
 1. Building a custom dataset for Pytorch
 1. Handling categorical input data using one-hot encoding
 1. Building a model that takes multimodal (i.e. numbers and images)  data as input

In [2]:
import torch
import torch.nn as nn
import torchvision as vision
from torch.utils.data import Dataset,DataLoader
import pandas as pd
import numpy as np

The data is in a Github repository. Our first task is to "clean" it by removing all houses belonging to zipcodes that occur less than 20 times in the dataset. The number 20 is arbitrary but it seems a good choice.
The non-image features of the houses are in a .csv file, "HousesInfo.csv"  without headers. Each house has 4 images: bathroom, bedroom,front,kitchen. To simplify matters we chose to use only the frontal image. The prefix of the image files is the index of house as it occurs in the "HousesInfo.csv" **starting from 1**

**NOTE about the data** The resulting dataset is very small, 384 entries. Also, the way we divide it into training/testing datasets it is possible that the testing dataset contains entries not seen in the training. Finally, the model we are using is very simple and we are including one image out of four for each.
All the above means that different runs will give wildely different accuracies.

In [3]:
!git clone https://github.com/emanhamed/Houses-dataset

Cloning into 'Houses-dataset'...
remote: Enumerating objects: 2166, done.[K
remote: Counting objects: 100% (1/1), done.[K
remote: Total 2166 (delta 0), reused 0 (delta 0), pack-reused 2165[K
Receiving objects: 100% (2166/2166), 176.26 MiB | 22.43 MiB/s, done.
Resolving deltas: 100% (20/20), done.


Read the .csv file into a pandas data frame. The parameters are self explanatory.

In [4]:
df=pd.read_csv("Houses-dataset/Houses Dataset/HousesInfo.txt",header=None,delim_whitespace=True,
               names=["bedrooms","bathrooms","size","zipcode","price"])

Remove all entries with zipcodes occuring less than 20 times

In [5]:
def cleanData(df):
    # compute the number of entries per zipcode
    zipcodes=df['zipcode'].value_counts().keys().tolist()
    counts=df['zipcode'].value_counts().tolist()
    #discard all zipcodes ocurring less than 20 times
    for count,zipcode in zip(counts,zipcodes):
      if count<20:
        idx=df[df['zipcode']==zipcode].index
        df.drop(idx,inplace=True)
    return df

In [6]:
dataset=cleanData(df)

# Dealing with categorical data and large values 
Large numbers often cause problems with convergence. In this case the size and price of the houses contain large numbers so we divide all sizes and prices with the largest respective values.
The value of the zipcode is a categorical data (for example we can't say zip1>zip2). An often used, and easy, solution is to use one_hot encoding

In [7]:
max_price=dataset['price'].max()
max_size=dataset['size'].max()
price_col=dataset['price']/max_price
size_col=dataset['size']/max_size
one_hot_zip=pd.get_dummies(dataset['zipcode'])

In [8]:
one_hot_zip

Unnamed: 0,91901,92276,92677,92880,93446,93510,94501,94531
30,0,0,0,0,1,0,0,0
32,0,0,0,0,1,0,0,0
39,0,0,0,0,1,0,0,0
80,1,0,0,0,0,0,0,0
81,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
530,0,0,0,0,0,0,0,1
531,0,0,0,0,0,0,0,1
532,0,0,0,0,0,0,0,1
533,0,0,0,0,0,0,0,1


Remove the "old" price, zipcode and size columns then add the modified ones

In [9]:
dataset=dataset.drop(['price','zipcode','size'],axis=1)
dataset=pd.concat([dataset,size_col,one_hot_zip,price_col],axis=1)

In [10]:
dataset

Unnamed: 0,bedrooms,bathrooms,size,91901,92276,92677,92880,93446,93510,94501,94531,price
30,5,3.0,0.264262,0,0,0,0,1,0,0,0,0.134688
32,3,2.0,0.188968,0,0,0,0,1,0,0,0,0.062308
39,3,3.0,0.225042,0,0,0,0,1,0,0,0,0.077672
80,4,2.5,0.258389,1,0,0,0,0,0,0,0,0.102253
81,2,2.0,0.193477,1,0,0,0,0,0,0,0,0.090440
...,...,...,...,...,...,...,...,...,...,...,...,...
530,5,2.0,0.216653,0,0,0,0,0,0,0,1,0.068266
531,4,3.5,1.000000,0,0,0,0,0,0,0,1,0.078525
532,3,2.0,0.211200,0,0,0,0,0,0,0,1,0.069478
533,4,3.0,0.242450,0,0,0,0,0,0,0,1,0.071526


Later we will split the dataset into training and testing so it is important we randomize the dataset before we do that

In [11]:
#randomize the dataframe
ran_dataset=dataset.sample(len(dataset))

In [12]:
ran_dataset

Unnamed: 0,bedrooms,bathrooms,size,91901,92276,92677,92880,93446,93510,94501,94531,price
234,8,6.0,0.403628,0,0,0,1,0,0,0,0,0.099010
299,5,4.0,0.373846,0,0,0,1,0,0,0,0,0.091670
457,3,2.0,0.146812,0,0,0,0,1,0,0,0,0.023882
99,4,3.5,0.359270,1,0,0,0,0,0,0,0,0.115910
144,3,2.5,0.224518,0,0,1,0,0,0,0,0,0.128030
...,...,...,...,...,...,...,...,...,...,...,...,...
335,2,2.0,0.130872,0,1,0,0,0,0,0,0,0.017668
471,2,2.0,0.130872,0,0,0,0,1,0,0,0,0.029874
522,4,3.0,0.185088,0,0,0,0,0,0,0,1,0.059730
368,2,2.0,0.130872,0,1,0,0,0,0,0,0,0.013639


train/test split and save to .csv files. The "index" is there to make sure we pick the correct image for each entry

In [13]:
# the dataset has 384 entries. Choose 310 for training and 74 for testing
train_dataset=ran_dataset[0:310]
test_dataset=ran_dataset[310:len(dataset)]
train_dataset.to_csv("train.csv",index_label="index")
test_dataset.to_csv("test.csv",index_label="index")

##Custom Pytorch dataset

To build a custom dataset we need to design a class that implements two methods
1. \_\_len(self)\__ should return the total number of items in the dataset 
1. \_\_getitem(self,index)\_\_ returns the item at "index"

The above are the same methods needed to make an object [**iterable**](https://docs.python.org/3/glossary.html#term-iterable)


In [14]:
from torchvision.io import read_image
import os
class CustomDataset(Dataset):
  def __init__(self,csvFile,imgDir):
    self.imgDir=imgDir
    self.data=pd.read_csv(csvFile)
  def __len__(self):
    return len(self.data)

  def __getitem__(self,idx):
    #
    img_idx=self.data.iloc[idx,0]
    # the images were labelled starting at 1. Pandas starts at 0
    path=os.path.join(self.imgDir,str(img_idx+1)+"_frontal.jpg")
    img=read_image(path)
    img=vision.transforms.Resize((48,48))(img)
    return (self.data.iloc[idx,1:-1].to_numpy(dtype=np.float32),img.float()),np.float32(self.data.iloc[idx,-1])

In [15]:
train_dataset=CustomDataset("train.csv","Houses-dataset/Houses Dataset/")
test_dataset=CustomDataset("test.csv","Houses-dataset/Houses Dataset/")

## Checking the dataset
It is good practice to "inspect" the datasets before using them. Since the datasets are iterables we can retrieve single entries and check their values, types,...

In [16]:
itr=iter(train_dataset)
result=next(itr)

In [17]:
print(len(result))
print(type(result[0]))
print(type(result[0][0]))
print(result[0][1].size())
print(result[0][1].dtype)

2
<class 'tuple'>
<class 'numpy.ndarray'>
torch.Size([3, 48, 48])
torch.float32


In [18]:
train_loader=DataLoader(train_dataset,batch_size=16,shuffle=True)
test_loader=DataLoader(test_dataset,batch_size=1,shuffle=False)

In [33]:
class Net(nn.Module):
  def __init__(self):
    super(Net,self).__init__()
    self.relu=nn.ReLU() 
    self.fc1=nn.Linear(in_features=11,out_features=32)
    self.fc2=nn.Linear(in_features=32,out_features=16)
    self.fc3=nn.Linear(in_features=16,out_features=1)
    # for images
    self.flatten=nn.Flatten()
    self.img_fc1=nn.Linear(in_features=6912,out_features=64)
    self.img_fc2=nn.Linear(in_features=64,out_features=32)
    self.img_fc3=nn.Linear(in_features=32,out_features=16)
    self.img_fc4=nn.Linear(in_features=16,out_features=1)
    self.combine=nn.Linear(in_features=2,out_features=1)
  
  def forward(self,z):# version 1. used for training
    x,y=z# version 1. used for training
  # version 2. used for SummaryWriter.add_graph since it insists on unpacking tuples
  #def forward(self,x,y):version 2.
    x=self.fc1(x)
    x=self.relu(x)
    x=self.fc2(x)
    x=self.relu(x)
    x=self.fc3(x)
    # image
    y=self.flatten(y)
    y=self.img_fc1(y)
    y=self.relu(y)
    y=self.img_fc2(y)
    y=self.relu(y)
    y=self.img_fc3(y)
    y=self.relu(y)
    y=self.img_fc4(y)
    z=torch.concat((x,y),dim=1)
    z=self.combine(z)
    return z
#    return x

In [39]:
model=Net()
from torch.utils.tensorboard import SummaryWriter
itr=iter(train_loader)
input,price=next(itr)
writer=SummaryWriter("logs")
writer.add_graph(model,input)
writer.close()
#a=model(input)

In [40]:
# To display tensorboard inside the notebook
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [None]:
%tensorboard --logdir logs

In [None]:
from torch.optim import SGD,Adam
from torch.nn import MSELoss,L1Loss
from tqdm import tqdm
model=Net()
optimizer=Adam(model.parameters())
#optimizer=SGD(model.parameters(),lr=0.001)
# one could use a mean squared error loss
# but since our testing will be based on mean absolute error
# we will use the corresponding loss
#loss_fn=MSELoss()
loss_fn=L1Loss()
epochs=50

for epoch in range(epochs):
  for input,price in tqdm(train_loader):
    output=model(input)
    loss=loss_fn(output.squeeze(),price)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  print(loss)
  



100%|██████████| 20/20 [00:03<00:00,  5.21it/s]


tensor(2.4675, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:05<00:00,  3.46it/s]


tensor(0.7501, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:04<00:00,  4.16it/s]


tensor(0.7179, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.32it/s]


tensor(0.3201, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.48it/s]


tensor(0.6138, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.66it/s]


tensor(1.1685, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.22it/s]


tensor(0.1933, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.81it/s]


tensor(0.8942, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.72it/s]


tensor(0.8201, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.93it/s]


tensor(0.1722, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.41it/s]


tensor(0.3954, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.56it/s]


tensor(0.0882, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.53it/s]


tensor(0.1289, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.02it/s]


tensor(0.0421, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.03it/s]


tensor(0.1960, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.16it/s]


tensor(0.1626, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  6.85it/s]


tensor(0.0842, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.81it/s]


tensor(0.0565, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.49it/s]


tensor(0.0907, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.64it/s]


tensor(0.0789, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.74it/s]


tensor(0.0609, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.52it/s]


tensor(0.1879, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.51it/s]


tensor(0.0239, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.50it/s]


tensor(0.0592, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.55it/s]


tensor(0.1398, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.29it/s]


tensor(0.0483, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.96it/s]


tensor(0.0271, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.28it/s]


tensor(0.0363, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.11it/s]


tensor(0.0704, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.79it/s]


tensor(0.1389, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.15it/s]


tensor(0.1156, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.57it/s]


tensor(0.0339, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  8.07it/s]


tensor(0.0269, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:03<00:00,  6.56it/s]


tensor(0.0262, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.15it/s]


tensor(0.0284, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.93it/s]


tensor(0.0165, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.14it/s]


tensor(0.0237, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.87it/s]


tensor(0.0168, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.77it/s]


tensor(0.0218, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.15it/s]


tensor(0.0235, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.43it/s]


tensor(0.0323, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.12it/s]


tensor(0.0205, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.33it/s]


tensor(0.0214, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.33it/s]


tensor(0.0298, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.67it/s]


tensor(0.0485, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.87it/s]


tensor(0.0329, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.06it/s]


tensor(0.0359, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.55it/s]


tensor(0.0217, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.36it/s]


tensor(0.0147, grad_fn=<L1LossBackward0>)


100%|██████████| 20/20 [00:02<00:00,  7.40it/s]

tensor(0.0092, grad_fn=<L1LossBackward0>)





## Testing 

In [None]:
total=0.0
count=0

for input,price in test_loader:
  count+=1
  output=model(input)
  abs=torch.abs(output-price)/price.squeeze()
  total+=abs
# Average percentage difference
print(100*total.mean().item()/count)

63.849454312711146
