<a href="https://colab.research.google.com/github/hikmatfarhat-ndu/pytorch/blob/main/custom_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Estimating house prices
So far we have dealt with classification problems, i.e. identifying to which category a given object belongs to. This notebook is the first to deal with regression problems. Here the output is a value(s). In particular, we will build a model to estimate a house price based on the zipcode, number of rooms, size, images...

## What you will learn
 1. Learng to use the pandas package
 1. Building a custom dataset for Pytorch
 1. Handling categorical data using one-hot encoding
 1. Building a model that takes multimodal data as input

In [None]:
import torch
import torch.nn as nn
import torchvision as vision
from torch.utils.data import Dataset,DataLoader
import pandas as pd
import numpy as np

The data is in a Github repository. Our first task is to "clean" it by removing all houses belonging to zipcodes that occur less than 20 times in the dataset. The number 20 is arbitrary but it seems a good choice.
The non-image features of the houses are in a .csv file, "HousesInfo.csv"  without headers. Each house has 4 images: bathroom, bedroom,front,kitchen. To simplify matters we choose to use only the frontal image. The prefix of the image files is the index of house as it occurs in the "HousesInfo.csv" **starting from 1**

In [None]:
!git clone https://github.com/emanhamed/Houses-dataset

Read the .csv file into a pandas data frame. The parameters are self explanatory.

In [None]:
df=pd.read_csv("Houses-dataset/Houses Dataset/HousesInfo.txt",header=None,delim_whitespace=True,
               names=["bedrooms","bathrooms","size","zipcode","price"])

Remove all entries with zipcodes occuring less than 20 times

In [None]:
def cleanData(df):
    # compute the number of entries per zipcode
    zipcodes=df['zipcode'].value_counts().keys().tolist()
    counts=df['zipcode'].value_counts().tolist()
    #discard all zipcodes ocurring less than 20 times
    for count,zipcode in zip(counts,zipcodes):
      if count<20:
        idx=df[df['zipcode']==zipcode].index
        df.drop(idx,inplace=True)
    return df

In [None]:
dataset=cleanData(df)

Later we will split the dataset into training and testing so it is important we randomize the dataset before we do that

In [None]:
#randomize the dataframe
ran_dataset=dataset.sample(len(dataset))

train/test split and save to .csv files. The "index" is there to make sure we pick the correct image for each entry

In [None]:
# the dataset has 384 entries. Choose 310 for training and 74 for testing
train_dataset=ran_dataset[0:310]
test_dataset=ran_dataset[310:len(dataset)]
train_dataset.to_csv("train.csv",index_label="index")
test_dataset.to_csv("test.csv",index_label="index")

##Custom Pytorch dataset

To build a custom dataset we need to design a class that implements two methods
1. \_\_len(self)\__ should return the total number of items in the dataset 
1. \_\_getitem(self,index)\_\_ returns the item at "index"

The above are the same methods needed to make an object [**iterable**](https://docs.python.org/3/glossary.html#term-iterable)

**Note**: It is good practice to scale the inputs to small numbers by normalizing the data or dividing the values by the mean or max. If this is not done, the model might not converge, in many situations. 

In [None]:
from torchvision.io import read_image
import os
class CustomDataset(Dataset):
  def __init__(self,csvFile,imgDir):
    self.imgDir=imgDir
    dataset=pd.read_csv(csvFile)
    # convert the zipcode column into one-hot encoding
    dummy=pd.get_dummies(dataset['zipcode'])
    # the price and size column will be normalized
    price=dataset['price']
    size=dataset['size']
    self.max_size=size.max()
    self.max_price=price.max()
    size=size/self.max_size
    price=price/self.max_price
    # remove the "old" columns of size,price and zipcode
    # to prepar for the addition of the modified versions
    df=dataset.drop(['size','price','zipcode'],axis=1)
    self.data=pd.concat([df,size,dummy,price],axis=1)
    self.resize=vision.transforms.Resize((48,48))
  def __len__(self):
    return len(self.data)

  def __getitem__(self,idx):
    #
    img_idx=self.data.iloc[idx,0]
    # the images were labelled starting at 1. Pandas starts at 0
    path=os.path.join(self.imgDir,str(img_idx+1)+"_frontal.jpg")
    img=read_image(path)
    img=self.resize(img)
    return self.data.iloc[idx,1:-1].to_numpy(dtype=np.float32),img,np.float32(self.data.iloc[idx,-1])

In [None]:
train_dataset=CustomDataset("train.csv","Houses-dataset/Houses Dataset/")
test_dataset=CustomDataset("test.csv","Houses-dataset/Houses Dataset/")

In [None]:
train_loader=DataLoader(train_dataset,batch_size=16,shuffle=True)
test_loader=DataLoader(test_dataset,batch_size=1,shuffle=False)

In [None]:
class Net(nn.Module):
  def __init__(self):
    super(Net,self).__init__()
    self.relu=nn.ReLU() 
    self.fc1=nn.Linear(in_features=11,out_features=32)
    self.fc2=nn.Linear(in_features=32,out_features=16)
    self.fc3=nn.Linear(in_features=16,out_features=1)
  def forward(self,x):
    x=self.fc1(x)
    x=self.relu(x)
    x=self.fc2(x)
    x=self.relu(x)
    x=self.fc3(x)
    return x

In [None]:
from torch.optim import SGD,Adam
from torch.nn import MSELoss,L1Loss

model=Net()
optimizer=Adam(model.parameters())
# one could use a mean squared error loss
# but since our testing will be based on mean absolute error
# we will use the corresponding loss
#loss_fn=MSELoss()
loss_fn=L1Loss()
epochs=50
for epoch in range(epochs):
  for input,img,price in train_loader:
    output=model(input)
    loss=loss_fn(output.squeeze(),price)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  print(loss)
  



Recall that the datasets were "normalized" by dividing the values by the maximum value which is **different** for train and test data

In [None]:
itr=iter(test_loader)
input,img,price=next(itr)
output=model(input)
print(type(price))
print(type(output))

In [None]:
total=0.0
total2=0.0
count=0
max_test=test_dataset.max_price
max_train=train_dataset.max_price
for input,img,price in test_loader:
  count+=1
  output=model(input)
  abs=torch.abs(output.item()*max_train-price*max_test)
  print(np.abs(output.item()*max_train-price.item()*max_test))
  total+=abs

In [None]:
total.mean()/count