# House Price Prediction using Pytorch


## 🎯 Project: House Price Prediction

This notebook predicts house prices using:

- Linear Regression
- Selected features:
  - SalePrice
  - LotArea
  - YearBuilt
  - 1stFlrSF, 2ndFlrSF

> 📁 Data from: `housePrice.csv`

In [121]:
import pandas as pd
import numpy as np

In [122]:
df = pd.read_csv('housePrice.csv')

In [123]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [124]:
df.shape

(1460, 81)

In [125]:
selected_features = [
    "SalePrice", "MSSubClass", "MSZoning", "LotFrontage", "LotArea",
    "Street", "YearBuilt", "LotShape", "1stFlrSF", "2ndFlrSF"
]

df = df[selected_features].dropna()

In [126]:
df.head()

Unnamed: 0,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,YearBuilt,LotShape,1stFlrSF,2ndFlrSF
0,208500,60,RL,65.0,8450,Pave,2003,Reg,856,854
1,181500,20,RL,80.0,9600,Pave,1976,Reg,1262,0
2,223500,60,RL,68.0,11250,Pave,2001,IR1,920,866
3,140000,70,RL,60.0,9550,Pave,1915,IR1,961,756
4,250000,60,RL,84.0,14260,Pave,2000,IR1,1145,1053


In [127]:
df.shape

(1201, 10)

In [128]:
for i in df.columns:
    print("Column name {} and unique values are {}".format(i,len(df[i].unique())))

Column name SalePrice and unique values are 597
Column name MSSubClass and unique values are 15
Column name MSZoning and unique values are 5
Column name LotFrontage and unique values are 110
Column name LotArea and unique values are 869
Column name Street and unique values are 2
Column name YearBuilt and unique values are 112
Column name LotShape and unique values are 4
Column name 1stFlrSF and unique values are 678
Column name 2ndFlrSF and unique values are 368


- From seeing the results we can say that the following are **categorical features**:

  1. `MSZoning`  
  2. `Street`  
  3. `LotShape`  
  4. `MSSubClass`

In [129]:
df.head()

Unnamed: 0,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,YearBuilt,LotShape,1stFlrSF,2ndFlrSF
0,208500,60,RL,65.0,8450,Pave,2003,Reg,856,854
1,181500,20,RL,80.0,9600,Pave,1976,Reg,1262,0
2,223500,60,RL,68.0,11250,Pave,2001,IR1,920,866
3,140000,70,RL,60.0,9550,Pave,1915,IR1,961,756
4,250000,60,RL,84.0,14260,Pave,2000,IR1,1145,1053


In [130]:
import datetime
datetime.datetime.now().year

2025

In [131]:
df['TotaYear'] = datetime.datetime.now().year - df['YearBuilt'] # using derived year

In [132]:
df.head()

Unnamed: 0,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,YearBuilt,LotShape,1stFlrSF,2ndFlrSF,TotaYear
0,208500,60,RL,65.0,8450,Pave,2003,Reg,856,854,22
1,181500,20,RL,80.0,9600,Pave,1976,Reg,1262,0,49
2,223500,60,RL,68.0,11250,Pave,2001,IR1,920,866,24
3,140000,70,RL,60.0,9550,Pave,1915,IR1,961,756,110
4,250000,60,RL,84.0,14260,Pave,2000,IR1,1145,1053,25


In [133]:
df.drop('YearBuilt',axis=1,inplace = True)

In [134]:
df.columns

Index(['SalePrice', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea',
       'Street', 'LotShape', '1stFlrSF', '2ndFlrSF', 'TotaYear'],
      dtype='object')

In [135]:
cat_features = ['MSZoning','MSSubClass', 'LotShape', 'Street']
target_feature = 'SalePrice'

In [136]:
df['MSSubClass'].unique()

array([ 60,  20,  70,  50, 190,  45,  90, 120,  30,  80, 160,  75, 180,
        40,  85])

In [137]:
from sklearn.preprocessing import LabelEncoder

# Dictionary to store encoders (if you want to inverse transform later)
lbl_encoders = {}

# Encode each categorical feature
for col in cat_features:
    lbl_encoders[col] = LabelEncoder()
    df[col] = lbl_encoders[col].fit_transform(df[col])


In [138]:
lbl_encoders

{'MSZoning': LabelEncoder(),
 'MSSubClass': LabelEncoder(),
 'LotShape': LabelEncoder(),
 'Street': LabelEncoder()}

In [139]:
df.head()

Unnamed: 0,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,1stFlrSF,2ndFlrSF,TotaYear
0,208500,5,3,65.0,8450,1,3,856,854,22
1,181500,0,3,80.0,9600,1,3,1262,0,49
2,223500,5,3,68.0,11250,1,0,920,866,24
3,140000,6,3,60.0,9550,1,0,961,756,110
4,250000,5,3,84.0,14260,1,0,1145,1053,25


In [140]:
## Stacking and Converting Into Tensors
cat_features=np.stack([df['MSSubClass'],df['MSZoning'],df['Street'],df['LotShape']],1)
cat_features
        

array([[5, 3, 1, 3],
       [0, 3, 1, 3],
       [5, 3, 1, 0],
       ...,
       [6, 3, 1, 3],
       [0, 3, 1, 3],
       [0, 3, 1, 3]])

In [143]:
import torch
cont_values=np.stack([df[i].values for i in cont_features],axis=1)
cont_values=torch.tensor(cont_values,dtype=torch.float)
cont_values

tensor([[   65.,  8450.,   856.,   854.,    22.],
        [   80.,  9600.,  1262.,     0.,    49.],
        [   68., 11250.,   920.,   866.,    24.],
        ...,
        [   66.,  9042.,  1188.,  1152.,    84.],
        [   68.,  9717.,  1078.,     0.,    75.],
        [   75.,  9937.,  1256.,     0.,    60.]])

In [144]:
## creating continous variables
cont_features = []

for i in df.columns:
    if i in ['MSZoning','MSSubClass', 'LotShape', 'Street','SalePrice']:
        pass
    else:
        cont_features.append(i)   
cont_features

['LotFrontage', 'LotArea', '1stFlrSF', '2ndFlrSF', 'TotaYear']

In [145]:
## Stacking continous varibles to tensors

# Stack continuous features into a NumPy array
cont_features_np = np.stack([df[i].values for i in cont_features], axis=1)

# Convert to PyTorch tensor
cont_values = torch.tensor(cont_features_np, dtype=torch.float)
cont_values

tensor([[   65.,  8450.,   856.,   854.,    22.],
        [   80.,  9600.,  1262.,     0.,    49.],
        [   68., 11250.,   920.,   866.,    24.],
        ...,
        [   66.,  9042.,  1188.,  1152.,    84.],
        [   68.,  9717.,  1078.,     0.,    75.],
        [   75.,  9937.,  1256.,     0.,    60.]])

In [146]:
### Dependent Feature 
y=torch.tensor(df['SalePrice'].values,dtype=torch.float).reshape(-1,1)
y

tensor([[208500.],
        [181500.],
        [223500.],
        ...,
        [266500.],
        [142125.],
        [147500.]])

In [147]:
cat_features.shape , cont_values.shape, y.shape

((1201, 4), torch.Size([1201, 5]), torch.Size([1201, 1]))

In [148]:
df.shape

(1201, 10)

## Embedding Layer -> Only for Categorical Features

### Embedding size for categorical features

In [150]:
len(df['Street'].unique())

2

In [153]:
cat_dims = [len(df[col].unique()) for col in ['MSZoning','MSSubClass', 'LotShape', 'Street']]
cat_dims

[5, 15, 4, 2]

In [161]:
## Output dimesions should be setbased on the input dimension(min(50,feature dimension/2))
embedding_dim = [(x,min(50,(x+1)//2)) for x in cat_dims]
embedding_dim

[(5, 3), (15, 8), (4, 2), (2, 1)]