# House Prices

## Competition Description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

## Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 

## Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

## Data Source
Anna Montoya, DataCanary. (2016). House Prices - Advanced Regression Techniques. Kaggle.  
https://kaggle.com/competitions/house-prices-advanced-regression-techniques

## Data Exploration

In [59]:
import pandas as pd
import numpy as np
import pathlib
from IPython.display import display

In [60]:
NOTEBOOK_PATH = pathlib.Path().resolve()
DATA_DIR = NOTEBOOK_PATH/"data"  # defining data directory

In [61]:
df = pd.read_csv(DATA_DIR/"train.csv",sep=",")

In [62]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [64]:
df["SalePrice"]=np.log(df.SalePrice)

# Handling with Categoric Features

There is 43 columns which have object dtype. Also ,as data description says, MSSubClass column is a categoric feature. For encoding, OrdinalEncoder will be used

In [65]:
from sklearn.preprocessing import OrdinalEncoder

In [66]:
from pandas.api.types import is_object_dtype 

enc = OrdinalEncoder()
for name,column in df.items():
    if is_object_dtype(column):
          df[name] = column.astype("category").cat.as_ordered()

In [67]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Id             1460 non-null   int64   
 1   MSSubClass     1460 non-null   int64   
 2   MSZoning       1460 non-null   category
 3   LotFrontage    1201 non-null   float64 
 4   LotArea        1460 non-null   int64   
 5   Street         1460 non-null   category
 6   Alley          91 non-null     category
 7   LotShape       1460 non-null   category
 8   LandContour    1460 non-null   category
 9   Utilities      1460 non-null   category
 10  LotConfig      1460 non-null   category
 11  LandSlope      1460 non-null   category
 12  Neighborhood   1460 non-null   category
 13  Condition1     1460 non-null   category
 14  Condition2     1460 non-null   category
 15  BldgType       1460 non-null   category
 16  HouseStyle     1460 non-null   category
 17  OverallQual    1460 non-null   in

In [68]:
df["MSZoning"]

0       RL
1       RL
2       RL
3       RL
4       RL
        ..
1455    RL
1456    RL
1457    RL
1458    RL
1459    RL
Name: MSZoning, Length: 1460, dtype: category
Categories (5, object): [C (all) < FV < RH < RL < RM]

In [69]:
df["Alley"]

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
       ... 
1455    NaN
1456    NaN
1457    NaN
1458    NaN
1459    NaN
Name: Alley, Length: 1460, dtype: category
Categories (2, object): [Grvl < Pave]

In [70]:
from pandas.api.types import is_numeric_dtype 
def numericalize(df):
    "Kategorik verilerin numaralarını df'de igili sutuna yazdırma"
    for name,col in df.items():
        if not is_numeric_dtype(col):
            df[name] = col.cat.codes + 1 

In [71]:
numericalize(df)

In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   int8   
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   int8   
 6   Alley          1460 non-null   int8   
 7   LotShape       1460 non-null   int8   
 8   LandContour    1460 non-null   int8   
 9   Utilities      1460 non-null   int8   
 10  LotConfig      1460 non-null   int8   
 11  LandSlope      1460 non-null   int8   
 12  Neighborhood   1460 non-null   int8   
 13  Condition1     1460 non-null   int8   
 14  Condition2     1460 non-null   int8   
 15  BldgType       1460 non-null   int8   
 16  HouseStyle     1460 non-null   int8   
 17  OverallQual    1460 non-null   int64  
 18  OverallC

# Missing Values

In [73]:
for n,c in df.items():
    if is_numeric_dtype(c):
        if df[n].isnull().sum():
            print(n)

LotFrontage
MasVnrArea
GarageYrBlt


In [74]:
def fix_missing(df):
    for name,col in df.items():
        if is_numeric_dtype(col):
            if pd.isnull(col).sum():
                df[name+"_na"]=pd.isnull(col)
                df[name] = col.fillna(0)

In [75]:
fix_missing(df)

In [76]:
from sklearn.model_selection import  train_test_split

# Modelling

In [77]:
x_train =df.copy()
y_train = df["SalePrice"].values
x_train.drop("SalePrice",axis=1,inplace=True)

In [92]:
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train,
                                                  test_size = 0.25)

In [79]:
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display
from sklearn import metrics

In [80]:
import math
def rmse(x,y):
    return math.sqrt(((x-y)**2).mean())

In [81]:
def print_score(m):
    print(f"RMSE of train set {rmse(m.predict(x_train),y_train)}")
    print(f"RMSE of valid set {rmse(m.predict(x_val),y_val)}")
    print(f"R^2 of train set  {m.score(x_train,y_train)}")
    print(f"R^2 of valid set {m.score(x_val,y_val)}")

In [82]:
m = RandomForestRegressor(n_estimators=100,n_jobs=-1)
%time m.fit(x_train,y_train)

Wall time: 566 ms


RandomForestRegressor(n_jobs=-1)

In [83]:
print_score(m)

RMSE of train set 0.05273949688550509
RMSE of valid set 0.15716832040923864
R^2 of train set  0.9828556780270187
R^2 of valid set 0.8365083995701134


In [84]:
m= RandomForestRegressor(n_estimators=100,min_samples_leaf=5,max_features=0.5,n_jobs=-1  )
%time m.fit(x_train,y_train)

Wall time: 314 ms


RandomForestRegressor(max_features=0.5, min_samples_leaf=5, n_jobs=-1)

In [85]:
print_score(m)

RMSE of train set 0.09303117754485077
RMSE of valid set 0.14615041647234794
R^2 of train set  0.9466535670313625
R^2 of valid set 0.8586273054863559


In [86]:
m = RandomForestRegressor(n_estimators=100,max_depth=10,n_jobs=-1)
%time m.fit(x_train,y_train)

Wall time: 469 ms


RandomForestRegressor(max_depth=10, n_jobs=-1)

In [87]:
print_score(m)

RMSE of train set 0.060361651682101614
RMSE of valid set 0.1575404652759677
R^2 of train set  0.9775420257729469
R^2 of valid set 0.8357332485532442


In [94]:
main_model= RandomForestRegressor(n_estimators=100,max_depth=200,n_jobs=-1)
main_model.fit(x_train,y_train)

RandomForestRegressor(max_depth=200, n_jobs=-1)

In [95]:
print_score(main_model)

RMSE of train set 0.05427423647505069
RMSE of valid set 0.1453106647978413
R^2 of train set  0.9818608249119675
R^2 of valid set 0.8691611213900969
