# **Predicting The Sale Price of Houses Using Machine Learning**

1 . `Import libraies`

1 . 1 `Getting the data ready`

2 . `Data Cleaning`

3 . `Save the clean data by using df.to_csv() function`


4 . `Then Make a copy of the original data `

4 . 1 `Import Preprocessed Data`

4 . 2 `Splitting Data into Trained and Validation Sets`

5 `Hyperparameter Tuning with RandomizedSearchCV (Choose the right estimator,Fit the model)`

6 .` Train a model with the best hyperparameters`

7 . `Make Predictions`

8 . `Evaluate the model`

9 . `Save and Load the model`


### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt


### Gettting Data ready

In [2]:
df = pd.read_csv("train.csv")
df.head(2)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500


# Data Cleaning

## Convert strings/objects into categories

One way we can turn all of our data into numbers is by converting them into pandas caetgory 

**How** `.cat.codes` **Works**

1 . **Convert to Categorical** : First, you need to convert the column to a categorical type using .astype('category').

2 . **Get Codes** : Use `.cat.codes` to get the integer codes for each category.

In [3]:
# Find columns which contain strings
# The key is the column names and the value is what you see in each column

for key , value in df.items():
    if pd.api.types.is_string_dtype(value):
        print(key)


MSZoning
Street
LotShape
LandContour
Utilities
LotConfig
LandSlope
Neighborhood
Condition1
Condition2
BldgType
HouseStyle
RoofStyle
RoofMatl
Exterior1st
Exterior2nd
ExterQual
ExterCond
Foundation
Heating
HeatingQC
CentralAir
KitchenQual
Functional
PavedDrive
SaleType
SaleCondition


In [4]:
# This is will turn all the strings into category values

for key , value in df.items():
    if pd.api.types.is_string_dtype(value):
    #keep the column name df[key] but change the value to a category dtype
        df[key] = value.astype('category').cat.as_ordered()

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Id             1460 non-null   int64   
 1   MSSubClass     1460 non-null   int64   
 2   MSZoning       1460 non-null   category
 3   LotFrontage    1201 non-null   float64 
 4   LotArea        1460 non-null   int64   
 5   Street         1460 non-null   category
 6   Alley          91 non-null     object  
 7   LotShape       1460 non-null   category
 8   LandContour    1460 non-null   category
 9   Utilities      1460 non-null   category
 10  LotConfig      1460 non-null   category
 11  LandSlope      1460 non-null   category
 12  Neighborhood   1460 non-null   category
 13  Condition1     1460 non-null   category
 14  Condition2     1460 non-null   category
 15  BldgType       1460 non-null   category
 16  HouseStyle     1460 non-null   category
 17  OverallQual    1460 non-null   in



***category values assigns a numeric value to each category. Under the hood , pandas is treating categories as numbers***

In [6]:
df.Neighborhood.cat.categories

Index(['Blmngtn', 'Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr',
       'Crawfor', 'Edwards', 'Gilbert', 'IDOTRR', 'MeadowV', 'Mitchel',
       'NAmes', 'NPkVill', 'NWAmes', 'NoRidge', 'NridgHt', 'OldTown', 'SWISU',
       'Sawyer', 'SawyerW', 'Somerst', 'StoneBr', 'Timber', 'Veenker'],
      dtype='object')

In [7]:
df.Neighborhood.cat.codes

0        5
1       24
2        5
3        6
4       15
        ..
1455     8
1456    14
1457     6
1458    12
1459     7
Length: 1460, dtype: int8

## Non -Numeric Columns

Turn category values into numbers and fill missing values at the same time

In [8]:
#Check for column which aint numeric

for key, value in df.items():
    if not pd.api.types.is_numeric_dtype(value):
       #Add binary column to indicate whether sample has missing value 
        df[key+"_is_missing"] = pd.isnull(value)
      # Turn categories into numbers and add + 1 and filling missing values
        df[key] = pd.Categorical(value).codes + 1  

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Columns: 124 entries, Id to SaleCondition_is_missing
dtypes: bool(43), float64(3), int64(35), int8(43)
memory usage: 556.2 KB


## Numeric columns

In [10]:
# Listing Numeric columns which have null values
for key , value in df.items():
    if pd.api.types.is_numeric_dtype(value):
        if pd.isna(value).sum():
            print(key)

LotFrontage
MasVnrArea
GarageYrBlt


In [11]:
# fill missing numeric columns with their median
for key , value in df.items():
    if pd.api.types.is_numeric_dtype(value):
        if pd.isna(value).sum():
         #Add binary columns indicating missing values
            df[key+'_is_missing'] = pd.isna(value)
        #Fill numeric value with median
            df[key] = value.fillna(value.median())
                   

In [12]:
# Check if there is any null numeric value
for key , value in df.items():
    if pd.api.types.is_numeric_dtype(value):
        if pd.isna(value).sum():
            print(key)
        

In [13]:
df.LotFrontage_is_missing.value_counts()

LotFrontage_is_missing
False    1201
True      259
Name: count, dtype: int64

In [14]:
df.isna().sum()

Id                          0
MSSubClass                  0
MSZoning                    0
LotFrontage                 0
LotArea                     0
                           ..
SaleType_is_missing         0
SaleCondition_is_missing    0
LotFrontage_is_missing      0
MasVnrArea_is_missing       0
GarageYrBlt_is_missing      0
Length: 127, dtype: int64

## Save the clean data

In [15]:
df.to_csv("clean_data.csv" , index = False)

## Import preprocessed data

In [16]:
df_copy = pd.read_csv("clean_data.csv")

## Splitting Data into Training and validation set

Here , we split the data into training and validation datatset , we save the validation dataset as an csv file.

Then we train the training dataset on the model before we test it on validation dataset and test dataset

In [17]:
df_copy.YrSold.value_counts()

YrSold
2009    338
2007    329
2006    314
2008    304
2010    175
Name: count, dtype: int64

In [18]:
df_copy_training = df_copy[df_copy.YrSold >= 2008]
df_copy_validation = df_copy[df_copy.YrSold < 2008]

In [19]:
df_copy_training.YrSold.value_counts()

YrSold
2009    338
2008    304
2010    175
Name: count, dtype: int64

In [20]:
df_copy_validation.YrSold.value_counts()

YrSold
2007    329
2006    314
Name: count, dtype: int64

## saving validation dataset

In [21]:
df_copy_validation , df_copy_validation_SalePrice = df_copy_validation.drop("SalePrice" , axis = 1) , df_copy_validation.SalePrice
df_copy_validation.to_csv("validation.csv" , index = False)

In [22]:
#split training dataset into x and y train
x_train , y_train = df_copy_training.drop("SalePrice" , axis =1) , df_copy_training.SalePrice

In [23]:
x_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,GarageCond_is_missing,PavedDrive_is_missing,PoolQC_is_missing,Fence_is_missing,MiscFeature_is_missing,SaleType_is_missing,SaleCondition_is_missing,LotFrontage_is_missing,MasVnrArea_is_missing,GarageYrBlt_is_missing
0,1,60,4,65.0,8450,2,0,4,4,1,...,False,False,True,True,True,False,False,False,False,False
2,3,60,4,68.0,11250,2,0,1,4,1,...,False,False,True,True,True,False,False,False,False,False
4,5,60,4,84.0,14260,2,0,1,4,1,...,False,False,True,True,True,False,False,False,False,False
5,6,50,4,85.0,14115,2,0,1,4,1,...,False,False,True,False,False,False,False,False,False,False
7,8,60,4,69.0,10382,2,0,1,4,1,...,False,False,True,True,False,False,False,True,False,False


In [24]:
x_train.dtypes.value_counts()

int64      77
bool       46
float64     3
Name: count, dtype: int64

# Choose right estimator for our problem

In [25]:
# Building a model to train our data to find patterns

from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_jobs= -1 ,
                                 random_state = 42)
rf_model.fit(x_train , y_train)

# Hyperparameter Tuning with RandomForestRegressor

## RandomForestRegressor


In [26]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV



gradient_grid = {'n_estimators': np.arange(50, 2000, 50),
                 'max_depth': np.arange(3, 8 , 9), 
                 "min_samples_leaf" : np.arange(1, 90, 2),
                 'min_samples_split'  : [5 , 10 , 15 , 20] ,
                 "max_features" : [None, "sqrt", "log2",1, 1.0,] ,
                 "min_weight_fraction_leaf" : np.arange(0.0, 0.5, 11) ,
                 "oob_score" : [False, True] ,
                 "warm_start": [False, True]
                 }


forest_model = RandomForestRegressor(random_state=42 , n_jobs= -1)

rSearchCV_forest_model = RandomizedSearchCV(estimator = forest_model , 
                                        param_distributions = gradient_grid ,
                                        n_iter = 150 ,
                                        cv = 5 ,
                                        verbose = 1)
rSearchCV_forest_model.fit(x_train,y_train)

Fitting 5 folds for each of 150 candidates, totalling 750 fits


In [27]:
rSearchCV_forest_model.best_params_

{'warm_start': False,
 'oob_score': False,
 'n_estimators': 1200,
 'min_weight_fraction_leaf': 0.0,
 'min_samples_split': 20,
 'min_samples_leaf': 7,
 'max_features': None,
 'max_depth': 3}

## Train a model with the best hyperparameters


In [28]:
np.random.seed(42)


def best_parameters():
    # Get the best parameters from the search
    best_parameters = rSearchCV_forest_model.best_params_ 
    
    # Instantiate the model and Unpack the best_parameters dictionary into the RandomForestRegressor constructor (** means unpacking the best parameters)
    ideal_model = RandomForestRegressor(**best_parameters)
    
    #Fitting the model
    ideal_model.fit(x_train,y_train)
    
    return ideal_model

In [29]:
final_model = best_parameters()
final_model

## Make prediction on Validation Dataset

In [30]:
# Importing validation dataset
df_copy_validation = pd.read_csv("validation.csv")
df_copy_validation.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,GarageCond_is_missing,PavedDrive_is_missing,PoolQC_is_missing,Fence_is_missing,MiscFeature_is_missing,SaleType_is_missing,SaleCondition_is_missing,LotFrontage_is_missing,MasVnrArea_is_missing,GarageYrBlt_is_missing
0,2,20,4,80.0,9600,2,0,4,4,1,...,False,False,True,True,True,False,False,False,False,False
1,4,70,4,60.0,9550,2,0,1,4,1,...,False,False,True,True,True,False,False,False,False,False
2,7,20,4,75.0,10084,2,0,4,4,1,...,False,False,True,True,True,False,False,False,False,False
3,12,60,4,85.0,11924,2,0,1,4,1,...,False,False,True,True,True,False,False,False,False,False
4,14,20,4,91.0,10652,2,0,1,4,1,...,False,False,True,True,True,False,False,False,False,False


In [31]:
#matching columns
train_features = x_train.columns
validation_data = df_copy_validation[train_features]
validation_data.head(1)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,GarageCond_is_missing,PavedDrive_is_missing,PoolQC_is_missing,Fence_is_missing,MiscFeature_is_missing,SaleType_is_missing,SaleCondition_is_missing,LotFrontage_is_missing,MasVnrArea_is_missing,GarageYrBlt_is_missing
0,2,20,4,80.0,9600,2,0,4,4,1,...,False,False,True,True,True,False,False,False,False,False


In [32]:
#make prediction on validation dataset

y_val_preds = final_model.predict(validation_data)
y_val_preds

array([157132.05081799, 225194.11675451, 286854.39947884, 361466.60966954,
       243016.43400217, 194084.90245636, 139536.97209185, 313754.28547311,
       193168.8838038 , 140243.47812972, 142452.22152667, 293529.84085987,
       308746.79462031, 157252.3729144 , 136392.88609563, 130917.13240857,
       131879.60146032, 278771.73900271, 128508.31980135, 161231.27578142,
       144595.95879646, 336327.01162522, 145598.01577686, 203059.3352998 ,
       348022.56301974, 148280.48854762, 109293.55324855, 219631.17629195,
       303727.3756043 , 215696.31266869, 222550.98113898, 266954.91733033,
       126987.37224108, 155095.95504025, 137095.12714626, 241045.43497392,
       124294.13868324, 124057.51199292, 146241.07771516, 171005.97563543,
       168508.58005164, 242462.89386559, 126999.50661976, 202826.61282577,
       111687.16388742, 132825.62456073, 148887.56895037, 270158.39292588,
       169872.47751803, 151614.04496842, 163311.95482596, 140700.92332712,
       164487.63458364, 1

# Evaluating the Model with RMSE 

The competition seeks for RMSE

`Interpretation of RMSE`

**RMSE represents the standard deviation of the prediction errors (residuals): It shows how spread out these residuals are. In simple terms, it gives you a measure of how far the predictions are from the actual values on average.**

`Lower RMSE is better` : **A lower RMSE indicates that the model's predictions are closer to the actual values.**

**For example, if an RMSE value is 5, this means that on average, the predicted values deviate from the actual values by around 5 units.**

In [33]:
from sklearn.metrics import mean_squared_error

def rmse(y_test, y_preds) :
    
    """
    calculates the root mean squared error
    """
    
    return np.sqrt(mean_squared_error(y_test , y_preds))

def metrics():
    y_val_preds = final_model.predict(validation_data)
    scores =  {"Validation RMSE" : rmse(df_copy_validation_SalePrice,y_val_preds)}
    
    
    
    return scores

    

    
    
    

In [34]:
metrics()

{'Validation RMSE': 41941.5204206541}

# Saving and Loading the Model


In [35]:
# saving an existing model to file
from joblib import dump , load

dump(final_model , filename="house_pricePrediction_model.joblib")

['house_pricePrediction_model.joblib']

In [36]:
# Import a saved joblib model
house_pricePrediction_model = load(filename="house_pricePrediction_model.joblib")

In [37]:
house_pricePrediction_model.predict(validation_data)

array([157132.05081799, 225194.11675451, 286854.39947884, 361466.60966954,
       243016.43400217, 194084.90245636, 139536.97209185, 313754.28547311,
       193168.8838038 , 140243.47812972, 142452.22152667, 293529.84085987,
       308746.79462031, 157252.3729144 , 136392.88609563, 130917.13240857,
       131879.60146032, 278771.73900271, 128508.31980135, 161231.27578142,
       144595.95879646, 336327.01162522, 145598.01577686, 203059.3352998 ,
       348022.56301974, 148280.48854762, 109293.55324855, 219631.17629195,
       303727.3756043 , 215696.31266869, 222550.98113898, 266954.91733033,
       126987.37224108, 155095.95504025, 137095.12714626, 241045.43497392,
       124294.13868324, 124057.51199292, 146241.07771516, 171005.97563543,
       168508.58005164, 242462.89386559, 126999.50661976, 202826.61282577,
       111687.16388742, 132825.62456073, 148887.56895037, 270158.39292588,
       169872.47751803, 151614.04496842, 163311.95482596, 140700.92332712,
       164487.63458364, 1

# Import test data

In [38]:
df_test = pd.read_csv("test.csv")
df_test.head(2)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal


# Format test data to be same as train data

In [39]:
    
    
def preprocess_data(df_test):
#     # fill missing numeric columns with their median
    for key , value in df_test.items():
        if pd.api.types.is_numeric_dtype(value):
            if pd.isna(value).sum():
         #Add binary columns indicating missing values
                df_test[key+'_is_missing'] = pd.isna(value)
        #Fill numeric value with median
                df_test[key] = value.fillna(value.median())
                
                
      # This is will turn all the strings into category values
    for key , value in df.items():
        if pd.api.types.is_string_dtype(value):
    #keep the column name df[key] but change the value to a category dtype
            df[key] = value.astype('category').cat.as_ordered()
                
                
                
                
                #Check for column which aint numeric
    for key, value in df_test.items():
        if not pd.api.types.is_numeric_dtype(value):
           #Add binary column to indicate whether sample has missing value 
            df_test[key+"_is_missing"] = pd.isnull(value)
          # Turn categories into numbers and add + 1 and filling missing values
            df_test[key] = pd.Categorical(value).codes + 1  
    
    return df_test

    
                   
    

In [40]:
preprocess_data(df_test).head()

  df_test[key+"_is_missing"] = pd.isnull(value)
  df_test[key+"_is_missing"] = pd.isnull(value)
  df_test[key+"_is_missing"] = pd.isnull(value)
  df_test[key+"_is_missing"] = pd.isnull(value)
  df_test[key+"_is_missing"] = pd.isnull(value)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,GarageType_is_missing,GarageFinish_is_missing,GarageQual_is_missing,GarageCond_is_missing,PavedDrive_is_missing,PoolQC_is_missing,Fence_is_missing,MiscFeature_is_missing,SaleType_is_missing,SaleCondition_is_missing
0,1461,20,3,80.0,11622,2,0,4,4,1,...,False,False,False,False,False,True,False,True,False,False
1,1462,20,4,81.0,14267,2,0,1,4,1,...,False,False,False,False,False,True,True,False,False,False
2,1463,60,4,74.0,13830,2,0,1,4,1,...,False,False,False,False,False,True,False,True,False,False
3,1464,60,4,78.0,9978,2,0,1,4,1,...,False,False,False,False,False,True,True,True,False,False
4,1465,120,4,43.0,5005,2,0,1,2,1,...,False,False,False,False,False,True,True,True,False,False


# Make predictions on the test data and evaluate it

In [41]:
df_test.shape, x_train.shape , df_copy.shape

((1459, 134), (817, 126), (1460, 127))

In [42]:
missing_cols = set(x_train.columns) - set(df_test.columns)
extra_cols = set(df_test.columns) - set(x_train.columns)

print("Missing columns in test set:", missing_cols)
print("Extra columns in test set:", extra_cols)


Missing columns in test set: set()
Extra columns in test set: {'BsmtFullBath_is_missing', 'GarageCars_is_missing', 'BsmtFinSF1_is_missing', 'GarageArea_is_missing', 'TotalBsmtSF_is_missing', 'BsmtFinSF2_is_missing', 'BsmtHalfBath_is_missing', 'BsmtUnfSF_is_missing'}


In [43]:
df_test = df_test.drop(columns=extra_cols)


In [44]:
#matching columns
train_features = x_train.columns
df_test = df_test[train_features]
df_test.head(1)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,GarageCond_is_missing,PavedDrive_is_missing,PoolQC_is_missing,Fence_is_missing,MiscFeature_is_missing,SaleType_is_missing,SaleCondition_is_missing,LotFrontage_is_missing,MasVnrArea_is_missing,GarageYrBlt_is_missing
0,1461,20,3,80.0,11622,2,0,4,4,1,...,False,False,True,False,True,False,False,False,False,False


In [45]:
#predicting house prices of test data
y_test_preds = house_pricePrediction_model.predict(df_test)
y_test_preds

array([124216.1762822 , 139639.38297344, 158666.73474311, ...,
       148377.4859831 , 125573.5102595 , 237182.28356572])

In [46]:
submission = pd.DataFrame()
submission['Id'] = df_test['Id']
submission["SalePrice"] = y_test_preds
submission

Unnamed: 0,Id,SalePrice
0,1461,124216.176282
1,1462,139639.382973
2,1463,158666.734743
3,1464,170576.897902
4,1465,217355.742573
...,...,...
1454,2915,109143.238112
1455,2916,109299.543147
1456,2917,148377.485983
1457,2918,125573.510259


In [47]:
submission.to_csv("submission.csv" , index=False)

# HyperParameter Tuning with GridSearchCV

In [48]:
from sklearn.model_selection import GridSearchCV

#GridSearchCV is like BruteForcing , its every hyperparameter for the best one

grid = {'n_estimators': [1450],
        'max_depth': np.arange(3, 8), 
        "min_samples_leaf" : np.arange(1, 90, 10),
        'min_samples_split'  : [15 , 20] ,
        "max_features" : [None, "sqrt", "log2",1, 1.0,] ,
        "min_weight_fraction_leaf" :[0.0],
        "oob_score" : [False] ,
        "warm_start": [True]
                 }


forest_model = RandomForestRegressor(random_state=42 , n_jobs= -1)

gs_model = GridSearchCV(estimator = forest_model , 
                                    param_grid = grid ,
                                    cv = 5 ,
                                    verbose = 1)
gs_model.fit(x_train,y_train)

Fitting 5 folds for each of 450 candidates, totalling 2250 fits


In [49]:
gs_model.best_params_

{'max_depth': 7,
 'max_features': None,
 'min_samples_leaf': 1,
 'min_samples_split': 15,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 1450,
 'oob_score': False,
 'warm_start': True}

In [54]:
np.random.seed(42)


def best_grid():
    # Get the best parameters from the search
    best_grid = gs_model.best_params_ 
    
    # Instantiate the model and Unpack the best_parameters dictionary into the RandomForestRegressor constructor (** means unpacking the best parameters)
    ideal_model = RandomForestRegressor(**best_grid)
    
    #Fitting the model
    ideal_model.fit(x_train,y_train)
    
    return ideal_model

In [55]:
final_gs_model = best_grid()
final_gs_model

In [57]:
gs_preds = final_gs_model.predict(df_test)
gs_preds

array([123000.45032823, 150598.59140349, 178368.18623697, ...,
       163698.68975092, 125812.09043934, 234899.22947418])

In [58]:
submission_1 = pd.DataFrame()
submission_1['Id'] = df_test['Id']
submission_1['SalePrice'] = gs_preds
submission_1

Unnamed: 0,Id,SalePrice
0,1461,123000.450328
1,1462,150598.591403
2,1463,178368.186237
3,1464,184112.027013
4,1465,206077.030602
...,...,...
1454,2915,89393.063080
1455,2916,90961.663607
1456,2917,163698.689751
1457,2918,125812.090439


In [59]:
submission_1.to_csv("submission_gs.csv" , index=False)