## Finding the Prices for Houses:
The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

### Problem Defination:

How well we can predict the prices of the house based on the characteristics.

### Data:

There are 2 files provided, one for test and the other one for train where we can even witness the features and the target value.

### Evaluation:

The model will be evaluated on the basis of the (RMSE) Root mean squared error.

### Features:

There are 79 columns representing a uniqe value.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("train.csv")
data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [3]:
data.isna().sum()

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

### Fill the missing numeric data:

In [6]:
for label, content in data.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

LotFrontage
MasVnrArea
GarageYrBlt


In [7]:
for label, content in data.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            data[label+"_is_missing"] = pd.isnull(content)
            data[label] = content.fillna(content.median())

In [8]:
for label, content in data.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

This represents that there are no null values when it comes to numeric data.

### Fill the other missing data:

There are multiple ways to go about a string or object missing data, I will be chosing to convert them to categories and solving the missing data issue.

In [11]:
for label, content in data.items():
    if not pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label) #to check for null object datatype

Alley
MasVnrType
BsmtQual
BsmtCond
BsmtExposure
BsmtFinType1
BsmtFinType2
Electrical
FireplaceQu
GarageType
GarageFinish
GarageQual
GarageCond
PoolQC
Fence
MiscFeature


In [13]:
for label, content in data.items():
    if not pd.api.types.is_numeric_dtype(content):
        data[label+"_is_missing"] = pd.isnull(content)
        data[label] = pd.Categorical(content).codes +1

In [14]:
data.isna().sum()

Id                          0
MSSubClass                  0
MSZoning                    0
LotFrontage                 0
LotArea                     0
                           ..
PoolQC_is_missing           0
Fence_is_missing            0
MiscFeature_is_missing      0
SaleType_is_missing         0
SaleCondition_is_missing    0
Length: 127, dtype: int64

### The same preprocessing we will have to do with the test dataset:

It is better to make a function for it.

In [40]:
def filling_pre(df):
    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isna(content).sum():
                df[label+"_is_missing"] = pd.isnull(content)
                df[label] = content.fillna(content.median())
        if not pd.api.types.is_numeric_dtype(content):
            df[label+"_is_missing"] = pd.isnull(content)
            df[label] = pd.Categorical(content).codes +1
    return df

In [16]:
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,GarageType_is_missing,GarageFinish_is_missing,GarageQual_is_missing,GarageCond_is_missing,PavedDrive_is_missing,PoolQC_is_missing,Fence_is_missing,MiscFeature_is_missing,SaleType_is_missing,SaleCondition_is_missing
0,1,60,4,65.0,8450,2,0,4,4,1,...,False,False,False,False,False,True,True,True,False,False
1,2,20,4,80.0,9600,2,0,4,4,1,...,False,False,False,False,False,True,True,True,False,False
2,3,60,4,68.0,11250,2,0,1,4,1,...,False,False,False,False,False,True,True,True,False,False
3,4,70,4,60.0,9550,2,0,1,4,1,...,False,False,False,False,False,True,True,True,False,False
4,5,60,4,84.0,14260,2,0,1,4,1,...,False,False,False,False,False,True,True,True,False,False


In [17]:
X = data.drop("SalePrice", axis=1)
y = data["SalePrice"]

In [62]:
from sklearn.model_selection import train_test_split
np.random.seed(42)
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=0.2)

In [63]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, y_train)

RandomForestRegressor()

In [64]:
model.score(X_val, y_val)

0.8886808132720552

In [78]:
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
rf_model = RandomizedSearchCV(RandomForestRegressor(random_state=42),
                              param_distributions=grid, cv=5, verbose=2, n_iter=100, n_jobs=-1)

rf_model.fit(X_train,y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   verbose=2)

In [79]:
rf_model.best_params_

{'n_estimators': 1600,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 80,
 'bootstrap': False}

In [80]:
rf_model.score(X_val, y_val)

0.8798101770331456

In [65]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def rmse(y_test,y_preds):
    return np.sqrt(mean_squared_error(y_test,y_preds))

# create function to eval model on different levels:

def show_scores(model):
    train_preds = model.predict(X_train)
    test_preds = model.predict(X_val)
    scores = {"Train mae": mean_absolute_error(y_train,train_preds),"Valid mae": mean_absolute_error(y_val,test_preds),
              "Training set RMSE": rmse(y_train,train_preds), "Test RMSE": rmse(y_val,test_preds),
              "Valid R^2": r2_score(y_val,test_preds)}
    return scores

In [66]:
show_scores(model)

{'Train mae': 6685.424083904109,
 'Valid mae': 17900.77934931507,
 'Training set RMSE': 11482.189234672784,
 'Test RMSE': 29220.788410344314,
 'Valid R^2': 0.8886808132720552}

In [67]:
model.fit(X,y)

RandomForestRegressor()

In [48]:
test = pd.read_csv("test.csv")
test.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [49]:
for label, content in test.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            test[label+"_is_missing"] = pd.isnull(content)
            test[label] = content.fillna(content.median())

In [52]:
for label, content in test.items():
    if not pd.api.types.is_numeric_dtype(content):
        test[label+"_is_missing"] = pd.isnull(content)
        test[label] = pd.Categorical(content).codes +1

  test[label+"_is_missing"] = pd.isnull(content)
  test[label+"_is_missing"] = pd.isnull(content)
  test[label+"_is_missing"] = pd.isnull(content)
  test[label+"_is_missing"] = pd.isnull(content)
  test[label+"_is_missing"] = pd.isnull(content)


In [68]:
test

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,GarageType_is_missing,GarageFinish_is_missing,GarageQual_is_missing,GarageCond_is_missing,PavedDrive_is_missing,PoolQC_is_missing,Fence_is_missing,MiscFeature_is_missing,SaleType_is_missing,SaleCondition_is_missing
0,1461,20,3,80.0,11622,2,0,4,4,1,...,False,False,False,False,False,True,False,True,False,False
1,1462,20,4,81.0,14267,2,0,1,4,1,...,False,False,False,False,False,True,True,False,False,False
2,1463,60,4,74.0,13830,2,0,1,4,1,...,False,False,False,False,False,True,False,True,False,False
3,1464,60,4,78.0,9978,2,0,1,4,1,...,False,False,False,False,False,True,True,True,False,False
4,1465,120,4,43.0,5005,2,0,1,2,1,...,False,False,False,False,False,True,True,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,5,21.0,1936,2,0,4,4,1,...,True,True,True,True,False,True,True,True,False,False
1455,2916,160,5,21.0,1894,2,0,4,4,1,...,False,False,False,False,False,True,True,True,False,False
1456,2917,20,4,160.0,20000,2,0,4,4,1,...,False,False,False,False,False,True,True,True,False,False
1457,2918,85,4,62.0,10441,2,0,4,4,1,...,True,True,True,True,False,True,False,False,False,False


In [56]:
set(X)

{'1stFlrSF',
 '2ndFlrSF',
 '3SsnPorch',
 'Alley',
 'Alley_is_missing',
 'BedroomAbvGr',
 'BldgType',
 'BldgType_is_missing',
 'BsmtCond',
 'BsmtCond_is_missing',
 'BsmtExposure',
 'BsmtExposure_is_missing',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtFinType1',
 'BsmtFinType1_is_missing',
 'BsmtFinType2',
 'BsmtFinType2_is_missing',
 'BsmtFullBath',
 'BsmtHalfBath',
 'BsmtQual',
 'BsmtQual_is_missing',
 'BsmtUnfSF',
 'CentralAir',
 'CentralAir_is_missing',
 'Condition1',
 'Condition1_is_missing',
 'Condition2',
 'Condition2_is_missing',
 'Electrical',
 'Electrical_is_missing',
 'EnclosedPorch',
 'ExterCond',
 'ExterCond_is_missing',
 'ExterQual',
 'ExterQual_is_missing',
 'Exterior1st',
 'Exterior1st_is_missing',
 'Exterior2nd',
 'Exterior2nd_is_missing',
 'Fence',
 'Fence_is_missing',
 'FireplaceQu',
 'FireplaceQu_is_missing',
 'Fireplaces',
 'Foundation',
 'Foundation_is_missing',
 'FullBath',
 'Functional',
 'Functional_is_missing',
 'GarageArea',
 'GarageCars',
 'GarageCond',
 'GarageCond_

In [57]:
set(test)

{'1stFlrSF',
 '2ndFlrSF',
 '3SsnPorch',
 'Alley',
 'Alley_is_missing',
 'BedroomAbvGr',
 'BldgType',
 'BldgType_is_missing',
 'BsmtCond',
 'BsmtCond_is_missing',
 'BsmtExposure',
 'BsmtExposure_is_missing',
 'BsmtFinSF1',
 'BsmtFinSF1_is_missing',
 'BsmtFinSF2',
 'BsmtFinSF2_is_missing',
 'BsmtFinType1',
 'BsmtFinType1_is_missing',
 'BsmtFinType2',
 'BsmtFinType2_is_missing',
 'BsmtFullBath',
 'BsmtFullBath_is_missing',
 'BsmtHalfBath',
 'BsmtHalfBath_is_missing',
 'BsmtQual',
 'BsmtQual_is_missing',
 'BsmtUnfSF',
 'BsmtUnfSF_is_missing',
 'CentralAir',
 'CentralAir_is_missing',
 'Condition1',
 'Condition1_is_missing',
 'Condition2',
 'Condition2_is_missing',
 'Electrical',
 'Electrical_is_missing',
 'EnclosedPorch',
 'ExterCond',
 'ExterCond_is_missing',
 'ExterQual',
 'ExterQual_is_missing',
 'Exterior1st',
 'Exterior1st_is_missing',
 'Exterior2nd',
 'Exterior2nd_is_missing',
 'Fence',
 'Fence_is_missing',
 'FireplaceQu',
 'FireplaceQu_is_missing',
 'Fireplaces',
 'Foundation',
 'Fou

In [58]:
set(test) - set(X)

{'BsmtFinSF1_is_missing',
 'BsmtFinSF2_is_missing',
 'BsmtFullBath_is_missing',
 'BsmtHalfBath_is_missing',
 'BsmtUnfSF_is_missing',
 'GarageArea_is_missing',
 'GarageCars_is_missing',
 'TotalBsmtSF_is_missing'}

In [59]:
test["BsmtFinSF1_is_missing"].value_counts()

False    1458
True        1
Name: BsmtFinSF1_is_missing, dtype: int64

In [60]:
X["BsmtFinSF1_is_missing"] = False
X["BsmtFinSF2_is_missing"] = False
X["BsmtFullBath_is_missing"] = False
X["BsmtHalfBath_is_missing"] = False
X["BsmtUnfSF_is_missing"] = False
X["GarageArea_is_missing"] = False
X["GarageCars_is_missing"] = False
X["TotalBsmtSF_is_missing"] = False

In [61]:
X

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,SaleType_is_missing,SaleCondition_is_missing,BsmtFinSF1_is_missing,BsmtFinSF2_is_missing,BsmtFullBath_is_missing,BsmtHalfBath_is_missing,BsmtUnfSF_is_missing,GarageArea_is_missing,GarageCars_is_missing,TotalBsmtSF_is_missing
0,1,60,4,65.0,8450,2,0,4,4,1,...,False,False,False,False,False,False,False,False,False,False
1,2,20,4,80.0,9600,2,0,4,4,1,...,False,False,False,False,False,False,False,False,False,False
2,3,60,4,68.0,11250,2,0,1,4,1,...,False,False,False,False,False,False,False,False,False,False
3,4,70,4,60.0,9550,2,0,1,4,1,...,False,False,False,False,False,False,False,False,False,False
4,5,60,4,84.0,14260,2,0,1,4,1,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,4,62.0,7917,2,0,4,4,1,...,False,False,False,False,False,False,False,False,False,False
1456,1457,20,4,85.0,13175,2,0,4,4,1,...,False,False,False,False,False,False,False,False,False,False
1457,1458,70,4,66.0,9042,2,0,4,4,1,...,False,False,False,False,False,False,False,False,False,False
1458,1459,20,4,68.0,9717,2,0,4,4,1,...,False,False,False,False,False,False,False,False,False,False


In [70]:
test_predictions = model.predict(test)

Feature names must be in the same order as they were in fit.



In [74]:
submission = pd.DataFrame()
submission["Id"] = test["Id"]
submission["SalePrice"] = test_predictions

In [75]:
submission

Unnamed: 0,Id,SalePrice
0,1461,127210.10
1,1462,155429.00
2,1463,179116.71
3,1464,185508.37
4,1465,197184.32
...,...,...
1454,2915,86302.00
1455,2916,87349.71
1456,2917,149443.37
1457,2918,109817.00


In [76]:
submission.to_csv("Prediction_to_kaggle.csv", index=False)