# House Prices: Advanced Regression Techniques
## Model Building and Prediction

Steps performed in this file:
- Train Test Split
- XGBoost
    - Hyperparameter Optimization
    - RandomizedSearchCV
    - Prediction
    - Score
- Random Forest
    - Hyperparameter Optimization
    - RandomizedSearchCV and GridSearchCV
    - Prediction
    - Score
- Result

In [1]:
import numpy as np
import pandas as pd

In [2]:
dataset=pd.read_csv('../Input/X_train.csv')
dataset.head()

Unnamed: 0,MSSubClass,MSZoning,Neighborhood,OverallQual,YearRemodAdd,RoofStyle,BsmtQual,BsmtExposure,HeatingQC,CentralAir,...,GrLivArea,BsmtFullBath,KitchenQual,Fireplaces,FireplaceQu,GarageType,GarageFinish,GarageCars,PavedDrive,SaleCondition
0,0.235294,0.75,0.636364,0.666667,0.098361,0.0,0.75,0.25,1.0,1.0,...,0.577712,0.333333,0.666667,0.0,0.2,0.8,0.666667,0.5,1.0,0.75
1,0.0,0.75,0.5,0.555556,0.52459,0.0,0.75,1.0,1.0,1.0,...,0.470245,0.0,0.333333,0.333333,0.6,0.8,0.666667,0.5,1.0,0.75
2,0.235294,0.75,0.636364,0.666667,0.114754,0.0,0.75,0.5,1.0,1.0,...,0.593095,0.333333,0.666667,0.333333,0.6,0.8,0.666667,0.5,1.0,0.75
3,0.294118,0.75,0.727273,0.666667,0.606557,0.0,0.5,0.25,0.75,1.0,...,0.579157,0.333333,0.666667,0.333333,0.8,0.4,0.333333,0.75,1.0,0.0
4,0.235294,0.75,1.0,0.777778,0.147541,0.0,0.75,0.75,1.0,1.0,...,0.666523,0.333333,0.666667,0.333333,0.6,0.8,0.666667,0.75,1.0,0.75


In [3]:
data=pd.read_csv('../Input/y_train.csv')
data.head()

Unnamed: 0,SalePrice
0,12.247694
1,12.109011
2,12.317167
3,11.849398
4,12.429216


### Train Test Split
As I have considered this problem to be a real life problem instead of Kaggle problem. So, I need to split the data into train and test data, instead of using test dataset provided seperately.

In [4]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(dataset,data,test_size=0.2,random_state=42)

### XGBoost

In [5]:
import xgboost
from sklearn.model_selection import RandomizedSearchCV

In [6]:
regressor=xgboost.XGBRegressor()

In [7]:
# Hyper-Parameter Optimization

n_estimators = [100, 500, 700, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05, 0.1, 0.15, 0.20, 0.25]
min_child_weight=[1, 2, 3, 4]
base_score=[0.25, 0.5, 0.75, 1]
colsample_bytree=[0.1, 0.2, 0.3, 0.5, 0.7]
gamma=[0.0, 0.1, 0.2, 0.3]

# Define the grid of hyperparameters to search
hyperparameter_grid = {
    'n_estimators': n_estimators,
    'max_depth':max_depth,
    'learning_rate':learning_rate,
    'min_child_weight':min_child_weight,
    'booster':booster,
    'base_score':base_score,
    'colsample_bytree':colsample_bytree,
    'gamma':gamma
    }

- Now I will use the most common technique of cross validation, K-Fold CV. 
- When we approach a machine learning problem, we make sure to split our data into a training and a testing set. 
- In K-Fold CV, we further split our training set into K number of subsets, called folds. 
- We then iteratively fit the model K times, each time training the data on K-1 of the folds and evaluating on the Kth fold (called the validation data).
- For example, consider fitting a model with K = 5. 
    - The first iteration we train on the first four folds and evaluate on the fifth. 
    - The second time we train on the first, second, third, and fifth fold and evaluate on the fourth. 
    - We repeat this procedure 3 more times, each time evaluating on a different fold. 
- At the very end of training, we average the performance on each of the folds to come up with final validation metrics for the model.
- Using Scikit-Learn’s RandomizedSearchCV method, we can define a grid of hyperparameter ranges, and randomly sample from the grid, performing K-Fold CV with each combination of values.

In [8]:
random_cv = RandomizedSearchCV(estimator=regressor,
            param_distributions=hyperparameter_grid,
            cv=5, n_iter=50, scoring = 'neg_mean_absolute_error',n_jobs = 4,
            verbose = 5, return_train_score = True, random_state=42)

In [9]:
random_cv.fit(X_train,y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:    4.3s
[Parallel(n_jobs=4)]: Done  64 tasks      | elapsed:    9.5s
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed:   40.8s
[Parallel(n_jobs=4)]: Done 250 out of 250 | elapsed:  1.2min finished


RandomizedSearchCV(cv=5,
                   estimator=XGBRegressor(base_score=None, booster=None,
                                          colsample_bylevel=None,
                                          colsample_bynode=None,
                                          colsample_bytree=None, gamma=None,
                                          gpu_id=None, importance_type='gain',
                                          interaction_constraints=None,
                                          learning_rate=None,
                                          max_delta_step=None, max_depth=None,
                                          min_child_weight=None, missing=nan,
                                          monotone_constraints=None,
                                          n_estimators=100, n...
                   param_distributions={'base_score': [0.25, 0.5, 0.75, 1],
                                        'booster': ['gbtree', 'gblinear'],
                                       

In [10]:
random_cv.best_estimator_

XGBRegressor(base_score=1, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.3, gamma=0.0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.05, max_delta_step=0, max_depth=2,
             min_child_weight=4, missing=nan, monotone_constraints='()',
             n_estimators=700, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [11]:
random_cv.best_params_

{'n_estimators': 700,
 'min_child_weight': 4,
 'max_depth': 2,
 'learning_rate': 0.05,
 'gamma': 0.0,
 'colsample_bytree': 0.3,
 'booster': 'gbtree',
 'base_score': 1}

In [12]:
model_xg=xgboost.XGBRegressor(base_score=1, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.3, gamma=0.0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.05, max_delta_step=0, max_depth=2,
             min_child_weight=4, missing=None, monotone_constraints='()',
             n_estimators=700, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [13]:
model_xg.fit(X_train,y_train)

XGBRegressor(base_score=1, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.3, gamma=0.0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.05, max_delta_step=0, max_depth=2,
             min_child_weight=4, missing=None, monotone_constraints='()',
             n_estimators=700, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [14]:
y_pred_xg=model_xg.predict(X_test)

In [15]:
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

In [16]:
score_xg=r2_score(y_test,y_pred_xg)
score_xg

0.9025047566027128

In [17]:
cv_score_xg=cross_val_score(model_xg, X_train, y_train, cv=10)
cv_score_xg

array([0.90452266, 0.88666519, 0.88977768, 0.84977814, 0.83882195,
       0.90906991, 0.90818446, 0.90372937, 0.85478382, 0.92174786])

In [18]:
cv_score_xg.mean()

0.8867081047797329

### Random Forest

In [19]:
from sklearn.ensemble import RandomForestRegressor

In [20]:
forest=RandomForestRegressor()

In [21]:
forest.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [22]:
# Hyper-Parameter Optimization

# n_estimators = number of trees in the foreset
# max_features = max number of features considered for splitting a node
# max_depth = max number of levels in each decision tree
# min_samples_split = min number of data points placed in a node before the node is split
# min_samples_leaf = min number of data points allowed in a leaf node
# bootstrap = method for sampling data points (with or without replacement)

param_random={
    'bootstrap': [True, False],
    'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
}

# Altogether, there are 2 * 12 * 2 * 3 * 3 * 10 = 4320 settings! 
# However, the benefit of a random search is that we are not trying every combination, 
# but selecting at random to sample a wide range of values.

In [23]:
# Random search of parameters, using 4 fold cross validation, search across 100 different combinations, 
# n_iter: controls the number of different combinations to try 
# cv: is the number of folds to use for cross validation
# more cv folds reduces the chances of overfitting, but raising each will increase the run time

random_search = RandomizedSearchCV(estimator=forest,
            param_distributions=param_random, cv=4, n_iter=100,
            scoring = 'neg_mean_absolute_error',n_jobs = -1,
            verbose = 5, random_state=42)

In [24]:
random_search.fit(X_train,y_train)

Fitting 4 folds for each of 100 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   16.6s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  6.3min
[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:  9.3min finished
  self.best_estimator_.fit(X, y, **fit_params)


RandomizedSearchCV(cv=4, estimator=RandomForestRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=42, scoring='neg_mean_absolute_error',
                   verbose=5)

In [25]:
random_search.best_estimator_

RandomForestRegressor(bootstrap=False, max_depth=20, max_features='sqrt',
                      min_samples_split=5, n_estimators=1800)

In [26]:
random_search.best_params_

{'n_estimators': 1800,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 20,
 'bootstrap': False}

In [27]:
# Random search allowed us to narrow down the range for each hyperparameter. 
# Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try.
# With GridSearchCV, instead of sampling randomly from a distribution, all combinations we define are evaluated.

from sklearn.model_selection import GridSearchCV

In [56]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [40, 45, 50, 55],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [1, 2, 3],
    'n_estimators': [1500, 1700, 1900, 2000]
}
# This will try out 1 * 4 * 3 * 3 * 4 = 144 combinations of settings.

grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=4, n_jobs=-1, verbose=2)

grid_search.fit(X_train,y_train)

Fitting 4 folds for each of 144 candidates, totalling 576 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  5.2min
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed: 12.4min
[Parallel(n_jobs=-1)]: Done 576 out of 576 | elapsed: 19.3min finished
  self.best_estimator_.fit(X, y, **fit_params)


{'bootstrap': True,
 'max_depth': 55,
 'min_samples_leaf': 3,
 'min_samples_split': 2,
 'n_estimators': 2000}

In [57]:
grid_search.best_params_

{'bootstrap': True,
 'max_depth': 55,
 'min_samples_leaf': 3,
 'min_samples_split': 2,
 'n_estimators': 2000}

In [58]:
grid_search.best_estimator_

RandomForestRegressor(max_depth=55, min_samples_leaf=3, n_estimators=2000)

In [28]:
model_rf = RandomForestRegressor(bootstrap=False, max_depth=20, max_features='sqrt',
                      min_samples_split=5, n_estimators=1800, random_state = 42)

In [29]:
model_rf.fit(X_train,y_train)

  model_rf.fit(X_train,y_train)


RandomForestRegressor(bootstrap=False, max_depth=20, max_features='sqrt',
                      min_samples_split=5, n_estimators=1800, random_state=42)

In [30]:
y_pred_rf=model_rf.predict(X_test)

In [31]:
score_rf=r2_score(y_test,y_pred_rf)
score_rf

0.8918445701400959

In [32]:
cv_score_rf=cross_val_score(model_rf,X_train,y_train, cv=10)
cv_score_rf

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


array([0.89526269, 0.85671338, 0.88706718, 0.84198439, 0.86526673,
       0.86515374, 0.88367444, 0.88414371, 0.86478079, 0.92012152])

In [33]:
cv_score_rf.mean()

0.8764168574214416

## Result

- From Xgboost model we get an accuracy rate of 90.25%
- From Random Forest model we get an accuracy score of 89.18%