<a href="https://www.kaggle.com/code/mdistiakahmedkhan/xgboost-hypermeters-tuning-on-house-prices-dataset?scriptVersionId=105854449" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## We will be trying to predict house price with ensemble technique XGBoost.

XGBoost (extreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm. XGBoost has proved to be a highly effective ML algorithm, extensively used in machine learning competitions and hackathons.  
XGBoost has high predictive power and is almost 10 times faster than the other gradient boosting techniques.

**Why XGBoost is havily used in machine learning competetions?**

XGBoost helps to reduce overfitting. XGBoost implements parallel processing and is faster than GBM . XGBoost also supports implementation on Hadoop. High Flexibility: XGBoost allows users to define custom optimization objectives and evaluation criteria adding a whole new dimension to the model. Handling Missing Values: XGBoost has an in-built routine to handle missing values. Tree Pruning: XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain. Built-in Cross-Validation: XGBoost allows a user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

**Lets get started!**

## Check out the data 
We've been able to get some data from your neighbor for housing prices as a csv set, let's get our environment ready with the libraries we'll need and then import the data!



#### Import Libaries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns 
from scipy import stats
from scipy.stats import norm, skew

import os
print(os.listdir("../input/house-prices-advanced-regression-techniques"))

In [None]:
train=pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test=pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")

# EDA

In [None]:
train.shape

In [None]:
test.shape

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.columns 

In [None]:
test.columns

What is the data trying to say to us ? We need to analyse the data. Analysing data is the most important thing to understand what the data is telling us.

Here's a brief version of what you'll find in the data description file.

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.

MSSubClass: The building class

MSZoning: The general zoning classification

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access

Alley: Type of alley access

LotShape: General shape of property

LandContour: Flatness of the property

Utilities: Type of utilities available

LotConfig: Lot configuration

LandSlope: Slope of property

Neighborhood: Physical locations within Ames city limits

Condition1: Proximity to main road or railroad

Condition2: Proximity to main road or railroad (if a second is 
present)

BldgType: Type of dwelling

HouseStyle: Style of dwelling

OverallQual: Overall material and finish quality

OverallCond: Overall condition rating

YearBuilt: Original construction date

YearRemodAdd: Remodel date

RoofStyle: Type of roof

RoofMatl: Roof material

Exterior1st: Exterior covering on house

Exterior2nd: Exterior covering on house (if more than one 
material)

MasVnrType: Masonry veneer type

MasVnrArea: Masonry veneer area in square feet

ExterQual: Exterior material quality

ExterCond: Present condition of the material on the exterior

Foundation: Type of foundation

BsmtQual: Height of the basement

BsmtCond: General condition of the basement

BsmtExposure: Walkout or garden level basement walls

BsmtFinType1: Quality of basement finished area

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Quality of second finished area (if present)

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating

HeatingQC: Heating quality and condition

CentralAir: Central air conditioning

Electrical: Electrical system

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Number of bedrooms above basement level

Kitchen: Number of kitchens

KitchenQual: Kitchen quality

TotRmsAbvGrd: Total rooms above grade (does not include 
bathrooms)

Functional: Home functionality rating

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

GarageType: Garage location

GarageYrBlt: Year garage was built

GarageFinish: Interior finish of the garage

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

GarageQual: Garage quality

GarageCond: Garage condition

PavedDrive: Paved driveway

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

PoolQC: Pool quality

Fence: Fence quality

MiscFeature: Miscellaneous feature not covered in other 
categories

MiscVal: $Value of miscellaneous feature

MoSold: Month Sold

YrSold: Year Sold

SaleType: Type of sale

SaleCondition: Condition of sale

## Data Visualization 

Let's look at the point of visualization.

In [None]:
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show() 

As you can see the sale price value is right skewed. We need to make this normal distributed

# Feature selection

In [None]:
plt.figure(figsize=(30,8))
sns.heatmap(train.corr(),cmap='coolwarm',annot = True)
plt.show()

we can see the most corelated parameters in numerical values above plotting. And we can pick these as features for our macine learning model.

In [None]:
corr = train.corr()

In [None]:
corr[corr['SalePrice']>0.2].index

Heat map for features that is greater or equal to 0.2 of correlation of Sale Prices.

In [None]:
train=train[['LotFrontage', 'LotArea', 'OverallQual', 'YearBuilt', 'YearRemodAdd',
       'MasVnrArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
       '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath',
       'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea',
       'WoodDeckSF', 'OpenPorchSF', 'SalePrice']]
test_id=test['Id']
test=test[['LotFrontage', 'LotArea', 'OverallQual', 'YearBuilt', 'YearRemodAdd',
       'MasVnrArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
       '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath',
       'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea',
       'WoodDeckSF', 'OpenPorchSF']]

We droped some columns that less than 0.2 of correlation of Sale Prices.

In [None]:
plt.figure(figsize=(30,8))
sns.heatmap(train.corr(),cmap='coolwarm',annot = True)
plt.show()

In [None]:
train.info()

In [None]:
sns.lmplot(x='1stFlrSF',y='SalePrice',data=train) # 1stFlrSF seems very corelated with SalePrice.

In [None]:
plt.scatter(x= 'GrLivArea', y='SalePrice', data = train)

In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(x='GarageCars',y='SalePrice',data=train)
plt.show()

In [None]:
sns.lmplot(x='OverallQual',y='SalePrice',data=train)

In [None]:
sns.lmplot(x='GarageArea',y='SalePrice',data=train)

In [None]:
plt.figure(figsize=(16,8))
sns.barplot(x='FullBath',y = 'SalePrice',data=train)
plt.show()

# Feature engineering
1. handle missing data
2. deal with catagorical features
3. select features most correlated with SalePrice

In [None]:
#missing data
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(25)

**Dealing With missing values**

imputing values missing in the columns on train dataset

In [None]:
train_c=train.copy()
test_c=test.copy()


In [None]:
# Score From Approach 2( imputation):Next, we use SimpleImputer to replace missing values with the mean value along each column.
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
for col in (train_c.columns):
    train_c[col] = my_imputer.fit_transform(train_c[col].values.reshape(-1,1))

In [None]:
#missing data
train_c.isnull().sum()

We are going to do same thing to the test data

In [None]:
#missing data
total_test = test.isnull().sum().sort_values(ascending=False)
percent_test =(test_c.isnull().sum()/test_c .isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total_test, percent_test], axis=1, keys=['Total', 'Percent'])
missing_data.head(25)

In [None]:
# Imputation
my_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
for col in (test_c.columns):
    test_c[col] = my_imputer.fit_transform(test_c[col].values.reshape(-1,1))

In [None]:
#missing data
test_c.isnull().sum()

Now there is no missing values in both test and training data.

In [None]:
train_c.head()

In [None]:
train_c.shape

In [None]:
test_c.head()

In [None]:
test_c.shape

## Dealing with catagorical values

In [None]:
numerical_cols = [cname for cname in test.columns if test[cname].dtype in ['int64', 'float64']]
# Get list of categorical variables
object_cols = [cname for cname in test.columns if test[cname].dtype in ['object']]
print("Numerical columns:")
print(numerical_cols)
print("\n Object columns:")
print(object_cols)

**no object columns remained after data cleaning step.**

In [None]:
train=train_c.copy()

In [None]:
test=test_c.copy()

Now we are going to pick some features for the model. For this we are going to use correlation matrix and we are going to pick most correlated with sale price.

**Train test split**

In [None]:
#Importing packages
from sklearn.model_selection import train_test_split

X = train.drop(columns="SalePrice")
y = train["SalePrice"]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

In [None]:
X_train.shape, X_test.shape

# XGBoost Parameter Tuning

**Parameters**
nthread:
This is used for parallel processing and the number of cores in the system should be entered..
If you wish to run on all cores, do not input this value. The algorithm will detect it automatically.

eta:
Analogous to learning rate in GBM.
Makes the model more robust by shrinking the weights on each step.

min_child_weight:
Defines the minimum sum of weights of all observations required in a child.
Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

max_depth:
It is used to define the maximum depth.
Higher depth will allow the model to learn relations very specific to a particular sample.

max_leaf_nodes:
The maximum number of terminal nodes or leaves in a tree.
Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
If this is defined, GBM will ignore max_depth.

gamma :
A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.

subsample:
Same as the subsample of GBM. Denotes the fraction of observations to be randomly sampled for each tree.
Lower values make the algorithm more conservative and prevent overfitting but values that are too small might lead to under-fitting.

colsample_bytree:
It is similar to max_features in GBM.
Denotes the fraction of columns to be randomly sampled for each tree.

**Explanation of relevant parameters for this kernel.**

booster: Select the type of model to run at each iteration
gbtree: tree-based models
gblinear: linear models
nthread: default to maximum number of threads available if not set
objective: This defines the loss function to be minimized
Parameters for controlling speed

subsample: Denotes the fraction of observations to be randomly samples for each tree
colsample_bytree: Subsample ratio of columns when constructing each tree.
n_estimators: Number of trees to fit.
Important parameters which control overfiting

learning_rate: Makes the model more robust by shrinking the weights on each step
max_depth: The maximum depth of a tree.
min_child_weight: Defines the minimum sum of weights of all observations required in a child.

**General Approach**

1. Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate. XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.

2. Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.

3. Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
Lower the learning rate and decide the optimal parameters .

In [None]:
#Importing Packages

from xgboost import XGBRegressor , plot_importance
from xgboost import XGBRFRegressor
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

## Tuning the hyper-parameters
**GridSearchCV params:**

estimator: estimator object
param_grid : dict or list of dictionaries
scoring: A single string or a callable to evaluate the predictions on the test set. If None, the estimator’s score method is used.
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
n_jobs: Number of jobs to run in parallel. None means. -1 means using all processors.
cv: cross-validation, None, to use the default 3-fold cross validation. Integer, to specify the number of folds in a (Stratified)KFold.

In [None]:
#XGBoost hyper-parameter tuning
def hyperParameterTuning(X_train, y_train,params):
    param_tuning = params
    xgb_model = XGBRegressor()

    gsearch = GridSearchCV(estimator = xgb_model,
                           param_grid = param_tuning,                  
                           scoring = 'neg_mean_absolute_error', #MAE
                           #scoring = 'neg_mean_squared_error',  #MSE
                           cv = 5,
                           n_jobs = -1,
                           verbose = 1)

    gsearch.fit(X_train,y_train)
    print(gsearch.best_params_, gsearch.best_score_)



    return 

**Step-1 Fix learning rate and number of estimators for tuning tree-based parameters**
1. max_depth = 5 : This should be between 3-10. I’ve started with 5 but you can choose a different number as well. 4-6 can be good starting points.
2. min_child_weight = 1 : A smaller value is chosen because it is a highly imbalanced class problem and leaf nodes can have smaller size groups.
3. gamma = 0 : A smaller value like 0.1-0.2 can also be chosen for starting. This will anyways be tuned later.
4. subsample, colsample_bytree = 0.8 : This is a commonly used used start value. Typical values range between 0.5-0.9.
5. scale_pos_weight = 1: Because of high class imbalance.


In [None]:
#Run only in the first run of the kernel.
params={
    'learning_rate': [0.1],
     'n_estimators': [200,400,600,800,1000],
     'max_depth':[5],
     'min_child_weight': [1],
     'gamma':[0],
     'subsample':[0.8],
     'colsample_bytree':[0.8],
     'objective': ['reg:squarederror'],
     'nthread':[4],
     'scale_pos_weight' :[1],
     'seed':[27]
 }

hyperParameterTuning(X_train, y_train,params)

Here for n_esimators we got the value 200, which was the lowest value set by us.Optimum n_estimators value could be less than 200. for n_estimator we should check value lower than 200.

In [None]:
param_test2 = {
 'n_estimators': range (55,200,5)
}
gsearch2 = GridSearchCV(estimator = XGBRegressor( learning_rate=0.1, n_estimators=200, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'reg:squarederror', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test2, scoring='neg_mean_absolute_error',n_jobs=4, cv=5)
gsearch2.fit(X_train,y_train)
gsearch2.best_params_, gsearch2.best_score_ 

Here, we get the optimum values as 60 for n_estimators. 

**Step 2: Tune max_depth and min_child_weight**

 The ideal values are 1 for max_depth and 5 for min_child_weight. Lets go one step deeper and look for optimum values. We’ll search for values 1 above and below the optimum values because we took an interval of two.

In [None]:
param_test3 = {
    'max_depth':range(3,10,1),
    'min_child_weight': range(3,10,1)
}
gsearch3 = GridSearchCV(estimator = XGBRegressor( learning_rate=0.1, n_estimators=60, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'reg:squarederror', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test3, scoring='neg_mean_absolute_error',n_jobs=4, cv=5)
gsearch3.fit(X_train,y_train)
gsearch3.best_params_, gsearch3.best_score_

Here, we get the optimum values as 6 for max_depth and 9 for min_child_weight. 

**Step 3 : Tune gamma**

In [None]:
param_test4 = {
 'gamma':[i/10.0 for i in range(0,5)]
}
gsearch4 = GridSearchCV(estimator = XGBRegressor( learning_rate=0.1, n_estimators=60, max_depth=6,
 min_child_weight=9, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'reg:squarederror', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='neg_mean_absolute_error',n_jobs=4, cv=5)
gsearch4.fit(X_train,y_train)
gsearch4.best_params_, gsearch4.best_score_ 

Here, we get the optimum values as 0 for gamma.

**Step-4 :Tune Subsample and colsample_bytree**

In [None]:
param_test5 = {
 'subsample':[i/10.0 for i in range(4,10)],
 'colsample_bytree':[i/10.0 for i in range(4,10)]
}
gsearch5 = GridSearchCV(estimator = XGBRegressor( learning_rate=0.1, n_estimators=60, max_depth=6,
 min_child_weight=9, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'reg:squarederror', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test5, scoring='neg_mean_absolute_error',n_jobs=4, cv=5)
gsearch5.fit(X_train,y_train)
gsearch5.best_params_, gsearch5.best_score_

Here, we get the optimum values as 0.8 for colsample_bytree and 0.8 for subsample. Now we should try values in 0.05 interval around these.

In [None]:
param_test6 = {
 'subsample':[i/100.0 for i in range(75,85,5)],
 'colsample_bytree':[i/100.0 for i in range(75,85,5)]
}
gsearch6 = GridSearchCV(estimator = XGBRegressor( learning_rate=0.1, n_estimators=60, max_depth=6,
 min_child_weight=9, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'reg:squarederror', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='neg_mean_absolute_error',n_jobs=4, cv=5)
gsearch6.fit(X_train,y_train)
gsearch6.best_params_, gsearch6.best_score_

Here, we get the optimum values as 0.75for colsample_bytree and 0.8 for subsample.

**Step 5: Tuning Regularization Parameters**

In [None]:
param_test7 = {
 'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
}
gsearch7 = GridSearchCV(estimator = XGBRegressor( learning_rate=0.1, n_estimators=60, max_depth=6,
 min_child_weight=9, gamma=0, subsample=0.8, colsample_bytree=0.75,
 objective= 'reg:squarederror', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test7, scoring='neg_mean_absolute_error',n_jobs=4, cv=5)
gsearch7.fit(X_train,y_train)
gsearch7.best_params_, gsearch7.best_score_ 

Here, we get the optimum values as 0.05 for reg_alpha.

**Step 6: Tuning Seed**

In [None]:
param_test8 = {
  'seed':range(25,35,1)
}
gsearch8 = GridSearchCV(estimator = XGBRegressor( learning_rate=0.1, n_estimators=60, max_depth=6,
 min_child_weight=9, gamma=0, subsample=0.8, colsample_bytree=0.75,
 objective= 'reg:squarederror', reg_alpha=0.05,nthread=1, scale_pos_weight=1,seed=27), 
 param_grid = param_test8, scoring='neg_mean_absolute_error',n_jobs=4, cv=5)
gsearch8.fit(X_train,y_train)
gsearch8.best_params_, gsearch8.best_score_ 

Here, we get the optimum values as 27 for seed.

**Step 7: Reducing Learning Rate & increasing m_estimators**

In [None]:
param_test9 = {
  'learning_rate':[0.01,0.001],
  'n_estimators':[1000,2000,500,5000]
}
gsearch9 = GridSearchCV(estimator = XGBRegressor( learning_rate=0.1, n_estimators=60, max_depth=6,
 min_child_weight=9, gamma=0, subsample=0.8, colsample_bytree=0.75,
 objective= 'reg:squarederror', reg_alpha=0.05,nthread=1, scale_pos_weight=1,seed=27), 
 param_grid = param_test9, scoring='neg_mean_absolute_error',n_jobs=4, cv=5)
gsearch9.fit(X_train,y_train)
gsearch9.best_params_, gsearch9.best_score_ 

Here, we get the optimum values as 0.01 for learning_rate and 500 for n_estimators.

# Train and fit model

In [None]:
xgb_model = XGBRegressor( learning_rate=0.1, n_estimators=500, max_depth=6,
 min_child_weight=9, gamma=0, subsample=0.8, colsample_bytree=0.75,
 objective= 'reg:squarederror', reg_alpha=0.05,nthread=1, scale_pos_weight=1,seed=27)

In [None]:
xgb_model.fit(X_train,y_train)

In [None]:
predictions=xgb_model.predict(X_test)

# Evaluate

In [None]:
import numpy as np
from sklearn import metrics

In [None]:
    #Print model report:
    print ("\nModel Report")
    print('MAE:', metrics.mean_absolute_error(y_test,predictions))
    print('MSE:', metrics.mean_squared_error(y_test,predictions))             
    # get importance
    plot_importance(xgb_model)


# create submission file

In [None]:
predictions_test =xgb_model.predict(test)

In [None]:
submission = pd.DataFrame({'Id':test_id,'SalePrice':predictions_test})
submission.to_csv("submission.csv",index=False)

In [None]:
submission.head()