# What is XGBoost

XGBoost is the leading model for working with standard tabular data most likely in Pandas DataFrames, XGBoost models dominate many Kaggle competitions and top players tuning hyperparameter for higher accuracy.

XGBoost is an implementation of the Gradient Boosted Decision Trees algorithm (scikit-learn has another version of this algorithm, but XGBoost has some technical advantages.) 
What is Gradient Boosted Decision Trees? 

![](img/xgboost.png)

We go through cycles that repeatedly builds new models and combines them into an ensemble model. We start the cycle by calculating the errors for each observation in the dataset. We then build a new model to predict those. We add predictions from this error-predicting model to the "ensemble of models."

To make a prediction, we add the predictions from all previous models. We can use these predictions to calculate new errors, build the next model, and add it to the ensemble.

There's one piece outside that cycle. We need some base prediction to start the cycle. In practice, the initial predictions can be pretty naive. Even if it's predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.

This process may sound complicated, but the code to use it is straightforward. 


# Model Tuning in XGBoost

XGBoost has a few parameters that can dramatically affect your model's accuracy and training speed. 

1.	n_estimators: specifies how many times to go through the modeling cycle described above
(possible range: 100-1000)
2.	early_stopping_rounds: stop iterating when the validation score stops improving
3.	learning_rate: weight adding to each prediction from each model, help to reduce the model's propensity to overfit.
4.	n_jobs: set tp number of cores in computer for runtime reduction for large dataset
In the underfitting vs overfitting graph, n_estimators moves you further to the right. Too low a value causes underfitting, which is inaccurate predictions on both training data and new data. Too large a value causes overfitting, which is accurate predictions on training data, but inaccurate predictions on new data (which is what we care about). You can experiment with your dataset to find the ideal. Typical values range from 100-1000, though this depends a lot on the learning rate discussed below.


# Installation of xgboost

since xgboost is not maintained actively in pip environment, the preferred way to install xgboost is using anaconda installation:

https://anaconda.org/conda-forge/xgboost

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer

In [2]:
data = pd.read_csv('data/train.csv')
data.dropna(axis=0,subset=['SalePrice'], inplace = True)

In [3]:
y = data.SalePrice
X = data.drop(['SalePrice'],axis=1).select_dtypes(exclude=['object'])

In [4]:
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(),y.as_matrix(),test_size = 0.3)

  """Entry point for launching an IPython kernel.


In [5]:
my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.fit_transform(test_X)

In [6]:
from xgboost import XGBRegressor

In [7]:
xgb_model = XGBRegressor()

In [8]:
xgb_model.fit(train_X, train_y, verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [9]:
predictions = xgb_model.predict(test_X)

In [10]:
print(predictions[1])

195479.48


In [11]:
from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

Mean Absolute Error : 15028.021261415524


In [12]:
xgb_model_2 = XGBRegressor(n_estimators=1000)

xgb_model_2.fit(train_X, train_y, early_stopping_rounds=5,
             eval_set=[(test_X, test_y)], verbose=True)


[0]	validation_0-rmse:167624
Will train until validation_0-rmse hasn't improved in 5 rounds.
[1]	validation_0-rmse:151612
[2]	validation_0-rmse:137175
[3]	validation_0-rmse:124214
[4]	validation_0-rmse:112826
[5]	validation_0-rmse:102644
[6]	validation_0-rmse:93437.7
[7]	validation_0-rmse:85157.3
[8]	validation_0-rmse:77680.4
[9]	validation_0-rmse:70976.5
[10]	validation_0-rmse:65321.9
[11]	validation_0-rmse:60057.7
[12]	validation_0-rmse:55383.8
[13]	validation_0-rmse:51155.4
[14]	validation_0-rmse:47516.6
[15]	validation_0-rmse:44415.5
[16]	validation_0-rmse:41906.6
[17]	validation_0-rmse:39774.8
[18]	validation_0-rmse:37929
[19]	validation_0-rmse:36125.6
[20]	validation_0-rmse:34631.9
[21]	validation_0-rmse:33579.3
[22]	validation_0-rmse:32460.4
[23]	validation_0-rmse:31500.8
[24]	validation_0-rmse:30623.7
[25]	validation_0-rmse:29795.5
[26]	validation_0-rmse:29232
[27]	validation_0-rmse:28895.3
[28]	validation_0-rmse:28517.7
[29]	validation_0-rmse:28378.8
[30]	validation_0-rmse:281

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [13]:
predictions = xgb_model_2.predict(test_X)

In [14]:
from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

Mean Absolute Error : 15423.398473173516


In [15]:
xgb_model_3 = XGBRegressor(n_estimators=1000, learning_rate=0.05)
xgb_model_3.fit(train_X, train_y, early_stopping_rounds=5,
             eval_set=[(test_X, test_y)], verbose=True)


[0]	validation_0-rmse:176564
Will train until validation_0-rmse hasn't improved in 5 rounds.
[1]	validation_0-rmse:168030
[2]	validation_0-rmse:160086
[3]	validation_0-rmse:152396
[4]	validation_0-rmse:145117
[5]	validation_0-rmse:138247
[6]	validation_0-rmse:131690
[7]	validation_0-rmse:125551
[8]	validation_0-rmse:119710
[9]	validation_0-rmse:114249
[10]	validation_0-rmse:108945
[11]	validation_0-rmse:104016
[12]	validation_0-rmse:99316.5
[13]	validation_0-rmse:94881.4
[14]	validation_0-rmse:90544.4
[15]	validation_0-rmse:86486.2
[16]	validation_0-rmse:82646.3
[17]	validation_0-rmse:79053.8
[18]	validation_0-rmse:75640.9
[19]	validation_0-rmse:72444.1
[20]	validation_0-rmse:69495
[21]	validation_0-rmse:66645.9
[22]	validation_0-rmse:63912.4
[23]	validation_0-rmse:61388.1
[24]	validation_0-rmse:58943
[25]	validation_0-rmse:56634.6
[26]	validation_0-rmse:54488
[27]	validation_0-rmse:52548.7
[28]	validation_0-rmse:50582.4
[29]	validation_0-rmse:48787.6
[30]	validation_0-rmse:47200.9
[31

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.05, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [16]:
predictions = xgb_model_3.predict(test_X)

In [17]:
from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

Mean Absolute Error : 15344.130029965754
