##About XGBoost
XGBoost is an advanced implementation of the gradient boosting algorithm. 

XGBoost includes regularisation to avoid overfitting and uses parallel computing to improve performance.

Here I am using it as a standalone entity without any Exploratory Data Analysis - this is because it has an in-built routine to handle missing values. We supply a different value and pass it as a parameter then Xgboost will try different things as it encounters missing values on each node and learn which path to take for missing values in the future. 

XGBoost will make splits on nodes up to the max_depth parameter specified, then it will prune the tree backwards and remove splits beyond which there is no positive gain

XGBoost has built-in cross-validation at each iteration of the boosting process. 

As a comparison please see my other notebook with EDA and random forest

## Beginning of routine
We start by importing the various libraries we are going to use.
We just need four in this example. 

In [None]:
import numpy as np # mathematical library including linear algebra
import pandas as pd #data processing and CSV file input / output
from sklearn import model_selection, preprocessing # sklearn is the machine learning library
import xgboost as xgb # this is the extreme gradient boosting library

##Read in data
Now we read in the training and test data. We also read in the macro economic variables. We are using Pandas "read_csv" function for this.

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
macro = pd.read_csv('../input/macro.csv')
id_test = test.id

## Set our response variables and perform any data modifications
We set y_train to be the price_doc variable - our required prediction
We then drop id, timestamp and price_doc from the training set to use in the prediction

To be consistent we also drop id and timestamp from our test data set.

Normally we would do both of these together by combining our train and test sets for data wrangling but in this instance this affects performance severely. 

The modification to the training price_doc reflects movement in house prices 
between the times in the training set versus the test set - we try to have them consistent

In [None]:
y_train = train["price_doc"] * .969 + 10
x_train = train.drop(["id", "timestamp", "price_doc"], axis=1)
x_test = test.drop(["id", "timestamp"], axis=1)

##Fitting the model
We run through each column in the training set and give it a label. We do this using the preprocessing.LabelEncoder() function from the sklearn library.
This function takes a list of values and transforms non-numerical labels to numerical values. We require our labels to have numerical values for use in most algorithms and in partcular, the XGBoost algorithm.

We then repeat the process for the test set - again we would normally do this on a combined test / train for consistency.

In [None]:
for c in x_train.columns:
    if x_train[c].dtype == 'object':
        lbl = preprocessing.LabelEncoder() # set an instance of the label encoder
        lbl.fit(list(x_train[c].values)) # fit it to the values of the training set column headers
        x_train[c] = lbl.transform(list(x_train[c].values)) # Have them transformed to encoded labels
        
for c in x_test.columns:
    if x_test[c].dtype == 'object':
        lbl = preprocessing.LabelEncoder() # set an instance of the label encoder
        lbl.fit(list(x_test[c].values)) # fit it to the values of the test set column headers
        x_test[c] = lbl.transform(list(x_test[c].values)) # Have them transformed to encoded labels

#Set the parameters for xgboost as follows:

##Booster parameters 
These parameters are used to optimise the algorithm in terms of both accuracy and performance.

**eta: 0.05**  - the default value for this parameter is 0.3. This is similar to the learning rate (alpha) in gradient descent. 
Makes the model more robust by shrinking the weights on each step. Typical final values range from 0.01-0.2

**max_depth: 5** - the default here is 6. It sets the maximum depth of a tree and is used to control over-fitting as higher depth allows the model to learn relations very specific to a particular sample. We tune it using cross-validation. Typical values range from 3-10

**subsample: 0.7** - the default here is 1. It denotes the fraction of obeservations to be randomly samples for each tree. Lower values make the algorithm conservative and prevent overfitting but too small and we may get under-fitting. Typical values range from 0-1

**colsample_bytree: 0.7** - the default here is 1. It denotes the fraction of columns to be randomly samples for each tree. Typical values range from 0.5-1

##Learning Task Parameters
These parameters are used to define the optimisation metric to be calculated at each step.

**'eval_metric': 'rmse'** sets our evaluation metric to root mean squared error
    This  evaluation metric used to score submissions in this competition is the log root mean squared error, however this option is not available to us within xgboost so this is the closest match.

## General parameters
**booster** - left at default by not setting it, which means we are using a tree-based model. It can also be set to use linear models.

**silent: 1** - this defaults to 0 and is a binary switch. When set to 0 running messages will be printed which may help to understand the model. It can be set to 1 to suppress running messages.

In [None]:
xgb_params = {
    'eta': 0.05,
    'max_depth': 5,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'objective': 'reg:linear',
    'eval_metric': 'rmse',
    'silent': 1
}

##Import the train and test sets to XGBoost and create a cross-validation set
Format the train and test sets we modified above for use in xgboost 
(Dmatrix is the format required by the xgboost library)

In [None]:
dtrain = xgb.DMatrix(x_train, y_train)
dtest = xgb.DMatrix(x_test)

Create a cross-validation set and define the early stopping criteria.
The num_boost_round parameter sets the number of iterations of the algorithm. 
Here it is set to just 200 to speed up the run but in practice 
we should set it to something like 1000

In [None]:
cv_output = xgb.cv(xgb_params, dtrain, num_boost_round=200, early_stopping_rounds=20,
    verbose_eval=50, show_stdv=False)

##Train the algorithm

In [None]:
num_boost_rounds = len(cv_output)
model = xgb.train(dict(xgb_params), dtrain, num_boost_round= num_boost_rounds)

##Now we can make a prediction of house prices in our test cases

In [None]:
y_predict = model.predict(dtest)


##Store our predictions for submission
We need to submit our results in a prescribed format. Two columns containing the id and the price.
First format the output and then write the formatted data to csv for submission.

In [None]:
output = pd.DataFrame({'id': id_test, 'price_doc': y_predict})

output.to_csv('xgbSub.csv', index=False)