# Chapter 2: XGBoost (DataCamp)

Course notes from [DataCamp](https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost) XGBoost<br>
Importing xgboost requires some [work](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_For_Anaconda_on_Windows?lang=en)

## Comparison of Linear and DT Base Learners for XGBoost Regressor

Linear Base Learner:
- Sum of linear terms 
- Boosted model is weighted sum of linear models (thus is itself linear) 
- Rarely used 

Tree Base Learner:
- Decision tree 
- Boosted model is weighted sum of decision trees (nonlinear) 
- Almost exclusively used in XGBoost

### XGBoost Regressor using Decision Trees as Base Learners
Build an XGBoost model to predict house prices in Ames, Iowa. 

In this exercise, the goal is to use trees as base learners. By default, XGBoost uses trees as base learners, so no need to specify trees with booster="gbtree".

In [14]:
# may be required as xgboost import throws errors 
# import os
# mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-7.2.0-posix-seh-rt_v5-rev1\\mingw64\\bin'
# os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']

In [15]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [16]:
# Load data Ames, Iowa dataset from DataCamp's AWS url
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

In [17]:
# Create df for the features and the target: X, y
X, y = housing_data.iloc[:,:-1], housing_data.iloc[:,-1]

In [18]:
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

# Instantiate the XGBRegressor: xg_reg
xg_reg = xgb.XGBRegressor(n_estimators = 10, objective = 'reg:linear', booster = 'gbtree', seed = 123)

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)
                      
# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(y_test,preds))

print("DT Boosted Linear Regressor RMSE: %f" % (rmse))

DT Boosted Linear Regressor RMSE: 78847.401758


### XGBoost Regressor using Linear Regression as Base Learner

This model, although not as commonly used in XGBoost, allows one to create a regularized linear regression using XGBoost's powerful learning API. However, because it's uncommon, one has to use XGBoost's own non-scikit-learn compatible functions to build the model, such as xgb.train().

In order to do this one must create the parameter dictionary that describes the kind of booster one wants to use (similarly creating the dictionary in Chapter 1 with xgb.cv()). The key-value pair that defines the booster type (base model) needed is "booster":"gblinear".

Once model is created, .fit() and .predict() methods of the model.

In [19]:
# may be required as xgboost import throws errors 
# import os
# mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-7.2.0-posix-seh-rt_v5-rev1\\mingw64\\bin'
# os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']

In [20]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [21]:
# Load data Ames, Iowa dataset from DataCamp's AWS url
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

In [22]:
# Create df for the features and the target: X, y

X, y = housing_data.iloc[:,:-1], housing_data.iloc[:,-1]

In [23]:
# Convert the training and testing sets into DMatrixes: DM_train, DM_test
DM_train = xgb.DMatrix(data=X_train,label=y_train)
DM_test =  xgb.DMatrix(data=X_test,label=y_test)

# Create the parameter dictionary: params
params = {"booster":"gblinear","objective":"reg:linear"}

# Train the model: xg_reg
xg_reg = xgb.train(params = params, dtrain = DM_train, num_boost_round = 10)

# Predict the labels of the test set: preds
preds = xg_reg.predict(DM_test)

# Compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("Linear Boosted Linear Regression RMSE: %f" % (rmse))

Linear Boosted Linear Regression RMSE: 40719.741641


### Evaluating Model Quality using Root Mean Squared Error (RMSE) and Mean ABS Error (MAE)

Compare the RMSE and MAE of a cross-validated XGBoost model on the housing data. 

In [24]:
# RMSE
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics='rmse', as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print("\n Non-boosted Linear Regression RMSE: %f" % ((cv_results["test-rmse-mean"]).tail(1)))    

   test-rmse-mean  test-rmse-std  train-rmse-mean  train-rmse-std
0   142980.433594    1193.791602    141767.531250      429.454591
1   104891.394532    1223.158855    102832.544922      322.469930
2    79478.937500    1601.344539     75872.615235      266.475960
3    62411.920899    2220.150028     57245.652343      273.625086
4    51348.279297    2963.377719     44401.298828      316.423666

 Non-boosted Linear Regression RMSE: 51348.279297


In [25]:
# MAE
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics='mae', as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric

print("\n Non-boosted Linear Regression MAE: %f" % ((cv_results["test-mae-mean"]).tail(1))) 

   test-mae-mean  test-mae-std  train-mae-mean  train-mae-std
0  127634.000000   2404.009898   127343.482421     668.308109
1   90122.501953   2107.912810    89770.056641     456.965267
2   64278.558594   1887.567576    63580.791016     263.404950
3   46819.168945   1459.818607    45633.155274     151.883420
4   35670.646484   1140.607452    33587.090820      86.999396

 Non-boosted Linear Regression MAE: 35670.646484
