# Motivation

I have been "Kaggling" (that can be a verb, right?) since before **GBM** was cool. An over trained set of **GBM** models got me a 2nd place finish in the [Dunnhumby's Shopper Challenge](https://www.kaggle.com/c/dunnhumbychallenge/leaderboard).  Since then I have spent my time working on projects that didn't require boosted algorithms. Now that I have some free time, I wanted to explore **xgboost** (the GBM killer) and its new challenger **lightGBM**. This notebook will explore the speed and accuracy of each model and discuss any observations I have along the way.

# Boring Setup

In [None]:
# Libraries
library(pROC, quietly=TRUE)
library(microbenchmark, quietly=TRUE)

# Set seed so the train/test split is reproducible
set.seed(42)

# Read in the data and split it into train/test subsets
credit.card.data = read.csv("../input/creditcard.csv")

train.test.split <- sample(2
	, nrow(credit.card.data)
	, replace = TRUE
	, prob = c(0.7, 0.3))
train = credit.card.data[train.test.split == 1,]
test = credit.card.data[train.test.split == 2,]

# Feature Creation

This section is empty. Converting the time values to hour or day would probably improve the accuracy, but that is not the purpose of this kernel.

# Modeling

I have attempted to select a common set of parameters for each model, but that is not entirely possible. (*max_depth* vs *num_leaves* in **xgboost** and **lightGBM**) The following are some of the assumptions and choices made during this modeling process.

* The data will be placed into the their preferred data formats before calling the models.
* Models will not be trained with cross-validation.
* If possible, different number of cores will be used during the speed analysis. (future mod)

## GBM

Training the GBM is slow enough, I am not going to bother microbenchmarking it.

In [None]:
library(gbm, quietly=TRUE)

# Get the time to train the GBM model
system.time(
	gbm.model <- gbm(Class ~ .
		, distribution = "bernoulli"
		, data = rbind(train, test)
		, n.trees = 500
		, interaction.depth = 3
		, n.minobsinnode = 100
		, shrinkage = 0.01
		, bag.fraction = 0.5
		, train.fraction = nrow(train) / (nrow(train) + nrow(test))
		)
)
# Determine best iteration based on test data
best.iter = gbm.perf(gbm.model, method = "test")
gbm.influence = relative.influence(gbm.model, n.trees = best.iter, sort. = TRUE)

# Plot and calculate AUC on test data
gbm.test = predict(gbm.model, newdata = test, n.trees = best.iter)
auc.gbm = roc(test$Class, gbm.test, plot = TRUE, col = "red")
print(auc.gbm)

## xgboost

In [None]:
library(xgboost, quietly=TRUE)
xgb.data.train <- xgb.DMatrix(as.matrix(train[, colnames(train) != "Class"]), label = train$Class)
xgb.data.test <- xgb.DMatrix(as.matrix(test[, colnames(test) != "Class"]), label = test$Class)

# Get the time to train the xgboost model
# IGNORE WARNINGS: the suggested parameters were not working as intended at this time.
xgb.bench = microbenchmark(
	xgb.model <- xgb.train(data = xgb.data.train
		, params = list(objective = "binary:logistic"
			, eta = 0.1
			, max.depth = 3
			, min_child_weight = 100
			, subsample = 1
			, colsample_bytree = 1
			, nthread = 3
			, eval_metric = "auc"
			)
		, watchlist = list(test = xgb.data.test)
		, nrounds = 500
		, early.stop.round = 40
		, print.every.n = 20
		)
    , times = 5L
)
print(xgb.bench)
print(xgb.model$bestScore)

# Make predictions on test set for ROC curve
xgb.test = predict(xgb.model, newdata = as.matrix(test[, colnames(test) != "Class"]), ntreelimit = xgb.model$bestInd)
auc.xgb = roc(test$Class, xgb.test, plot = TRUE, col = "blue")
print(auc.xgb)

## lightGBM

In [None]:
library(lightgbm, quietly=TRUE)
lgb.train = lgb.Dataset(as.matrix(train[, colnames(train) != "Class"]), label = train$Class)
lgb.test = lgb.Dataset(as.matrix(test[, colnames(test) != "Class"]), label = test$Class)

params.lgb = list(
	objective = "binary"
	, metric = "auc"
	, min_data_in_leaf = 1
	, min_hess = 100
	, feature_fraction = 1
	, bagging_fraction = 1
	, bagging_freq = 0
	)

# Get the time to train the lightGBM model
lgb.bench = microbenchmark(
	lgb.model <- lgb.train(
		params = params.lgb
		, data = lgb.train
		, valids = list(test = lgb.test)
		, learning_rate = 0.1
		, num_leaves = 7
		, num_threads = 2
		, nrounds = 500
		, early_stopping_rounds = 40
		, eval_freq = 20
		)
		, times = 5L
)
print(lgb.bench)
print(max(unlist(lgb.model$record_evals[["test"]][["auc"]][["eval"]])))

lgb.test = predict(lgb.model, data = as.matrix(test[, colnames(test) != "Class"]), n = lgb.model$best_iter)
auc.lgb = roc(test$Class, lgb.test, plot = TRUE, col = "green")
print(auc.lgb)

# Results

## Speed

The following shows the estimated **GBM** benchmark (see above for actual) and the microbenchmark results for the **xgboost** and **lightgbm** models.

In [None]:
print("GBM = ~263s")
print(xgb.bench)
print(lgb.bench)

## Accuracy

The following are the *AUC* results for the test set. 


### GBM

In [None]:
print(auc.gbm)

## xgboost

In [None]:
print(auc.xgb)

## lightGBM

In [None]:
print(auc.lgb)

# Additional Observations

## GBM

Advantages:

* None

Disadvantages:

* No early exit
* Slower training
* Less accurate

## xgboost

Advantages:

* Proven success (on kaggle)

Disadvantages:

* Slower than lightGBM 

## lightGBM

Advantages:

* Fast training efficiency
* Low memory usage
* Better accuracy
* Parallel learning supported
* Deal with large scale of data
* Corporate supported

Disadvantages:

* No feature influence?
* Newer, so less community documentation

## Post Script

In this example lightGBM completed in 15% of the time it took xgboost. That seems too extreme to me. Any one have feedback on how I may not be fairly parameterizing xgboost in this comparison?