# Example 3: Runing XGBoost Model in mlr Package


### Loading in the relevant packages and setup the working directory

In [1]:
library(mlr)
library(readr)
library(dplyr)
library(parallel)
library(ROCR)
library(parallelMap)
setwd("D:/Project_2017/Training_0331")

Loading required package: ParamHelpers

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Loading required package: gplots

Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess


Attaching package: 'ROCR'

The following object is masked from 'package:mlr':

    performance



### Step 1: Reading in the data and define the Learning Task
#### The data is the Titanic survival data which is from Kaggle

In [2]:
df <- readr::read_csv("./data_for_testing.csv")

In [3]:
head(df)

y,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,1,40.0,0,0,27.7208,0
1,1,0,29.9,1,0,146.5208,0
0,2,1,66.0,0,0,10.5,2
0,1,1,42.0,1,0,52.0,2
1,2,0,5.0,1,2,27.75,2
1,3,1,29.9,1,1,15.2458,0


### Step 2:  Defining the Learning Tasks
#### Intro to learning tasks
- Learning tasks encapsulate the data set and further relevant information about a machine learning problem.
- The following tasks can be instanitated and all inherit from the virtual superclass Task:
    - *RegrTask*: for regression problems
    - *ClassifTask*: for binary and multi-class classification problems
    - *SurvTask*: for survival analysis
    - *ClusterTask*: for cluster analysis
    - *MultilabelTask*: for multilabel classification problems
    - *CostSensTask*: for general cost-sensitive classification
- To create a task, just call `make<TaskType>`, e.g. `makeClassifTask`
- All tasks require an identifier(argument `id`) and a data.frame(argument `data`)

In [4]:
df2 <- as.data.frame(df)
dataset <- makeClassifTask(id="xgb_mlr_eg", data=df2, target="y", positive=1)

#### Accessing a learning task
- `getTaskDescription()` contains basic information about the task you can use
- Frequently  required elements can also be accessed directly
    - `getTaskId()`
    - `getTaskType()`
    - `getTaskSize()`
    - etc..

In [5]:
getTaskDescription(dataset)

$id
[1] "xgb_mlr_eg"

$type
[1] "classif"

$target
[1] "y"

$size
[1] 1309

$n.feat
numerics  factors  ordered 
       7        0        0 

$has.missings
[1] FALSE

$has.weights
[1] FALSE

$has.blocking
[1] FALSE

$class.levels
[1] "0" "1"

$positive
[1] "1"

$negative
[1] "0"

attr(,"class")
[1] "TaskDescClassif"    "TaskDescSupervised" "TaskDesc"          

#### Modifying a learning task
- `subsetTask()` to select observations and/or features
- `removeConstanctFeatures()` to remove features that is a contant
- `dropFeatures()` to remove selected features
- `normalizeFeatures()` to standardize numerical features

In [6]:
norm_task <- normalizeFeatures(dataset, method='range')
summary(getTaskData(norm_task))

 y           Pclass            Sex             Age             SibSp        
 0:826   Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.00000  
 1:483   1st Qu.:0.5000   1st Qu.:0.000   1st Qu.:0.2735   1st Qu.:0.00000  
         Median :1.0000   Median :1.000   Median :0.3724   Median :0.00000  
         Mean   :0.6474   Mean   :0.644   Mean   :0.3722   Mean   :0.06236  
         3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:0.4363   3rd Qu.:0.12500  
         Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.00000  
     Parch              Fare            Embarked     
 Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.01541   1st Qu.:0.5000  
 Median :0.00000   Median :0.02821   Median :1.0000  
 Mean   :0.04278   Mean   :0.06499   Mean   :0.7468  
 3rd Qu.:0.00000   3rd Qu.:0.06104   3rd Qu.:1.0000  
 Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  

### Step 3: Constructing a learner
#### Define a learner
- A learner in `mlr` is generated by calling `makeLearner`. In the constructor you need to specify which learning method you want to use, which hyper-parameter space you would like to search and which kind of output you need.
- In our classification case, we need to setup the learner by:
    - The first argument specifies which algorithm to use, the naming convention is `clssif.<R_method_name>` for classification methods.
    - The second argument specifies the output for later prediction, i.e. whether a factor of predicted class labels or probabilities
    - Hyperparameters can be specified either via the `...` argument or as a list var `par.vals`

In [7]:
lrn <- makeLearner("classif.xgboost", predict.type="prob")
lrn$par.vals = list(
  nrounds = 100,
  verbose = F,
  objective = "binary:logistic"
)

In [8]:
lrn

Learner classif.xgboost from package xgboost
Type: classif
Name: eXtreme Gradient Boosting; Short name: xgboost
Class: classif.xgboost
Properties: twoclass,multiclass,numerics,factors,prob,weights,missings,featimp
Predict-Type: prob
Hyperparameters: nrounds=100,verbose=FALSE,objective=binary:logistic


#### Accessing a learner
- The learner object has a list and following elements contain info regarding the hyperparameters and the type of prediction

In [9]:
lrn$par.vals

- The slot `$par.set` is an object of class ParamSet. It contains potential default values and the range of allowed values.

In [10]:
lrn$par.set

                          Type len             Def               Constr Req
booster               discrete   -          gbtree gbtree,gblinear,dart   -
silent                 integer   -               0          -Inf to Inf   -
eta                    numeric   -             0.3               0 to 1   -
gamma                  numeric   -               0             0 to Inf   -
max_depth              integer   -               6             1 to Inf   -
min_child_weight       numeric   -               1             0 to Inf   -
subsample              numeric   -               1               0 to 1   -
colsample_bytree       numeric   -               1               0 to 1   -
colsample_bylevel      numeric   -               1               0 to 1   -
num_parallel_tree      integer   -               1             1 to Inf   -
lambda                 numeric   -               0             0 to Inf   -
lambda_bias            numeric   -               0             0 to Inf   -
alpha       

- The function `getHyperPars` access the current hyperparameter setting of a learner and `getParamSet` access to get a description of all possible settings.

In [11]:
getHyperPars(lrn)

In [12]:
getParamSet(lrn)

                          Type len             Def               Constr Req
booster               discrete   -          gbtree gbtree,gblinear,dart   -
silent                 integer   -               0          -Inf to Inf   -
eta                    numeric   -             0.3               0 to 1   -
gamma                  numeric   -               0             0 to Inf   -
max_depth              integer   -               6             1 to Inf   -
min_child_weight       numeric   -               1             0 to Inf   -
subsample              numeric   -               1               0 to 1   -
colsample_bytree       numeric   -               1               0 to 1   -
colsample_bylevel      numeric   -               1               0 to 1   -
num_parallel_tree      integer   -               1             1 to Inf   -
lambda                 numeric   -               0             0 to Inf   -
lambda_bias            numeric   -               0             0 to Inf   -
alpha       

- Also the function `getParamSet` could give us a quick overview about the available hyperparameters and defaults of a learning method without explicityly constructing it.

In [13]:
getParamSet('classif.xgboost')

                          Type len             Def               Constr Req
booster               discrete   -          gbtree gbtree,gblinear,dart   -
silent                 integer   -               0          -Inf to Inf   -
eta                    numeric   -             0.3               0 to 1   -
gamma                  numeric   -               0             0 to Inf   -
max_depth              integer   -               6             1 to Inf   -
min_child_weight       numeric   -               1             0 to Inf   -
subsample              numeric   -               1               0 to 1   -
colsample_bytree       numeric   -               1               0 to 1   -
colsample_bylevel      numeric   -               1               0 to 1   -
num_parallel_tree      integer   -               1             1 to Inf   -
lambda                 numeric   -               0             0 to Inf   -
lambda_bias            numeric   -               0             0 to Inf   -
alpha       

#### Listing learners
- Function `listLearners()` would list everthing in mlr 

In [14]:
lrns <- listLearners("classif", properties = "prob") #list classifiers that can output probabilities
head(lrns[c("class", "package")])

"The following learners could not be constructed, probably because their packages are not installed:
classif.ada,classif.bartMachine,classif.bdk,classif.blackboost,classif.boosting,classif.bst,classif.C50,classif.clusterSVM,classif.dbnDNN,classif.dcSVM,classif.earth,classif.evtree,classif.extraTrees,classif.fnn,classif.gamboost,classif.gaterSVM,classif.geoDA,classif.glmboost,classif.h2o.deeplearning,classif.h2o.gbm,classif.h2o.glm,classif.h2o.randomForest,classif.hdrda,classif.kknn,classif.LiblineaRL1L2SVC,classif.LiblineaRL1LogReg,classif.LiblineaRL2L1SVC,classif.LiblineaRL2LogReg,classif.LiblineaRL2SVC,classif.LiblineaRMultiClassSVC,classif.linDA,classif.lqa,classif.mda,classif.mlp,classif.neuralnet,classif.nnTrain,classif.nodeHarvest,classif.pamr,classif.penalized.fusedlasso,classif.penalized.lasso,classif.penalized.ridge,classif.plr,classif.quaDA,classif.randomForestSRC,classif.ranger,classif.rda,classif.rFerns,classif.rknn,classif.rotationForest,classif.RRF,classif.rrlda,classif.s

class,package
classif.binomial,stats
classif.cforest,party
classif.ctree,party
classif.cvglmnet,glmnet
classif.featureless,mlr
classif.gausspr,kernlab


### Step 4 Tuning 
#### Specifying the searching space
- Create a ParamSet object to define a searching space

In [15]:
ps = makeParamSet(
  makeNumericParam("eta", lower=0.15, upper=0.4),
  makeIntegerParam("max_depth", lower=3, upper=8),
  makeNumericParam("alpha", lower=0, upper=1),
  makeNumericParam("lambda", lower=0.25, upper=3),
  makeIntegerParam("min_child_weight", lower=2, upper=6),
  makeNumericParam("colsample_bytree", lower=.3, upper=.7),
  makeNumericParam("subsample", lower=.65, upper=.95)
)
ps

                    Type len Def       Constr Req Tunable Trafo
eta              numeric   -   -  0.15 to 0.4   -    TRUE     -
max_depth        integer   -   -       3 to 8   -    TRUE     -
alpha            numeric   -   -       0 to 1   -    TRUE     -
lambda           numeric   -   -    0.25 to 3   -    TRUE     -
min_child_weight integer   -   -       2 to 6   -    TRUE     -
colsample_bytree numeric   -   -   0.3 to 0.7   -    TRUE     -
subsample        numeric   -   - 0.65 to 0.95   -    TRUE     -

#### Specifying the optimization algorithm
- A grid search is one of the standard ways to choose an appropriate set of parametes from a given search space, but it is normally very slow.
- A random search will randomly choose from the specified values, which might be faster in practice.

In [17]:
ctrl <- makeTuneControlRandom(maxit=30)
#ctrl <- makeTuneControlGrid(resolution=15L)
ctrl

Tune control: TuneControlRandom
Same resampling instance: TRUE
Imputation value: <worst>
Start: NULL
Budget: 30
Tune threshold: FALSE
Further arguments: maxit=30

#### Performing the tuning
- Define a resampling strategy and make note of the performance measure
- In our case, I will use 3-fold cross-validation to assess the quality of a specific parameter setting.
- The default measure to select best parameters is error rate(`mmce`), but we could also pass other measures or a list of measures to `tuneParams`. The __first__ measure is optimized during tuning, the others are simply evaluated.

In [18]:
cv <- makeResampleDesc("CV", iters=3) 
measures_ls <- list(auc, acc)

### Step 5 Parallelization
- mlr will use parallelMap package to do parallelization. parallelMap supports all major parallelization backends: local multicore execution using parallel pacakge, socket and MPI clusters using snow, makeshift SSH-clusters using BatchJobs and high performance computing clusters all using BatchJobs.
- All we need to do is selecting a backed of calling one of the `parallelStart*` functions, and call `parallelStop` at the end of script
- mlr also offers different parallelization levels for fine grained control over the parallelization

In [19]:
parallelGetRegisteredLevels()

mlr: mlr.benchmark, mlr.resample, mlr.selectFeatures, mlr.tuneParams

- __`mlr.resample`__: Each resampling iteration (a train / test step) is a parallel job.
- __`mlr.benchmark`__: Each experiment "run this learner on this data set" is a parallel job.
- __`mlr.tuneParams`__: Each evaluation in hyperparameter space "resample with these parameter settings" is a parallel job. How many of these can be run independently in parallel, depends on the tuning algorithm. For grid search or random search this is no problem, but for other tuners it depends on how many points are produced in each iteration of the optimization. If a tuner works in a purely sequential fashion, we cannot work magic and the hyperparameter evaluation will also run sequentially. But note that you can still parallelize the underlying resampling.
- __`mlr.selectFeatures`__: Each evaluation in feature space "resample with this feature subset" is a parallel job. The same comments as for "mlr.tuneParams" apply here.

In [20]:
random_seed <- 123
set.seed(random_seed, "L'Ecuyer")

num_cores <- 4


parallelStartSocket(num_cores, level="mlr.tuneParams")

res <- tuneParams(lrn, dataset, resampling=cv, par.set=ps, control=ctrl, 
                  show.info=T, measures=measures_ls)
parallelStop()


Starting parallelization in mode=socket with cpus=4.
[Tune] Started tuning learner classif.xgboost for parameter set:
                    Type len Def       Constr Req Tunable Trafo
eta              numeric   -   -  0.15 to 0.4   -    TRUE     -
max_depth        integer   -   -       3 to 8   -    TRUE     -
alpha            numeric   -   -       0 to 1   -    TRUE     -
lambda           numeric   -   -    0.25 to 3   -    TRUE     -
min_child_weight integer   -   -       2 to 6   -    TRUE     -
colsample_bytree numeric   -   -   0.3 to 0.7   -    TRUE     -
subsample        numeric   -   - 0.65 to 0.95   -    TRUE     -
With control class: TuneControlRandom
Imputation value: -0Imputation value: -0
Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; cpus = 4; elements = 30.
[Tune] Result: eta=0.29; max_depth=8; alpha=0.983; lambda=2.99; min_child_weight=2; colsample_bytree=0.391; subsample=0.659 : auc.test.mean=0.907,acc.test.mean=0.869


### Step 6 Accessing the tuning results
- Access the best hyperparameter setting by `$x` and their estimated performance by `$y`.

In [21]:
res$x
res$y

- Then we could generate a new Learner with optimal hyperparameter settings and __retrain__ it on the dataset

In [23]:
lrn_opt <- setHyperPars(lrn, par.vals=res$x)
lrn_opt

Learner classif.xgboost from package xgboost
Type: classif
Name: eXtreme Gradient Boosting; Short name: xgboost
Class: classif.xgboost
Properties: twoclass,multiclass,numerics,factors,prob,weights,missings,featimp
Predict-Type: prob
Hyperparameters: nrounds=100,verbose=FALSE,objective=binary:logistic,eta=0.29,max_depth=8,alpha=0.983,lambda=2.99,min_child_weight=2,colsample_bytree=0.391,subsample=0.659


- We can inspect all points evaluated during the searching and output it as dataframe

In [24]:
generateHyperParsEffectData(res, partial.dep = TRUE)
perf_path <- as.data.frame(res$opt.path)

HyperParsEffectData:
Hyperparameters: eta,max_depth,alpha,lambda,min_child_weight,colsample_bytree,subsample
Measures: auc.test.mean,acc.test.mean
Optimizer: TuneControlRandom
Nested CV Used: FALSE
[1] "Partial dependence requested"
Snapshot of data:
        eta max_depth      alpha   lambda min_child_weight colsample_bytree
1 0.3905567         6 0.12441539 2.482658                3        0.6776513
2 0.3133254         7 0.04480111 2.080678                5        0.5210854
3 0.2825981         8 0.76159505 1.010454                4        0.3594741
4 0.1594541         7 0.51196954 2.987880                3        0.5246644
5 0.3990977         5 0.68971964 2.874786                6        0.3727439
6 0.3720453         7 0.20920194 2.053653                5        0.3369490
  subsample auc.test.mean acc.test.mean iteration exec.time
1 0.7973827     0.8960173     0.8510381         1      0.98
2 0.6993547     0.9023835     0.8655484         2      1.03
3 0.8383479     0.9021821     0.86479

### Step 7 Predicting
- In Step 6 we already got the learner with best hyperparameter setting, now we need to __retrain__ the optimal model on the dataset and __predict__ it on new data

In [25]:
opt_model <- train(lrn_opt, dataset)
opt_pred <- predict(opt_model, dataset)


- We could use slot `$data` to access the prediction, also we could convert the object to dataframe directly. But pay attention to the columns in output 

In [26]:
prediction1 <- opt_pred$data
prediction2 <- as.data.frame(opt_pred)

In [27]:
head(prediction1)

id,truth,prob.0,prob.1,response
1,0,0.8864314,0.1135686,0
2,1,0.0379023,0.9620977,1
3,0,0.94110041,0.05889959,0
4,0,0.76211266,0.23788734,0
5,1,0.02898335,0.97101665,1
6,1,0.80719936,0.19280064,0


In [28]:
head(prediction2)

id,truth,prob.0,prob.1,response
1,0,0.8864314,0.1135686,0
2,1,0.0379023,0.9620977,1
3,0,0.94110041,0.05889959,0
4,0,0.76211266,0.23788734,0
5,1,0.02898335,0.97101665,1
6,1,0.80719936,0.19280064,0


- To access the probabilities and true labels directly, we need to use function `getPredictionTruth()` and `getPredictionProbabilities()`

In [29]:
true_label <- getPredictionTruth(opt_pred)
head(true_label)

In [30]:
positive_class <- opt_pred$task.desc$positive
pred_prob <- getPredictionProbabilities(opt_pred, cl=positive_class)
head(pred_prob)

### Step 8 Calculating performance
#### Based on the `true_label` and `pred_prob` vector, we could calculate whatever we want by the `ROCR` package.

---

### Advanced tuning: create custom measure
#### Using precision at given recall as measure


In [32]:
make_custom_pr_measure <- function(recall_perc=5, name_str="pr5"){
  
  find_prec_at_recall <- function(pred, recall_perc=5){
    
    positive_class <- pred$task.desc$positive
    prob <- getPredictionProbabilities(pred, cl=positive_class)
    truth <- getPredictionTruth(pred)
    
    aucobj <- ROCR::prediction(prob, truth) 
    
    ppvRec <- ROCR::performance(aucobj, 'ppv', 'sens')
    
    tarPPV <- ppvRec@y.values[[1]][which.min(abs(ppvRec@x.values[[1]]-recall_perc*0.01))]
    #selRec <- ppvRec@x.values[[1]][which.min(abs(ppvRec@x.values[[1]]-recall_perc*0.01))]
    return(tarPPV)
    }
  
  name <- paste("Precision at ", as.character(recall_perc),"%"," recall", sep='')
  
  custom_measure <- makeMeasure(
    id = name_str, 
    name = name,
    properties = c("classif", "req.prob", "req.truth"),
    minimize = FALSE, best = 1, worst = 0,
    extra.args = list("threshold" = recall_perc),
    fun = function(task, model, pred, feats, extra.args) {
      find_prec_at_recall(pred, extra.args$threshold)
    }
  )
  custom_measure
}

pr20 <- make_custom_pr_measure(20, "pr20")


In [33]:
parallelStartSocket(num_cores, level="mlr.tuneParams")

# Cluster negatives to pos_n numbers with hierarchical clustering
res2 <- tuneParams(lrn, dataset, resampling=cv, par.set=ps, control=ctrl, 
                  show.info=T, measures=list(pr20, auc, acc))
parallelStop()


"Parallelization was not stopped, doing it now."Stopped parallelization. All cleaned up.
Starting parallelization in mode=socket with cpus=4.
[Tune] Started tuning learner classif.xgboost for parameter set:
                    Type len Def       Constr Req Tunable Trafo
eta              numeric   -   -  0.15 to 0.4   -    TRUE     -
max_depth        integer   -   -       3 to 8   -    TRUE     -
alpha            numeric   -   -       0 to 1   -    TRUE     -
lambda           numeric   -   -    0.25 to 3   -    TRUE     -
min_child_weight integer   -   -       2 to 6   -    TRUE     -
colsample_bytree numeric   -   -   0.3 to 0.7   -    TRUE     -
subsample        numeric   -   - 0.65 to 0.95   -    TRUE     -
With control class: TuneControlRandom
Imputation value: -0Imputation value: -0Imputation value: -0
Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; cpus = 4; elements = 30.
[Tune] Result: eta=0.241; max_depth=4; alpha=0.978; lambd

In [34]:
res2

Tune result:
Op. pars: eta=0.241; max_depth=4; alpha=0.978; lambda=2.77; min_child_weight=4; colsample_bytree=0.546; subsample=0.755
pr20.test.mean=   1,auc.test.mean=0.903,acc.test.mean=0.868