<center><strong style="font-size:40px">Binary Classifier</strong></center>

In this notebook, we show how we train the binary classification learners using `mlr3` package. We will mainly train a penalised logistic regression (from `glmnet`package) and a gradient boosted tree (using `xgboost`). In a later stage, we could try to use `catboost` without converting the categorical features

# Data
The data has been precomputed using the `get_training_data()` of the `Spadl` class and was stored in RDS format. Due to its relative big size, it is not shipped with the package and rather saved on google drive. Here, we will use the 'opta' version.

In [1]:
library(data.table)
library("mlr3") # mlr3 base package
library("mlr3misc") # contains some helper functions
library("mlr3pipelines") # create ML pipelines
library("mlr3tuning") # tuning ML algorithms
library("mlr3learners") # additional ML algorithms
library("mlr3viz") # autoplot for benchmarks
library("paradox") # hyperparameter space
library("smotefamily") # SM

data_path = "/home/tarak/Gdrive_RA/events_data/opta"
training_data = readRDS(file.path(data_path, "training_dt.RDS"))

In [2]:
training_data =  training_data[, lapply(.SD, as.numeric)]
names(training_data)

In [3]:
head(training_data, 3)

type_id_a0,type_id_a1,type_id_a2,type_pass_a0,type_cross_a0,type_throw_in_a0,type_freekick_crossed_a0,type_freekick_short_a0,type_corner_crossed_a0,type_corner_short_a0,⋯,time_seconds_overall_a1,period_id_a2,time_seconds_a2,time_seconds_overall_a2,time_delta_1,time_delta_2,scores,concedes,goal_from_shot,event_id
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,0,0,1,0,0,0,0,0,0,⋯,-2700,1,1,1,1,0,0,0,0,1936000000.0
0,0,0,1,0,0,0,0,0,0,⋯,1,0,0,-2700,1,2,0,0,0,123500000.0
21,0,0,0,0,0,0,0,0,0,⋯,2,1,1,1,1,2,0,0,0,1319000000.0


The target columns should be transformed to meet `mlr3` requirement. We will also remove redundant columns (`type_id_` type) and create 2 binary classification tasks: one for the `scores` and another for the `concedes` target.

In [4]:

# columns to exclude anyway
to_exclude_anyway <- c("type_id_a0", "type_id_a1", "type_id_a2", "goal_from_shot") 

# score task
to_exclude_score <- c(to_exclude_anyway, "concedes")
dt_score <- training_data[, !..to_exclude_score]
dt_score[["scores"]] <- factor(ifelse(dt_score[["scores"]], "goal", "no_goal"), 
                               c("goal", "no_goal"))
task_score <- TaskClassif$new(id = "scores", backend = dt_score, target = "scores")
task_score$col_roles$name <- "event_id"
task_score$col_roles$feature <- setdiff(task_score$col_roles$feature, "event_id")

# concede task
to_exclude_concede <- c(to_exclude_anyway, "scores")
dt_concede = training_data[, !..to_exclude_concede]
dt_concede[["concedes"]] <- factor(ifelse(dt_concede[["concedes"]], "goal", "no_goal"), 
                                   c("goal", "no_goal"))
task_concede <- TaskClassif$new(id = "concedes", backend = dt_concede, target = "concedes")
task_concede$col_roles$name <- "event_id"
task_concede$col_roles$feature <- setdiff(task_concede$col_roles$feature, "event_id")

# Sampling Strategy

The data available is severly umbalanced as shown below:

In [5]:
table(task_score$truth())


   goal no_goal 
  18909 1395564 

In `mlr3pipelines`, there is a `classbalancing` and a `smote` pipe operator that can be combined with any learner. Below, we define the `undersampling`, `oversampling` and `SMOTE` `PipeOps`. All three imbalance correction methods have hyperparameters to control the degree of class imbalance. We apply the `PipeOps` to the current task with specific hyperparameter values to see how the class balance changes:

In [6]:
# undersample majority class (relative to majority class)
po_under = po("classbalancing", id = "undersample", 
              adjust = "major", 
              reference = "major", 
              shuffle = FALSE, ratio = 1 / 6)
# reduce majority class by factor '1/ratio'
table(po_under$train(list(task_score))$output$truth())


   goal no_goal 
  18909  232594 

In [7]:
# oversample majority class (relative to majority class)
po_over = po("classbalancing", id = "oversample", 
             adjust = "minor",
             reference = "minor", 
             shuffle = FALSE, 
             ratio = 6)
# enrich minority class by factor 'ratio'
table(po_over$train(list(task_score))$output$truth())


   goal no_goal 
 113454 1395564 

In [None]:
# SMOTE enriches the minority class with synthetic data
po_smote = po("smote", dup_size = 2)
# enrich minority class by factor (dup_size + 1)
table(po_smote$train(list(task_score))$output$truth())