In [2]:
source('src/lib.r')

### The Bias - Variance Trade Off

Previously, we have seen different ways to measure the performance of a ML model. However what is the **final aim** of a machine learning model? If you recall from the presentation, it's **the prediction on new data**, not the performance on the current one.

To put this point in a formal framework let's say we have trained our model $f: X \rightarrow Y$ on the so-called training observations:

$$\textrm{training set:} \left\{ (x_1,y_1), (x_2,y_2), \dots, (x_n,y_n) \right\}$$

We're no really interested int the *training set* accuracy (or any other measure previously shown):

$$\textrm{training accuracy:} \frac{1}{n} \sum_{i = 1}^n \mathbb{I}(f(x_i) = y_i)$$

But given a set of new observations, **not used to train the model**:

$$\textrm{testing set:} \left\{ (\tilde{x}_1,\tilde{y}_1), (\tilde{x}_2,\tilde{y}_2), \dots, (\tilde{x}_k,\tilde{y}_k) \right\}$$

We are interest in the *testing set* accuracy:

$$\textrm{test accuracy:} \frac{1}{n} \sum_{i = 1}^k \mathbb{I}(f(\tilde{x}_i) = \tilde{y}_i)$$

To better explain these concepts, the toy datasets have been already partitioned in *training* and *test* set (this last set is identified by the *_val* suffix as for *validation*):

In [11]:
df <- get_partitioned_df()
df$spirals$x_train %>% head
df$spirals$y_train %>% head
df$spirals$x_val %>% head
df$spirals$y_val %>% head
df$spirals$x_train %>% nrow
df$spirals$x_val %>% nrow

x,y
0.6252236,0.5290558
0.5075064,0.2557003
0.2902924,0.244952
0.7239316,0.549744
0.3686198,0.4139273
0.361764,0.3420039


class
class_1
class_2
class_2
class_1
class_2
class_2


x,y
0.7283947,0.5249248
0.6271864,0.6185736
0.3193688,0.346685
0.6427129,0.6762807
0.534729,0.7064484
0.8796171,0.5194445


class
class_1
class_1
class_2
class_1
class_1
class_2


Let's train a simple *KNN* model on the train dataset (3 neighbors, rectangular kernel and euclidean distance):

In [62]:
model <-  train(
    y = df$spirals$y_train$class,
    x = df$spirals$x_train,
    method = "kknn",
    ks = 1,
    trControl = trainControl(classProbs =  TRUE, method = "none"),
    tuneGrid = data.frame(
          kmax = 1,
          distance = 2,
          kernel = 'rectangular'
      )
)

Let's compute the *training set* accuracy

In [63]:
predictions <- predict(model, df$spirals$x_train)
confusionMatrix(predictions, df$spirals$y_train$class)$overall["Accuracy"]

And what about the *testing set* accuracy?

In [64]:
predictions <- predict(model, df$spirals$x_val)
confusionMatrix(predictions, df$spirals$y_val$class)$overall["Accuracy"]

#### Cross Validation
##### Validation Set Approach
##### K Fold Validation
##### Leave one out Cross Validation

#### Ensembling