# Meta Learning

## Seoul AI Meetup, September 16

Martin Kersner, <m.kersner@gmail.com>

In [33]:
import numpy as np


* train/val/test dataset
* bias/variance
* discovering meta knowledge/bagging/boosting/stacking/dynamic bias selection/inductive transfer
* adaboost, gradient boosting, random forest
* netflix, kaggle
* nexar

## Meta Learning

Meta Learning is a way of combining models using Meta algorithms.

Supervised Learning only

## Training, Validation and Test Dataset

todo

## Generalization Error
https://en.wikipedia.org/wiki/Generalization_error

Generalization error is measure of how accurately an algorithm is able to predict outcome values for previously **unseen data**.


Generalization is composed of three parts:

* Bias
* Variance
* **Irreducible Error**

## Bias, Variance
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

* Bias
  * Error from erroneous assumptions in the learning algorithm.
  * Can cause **underfitting**.
  

* Variance
  * Error from sensitivity to small fluctuations in the training set.
  * Can cause **overfitting**.

<center>
<img src="https://qph.ec.quoracdn.net/main-qimg-f9c226fe76f482855b6d46b86c76779a" style="height: 50%; width: 50%"/>
</center>

## Bias-Variance Tradeoff

<center>
<img src="http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png" style="height: 50%; width: 50%"/>
</center>

## Ensemble

Combining these rules will provide robust prediction as compared to prediction done by individual rules
Ensemble model combines multiple ‘individual’ (diverse) models together and delivers superior prediction power.

an ensemble is a supervised learning technique for combining multiple weak learners/ models to produce a strong learner. Ensemble model works better, when we ensemble models with low correlation.

the random forest algorithm (having multiple CART models). It performs better compared to individual CART model by classifying a new object where each tree gives “votes” for that class and the forest chooses the classification having the most votes (over all the trees in the forest). In case of regression, it takes the average of outputs of different trees.

The key to creating a powerful ensemble is model diversity. An ensemble with two techniques that are very similar in nature will perform poorly than a more diverse model set.

## Voting Ensembles

TODO link (ensembling guide MLWave)

Voting ensembled works better to ensemble low-correlated model predictions.

Majority votes make most sense when the evaluation metric requires hard predictions, for instance with (multiclass-) classification accuracy.

Final class is selected based on (weighted) majority voting.

* Majority Voting Ensemble
* Weighted Voting Ensemble

### Majority Voting Ensemble Example

3 independent binary models (A, B, C) with accuracy 70 %.

* 70 % of time correct prediction.
* 30 % of time wrong prediction.

**At least two predictions (out of three) have to be correct.**


Voting Mechanism:
> * A: 1
> * B: 1
> * C: 0
>
> => 1


#### All three are correct

In [12]:
P3 = 0.7 * 0.7 * 0.7
print(P3)

0.3429999999999999


#### Two are correct

In [11]:
P2 = 3 * (0.7 * 0.7 * 0.3)
print(P2)

0.4409999999999999


#### One is correct

In [13]:
P1 = 3 * (0.3 * 0.3 * 0.7)
print(P1)

0.189


#### None is correct

In [14]:
P0 = 0.3 * 0.3 * 0.3
print(P0)

0.027


In [15]:
P = P0 + P1 + P2 + P3
print(P)

0.9999999999999998


### Voting Ensemble Example Result

Most of the time (P2 ~ 44 %) the majority vote corrects an error.

Prediction accuracy of majority ensembling mode will be **78.38 %** (P3 + P2) which is higher than when using models individually.

Using **5** independent binary models with accuracy 70 %, accuracy of majority voting raises to **83.69 %**.

## Correlation

* Pearson Correlation

TODO

For highly correlated models, majority voting enembles don't help much or not at all.

In [34]:
GT = np.array([1,1,1,1,1,1,1,1,1,1])

A  = np.array([1,1,1,1,1,1,1,1,0,0]) # 80 % accuracy
B  = np.array([1,1,1,1,1,1,1,1,0,0]) # 80 % accuracy
C  = np.array([1,0,1,1,1,1,1,1,0,0]) # 70 % accuracy

Accuracy with voting ensembles is still 80 %!

In [39]:
sum(A+B+C >= 2)/len(A)

0.80000000000000004

In [40]:
A = np.array([1,1,1,1,1,1,1,1,0,0]) # 80 % accuracy
B = np.array([0,1,1,1,0,1,1,1,0,1]) # 70 % accuracy
C = np.array([1,0,0,0,1,0,1,1,1,1]) # 60 % accuracy

Using highly uncorrelated models, accuracy raised to 90 %.

In [41]:
sum(A+B+C >= 2)/len(A)

0.90000000000000002

### Weighted Voting Ensemble

To give a better model more weight.

## Averaging

works well for a wide range of problems (both classification and regression) and metrics (AUC, squared error or logaritmic loss).

Geometric mean

reduces overfit

## Rank Averaging

do well when evaluation metric is ranking or threshold based like AUC


https://www.kaggle.com/cbourguignat/why-calibration-works

## Historical Ranks

## Stacking, Blending

### Stacked Generalization
The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.

**2 fold stacking**

Split the train set in 2 parts: train_a and train_b
Fit a first-stage model on train_a and create predictions for train_b
Fit the same model on train_b and create predictions for train_a
Finally fit the model on the entire train set and create predictions for the test set.
Now train a second-stage stacker model on the probabilities from the first-stage model(s).

A stacker model gets more information on the problem space by using the first-stage predictions as features, than if it was trained in isolation.

### Blending (= stacked ensembling)

It is very close to stacked generalization, but a bit simpler and less risk of an information leak.

### Ensemble Meta-Algorithms

* Bagging
* Boosting
* Stacking

## Bagging (Bootstrap Aggreagting)

1. Create **random samples** (sampling uniformly and with replacement) of the training data set.
2. Build a model **from each sample**.
3. **Combine** results of these multiple classifiers using **average** (regression) or **majority voting** (classification).


* Bagging helps to reduce the variance error.
* Can be trained in parallel.

### Random Forests

TODO

## Boosting

**Boosting** is a method of turning set of **weak learners** to one **strong learner**.

* Weak learner
  * Classifier which can label testing examples better than random guessing.
* Strong learner
  * Classifier that is arbitrarily well-correlated with the true label.


1. Assign weight (same for each example) to each training example.
2. Train weak classifier on whole training dataset.
3. Evaluate weak classifier and reweight data accordingly:
   * Misclassified examples **gain** weight.
   * Correctly clasified examples **lose** weight.
4. Train another weak classifier that focuses on examples that were misclassified by previous weak learner. 
5. Evaluate weak clasifier and modify weights appropriately (as in step 3).
6. Repeat from 4 unless you achieve required accuracy or reach the maximum number of weak learners.


When weak classifiers are put together, they are typically weighted in some way.


* Boosting is primarily reducing bias, and also variance.
* Tends to overfit the training data.
* Can be trained only sequentally.

### Adaboost
### Brownboost
### xgboost

TODO

## Stacking (Stacked Generalization)

Training a model to combine the **predictions** of several other models.

Stacking works in two phases.

1. Train several base models using all data (data splitting).
2. Train model to make final predictions using predictions results from trained models in the first phase.

### How can we identify the weights of different models for ensemble?

ne of the most common challenge with ensemble modeling is to find optimal weights to ensemble base models. In general, we assume equal weight for all models and takes the average of predictions. But, is this the best way to deal with this challenge?

There are various methods to find the optimal weight for combining all base learners. These methods provide a fair understanding about finding the right weight. I am listing some of the methods below:

Find the collinearity between base learners and based on this table, then identify the base models to ensemble. After that look at the cross validation score (ratio of score) of identified base models to find the weight.
Find the algorithm to return the optimal weight for base learners. You can refer article Finding Optimal Weights of Ensemble Learner using Neural Network to look at the method to find optimal weight.
We can also solve the same problem using methods like:
Forward Selection of learners
Selection with Replacement
Bagging of ensemble methods
You can also look at the winning solution of Kaggle / data science competitions to understand other methods to deal with this challenge.

### What are the benefits of ensemble model?

There are two major benefits of Ensemble models:

* Better prediction
* More stable model


The aggregate opinion of a multiple models is less noisy than other models. In finance, we called it “Diversification”  a mixed portfolio of many stocks will be much less variable than just one of the stocks alone. This is also why your models will be better with ensemble of models rather than individual. One of the caution with ensemble models are over fitting although bagging takes care of it largely.