# Overview

What is transfer learning?
- why do it?
- relationship to fine-tuning

Relationship to pre-training
- autoencoders/auto-associators
- large Language models



## Transfer Learning in a sentence 

Transfer Learning is an approach to learning new problems that attempts to leverage learning done on _similar_ problems.

|![balance bike](imgs/balance.jpg)] | Then ... | ![road bike](imgs/roadbike.jpg)| ... then|![motor bike](imgs/motorbike.jpg)|


original images from wikimedia [balance bike](https://upload.wikimedia.org/wikipedia/commons/5/51/Wooden_Balance_Bike_for_Kids_oak_frame_with_flame_red_tires.jpg)
[road bike](https://commons.wikimedia.org/wiki/File:Colnago_Extreme_C.jpg)
[motor bike](https://commons.wikimedia.org/wiki/File:Five_children_on_a_motorcycle.jpg)


### Let's unpack that a bit

Let's assume we have some input space $\mathcal{X}$
- could be the space of all 256 x256x3 images
- or could be 'bag of words' encoding for  sentences with some fixed vocabulary
- or some other common set of metrics/features that define a set of objects/events

and an output space 
 - e.g.  $\mathcal{Y}_1$= \[horse, truck,person,...\], or
 - $\mathcal{Y}_2$= \[cat,dog\]

One view of transfer learning says that if we split a model $M_1$ into two parts  
$M_{1,intermediate}$ and $M_{1,final}$  
with an intermediate 'feature detection output space'  $\mathcal{F}$  

where $M_{1,intermediate}: \mathcal{X} \rightarrow \mathcal{F}$   
and $ M_{1,final}: \mathcal{F} \rightarrow \mathcal{Y_1}$

i.e. for an input $x$, $ M_1(x) = M_{1,final}( M_{1,intermediate}(x))$


![Simple CNN architecture](./imgs/typical_cnn.jpg)

![Simple CNN architecture ashowing internediate and final models](./imgs/typical_cnn_annotated.jpg)

then provided there are some common elements between the tasks:

- given a model $M_1 = M_{1,final} ( M_{1,intermediate} )$ trained to predict $M_1: \mathcal{X} \rightarrow \mathcal{Y_1}$ 

- we can create a model for task 2 $M_2 = M_{2,final}( M_{1,intermediate})$  to predict $M_2: \mathcal{X} \rightarrow \mathcal{Y_2}$
  
- and this may be **better**  ... in some sense 

- than trying to learn both $M_{2,final}$ and $M_{2,intermediate}$ from scratch

### Aside: fine-tuning

Fine-tuning approach says that it is better to learn the sequence:

1. Train   $M_{1,final}$ and $M_{1,intermediate}$ simultaneously for task 1

2. Make copy $M_{2,intermediate} = M_{1,intermediate}$

2. Freeze $M_{2,intermediate}$ and adapt $M_{2,final}$ to get good accuracy

3. Then adapt both $M_{2,intermediate}$ and $M_{2,final}$ simultaneously for last few bits of improvement

Steps 1-3 are transfer learning, 4 is fine tuning

### Sounds great,    so when is it appropriate?

| Scenario | Transfer Learning ? | Fine-tuning?| Why?|
| ---|---|---|---|
| New data set is small and task2 is  similar to task 1 | Yes | No | Risk of over-fitting |
| New data set is large and task2 is  similar to task 1 | Yes  | Yes |
| New data set is large and task 2 is different to task 2| No | No | Train from scratch|

### Examples: lots of deep CNNs are available which were trained on ImageNet or CIFAR-100

### Image-Net [homepage](https://image-net.org)
Collection of >14m  images organised into >21k "synonym-sets" according to word-net hierarchy.
- each labelled by several volunteers with different tags
- >1m have bounding box information for artefacts
- Good description from [papers with code]( https://paperswithcode.com/dataset/imagenet)

### CoCo Common Objects in Context [homepage](https://cocodataset.org/#home)
-  330k labelled images
- but 1.5 million object instances from >80 imagenet classes
-  contains more precisely segmented objects
- ![example](https://cocodataset.org/images/detection-splash.png)

## Issues?

### Mapping input spaces (domains) between problems
or how do I use $M_{1, intermediate}$ for problem 2 if $\mathcal{X_1} \neq \mathcal{X_2}$

Images:
- resize, rescale, 
- simple if $|\mathcal{X_1}| < |\mathcal{X_2}|$ i.e. smaller images, one channel (b/w) vs 3 (rgb)
- what if the opposite is true?  
  - upscaling fairly well understood (e.g. SD to HD for tvs) 
  - pseudo colouring vs mapping gray scale to rgb. if task 1 has 3 channels and task 2 has 1

Language:
- can be problematic if the need to restrict vocabularies to a manageable size means words from task 2 not present in task 1
  - e.g. medibert vs bert [medibert arXiv paper, Rasmi et al 2020](https://arxiv.org/abs/2005.12833)

### What if I don't have a nicely labelled dataset for task1 ?

<span style="color:red; font-size:24pt;">That's where pre-training comes in ...</span>

## Pretraining as a form of transfer learning

### Put crudely: define a simple task to turn unsupervised  problem  into  a supervised  one.

### Example _de-noising autoencoders_<sup>1</sup>

Takes the original idea of autoencoders as a pretraining step 
>One key ingredient to this success appears to be the use of an unsupervised training criterion to perform a layer-by-layer initialization: each layer is at first trained to produce a higher level (hidden) representation of the observed patterns, based on the representation it receives as input from the layer below, by optimizing a local unsupervised criterion. ... \[then\] a global fine-tuning of the model’s parameters is then performed using another training criterion appropriate for the task at hand.

where the 'unsupervised criteria' is that you can reconstruct the input to a layer from its outputs,   
and then adds to it the idea that partially corrupted versions of inputs should produce similar outputs.

1: _Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM._ [pdf link](https://www.cs.toronto.edu/~larocheh/publications/icml-2008-denoising-autoencoders.pdf)

### Example 2: Large Language models 
Most of these e.g. GPT, BERT etc. use pre-training a lot.

For example, BERT uses a combination of two simple pre-training objectives:
- a 'masked language model'  
    - randomly blank out some words from input sentences 
    - simple output layer has to predict the  missing word.
    - allows for training of a Bidirectional Transformer (more in next few weeks)
- 'next sentence prediction'
  - pass in a sentence and either:
  - the next one (label= _IsNext_)
  - a random sentence (_NotNext_)

> [See Fig 1 in *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*, Devlin et al 2019](https://arxiv.org/pdf/1810.04805.pdf)

# Other related tasks: Curriculum learning
An approach to Reinforcement Learning that uses progressively more complex tasks
 - either in simulation:
   - bend
   - walk
   - run
   - jump
   - parkour!
- or simulation $\rightarrow$ reality

Could use for supervised ML $\rightarrow$ Generative Adversarial Networks 

## Summary
1. Transfer Learning is really powerful, especially when you don't have much data

2. There are links to _curriculum learning_ where you train model through a sequence of successively more complex tasks (widely used in Reinforcement Learning)  
   There's a reading list article about this.
   
3. **It is not without its problems**  
   Especially if the early problems "bake-in" unwanted bias.  
   For example, in the image below 'synsets' are different occupations.

   ![example of bias in imagenet](https://image-net.org/static_files/figures/demographcs_distribution.png)
   [report on image-net bias](https://image-net.org/update-sep-17-2019.php)