# Fine-Tuning Pre-trained Machine Learning Models

There are many cases where taking a pre-trained model off the shelf can be a great way to tackle a problem. 
You've seen that there are tons of models available online for tackling all kinds of problems.

However, most models available online are dataset-specific or problem-specific.

Generic examples:
- Most text models are trained on a generic text dataset, but are not specialised to any particular domain
- Many of the top image classification models are trained on [ImageNet](https://www.image-net.org/) -- which has 1000 possible classification classes -- but your problem might only have 3 different classes that you want to distinguish between. That means that the off-the-shelf model is going to have the wrong output shape.

Specific examples:
- There may be a generic image classifier online, but it's unlikely there is a model for classifying different types of bicycles in New York City
- There may be a generic sentiment classifier online, but it's unlikely that there is one for detecting specific variations of positive sentiments in recently evolved colloquial language (slang, abbreviations, acronyms) used by a particular segment of TikTok users in comments.
- There may be a generic text summarisation model online, but it's unlikely that there is one for summarising a specific type of medical literature on the conditions relating to human memory

The more obscure or specific your problem, the less likely there's a model trained for the exact task available off the shelf.

It sounds like we might have to start from scratch. But is there a middle ground?

# TODO diagram of spectrum betwwen "train from scratch" and "use model off the shelf"

Lots of the things that these off-the-shelf models must have learnt, are still relevant to the models that we want to produce.

For example:
- A generic image classification model trained on imagenet will have had to learnt how to recognise different types of shapes and textures. 
- A generic language model will have learnt what the general structure of sentences looks like

This knowledge, stored as the parameters of the models, can be useful for our specific task, and should be transferred to our new model.

This is where _model fine-tuning (or transfer learning)_ comes in.

> Model fine-tuning is the process of taking a model pre-trained on a related problem as a starting point, and training it further to work for your specific case.

# TODO diagram of spectrum

> The starting model, which will be fine tuned, is commonly called _the backbone_

E.g: "I trained an image classifier with a ResNet-50 backbone"

Typically, the backbone is a large model that a big company spent a lot of time and compute training (check out some of Microsoft's models [here](https://huggingface.co/microsoft)).
It may not be possible for you to train that model from scratch in your lifetime (e.g GPT-2 (yes, 2, not 3) would have taken around 13 years to train on a single GPU), but it is incredible that you can take it [straight off the shelf](https://huggingface.co/gpt2).

Model fine-tuning usually consists of two steps:
1. Replacing the model head
1. Training the parameters of the head (and perhaps the rest of the model too)

## Fine-tuning step 1: Replacing the model head

> We call the part of the model that makes the final predictions from the outputs of the last hidden layer the _model head_.

## TODO diagram highlighting model head

> The first step in transfer learning is to replace the head of the starting model.

Many people make the mistake of keeping the head, and building another layer on top of it, instead of throwing it away and replacing it.

Picture this scenario:
- You want to train a classifier to classify 3 different types of castles, starting with a classifier trained on ImageNet.
- If you do not throw away the head, then the head will produce the same output corresponding to the class "castle" (a class which appears in the imagenet dataset) for every image in your specific castle dataset
- This means that the input to your new additional layer will be the same for every example - a vector of low values everywhere except for the node corresponding to the "castle" logit
- All of the information about the high level features will be compressed through this bottleneck, and the features that will help you to distinguish between different planes will be lost

# TODO diagram of mistake of not replacing the model head

Because of this, the original model head should be thrown away, and replaced with a new head which we will train in the next step.

> Adding a new layer **on top** of the existing model head, **instead of replacing it**, is a very common mistake.

# TODO diagram of replacing model head


The only case where you do not need to replace the model head is if you are training the model on the same problem **and the same targets as the model was trained**. 
For example, if you are training an image classification model, there may be new styles of vehicles and clothing in a more recent set of images. 
This "further training" is typically useful because as trends change, the distribution of types of data that appears in your dataset may change. 

## Fine-tuning step 2: Final training

Because the new model head is initialised with random parameters, you need to train them.

There are two fundamental approaches to the final training phase of fine-tuning:
1. _Weight freezing_: Keep the parameters of the original model fixed
1. _Discriminated learning rates_: Update the parameters of both the new head and the original model, but use a lower learning rate on the original model so that the original knowledge is not lost

### Fine-tuning by weight freezing

Weight freezing is where you totally prevent the parameters of the original model from updating.

Advantages of weight freezing:
- The backward pass and parameter updates during training is a lot faster, because only the parameters of the head have to be updated
    - The original model can contain +99% of the parameters
- All of the original knowledge is retained

Disadvantages of weight freezing:
- No opportunity for most of the parameters to specialise for the new task
- A single layer head may not have the capacity to learn the function to usefully combine the input activations into the desired output

### Fine-tuning by using discriminated learning rates

Using discriminated learning rates is the approach to fine tuning where you update the parameters of both the original model and the new head, but use a lower learning rate to update the original model's parameters, such that they don't change too drastically, and as such most of the knowledge is retained.

Advantages of discriminated learning rates:
- All parameters have the opportunity to specialise for the new task
    - Earlier layers can adapt to learn representations that the new head is able to make accurate predictions from with just a single layer

Disadvantages of discriminated learning rates:
- The backward pass and parameter updates during training is a lot slower, because every parameter requires its gradient to be computed and requires updating on every optimisation step
    - The original model can contain +99% of the parameters
- During training, especially early on when the head is random and the predictions are bad, the high gradients can cause the model to catastrophically forget some fundamental knowledge


## Things to be careful with when fine-tuning

- There is little point in using a starting model that was trained on a very different dataset. For example, a starting model trained on ImageNet will not be a useful starting point for a model trained to classify x-rays, because many of the representations learnt by the starting model do not apply.


In [None]:
import os
print(os.getenv())

: 