## Introduction

In a recent notebook I tried to answer the question "[Which image models are best?](https://www.kaggle.com/code/jhoward/which-image-models-are-best)" This showed which models in Ross Wightman's [PyTorch Image Models](https://timm.fast.ai/) (*timm*) were the fastest and most accurate for training from scratch with Imagenet.

However, this is not what most of us use models for. Most of us fine-tune pretrained models. Therefore, what most of us really want to know is which models are the fastest and most accurate for fine-tuning. However, this analysis has not, to my knowledge, previously existed.

Therefore I teamed up with [Thomas Capelle](https://tcapelle.github.io/about/) of [Weights and Biases](https://wandb.ai/) to answer this question. In this notebook, I present our results.

## The analysis

There are two key dimensions on which datasets can vary when it comes to how well they fine-tune a model:

1. How similar they are to the pre-trained model's dataset
2. How large they are.

Therefore, we decided to test on two datasets that were very different on both of these axes. We tested pre-trained models that were trained on Imagenet, and tested fine-tuning on two different datasets:

1. The [Oxford IIT-Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/), which is very similar to Imagenet. Imagenet contains many pictures of animals, and each picture is a photo in which the animal is the main subject. IIT-Pet contains nearly 15,000 images, that are also of this type.
2. The [Kaggle Planet](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/data) sample contains 1,000 satellite images of Earth. There are no images of this kind in Imagenet.

So these two datasets are of very different sizes, and very different in terms of their similarity to Imagenet. Furthermore, they have different types of labels - Planet is a multi-label problem, whereas IIT-Pet is a single label problem.

To test the fine-tuning accuracy of different models, Thomas put together [this script](https://github.com/tcapelle/fastai_timm/blob/main/fine_tune.py). The basic script contains the standard 4 lines of code needed for fastai image recognition models, plus some code to handle various configuration options, such as learning rate and batch size. It was particularly easy to handle in fastai since fastai supports all timm models directly.

Then, to allow us to easily try different configuration options, Thomas created Weights and Biases (*wandb*) YAML files such as [this one](https://github.com/tcapelle/fastai_timm/blob/main/sweep_planets_lr.yaml). This takes advantage of the convenient [wandb "sweeps"](https://wandb.ai/site/sweeps) feature which tries a range of different levels of a model input and tracks the results.

wandb makes it really easy for a group of people to run these kinds of analyses on whatever GPUs they have access to. When you create a sweep using the command-line wandb client, it gives you a command to run to have a computer run experiments for the project. You run that same command on each computer where you want to run experiments. The wandb client automatically ensures that each computer runs different parts of the sweep, and has each on report back its results to the wandb server. You can look at the progress in the wandb web GUI at any time during or after the run. I've got three GPUs in my PC at home, so I ran three copies of the client, with each using a different GPU. Thomas also ran the client on a [Paperspace Gradient](https://gradient.run/notebooks) server.

I liked this approach because I could start and stop the clients any time I wanted, and wandb would automatically handle keeping all the results in sync. When I restarted a client, it would automatically grab from the server whatever the next set of sweep settings were needed. Furthermore, the integration in fastai is really exceptional, thanks particularly to [Boris Dayma](https://github.com/borisdayma), who worked tirelessly to ensure that wandb automatically tracks every aspect of all fastai data processing, model architectures, and optimisation.

## Hyperparameters

We decided to try out all the timm models which had reasonable performance on timm. We ended up with a list of 86 models and variants to try.

Our first step was to find a good set of hyper-parameters for each model variant and for each dataset. Our experience at fast.ai has been that there's generally not much difference between models and datasets in terms of what hyperparameter settings work well -- and that experience was repeated in this project. For every model and every dataset we tried a variety of settings, and found that the following were the best (or near enough) for all combinations:

- Learning rate (AdamW): 0.08
- Resize method: [Squish](https://docs.fast.ai/vision.augment.html#Resize)
- Pooling type: [Concat](https://docs.fast.ai/layers.html#AdaptiveConcatPool2d)

For other parameters, we used defaults that we've previously found at fast.ai to be reliable across a range of models and datasets (see the fastai docs for details).

## Analysis

Let's take a look at the data. I've put a CSV of the results into a gist:

In [1]:
from fastai.vision.all import *
import plotly.express as px

url = 'https://gist.githubusercontent.com/jph00/44a39eede49567f0b2cc8c5acbc0d762/raw/sweep.csv'

For each model variant and dataset, for each hyperparameter setting, we did three runs. For the final sweep, we just used the hyperparameter settings listed above.

For each model variant and dataset, I create a group with the minimum error and fit time, and GPU memory use if used. I use the minimum because there might be some reason that a particular run didn't do so well (e.g. maybe there was some resource contention), and I'm mainly interested in knowing what the best case results for a model can be.

I create a "score" which, somewhat arbitrarily combines the accuracy and speed into a single number. I tried a few options until I came up with something that closely matched my own opinions about the tradeoffs between the two. (Feel free of course to fork this notebook and adjust how that's calculated.)

In [2]:
df = pd.read_csv(url)

In [3]:
df['family'] = df.model_name.str.extract('^([a-z]+?(?:v2)?)(?:\d|_|$)')
df.loc[df.family=='swinv2', 'family'] = 'swin'
pt_all = df.pivot_table(values=['error_rate','fit_time','GPU_mem'], index=['dataset', 'family','model_name'],
                    aggfunc=np.min).reset_index()
pt_all['score'] = pt_all.error_rate*(pt_all.fit_time+80)

### IIT Pet

Here's the top 15 models on the IIT Pet dataset, ordered by score:

In [4]:
pt = pt_all[pt_all.dataset=='pets'].sort_values('score').reset_index(drop=True)
pt.head(15)

Unnamed: 0,dataset,family,model_name,GPU_mem,error_rate,fit_time,score
0,pets,convnext,convnext_tiny,2.660156,0.052097,90.910305,8.903989
1,pets,vit,vit_small_patch16_224,2.121094,0.056157,84.598138,9.243336
2,pets,swin,swin_s3_tiny_224,3.126953,0.049391,113.855675,9.574743
3,pets,vit,vit_base_patch16_224,4.833984,0.041949,150.126558,9.653477
4,pets,swin,swin_tiny_patch4_window7_224,2.816406,0.051421,108.683676,9.702278
5,pets,vit,vit_base_patch16_224_miil,4.853516,0.043302,147.776299,9.863115
6,pets,resnetv2,resnetv2_50x1_bit_distilled,4.107422,0.050744,116.213847,9.956722
7,pets,convnext,convnext_base_in22ft1k,5.890625,0.040595,175.319081,10.364783
8,pets,convnext,convnext_tiny_hnf,2.990234,0.05751,106.618983,10.732483
9,pets,convnext,convnext_tiny_in22k,2.660156,0.064276,95.352934,11.270992


As you can see, the [convnext](https://arxiv.org/abs/2201.03545), [swin](https://arxiv.org/abs/2103.14030), and [vit](https://arxiv.org/abs/2010.11929) families are fairly dominent. The excellent showing of `convnext_tiny` matches my view that we should think of this as our default baseline for image recognition today. It's fast, accurate, and not too much of a memory hog. (And according to Ross Wightman, it could be even faster if NVIDIA and PyTorch make some changes to better optimise the operations it relies on!)

`vit_small` is also a good option -- it's faster and leaner on memory than `convnext_tiny`, although there is some performance cost too.

Interestingly, resnets are still a great option -- especially the [`resnet50d`](https://arxiv.org/abs/1812.01187) variant, which is great for memory use, or `resnet26` which is both the leanest and fastest in our top 15.

Here's a quick visual representation of the seven model families which look best in the above analysis (the "fit lines" are just there to help visually show where the different families are -- they don't necessarily actually follow a linear fit):

In [5]:
w,h = 900,700
faves = ['vit','convnext','resnet','levit', 'regnetx', 'swin', 'swinv2']
pt2 = pt[pt.family.isin(faves)]
px.scatter(pt2, width=w, height=h, x='fit_time', y='error_rate', color='family', hover_name='model_name', trendline="ols",)

This chart shows that there's a big drop-off in performance towards the far left. It seems like there's a big compromise if we want the fastest possible model.

I particularly like using fast and small models, since I wanted to be able to iterate rapidly to try lots of ideas (see [this notebook](https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster) for more on this). Here's the top models (based on accuracy) that are smaller and faster than the median model:

In [6]:
pt.query("(GPU_mem<2.7) & (fit_time<110)").sort_values("error_rate").head(15).reset_index(drop=True)

Unnamed: 0,dataset,family,model_name,GPU_mem,error_rate,fit_time,score
0,pets,convnext,convnext_tiny,2.660156,0.052097,90.910305,8.903989
1,pets,vit,vit_small_patch16_224,2.121094,0.056157,84.598138,9.243336
2,pets,resnet,resnet50d,2.058594,0.060893,105.216626,11.278418
3,pets,levit,levit_384,1.699219,0.062923,100.538907,11.360031
4,pets,convnext,convnext_tiny_in22k,2.660156,0.064276,95.352934,11.270992
5,pets,resnetblur,resnetblur50,2.208984,0.064276,108.031933,12.085948
6,pets,regnetx,regnetx_016,1.568359,0.069012,101.614066,12.53358
7,pets,resnet,resnet26,1.310547,0.075101,72.05132,11.419281
8,pets,regnety,regnety_006,1.03125,0.075101,105.061571,13.898401
9,pets,resnet,resnet26d,1.505859,0.075778,78.610718,12.019214


### Planet

Here's the top-15 for Planet:

In [7]:
pt = pt_all[pt_all.dataset=='planet'].sort_values('score').reset_index(drop=True)
pt.head(15)

Unnamed: 0,dataset,family,model_name,GPU_mem,error_rate,fit_time,score
0,planet,vit,vit_base_patch16_224,5.220703,0.036765,25.313958,3.87184
1,planet,vit,vit_small_patch16_224,2.3125,0.039706,17.98006,3.890384
2,planet,vit,vit_tiny_patch16_224,1.152344,0.042059,16.402883,4.054598
3,planet,vit,vit_base_patch32_224_sam,2.755859,0.042059,17.448341,4.098557
4,planet,vit,vit_base_patch32_224,2.755859,0.042353,16.840506,4.101478
5,planet,vit,vit_base_patch16_224_sam,5.220703,0.039118,25.019967,4.108133
6,planet,swin,swinv2_cr_small_ns_224,5.400391,0.034118,41.134334,4.13281
7,planet,convnext,convnext_tiny_in22k,2.740234,0.040882,22.186247,4.177614
8,planet,vit,vit_small_patch32_224,0.787109,0.043823,15.530336,4.186472
9,planet,swin,swin_small_patch4_window7_224,4.517578,0.036471,36.911189,4.26382


Interestingly, the results look quite different: there's a clear victory for *vit* across the board in terms of the combination of accuracy and speed. (Although the `swinv2_cr_small_ns_224` model has the best accuracy, but it's also the slowest of these top 15 architectures.)

Because this dataset is so different to Imagenet, what we're testing here is more about how quickly and data-efficiently a model can learn new features that it hasn't seen before. We can see that the transformers-based *vit* is able to do that better than any other model.

The downside of vit and swin models, like most transformers-based models, is that they can only handle one input image size. Of course, we can always squish or crop or pad our input images to the required size, but this can have a significant impact on performance. For instance, recently in looking at the [Kaggle Paddy Disease](https://www.kaggle.com/competitions/paddy-disease-classification) competition I've found that the ability of convnext models to handle dynamically sized inputs to be very convenient.

Here's a chart of the seven top families, this time for the Planet dataset:

In [8]:
pt2 = pt[pt.family.isin(faves)]
px.scatter(pt2, width=w, height=h, x='fit_time', y='error_rate', color='family', hover_name='model_name', trendline="ols")

One striking feature is that for this dataset, there's little correlation between model size and performance. Regnetx and vit are the only families that show much of a relationship here. This suggests that if you have data that's very different to your pretrained model's data, that you might want to focus on smaller models. This makes intuitive sense, since these models have more new features to learn, and if they're too big they're either going to overfit, or fail to utilise their capacity effectively.

Here's the most accurate small and fast models on the Planet dataset:

In [9]:
pt.query("(GPU_mem<2.7) & (fit_time<25)").sort_values("error_rate").head(15).reset_index(drop=True)

Unnamed: 0,dataset,family,model_name,GPU_mem,error_rate,fit_time,score
0,planet,vit,vit_small_patch16_224,2.3125,0.039706,17.98006,3.890384
1,planet,vit,vit_tiny_patch16_224,1.152344,0.042059,16.402883,4.054598
2,planet,convnext,convnext_tiny,2.662109,0.042941,20.545165,4.317525
3,planet,vit,vit_small_patch32_224,0.787109,0.043823,15.530336,4.186472
4,planet,vit,vit_tiny_r_s16_p8_224,0.984375,0.044706,16.201229,4.300754
5,planet,resnet,resnet26,1.310547,0.045882,16.307683,4.418817
6,planet,resnet,resnet26d,1.505859,0.046177,17.209666,4.488807
7,planet,resnet,resnet18,0.671875,0.046765,14.0151,4.396585
8,planet,mobilevit,mobilevit_xxs,1.154297,0.047647,21.987369,4.859392
9,planet,resnetblur,resnetblur50,2.208984,0.047941,21.238755,4.853502


`vit_small_patch16_224`, which was our 2nd top scoring of all models, stands out here as by far the most accurate, and amongst the fastest. `convnext_tiny` is still the most accurate option amongst architectures that don't have a fixed resolution. Resnets 18 and 26 have very low memory use, are fast, and still quite accurate.

## Conclusions

It really seems like it's time for a changing of the guard when it comes to computer vision models. There are, as at the time of writing (June 2022) three very clear winners when it comes to fine-tuning pretrained models:

- [convnext](https://arxiv.org/abs/2201.03545)
- [vit](https://arxiv.org/abs/2010.11929)
- [swin](https://arxiv.org/abs/2103.14030) (and [vs](https://arxiv.org/abs/2111.09883)).

Thankfully, it's easy to try lots of different models, especially if you use fastai and timm, because it's literally as easy as changing the model name in one place in your code. Your existing hyperparameters are most likely going to continue to work fine regardless of what model you try. And it's particularly easy if you use [wandb](https://wandb.ai/), since you can start and stop experiments at any time and they'll all be automatically tracked and managed for you.