*The data, concept, and initial implementation of this notebook was done in Colab by Ross Wightman, the creator of timm. I (Jeremy Howard) did some refactoring, curating, and expanding of the analysis, and added prose.*

## timm

[PyTorch Image Models](https://timm.fast.ai/) (timm) is a wonderful library by Ross Wightman which provides state-of-the-art pre-trained computer vision models. It's like Huggingface Transformers, but for computer vision instead of NLP (and it's not restricted to transformers-based models)!

Ross has been kind enough to help me understand how to best take advantage of this library by identifying the top models. I'm going to share here so of what I've learned from him, plus some additional ideas.

## The data

Ross regularly benchmarks new models as they are added to timm, and puts the results in a CSV in the project's GitHub repo. To analyse the data, we'll first clone the repo:

In [15]:
! git clone --depth 1 https://github.com/huggingface/pytorch-image-models.git
%cd pytorch-image-models/results

Cloning into 'pytorch-image-models'...
remote: Enumerating objects: 566, done.[K
remote: Counting objects: 100% (566/566), done.[K
remote: Compressing objects: 100% (406/406), done.[K
remote: Total 566 (delta 222), reused 356 (delta 154), pack-reused 0[K
Receiving objects: 100% (566/566), 2.48 MiB | 15.51 MiB/s, done.
Resolving deltas: 100% (222/222), done.
/kaggle/working/pytorch-image-models/results/pytorch-image-models/results


Using Pandas, we can read the two CSV files we need, and merge them together.

In [16]:
import pandas as pd
df_results = pd.read_csv('results-imagenet.csv')

In [17]:
df_results

Unnamed: 0,model,top1,top1_err,top5,top5_err,param_count,img_size,crop_pct,interpolation
0,eva02_large_patch14_448.mim_m38m_ft_in22k_in1k,90.052,9.948,99.048,0.952,305.08,448,1.000,bicubic
1,eva02_large_patch14_448.mim_in22k_ft_in22k_in1k,89.966,10.034,99.012,0.988,305.08,448,1.000,bicubic
2,eva_giant_patch14_560.m30m_ft_in22k_in1k,89.786,10.214,98.992,1.008,1014.45,560,1.000,bicubic
3,eva02_large_patch14_448.mim_in22k_ft_in1k,89.624,10.376,98.950,1.050,305.08,448,1.000,bicubic
4,eva02_large_patch14_448.mim_m38m_ft_in1k,89.570,10.430,98.922,1.078,305.08,448,1.000,bicubic
...,...,...,...,...,...,...,...,...,...
997,dla46_c.in1k,64.872,35.128,86.298,13.702,1.30,224,0.875,bilinear
998,lcnet_050.ra2_in1k,63.120,36.880,84.386,15.614,1.88,224,0.875,bicubic
999,tf_mobilenetv3_small_minimal_100.in1k,62.890,37.110,84.240,15.760,2.04,224,0.875,bilinear
1000,tinynet_e.in1k,59.866,40.134,81.764,18.236,2.04,106,0.875,bicubic


In [18]:
df_results['model_org'] = df_results['model'] 
df_results['model'] = df_results['model'].str.split('.').str[0]

In [19]:
df_results

Unnamed: 0,model,top1,top1_err,top5,top5_err,param_count,img_size,crop_pct,interpolation,model_org
0,eva02_large_patch14_448,90.052,9.948,99.048,0.952,305.08,448,1.000,bicubic,eva02_large_patch14_448.mim_m38m_ft_in22k_in1k
1,eva02_large_patch14_448,89.966,10.034,99.012,0.988,305.08,448,1.000,bicubic,eva02_large_patch14_448.mim_in22k_ft_in22k_in1k
2,eva_giant_patch14_560,89.786,10.214,98.992,1.008,1014.45,560,1.000,bicubic,eva_giant_patch14_560.m30m_ft_in22k_in1k
3,eva02_large_patch14_448,89.624,10.376,98.950,1.050,305.08,448,1.000,bicubic,eva02_large_patch14_448.mim_in22k_ft_in1k
4,eva02_large_patch14_448,89.570,10.430,98.922,1.078,305.08,448,1.000,bicubic,eva02_large_patch14_448.mim_m38m_ft_in1k
...,...,...,...,...,...,...,...,...,...,...
997,dla46_c,64.872,35.128,86.298,13.702,1.30,224,0.875,bilinear,dla46_c.in1k
998,lcnet_050,63.120,36.880,84.386,15.614,1.88,224,0.875,bicubic,lcnet_050.ra2_in1k
999,tf_mobilenetv3_small_minimal_100,62.890,37.110,84.240,15.760,2.04,224,0.875,bilinear,tf_mobilenetv3_small_minimal_100.in1k
1000,tinynet_e,59.866,40.134,81.764,18.236,2.04,106,0.875,bicubic,tinynet_e.in1k


We'll also add a "family" column that will allow us to group architectures into categories with similar characteristics:

Ross has told me which models he's found the most usable in practice, so I'll limit the charts to just look at these. (I also include VGG, not because it's good, but as a comparison to show how far things have come in the last few years.)

In [20]:
df_infer = pd.read_csv('benchmark-infer-amp-nhwc-pt111-cu113-rtx3090.csv')
df_infer

Unnamed: 0,model,infer_samples_per_sec,infer_step_time,infer_batch_size,infer_img_size,param_count
0,tinynet_e,68298.73,14.982,1024,106,2.04
1,mobilenetv3_small_050,48773.32,20.985,1024,224,1.59
2,lcnet_035,47045.94,21.755,1024,224,1.64
3,lcnet_050,41541.83,24.639,1024,224,1.88
4,mobilenetv3_small_075,37803.23,27.076,1024,224,2.04
...,...,...,...,...,...,...
737,nfnet_f7,20.63,9307.125,192,608,499.50
738,resnetv2_152x4_bitm,18.31,3496.111,64,480,936.53
739,swin_v2_cr_giant_384,13.76,2325.078,32,384,2598.76
740,cait_m48_448,13.51,9473.095,128,448,356.46


In [21]:
def get_data(part, col):
    df = pd.read_csv(f'benchmark-{part}-amp-nhwc-pt111-cu113-rtx3090.csv').merge(df_results, on='model')
    df['secs'] = 1. / df[col]
    df['family'] = df.model.str.extract('^([a-z]+?(?:v2)?)(?:\d|_|$)')
    df = df[~df.model.str.endswith('gn')]
    df.loc[df.model.str.contains('in22'),'family'] = df.loc[df.model.str.contains('in22'),'family'] + '_in22'
    df.loc[df.model.str.contains('resnet.*d'),'family'] = df.loc[df.model.str.contains('resnet.*d'),'family'] + 'd'
    return df[df.family.str.contains('^re[sg]netd?|beit|convnext|levit|efficient|vit|vgg|swin')]

In [22]:
df = get_data('infer', 'infer_samples_per_sec')

In [23]:
df

Unnamed: 0,model,infer_samples_per_sec,infer_step_time,infer_batch_size,infer_img_size,param_count_x,top1,top1_err,top5,top5_err,param_count_y,img_size,crop_pct,interpolation,model_org,secs,family
12,levit_128s,21485.80,47.648,1024,224,7.78,76.526,23.474,92.872,7.128,7.78,224,0.900,bicubic,levit_128s.fb_dist_in1k,0.000047,levit
13,regnetx_002,17821.98,57.446,1024,224,2.68,68.746,31.254,88.536,11.464,2.68,224,0.875,bicubic,regnetx_002.pycls_in1k,0.000056,regnetx
15,regnety_002,16673.08,61.405,1024,224,3.16,70.278,29.722,89.528,10.472,3.16,224,0.875,bicubic,regnety_002.pycls_in1k,0.000060,regnety
17,levit_128,14657.83,69.849,1024,224,9.21,78.490,21.510,94.012,5.988,9.21,224,0.900,bicubic,levit_128.fb_dist_in1k,0.000068,levit
18,regnetx_004,14440.03,70.903,1024,224,5.16,72.398,27.602,90.828,9.172,5.16,224,0.875,bicubic,regnetx_004.pycls_in1k,0.000069,regnetx
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
625,resnetrs420,134.22,3814.641,512,416,191.89,85.006,14.994,97.124,2.876,191.89,416,1.000,bicubic,resnetrs420.tf_in1k,0.007450,resnetrs
626,swin_large_patch4_window12_384,125.78,1017.629,128,384,196.74,87.132,12.868,98.234,1.766,196.74,384,1.000,bicubic,swin_large_patch4_window12_384.ms_in22k_ft_in1k,0.007950,swin
635,vit_large_patch16_384,94.39,2712.048,256,384,304.72,87.084,12.916,98.302,1.698,304.72,384,1.000,bicubic,vit_large_patch16_384.augreg_in21k_ft_in1k,0.010594,vit
637,beit_large_patch16_384,82.25,3112.330,256,384,305.00,88.402,11.598,98.608,1.392,305.00,384,1.000,bicubic,beit_large_patch16_384.in22k_ft_in22k_in1k,0.012158,beit


## Inference results

Here's the results for inference performance (see the last section for training performance). In this chart:

- the x axis shows how many seconds it takes to process one image (**note**: it's a log scale)
- the y axis is the accuracy on Imagenet
- the size of each bubble is proportional to the size of images used in testing
- the color shows what "family" the architecture is from.

Hover your mouse over a marker to see details about the model. Double-click in the legend to display just one family. Single-click in the legend to show or hide a family.

**Note**: on my screen, Kaggle cuts off the family selector and some plotly functionality -- to see the whole thing, collapse the table of contents on the right by clicking the little arrow to the right of "*Contents*".

In [24]:
import plotly.express as px
w,h = 1000,800

def show_all(df, title, size):
    return px.scatter(df, width=w, height=h, size=df[size]**2, title=title,
        x='secs',  y='top1', log_x=True, color='family', hover_name='model_org', hover_data=[size])

In [25]:
show_all(df, 'Inference', 'infer_img_size')

That number of families can be a bit overwhelming, so I'll just pick a subset which represents a single key model from each of the families that are looking best in our plot. I've also separated convnext models into those which have been pretrained on the larger 22,000 category imagenet sample (`convnext_in22`) vs those that haven't (`convnext`). (Note that many of the best performing models were trained on the larger sample -- see the papers for details before coming to conclusions about the effectiveness of these architectures more generally.)

In [26]:
subs = 'levit|resnetd?|regnetx|vgg|convnext.*|efficientnetv2|beit|swin'

In this chart, I'll add lines through the points of each family, to help see how they compare -- but note that we can see that a linear fit isn't actually ideal here! It's just there to help visually see the groups.

In [27]:
def show_subs(df, title, size):
    df_subs = df[df.family.str.fullmatch(subs)]
    return px.scatter(df_subs, width=w, height=h, size=df_subs[size]**2, title=title,
        trendline="ols", trendline_options={'log_x':True},
        x='secs',  y='top1', log_x=True, color='family', hover_name='model_org', hover_data=[size])

In [28]:
show_subs(df, 'Inference', 'infer_img_size')

From this, we can see that the *levit* family models are extremely fast for image recognition, and clearly the most accurate amongst the faster models. That's not surprising, since these models are a hybrid of the best ideas from CNNs and transformers, so get the benefit of each. In fact, we see a similar thing even in the middle category of speeds -- the best is the ConvNeXt, which is a pure CNN, but which takes advantage of ideas from the transformers literature.

For the slowest models, *beit* is the most accurate -- although we need to be a bit careful of interpreting this, since it's trained on a larger dataset (ImageNet-21k, which is also used for *vit* models).

I'll add one other plot here, which is of speed vs parameter count. Often, parameter count is used in papers as a proxy for speed. However, as we see, there is a wide variation in speeds at each level of parameter count, so it's really not a useful proxy.

(Parameter count may be be useful for identifying how much memory a model needs, but even for that it's not always a great proxy.)

In [29]:
px.scatter(df, width=w, height=h,
    x='param_count_x',  y='secs', log_x=True, log_y=True, color='infer_img_size',
    hover_name='model_org', hover_data=['infer_samples_per_sec', 'family']
)

## Training results

We'll now replicate the above analysis for training performance. First we grab the data:

In [33]:
tdf = get_data('train', 'train_samples_per_sec')
tdf

Unnamed: 0,model,train_samples_per_sec,train_step_time,train_batch_size,train_img_size,param_count_x,top1,top1_err,top5,top5_err,param_count_y,img_size,crop_pct,interpolation,model_org,secs,family
9,levit_128s,6303.14,80.293,512,224,7.78,76.526,23.474,92.872,7.128,7.78,224,0.9,bicubic,levit_128s.fb_dist_in1k,0.000159,levit
13,levit_128,4434.56,114.332,512,224,9.21,78.490,21.510,94.012,5.988,9.21,224,0.9,bicubic,levit_128.fb_dist_in1k,0.000226,levit
14,vit_small_patch32_224,4334.06,117.284,512,224,22.88,75.994,24.006,93.270,6.730,22.88,224,0.9,bicubic,vit_small_patch32_224.augreg_in21k_ft_in1k,0.000231,vit
17,vit_tiny_r_s16_p8_224,3857.49,131.880,512,224,6.34,71.798,28.202,90.824,9.176,6.34,224,0.9,bicubic,vit_tiny_r_s16_p8_224.augreg_in21k_ft_in1k,0.000259,vit
18,levit_192,3823.94,132.765,512,224,10.95,79.838,20.162,94.778,5.222,10.95,224,0.9,bicubic,levit_192.fb_dist_in1k,0.000262,levit
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
620,swin_large_patch4_window12_384,41.02,388.265,16,384,196.74,87.132,12.868,98.234,1.766,196.74,384,1.0,bicubic,swin_large_patch4_window12_384.ms_in22k_ft_in1k,0.024378,swin
628,resnetrs420,32.20,487.619,16,416,191.89,85.006,14.994,97.124,2.876,191.89,416,1.0,bicubic,resnetrs420.tf_in1k,0.031056,resnetrs
631,vit_large_patch16_384,28.85,414.372,12,384,304.72,87.084,12.916,98.302,1.698,304.72,384,1.0,bicubic,vit_large_patch16_384.augreg_in21k_ft_in1k,0.034662,vit
632,beit_large_patch16_384,25.11,475.839,12,384,305.00,88.402,11.598,98.608,1.392,305.00,384,1.0,bicubic,beit_large_patch16_384.in22k_ft_in22k_in1k,0.039825,beit


Now we can repeat the same *family* plot we did above:

In [31]:
show_all(tdf, 'Training', 'train_img_size')

...and we'll also look at our chosen subset of models:

In [32]:
show_subs(tdf, 'Training', 'train_img_size')

Finally, we should remember that speed depends on hardware. If you're using something other than a modern NVIDIA GPU, your results may be different. In particular, I suspect that transformers-based models might have worse performance in general on CPUs (although I need to study this more to be sure).