[pytorch-image-models](https://github.com/huggingface/pytorch-image-models) <br>
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

## The data

In [28]:
! git submodule add https://github.com/huggingface/pytorch-image-models submodules/timm
# ! git submodule update --init --recursive
# or clone 
# ! git clone https://github.com/huggingface/pytorch-image-models

Cloning into '/home/gnart/gblabs/ml-course/notebooks/fastai/submodules/timm'...
remote: Enumerating objects: 19552, done.[K
remote: Counting objects: 100% (640/640), done.[K
remote: Compressing objects: 100% (279/279), done.[K
remote: Total 19552 (delta 455), reused 361 (delta 361), pack-reused 18912 (from 3)[K
Receiving objects: 100% (19552/19552), 27.22 MiB | 9.98 MiB/s, done.
Resolving deltas: 100% (14392/14392), done.


### ImageNet Validation - [`results-imagenet.csv`](results-imagenet.csv)

The standard 50,000 image ImageNet-1k validation set. Model selection during training utilizes this validation set, so it is not a true test set.

In [29]:
from pathlib import Path

root_path = Path.cwd().parent.parent

In [30]:
import pandas as pd

df_results = pd.read_csv(root_path / "submodules/timm/results/results-imagenet.csv")

In [31]:
df_results

Unnamed: 0,model,img_size,top1,top1_err,top5,top5_err,param_count,crop_pct,interpolation
0,eva02_large_patch14_448.mim_m38m_ft_in22k_in1k,448,90.054,9.946,99.056,0.944,305.08,1.000,bicubic
1,eva02_large_patch14_448.mim_in22k_ft_in22k_in1k,448,89.966,10.034,99.016,0.984,305.08,1.000,bicubic
2,eva_giant_patch14_560.m30m_ft_in22k_in1k,560,89.796,10.204,98.990,1.010,1014.45,1.000,bicubic
3,eva02_large_patch14_448.mim_in22k_ft_in1k,448,89.632,10.368,98.954,1.046,305.08,1.000,bicubic
4,eva_giant_patch14_336.m30m_ft_in22k_in1k,336,89.570,10.430,98.954,1.046,1013.01,1.000,bicubic
...,...,...,...,...,...,...,...,...,...
1440,tinynet_e.in1k,106,59.874,40.126,81.770,18.230,2.04,0.875,bicubic
1441,mobilenetv3_small_050.lamb_in1k,224,57.924,42.076,80.142,19.858,1.59,0.875,bicubic
1442,test_efficientnet.r160_in1k,160,46.424,53.576,70.958,29.042,0.36,0.875,bicubic
1443,test_byobnet.r160_in1k,160,45.400,54.600,70.610,29.390,0.46,0.875,bicubic


In [32]:
df_results['model_org'] = df_results['model'] 
df_results['model'] = df_results['model'].str.split('.').str[0]

We'll also add a "family" column that will allow us to group architectures into categories with similar characteristics:

Ross has told me which models he's found the most usable in practice, so I'll limit the charts to just look at these. (I also include VGG, not because it's good, but as a comparison to show how far things have come in the last few years.)

In [33]:
def get_data(part, col):
    if part == 'infer':
        df = pd.read_csv(
            root_path / f'submodules/timm/results/benchmark-{part}-amp-nhwc-pt113-cu117-rtx3090.csv'
            ).merge(df_results, on='model')
    elif part == 'train':
        df = pd.read_csv(
            root_path / f'submodules/timm/results/benchmark-{part}-amp-nhwc-pt112-cu113-rtx3090.csv'
            ).merge(df_results, on='model')
    df['secs'] = 1. / df[col]
    df['family'] = df.model.str.extract('^([a-z]+?(?:v2)?)(?:\d|_|$)')
    df = df[~df.model.str.endswith('gn')]
    df.loc[df.model.str.contains('in22'),'family'] = df.loc[df.model.str.contains('in22'),'family'] + '_in22'
    df.loc[df.model.str.contains('resnet.*d'),'family'] = df.loc[df.model.str.contains('resnet.*d'),'family'] + 'd'
    return df[df.family.str.contains('^re[sg]netd?|beit|convnext|levit|efficient|vit|vgg|swin')]


invalid escape sequence '\d'


invalid escape sequence '\d'


invalid escape sequence '\d'



In [43]:
df = get_data('infer', 'infer_samples_per_sec')

## Inference results

Here's the results for inference performance (see the last section for training performance). In this chart:

- the x axis shows how many seconds it takes to process one image (**note**: it's a log scale)
- the y axis is the accuracy on Imagenet
- the size of each bubble is proportional to the size of images used in testing
- the color shows what "family" the architecture is from.

Hover your mouse over a marker to see details about the model. Double-click in the legend to display just one family. Single-click in the legend to show or hide a family.

**Note**: on my screen, Kaggle cuts off the family selector and some plotly functionality -- to see the whole thing, collapse the table of contents on the right by clicking the little arrow to the right of "*Contents*".

In [35]:
import plotly.express as px
w,h = 1000,800

def show_all(df, title, size):
    return px.scatter(df, width=w, height=h, size=df[size]**2, title=title,
        x='secs',  y='top1', log_x=True, color='family', hover_name='model_org', hover_data=[size])

In [36]:
show_all(df, 'Inference', 'infer_img_size')

That number of families can be a bit overwhelming, so I'll just pick a subset which represents a single key model from each of the families that are looking best in our plot. I've also separated convnext models into those which have been pretrained on the larger 22,000 category imagenet sample (`convnext_in22`) vs those that haven't (`convnext`). (Note that many of the best performing models were trained on the larger sample -- see the papers for details before coming to conclusions about the effectiveness of these architectures more generally.)

In [37]:
subs = 'levit|resnetd?|regnetx|vgg|convnext.*|efficientnetv2|beit|swin'

In this chart, I'll add lines through the points of each family, to help see how they compare -- but note that we can see that a linear fit isn't actually ideal here! It's just there to help visually see the groups.

In [38]:
def show_subs(df, title, size):
    df_subs = df[df.family.str.fullmatch(subs)]
    return px.scatter(df_subs, width=w, height=h, size=df_subs[size]**2, title=title,
        trendline="ols", trendline_options={'log_x':True},
        x='secs',  y='top1', log_x=True, color='family', hover_name='model_org', hover_data=[size])

In [39]:
show_subs(df, 'Inference', 'infer_img_size')

From this, we can see that the *levit* family models are extremely fast for image recognition, and clearly the most accurate amongst the faster models. That's not surprising, since these models are a hybrid of the best ideas from CNNs and transformers, so get the benefit of each. In fact, we see a similar thing even in the middle category of speeds -- the best is the ConvNeXt, which is a pure CNN, but which takes advantage of ideas from the transformers literature.

For the slowest models, *beit* is the most accurate -- although we need to be a bit careful of interpreting this, since it's trained on a larger dataset (ImageNet-21k, which is also used for *vit* models).

I'll add one other plot here, which is of speed vs parameter count. Often, parameter count is used in papers as a proxy for speed. However, as we see, there is a wide variation in speeds at each level of parameter count, so it's really not a useful proxy.

(Parameter count may be be useful for identifying how much memory a model needs, but even for that it's not always a great proxy.)

In [25]:
px.scatter(df, width=w, height=h,
    x='param_count_x',  y='secs', log_x=True, log_y=True, color='infer_img_size',
    hover_name='model_org', hover_data=['infer_samples_per_sec', 'family']
)

## Training results

We'll now replicate the above analysis for training performance. First we grab the data:

In [40]:
tdf = get_data('train', 'train_samples_per_sec')

Now we can repeat the same *family* plot we did above:

In [41]:
show_all(tdf, 'Training', 'train_img_size')

...and we'll also look at our chosen subset of models:

In [42]:
show_subs(tdf, 'Training', 'train_img_size')

Finally, we should remember that speed depends on hardware. If you're using something other than a modern NVIDIA GPU, your results may be different. In particular, I suspect that transformers-based models might have worse performance in general on CPUs (although I need to study this more to be sure).

## LeViT

In [54]:
idf_levit = df[df['family'] == 'levit']
tdf_levit = tdf[tdf['family'] == 'levit']

display(idf_levit)
display(tdf_levit)


Unnamed: 0,model,infer_samples_per_sec,infer_step_time,infer_batch_size,infer_img_size,infer_gmacs,infer_macts,param_count_x,img_size,top1,top1_err,top5,top5_err,param_count_y,crop_pct,interpolation,model_org,secs,family
13,levit_128s,22675.73,45.148,1024,224,0.31,1.88,7.78,224,76.522,23.478,92.874,7.126,7.78,0.9,bicubic,levit_128s.fb_dist_in1k,4.4e-05,levit
17,levit_128,15337.67,66.754,1024,224,0.41,2.71,9.21,224,78.486,21.514,93.998,6.002,9.21,0.9,bicubic,levit_128.fb_dist_in1k,6.5e-05,levit
24,levit_192,13524.14,75.706,1024,224,0.66,3.2,10.95,224,79.85,20.15,94.804,5.196,10.95,0.9,bicubic,levit_192.fb_dist_in1k,7.4e-05,levit
55,levit_256,9858.1,103.863,1024,224,1.13,4.23,18.89,224,81.506,18.494,95.466,4.534,18.89,0.9,bicubic,levit_256.fb_dist_in1k,0.000101,levit
120,levit_384,5880.4,174.126,1024,224,2.36,6.26,39.13,224,82.598,17.402,96.016,3.984,39.13,0.9,bicubic,levit_384.fb_dist_in1k,0.00017,levit


Unnamed: 0,model,train_samples_per_sec,train_step_time,train_batch_size,train_img_size,param_count_x,img_size,top1,top1_err,top5,top5_err,param_count_y,crop_pct,interpolation,model_org,secs,family
10,levit_128s,6539.16,77.346,512,224,7.78,224,76.522,23.478,92.874,7.126,7.78,0.9,bicubic,levit_128s.fb_dist_in1k,0.000153,levit
15,levit_128,4558.28,111.213,512,224,9.21,224,78.486,21.514,93.998,6.002,9.21,0.9,bicubic,levit_128.fb_dist_in1k,0.000219,levit
23,levit_192,3727.29,136.213,512,224,10.95,224,79.85,20.15,94.804,5.196,10.95,0.9,bicubic,levit_192.fb_dist_in1k,0.000268,levit
42,levit_256,2956.43,172.043,512,224,18.89,224,81.506,18.494,95.466,4.534,18.89,0.9,bicubic,levit_256.fb_dist_in1k,0.000338,levit
110,levit_384,1801.0,283.153,512,224,39.13,224,82.598,17.402,96.016,3.984,39.13,0.9,bicubic,levit_384.fb_dist_in1k,0.000555,levit


In [None]:
from timm import create_model
model_name = 'levit_128s.fb_dist_in1k'
model = create_model(
    model_name,
    pretrained=False,
    num_classes=1000
)
model.eval()




