Skip to content

Releases: huggingface/pytorch-image-models

Release v1.0.9

23 Aug 23:42
Compare
Choose a tag to compare

Aug 21, 2024

  • Updated SBB ViT models trained on ImageNet-12k and fine-tuned on ImageNet-1k, challenging quite a number of much larger, slower models
model top1 top5 param_count img_size
vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k 87.438 98.256 64.11 384
vit_mediumd_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k 86.608 97.934 64.11 256
vit_betwixt_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k 86.594 98.02 60.4 384
vit_betwixt_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k 85.734 97.61 60.4 256
  • MobileNet-V1 1.25, EfficientNet-B1, & ResNet50-D weights w/ MNV4 baseline challenge recipe
model top1 top5 param_count img_size
resnet50d.ra4_e3600_r224_in1k 81.838 95.922 25.58 288
efficientnet_b1.ra4_e3600_r240_in1k 81.440 95.700 7.79 288
resnet50d.ra4_e3600_r224_in1k 80.952 95.384 25.58 224
efficientnet_b1.ra4_e3600_r240_in1k 80.406 95.152 7.79 240
mobilenetv1_125.ra4_e3600_r224_in1k 77.600 93.804 6.27 256
mobilenetv1_125.ra4_e3600_r224_in1k 76.924 93.234 6.27 224
  • Add SAM2 (HieraDet) backbone arch & weight loading support

  • Add Hiera Small weights trained w/ abswin pos embed on in12k & fine-tuned on 1k

model top1 top5 param_count
hiera_small_abswin_256.sbb2_e200_in12k_ft_in1k 84.912 97.260 35.01
hiera_small_abswin_256.sbb2_pd_e200_in12k_ft_in1k 84.560 97.106 35.01

Aug 8, 2024

Release v1.0.8

29 Jul 05:18
Compare
Choose a tag to compare

July 28, 2024

  • Add mobilenet_edgetpu_v2_m weights w/ ra4 mnv4-small based recipe. 80.1% top-1 @ 224 and 80.7 @ 256.
  • Release 1.0.8

July 26, 2024

  • More MobileNet-v4 weights, ImageNet-12k pretrain w/ fine-tunes, and anti-aliased ConvLarge models
model top1 top1_err top5 top5_err param_count img_size
mobilenetv4_conv_aa_large.e230_r448_in12k_ft_in1k 84.99 15.01 97.294 2.706 32.59 544
mobilenetv4_conv_aa_large.e230_r384_in12k_ft_in1k 84.772 15.228 97.344 2.656 32.59 480
mobilenetv4_conv_aa_large.e230_r448_in12k_ft_in1k 84.64 15.36 97.114 2.886 32.59 448
mobilenetv4_conv_aa_large.e230_r384_in12k_ft_in1k 84.314 15.686 97.102 2.898 32.59 384
mobilenetv4_conv_aa_large.e600_r384_in1k 83.824 16.176 96.734 3.266 32.59 480
mobilenetv4_conv_aa_large.e600_r384_in1k 83.244 16.756 96.392 3.608 32.59 384
mobilenetv4_hybrid_medium.e200_r256_in12k_ft_in1k 82.99 17.01 96.67 3.33 11.07 320
mobilenetv4_hybrid_medium.e200_r256_in12k_ft_in1k 82.364 17.636 96.256 3.744 11.07 256
model top1 top1_err top5 top5_err param_count img_size
efficientnet_b0.ra4_e3600_r224_in1k 79.364 20.636 94.754 5.246 5.29 256
efficientnet_b0.ra4_e3600_r224_in1k 78.584 21.416 94.338 5.662 5.29 224
mobilenetv1_100h.ra4_e3600_r224_in1k 76.596 23.404 93.272 6.728 5.28 256
mobilenetv1_100.ra4_e3600_r224_in1k 76.094 23.906 93.004 6.996 4.23 256
mobilenetv1_100h.ra4_e3600_r224_in1k 75.662 24.338 92.504 7.496 5.28 224
mobilenetv1_100.ra4_e3600_r224_in1k 75.382 24.618 92.312 7.688 4.23 224
  • Prototype of set_input_size() added to vit and swin v1/v2 models to allow changing image size, patch size, window size after model creation.
  • Improved support in swin for different size handling, in addition to set_input_size, always_partition and strict_img_size args have been added to __init__ to allow more flexible input size constraints
  • Fix out of order indices info for intermediate 'Getter' feature wrapper, check out or range indices for same.
  • Add several tiny < .5M param models for testing that are actually trained on ImageNet-1k
model top1 top1_err top5 top5_err param_count img_size crop_pct
test_efficientnet.r160_in1k 47.156 52.844 71.726 28.274 0.36 192 1.0
test_byobnet.r160_in1k 46.698 53.302 71.674 28.326 0.46 192 1.0
test_efficientnet.r160_in1k 46.426 53.574 70.928 29.072 0.36 160 0.875
test_byobnet.r160_in1k 45.378 54.622 70.572 29.428 0.46 160 0.875
test_vit.r160_in1k 42.0 58.0 68.664 31.336 0.37 192 1.0
test_vit.r160_in1k 40.822 59.178 67.212 32.788 0.37 160 0.875
  • Fix vit reg token init, thanks Promisery
  • Other misc fixes

June 24, 2024

  • 3 more MobileNetV4 hyrid weights with different MQA weight init scheme
model top1 top1_err top5 top5_err param_count img_size
mobilenetv4_hybrid_large.ix_e600_r384_in1k 84.356 15.644 96.892 3.108 37.76 448
mobilenetv4_hybrid_large.ix_e600_r384_in1k 83.990 16.010 96.702 3.298 37.76 384
mobilenetv4_hybrid_medium.ix_e550_r384_in1k 83.394 16.606 96.760 3.240 11.07 448
mobilenetv4_hybrid_medium.ix_e550_r384_in1k 82.968 17.032 96.474 3.526 11.07 384
mobilenetv4_hybrid_medium.ix_e550_r256_in1k 82.492 17.508 96.278 3.722 11.07 320
mobilenetv4_hybrid_medium.ix_e550_r256_in1k 81.446 18.554 95.704 4.296 11.07 256
  • florence2 weight loading in DaViT model

Release v1.0.7

19 Jun 06:52
Compare
Choose a tag to compare

June 12, 2024

  • MobileNetV4 models and initial set of timm trained weights added:
model top1 top1_err top5 top5_err param_count img_size
mobilenetv4_hybrid_large.e600_r384_in1k 84.266 15.734 96.936 3.064 37.76 448
mobilenetv4_hybrid_large.e600_r384_in1k 83.800 16.200 96.770 3.230 37.76 384
mobilenetv4_conv_large.e600_r384_in1k 83.392 16.608 96.622 3.378 32.59 448
mobilenetv4_conv_large.e600_r384_in1k 82.952 17.048 96.266 3.734 32.59 384
mobilenetv4_conv_large.e500_r256_in1k 82.674 17.326 96.31 3.69 32.59 320
mobilenetv4_conv_large.e500_r256_in1k 81.862 18.138 95.69 4.31 32.59 256
mobilenetv4_hybrid_medium.e500_r224_in1k 81.276 18.724 95.742 4.258 11.07 256
mobilenetv4_conv_medium.e500_r256_in1k 80.858 19.142 95.768 4.232 9.72 320
mobilenetv4_hybrid_medium.e500_r224_in1k 80.442 19.558 95.38 4.62 11.07 224
mobilenetv4_conv_blur_medium.e500_r224_in1k 80.142 19.858 95.298 4.702 9.72 256
mobilenetv4_conv_medium.e500_r256_in1k 79.928 20.072 95.184 4.816 9.72 256
mobilenetv4_conv_medium.e500_r224_in1k 79.808 20.192 95.186 4.814 9.72 256
mobilenetv4_conv_blur_medium.e500_r224_in1k 79.438 20.562 94.932 5.068 9.72 224
mobilenetv4_conv_medium.e500_r224_in1k 79.094 20.906 94.77 5.23 9.72 224
mobilenetv4_conv_small.e2400_r224_in1k 74.616 25.384 92.072 7.928 3.77 256
mobilenetv4_conv_small.e1200_r224_in1k 74.292 25.708 92.116 7.884 3.77 256
mobilenetv4_conv_small.e2400_r224_in1k 73.756 26.244 91.422 8.578 3.77 224
mobilenetv4_conv_small.e1200_r224_in1k 73.454 26.546 91.34 8.66 3.77 224
  • Apple MobileCLIP (https://arxiv.org/pdf/2311.17049, FastViT and ViT-B) image tower model support & weights added (part of OpenCLIP support).
  • ViTamin (https://arxiv.org/abs/2404.02132) CLIP image tower model & weights added (part of OpenCLIP support).
  • OpenAI CLIP Modified ResNet image tower modelling & weight support (via ByobNet). Refactor AttentionPool2d.
  • Refactoring & improvements, especially related to classifier_reset and num_features vs head_hidden_size for forward_features() vs pre_logits

Release v1.0.3

15 May 18:19
Compare
Choose a tag to compare

May 14, 2024

  • Support loading PaliGemma jax weights into SigLIP ViT models with average pooling.
  • Add Hiera models from Meta (https://github.com/facebookresearch/hiera).
  • Add normalize= flag for transorms, return non-normalized torch.Tensor with original dytpe (for chug)
  • Version 1.0.3 release

May 11, 2024

  • Searching for Better ViT Baselines (For the GPU Poor) weights and vit variants released. Exploring model shapes between Tiny and Base.
model top1 top5 param_count img_size
vit_mediumd_patch16_reg4_gap_256.sbb_in12k_ft_in1k 86.202 97.874 64.11 256
vit_betwixt_patch16_reg4_gap_256.sbb_in12k_ft_in1k 85.418 97.48 60.4 256
vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k 84.322 96.812 63.95 256
vit_betwixt_patch16_rope_reg4_gap_256.sbb_in1k 83.906 96.684 60.23 256
vit_base_patch16_rope_reg1_gap_256.sbb_in1k 83.866 96.67 86.43 256
vit_medium_patch16_rope_reg1_gap_256.sbb_in1k 83.81 96.824 38.74 256
vit_betwixt_patch16_reg4_gap_256.sbb_in1k 83.706 96.616 60.4 256
vit_betwixt_patch16_reg1_gap_256.sbb_in1k 83.628 96.544 60.4 256
vit_medium_patch16_reg4_gap_256.sbb_in1k 83.47 96.622 38.88 256
vit_medium_patch16_reg1_gap_256.sbb_in1k 83.462 96.548 38.88 256
vit_little_patch16_reg4_gap_256.sbb_in1k 82.514 96.262 22.52 256
vit_wee_patch16_reg1_gap_256.sbb_in1k 80.256 95.360 13.42 256
vit_pwee_patch16_reg1_gap_256.sbb_in1k 80.072 95.136 15.25 256
vit_mediumd_patch16_reg4_gap_256.sbb_in12k N/A N/A 64.11 256
vit_betwixt_patch16_reg4_gap_256.sbb_in12k N/A N/A 60.4 256
  • AttentionExtract helper added to extract attention maps from timm models. See example in #1232 (comment)
  • forward_intermediates() API refined and added to more models including some ConvNets that have other extraction methods.
  • 1017 of 1047 model architectures support features_only=True feature extraction. Remaining 34 architectures can be supported but based on priority requests.
  • Remove torch.jit.script annotated functions including old JIT activations. Conflict with dynamo and dynamo does a much better job when used.

April 11, 2024

  • Prepping for a long overdue 1.0 release, things have been stable for a while now.
  • Significant feature that's been missing for a while, features_only=True support for ViT models with flat hidden states or non-std module layouts (so far covering 'vit_*', 'twins_*', 'deit*', 'beit*', 'mvitv2*', 'eva*', 'samvit_*', 'flexivit*')
  • Above feature support achieved through a new forward_intermediates() API that can be used with a feature wrapping module or direclty.
model = timm.create_model('vit_base_patch16_224')
final_feat, intermediates = model.forward_intermediates(input) 
output = model.forward_head(final_feat)  # pooling + classifier head

print(final_feat.shape)
torch.Size([2, 197, 768])

for f in intermediates:
    print(f.shape)
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])
torch.Size([2, 768, 14, 14])

print(output.shape)
torch.Size([2, 1000])
model = timm.create_model('eva02_base_patch16_clip_224', pretrained=True, img_size=512, features_only=True, out_indices=(-3, -2,))
output = model(torch.randn(2, 3, 512, 512))

for o in output:    
    print(o.shape)   
torch.Size([2, 768, 32, 32])
torch.Size([2, 768, 32, 32])
  • TinyCLIP vision tower weights added, thx Thien Tran

Release v0.9.16

19 Feb 19:35
6e6f368
Compare
Choose a tag to compare

Feb 19, 2024

  • Next-ViT models added. Adapted from https://github.com/bytedance/Next-ViT
  • HGNet and PP-HGNetV2 models added. Adapted from https://github.com/PaddlePaddle/PaddleClas by SeeFun
  • Removed setup.py, moved to pyproject.toml based build supported by PDM
  • Add updated model EMA impl using _for_each for less overhead
  • Support device args in train script for non GPU devices
  • Other misc fixes and small additions
  • Min supported Python version increased to 3.8
  • Release 0.9.16

Jan 8, 2024

Datasets & transform refactoring

  • HuggingFace streaming (iterable) dataset support (--dataset hfids:org/dataset)
  • Webdataset wrapper tweaks for improved split info fetching, can auto fetch splits from supported HF hub webdataset
  • Tested HF datasets and webdataset wrapper streaming from HF hub with recent timm ImageNet uploads to https://huggingface.co/timm
  • Make input & target column/field keys consistent across datasets and pass via args
  • Full monochrome support when using e:g: --input-size 1 224 224 or --in-chans 1, sets PIL image conversion appropriately in dataset
  • Improved several alternate crop & resize transforms (ResizeKeepRatio, RandomCropOrPad, etc) for use in PixParse document AI project
  • Add SimCLR style color jitter prob along with grayscale and gaussian blur options to augmentations and args
  • Allow train without validation set (--val-split '') in train script
  • Add --bce-sum (sum over class dim) and --bce-pos-weight (positive weighting) args for training as they're common BCE loss tweaks I was often hard coding

Release v0.9.12

24 Nov 19:09
Compare
Choose a tag to compare

Nov 23, 2023

  • Added EfficientViT-Large models, thanks SeeFun
  • Fix Python 3.7 compat, will be dropping support for it soon
  • Other misc fixes
  • Release 0.9.12

Release v0.9.11

20 Nov 23:16
Compare
Choose a tag to compare

Nov 20, 2023

Release v0.9.10

04 Nov 15:23
Compare
Choose a tag to compare

Nov 4

  • Patch fix for 0.9.9 to fix FrozenBatchnorm2d import path for old torchvision (~2 years )

Nov 3, 2023

  • DFN (Data Filtering Networks) and MetaCLIP ViT weights added
  • DINOv2 'register' ViT model weights added
  • Add quickgelu ViT variants for OpenAI, DFN, MetaCLIP weights that use it (less efficient)
  • Improved typing added to ResNet, MobileNet-v3 thanks to Aryan
  • ImageNet-12k fine-tuned (from LAION-2B CLIP) convnext_xxlarge
  • 0.9.9 release

Release v0.9.9

03 Nov 22:24
Compare
Choose a tag to compare

Nov 3, 2023

  • DFN (Data Filtering Networks) and MetaCLIP ViT weights added
  • DINOv2 'register' ViT model weights added
  • Add quickgelu ViT variants for OpenAI, DFN, MetaCLIP weights that use it (less efficient)
  • Improved typing added to ResNet, MobileNet-v3 thanks to Aryan
  • ImageNet-12k fine-tuned (from LAION-2B CLIP) convnext_xxlarge
  • 0.9.9 release

Release v0.9.8

21 Oct 20:52
Compare
Choose a tag to compare

Oct 20, 2023

  • SigLIP image tower weights supported in vision_transformer.py.
    • Great potential for fine-tune and downstream feature use.
  • Experimental 'register' support in vit models as per Vision Transformers Need Registers
  • Updated RepViT with new weight release. Thanks wangao
  • Add patch resizing support (on pretrained weight load) to Swin models
  • 0.9.8 release