Skip to content

Releases: innat/VideoSwin

v2.0

23 Mar 19:18
0953651
Compare
Choose a tag to compare

Summary

Keras 3 implementation of Video Swin Transformer. The official PyTorch weight has been converted to Keras 3 compatible. This implementaiton supports to run the model on multiple backend, i.e. TensorFlow, PyTorch, and Jax.

Full Changelog: v1.1...v2.0

v1.1

13 Oct 10:24
Compare
Choose a tag to compare

TensorFlow SavedModel formet weights. Details.

v1.0

12 Oct 20:24
Compare
Choose a tag to compare

Checkpoints of VideoSwin in Keras

Checkpoints of VideoSwin: Video Swin Transformer model in keras. The pretrained weights are ported from official pytorch model. Following are the list of all available model in .h5 format.

Checkpoint Naming Style

For the variation and brevity, the general format is:

dataset = 'K400' # K400, SSV2
pretrained_dataset = 'IN1K' # 'IN1K', 'IN22K`
size = 'B' # 'B', 'L'
patch_size = (2,4,4)
window_size=(8,7,7) # (8,7,7), (16,7,7)
num_frames = 32 
input_size = 224 

>> checkpoint_name = (
   f'TFVideoSwin{size}'
   f'{dataset}_'
   f'{dataset_ext + "_"'
   f'P{patch_size}_'
   f'W{window_size}_'
   f'{num_frames}x{input_size}.h5'
)
>> checkpoint_name 
TFUniFormerV2_K400_K710_L14_32x224.h5

Here, size represents tiny, small, and base. The pretrained_dataset refers the initialized pretrained weights while training the video swin model. For example, IN22K or ImageNet 22K pretrained 2D swin image models are used to initialize in 3D video swin model. The dataset refers the benchmark dataset, i.e., Kinetics, Something-Something-V2. The patch_size and window_size refer the internal parameter of model architecture. The input_frame and input_size for video-swin is 32 and 224 respectively. In keras implementation, the checkpoints are also available in SavedModel and h5 format. Check release page of v.1.1 for the SavedModel checkpoints.

Model Name
TFVideoSwinT_K400_IN1K_P244_W877_32x224.h5
TFVideoSwinS_K400_IN1K_P244_W877_32x224.h5
TFVideoSwinB_SSV2_K400_P244_W1677_32x224.h5
TFVideoSwinB_K600_IN22K_P244_W877_32x224.h5
TFVideoSwinB_K400_IN22K_P244_W877_32x224.h5
TFVideoSwinB_K400_IN1K_P244_W877_32x224.h5

Here, IN1K and IN22K refer to ImageNet 1K and ImageNet 22K. The P244 refers to patch_size of [2,4,4] and W877 refers to window_size of [8,7,7]. All these models give logit as output that makes it easy to add custom head on top of it for downstream task further. Check the notebook.