GCNet: Global-local Temporal Modeling Enhanced 2DCNN with Category-supervised Contrastive Learning for Action Recognition

Overview

This PyTorch-implemented GCNet, a lightweight video action recognition framework, enhances 2DCNNs with two novel modules—Global-local Temporal Modeling (GTM) for unified global-local temporal feature extraction and Category-supervised Contrastive Learning (CCL) for discriminative feature learning—to achieve state-of-the-art accuracy without expensive 3D convolutions.

Prerequisites

The environment is fully compatible with TDN and standard video understanding pipelines:c

Python ≥ 3.6
PyTorch ≥ 1.4
torchvision
tensorboardX
tqdm
scikit-learn
ffmpeg
decord
torchmetrics (for CCL loss)
wandb (optional, for visualization)

Install all dependencies:

pip install -r requirements.txt

Required Pretrained Weights

Large weight files are excluded from GitHub due to size limits. Please download and place them as follows:

GCNet/
├─ resnet50-19c8e357.pth          (torchvision official)
└─ kd_pretrained_models/
    ├─ clip-vit-base-patch32/
    │   └─ pytorch_model.bin
    └─ vit_base_patch16_224/
       ├─ pytorch_model.bin
       └─ model.safetensors

Datasets

GCNet is evaluated on three standard benchmarks:

UCF-101: 101 action classes, 13,320 videos
HMDB-51: 51 action classes, 6,849 videos
Something-Something-V1: 174 fine-grained interactive actions, 108,499 videos

Data preparation follows the TDN protocol:

Extract frames or use raw video files
Generate annotation files:path num_frames label
Update dataset paths in ops/dataset_configs.py

Model Zoo & Benchmark Results

UCF-101

Model	Frames	Top-1
GCNet-ResNet50	8	88.6%
GCNet-ResNet50	16	89.5%

HMDB-51

Model	Frames	Top-1
GCNet-ResNet50	8	59.9%
GCNet-ResNet50	16	59.2%

Something-Something-V1

Model	Frames	Top-1
GCNet-ResNet50	8	52.7%
GCNet-ResNet50	16	54.2%

Comparison with TDN

GCNet is developed based on the TDN codebase but with significantly improved performance via Global-local Temporal Modeling (GTM) and Category-supervised Contrastive Learning (CCL). Under the same backbone (ResNet-50), pretrain (ImageNet), and training settings, we compare GCNet with TDN as below:

HMDB-51

Method	Frames	Top-1 Accuracy	Improvement
TDN	8	57.0%	—
GCNet (Ours)	8	59.9%	+2.9%
TDN	16	57.4%	—
GCNet (Ours)	16	59.2%	+1.8%

UCF-101

Method	Frames	Top-1 Accuracy	Improvement
TDN	8	87.0%	—
GCNet (Ours)	8	88.6%	+1.6%
TDN	16	88.3%	—
GCNet (Ours)	16	89.5%	+1.2%

Something-Something-V1

Method	Frames	Top-1 Accuracy	Improvement
TDN	8	52.3%	—
GCNet (Ours)	8	52.7%	+0.4%
TDN	16	53.9%	—
GCNet (Ours)	16	54.2%	+0.3%

Testing

Center Crop (Single Clip)

CUDA_VISIBLE_DEVICES=0 python test_models_center_crop.py ucf101 \
--archs resnet50 --weights YOUR_WEIGHTS.pth \
--test_segments 8 --test_crops 1 --batch_size 16

3 Crops × 10 Clips (High Accuracy)

CUDA_VISIBLE_DEVICES=0 python test_models_three_crops.py ucf101 \
--archs resnet50 --weights YOUR_WEIGHTS.pth \
--test_segments 8 --test_crops 3 --clip_index 0~9

Aggregate results:

python pkl_to_results.py --num_clips 10 --test_crops 3

Training (2 GPUs)

Train on UCF-101

python -m torch.distributed.launch --nproc_per_node=8 \
main.py ucf101 RGB \
--arch resnet50 --num_segments 8 \
--lr 0.0015 --lr_steps 25 35 --epochs 50 \
--batch-size 4 --wd 5e-4 \
--dropout 0.8 \
--no_partialbn \
--contrastive_loss True \
--category_supervised True

Train on HMDB-51

python -m torch.distributed.launch --nproc_per_node=8 \
main.py hmdb51 RGB \
--arch resnet50 --num_segments 8 \
--lr 0.0015 --lr_steps 25 35 --epochs 50 \
--batch-size 4 --wd 5e-4 \
--dropout 0.8 \
--no_partialbn \
--contrastive_loss True \
--category_supervised True

Train on Something-Something-V1

python -m torch.distributed.launch --nproc_per_node=8 \
main.py something RGB \
--arch resnet50 --num_segments 8 \
--lr 0.0025 --lr_steps 30 45 55 --epochs 60 \
--batch-size 8 --wd 5e-4 \
--dropout 0.8 \
--no_partialbn \
--contrastive_loss True \
--category_supervised True

Key Parameters for GCNet

--contrastive_loss True: Enable CCL module
--contrastive_temperature 0.1: Temperature for contrastive loss
--category_supervised True: Use label-guided contrastive learning
--num_segments 8: Sparse 8-frame sampling (best speed–accuracy tradeoff)

Ablation Study

GTM	CCL	HMDB-51	UCF-101
✗	✗	57.0%	87.0%
✓	✗	59.3%	88.0%
✗	✓	58.0%	87.6%
✓	✓	59.9%	88.6%

Acknowledgements

Our code is built on:

@inproceedings{wang2021tdn,
  title={Tdn: Temporal difference networks for efficient action recognition},
  author={Wang, Limin and Tong, Zhan and Ji, Bin and Wu, Gangshan},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages={1895--1904},
  year={2021}
}
@article{wang2018temporal,
  title={Temporal segment networks for action recognition in videos},
  author={Wang, Limin and Xiong, Yuanjun and Wang, Zhe and Qiao, Yu and Lin, Dahua and Tang, Xiaoou and Van Gool, Luc},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  volume={41},
  number={11},
  pages={2740--2755},
  year={2018},
  publisher={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Dataset		Dataset
__pycache__		__pycache__
config		config
mamba_ssm		mamba_ssm
ops		ops
picture		picture
scripts		scripts
stablevideogenerate		stablevideogenerate
tools		tools
utils		utils
waibu		waibu
wandb		wandb
.gitignore		.gitignore
README.md		README.md
flops.py		flops.py
losses.py		losses.py
main.py		main.py
main_mamba.py		main_mamba.py
opts.py		opts.py
pkl_to_results.py		pkl_to_results.py
requirements.txt		requirements.txt
tensor.py		tensor.py
test.bash		test.bash
test_models_center_crop.py		test_models_center_crop.py
test_models_three_crops.py		test_models_three_crops.py
train.bash		train.bash
train_utils.py		train_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GCNet: Global-local Temporal Modeling Enhanced 2DCNN with Category-supervised Contrastive Learning for Action Recognition

Overview

Prerequisites

Required Pretrained Weights

Datasets

Model Zoo & Benchmark Results

UCF-101

HMDB-51

Something-Something-V1

Comparison with TDN

HMDB-51

UCF-101

Something-Something-V1

Testing

Center Crop (Single Clip)

3 Crops × 10 Clips (High Accuracy)

Training (2 GPUs)

Train on UCF-101

Train on HMDB-51

Train on Something-Something-V1

Key Parameters for GCNet

Ablation Study

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GCNet: Global-local Temporal Modeling Enhanced 2DCNN with Category-supervised Contrastive Learning for Action Recognition

Overview

Prerequisites

Required Pretrained Weights

Datasets

Model Zoo & Benchmark Results

UCF-101

HMDB-51

Something-Something-V1

Comparison with TDN

HMDB-51

UCF-101

Something-Something-V1

Testing

Center Crop (Single Clip)

3 Crops × 10 Clips (High Accuracy)

Training (2 GPUs)

Train on UCF-101

Train on HMDB-51

Train on Something-Something-V1

Key Parameters for GCNet

Ablation Study

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages