Multimodal Distillation for Egocentric Action Recognition

This repository contains the implementation of the paper Multimodal Distillation for Egocentric Action Recognition, published at ICCV 2023.

Reproducing the virtual environment

The main dependencies that you need to install to reproduce the virtual environment are PyTorch, and:

pip install accelerate tqdm h5py yacs timm einops natsort

Downloading the pre-trained Swin-T model

Create a directory ./data/pretrained-backbones/ and download Swin-T from here:

wget https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_tiny_patch244_window877_kinetics400_1k.pth  -O ./data/pretrained-backbones/

Preparing the Epic-Kitchens and the Something-Something/Else datasets

We store all data (video frames, optical flow frames, audios, etc.) is an efficient HDF5 file where each video represents a dataset within the HDF5 file, and the n-th element of the dataset contains the bytes for the n-th frame of the video. You can download the Something-Something and Something-Else datasets from this link, and the Epic-Kitchens dataset from this link. This includes all the modalities we use for each dataset.

Please download and place the datasets inside ./data/ - ./data/something-something/ and ./data/EPIC-KITCHENS. Otherwise, feel free to store the data wherever you see fit, just do not forget to modify the config.yaml files with the appropriate location. In this README.md, we assume that all data is placed inside ./data/, and all experiments are placed inside ./experiments/.

Model ZOO

Dataset	Model Type	Model architecture	Training modalities	Download Link
Something-Something	Distilled student	Swin-T	RGB frames + Optical Flow + Object Detections	Download
Something-Else	Distilled student	Swin-T	RGB frames + Optical Flow + Object Detections	Download
Epic-Kitchens	Distilled student	Swin-T	RGB frames + Optical Flow + Audio	Download
Something-Something	Unimodal	Swin-T	RGB Frames	Download
Something-Something	Unimodal	Swin-T	Optical Flow	Download
Something-Something	Unimodal	STLT	Object Detections	Download
Something-Else	Unimodal	Swin-T	RGB frames	Download
Something-Else	Unimodal	Swin-T	Optical Flow	Download
Something-Else	Unimodal	STLT	Object Detections	Download
Epic-Kitchens	Unimodal	Swin-T	RGB frames	Download
Epic-Kitchens	Unimodal	Swin-T	Optical Flow	Download
Epic-Kitchens	Unimodal	Swin-T	Audio	Download

Inference on Epic-Kitchens

Download our Epic-Kitchens distilled model from the Model ZOO, and place it in ./experiments/.
Run inference as:

python src/inference.py --experiment_path "experiments/epic-kitchens-swint-distill-flow-audio" --opts DATASET_TYPE "video"

Inference on Something-Something & Something-Else

Download our Something-Else distilled model or the Something-Something distilled model from the Model ZOO, and place it in ./experiments/.
Run inference as:

python src/inference.py --experiment_path "experiments/something-swint-distill-layout-flow" --opts DATASET_TYPE "video"

for Something-Something, and

python src/inference.py --experiment_path "experiments/something-else-swint-distill-layout-flow" --opts DATASET_TYPE "video"

for Something-Else.

Distilling from Multimodal Teachers

To reproduce the experiments (i.e., using the identical hyperparameters, where only the random seed will vary):

python src/patient_distill.py --config "experiments/something-else-swint-distill-layout-flow/config.yaml" --opts EXPERIMENT_PATH "experiments/experiments/reproducing-the-something-else-experiment"

note that this assumes access to the datasets for all modalities (video, optical flow, audio, object detections), as well as the individual (unimodal) models which constitute the multimodal ensemble teacher.

TODOs

Release Something-Something pretrained teachers for each modality.
Test the codebase.
Structure the Model ZOO part of the codebase.

Citation

If you find our code useful for your own research, please use the following BibTeX entry:

@inproceedings{radevski2023multimodal,
  title={Multimodal Distillation for Egocentric Action Recognition},
  author={Radevski, Gorjan and Grujicic, Dusan and Blaschko, Matthew and Moens, Marie-Francine and Tuytelaars, Tinne},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={5213--5224},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
configs		configs
data		data
experiments		experiments
src		src
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

data

data

experiments

experiments

src

src

.flake8

.flake8

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Multimodal Distillation for Egocentric Action Recognition

Reproducing the virtual environment

Downloading the pre-trained Swin-T model

Preparing the Epic-Kitchens and the Something-Something/Else datasets

Model ZOO

Inference on Epic-Kitchens

Inference on Something-Something & Something-Else

Distilling from Multimodal Teachers

TODOs

Citation

About

Releases

Packages

Contributors 2

Languages

License

gorjanradevski/multimodal-distillation

Folders and files

Latest commit

History

Repository files navigation

Multimodal Distillation for Egocentric Action Recognition

Reproducing the virtual environment

Downloading the pre-trained Swin-T model

Preparing the Epic-Kitchens and the Something-Something/Else datasets

Model ZOO

Inference on Epic-Kitchens

Inference on Something-Something & Something-Else

Distilling from Multimodal Teachers

TODOs

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages