Official code repo for Ensemble Modeling for Multimodal Visual Action Recognition [ICIAP-W 2023
conda create -n mm python=3.11.4
conda activate mm
conda install pytorch=2.0.1 torchvision=0.15.2 torchaudio=2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c anaconda numpy
conda install -c conda-forge matplotlib
conda install -c conda-forge tqdm
pip install opencv-python
pip install fvcore
pip install timm
pip install mmcv==1.3.11
pip install einops
pip install scikit-learn
pip install focal-loss-torch
pip install pandas
pip install seaborn
Download the following components of the Meccano dataset from the official website:
RGB frames
Depth frames
Action annotations
Update config.py [data_dir
] to reflect the dataset location.
We train individual modalities RGB and Depth.
Update config.py [train_run_id
, train_modality
, train_weights_dir
, train_ss_wt_file
] to reflect the relevant details.
Run:
python -u train.py
- Test individual modalities RGB and Depth.
Update config.py [test_wt_file
,test_modality
] to reflect the relevant details.
Run:
python -u test.py
- Obtain class probabilities averaged from RGB and Depth pathways (
${\color{red}Competition~Result}$ ).
Update config.py [test_wt_file_1
,test_wt_file_2
] to reflect the relevant details.
Run:
python -u test_mm.py
We use the Swin3D-B backbone, which is pre-trained on the SomethingSomething v2 dataset.
Swin3D-B with Something-Something v2 pre-training: Google Drive
The RGB frames and Depth maps are passed through two independently trained Swin3D-B encoders. The resultant class probabilities, obtained from each pathway, are averaged to subsequently yield action classes.
Ours (RGB) with Something-Something v2 pre-training: Google Drive
Ours (Depth) with Something-Something v2 pre-training: Google Drive
Thanks to https://github.com/SwinTransformer/Video-Swin-Transformer, for the Swin3D-B implementation.
@article{kini2023ensemble,
title={Ensemble Modeling for Multimodal Visual Action Recognition},
author={Kini, Jyoti and Fleischer, Sarah and Dave, Ishan and Shah, Mubarak},
journal={arXiv preprint arXiv:2308.05430},
year={2023}
}
@article{kini2023egocentric,
title={Egocentric RGB+ Depth Action Recognition in Industry-Like Settings},
author={Kini, Jyoti and Fleischer, Sarah and Dave, Ishan and Shah, Mubarak},
journal={arXiv preprint arXiv:2309.13962},
year={2023}
}
If you have any inquiries or require assistance, please reach out to Jyoti Kini (jyoti.kini@ucf.edu).