Zero-shot Prompt-based Video Encoder for Surgical Gesture Recognition

Purpose: Surgical video is an important data stream for gesture recognition. Thus, producing better visual encoders for those data-streams is similarly important.
Methods: Leveraging the Bridge-Prompt framework, we fine-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside (non-video) image data, but also make use of label meta-data and weakly supervised contrastive losses.
Results: Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema.
Conclusion: Bridge-Prompt and similar pre-trained+fine-tuned video encoder models present significant promise for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical procedures, the ability of these models to generalize without extensive retraining makes them invaluable.

To Do List:

🔲 Put it on axiv
✅ 3DResNet
✅ Inflated-3D
✅ MS-TCN2
✅ Bridge-Prompt
✅ Gesture Recognition in Robotic Surgery With Multimodal Attention (TCAN)

Examples and experiments run on a Linux server with the following specifications:

Ubuntu 22.04.3 LTS with 8 NVIDIA A40 GPUS
Ubuntu 22.04.3 LTS with 4 NVIDIA A5000 GPUs

Dataset preparation

Download JIGSAWS dataset from here

Installation

conda env create -f environment.yml

Running example for Bridge-Prompt on JIGSAWS

Preprocessing for frames of a validation set

title=All_gestures
valid="B"
task="Suturing"
# Standard training
python ./Bridge-Prompt/preprocess/preprocess.py --vpath JIGSAWS_path --out /path/to/$title/$task-$valid --user_for_val $valid --task $task

# Limited gesture training
python ./Bridge-Prompt/preprocess/preprocess.py --vpath JIGSAWS_path --out /path/to/$title/$task-$valid --user_for_val $valid --task $task --filter_labels True --keep_labels 10

Train with Bridge-Prompt

bash scripts/run_train.sh ./configs/JIGSAWS/JIGSAWS_ft.yaml $task $valid /path/to/$title/$task-$valid

Visual feature extraction for all frames

python extract_frame_features.py --config ./configs/JIGSAWS/JIGSAWS_exfm.yaml --pretrain ./exp/clip_ucf/ViT-B/16/JIGSAWS/$task-$valid/last_model.pt --savedir /path/to/$title/$task-$valid/visual_features

Train using Gesture Index (indtead of text description)

Change class_dir in JIGSAWS class in ./Bridge-Prompt/datasets/datasets.py from bf_mapping.json to bf_index_mapping.json

Running example for MS-TCN++ on JIGSAWS

Preprocessing

mkdir -p /path/to/$title/$task-$valid/JIGSAWS
python ./MS_TCN2/preprocess.py --subdataset $task \
                                --vpath JIGSAWS_path \
                                --output /path/to/$title/$task-$valid/JIGSAWS \
                                --visual_feat /path/to/$title/$task-$valid/visual_features

cp ./mapping.txt /path/to/$title/$task-$valid/JIGSAWS/mapping.txt

Train & test

bash train.sh JIGSAWS .$task.LOUO.$valid /path/to/$title/$task-$valid/

bash test_epoch.sh JIGSAWS .$task.LOUO.$valid 100 /path/to/$title/$task-$valid/

Remarks

Please refer to run_batch_cross_valid.sh for batch running

Please email Mingxing (mingxing.rao@vanderbilt.edu) for all experimental checkpoints and visual features

Reference Code

License

MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
3DCNN_Resnet		3DCNN_Resnet
Bridge-Prompt		Bridge-Prompt
MS-TCN2		MS-TCN2
TCAN		TCAN
figs		figs
i3d		i3d
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zero-shot Prompt-based Video Encoder for Surgical Gesture Recognition

To Do List:

Examples and experiments run on a Linux server with the following specifications:

Dataset preparation

Installation

Running example for Bridge-Prompt on JIGSAWS

Preprocessing for frames of a validation set

Train with Bridge-Prompt

Visual feature extraction for all frames

Train using Gesture Index (indtead of text description)

Running example for MS-TCN++ on JIGSAWS

Preprocessing

Train & test

Remarks

Please refer to run_batch_cross_valid.sh for batch running

Please email Mingxing (mingxing.rao@vanderbilt.edu) for all experimental checkpoints and visual features

Reference Code

License

About

Releases

Packages

Languages

License

mx-ethan-rao/Robotic_Gesture_Tools

Folders and files

Latest commit

History

Repository files navigation

Zero-shot Prompt-based Video Encoder for Surgical Gesture Recognition

To Do List:

Examples and experiments run on a Linux server with the following specifications:

Dataset preparation

Installation

Running example for Bridge-Prompt on JIGSAWS

Preprocessing for frames of a validation set

Train with Bridge-Prompt

Visual feature extraction for all frames

Train using Gesture Index (indtead of text description)

Running example for MS-TCN++ on JIGSAWS

Preprocessing

Train & test

Remarks

Please refer to run_batch_cross_valid.sh for batch running

Please email Mingxing (mingxing.rao@vanderbilt.edu) for all experimental checkpoints and visual features

Reference Code

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages