Skip to content

mx-ethan-rao/Robotic_Gesture_Tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zero-shot Prompt-based Video Encoder for Surgical Gesture Recognition

Purpose: Surgical video is an important data stream for gesture recognition. Thus, producing better visual encoders for those data-streams is similarly important.
Methods: Leveraging the Bridge-Prompt framework, we fine-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside (non-video) image data, but also make use of label meta-data and weakly supervised contrastive losses.
Results: Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema.
Conclusion: Bridge-Prompt and similar pre-trained+fine-tuned video encoder models present significant promise for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical procedures, the ability of these models to generalize without extensive retraining makes them invaluable.

To Do List:

  • 🔲 Put it on axiv
  • ✅ 3DResNet
  • ✅ Inflated-3D
  • ✅ MS-TCN2
  • ✅ Bridge-Prompt
  • ✅ Gesture Recognition in Robotic Surgery With Multimodal Attention (TCAN)

Examples and experiments run on a Linux server with the following specifications:

  • Ubuntu 22.04.3 LTS with 8 NVIDIA A40 GPUS
  • Ubuntu 22.04.3 LTS with 4 NVIDIA A5000 GPUs

Dataset preparation

Download JIGSAWS dataset from here

Installation

conda env create -f environment.yml

Running example for Bridge-Prompt on JIGSAWS

Preprocessing for frames of a validation set

title=All_gestures
valid="B"
task="Suturing"
# Standard training
python ./Bridge-Prompt/preprocess/preprocess.py --vpath JIGSAWS_path --out /path/to/$title/$task-$valid --user_for_val $valid --task $task

# Limited gesture training
python ./Bridge-Prompt/preprocess/preprocess.py --vpath JIGSAWS_path --out /path/to/$title/$task-$valid --user_for_val $valid --task $task --filter_labels True --keep_labels 10

Train with Bridge-Prompt

bash scripts/run_train.sh ./configs/JIGSAWS/JIGSAWS_ft.yaml $task $valid /path/to/$title/$task-$valid 

Visual feature extraction for all frames

python extract_frame_features.py --config ./configs/JIGSAWS/JIGSAWS_exfm.yaml --pretrain ./exp/clip_ucf/ViT-B/16/JIGSAWS/$task-$valid/last_model.pt --savedir /path/to/$title/$task-$valid/visual_features

Train using Gesture Index (indtead of text description)

Change class_dir in JIGSAWS class in ./Bridge-Prompt/datasets/datasets.py from bf_mapping.json to bf_index_mapping.json

Running example for MS-TCN++ on JIGSAWS

Preprocessing

mkdir -p /path/to/$title/$task-$valid/JIGSAWS
python ./MS_TCN2/preprocess.py --subdataset $task \
                                --vpath JIGSAWS_path \
                                --output /path/to/$title/$task-$valid/JIGSAWS \
                                --visual_feat /path/to/$title/$task-$valid/visual_features

cp ./mapping.txt /path/to/$title/$task-$valid/JIGSAWS/mapping.txt

Train & test

bash train.sh JIGSAWS .$task.LOUO.$valid /path/to/$title/$task-$valid/

bash test_epoch.sh JIGSAWS .$task.LOUO.$valid 100 /path/to/$title/$task-$valid/

Remarks

Please refer to run_batch_cross_valid.sh for batch running

Please email Mingxing (mingxing.rao@vanderbilt.edu) for all experimental checkpoints and visual features

Reference Code

License

MIT License.