Skip to content

Visual Representation Learning with Stochastic Frame Prediction (ICML 2024)

License

Notifications You must be signed in to change notification settings

huiwon-jang/RSP

Repository files navigation

Visual Representation Learning with Stochastic Frame Prediction

Huiwon Jang1·   Dongyoung Kim1·Junsu Kim1
Jinwoo Shin1·Pieter Abbeel2·Younggyo Seo1,3
1 KAIST   2UC Berkeley   3Dyson Robot Learning Lab  

1. Environment setup

  • We note that torch version >2.0 may work, but conda install with below version is recommended.
conda create -n rsp python=3.9.12 -y
conda activate rsp
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt

2. Dataset

Dataset download

sh data_preprocessing/download.sh
sh data_preprocessing/extract.sh
  • We assume the root directory for the data: $DATA_ROOT = /data/kinetics400.
  • If you want to change the root directory, please change root_dl of download.sh and extract.sh.

Dataset pre-processing

  • We resize the data into 256x256 for the efficient loading while training.
python data_preprocessing/make_256scale.py --datadir $DATA_ROOT
  • We additionally provide the code to filter out several not-working videos.
python data_preprocessing/make_labels.py --datadir $DATA_ROOT --filedir train2

Kinetics-400

/data/kinetics400
|-- train2
    |-- abseiling
        |-- xx.mp4
        |-- ...
    |-- air_drumming
        |-- xx.mp4
        |-- ...
    |-- ...
|-- labels
    |-- label_full_1.0.pickle

3. Pre-training RSP on Kinetics-400

  • Note that [N_NODE] x [BATCH_SIZE_PER_GPU] x [ACCUM_ITER] = 1536 to reproduce our results.
  • Default: [DATA_PATH]=/data/kinetics400
python -m torch.distributed.launch --nproc_per_node=[N_NODE] main_pretrain_rsp.py \
    --batch_size [BATCH_SIZE_PER_GPU] \
    --accum_iter [ACCUM_ITER] \
    --model rsp_vit_small_patch16 \
    --epochs 400 \
    --warmup_epochs 40 \
    --data_path [DATA_PATH] \
    --log_dir [LOG_DIR] \
    --output_dir [LOG_DIR] \
    --norm_pix_loss \
    --repeated_sampling 2

4. Evaluation

We provide the checkpoint in the below:

  • ViT-S/16 400 epochs: [link]
  • ViT-B/16 400 epochs: [link]

4.1. Video Label Propagation

The evaluation code is mainly built upon Dino.

1. DAVIS 2017 video object segmentation

  • Step 1: Dataset preparation

We note that the default root path is [DATA_ROOT]=/data. Additionally, we resize DAVIS of 480x(?) to 480x880 for a natural evaluation with patches.

sh data_preprocessing/eval/davis_download.sh
python data_preprocessing/eval/davis_preprocessing.py --data_root [DATA_ROOT]
[DATA_ROOT]/DAVIS_480_880
|-- Annotations/480p
    |-- bear
        |-- 00000.png
        |-- ...
    |-- ...
|-- ImageSets/2017/val.txt
|-- JPEGImages/480p
    |-- bear
        |-- 00000.jpg
        |-- ...
    |-- ...
  • Step 2: Video object segmentation
python eval_video_segmentation_davis.py \
    --finetune [LOG_DIR]/checkpoint-199.pth \
    --output_dir [LOG_DIR]/davis_seg \
    --data_path [DATA_ROOT]/DAVIS_480_880 \
    --topk 7 --size_mask_neighborhood 30 --n_last_frames 30 \
    --model vit_small
  • Step 3: Evaluation the obtained segmentation
git clone https://github.com/davisvideochallenge/davis2017-evaluation
python ./davis2017-evaluation/evaluation_method.py \
    --task semi-supervised \
    --results_path [LOG_DIR]/davis_seg \
    --davis_path [DATA_ROOT]/DAVIS_480_880

2. JHMDB pose tracking

TBD

3. VIP video part segmentation

TBD

4.2. Vision-based Robot Learning

1. CortexBench

We will update the evaluation code at https://github.com/huiwon-jang/RSP/tree/eval_cortexbench.

2. RLBench

TBD

3. Franka Kitchen

TBD

About

Visual Representation Learning with Stochastic Frame Prediction (ICML 2024)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published