Visual Representation Learning with Stochastic Frame Prediction

Huiwon Jang¹ · Dongyoung Kim¹ · Junsu Kim¹
Jinwoo Shin¹ · Pieter Abbeel² · Younggyo Seo^1,3
¹ KAIST ²UC Berkeley ³Dyson Robot Learning Lab

[project page] [openreview]

1. Environment setup

We note that torch version >2.0 may work, but conda install with below version is recommended.

conda create -n rsp python=3.9.12 -y
conda activate rsp
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt

2. Dataset

Dataset download

sh data_preprocessing/download.sh
sh data_preprocessing/extract.sh

We assume the root directory for the data: $DATA_ROOT = /data/kinetics400.
If you want to change the root directory, please change root_dl of download.sh and extract.sh.

Dataset pre-processing

We resize the data into 256x256 for the efficient loading while training.

python data_preprocessing/make_256scale.py --datadir $DATA_ROOT

We additionally provide the code to filter out several not-working videos.

python data_preprocessing/make_labels.py --datadir $DATA_ROOT --filedir train2

Kinetics-400

/data/kinetics400
|-- train2
    |-- abseiling
        |-- xx.mp4
        |-- ...
    |-- air_drumming
        |-- xx.mp4
        |-- ...
    |-- ...
|-- labels
    |-- label_full_1.0.pickle

3. Pre-training RSP on Kinetics-400

Note that [N_NODE] x [BATCH_SIZE_PER_GPU] x [ACCUM_ITER] = 1536 to reproduce our results.
Default: [DATA_PATH]=/data/kinetics400

python -m torch.distributed.launch --nproc_per_node=[N_NODE] main_pretrain_rsp.py \
    --batch_size [BATCH_SIZE_PER_GPU] \
    --accum_iter [ACCUM_ITER] \
    --model rsp_vit_small_patch16 \
    --epochs 400 \
    --warmup_epochs 40 \
    --data_path [DATA_PATH] \
    --log_dir [LOG_DIR] \
    --output_dir [LOG_DIR] \
    --norm_pix_loss \
    --repeated_sampling 2

4. Evaluation

We provide the checkpoint in the below:

ViT-S/16 400 epochs: [link]
ViT-B/16 400 epochs: [link]

4.1. Video Label Propagation

The evaluation code is mainly built upon Dino.

1. DAVIS 2017 video object segmentation

Step 1: Dataset preparation

We note that the default root path is [DATA_ROOT]=/data. Additionally, we resize DAVIS of 480x(?) to 480x880 for a natural evaluation with patches.

sh data_preprocessing/eval/davis_download.sh
python data_preprocessing/eval/davis_preprocessing.py --data_root [DATA_ROOT]

[DATA_ROOT]/DAVIS_480_880
|-- Annotations/480p
    |-- bear
        |-- 00000.png
        |-- ...
    |-- ...
|-- ImageSets/2017/val.txt
|-- JPEGImages/480p
    |-- bear
        |-- 00000.jpg
        |-- ...
    |-- ...

Step 2: Video object segmentation

python eval_video_segmentation_davis.py \
    --finetune [LOG_DIR]/checkpoint-199.pth \
    --output_dir [LOG_DIR]/davis_seg \
    --data_path [DATA_ROOT]/DAVIS_480_880 \
    --topk 7 --size_mask_neighborhood 30 --n_last_frames 30 \
    --model vit_small

Step 3: Evaluation the obtained segmentation

git clone https://github.com/davisvideochallenge/davis2017-evaluation
python ./davis2017-evaluation/evaluation_method.py \
    --task semi-supervised \
    --results_path [LOG_DIR]/davis_seg \
    --davis_path [DATA_ROOT]/DAVIS_480_880

2. JHMDB pose tracking

TBD

3. VIP video part segmentation

TBD

4.2. Vision-based Robot Learning

1. CortexBench

We will update the evaluation code at https://github.com/huiwon-jang/RSP/tree/eval_cortexbench.

2. RLBench

TBD

3. Franka Kitchen

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data_preprocessing		data_preprocessing
third_party_license		third_party_license
timm		timm
util		util
LICENSE		LICENSE
README.md		README.md
engine_pretrain_repsamp.py		engine_pretrain_repsamp.py
eval_video_segmentation_davis.py		eval_video_segmentation_davis.py
main_pretrain_rsp.py		main_pretrain_rsp.py
models_rsp.py		models_rsp.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Representation Learning with Stochastic Frame Prediction

[project page] [openreview]

1. Environment setup

2. Dataset

Dataset download

Dataset pre-processing

Kinetics-400

3. Pre-training RSP on Kinetics-400

4. Evaluation

4.1. Video Label Propagation

1. DAVIS 2017 video object segmentation

2. JHMDB pose tracking

3. VIP video part segmentation

4.2. Vision-based Robot Learning

1. CortexBench

2. RLBench

3. Franka Kitchen

About

Releases

Packages

Languages

License

huiwon-jang/RSP

Folders and files

Latest commit

History

Repository files navigation

Visual Representation Learning with Stochastic Frame Prediction

[project page] [openreview]

1. Environment setup

2. Dataset

Dataset download

Dataset pre-processing

Kinetics-400

3. Pre-training RSP on Kinetics-400

4. Evaluation

4.1. Video Label Propagation

1. DAVIS 2017 video object segmentation

2. JHMDB pose tracking

3. VIP video part segmentation

4.2. Vision-based Robot Learning

1. CortexBench

2. RLBench

3. Franka Kitchen

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages