🎬 You Don't Need Strong Assumptions: Temporal Difference Visual Representation Learning

Ninad Daithankar*, Alexi Gladstone*, Yann LeCun, Heng Ji (*Equal Contribution)
University of Illinois Urbana-Champaign · New York University

TDV is a self-supervised video representation learning method built on a single causal assumption: the past causes the future. A frame encoder and motion encoder are trained jointly such that the current frame's representation plus the encoded motion equals the next frame's representation — no augmentations, masking, or cropping required. TDV matches DINO, iBOT baselines on segmentation tasks and surpasses them on optical flow and stereo depth.

Setup

Create a conda environment and install dependencies:

# Create the environment
conda create -n tdv python=3.11
conda activate tdv

# PyTorch (CUDA 12.6) — works on both x86_64 and aarch64/GH200
pip install torch==2.10.0+cu126 torchaudio torchvision --index-url https://download.pytorch.org/whl/cu126 

# aiohttp==3.8.1 can't compile from source on Python 3.11; conda provides a compatible binary
conda install -c conda-forge aiohttp libiconv ffmpeg -y  

# --no-deps avoids conflicts while downloading dependencies
pip install -r requirements.txt --no-deps

Set environment variables:

export HF_HOME=/path/to/cache
export HF_TOKEN=your_token_here  # may not be needed

Login to wandb:

wandb login

WandB Tip: in the wandb UI, switch the x-axis from the default Step to trainer/global_step to see correct plots — the default step counter does not account for gradient accumulation.

For dataset setup see data/cv/README.md.

Running the Code

Training

Running locally

In job_scripts/pretrain_tdv.slurm, set:

Field	Location	What to set
`RUN_NAME`	`export RUN_NAME=`	name for this run (used in logs and wandb)
`--aggr_dataset_dirs`	`ARGS` block	path(s) to your dataset(s)
`--wandb_project`	`ARGS` block	your wandb project name
`--gpus`	`ARGS` block	e.g. `"[0]"` for single GPU, `"[0, 1]"` for two (default is -1 which uses all GPUs)
`--knn_eval_data_dir`	`ARGS` block	path to your ImageNet folder if downloaded elsewhere

Then run:

bash job_scripts/pretrain_tdv.slurm

For debugging, also add --debug_mode to the ARGS block — it disables wandb, enables anomaly detection, and limits batches. Or add --no_wandb to skip wandb without the other debug constraints.

Running on a cluster (SLURM)

1. Set run-specific fields in job_scripts/pretrain_tdv.slurm — same as above (RUN_NAME, --aggr_dataset_dirs, --wandb_project, --knn_eval_data_dir).

Then, update the #SBATCH header lines at the top of the file:

Field	What to set
`#SBATCH --job-name`	same as `RUN_NAME`
`#SBATCH --output`	log path, e.g. `logs/slurm/YOUR_RUN_NAME`
`#SBATCH --nodes`	number of nodes
`#SBATCH --gpus-per-node`	GPUs per node
`#SBATCH --time`	wall time limit

2. Set cluster-specific settings in your cluster's header file under job_scripts/slurm_headers/ (copy example_gh200.slurm to <cluster_name>.slurm):

Field	What to set
`#SBATCH --account`	your cluster account/allocation
`#SBATCH --partition`	partition name for your cluster

3. Submit:

Make sure the tdv conda environment is activated before submitting — SLURM inherits it from the shell that calls sbatch.

conda activate tdv
bash slurm_executor.sh <cluster_name> job_scripts/pretrain_tdv.slurm

slurm_executor.sh prepends job_scripts/slurm_headers/<cluster_name>.slurm to the job script and submits via sbatch. To add a new cluster, just add a new header file. Every submission is logged to logs/job_scripts/executed_slurm_script_contents.log.

SLURM note: You can also just add a slurm header to the existing bash scripts and execute scripts using sbatch, which is more standard, but this slurm_executor.sh is super helpful while switching between clusters.

For a full list of training arguments see hparams/args.py.

Evaluation

Linear Probe (SSv2 / ImageNet)

Trains a linear classifier on top of the frozen TDV frame encoder. Frames are pooled via CLS token and optionally concatenated across a short clip. Set CHECKPOINT_PATH, DATASET_DIR, and the dataset-specific variables at the top of job_scripts/eval/linear_probe.slurm — the script covers both SSv2 and ImageNet.

Optical Flow (Sintel)

Uses a Midway decoder (IterativeLatentMotion + DPT head) trained on top of the frozen TDV encoder. The flow eval code lives in eval/flow/ and is adapted from the CroCo v2 stereoflow codebase. See eval/flow/README.md for full setup and dataset download instructions. Set REPO_ROOT, CHECKPOINT, and the three dataset_dirs paths in job_scripts/eval/flow_eval.slurm.

Stereo Depth (SceneFlow)

Uses the same Midway decoder as optical flow, but in stereo mode (stereo subcommand). Set REPO_ROOT, CHECKPOINT, and SceneFlow dataset path in job_scripts/eval/stereo_depth_eval.slurm.

Semantic Segmentation (ADE20K / Cityscapes)

Uses a UPerNet decode head on top of the frozen TDV encoder, implemented via the mmsegmentation fork in repos/mmsegmentation-tdv. The TDV frame encoder is exposed as a drop-in mmseg backbone (TDVBackbone) that extracts multi-scale intermediate features. Set tdv_repo_path and data_root in the config before running (see the repo's README for installation):

cd repos/mmsegmentation-tdv
python tools/train.py configs/tdv/tdv-base_upernet_160k_ade20k-512x512.py \
  --cfg-options model.backbone.checkpoint_path=/path/to/checkpoint.ckpt \
                model.backbone.tdv_repo_path=/path/to/tdv-clean

General Code Flow

The job script passes all hyperparameters to train_model.py. It handles seeding, distributed training setup (DDP), wandb initialization, callback registration (KNN eval, linear probes, DeepSORT tracking), and then calls trainer.fit().

train_model.py instantiates ModelTrainer from base_model_trainer.py — the PyTorch Lightning module responsible for the training loop, validation, optimizer/scheduler setup, gradient logging, and EMA updates.

Model Architecture: model/cv/tdv/tdv.py

The core model is TDV, defined in model/cv/tdv/tdv.py. It implements the temporal difference objective end-to-end and is self-contained — if you only need the architecture and loss, this is the only file you need.

All hyperparameters live in hparams/args.py. Generally you should not need to touch train_model.py or base_model_trainer.py — new experiments come from adding model variants in model/cv/ and new datasets in data/cv/.

Repo Structure

tdv/
├── train_model.py                          # entry point — parses hparams, sets up DDP, calls trainer
├── base_model_trainer.py                   # PyTorch Lightning module — training loop, optimizer, logging
├── hparams/args.py                         # all hyperparameters
├── model/
│   └── cv/
│       ├── tdv/tdv.py                      # TDV — main model
│       └── dinov2/                         # DINOv2 ViT encoder
├── data/cv/                                # dataloaders (SSv2, Ego4D, FineVideo, Kinetics, ...)
├── eval/
│   ├── flow/                               # optical flow eval (Sintel) — Midway decoder
│   ├── probes/                             # linear probe eval
│   ├── knn/                                # KNN eval
│   └── tracking/                           # DeepSORT tracking eval
├── repos/
│   ├── dino-with-online-knn/               # DINO training with online KNN monitoring
│   ├── ibot-with-online-knn/               # iBOT training with online KNN monitoring
│   └── mmsegmentation-tdv/                 # mmseg fork with TDVBackbone for segmentation eval
├── job_scripts/
│   ├── pretrain_tdv.slurm                  # pretraining launch script
│   ├── slurm_headers/                      # cluster-specific SBATCH headers
│   └── eval/
│       ├── linear_probe.slurm              # linear probe eval (SSv2 / ImageNet)
│       ├── flow_eval.slurm                 # optical flow eval (Sintel / FlyingThings / Chairs)
│       └── stereo_depth_eval.slurm         # stereo depth eval (SceneFlow)
├── requirements.txt
└── slurm_executor.sh

Citation

If you find this repo useful, please consider giving a star ⭐ and a citation 🙃. If you have any questions, feel free to post them on github issues, hugging face daily papers, or email me (ninaddaithankar@gmail.com).

@misc{daithankar2026dontneedstrongassumptions,
      title={You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences},
      author={Ninad Daithankar and Alexi Gladstone and Yann LeCun and Heng Ji},
      year={2026},
      eprint={2606.15956},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.15956},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 You Don't Need Strong Assumptions: Temporal Difference Visual Representation Learning

Setup

Running the Code

Training

Running locally

Running on a cluster (SLURM)

Evaluation

Linear Probe (SSv2 / ImageNet)

Optical Flow (Sintel)

Stereo Depth (SceneFlow)

Semantic Segmentation (ADE20K / Cityscapes)

General Code Flow

Model Architecture: model/cv/tdv/tdv.py

Repo Structure

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
data		data
eval		eval
hparams		hparams
job_scripts		job_scripts
model		model
repos		repos
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
base_model_trainer.py		base_model_trainer.py
optimization.py		optimization.py
requirements.txt		requirements.txt
slurm_executor.sh		slurm_executor.sh
train_model.py		train_model.py

Folders and files

Latest commit

History

Repository files navigation

🎬 You Don't Need Strong Assumptions: Temporal Difference Visual Representation Learning

Setup

Running the Code

Training

Running locally

Running on a cluster (SLURM)

Evaluation

Linear Probe (SSv2 / ImageNet)

Optical Flow (Sintel)

Stereo Depth (SceneFlow)

Semantic Segmentation (ADE20K / Cityscapes)

General Code Flow

Model Architecture: model/cv/tdv/tdv.py

Repo Structure

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages