Skip to content

Finetuning issue with V-JEPA 2 #42251

@MattLiutt

Description

@MattLiutt

System Info

OS: Linux (GCP)
Python: 3.12
torch: 2.6.0+cuda12.4
CUDA: 12.4
GPU: A100 40GB
transformers: tested on main and v4.57.1

Who can help?

@SunMarc @Cyrilvallez @ArthurZucker @McPatate Might be related to VideoProcessor?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm attempting to fine-tune a video classification model using the facebook/vjepa2-vitl-fpc16-256-ssv2 notebook and a nearly identical script (gist here) on a GCP A100 40GB instance.

My dataset loader uses torchcodec for video decoding and batch collation as in the gist above. vids is the video batch tensor with dimensions [batch_size, num_frames, channels, height, width]. When this tensor is fed to VJEPA2VideoProcessor, the error occurs. The training and preprocessing steps closely follow the official notebook. but got the report as:

Traceback (most recent call last):
  File "/home/vjepa2/vjepa2.py", line 436, in <module>
    main()
  File "/home/vjepa2/vjepa2.py", line 399, in main
    training_history = run_training_loop(config, model, processor, train_loader, val_loader, device)
  File "/home/vjepa2/vjepa2.py", line 231, in run_training_loop
    inputs = processor(vids, return_tensors="pt").to(device)
  File "/home/vj2_env/lib/python3.12/site-packages/transformers/video_processing_utils.py", line 209, in __call__
    return self.preprocess(videos, **kwargs)
  File "/home/vj2_env/lib/python3.12/site-packages/transformers/video_processing_utils.py", line 391, in preprocess
    videos = self._prepare_input_videos(videos=videos, input_data_format=input_data_format, device=device)
  File "/home/vj2_env/lib/python3.12/site-packages/transformers/video_processing_utils.py", line 347, in _prepare_input_videos
    input_data_format = infer_channel_dimension_format(video)
  File "/home/vj2_env/lib/python3.12/site-packages/transformers/image_utils.py", line 312, in infer_channel_dimension_format
    raise ValueError(f"Unsupported number of image dimensions: {image.ndim}")

ValueError: Unsupported number of image dimensions: 6

Expected behavior

The processor should handle the batch as per notebook expectations, or if batched input shape is unsupported, the documentation/code should clarify accepted input shapes for video batches.

Additional Notes

I have verified vids.shape before the call and it's typically (batch, frames, ch, h, w), i.e., five dimensions.

The error hints at a tensor with six dimensions being passed, but my code (and the notebook) only creates 5D tensors for video batches.

Am I perhaps missing a step regarding batch collation or stacking, or should the processor support this format out of the box?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions