Finetuning issue with V-JEPA 2

### System Info

OS: Linux (GCP)
Python: 3.12
torch: 2.6.0+cuda12.4
CUDA: 12.4
GPU: A100 40GB
transformers: tested on main and v4.57.1

### Who can help?

@SunMarc @CyrilVallez @ArthurZucker @McPatate Might be related to VideoProcessor?

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

I'm attempting to fine-tune a video classification model using the [facebook/vjepa2-vitl-fpc16-256-ssv2 notebook](https://huggingface.co/facebook/vjepa2-vitl-fpc16-256-ssv2/blob/main/notebook_finetuning.ipynb) and a nearly identical script ([gist here](https://gist.github.com/ariG23498/28bccc737c11d1692f6d0ad2a0d7cddb)) on a GCP A100 40GB instance.

My dataset loader uses torchcodec for video decoding and batch collation as in the gist above. `vids` is the video batch tensor with dimensions [batch_size, num_frames, channels, height, width]. When this tensor is fed to VJEPA2VideoProcessor, the error occurs. The training and preprocessing steps closely follow the official notebook. but got the report as:

```python
Traceback (most recent call last):
  File "/home/vjepa2/vjepa2.py", line 436, in <module>
    main()
  File "/home/vjepa2/vjepa2.py", line 399, in main
    training_history = run_training_loop(config, model, processor, train_loader, val_loader, device)
  File "/home/vjepa2/vjepa2.py", line 231, in run_training_loop
    inputs = processor(vids, return_tensors="pt").to(device)
  File "/home/vj2_env/lib/python3.12/site-packages/transformers/video_processing_utils.py", line 209, in __call__
    return self.preprocess(videos, **kwargs)
  File "/home/vj2_env/lib/python3.12/site-packages/transformers/video_processing_utils.py", line 391, in preprocess
    videos = self._prepare_input_videos(videos=videos, input_data_format=input_data_format, device=device)
  File "/home/vj2_env/lib/python3.12/site-packages/transformers/video_processing_utils.py", line 347, in _prepare_input_videos
    input_data_format = infer_channel_dimension_format(video)
  File "/home/vj2_env/lib/python3.12/site-packages/transformers/image_utils.py", line 312, in infer_channel_dimension_format
    raise ValueError(f"Unsupported number of image dimensions: {image.ndim}")

ValueError: Unsupported number of image dimensions: 6
```

### Expected behavior

The processor should handle the batch as per notebook expectations, or if batched input shape is unsupported, the documentation/code should clarify accepted input shapes for video batches.

### Additional Notes
I have verified vids.shape before the call and it's typically (batch, frames, ch, h, w), i.e., five dimensions.

The error hints at a tensor with six dimensions being passed, but my code (and the notebook) only creates 5D tensors for video batches.

Am I perhaps missing a step regarding batch collation or stacking, or should the processor support this format out of the box?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finetuning issue with V-JEPA 2 #42251

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Finetuning issue with V-JEPA 2 #42251

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions