-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Description
System Info
OS: Linux (GCP)
Python: 3.12
torch: 2.6.0+cuda12.4
CUDA: 12.4
GPU: A100 40GB
transformers: tested on main and v4.57.1
Who can help?
@SunMarc @Cyrilvallez @ArthurZucker @McPatate Might be related to VideoProcessor?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I'm attempting to fine-tune a video classification model using the facebook/vjepa2-vitl-fpc16-256-ssv2 notebook and a nearly identical script (gist here) on a GCP A100 40GB instance.
My dataset loader uses torchcodec for video decoding and batch collation as in the gist above. vids is the video batch tensor with dimensions [batch_size, num_frames, channels, height, width]. When this tensor is fed to VJEPA2VideoProcessor, the error occurs. The training and preprocessing steps closely follow the official notebook. but got the report as:
Traceback (most recent call last):
File "/home/vjepa2/vjepa2.py", line 436, in <module>
main()
File "/home/vjepa2/vjepa2.py", line 399, in main
training_history = run_training_loop(config, model, processor, train_loader, val_loader, device)
File "/home/vjepa2/vjepa2.py", line 231, in run_training_loop
inputs = processor(vids, return_tensors="pt").to(device)
File "/home/vj2_env/lib/python3.12/site-packages/transformers/video_processing_utils.py", line 209, in __call__
return self.preprocess(videos, **kwargs)
File "/home/vj2_env/lib/python3.12/site-packages/transformers/video_processing_utils.py", line 391, in preprocess
videos = self._prepare_input_videos(videos=videos, input_data_format=input_data_format, device=device)
File "/home/vj2_env/lib/python3.12/site-packages/transformers/video_processing_utils.py", line 347, in _prepare_input_videos
input_data_format = infer_channel_dimension_format(video)
File "/home/vj2_env/lib/python3.12/site-packages/transformers/image_utils.py", line 312, in infer_channel_dimension_format
raise ValueError(f"Unsupported number of image dimensions: {image.ndim}")
ValueError: Unsupported number of image dimensions: 6Expected behavior
The processor should handle the batch as per notebook expectations, or if batched input shape is unsupported, the documentation/code should clarify accepted input shapes for video batches.
Additional Notes
I have verified vids.shape before the call and it's typically (batch, frames, ch, h, w), i.e., five dimensions.
The error hints at a tensor with six dimensions being passed, but my code (and the notebook) only creates 5D tensors for video batches.
Am I perhaps missing a step regarding batch collation or stacking, or should the processor support this format out of the box?