<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/VideoMAE/Quick_inference_with_VideoMAE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set-up environment

First, let's install 🤗 Transformers and decord, which we'll use to decode a video.

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


In [2]:
!pip install -q decord

## Load video

Let's load a video from the [Kinetics-400](https://www.deepmind.com/open-source/kinetics) dataset. This dataset contains millions of YouTube videos annotated with one out of 400 possible classes.



In [3]:
!wget https://huggingface.co/datasets/nielsr/video-demo/resolve/main/eating_spaghetti.mp4

--2022-08-04 16:06:14--  https://huggingface.co/datasets/nielsr/video-demo/resolve/main/eating_spaghetti.mp4
Resolving huggingface.co (huggingface.co)... 34.231.117.252, 52.2.34.29, 2600:1f18:147f:e850:d57d:d46a:df34:61ee, ...
Connecting to huggingface.co (huggingface.co)|34.231.117.252|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/21/27/2127ba3909eec39f0c04aa658b6aa97c12af51427ff415d000d565c97e36724b/252f63d13748f08acf56765c295506bfdb8bb73b822e93a33a57d73988814a71?response-content-disposition=attachment%3B%20filename%3D%22eating_spaghetti.mp4%22 [following]
--2022-08-04 16:06:14--  https://cdn-lfs.huggingface.co/repos/21/27/2127ba3909eec39f0c04aa658b6aa97c12af51427ff415d000d565c97e36724b/252f63d13748f08acf56765c295506bfdb8bb73b822e93a33a57d73988814a71?response-content-disposition=attachment%3B%20filename%3D%22eating_spaghetti.mp4%22
Resolving cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)... 13.224.167.90, 13.224.1

In [4]:
from ipywidgets import Video

video_path = "eating_spaghetti.mp4" 
Video.from_file(video_path, width=500)

Video(value=b'\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free\x00\x0fI\xb7mdat\x00\x…

## Prepare video for model

We can prepare the video for the model using VideoMAEFeatureExtractor. We'll first sample 16 frames (out of the possible 300), and provide this to the feature extractor.

It will perform some basic preprocessing, namely resize, center crop and normalize each frame of the video.

In [5]:
from transformers import VideoMAEFeatureExtractor

feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")

In [6]:
from decord import VideoReader, cpu
import numpy as np

# video clip consists of 300 frames (10 seconds at 30 FPS)
vr = VideoReader(video_path, num_threads=1, ctx=cpu(0)) 

def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
  converted_len = int(clip_len * frame_sample_rate)
  end_idx = np.random.randint(converted_len, seg_len)
  str_idx = end_idx - converted_len
  index = np.linspace(str_idx, end_idx, num=clip_len)
  index = np.clip(index, str_idx, end_idx - 1).astype(np.int64)
  
  return index

vr.seek(0)
index = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(vr))
buffer = vr.get_batch(index).asnumpy()
buffer.shape

(16, 360, 640, 3)

In [7]:
# create a list of NumPy arrays
video = [buffer[i] for i in range(buffer.shape[0])]

encoding = feature_extractor(video, return_tensors="pt")
print(encoding.pixel_values.shape)

torch.Size([1, 16, 3, 224, 224])


## Load model

Next, let's load the model and move it to the GPU, if it's available.

In [8]:
from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
model.to(device)

VideoMAEForVideoClassification(
  (videomae): VideoMAEModel(
    (embeddings): VideoMAEEmbeddings(
      (patch_embeddings): VideoMAEPatchEmbeddings(
        (projection): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
      )
    )
    (encoder): VideoMAEEncoder(
      (layer): ModuleList(
        (0): VideoMAELayer(
          (attention): VideoMAEAttention(
            (attention): VideoMAESelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=False)
              (key): Linear(in_features=1024, out_features=1024, bias=False)
              (value): Linear(in_features=1024, out_features=1024, bias=False)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): VideoMAESelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): VideoMAEIntermediate(
            (dense): Li

## Forward pass

In [9]:
pixel_values = encoding.pixel_values.to(device)

# forward pass
with torch.no_grad():
  outputs = model(pixel_values)
  logits = outputs.logits

In [10]:
predicted_class_idx = logits.argmax(-1).item()

print("Predicted class:", model.config.id2label[predicted_class_idx])

Predicted class: eating spaghetti
