# 使用Video MAE进行快速推理



## 1. Load video 加载视频

让我们从[Kinetics-400数据集](https://www.deepmind.com/open-source/kinetics)中加载一个视频。该数据集包含数百万个YouTube视频，每个视频都被标注为400种可能类别中的一种。

In [4]:
!wget https://huggingface.co/datasets/nielsr/video-demo/resolve/main/eating_spaghetti.mp4

--2024-10-11 10:58:58--  https://huggingface.co/datasets/nielsr/video-demo/resolve/main/eating_spaghetti.mp4
Connecting to 172.26.1.26:12798... connected.
Proxy request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/21/27/2127ba3909eec39f0c04aa658b6aa97c12af51427ff415d000d565c97e36724b/252f63d13748f08acf56765c295506bfdb8bb73b822e93a33a57d73988814a71?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27eating_spaghetti.mp4%3B+filename%3D%22eating_spaghetti.mp4%22%3B&response-content-type=video%2Fmp4&Expires=1728874740&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyODg3NDc0MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy8yMS8yNy8yMTI3YmEzOTA5ZWVjMzlmMGMwNGFhNjU4YjZhYTk3YzEyYWY1MTQyN2ZmNDE1ZDAwMGQ1NjVjOTdlMzY3MjRiLzI1MmY2M2QxMzc0OGYwOGFjZjU2NzY1YzI5NTUwNmJmZGI4YmI3M2I4MjJlOTNhMzNhNTdkNzM5ODg4MTRhNzE%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=EzznF3jF

In [2]:
from ipywidgets import Video

video_path = "eating_spaghetti.mp4" 
Video.from_file(video_path, width=500)

Video(value=b'\x00\x00\x00 ftypisom\x00\x00\x02\x00isomiso2avc1mp41\x00\x00\x00\x08free...', width='500')

## 2. Prepare video for model 为模型准备视频

我们可以通过使用VideoMAEFeatureExtractor为模型处理视频。首先，从最多300帧中采样16帧，并将这些帧输入特征提取器。

它将进行一些基本的预处理操作，包括对视频的每一帧进行调整大小、中心裁剪以及归一化处理。

In [3]:
from mindnlp.transformers import VideoMAEFeatureExtractor

feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 1.009 seconds.
Prefix dict has been built successfully.


  0%|          | 0.00/271 [00:00<?, ?B/s]



In [4]:
from decord import VideoReader, cpu
import numpy as np

# video clip consists of 300 frames (10 seconds at 30 FPS)
vr = VideoReader(video_path, num_threads=1, ctx=cpu(0)) 

def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
  converted_len = int(clip_len * frame_sample_rate)
  end_idx = np.random.randint(converted_len, seg_len)
  str_idx = end_idx - converted_len
  index = np.linspace(str_idx, end_idx, num=clip_len)
  index = np.clip(index, str_idx, end_idx - 1).astype(np.int64)
  
  return index

vr.seek(0)
index = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(vr))
buffer = vr.get_batch(index).asnumpy()
buffer.shape

(16, 360, 640, 3)

In [5]:
# create a list of NumPy arrays
video = [buffer[i] for i in range(buffer.shape[0])]

encoding = feature_extractor(video, return_tensors="ms")
print(encoding.pixel_values.shape)

(1, 16, 3, 224, 224)


## 3. Load model 加载模型

接下来，让我们从hub中加载模型。

In [6]:
from mindnlp.transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification

model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")


0.00B [00:00, ?B/s]

  0%|          | 0.00/1.13G [00:00<?, ?B/s]

## 4. Forward pass 前向传播

In [13]:
pixel_values = encoding.pixel_values

# forward pass
outputs = model(pixel_values)
logits = outputs.logits

In [14]:
predicted_class_idx = logits.argmax(-1).item()

print("Predicted class:", model.config.id2label[predicted_class_idx])

Predicted class: eating spaghetti
