<a href="https://colab.research.google.com/github/ishi23/deep-learning-with-pytorch-ja/blob/main/p1ch4/X_video_cockatoo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Video
====

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/repos/deep-learning-with-pytorch-ja/p1ch4/

/content/drive/MyDrive/repos/deep-learning-with-pytorch-ja/p1ch4


In [3]:
import numpy as np
import torch
torch.set_printoptions(edgeitems=2, threshold=50)

When it comes to the shape of tensors, video data can be seen as equivalent to volumetric data, with `depth` replaced by the `time` dimension. The result is again a 5D tensor with shape `N x C x T x H x W`.

There are several formats for video, especially geared towards compression by exploiting redundancies in space and time. Luckily for us, `imageio` reads video data as well. Suppose we'd like to retain 100 consecutive frames in our 512 x 512 RBG video for classifying an action using a convolutional neural network. We first create a reader instance for the video, that will allow us to get information about the video and iterate over the frames in time.
Let's see what the meta data for the video looks like:

In [4]:
import imageio
# 動画データの読み込み
reader = imageio.get_reader('../data/p1ch4/video-cockatoo/cockatoo.mp4')
meta = reader.get_meta_data()
meta

{'duration': 14.0,
 'ffmpeg_version': '3.4.8-0ubuntu0.2 built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)',
 'fps': 20.0,
 'nframes': 280,
 'plugin': 'ffmpeg',
 'size': (1280, 720),
 'source_size': (1280, 720)}

In [17]:
# readerの中身：W x H x C の画像が全フレーム分入っている
from pprint import pprint
for i, read in enumerate(reader):
    pprint([read.shape, type(read), read])
    if i == 0:
        break

[(720, 1280, 3),
 <class 'imageio.core.util.Array'>,
 Array([[[116, 119, 104],
        [116, 119, 104],
        [116, 119, 104],
        ...,
        [ 33,  29,  28],
        [ 33,  29,  28],
        [ 33,  29,  28]],

       [[116, 119, 104],
        [116, 119, 104],
        [116, 119, 104],
        ...,
        [ 33,  29,  28],
        [ 33,  29,  28],
        [ 33,  29,  28]],

       [[116, 119, 104],
        [116, 119, 104],
        [116, 119, 104],
        ...,
        [ 33,  29,  28],
        [ 33,  29,  28],
        [ 33,  29,  28]],

       ...,

       [[172, 166, 143],
        [175, 170, 147],
        [180, 175, 151],
        ...,
        [ 48,  47,  52],
        [ 48,  47,  52],
        [ 48,  47,  52]],

       [[172, 166, 143],
        [175, 170, 147],
        [180, 175, 151],
        ...,
        [ 48,  47,  52],
        [ 48,  47,  52],
        [ 48,  47,  52]],

       [[172, 166, 143],
        [175, 170, 147],
        [180, 175, 151],
        ...,
        [ 48,  47,  

We now have all the information to size the tensor that will store the video frames:

In [5]:
# Tensor化用の入れ者準備
n_channels = 3
n_frames = meta['nframes']
video = torch.empty(n_channels, n_frames, *meta['size'])
# Tensor C x T x H x W
video.shape

torch.Size([3, 280, 1280, 720])

Now we just iterate over the reader and set the values for all three channels into in the proper `i`-th time slice.
This might take a few seconds to finish!

In [21]:
video[:,0].shape

torch.Size([3, 1280, 720])

In [23]:
# 用意した箱の中身をframeで上書き
for i, frame_arr in enumerate(reader):
    frame = torch.from_numpy(frame_arr).float()
    video[:, i] = torch.transpose(frame, 0, 2)  # WHC -> CHW

In [26]:
video.shape  # C x T x H x W

torch.Size([3, 280, 1280, 720])

In the above, we iterate over individual frames and set each frame in the `C x T x H x W` video tensor, after transposing the channel. We can then obtain a batch by stacking multiple 4D tensors or pre-allocating a 5D tensor with a known batch size and filling it iteratively, clip by clip, assuming clips are trimmed to a fixed number of frames.

Equating video data to volumetric data is not the only way to represent video for training purposes. This is a valid strategy if we deal with video bursts of fixed length. An alternative strategy is to resort to network architectures capable of processing long sequences and exploiting short and long-term relationships in time, just like for text or audio.
// We'll see this kind of architectures when we take on recurrent networks.

This next approach accounts for time along the batch dimension. Hence, we'll build our dataset as a 4D tensor, stacking frame by frame in the batch:


In [24]:
time_video = torch.empty(n_frames, n_channels, *meta['size'])

for i, frame in enumerate(reader):
    frame = torch.from_numpy(frame).float()
    time_video[i] = torch.transpose(frame, 0, 2)

time_video.shape  # T x C x H x W

torch.Size([280, 3, 1280, 720])

In [28]:
(video == time_video.transpose(1,0)).all()

tensor(True)