### Datasets
**HMDB-51**

**UCF-101.** [Official Website](https://www.crcv.ucf.edu/data/UCF101.php). UCF-101 contains 101 human action classes including haircut, playing guitar, billiard, fencing and etc.

Categories can be divided into five types:
1. Human-Object Interation
2. Body-Motion Only
3. Human-Human Interation
4. Playing Musical Instruments
5. Sports

List of action classes and their numerical index: [Download](https://www.crcv.ucf.edu/THUMOS14/Class%20Index.txt)

All clips in UCF-101 are from only 2.5k distinct videos. The problem with it is that, for example, the class of "brushing hair" contains the 7 clips from one video of one person.


**Kinetics.** Kinetics has 400 human action classes with more than 400 examples for each class, each from a unique YouTube video.
+ considerable camera motion/shake, illuminstration variations, shadows, background clutter

### Methods

+ ConvNets with an LSTM on top: 
    - Long-term recurrent convolutional networks for visual recognition and description
    - Beyond Short Snippets: Deep Networks for Video Classification
+ two-stream networks
    - Convolutional Two-Stream Network Fusion for Video Action Recognition
    - Two-stream convolutional networks for action recognition in videos.
+ 3D ConvNet
    - Convolutional learning of spatio-temporal features 2010
    - 3d convolutional neural networks for human action recognition 2012
    - Learning spatiotemporal features with 3d convolutional networks (C3D) 2014

<img src="images/video_architecture.png" width="450"/>

#### Two-Stream networks
Pass a single RGB frame and a stack of 10 externally computed optical flow frames through two replicas of an ImageNet-pretrained ConvNet and average their predictions.

#### 2D ConvNet
+ only consider spatial information, do not capture the motion information encoded in multiple contiguous frames for video analysis problems
+ ConvNets with LSTMs on top & two-stream networks with 2 different types of stream fusion
+ We use 2D CNN before. (Remember we use `torch.nn.Conv2d`?)

#### C3D
[Learning spatiotemporal features with 3d convolutional networks](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Tran_Learning_Spatiotemporal_Features_ICCV_2015_paper.pdf).


+ 8 convolutional layer
+ 5 pooling layer
+ 2 fully connected layer
+ inputs: 16-frame clips with 112x112-pixel crops
+ In [The Kinetics Human Action Video Dataset](https://arxiv.org/pdf/1705.06950.pdf), the authors add batch normalization after all convolutional and fully connected layers, and use a temporal stride of 2 instead of 1 in the first pooling layer to reduce the memory footprint and allows for bigger batches (important for batch normalization)


In [18]:
# 2D Convolution
import torch
conv = torch.nn.Conv2d(3, 4, 3) # C_O, C_I, K
conv.weight
x = torch.randn(1, 3, 6, 8) # B, C, H, W
y_torch = conv(x)

def my_conv2d(x, conv):
    y = torch.zeros(x.shape[0], conv.weight.data.shape[0], x.shape[2] - conv.weight.data.shape[2] + 1, x.shape[3] - conv.weight.data.shape[3] + 1)
    for b in range(y.shape[0]): # batch size
        for o in range(y.shape[1]): # output channel
            for h in range(y.shape[2]): # output height
                for w in range(y.shape[3]): # output width
                    y[b, o, h, w] = torch.sum(x[b, :, h:h+3, w:w+3] * conv.weight.data[o, :, :, :]) + conv.bias[o]
    return y

y_mine = my_conv2d(x, conv)
                
print('Results are the same.' if torch.any(torch.isclose(y_torch, y_mine, 1e-5)).item() == 1 else 'Results are different.')

Results are the same


In [20]:
# 3D Convolution
import torch
conv = torch.nn.Conv3d(3, 4, 3) # C_O, C_I, K
conv.weight
x = torch.randn(1, 3, 10, 6, 8) # B, C, D(Depth), H, W
y_torch = conv(x) # B, C_O, D, H, W

def my_conv3d(x, conv):
    y = torch.zeros(x.shape[0], conv.weight.data.shape[0], x.shape[2] - conv.weight.data.shape[2] + 1, x.shape[3] - conv.weight.data.shape[3] + 1, x.shape[4] - conv.weight.data.shape[4] + 1)
    for b in range(y.shape[0]): # batch size
        for o in range(y.shape[1]): # output channel
            for d in range(y.shape[2]): # output depth
                for h in range(y.shape[3]): # output height
                    for w in range(y.shape[4]): # output width
                        y[b, o, d, h, w] = torch.sum(x[b, :, d:d+3, h:h+3, w:w+3] * conv.weight.data[o, :, :, :, :]) + conv.bias[o]
    return y

y_mine = my_conv3d(x, conv)


print('Results are the same.' if torch.any(torch.isclose(y_torch, y_mine, 1e-5)).item() == 1 else 'Results are different.')


Results are the same.


### Video as Sequences
Design choices:
+ RGB
+ optical flow
+ RGB + optical flow

**3D Convolutional Neural Networks for Human Action Recognition.** Multiple channels as inputs: 
1. gray
2. gradient x
3. gradient y
4. optical flow x
5. optical flow y