# Learning Spatiotemporal Features with 3D Convolutional Networks
- Spatiotemporal Features: 3D Convolution Networks
- Authors refer video clips with a size of c x l x h x w
    ```
    where 
        c: numbver of channels 
        l: length in number of frames
        h: height of frame
        w: width of frame
    ```
- Also, authors refer 3D convolution and pooling kernel size by d x k x k
    ```
    where
        d: kernel temporal depth
        k: kernel spatial size
    ```

In [75]:
import os
import tensorflow as tf
import PIL.Image as Image
import random
import numpy as np
import cv2
import time
import math

In [76]:
NUM_CLASSES = 101           # The UCF-101 dataset has 101 classes
CROP_SIZE = 112             # Images are cropped to (CROP_SIZE, CROP_SIZE)
CHANNELS = 3                # RGB Channels
NUM_FRAMES_PER_CLIP = 16    # Number of frames per video clip

In [78]:
def conv3d(name, l_input, w, b):
    """Convolution layer"""
    # All of these convolution layers are applied with appropriate padding
    # (both spatial and temporal) and stride 1, thus there is no change in
    # term of size from the input to the output of these convolution layers.
    return tf.nn.bias_add(
        tf.nn.conv3d(l_input, w, strides=[1, 1, 1, 1, 1], 
                     padding='SAME', name=name), 
        b)

def max_pool(name, l_input, k):
    """Pooling layer"""
    # All pooling layers are max pooling with kernel size 2 X 2 X 2 (except
    # for the first layer) with stride 1 which means the size of output
    # signal is reduced by a factor of 8 compared with the input signal.
    #
    # The first pooling layer has kernel size 1 X 2 X 2 with the intention
    # of not to merge the temporal signal too early and also to satisfy the
    # clip length of 16 frames (e.g. we can temporally pool with factor 2
    # at most 4 times before completely collapsing the temporal signal).
    return tf.nn.max_pool3d(l_input, ksize=[1, k, 2, 2, 1], 
                            strides=[1, k, 2, 2, 1], padding='SAME', 
                            name=name)

def inference_c3d(_X, _dropout, batch_size, _weights, _biases):
    
    # Convolution Layer
    conv1 = conv3d('conv1', _X, _weights['wc1'], _biases['bc1'])
    conv1 = tf.nn.relu(conv1, name='relu1')
    pool1 = max_pool('pool1', conv1, k=1)
    
    # Convolution Layer
    conv2 = conv3d('conv2', pool1, _weights['wc2'], _biases['bc2'])
    conv2 = tf.nn.relu(conv2, name='relu2')
    pool2 = max_pool('pool2', conv2, k=2)
    
    # Convolution Layer
    conv3 = conv3d('conv3a', pool2, _weights['wc3a'], _biases['bc3a'])
    conv3 = tf.nn.relu(conv3, name='relu3a')
    conv3 = conv3d('conv3b', conv3, _weights['wc3b'], _biases['bc3b'])
    conv3 = tf.nn.relu(conv3, name='relu3b')
    pool3 = max_pool('pool3', conv3, k=2)
    
    # Convolution Layer
    conv4 = conv3d('conv4a', pool3, _weights['wc4a'], _biases['bc4a'])
    conv4 = tf.nn.relu(conv4, name='relu4a')
    conv4 = conv3d('conv4b', conv4, _weights['wc4b'], _biases['bc4b'])
    conv4 = tf.nn.relu(conv4, name='relu4b')
    pool4 = max_pool('pool4', conv4, k=2)
    
    # Convolution Layer
    conv5 = conv3d('conv5a', pool4, _weights['wc5a'], _biases['bc5a'])
    conv5 = tf.nn.relu(conv5, name='relu5a')
    conv5 = conv3d('conv5b', conv5, _weights['wc5b'], _biases['bc5b'])
    conv5 = tf.nn.relu(conv5, name='relu5b')
    pool5 = max_pool('pool5', conv5, k=2)
    
    # Fully connected layer
    pool5 = tf.transpose(pool5, perm=[0, 1, 4, 2, 3])
    dense1 = tf.reshape(pool5, [batch_size, _weights['wd1'].get_shape().as_list()[0]]) # Reshape conv3 output to fit dense layer input
    dense1 = tf.matmul(dense1, _weights['wd1']) + _biases['bd1']
    
    dense1 = tf.nn.relu(dense1, name='fc1') # Relu activation
    dense1 = tf.nn.dropout(dense1, _dropout)
    
    dense2 = tf.nn.relu(
        tf.matmul(dense1, _weights['wd2']) + _biases['bd2'], 
        name='fc2') # Relu activation
    dense2 = tf.nn.dropout(dense2, _dropout)
    
    # Output: class prediction
    out = tf.matmul(dense2, _weights['out']) + _biases['out']
    
    return out

The two fully connected layers has 2,048 outputs. Authors train the networks from scratch using mini-batches of 30 clips, with initial learning rate of 0.003. The learning rate is divided by 10 after every 4 epochs.. The training is stopped after 16 epochs.

For the purposes of this study authors are mainly interested in,
### How to aggregate temporal information thorough the deep networks.
좋은 3D ConvNet architecture를 얻기 위해, 저자는 convolution layers의 다른 설정을 유지한 채, kernel temporal depth d_i를 다르게 test했다고한다.

저자는 아래 두 가지 아키텍쳐를 실험했다.
1. Homogeneous temporal depth
    - All convolution layers has the same kernel temporal depth
    - Experiment with 4 networks having kernel temporal depth of d equal to 1, 3, 5, 7
    - Authors name these networks as **depth-d**, where _d_ is their homogeneous temporal depth.
    - Note that _depth-1_ net is equivalent to applying 2D convolutions on separate frames.
2. Varying temporal depth
    - Kernel temporal depth is changing across the layers.
    - Experiment two networks with temporal depth as followings;
        - increasing: 3-3-5-5-7
        - decreasing: 7-5-5-3-3
        - from the first to the fitth convolution layer respectively
    - Note that all of these networks have the same size of the output signal at the last pooling layer
    - Thus, they have the same number of parameters for fully connected layers.
    - Their number of parameters is noly different at convolution layers due to different kernel temporal depth.
    - These differences are quite minute compared to millions of parameters in the fully connected layers.
    - The learning capacity of the networks are comparable and the diffenences in number of parameters should not affect the results of out architecture search.