<a href="https://colab.research.google.com/github/lblogan14/deep_learning_for_computer_vision/blob/master/ch9_video_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Understand and Classify Videos
A video is nothing but a series of images. Video brings a new dimension to the
image along the temporal direction. The spatial features of the images and temporal features of the video can be put together, providing a better outcome than just the image. The extra dimension also results in a lot of space and hence increases the complexity of training and inference. The computational demands are extremely high for processing a video. Video also changes the architecture of
deep learning models as we have to consider the temporal features.

Video classification is the task of labeling a video with a category. A category
can be on the frame level or for the whole video. There could be actions or tasks
performed in the video. Hence, a video classification may label the objects
present in the video or label the actions happening in the video.

#Video Classification Datasets

##UCF101
is used for action recognition.

The videos are collected on YouTube and consist of realistic actions.
There are 101 action categories available in this dataset. There is another dataset
called UCF50 which has 50 categories. There are 13,320 videos in this dataset
across the actions. The videos have a diversified variation of background, scale,
pose, occlusion, and illumination conditions. The action categories are grouped
into 25, which share similar variations such as the background, pose, scale,
viewpoint, illumination and so on.

##YouTube-8M
is used for video classification.

The dataset contains video URLs with labels and visual features.
* Number of video URLs: 7 million
* Hours of video clips: 450,000
* Number of class labels: 4,716
* Average number of labels per video: 3.4

##Other datasets
* **Sports-1M (Sports - 1 Million)**: Has 1,133,158 videos with 487 classes.
The annotations are done automatically. The dataset can be downloaded
from: http://cs.stanford.edu/people/karpathy/deepvideo/.
* **UCF-11 (University of Central Florida - 11 actions)**: Has 1,600 videos
with 11 actions. The videos have 29.97 fps (frames per second). The dataset
can be downloaded along with UCF101.
* **HMDB-51 (Human Motion DataBase - 51 actions)**: Has 5,100 videos
with 51 actions. The dataset link is: http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database.
* **Hollywood2**: Has 1,707 videos with 12 actions. The dataset link is: http://www.di.ens.fr/~laptev/actions/hollywood2.

The following helper function is used to *load a video and split it into frames for further processing.*

Note: this may take a lot of harddisk space.

In [0]:
import cv2
video_path = 'your_working_directory/your_video'
video_handle = cv2.VideoCapture(video_path)
frame_no = 0

while True:
  eof, frame = video_handle.read()
  if not eof:
    break
  cv2.imwrite('frame{}.jpg'.format(frame_no), frame)
  frame_no += 1

#Approaches for Classifying Videos
Videos have to be classified for several applications. Since the video is a lot of
data, training and inference computations must also be accounted for. All video
classification approaches are inspired by image classification algorithms. The
standard architectures such as VGG, Inception, and so on are used for feature
computation at a frame level and then processed further.

The following approaches can be used for video classification:
* Extract the frames and use the models for classification on a frame basis.
* Extract the image features and the features can be used train an RNN.
* Train a **3D convolution** network on the whole video. 3D convolution is an extension of 2D convolution.
* Use the optical flow of the video to further improve the accuracy. Optical
flow is the pattern of movement of objects.

##Fusing parallel CNN for video classification
Frame-wise, the prediction of a video may not yield good results due to the
downsampling of images, which loses fine details. Using a high-resolution CNN
will increase the inference time.

Karpathy proposed fusing two streams that are run in parallel for video classification. However, there are two problems with doing frame-wise predictions:
* Predictions may take a long time because of the larger CNN architecture
* Independent predictions lose the information along the temporal dimension

The architecture can be simplified with fewer parameters with two smaller
encoders running in parallel. The video is passed simultaneously through two
CNN encoders. One encoder takes a low resolution and processes high
resolution. The encoder has alternating convolution, normalization, and pooling
layers. The final layer of the two encoders is connected through the fully
connected layer. 

The other encoder is of the same size, but takes only the central crop,
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/parallel_CNN.PNG?raw=true)

Parallel processing of frames makes the runtime faster by downsampling the
video. The CNN architecture is halved regarding the parameter while
maintaining the same accuracy. The two streams are called the **fovea** and
**context** streams.

After setting up the helper functions, we can create those two streams as shown below:

In [0]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

In [0]:
mnist_data = input_data.read_data_sets('MNIST_data', one_hot=True)

In [0]:
input_size = 784
no_classes = 10
batch_size = 100
total_batches = 300

In [0]:
def add_variable_summary(tf_variable, summary_name):
  with tf.name_scope(summary_name + '_summary'):
    mean = tf.reduce_mean(tf_variable)
    tf.summary.scalar('Mean', mean)
    with tf.name_scope('standard_deviation'):
      standard_deviation = tf.sqrt(tf.reduce_mean(tf.square(tf_variable - mean)))
        
    tf.summary.scalar('StandardDeviation', standard_deviation)
    tf.summary.scalar('Maximum', tf.reduce_max(tf_variable))
    tf.summary.scalar('Minimum', tf.reduce_min(tf_variable))
    tf.summary.histogram('Histogram', tf_variable)

In [0]:
def convolution_layer(input_layer, 
                      filters, 
                      kernel_size=[3, 3],
                      activation=tf.nn.relu):
  layer = tf.layers.conv2d(inputs=input_layer,
                           filters=filters,
                           kernel_size=kernel_size,
                           activation=activation)
  
  add_variable_summary(layer, 'convolution')
  return layer

In [0]:
def pooling_layer(input_layer, pool_size=[2, 2], strides=2):
  layer = tf.layers.max_pooling2d(inputs=input_layer,
                                  pool_size=pool_size,
                                  strides=strides)
  add_variable_summary(layer, 'pooling')
  return layer

In [0]:
def dense_layer(input_layer, units, activation=tf.nn.relu):
  layer = tf.layers.dense(inputs=input_layer,
                          units=units,
                          activation=activation)
  add_variable_summary(layer, 'dense')
  return layer

In [0]:
def get_model(input_):
  input_reshape = tf.reshape(input_,
                             [-1, 28, 28, 1],
                             name='input_reshape')
  convolution_layer_1 = convolution_layer(input_reshape, 64)
  pooling_layer_1 = pooling_layer(convolution_layer_1)
  convolution_layer_2 = convolution_layer(pooling_layer_1, 128)
  pooling_layer_2 = pooling_layer(convolution_layer_2)
  flattened_pool = tf.reshape(pooling_layer_2, 
                              [-1, 5 * 5 * 128],
                              name='flattened_pool')
  return flattened_pool

Now we can build the two CNN and combine them together:

In [0]:
high_resolution_input = tf.placeholder(tf.float32, shape=[None, input_size])
low_resolution_input = tf.placeholder(tf.float32, shape=[None, input_size])
y_input = tf.placeholder(tf.float32, shape=[None, no_classes])

high_resolution_cnn = get_model(high_resolution_input)
low_resolution_cnn = get_model(low_resolution_input)
dense_layer_1 = tf.concat([high_resolution_cnn, low_resolution_cnn], 1)
dense_layer_bottleneck = dense_layer(dense_layer_1, 1024)
logits = dense_layer(dense_layer_bottleneck, no_classes)

The reset of the code is to define the loss function and the optimization process for training and testing:

In [0]:
with tf.name_scope('loss'):
  softmax_cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_input,
                                                                     logits=logits)
  loss_operation = tf.reduce_mean(softmax_cross_entropy, name='loss')
  tf.summary.scalar('loss', loss_operation)

In [0]:
with tf.name_scope('optimiser'):
  optimiser = tf.train.AdamOptimizer().minimize(loss_operation)

In [0]:
with tf.name_scope('accuracy'):
  with tf.name_scope('correct_prediction'):
    predictions = tf.argmax(logits, 1)
    correct_predictions = tf.equal(predictions, tf.argmax(y_input, 1))
  with tf.name_scope('accuracy'):
    accuracy_operation = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))
  tf.summary.scalar('accuracy', accuracy_operation)

In [0]:
session = tf.Session()
session.run(tf.global_variables_initializer())

In [0]:
merged_summary_operation = tf.summary.merge_all()
train_summary_writer = tf.summary.FileWriter('./tmp/ch9/train', session.graph)
test_summary_writer = tf.summary.FileWriter('./tmp/ch9/test')

In [0]:
test_images, test_labels = mnist_data.test.images, mnist_data.test.labels

In [0]:
for batch_no in range(total_batches):
  mnist_batch = mnist_data.train.next_batch(batch_size)
  train_images, train_labels = mnist_batch[0], mnist_batch[1]
  _, merged_summary = session.run([optimiser, merged_summary_operation],
                                  feed_dict={high_resolution_input: train_images,
                                             low_resolution_input: train_images,
                                             y_input: train_labels
                                            }
                                 )
  train_summary_writer.add_summary(merged_summary, batch_no)
  if batch_no % 10 == 0:
    merged_summary, _ = session.run([merged_summary_operation, accuracy_operation], 
                                    feed_dict={high_resolution_input: test_images,
                                               low_resolution_input: test_images,
                                               y_input: test_labels
                                              }
                                   )
    test_summary_writer.add_summary(merged_summary, batch_no)

The frames for processing across temporal dimensions are as shown below:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/parallel_CNN_temporal.PNG?raw=true)

Instead of going through fixed size clips, the video can be seen at different times.

Three ways of connecting the temporal information are presented in the
preceding image. Late fusion requires a longer time frame while early fusion
sees a few frames together. Slow fusion combines both late and early fusion to
give good results. The model was trained on the Sports1M dataset, which has 487
classes and achieved an accuracy of 50%. The same model, when applied to
UCF101, achieves an accuracy of 60%.

##Classifying videos over long periods
The fusing method works well only for short videos.

Ng proprosed two methods for classifying longer videos:
* The first approach is to pool the convolutional features temporally. Max-
pooling is used as a feature aggregation method.
* The second approach has an LSTM connecting the convolutional features
that handle the variable length of the video.
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/ng_method.PNG?raw=true)

The CNN features can be extracted and fed to a small LSTM network:

In [0]:
input_shape = [500,500]
no_classes = 2

In [0]:
net = tf.keras.models.Sequential()

net.add(tf.keras.layers.LSTM(2048,
                             return_sequences=False,
                             input_shape=input_shape,
                             dropout=0.5))
net.add(tf.keras.layers.Dense(512, activation='relu'))
net.add(tf.keras.layers.Dropout(0.5))
net.add(tf.keras.layers.Dense(no_classes, activation='softmax'))

Adding LSTM for feature pooling instead provides better performance. The
features are pooled in various ways, as shown in the following image:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/feature_pooling.PNG?raw=true)

the convolutional features can be aggregated in several
different ways. The pooling is done after the fully connected layer before it. This
method achieved an accuracy of 73.1% and 88.6% in the Sports1M dataset and
UCF101 datasets respectively.

The LSTM approach is shown below:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/feature_pooling.PNG?raw=true)

The computations are high for this model because several LSTM's are used.

##Streaming two CNN's for action recognition
The motion of objects in videos has very good information about the actions
performed in the video. The motion of objects can be quantified by optical flow.

Zisserman proposed a method for action recognition that uses two streams from images and optical flow.

Optical flow measures the motion by quantifying the relative movement between
the observer and scene. The optical flow can be obtained by running the built-in function in OpenCV:

`p1, st, err = cv2.calcOpticalFlowPyrLK(old_gray, frame_gray, p0, None, **lk_params)`

One stream takes an individual frame and predicts actions using a regular CNN.
The other stream takes multiple frames and computes the optical flow. The
optical flow is passed through a CNN for a prediction.
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/stream_action_recognition.PNG?raw=true)

##Using 3D convolution for temporal learning
A video can be classified with 3D convolution. 3D convolution operation takes a
volume as input and outputs the same, whereas a 2D convolution can take a 2D
or volume output and outputs a 2D image.
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/3d_conv.PNG?raw=true)

The first two images belong to 2D convolution. The output is always an image.
3D convolution, meanwhile, outputs a volume. The difference is a convolution
operation in 3 directions with the kernel.

An example of 3D convoluational neural network is shown below:

In [0]:
import tensorflow as tf
input_shape = (227, 227, 200, 3)
no_classes = 2

In [0]:
net = tf.keras.models.Sequential()

net.add(tf.keras.layers.Conv3D(32,
                               kernel_size=(3,3,3),
                               input_shape=input_shape))
net.add(tf.keras.layers.Activation('relu'))
net.add(tf.keras.layers.Conv3D(32, (3, 3, 3)))
net.add(tf.keras.layers.Activation('softmax'))
net.add(tf.keras.layers.MaxPooling3D())
net.add(tf.keras.layers.Dropout(0.25))

net.add(tf.keras.layers.Conv3D(64, (3, 3, 3)))
net.add(tf.keras.layers.Activation('relu'))
net.add(tf.keras.layers.Conv3D(64, (3, 3, 3)))
net.add(tf.keras.layers.Activation('softmax'))
net.add(tf.keras.layers.MaxPool3D())
net.add(tf.keras.layers.Dropout(0.25))

net.add(tf.keras.layers.Flatten())
net.add(tf.keras.layers.Dense(512, activation='sigmoid'))
net.add(tf.keras.layers.Dropout(0.5))
net.add(tf.keras.layers.Dense(no_classes, activation='softmax'))
net.compile(loss=tf.keras.losses.categorical_crossentropy,
            optimizer=tf.keras.optimizers.Adam(), 
            metrics=['accuracy'])
net.summary()

3D convolution needs a lot of computing power. 3D convolution
achieves an accuracy of 90.2% on the Sports1M dataset.

##Using trajectory for classification
Wang used the trajectory of parts of bodies to classify the actions performed. This work combines handcrafted and deep learned features for final predictions.
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/trajectory.PNG?raw=true)

The handcrafted features are **Fisher vector** and the features are from CNN. The
following image demonstrates the extraction of the trajectories and features
maps:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/trajectory_flowchart.PNG?raw=true)

Both the trajectories and features maps are combined temporally to form the
final predictions over the temporal snippet.

##Multi-modal fusion
Yang proposed a multi-modal fusion, with 4 models, for video classification. The four models are 3D convolution features, 2D optical flow, 3D
optical flow, and 2D convolution features. The data flowchart is shown below:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/multi-modal.PNG?raw=true)

A Convlet is the small convolutional output from
a single kernel. The learning of spatial weights in the convolution layer by
convlet is shown in the following image:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/multi-modal%20feature.PNG?raw=true)

A spatial weight indicates how discriminative or important a local spatial region
is in a convolutional layer. The following image is an illustration of fusing multilayer representation, done at various layers of convolutional and fully connected layers:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/multi-modal%20detail.PNG?raw=true)

The boosting mechanism is used to combine the predictions. **Boosting** is a
mechanism that can combine several model prediction into a final prediction.

##Attending regions for classification
An attention mechanism can be used for classification. Attention mechanisms
replicate the human behaviour of focusing on regions for recognition activities.
Attention mechanisms give more weight to certain regions than others. The
method of weight is learned from the data while training.

There are two types of attention mechanisms:
* **Soft attention**: Deterministic in character, this can hence be learned by
back-propagation.
* **Hard attention**: Stochastic in character, this needs complicated
mechanisms to learn. It is also expensive because of the requirement of
sampling data.

A visualization of soft attention:

![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/soft_attention.PNG?raw=true)

The CNN features are computed and weighted according to the attention. The attention or weights given to certain areas can be used for visualization. LSTM were used to take the convolution features. The LSTM predicts the
regions by using attention on following frames,
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch9/attention_lstm.PNG?raw=true)

Each stack of LSTM predicts location and labels. Every stack has three LSTM.
The input to the LSTM stack is a convolution feature cube and location. The
location probabilities are the attention weights. The use of attention gives an
improvement in accuracy as well as a method to visualize the predictions.