# Video Classification with a CNN-RNN Architecture

**Author:** [Sayak Paul](https://twitter.com/RisingSayak)<br>
**Date created:** 2021/05/28<br>
**Last modified:** 2021/06/05<br>
**Description:** Training a video classifier with transfer learning and a recurrent model on the UCF101 dataset.

This example demonstrates video classification, an important use-case with
applications in recommendations, security, and so on.
We will be using the [UCF101 dataset](https://www.crcv.ucf.edu/data/UCF101.php)
to build our video classifier. The dataset consists of videos categorized into different
actions, like cricket shot, punching, biking, etc. This dataset is commonly used to
build action recognizers, which are an application of video classification.

A video consists of an ordered sequence of frames. Each frame contains *spatial*
information, and the sequence of those frames contains *temporal* information. To model
both of these aspects, we use a hybrid architecture that consists of convolutions
(for spatial processing) as well as recurrent layers (for temporal processing).
Specifically, we'll use a Convolutional Neural Network (CNN) and a Recurrent Neural
Network (RNN) consisting of [GRU layers](https://keras.io/api/layers/recurrent_layers/gru/).
This kind of hybrid architecture is popularly known as a **CNN-RNN**.

This example requires TensorFlow 2.5 or higher, as well as TensorFlow Docs, which can be
installed using the following command:

In [2]:
!pip install -q git+https://github.com/tensorflow/docs

  Building wheel for tensorflow-docs (setup.py) ... [?25l[?25hdone


## Data collection

In order to keep the runtime of this example relatively short, we will be using a
subsampled version of the original UCF101 dataset. You can refer to
[this notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb)
to know how the subsampling was done.

In [3]:
!wget -q https://git.io/JGc31 -O ucf101_top5.tar.gz
!tar xf ucf101_top5.tar.gz

## Setup

In [4]:
import tensorflow as tf
from sklearn.model_selection import train_test_split

In [5]:
from tensorflow_docs.vis import embed
from tensorflow import keras
from imutils import paths

import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imageio
import cv2
import os

## Define hyperparameters

In [6]:
IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 10

MAX_SEQ_LENGTH = 20
NUM_FEATURES = 2048

## Data preparation

In [7]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(f"Total videos for training: {len(train_df)}")
print(f"Total videos for testing: {len(test_df)}")

train_df.sample(10)

Total videos for training: 594
Total videos for testing: 224


Unnamed: 0,video_name,tag
506,v_TennisSwing_g12_c03.avi,TennisSwing
184,v_PlayingCello_g18_c01.avi,PlayingCello
368,v_ShavingBeard_g09_c03.avi,ShavingBeard
73,v_CricketShot_g18_c05.avi,CricketShot
139,v_PlayingCello_g11_c02.avi,PlayingCello
97,v_CricketShot_g22_c06.avi,CricketShot
42,v_CricketShot_g14_c01.avi,CricketShot
532,v_TennisSwing_g16_c01.avi,TennisSwing
315,v_Punch_g19_c03.avi,Punch
94,v_CricketShot_g22_c03.avi,CricketShot


One of the many challenges of training video classifiers is figuring out a way to feed
the videos to a network. [This blog post](https://blog.coast.ai/five-video-classification-methods-implemented-in-keras-and-tensorflow-99cad29cc0b5)
discusses five such methods. Since a video is an ordered sequence of frames, we could
just extract the frames and put them in a 3D tensor. But the number of frames may differ
from video to video which would prevent us from stacking them into batches
(unless we use padding). As an alternative, we can **save video frames at a fixed
interval until a maximum frame count is reached**. In this example we will do
the following:

1. Capture the frames of a video.
2. Extract frames from the videos until a maximum frame count is reached.
3. In the case, where a video's frame count is lesser than the maximum frame count we
will pad the video with zeros.

Note that this workflow is identical to [problems involving texts sequences](https://developers.google.com/machine-learning/guides/text-classification/). Videos of the UCF101 dataset is [known](https://www.crcv.ucf.edu/papers/UCF101_CRCV-TR-12-01.pdf)
to not contain extreme variations in objects and actions across frames. Because of this,
it may be okay to only consider a few frames for the learning task. But this approach may
not generalize well to other video classification problems. We will be using
[OpenCV's `VideoCapture()` method](https://docs.opencv.org/master/dd/d43/tutorial_py_video_display.html)
to read frames from videos.

In [8]:
# The following two methods are taken from this tutorial:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub


def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)


We can use a pre-trained network to extract meaningful features from the extracted
frames. The [`Keras Applications`](https://keras.io/api/applications/) module provides
a number of state-of-the-art models pre-trained on the [ImageNet-1k dataset](http://image-net.org/).
We will be using the [InceptionV3 model](https://arxiv.org/abs/1512.00567) for this purpose.

In [9]:

def build_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5


The labels of the videos are strings. Neural networks do not understand string values,
so they must be converted to some numerical form before they are fed to the model. Here
we will use the [`StringLookup`](https://keras.io/api/layers/preprocessing_layers/categorical/string_lookup)
layer encode the class labels as integers.

In [10]:
label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["tag"])
)
print(label_processor.get_vocabulary())

['CricketShot', 'PlayingCello', 'Punch', 'ShavingBeard', 'TennisSwing']


In [11]:
# The following two methods are taken from this tutorial:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub


def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE),num_part=6):
    cap = cv2.VideoCapture(path)
    length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    num_diff = int(np.floor(length/num_part))
    frames = []
    try:
        for i in range(0,length,num_diff):
            cap.set(1,i)
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)

def label_to_tensor(st):
    f_list = list(st)
    f_list.sort()
    st_list = []
    for s in st:
        s_index = f_list.index(s)
        st_list.append(s_index)
    return tf.convert_to_tensor(st_list, dtype=tf.float32)


In [12]:
root_dir = os.getcwd()+'/train'
num_samples = len(train_df)
video_paths = train_df["video_name"].values.tolist()
labels = train_df["tag"].values
labels = label_to_tensor(labels[..., None]).numpy()

frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
frame_features = np.zeros(
    shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
)

frames= []
for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames.append(load_video(os.path.join(root_dir, path)))

In [13]:
my_train_array = np.expand_dims(frames[0][:6,:,:,:],0)
for i in frames[1:]:
    my_train_array = np.vstack([my_train_array,np.expand_dims(i[:6,:,:,:],0)])

In [28]:
map_dict={0:0, 118: 1, 238:2, 359:3, 477:4}
new_train_label = np.vectorize(map_dict.get)(labels)
depth = 5
out_vec = tf.one_hot(new_train_label, depth)
tr_df=tf.expand_dims(my_train_array, 1)
def shuffle_split_data(X, y):
    arr_rand = np.random.rand(X.shape[0])
    split = arr_rand < np.percentile(arr_rand, 80)

    X_train = X[split]
    y_train = y[split]
    X_test =  X[~split]
    y_test = y[~split]

    print(len(X_train), len(y_train), len(X_test), len(y_test))
    return X_train, y_train, X_test, y_test

X_Train,y_Train,X_Test,y_Test = shuffle_split_data(my_train_array, new_train_label)



475 475 119 119


In [29]:
IMG_SIZE = 222

feature_extractor = tf.keras.applications.InceptionV3(
    weights="imagenet",
    include_top=False,
    pooling="avg",
    
)
for i in range(20):
    feature_extractor.layers[-i].trainable = False


inputs = tf.keras.Input((6,224,224,3)) 
level1 = tf.keras.layers.Conv3D(27, 3, activation='relu')(inputs)
level2 = tf.keras.layers.MaxPool3D((3,3,3))(level1)
level3 = tf.keras.layers.Reshape((222,222,3))(level2)
level4 = tf.keras.layers.Conv2D(100,3,activation='relu')(level3)
level5 = tf.keras.layers.MaxPool2D((3,3))(level4)
level6 = tf.keras.layers.Conv2D(10,3,activation='relu')(level5)
level7 = tf.keras.layers.MaxPool2D((3,3))(level6)
level8 = tf.keras.layers.Flatten()(level7)
# preprocessed = feature_extractor(level3)
# outputs = feature_extractor(preprocessed)

output = tf.keras.layers.Dense(5, activation='softmax')(level8)

final_model = tf.keras.Model(inputs, output, name="feature_extractor")

In [33]:

filepath = 'tf1_mnist_cnn.hdf5'
save_checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, \
                             save_best_only=True, save_weights_only=False, \
                             mode='auto', period=1)




In [34]:
final_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',metrics=['accuracy'])
final_model.fit(x=X_Train,y=y_Train,validation_data = (X_Test,y_Test),verbose=1,epochs=150,callbacks=[save_checkpoint])

Epoch 1/150
Epoch 1: val_accuracy improved from -inf to 0.77311, saving model to tf1_mnist_cnn.hdf5
Epoch 2/150
Epoch 2: val_accuracy improved from 0.77311 to 0.80672, saving model to tf1_mnist_cnn.hdf5
Epoch 3/150
Epoch 3: val_accuracy did not improve from 0.80672
Epoch 4/150
Epoch 4: val_accuracy did not improve from 0.80672
Epoch 5/150
Epoch 5: val_accuracy improved from 0.80672 to 0.83193, saving model to tf1_mnist_cnn.hdf5
Epoch 6/150
Epoch 6: val_accuracy did not improve from 0.83193
Epoch 7/150
Epoch 7: val_accuracy did not improve from 0.83193
Epoch 8/150
Epoch 8: val_accuracy did not improve from 0.83193
Epoch 9/150
Epoch 9: val_accuracy did not improve from 0.83193
Epoch 10/150
Epoch 10: val_accuracy did not improve from 0.83193
Epoch 11/150
Epoch 11: val_accuracy did not improve from 0.83193
Epoch 12/150
Epoch 12: val_accuracy did not improve from 0.83193
Epoch 13/150
Epoch 13: val_accuracy did not improve from 0.83193
Epoch 14/150
Epoch 14: val_accuracy did not improve from

KeyboardInterrupt: ignored