# Video Classification using CNN-RNN

The script defines and train a video classification model that uses a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) consisting of GRU layers. Data used for training is extracted from the UCF101 dataset. The videos are categorized into different actions, like cricket shot, punching, biking, etc. A video consists of an ordered sequence of frames. Each frame contains spatial information, and the sequence of those frames contains temporal information. This is model by the hybrid CNN-RNN architecture where the convolutions defines the spatial processing while the RNN defines the temporal processing. Specifically, we use the GRU layers for the Recurrent Neural Network (RNN).

In [27]:
!pip install -q git+https://github.com/tensorflow/docs

In [28]:
from tensorflow_docs.vis import embed
from tensorflow import keras
#from imutils import paths

import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imageio
import cv2
import os

# Data preparation

In order to keep the runtime of this example relatively short, we will be using a subsampled version of the original UCF101 dataset. You can refer to this [notebook](https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb) to know how the subsampling was done.

In [29]:
IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 10

MAX_SEQ_LENGTH = 40
NUM_FEATURES = 2048

In [31]:
train_df = pd.read_csv("/kaggle/input/ucf101/train.csv")
validate_df = pd.read_csv("/kaggle/input/ucf101/validate.csv")
test_df = pd.read_csv("/kaggle/input/ucf101/test.csv")

print("Total videos for training: {}".format(len(train_df)))
print("Total videos for validate: {}".format(len(validate_df)))
print("Total videos for testing: {}".format(len(test_df)))

train_df.sample(10)

Count the number of instances for the training and test sets

In [32]:
from collections import Counter

pd.DataFrame.from_dict(
    {'train_tags'      : Counter(train_df["tag"]).keys(), 
     'train_counts'    : Counter(train_df["tag"]).values(),
     'validate_tags'   : Counter(validate_df["tag"]).keys(),
     'validate_counts' : Counter(validate_df["tag"]).values(),
     'test_tags'       : Counter(test_df["tag"]).keys(),
     'test_counts'     : Counter(test_df["tag"]).values()})

Since a video is an ordered sequence of frames, we could just extract the frames and put them in a 3D tensor. But the number of frames is different across videos which will prevents us from stacking them nicely into fixed size batches (unless we use padding). In this implementation, we save all the video frames until a maximum frame count is reached. Specifically:

1. Extract frames from the videos
2. Center crop andd resize each frame 
3. Append the processed frames until the maximum frame count is reached.

In the case, where a video's frame count is lesser than the maximum frame count we will pad the video with zeros.
Note that this workflow is identical to problems involving texts sequences. Videos of the UCF101 dataset is known to not contain extreme variations in objects and actions across frames. Because of this, it may be okay to only consider a few frames for the learning task. But this approach may not generalize well to other video classification problems. We will be using OpenCV's VideoCapture() method to read frames from videos.

In [33]:
# The following two methods are taken from this tutorial:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub

def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)

# Feature extraction using InceptionV3

We can use a pre-trained InceptionV3 (pre-trained using ImageNet-1k) to extract the features from each video frames.

In [34]:
def build_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")

feature_extractor = build_feature_extractor()

The labels of the videos are strings and cannot be processed by the neural networks. So they are mapped to numerical value before they are fed to the model for training. This is done using the StringLookup layer encode the class string labels into integers.

In [35]:
label_processor = keras.layers.StringLookup(num_oov_indices=0, vocabulary=np.unique(train_df["tag"]))

print(label_processor.get_vocabulary())

In [36]:
def prepare_all_videos(df, root_dir):
    num_samples = len(df)
    processed_count = 0
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = label_processor(labels[..., None]).numpy()

    # `frame_masks` and `frame_features` are what we will feed to our sequence model.
    # `frame_masks` will contain a bunch of booleans denoting if a timestep is
    # masked with padding or not.
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")

    # For each video.
    for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None, ...]

        # Initialize placeholders to store the masks and features of the current video.
        temp_frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # Extract features from the frames of the current video.
        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                temp_frame_features[i, j, :] = feature_extractor.predict(
                    batch[None, j, :]
                )
            temp_frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

        frame_features[idx,] = temp_frame_features.squeeze()
        frame_masks[idx,] = temp_frame_mask.squeeze()

        processed_count += 1
        print("{}/{} videos done".format(processed_count, num_samples), end = "\r")
    
    print("\n")

    return (frame_features, frame_masks), labels

In [37]:
# print("Preparing train videos:")
# train_data, train_labels = prepare_all_videos(train_df.loc[[0,1,2,3]], "/kaggle/input/ucf101/train/")
# print("Preparing validate videos")
# validate_data, validate_labels = prepare_all_videos(validate_df.loc[[0,1]], "/kaggle/input/ucf101/validate/")
# print("Preparing test videos:")
# test_data, test_labels = prepare_all_videos(test_df.loc[[0,1]], "/kaggle/input/ucf101/test/")

print("Preparing train videos")
train_data, train_labels = prepare_all_videos(train_df, "/kaggle/input/ucf101/train/")
print("Preparing validate videos")
validate_data, validate_labels = prepare_all_videos(validate_df, "/kaggle/input/ucf101/validate/")
print("Preparing test videos")
test_data, test_labels = prepare_all_videos(test_df, "/kaggle/input/ucf101/test/")

print("Frame features in train set: {}".format(train_data[0].shape))
print("Frame masks in train set: {}".format(train_data[1].shape))

# Define Sequence Model

In [38]:
# Utility for our sequence model.
def get_sequence_model():
    class_vocab = label_processor.get_vocabulary()

    frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")

    x = keras.layers.GRU(32, return_sequences=True)(frame_features_input, mask=mask_input)
    #x = keras.layers.GRU(16, return_sequences=True)(x)
    x = keras.layers.GRU(16)(x)
    x = keras.layers.Dropout(0.25)(x)
    x = keras.layers.Dense(16, activation="relu")(x)
    output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)

    rnn_model = keras.Model([frame_features_input, mask_input], output)

    rnn_model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    return rnn_model

In [39]:
filepath = "video_classifier.h5"
checkpoint = keras.callbacks.ModelCheckpoint(filepath, 
                                             save_weights_only=True, 
                                             save_best_only=True, 
                                             verbose=1)

seq_model = get_sequence_model()
    
history = seq_model.fit(x = [train_data[0], train_data[1]],
                        y = train_labels,
                        validation_data = ([validate_data[0], validate_data[1]], validate_labels),
                        epochs=EPOCHS,
                        callbacks=[checkpoint])

In [40]:
seq_model.load_weights(filepath)

_, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)

print("Test accuracy: {}%".format(round(accuracy * 100, 2)))

In [41]:
def plot_result(item):
    plt.plot(history.history[item], label=item)
    plt.plot(history.history["val_" + item], label="val_" + item)
    plt.xlabel("Epochs")
    plt.ylabel(item)
    plt.title("Train and Validation {} Over Epochs".format(item), fontsize=14)
    plt.legend()
    plt.grid()
    plt.show()

plot_result("accuracy")

# Inference

In [61]:
def prepare_single_video(frames):
    frames = frames[None, ...]
    frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")

    for i, batch in enumerate(frames):
        video_length = batch.shape[0]
        length = min(MAX_SEQ_LENGTH, video_length)
        for j in range(length):
            frame_features[i, j, :] = feature_extractor.predict(batch[None, j, :])
        frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

    return frame_features, frame_mask


def sequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()

    frames = load_video(os.path.join("/kaggle/input/ucf101/test/", path))
    frame_features, frame_mask = prepare_single_video(frames)
    probabilities = seq_model.predict([frame_features, frame_mask])[0]

    for i in np.argsort(probabilities)[::-1]:
        print(f"  {class_vocab[i]}: {probabilities[i] * 100:5.2f}%")
    return frames


# This utility is for visualization.
# Referenced from:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub
def to_gif(images):
    converted_images = images.astype(np.uint8)
    imageio.mimsave("animation.gif", converted_images, fps=10)
    return embed.embed_file("animation.gif")

In [62]:
test_video = np.random.choice(test_df["video_name"].values.tolist())
print("Test video: {}".format(test_video))
test_frames = sequence_prediction(test_video)
to_gif(test_frames[:MAX_SEQ_LENGTH])

In [63]:
test_video = np.random.choice(test_df["video_name"].values.tolist())
print("Test video: {}".format(test_video))
test_frames = sequence_prediction(test_video)
to_gif(test_frames[:MAX_SEQ_LENGTH])

In [64]:
test_video = np.random.choice(test_df["video_name"].values.tolist())
print("Test video: {}".format(test_video))
test_frames = sequence_prediction(test_video)
to_gif(test_frames[:MAX_SEQ_LENGTH])