# Group 186 - Assignment 2 - Video Classification using CNN+LSTM 

In this Group 186 Assignment exercise, we will implement human activity recognition on videos using a Convolutional Neural Network combined with a Long-Short Term Memory Network, we’ll be using different architectures that are created in TensorFlow. We'll be doing Video Classification in order to perform activity recognition.

## Group No 186

## Group Member Names:
1. Sindhu C - 2021FC04993@wilp.bits-pilani.ac.in
2. Lajish VL - 2021fc04980@wilp.bits-pilani.ac.in
3. Sivarajan N - 2021fc04989@wilp.bits-pilani.ac.in


### Video Classification with a CNN+LSTM Architecture

We would follow the steps below for this task.

◆ Data collection

◆ Setup

◆ Define hyperparameters

◆ Data preparation

◆ The sequence model

◆ Evaluation and Inference 


## Step i: Import Libraries/Dataset

```
• Import the required libraries.
• Check the GPU available (recommended- use free GPU provided by Google Colab).
```



In [1]:
%%capture
# Install the required libraries. silent mode, no output required
!pip install moviepy imageio==2.4.1 -q gwpy
!pip install git+https://github.com/tensorflow/docs -q gwpy
!pip install tensorflow-gpu -q gwpy

In [2]:
# Import the required libraries.
import os
import cv2
import math
import random
import timeit
import gc
import pandas as pd
import numpy as np
import datetime as dt
import tensorflow as tf
from collections import deque
import matplotlib.pyplot as plt
 
from moviepy.editor import *
%matplotlib inline

from sklearn import datasets
from sklearn.metrics import classification_report, confusion_matrix 
from sklearn.model_selection import train_test_split
 
import tensorflow_docs
from tensorflow_docs.vis import embed 
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import plot_model

pygame 2.3.0 (SDL 2.24.2, Python 3.9.16)
Hello from the pygame community. https://www.pygame.org/contribute.html


### Check Tensorflow with GPU

```
The next step provides an introduction to computing on a [GPU](https://cloud.google.com/gpu) in Colab. 

We will connect to a GPU, and then run some basic TensorFlow operations on both the CPU and a GPU
And then observe the speedup provided by using the GPU.

```


In [3]:
#  To Enable GPUs for the notebook:
#  Navigate to Colab Menu->Edit→Notebook Settings
#  Select GPU from the Hardware Accelerator drop-down

# Next, we'll confirm that we can connect to the GPU with tensorflow:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))


def cpu():
  with tf.device('/cpu:0'):
    random_image_cpu = tf.random.normal((100, 100, 100, 3))
    net_cpu = tf.keras.layers.Conv2D(32, 7)(random_image_cpu)
    return tf.math.reduce_sum(net_cpu)

def gpu():
  with tf.device('/device:GPU:0'):
    random_image_gpu = tf.random.normal((100, 100, 100, 3))
    net_gpu = tf.keras.layers.Conv2D(32, 7)(random_image_gpu)
    return tf.math.reduce_sum(net_gpu)
  
# We run each op once to warm up; see: https://stackoverflow.com/a/45067900
cpu()
gpu()

# Run the op several times.
print('Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images '
      '(batch x height x width x channel). Sum of ten runs.')
print('CPU (s):')
cpu_time = timeit.timeit('cpu()', number=10, setup="from __main__ import cpu")
print(cpu_time)
print('GPU (s):')
gpu_time = timeit.timeit('gpu()', number=10, setup="from __main__ import gpu")
print(gpu_time)
print('GPU speedup over CPU: {}x'.format(int(cpu_time/gpu_time)))

# Set Tensorflor to use GPU
# Let TensorFlow to automatically choose an existing and supported device to run the operations 
# And in case the specified one doesn't exist, you can call tf.config.set_soft_device_placement(True).
tf.device('/device:GPU:0')
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'false'
tf.config.set_soft_device_placement(True)


Found GPU at: /device:GPU:0
Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
8.67597305199979
GPU (s):
0.14368549199980407
GPU speedup over CPU: 60x


The above speedup was seen by Enabling GPU over CPU 30x to 60x times.

## Step ii. Data Processing

• Download the data from https://www.crcv.ucf.edu/data/UCF50.rar, extract the dataset

• Convert the data into the correct format which could be used for the DL model.

• Plot at least two (we did 10) samples and their captions (use matplotlib/seaborn/any other library - we used matplotlib).

• Load the data into train and test data in the required format.

In [None]:
# Download the data if not already downloaded and extract the dataset.
%%capture

# Downlaod the UCF50 Dataset
# UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. 
# UCF101 data set is an extension of UCF50 data set which has 50 action categories. 
!wget -nc "https://www.crcv.ucf.edu/data/UCF50.rar"

#Extract the Dataset
!unrar x UCF50.rar


In [None]:
# Check the downloaded file, and extracted folders
!pwd
!ls -ll
!ls UCF50 -ll

In [None]:
# Let's set the Seed values so every run is consistent.
seed_constant = 27
np.random.seed(seed_constant)
random.seed(seed_constant)
tf.random.set_seed(seed_constant)

In [None]:
#  Plot 10 samples and their captions using Matplotlib
# Create a Matplotlib figure and specify the size of the figure.
plt.figure(figsize = (20, 20))

# Get the names of all classes/categories in UCF50.
all_classes_names = os.listdir('UCF50')

# Generate a list of 10 random values. The values will be between 0-50, 
# where 50 is the total number of class in the dataset. 
random_range = random.sample(range(len(all_classes_names)), 10)

# Iterating through all the generated random values.
for counter, random_index in enumerate(random_range, 1):

    # Retrieve a Class Name using the Random Index.
    selected_class_Name = all_classes_names[random_index]

    # Retrieve the list of all the video files present in the randomly selected Class Directory.
    video_files_names_list = os.listdir(f'UCF50/{selected_class_Name}')

    # Randomly select a video file from the list retrieved from the randomly selected Class Directory.
    selected_video_file_name = random.choice(video_files_names_list)

    # We will be using OpenCV's VideoCapture() method to read frames from videos.
    # Initialize a VideoCapture object to read from the video File.
    video_reader = cv2.VideoCapture(f'UCF50/{selected_class_Name}/{selected_video_file_name}')
    
    # Read the first frame of the video file.
    _, bgr_frame = video_reader.read()

    # Release the VideoCapture object. 
    video_reader.release()

    # Convert the frame from BGR into RGB format. 
    rgb_frame = cv2.cvtColor(bgr_frame, cv2.COLOR_BGR2RGB)

    # Write the class name on the video frame.
    cv2.putText(rgb_frame, selected_class_Name, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2)
    
    # Display the frame.
    plt.subplot(5, 4, counter);plt.imshow(rgb_frame);plt.axis('off')

In [None]:
# Specify the height and width to which each video frame will be resized in our dataset.
IMAGE_HEIGHT , IMAGE_WIDTH, IMG_SIZE = 224, 224, 224

# Specify the number of frames of a video that will be fed to the model as one sequence.
SEQUENCE_LENGTH = 5

# Specifying the directory containing the UCF50 dataset. 
DATASET_DIR = "UCF50"

# Specifying the list containing the names of the classes used for training.  
# we pick 5 random class names from the test data set, which will also be used in Step vi for Prediction.
n=5
CLASSES_LIST=random.choices(all_classes_names, k=n)
#CLASSES_LIST = ["VolleyballSpiking","PushUps","PullUps"]

BATCH_SIZE = 64
EPOCHS = 10

MAX_SEQ_LENGTH = 20
NUM_FEATURES = 2048

In [None]:
def frames_extraction(video_path):
    '''
    This function will extract the required frames from a video after resizing and normalizing them.
    Args:
        video_path: The path of the video in the disk, whose frames are to be extracted.
    Returns:
        frames_list: A list containing the resized and normalized frames of the video.
    '''

    # Declare a list to store video frames.
    frames_list = []
    
    # Read the Video File using the VideoCapture object.
    video_reader = cv2.VideoCapture(video_path)

    # Get the total number of frames in the video.
    video_frames_count = int(video_reader.get(cv2.CAP_PROP_FRAME_COUNT))

    # Calculate the the interval after which frames will be added to the list.
    skip_frames_window = max(int(video_frames_count/SEQUENCE_LENGTH), 1)

    # Iterate through the Video Frames.
    for frame_counter in range(SEQUENCE_LENGTH):

        # Set the current frame position of the video.
        video_reader.set(cv2.CAP_PROP_POS_FRAMES, frame_counter * skip_frames_window)

        # Reading the frame from the video. 
        success, frame = video_reader.read() 

        # Check if Video frame is not successfully read then break the loop
        if not success:
            break

        # Resize the Frame to fixed height and width.
        resized_frame = cv2.resize(frame, (IMAGE_HEIGHT, IMAGE_WIDTH))
        
        # Normalize the resized frame by dividing it with 255 so that each pixel value then lies between 0 and 1
        normalized_frame = resized_frame / 255
        
        # Append the normalized frame into the frames list
        frames_list.append(normalized_frame)
    
    # Release the VideoCapture object. 
    video_reader.release()

    # Return the frames list.
    return frames_list

In [None]:

def create_dataset():
    '''
    This function will extract the data of the selected classes and create the required dataset.
    Returns:
        features:          A list containing the extracted frames of the videos.
        labels:            A list containing the indexes of the classes associated with the videos.
        video_files_paths: A list containing the paths of the videos in the disk.
    '''

    # Declared Empty Lists to store the features, labels and video file path values.
    features = []
    labels = []
    video_files_paths = []
    
    # Iterating through all the classes mentioned in the classes list
    for class_index, class_name in enumerate(CLASSES_LIST):
        
        # Display the name of the class whose data is being extracted.
        print(f'Extracting Data of Class: {class_name}')
        
        # Get the list of video files present in the specific class name directory.
        files_list = os.listdir(os.path.join(DATASET_DIR, class_name))
        
        # Iterate through all the files present in the files list.
        for file_name in files_list:
            
            # Get the complete video path.
            video_file_path = os.path.join(DATASET_DIR, class_name, file_name)

            # Extract the frames of the video file.
            frames = frames_extraction(video_file_path)

            # Check if the extracted frames are equal to the SEQUENCE_LENGTH specified above.
            # So ignore the videos having frames less than the SEQUENCE_LENGTH.
            if len(frames) == SEQUENCE_LENGTH:

                # Append the data to their repective lists.
                features.append(frames)
                labels.append(class_index)
                video_files_paths.append(video_file_path)

    # Converting the list to numpy arrays
    features = np.asarray(features)
    labels = np.array(labels)  
    
    # Return the frames, class index, and video file path.
    return features, labels, video_files_paths

In [None]:
# Create the dataset.
features, labels, video_files_paths = create_dataset()
gc.collect()

In [None]:
# Using Keras's to_categorical method to convert labels into one-hot-encoded vectors
one_hot_encoded_labels = to_categorical(labels)

In [None]:
# Step ii: Load the data into train and test data in the required format.
# Split the Data into Train ( 75% ) and Test Set ( 25% ).
features_train, features_test, labels_train, labels_test = train_test_split(features, one_hot_encoded_labels, test_size = 0.25, shuffle = True, random_state = seed_constant)

## Step iii. Model Building

• Use any pre-trained model trained on ImageNet dataset (available publicly on Google) for image feature extraction.

• Create k-layered LSTM model and other relevant layers.

• Add one layer of dropout at the appropriate position and give reasons.

• Choose the appropriate activation function for all the layers.

• Print the model summary.

• Justify the choice of number of layers, activation function and any other hyper parameters used.

In [None]:
# This function will construct the required convlstm model.
def create_convlstm_model():

    # We will use a Sequential model for model construction
    model = Sequential()

    # Define the Model Architecture.
    ########################################################################################################################
    
    model.add(ConvLSTM2D(filters = 4, kernel_size = (3, 3), activation = 'tanh',data_format = "channels_last",
                         recurrent_dropout=0.2, return_sequences=True, input_shape = (SEQUENCE_LENGTH,
                                                                                      IMAGE_HEIGHT, IMAGE_WIDTH, 3)))
    
    model.add(MaxPooling3D(pool_size=(1, 2, 2), padding='same', data_format='channels_last'))
    model.add(TimeDistributed(Dropout(0.2)))
    
    model.add(ConvLSTM2D(filters = 8, kernel_size = (3, 3), activation = 'tanh', data_format = "channels_last",
                         recurrent_dropout=0.2, return_sequences=True))
    
    model.add(MaxPooling3D(pool_size=(1, 2, 2), padding='same', data_format='channels_last'))
    model.add(TimeDistributed(Dropout(0.2)))
    
    model.add(ConvLSTM2D(filters = 14, kernel_size = (3, 3), activation = 'tanh', data_format = "channels_last",
                         recurrent_dropout=0.2, return_sequences=True))
    
    model.add(MaxPooling3D(pool_size=(1, 2, 2), padding='same', data_format='channels_last'))
    model.add(TimeDistributed(Dropout(0.2)))
    
    model.add(ConvLSTM2D(filters = 16, kernel_size = (3, 3), activation = 'tanh', data_format = "channels_last",
                         recurrent_dropout=0.2, return_sequences=True))
    
    model.add(MaxPooling3D(pool_size=(1, 2, 2), padding='same', data_format='channels_last'))
    #model.add(TimeDistributed(Dropout(0.2)))
    
    model.add(Flatten()) 
    
    model.add(Dense(len(CLASSES_LIST), activation = "softmax"))
    
    ########################################################################################################################
     
    # Display the models summary.
    model.summary()
    
    # Return the constructed convlstm model.
    return model

In [None]:
# Construct the required convlstm model.
convlstm_model = create_convlstm_model()

# Display the success message. 
print("Model Created Successfully!")

In [None]:
# Plot the structure of the contructed model.
plot_model(convlstm_model, to_file = 'convlstm_model_structure_plot.png', show_shapes = True, show_layer_names = True)


### Justification for the choice of number of layers, activation function used  and Hyper Parameters

The approach we will implement in this Group 186 Assignment to build an Action Recognizer - we will use a Convolution Neural Network (CNN) + Long Short Term Memory (LSTM) Network to perform Action Recognition while utilizing the Spatial-temporal aspect of the videos.

#### Number of Layers: 
* To construct the model, we will use Keras **ConvLSTM2D recurrent** layers. The ConvLSTM2D layer also takes in the number of filters and kernel size required for applying the convolutional operations. The output of the layers is flattened in the end and is fed to the Dense layer with softmax activation which outputs the probability of each action category.

* We will also use **MaxPooling3D layers** to reduce the dimensions of the frames and avoid unnecessary computations and Dropout layers to prevent overfitting the model on the data. 

#### Activation Function - softmax
* The output of the layers is flattened in the end and is fed to the Dense layer with **softmax** activation which outputs the probability of each action category.
* This architecture is a simple one and has a small number of trainable parameters. This is because we are only dealing with a small subset of the dataset which does not require a large-scale model.

#### Model Choice
* We use an **LSTM network**  as it is specifically designed to work with a data sequence as it takes into consideration all of the previous inputs while generating an output. Other choice would be to use RNN. 
* We will use a **Sequential** model for model construction. This makes an LSTM more capable of solving problems involving sequential data such as time series prediction, speech recognition, language translation, or music composition. But for now, we will only explore the role of LSTMs in developing better action recognition models. Other choice for model would be Long-term Recurrent Convolutional Network (LRCN), which combines CNN and LSTM layers in a single model.



## Step iv. Model Compilation 

• Compile the model with the appropriate loss function.

• Use an appropriate optimizer - we used Adam

• Justify the choice of learning rate, optimizer, loss function and any other hyper parameter used.

In [None]:
# Create an Instance of Early Stopping Callback
early_stopping_callback = EarlyStopping(monitor = 'val_loss', patience = 10, mode = 'min', restore_best_weights = True)
print(f"Created Early Stopping Callback")
# Cleanup garbage
gc.collect()
# Compile the model and specify loss function, optimizer and metrics values to the model
convlstm_model.compile(loss = 'categorical_crossentropy', optimizer = 'Adam', metrics = ["accuracy"])


### Justification for the Model choices and Hyper Parameters:
#### Loss function
*   Cross-entropy is the default loss function to use for multi-class classification problems.
*   In this case, we can see the model performed well, achieving a classification accuracy of about 74% on the training dataset.

#### Optimizer
*   The results of the Adam optimizer are generally better than every other optimization algorithm, have faster computation time, and require fewer parameters for tuning.

#### Metrics 
* A metric is a function that is used to judge the performance of your model. **Accuracy** is a metric that generally describes how the model performs across all classes. It is useful when all classes are of equal importance. It is calculated as the ratio between the number of correct predictions to the total number of predictions. 


## Step v. Model Training

• Train the model for an appropriate number of epochs - we chose 50 epochs

• Print the train and validation loss for each epoch. Use the appropriate batch size.

• Plot the loss and accuracy history graphs for both train and validation set.

• Print the total time taken for training.

In [None]:
import time

# Start training the model.


start = time.time()
convlstm_model_training_history = convlstm_model.fit(x = features_train, y = labels_train, epochs = 5, batch_size = 10,shuffle = True, validation_split = 0.2, callbacks = [early_stopping_callback])
end = time.time()
total_time = end-start
print ("Total time taken for training (in seconds): ", ("%.2f" % total_time))


In [None]:
# Evaluate the trained model.
model_evaluation_history = convlstm_model.evaluate(features_test, labels_test)

In [None]:
# Get the loss and accuracy from model_evaluation_history.
model_evaluation_loss, model_evaluation_accuracy = model_evaluation_history

# Define the string date format.
# Get the current Date and Time in a DateTime Object.
# Convert the DateTime object to string according to the style mentioned in date_time_format string.
date_time_format = '%Y_%m_%d__%H_%M_%S'
current_date_time_dt = dt.datetime.now()
current_date_time_string = dt.datetime.strftime(current_date_time_dt, date_time_format)

# Define a useful name for our model to make it easy for us while navigating through multiple saved models.
model_file_name = f'convlstm_model___Date_Time_{current_date_time_string}___Loss_{model_evaluation_loss}___Accuracy_{model_evaluation_accuracy}.h5'

# Save your Model.
convlstm_model.save(model_file_name)

In [None]:
def plot_metric(model_training_history, metric_name_1, metric_name_2, plot_name):
    '''
    This function will plot the metrics passed to it in a graph.
    Args:
        model_training_history: A history object containing a record of training and validation 
                                loss values and metrics values at successive epochs
        metric_name_1:          The name of the first metric that needs to be plotted in the graph.
        metric_name_2:          The name of the second metric that needs to be plotted in the graph.
        plot_name:              The title of the graph.
    '''
    
    # Get metric values using metric names as identifiers.
    metric_value_1 = model_training_history.history[metric_name_1]
    metric_value_2 = model_training_history.history[metric_name_2]
    
    # Construct a range object which will be used as x-axis (horizontal plane) of the graph.
    epochs = range(len(metric_value_1))

    # Plot the Graph.
    plt.plot(epochs, metric_value_1, 'blue', label = metric_name_1)
    plt.plot(epochs, metric_value_2, 'red', label = metric_name_2)

    # Add title to the plot.
    plt.title(str(plot_name))

    # Add legend to the plot.
    plt.legend()

In [None]:
# Visualize the training and validation loss metrices.
plot_metric(convlstm_model_training_history, 'loss', 'val_loss', 'Total Loss vs Total Validation Loss')

In [None]:
# Visualize the training and validation accuracy metrices.
plot_metric(convlstm_model_training_history, 'accuracy', 'val_accuracy', 'Total Accuracy vs Total Validation Accuracy') 

## Step vi. Model Evaluation

• Take 5 random data from the test set and perform activity recognition.

• Print confusion metrics and classification report for the test data.

In [None]:
# To perform Activity recognition 
``
# Lets get 5 Random data from the Test set
n=5
random_names=random.choices(all_classes_names, k=n)
print (f"5 Random Data from the test set: ",random_names)
``
# Get the labels for the 5 Classes from dataset set up earlier 
from tensorflow import keras
label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(CLASSES_LIST)
)
label_processor.get_vocabulary()



In [None]:
# Function to extract Features from a video file.
def build_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()


In [None]:
# Predict the activity from the frame grab

def prepare_single_video(frames):
    frames = frames[None, ...]
    frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")
#    probabilities = convlstm_model.predict([frame_features, frame_mask])[0]

    for i, batch in enumerate(frames):
        video_length = batch.shape[0]
        length = min(MAX_SEQ_LENGTH, video_length)
        for j in range(length):
            frame_features[i, j, :] = feature_extractor.predict(batch[None, j, :])
        frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

    return frame_features, frame_mask


def sequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()

    frames = np.array(frames_extraction (video_files_paths[0]))
    #frame_features, frame_mask = prepare_single_video(frames)
    probabilities = convlstm_model.predict(np.expand_dims(frames, axis=0))

    for i in np.argsort(probabilities)[::-1]:
        #print(probabilities[i])
        print(f"   {probabilities * 100}%") # correct the format to show the right probability
    return frames


# This utility is for visualization, referenced from:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub
def to_gif(images):
    converted_images = images.astype(np.uint8)
    imageio.mimsave("animation.gif", converted_images, fps=10)
    return embed.embed_file("animation.gif")


#test_video = np.random.choice(test_df["video_name"].values.tolist())
for vdo in video_files_paths:
  print(f"Test video path: {vdo}")
  test_frames = sequence_prediction(vdo)
# to_gif(test_frames[:MAX_SEQ_LENGTH])


In [None]:
# Now lets get the Confusion matrix and classification report printed.
frames = np.array(frames_extraction (video_files_paths[0]))
predicted = convlstm_model.predict(np.expand_dims(frames, axis=0))
y_pred = predicted.argmax(axis=1)
#print(classification_report(labels_test.argmax(axis=1), y_pred, target_names=target_names))
print(confusion_matrix(labels_test, y_pred))