<a href="https://colab.research.google.com/github/nadgir-praveen/data-science-lab/blob/main/mini_projects/PN_M4_NB_MiniProject_3_Video_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Mini Project Notebook: Video based Action Classification using LSTM

## Learning Objectives

At the end of the experiment, you will be able to :

* extract frames out of a video
* build the CNN model to extract features from the video frames
* train LSTM/GRU model to perform action classification

## Information

**Background:** The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for feature extraction on input data combined with LSTMs to support sequence prediction.

CNN LSTMs were developed for visual time series prediction problems and the application of generating textual descriptions from sequences of images (e.g. videos). Specifically, the problems of:



*   Activity Recognition: Generating a textual description of an activity demonstrated in a sequence of images
*   Image Description: Generating a textual description of a single image.
*   Video Description: Generating a textual description of a sequence of images.

**Applications:** Applications such as surveillance, video retrieval and
human-computer interaction require methods for recognizing human actions in various scenarios. In the area of robotics, the tasks of
autonomous navigation or social interaction could also take advantage of the knowledge extracted
from live video recordings. Typical scenarios
include scenes with cluttered, moving backgrounds, nonstationary camera, scale variations, individual variations in
appearance and cloth of people, changes in light and view
point and so forth. All of these conditions introduce challenging problems that can be addressed using deep learning (computer vision) models.

## Dataset



**Dataset:** This dataset consists of labelled videos of 6 human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4 as illustrated below.

![img](https://cdn.iisc.talentsprint.com/CDS/Images/actions.gif)

All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160x120 pixels and have a length of four seconds in average. In summary, there are 25x6x4=600 video files for each combination of 25 subjects, 6 actions and 4 scenarios. For this mini-project we have randomly selected 20% of the data as test set.

Dataset source: https://www.csc.kth.se/cvap/actions/

**Methodology:**

When performing image classification, we input an image to our CNN; Obtain the predictions from the CNN;
Choose the label with the largest corresponding probability


Since a video is just a series of image frames, in a video classification, we Loop over all frames in the video file;
For each frame, pass the frame through the CNN; Classify each frame individually and independently of each other; Choose the label with the largest corresponding probability;
Label the frame and write the output frame to disk

Refer this [Video Classification using Keras](https://medium.com/video-classification-using-keras-and-tensorflow/action-recognition-and-video-classification-using-keras-and-tensorflow-56badcbe5f77) for complete understanding and implementation example of video classification.

## Problem Statement

Train a CNN-LSTM based deep neural net to recognize the action being performed in a video

## Grading = 10 Points

### Install and re-start the runtime

In [None]:
!pip3 install imageio==2.4.1



In [None]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/Actions.zip
!unzip -qq Actions.zip
print("Dataset downloaded successfully!!")

replace Actions/train/Walking/person01_walking_d1_uncomp.avi? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
Dataset downloaded successfully!!


### Import required packages

In [None]:
import keras
from keras import applications
from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import *
from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.layers import Dense, Input
from keras.layers import GlobalAveragePooling2D
from keras.layers import LSTM
from keras.layers import TimeDistributed
from keras.layers import Conv2D, BatchNormalization, MaxPool2D, GlobalMaxPool2D
from keras.layers import GRU, Dense, Dropout
from keras.layers import Conv2D, BatchNormalization, MaxPool2D, GlobalMaxPool2D
from tensorflow.keras.optimizers import Adam

import os, glob
import cv2
import numpy as np
import math
import pandas as pd
from matplotlib import pyplot as plt

### Load the data and generate frames of video (2 points)

Detecting an action is possible by analyzing a series of images (that we name “frames”) that are taken in time.

Hint: Refer data preparation section in [keras_video_classification](https://keras.io/examples/vision/video_classification/)


In [None]:
data_dir = "/content/Actions/train/"
test_data_dir = "/content/Actions/test/"
video_directory = "/content/images/"

# YOUR CODE HERE

In [None]:
# load all images in a directory
test_video_files = glob.glob('/content/Actions/test/*/*', recursive=True)
print(f'test video files are {len(test_video_files)}')

train_video_files = glob.glob('/content/Actions/train/*/*', recursive=True)
print(f'train video files are {len(train_video_files)}')

test_dataset = pd.DataFrame(data={'video':test_video_files })
test_dataset['label'] = test_dataset['video'].apply(lambda d: d.split('/')[4])

train_dataset = pd.DataFrame(data={'video':train_video_files })
train_dataset['label'] = train_dataset['video'].apply(lambda d: d.split('/')[4])




test video files are 120
train video files are 479


In [None]:
# # The following two methods are taken from this tutorial:
# # https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub


# def crop_center_square(frame):
#     y, x = frame.shape[0:2]
#     min_dim = min(y, x)
#     start_x = (x // 2) - (min_dim // 2)
#     start_y = (y // 2) - (min_dim // 2)
#     return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


# def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
#     cap = cv2.VideoCapture(path)
#     frames = []
#     try:
#         while True:
#             ret, frame = cap.read()
#             if not ret:
#                 break
#             frame = crop_center_square(frame)
#             frame = cv2.resize(frame, resize)
#             frame = frame[:, :, [2, 1, 0]]
#             frames.append(frame)

#             if len(frames) == max_frames:
#                 break
#     finally:
#         cap.release()
#     return np.array(frames)

In [None]:
test_video_files[0].split('/')[5].split('.')[0]
!rm -rf /content/images

In [None]:
os.makedirs("/content/images/train")
os.makedirs("/content/images/test")


In [None]:
# YOUR CODE HERE

def extract_frames(video_path, frame_rate=1, max_frames=25, resize=(160, 120)):
    """
    Extract frames from a video at a specified frame rate and return as a NumPy array.

    Args:
        video_path (str): Path to the input video file.
        frame_rate (int): Number of frames to skip between saved frames. Default is 1 (save every frame).

    Returns:
        np.ndarray: Array of video frames with shape (num_frames, height, width, channels).
    """
    video_capture = cv2.VideoCapture(video_path)
    frames = []
    frame_count = 0

    while True:
        ret, frame = video_capture.read()
        if not ret:
            break

        if frame_count % frame_rate == 0:
            frame = cv2.resize(frame, resize)
            frames.append(frame)

        frame_count += 1
        if frame_count == max_frames:
          break

    video_capture.release()
    frames = np.array(frames)  # Convert list of frames to NumPy array
    return frames, frame_count



In [None]:
MAX_FRAMES = 50
def preprocess_videos(video_df,path):
    all_frames = []
    all_lables = []
    for index, row in video_df.iterrows():
        video_name=row.video
        tag = row.label
        frames, count = extract_frames(video_name, frame_rate=1, max_frames=MAX_FRAMES)
        if count == MAX_FRAMES:
            all_frames.append(frames)
            # all_lables.append(tag)
            all_lables.append(tag)
            # Convert lists to numpy arrays
    all_frames = np.array(all_frames)
    all_lables = np.array(all_lables)

    return all_frames, all_lables


X, y = preprocess_videos(train_dataset,"/content/images/train")
test_frames, test_lables = preprocess_videos(test_dataset,"/content/images/test")



In [None]:
print(f"Frames shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Frames shape: {test_frames.shape}")
print(f"Labels shape: {test_lables.shape}")

Frames shape: (479, 50, 120, 160, 3)
Labels shape: (479,)
Frames shape: (120, 50, 120, 160, 3)
Labels shape: (120,)


In [None]:
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(y)
y = to_categorical(le.transform(y))
print(y.shape)

test_lables = to_categorical(le.transform(test_lables))
print(test_lables.shape)


(479, 6)
(120, 6)


#### Visualize the frames and analyze the object in each frame. (1 point)

* Plot the frames of each class per row (6 rows)
* Plot the title as label on each subplot

### Create the Neural Network (4 points)

We can build the model in several ways. We can use a well-known model that we inject in time distributed layer, or we can build our own.

With custom ConvNet each input image of the sequence must pass to a convolutional network. The goal is to train that model for each frame and then decide the class to infer.

* Use ConvNet and Time distributed to detect features.
* Inject the Time distributed output to GRU or LSTM to treat as a time series.
* Apply a DenseNet to take the decision and classify.

##### Build the ConvNet for the feature extraction, GRU LSTM layers as a time series and Dense layers for classification

#### Setup the parameters and train the model with epochs, batch wise

* Use train data to fit the model and test data for validation
* Configure batch size and epochs
* Plot the loss of train and test data

In [None]:
# Note: There will be a high memory requirement for the training steps below.
# You should work on a GPU/TPU based runtime. See 'Change Runtime' in Colab
# Training time for each epoch could be ~30 mins
# To save and re-load your model later, see the reference below:
# https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/save_and_load.ipynb

# YOUR CODE HERE

# import tensorflow as tf
# from tensorflow.keras.models import Model
# from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, TimeDistributed, LSTM
# from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split

# Parameters
num_classes = 6  # Number of classes in your dataset (change as needed)
frame_height = 120  # Height of each frame
frame_width = 160  # Width of each frame
channels = 3  # Number of channels (e.g., 3 for RGB)
sequence_length = None  # Set to None to handle variable-length sequences
batch_size = 16  # Batch size for training
epochs = 10  # Number of epochs

# CNN + RNN Model
input_shape = (sequence_length, frame_height, frame_width, channels)
inputs = Input(shape=input_shape)

# TimeDistributed CNN
cnn = TimeDistributed(Conv2D(32, (3, 3), activation='relu'))(inputs)
cnn = TimeDistributed(MaxPooling2D((2, 2)))(cnn)
cnn = TimeDistributed(Conv2D(64, (3, 3), activation='relu'))(cnn)
cnn = TimeDistributed(MaxPooling2D((2, 2)))(cnn)
cnn = TimeDistributed(Conv2D(128, (3, 3), activation='relu'))(cnn)
cnn = TimeDistributed(MaxPooling2D((2, 2)))(cnn)
cnn = TimeDistributed(Flatten())(cnn)

# LSTM
lstm = LSTM(256, return_sequences=False)(cnn)

# Fully Connected
dense = Dense(512, activation='relu')(lstm)
outputs = Dense(num_classes, activation='softmax')(dense)

# Model
model = Model(inputs, outputs)

# Compile
model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

# Summary
model.summary()

# Prepare data for training
# X = np.expand_dims(X, axis=0)  # Add batch dimension


X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=epochs, batch_size=batch_size)

# Evaluate the model
loss, accuracy = model.evaluate(X_val, y_val)
print(f"Validation Loss: {loss}")
print(f"Validation Accuracy: {accuracy}")



Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None, 120, 160,   0         
                              3)]                                
                                                                 
 time_distributed (TimeDist  (None, None, 118, 158,    896       
 ributed)                    32)                                 
                                                                 
 time_distributed_1 (TimeDi  (None, None, 59, 79, 32   0         
 stributed)                  )                                   
                                                                 
 time_distributed_2 (TimeDi  (None, None, 57, 77, 64   18496     
 stributed)                  )                                   
                                                                 
 time_distributed_3 (TimeDi  (None, None, 28, 38, 64   0     

### Use pre-trained model for feature extraction (3 points)

To create a deep learning network for video classification:

* Convert videos to sequences of feature vectors using a pretrained convolutional neural network, such as VGG16, to extract features from each frame.

* Train an LSTM network on the sequences to predict the video labels.

* Assemble a network that classifies videos directly by combining layers from both networks.

Hint: [VGG-16 CNN and LSTM](https://riptutorial.com/keras/example/29812/vgg-16-cnn-and-lstm-for-video-classification)

#### Load and fine-tune the pre-trained model

In [None]:
# YOUR CODE HERE

#### Setup the parameters and train the model with epochs, batch wise

* Use train data to fit the model and test data for validation
* Configure batch size and epochs
* Plot the loss of train and test data

In [None]:
# YOUR CODE HERE

### Report Analysis

* Discuss on FPS, Number of frames and duration of each video
* Analyze the impact of the LSTM, GRU and TimeDistributed layers
* Discuss about the model convergence using pre-trained and ConvNet
* *Additional Reading*: Read and discuss about the use of Conv3D in video classification