# Activity Recognition in Video

A major part of this notebook has been taken from [this example in Keras](https://keras.io/examples/vision/video_classification/).

Some changes have been introduced such as using DeIT and ORB for feature extraction and a different method for sampling frames from the videos in the given dataset.

<mark>This notebook just runs the model decribed in the aforementioned original notebook using **features extracted from the [pre-trained DeIT classifier](https://huggingface.co/docs/transformers/v4.27.2/en/model_doc/deit) provided by HuggingFace**.</mark>

The DeIT model used here is provided in [this table](https://huggingface.co/facebook/deit-small-patch16-224) called <mark>DeiT-small</mark>.

As in the original notebook we will be using a subset of the [UCF Activity Recognition dataset](https://www.crcv.ucf.edu/data/UCF101.php). We will not be modifying the hyperparameters except the *epoch* and *sampled sequence length*.

<mark>NOTE: We have **decreased** the sampled sequence length as it takes a lot of time to extract the features of multiple frames of the video.</mark>

*The model configuration used here will be kept constant except obviously the input size.* 

This is because we want to compare the models built using CNN, ORB and DeIT.

In [1]:
!pip uninstall tensorflow
!pip install tensorflow=='2.7.0'

Found existing installation: tensorflow 2.12.0
Uninstalling tensorflow-2.12.0:
  Would remove:
    /usr/local/bin/estimator_ckpt_converter
    /usr/local/bin/import_pb_to_tensorboard
    /usr/local/bin/saved_model_cli
    /usr/local/bin/tensorboard
    /usr/local/bin/tf_upgrade_v2
    /usr/local/bin/tflite_convert
    /usr/local/bin/toco
    /usr/local/bin/toco_from_protos
    /usr/local/lib/python3.9/dist-packages/tensorflow-2.12.0.dist-info/*
    /usr/local/lib/python3.9/dist-packages/tensorflow/*
Proceed (Y/n)? Y
  Successfully uninstalled tensorflow-2.12.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow==2.7.0
  Downloading tensorflow-2.7.0-cp39-cp39-manylinux2010_x86_64.whl (489.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m489.7/489.7 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting keras<2.8,>=2.7.0rc0
  Downloading keras-2.7.0-py2.py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━

In [2]:
!pip install -q git+https://github.com/tensorflow/docs
!pip install transformers
!pip install gdown

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for tensorflow-docs (setup.py) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully inst

## Data collection

In order to keep the runtime of this example relatively short, we will be using a
subsampled version of the original UCF101 dataset. You can refer to the notebook **dataset_subset_UCF.ipynb** in the repository to know how the subsampling was done.

The dataset is provided in the following link.
https://drive.google.com/file/d/1-1Jgmhg-84WbwZ8v9kvpFTDDpVUqRO1V/view?usp=share_link

In [3]:
!gdown 1-1Jgmhg-84WbwZ8v9kvpFTDDpVUqRO1V
!tar -xf /content/ucf101_top10.tar.gz

Downloading...
From: https://drive.google.com/uc?id=1-1Jgmhg-84WbwZ8v9kvpFTDDpVUqRO1V
To: /content/ucf101_top10.tar.gz
100% 1.04G/1.04G [00:19<00:00, 52.8MB/s]


## Setup

In [4]:
from tensorflow_docs.vis import embed
from tensorflow import keras
from imutils import paths

import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imageio
import cv2
import os
from tqdm import tqdm

from transformers import AutoFeatureExtractor, DeiTForImageClassificationWithTeacher
import matplotlib.pyplot as plt

## Define hyperparameters



1.   BATCH_SIZE - Number of inputs to train in a single iteration
2. EPOCHS - Number of training iterations
3. MAX_SEQ_LENGTH - Number of frames to be sampled from the image
4. NUM_FEATURES - The size of the feature vector extracted from the image or frame encoder (in this case its DeIT)



In [5]:
BATCH_SIZE = 64
EPOCHS = 50              

MAX_SEQ_LENGTH = 5
NUM_FEATURES = 768

## Data preparation

In [6]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(f"Total videos for training: {len(train_df)}")
print(f"Total videos for testing: {len(test_df)}")

train_df.sample(10)

Total videos for training: 1171
Total videos for testing: 459


Unnamed: 0,video_name,tag
232,v_Drumming_g08_c01.avi,Drumming
223,v_CricketShot_g24_c06.avi,CricketShot
248,v_Drumming_g11_c01.avi,Drumming
944,v_ShavingBeard_g09_c02.avi,ShavingBeard
288,v_Drumming_g16_c07.avi,Drumming
1067,v_TennisSwing_g10_c01.avi,TennisSwing
422,v_HorseRiding_g19_c07.avi,HorseRiding
772,v_PlayingGuitar_g19_c01.avi,PlayingGuitar
1046,v_ShavingBeard_g24_c06.avi,ShavingBeard
965,v_ShavingBeard_g12_c05.avi,ShavingBeard


We have used OpenCV's VideoCapture method to read each frame of the video.

As set by the corresponding hyperparameter, we will sample MAX_SEQ_LENGTH frames (denote by $M$ for now). To do this we first get the total number of frames (say, $F$) present in the video file using the metadata included in the read object created by the VideoCapture method.

Then we sample with step-size given by $⌊\frac{F}{M}⌋$ and stop once the number of sampled frames reaches $M$ (or maybe below it)

NOTE: The earlier implementation used to store all frames from all videos and then choose first MAX_SEQ_LENGTH frames to encode. We take the frames with the step-size mentioned above as this is equivalent and way more memory efficient.

In [7]:
def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


def load_video(path, max_frames=0):
    cap = cv2.VideoCapture(path)
    F = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))    #total number of frames in the video
    M = max_frames                                #max sequence length required
    S = max(int(np.floor(F/M)), 1)                #in case step-size goes to 0

    frames = []
    frame_count = 0
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            if frame_count%S == 0:                #sample only at steps
              frame = crop_center_square(frame)
              frames.append(frame)

            frame_count += 1                      #update number of frames read
            
            #break if we reach end of video or maximum required frames read  
            if frame_count==F or len(frames) == M:         
                break
    finally:
        cap.release()
    return np.array(frames)


The input pre-processing is done by the AutoFeatureExtractor module provided by HuggingFace which transforms the input as per the requirement of the pre-trained model loaded. 

We have to define a forward hook that takes the activation of the class token from the DeIT encoder (before it is used to generate the class and distillation outputs). The class token has been chosen as this is the token that interacts with the other patch tokens along with the distillation token and is used to make the final prediction of the class.

For more details about the forward hook and the steps for feature extraction using DeIT model please check the notebook *DeIT_Feature_Extraction_Example.ipynb* provided in the repository.

In [8]:
#load the pre-trained model and init the feature extractor
feature_extractor = AutoFeatureExtractor.from_pretrained('facebook/deit-base-distilled-patch16-224')
deit_model = DeiTForImageClassificationWithTeacher.from_pretrained('facebook/deit-base-distilled-patch16-224')

Downloading (…)rocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]



Downloading (…)lve/main/config.json:   0%|          | 0.00/69.6k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/349M [00:00<?, ?B/s]

In [9]:
#define the hook callable
activation = {}

def getActivation(name):
    # the hook signature
    def hook(model, input, output):
      try:
        activation[name] = output.detach()
      except:
        activation[name] = output
    return hook

#attach hooks to get intermediate activation
deit_encoder = deit_model.deit.register_forward_hook(getActivation('deit_encoder'))

In [10]:
#define the feature extractor
def deit_feature_extractor(image):
  inputs = feature_extractor(images=image, return_tensors="pt")
  outputs = deit_model(**inputs)

  feature = activation['deit_encoder'][0][ : , 0, : ].detach().numpy()
  return feature

The labels of the videos are strings. Neural networks do not understand string values,
so they must be converted to some numerical form before they are fed to the model. Here
we will use the [`StringLookup`](https://keras.io/api/layers/preprocessing_layers/categorical/string_lookup)
layer encode the class labels as integers.

In [11]:
label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["tag"])
)
print(label_processor.get_vocabulary())

['BoxingPunchingBag', 'CricketShot', 'Drumming', 'HorseRiding', 'PlayingCello', 'PlayingDhol', 'PlayingGuitar', 'Punch', 'ShavingBeard', 'TennisSwing']


Finally, we can put all the pieces together to create our data processing utility.

In [12]:

def prepare_all_videos(df, root_dir, max_frames = 10):
    num_samples = len(df)
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = label_processor(labels[..., None]).numpy()

    # `frame_masks` and `frame_features` are what we will feed to our sequence model.
    # `frame_masks` will contain a bunch of booleans denoting if a timestep is
    # masked with padding or not.
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )

    # For each video.
    for idx, path in enumerate(tqdm(video_paths)):
        # Gather all its frames and add a batch dimension.
        frames = load_video(os.path.join(root_dir, path), max_frames)
        frames = frames[None, ...]

#         # Initialize placeholders to store the masks and features of the current video.
        temp_frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # Extract features from the frames of the current video.
        for i, batch in enumerate(frames):
            length = batch.shape[0]
            assert length == MAX_SEQ_LENGTH, "sequence length not sufficient!"
            for j in range(length):
                temp_frame_features[i, j, :] = deit_feature_extractor(batch[j, : ])
            temp_frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

        frame_features[idx,] = temp_frame_features.squeeze()
        frame_masks[idx,] = temp_frame_mask.squeeze()

    return (frame_features, frame_masks), labels

In [13]:
train_data, train_labels = prepare_all_videos(train_df, "train", max_frames=MAX_SEQ_LENGTH)
test_data, test_labels = prepare_all_videos(test_df, "test", max_frames=MAX_SEQ_LENGTH)

100%|██████████| 1171/1171 [48:57<00:00,  2.51s/it]
100%|██████████| 459/459 [18:42<00:00,  2.44s/it]


In [14]:
print(f"Frame features in train set: {train_data[0].shape}")
print(f"Frame masks in train set: {train_data[1].shape}")

Frame features in train set: (1171, 5, 768)
Frame masks in train set: (1171, 5)


## The sequence model

Now, we can feed this data to a sequence model consisting of recurrent layers like `GRU`.

In [15]:
# Utility for our sequence model.
def get_sequence_model():
    class_vocab = label_processor.get_vocabulary()

    frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")

    # Refer to the following tutorial to understand the significance of using `mask`:
    # https://keras.io/api/layers/recurrent_layers/gru/
    x = keras.layers.GRU(128, return_sequences=True)(
        frame_features_input, mask=mask_input
    )
    x = keras.layers.Dropout(0.4)(x)
    x = keras.layers.GRU(64)(x)
    x = keras.layers.Dropout(0.2)(x)
    x = keras.layers.Dense(32, activation="relu")(x)
    output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)

    rnn_model = keras.Model([frame_features_input, mask_input], output)
    rnn_model.compile(
        loss="sparse_categorical_crossentropy", optimizer="adam", metrics='sparse_categorical_accuracy')
    
    return rnn_model


# Utility for running experiments.
def run_experiment(seq_model):

    history = seq_model.fit(
        [train_data[0], train_data[1]],
        train_labels,
        epochs=EPOCHS
    )

    test_metrics = seq_model.evaluate([test_data[0], test_data[1]], test_labels)
    print(f" {test_metrics}")

    return history, seq_model, test_metrics

In [16]:
model = get_sequence_model()
print(model.summary())

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 5, 768)]     0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 5)]          0           []                               
                                                                                                  
 gru (GRU)                      (None, 5, 128)       344832      ['input_1[0][0]',                
                                                                  'input_2[0][0]']                
                                                                                                  
 dropout (Dropout)              (None, 5, 128)       0           ['gru[0][0]']                

In [17]:
#fit model
training_hist, sequence_model, test_metrics = run_experiment(model)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
 [0.16603580117225647, 0.9629629850387573]


In [None]:
#save model
model.save('activity_recog_deit_gru')

In [None]:
!tar -cf activity_recog_deit_gru.tgz activity_recog_deit_gru

In [None]:
plt.plot(training_hist.history['loss'], label = 'training loss')
plt.plot(training_hist.history['sparse_categorical_accuracy'], label = 'training acc')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plot_txt = f"Test Accuracy: {round(test_metrics[-1], 3)}"
plt.text(30, 0.1, plot_txt, fontsize=8, backgroundcolor='lime')
plt.title('Training loss plot of DeIT-GRU Approach')
plt.savefig('training_DeIT_GRU_model.pdf')

## Inference

In [None]:

def prepare_single_video(frames):
    frames = frames[None, ...]
    frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")

    for i, batch in enumerate(frames):
        length = batch.shape[0]
        assert length == MAX_SEQ_LENGTH, "sequence length not sufficient!"
        for j in range(length):
          frame_features[i, j, :] = deit_feature_extractor(batch[j, :])
        
        frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

    return frame_features, frame_mask


def sequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()

    frames = load_video(os.path.join("test", path), max_frames=MAX_SEQ_LENGTH)
    frame_features, frame_mask = prepare_single_video(frames)
    probabilities = sequence_model.predict([frame_features, frame_mask])[0]

    for i in np.argsort(probabilities)[::-1]:
        print(f"  {class_vocab[i]}: {probabilities[i] * 100:5.2f}%")
    return frames


# This utility is for visualization.
# Referenced from:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub
def to_gif(images):
    for i in range(images.shape[0]):
      images[i] = cv2.cvtColor(images[i], cv2.COLOR_BGR2RGB)
    converted_images = images.astype(np.uint8)
    imageio.mimsave("animation.gif", converted_images, fps=10)
    return embed.embed_file("animation.gif")


test_video = np.random.choice(test_df["video_name"].values.tolist())
print(f"Test video path: {test_video}")
test_frames = sequence_prediction(test_video)
to_gif(test_frames)

In [22]:
#one good habit - remove the hook
deit_encoder.remove()
