# Action recognition in video using LSTMs:

### Requirements:

First we need to download the UCF101 dataset and extract it. When done, change the `BASE_PATH` variable to point to the dataset folder.

### Downloading Dataset through Kaggle:

In [None]:
!pip install kaggle

In [2]:
!mkdir ~/.kaggle

In [3]:
pwd

'/home/ec2-user/SageMaker'

In [4]:
!cp ./kaggle.json ~/.kaggle

In [5]:
!cd ~/.kaggle/ && ls

kaggle.json


In [None]:
!kaggle datasets list -s ucf101

In [7]:
!kaggle datasets download -d pevogam/ucf101

Downloading ucf101.zip to /home/ec2-user/SageMaker
100%|███████████████████████████████████████| 6.49G/6.49G [00:22<00:00, 263MB/s]
100%|███████████████████████████████████████| 6.49G/6.49G [00:22<00:00, 314MB/s]


In [None]:
!unzip ucf101.zip

In [None]:
ls

### Install packages in the current environment

In [None]:
import sys
!{sys.executable} -m pip install opencv-python 
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install tqdm
!{sys.executable} -m pip install scikit-learn

In [13]:
import tensorflow as tf
print(tf.__version__)
print(tf.config.list_physical_devices())

2.7.1
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [None]:
from tensorflow.python.client import device_lib 
print(device_lib.list_local_devices())

In [15]:
!pip3 list | grep tensorflow

tensorflow                         2.7.1
tensorflow-estimator               2.7.0
tensorflow-io-gcs-filesystem       0.24.0
tensorflow-serving-api             2.7.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/tensorflow2_p38/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [None]:
# !python -m pip install --upgrade pip

In [16]:
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
import tqdm
from sklearn.preprocessing import LabelBinarizer

In [17]:
BASE_PATH = 'UCF101/UCF-101'
VIDEOS_PATH = os.path.join(BASE_PATH, '**','*.avi')
SEQUENCE_LENGTH = 40

In [18]:
BASE_PATH

'UCF101/UCF-101'

In [19]:
VIDEOS_PATH

'UCF101/UCF-101/**/*.avi'

## Step 1 - Extract features from videos and cache them in files:

To generate feature vectors, we will use a pretrained inception network trained on the ImageNet dataset to categorize images in different categories.
We will remove the last layer (the fully connected layer) and only keep the feature vector that is generated after a max-pooling operation.
Another option would be to keep the output of the layer just before average-pooling, that is, the higher-dimensional feature maps. However, in our example, we will not need spatial information—whether the action takes place in the middle of the frame or in the corner, the predictions will be the same. Therefore, we will use the output of the two-dimensional max-pooling layer. This will make the training faster, since the input of the LSTM will be 64 times smaller (64 = 8 × 8 = the size of a feature map for an input image of size 299 × 299).

### Sample 'SEQUENCE_LENGTH' frames from each video

In [21]:
def frame_generator():
    video_paths = tf.io.gfile.glob(VIDEOS_PATH)
    np.random.shuffle(video_paths)
    for video_path in video_paths:
        frames = []
        cap = cv2.VideoCapture(video_path)
        num_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        sample_every_frame = max(1, num_frames // SEQUENCE_LENGTH)
        current_frame = 0

        label = os.path.basename(os.path.dirname(video_path))

        max_images = SEQUENCE_LENGTH
        while True:
            success, frame = cap.read()
            if not success:
                break

            if current_frame % sample_every_frame == 0:
                # OPENCV reads in BGR, tensorflow expects RGB so we invert the order
                frame = frame[:, :, ::-1]
                img = tf.image.resize(frame, (299, 299))
                img = tf.keras.applications.inception_v3.preprocess_input(
                    img)
                max_images -= 1
                yield img, video_path

            if max_images == 0:
                break
            current_frame += 1

# `from_generator` might throw a warning, expected to disappear in upcoming versions:
# https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#for_example_2
dataset = tf.data.Dataset.from_generator(frame_generator,
             output_types=(tf.float32, tf.string),
             output_shapes=((299, 299, 3), ()))

dataset = dataset.batch(16).prefetch(tf.data.experimental.AUTOTUNE)

In [22]:
dataset

<PrefetchDataset shapes: ((None, 299, 299, 3), (None,)), types: (tf.float32, tf.string)>

### Feature extraction model:

TensorFlow allows us to access a pretrained model with a single line:

In [23]:
inception_v3 = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')

x = inception_v3.output

# We add Average Pooling to transform the feature map from 
# 8 * 8 * 2048 to 1 x 2048, as we don't need spatial information

pooling_output = tf.keras.layers.GlobalAveragePooling2D()(x)

feature_extraction_model = tf.keras.Model(inception_v3.input, pooling_output)

### Extract features and store them in .npy files:

Extraction takes about ~51minutes on an AWS Sagemaker using ml.g5.8xlarge instance

In [24]:
current_path = None
all_features = []

for img, batch_paths in tqdm.tqdm(dataset):
    batch_features = feature_extraction_model(img)
    batch_features = tf.reshape(batch_features, 
                              (batch_features.shape[0], -1))
    
    for features, path in zip(batch_features.numpy(), batch_paths.numpy()):
        if path != current_path and current_path is not None:
            output_path = current_path.decode().replace('.avi', '.npy')
            np.save(output_path, all_features)
            all_features = []
            
        current_path = path
        all_features.append(features)

33295it [50:21, 11.02it/s]


## Step 2: Train the LSTM on video features

Now that the video features are generated, we can use them to train an LSTM. We define a model and an input pipeline, and launch the training.

### Labels preprocessing

In [26]:
LABELS = ['UnevenBars','ApplyLipstick','TableTennisShot','Fencing','Mixing','SumoWrestling','HulaHoop','PommelHorse','HorseRiding','SkyDiving','BenchPress','GolfSwing','HeadMassage','FrontCrawl','Haircut','HandstandWalking','Skiing','PlayingDaf','PlayingSitar','FrisbeeCatch','CliffDiving','BoxingSpeedBag','Kayaking','Rafting','WritingOnBoard','VolleyballSpiking','Archery','MoppingFloor','JumpRope','Lunges','BasketballDunk','Surfing','SkateBoarding','FloorGymnastics','Billiards','CuttingInKitchen','BlowingCandles','PlayingCello','JugglingBalls','Drumming','ThrowDiscus','BaseballPitch','SoccerPenalty','Hammering','BodyWeightSquats','SoccerJuggling','CricketShot','BandMarching','PlayingPiano','BreastStroke','ApplyEyeMakeup','HighJump','IceDancing','HandstandPushups','RockClimbingIndoor','HammerThrow','WallPushups','RopeClimbing','Basketball','Shotput','Nunchucks','WalkingWithDog','PlayingFlute','PlayingDhol','PullUps','CricketBowling','BabyCrawling','Diving','TaiChi','YoYo','BlowDryHair','PushUps','ShavingBeard','Knitting','HorseRace','TrampolineJumping','Typing','Bowling','CleanAndJerk','MilitaryParade','FieldHockeyPenalty','PlayingViolin','Skijet','PizzaTossing','LongJump','PlayingTabla','PlayingGuitar','BrushingTeeth','PoleVault','Punch','ParallelBars','Biking','BalanceBeam','Swing','JavelinThrow','Rowing','StillRings','SalsaSpin','TennisSwing','JumpingJack','BoxingPunchingBag'] 
encoder = LabelBinarizer()

encoder.fit(LABELS)

LabelBinarizer()

### Defining the model

We apply a dropout. The dropout parameter of the LSTM controls how much dropout is applied to the input weight matrix. The recurrent_dropout parameter controls how much dropout is applied to the previous state. Similar to a mask, recurrent_dropout randomly ignores part of the previous state activations in order to avoid overfitting.
The very first layer of our model is a Masking layer. As we padded our image sequences with empty frames in order to batch them, our LSTM cell would needlessly iterate over those added frames. Adding the Masking layer ensures the LSTM layer stops at the actual end of the sequence, before it encounters a zero matrix

In [27]:
model = tf.keras.Sequential([
    tf.keras.layers.Masking(mask_value=0.),
    tf.keras.layers.LSTM(512, dropout=0.5, recurrent_dropout=0.5),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(len(LABELS), activation='softmax')
])




In [28]:
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy', 'top_k_categorical_accuracy'])

In [29]:
model.summary

<bound method Model.summary of <keras.engine.sequential.Sequential object at 0x7f74ec09adf0>>

## Training on data:

We will load the .npy files that are produced when generating frame features using a generator. The code ensures that all the input sequences have the same length, padding them with zeros if necessary:

In [30]:
test_file = os.path.join('UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist', 'testlist01.txt')
train_file = os.path.join('UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist', 'trainlist01.txt')

with open('UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist/testlist01.txt') as f:
    test_list = [row.strip() for row in list(f)]

with open('UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist/trainlist01.txt') as f:
    train_list = [row.strip() for row in list(f)]
    train_list = [row.split(' ')[0] for row in train_list]


def make_generator(file_list):
    def generator():
        np.random.shuffle(file_list)
        for path in file_list:
            full_path = os.path.join(BASE_PATH, path).replace('.avi', '.npy')

            label = os.path.basename(os.path.dirname(path))
            features = np.load(full_path)

            padded_sequence = np.zeros((SEQUENCE_LENGTH, 2048))
            padded_sequence[0:len(features)] = np.array(features)

            transformed_label = encoder.transform([label])
            yield padded_sequence, transformed_label[0]
    return generator

In [31]:
train_file

'UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist/trainlist01.txt'

In [32]:
test_file

'UCF101TrainTestSplits-RecognitionTask/ucfTrainTestlist/testlist01.txt'

In [33]:
train_dataset = tf.data.Dataset.from_generator(make_generator(train_list),
                 output_types=(tf.float32, tf.int16),
                 output_shapes=((SEQUENCE_LENGTH, 2048), (len(LABELS))))

train_dataset = train_dataset.batch(16).prefetch(tf.data.experimental.AUTOTUNE)


valid_dataset = tf.data.Dataset.from_generator(make_generator(test_list),
                 output_types=(tf.float32, tf.int16),
                 output_shapes=((SEQUENCE_LENGTH, 2048), (len(LABELS))))
valid_dataset = valid_dataset.batch(16).prefetch(tf.data.experimental.AUTOTUNE)

In [34]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='/tmp', update_freq=1000)

In [36]:
model.fit(train_dataset, 
          epochs=17, 
          callbacks=[tensorboard_callback], 
          validation_data=valid_dataset)

Epoch 1/17
Epoch 2/17
Epoch 3/17
Epoch 4/17
Epoch 5/17
Epoch 6/17
Epoch 7/17
Epoch 8/17
Epoch 9/17
Epoch 10/17
Epoch 11/17
Epoch 12/17
Epoch 13/17
Epoch 14/17
Epoch 15/17
Epoch 16/17
Epoch 17/17


<keras.callbacks.History at 0x7f74ec066d30>