# 6.1 Video Classification

Module 6 - Video Action Recognition

For book, references and training materials, please check this project website [http://activefitness.ai/ai-in-sports-with-python](http://activefitness.ai/ai-in-sports-with-python).

Book: [Applied Machine Learning for Health and Fitness](https://www.apress.com/us/book/9781484257715), Chapter 9

In [6]:
import torchvision.io
video_file = 'media/surfing_cutback.mp4'
video, audio, info = torchvision.io.read_video(video_file, pts_unit='sec')
print(video.shape)
print(audio.shape)
print(info)

torch.Size([255, 720, 1280, 3])
torch.Size([2, 407552])
{'video_fps': 29.97002997002997, 'audio_fps': 48000}


In [7]:
from utils.kinetics import kinetics
categories = kinetics.categories()
classes = kinetics.classes()
sports = kinetics.sport_categories()

count = 0
for key in categories.keys():
    if key in sports:
        print(key)
        for label in categories[key]:
            count+=1
            print("\t{}".format(label))
print(f'Sport activities labels: {count}')      
      

athletics - jumping
	high jump
	hurdling
	long jump
	parkour
	pole vault
	triple jump
athletics - throwing + launching
	archery
	catching or throwing frisbee
	disc golfing
	hammer throw
	javelin throw
	shot put
	throwing axe
	throwing ball
	throwing discus
ball sports
	bowling
	catching or throwing baseball
	catching or throwing softball
	dodgeball
	dribbling basketball
	dunking basketball
	golf chipping
	golf driving
	golf putting
	hitting baseball
	hurling (sport)
	juggling soccer ball
	kicking field goal
	kicking soccer ball
	passing American football (in game)
	passing American football (not in game)
	playing basketball
	playing cricket
	playing kickball
	playing squash or racquetball
	playing tennis
	playing volleyball
	shooting basketball
	shooting goal (soccer)
	shot put
golf
	golf chipping
	golf driving
	golf putting
gymnastics
	bouncing on trampoline
	cartwheeling
	gymnastics tumbling
	somersaulting
	vault
heights
	abseiling
	bungee jumping
	climbing a rope
	climbing ladder
	c

In [13]:
import torch
import torchvision
import torchvision.models as models    

# check if cuda is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [14]:
#model = models.video.r3d_18(pretrained=True) 
#model = models.video.mc3_18(pretrained=True) 
model = models.video.r2plus1d_18(pretrained=True)
model.eval() 

VideoResNet(
  (stem): R2Plus1dStem(
    (0): Conv3d(3, 45, kernel_size=(1, 7, 7), stride=(1, 2, 2), padding=(0, 3, 3), bias=False)
    (1): BatchNorm3d(45, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): Conv3d(45, 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
    (4): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU(inplace=True)
  )
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Sequential(
        (0): Conv2Plus1D(
          (0): Conv3d(64, 144, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (1): BatchNorm3d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Conv3d(144, 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        )
        (1): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=Tru

In [15]:
pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(pytorch_total_params)

31505325


In [1]:
import torch
import torchvision
import torchvision.models as models  

# Normalization: Kinetics 400
mean = [0.43216, 0.394666, 0.37645]  
std = [0.22803, 0.22145, 0.216989] 

def normalize(video): 
    return video.permute(3, 0, 1, 2).to(torch.float32) / 255

def resize(video, size): 
    return torch.nn.functional.interpolate(video, size=size, scale_factor=None, mode='bilinear', align_corners=False)

def crop(video, output_size): 
    # center crop    
    h, w = video.shape[-2:] 
    th, tw = output_size 
    i = int(round((h - th) / 2.)) 
    j = int(round((w - tw) / 2.)) 
    return video[..., i:(i + th), j:(j + tw)]

def normalize_base(video, mean, std): 
    shape = (-1,) + (1,) * (video.dim() - 1) 
    mean = torch.as_tensor(mean).reshape(shape) 
    std = torch.as_tensor(std).reshape(shape) 
    return (video - mean) / std

In [2]:
import torchvision.io 
video_file = 'media/surfing_cutback.mp4'
video, audio, info = torchvision.io.read_video(video_file, pts_unit='sec') 
print(video.shape, audio.shape, info)

torch.Size([255, 720, 1280, 3]) torch.Size([2, 407552]) {'video_fps': 29.97002997002997, 'audio_fps': 48000}


In [3]:
video = normalize(video) 
video = resize(video,(128, 171)) 
video = crop(video,(112, 112)) 
video = normalize_base(video, mean=mean, std=std)
shape = video.shape
print(f'frames {shape[0]}, size {shape[1]} {shape[2]}') 

frames 3, size 255 112


In [4]:
model = models.video.r2plus1d_18(pretrained=True)
model.eval()

# make use of accelerated CUDA if available
if torch.cuda.is_available():
    model.cuda()
    video = video.cuda() 

In [None]:
# score the video
score = model(video.unsqueeze(0)) 
# get prediction with max score
prediction = score.argmax() 
print(prediction)

In [None]:
from utils.kinetics import kinetics
classes = kinetics.classes()
print(classes[prediction.item()])

In [10]:
import torch
import torch.nn as nn
import torchvision
import torchvision.models as models
from torch.utils.data import DataLoader as DataLoader
from torchvision import transforms
from torchvision.datasets.kinetics import Kinetics400
from torchvision.datasets.samplers import DistributedSampler, UniformClipSampler, RandomClipSampler
import matplotlib.pyplot as plt
from pathlib import Path

Path.ls = lambda x: [o.name for o in x.iterdir()]
from torchvision.io.video import read_video
from functools import partial as partial
read_video = partial(read_video, pts_unit='sec')
torchvision.io.read_video = partial(torchvision.io.read_video, pts_unit = 'sec')

In [11]:
base_dir = Path('data/kinetics400/')
data_dir = base_dir/'dataset'

In [None]:
!tree {data_dir/'train'}

Conveniently, as part of torchvision.datasets, PyTorch includes Kinetics400 dataset that serves as a cookie cutter for our project. Internally, video datasets use VideoClips object to store video clips data:

In [13]:
data = torchvision.datasets.Kinetics400(
            data_dir/'train',
            frames_per_clip=32,
            step_between_clips=1,
            frame_rate=None,
            extensions=('mp4',),
            num_workers=0
        )

100%|███████████████████████████████████████████████████████████████████████| 10/10 [00:34<00:00,  3.49s/it]


**Note:** Although you can and should take advantage of the multiprocessing nature of datasets, especially in the production environment, on some systems you may get an error, num_workers = 0 makes sure you use dataset single threaded.

According to this constructor above, each video clip loaded with our dataset should be a 4D tensor with the shape (frames, height, width, channels), in our case 32 frames, RGB video, note that Kinetics doesn't require all clips to be of the same height/width:

In [14]:
print((data[0][0]).shape)

torch.Size([32, 480, 272, 3])


### Visualizing dataset

Sometimes, it may be handy to visualize the entire dataset catalog as a table, summarizing the number of frames. The helper function to_dataframe loads the entire video catalog into Pandas DataFrame and displays the content:

In [28]:
import pandas as pd
from utils.video_classification.helpers import to_dataframe

to_dataframe(data)

Length: 35271


Unnamed: 0,filepath,frames,fps,clips
0,data\kinetics400\dataset\train\playing_tennis\...,300,30.00000,269
1,data\kinetics400\dataset\train\playing_tennis\...,300,29.97003,269
2,data\kinetics400\dataset\train\playing_tennis\...,300,29.97003,269
3,data\kinetics400\dataset\train\playing_tennis\...,300,30.00000,269
4,data\kinetics400\dataset\train\playing_tennis\...,300,30.00000,269
...,...,...,...,...
145,data\kinetics400\dataset\train\surfing_water\Y...,250,25.00000,219
146,data\kinetics400\dataset\train\surfing_water\Z...,300,29.97003,269
147,data\kinetics400\dataset\train\surfing_water\_...,178,29.97003,147
148,data\kinetics400\dataset\train\surfing_water\a...,119,15.00000,88


Let’s say we want to display the size of a video in the dataset:

In [30]:
VIDEO_NUMBER = 130
video_table = to_dataframe(data)
video_info = video_table['filepath'][VIDEO_NUMBER]

Length: 35271


With notebook IPython.display Video helper we can also show the video embedded in the notebook, but keep in mind that setting embed=True while displaying the video may significantly increase the size of your notebook:

In [29]:
from IPython.display import Video
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Video(video_info, width=400, embed=False)

So instead of embedding the video, it may be sufficient to just visualize the first and last frames:

In [None]:
def show_clip_start_end(f):
    last = len(f)
    plt.imshow(f[0])
    plt.title(f'frame: 1')
    plt.axis('off')
    plt.show()
    plt.imshow(f[last-1])
    plt.title(f'frame: {last}')
    plt.axis('off')
    plt.show()

show_clip_start_end(data[0][0])

## Video normalization

As with most of the data, before training our model, video needs to be normalized for video classification models included in torchvision. This involves getting image data in the range \[0,1\] and normalizing with standard deviation and the mean provided with the model:

In [43]:
import utils.video_classification.transforms as T

t = torchvision.transforms.Compose([
        T.ToFloatTensorInZeroOne(),
        T.Resize((128, 171)),
        T.RandomHorizontalFlip(),
        T.Normalize(mean=[0.43216, 0.394666, 0.37645],
                            std=[0.22803, 0.22145, 0.216989]),
        T.RandomCrop((112, 112))
    ])

Once we've defined the transform, we can pass it to the Kinetics400 dataset:

In [44]:
train_data = torchvision.datasets.Kinetics400(
            data_dir/'train',
            frames_per_clip=32,
            step_between_clips=1,
            frame_rate=None,
            transform=t,
            extensions=('mp4',),
            num_workers=0
        )

100%|███████████████████████████████████████████████████████████████████████| 10/10 [00:39<00:00,  3.97s/it]


DataLoader class in PyTorch provides many useful features and makes it easy to use from Python, including: iterable datasets, automatic batching, memory pinning, sampling and data loading order customization etc.

## Finding learning rate

> *Never let formal education get in the way of your learning.\
> *--Mark Twain

Learning rate, as a hyperparameter for training neural networks is important: if you make learning rate too small, the model will likely converge too slowly.

**Mysterious constant:** The so-called Karpathy constant defines the best learning rate for Adam as 3e-4. The author of the famous tweet in data science, Andrej himself in the response to his own tweet says that this was a joke. Nevertheless, the constant made it to Urban Dictionary and many data science blogs.

We will not take this for granted of course and will use sound theory to find the best learning rate. In practice, a large learning rate may fail to reach model convergence. As an illustration, notice that by making learning rate too large for gradient descent, the model will never reach its minimum.

![](images/ch9/fig_9-7.png)

To deal with this problem, a paper by Leslie N. Smith [https://arxiv.org/pdf/1506.01186.pdf] was published that proposed a method to optimize finding learning rates. As a result, many frameworks, including fastai and PyTorch now include learning rate finder module. For PyTorch, you can use torch_lr_finder module by installing it with pip install torch-lr-finder and then use it in the code with:

In [51]:
from torch_lr_finder import LRFinder

Getting dataset ready for learning rate finder:

In [49]:
from utils.video_classification.first_clip_sampler import FirstClipSampler
from torch.utils.data.dataloader import default_collate

def collate_fn(batch):
    # remove audio from the batch
    batch = [(d[0], d[2]) for d in batch]
    return default_collate(batch)

train_sampler = FirstClipSampler(train_data.video_clips, 2)
train_dl = DataLoader(train_data, batch_size=4, sampler=train_sampler, collate_fn=collate_fn, pin_memory=True)
x,y = next(iter(train_dl))
x.shape, y.shape

(torch.Size([4, 3, 32, 112, 112]), torch.Size([4]))

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-6, weight_decay=1e-2)
# if you are getting memory problems running this, 
# try reducing DataLoader batch_size above to 16 or even 4
lr_finder = LRFinder(model, optimizer, criterion,device=device)
lr_finder.range_test(train_dl, end_lr=10, num_iter=90)
lr_finder.plot()
lr_finder.reset()

In case of video action recognition, with the size of the data and large differences in training times for video data, it is recommended to use a proper learning rate, which is often in the middle of the descending loss curve. The module plots the loss curve, and the optimal learning rate from the chart below is somewhere near value lr = 1e-2:

![](images/ch9/fig_9-8.png)

Optimal learning rate is found around the middle of descending loss curve, on this figure around 10\^-2.

## Training the model

Training the model for video action recognition in PyTorch follows the same principles as for image classifier, but since video classification functionality is relatively new in PyTorch, it's worth including a small example in this chapter.

### Project 9-3. Video Recognition Model Training

To start, let's create two datasets, for training and validation, based on built-in Kinetics object. The idea here is to take advantage of built-in objects that PyTorch offers. I use the same normalizing video transformation T, already used in previous examples. On some systems you can get a significant speed improvement if you set num_workers > 0 , but on my system I had to be conservative, so I keep it at zero (basically, it means don't take advantage of parallelization):

In [None]:
train_data = torchvision.datasets.Kinetics400(
            data_dir/'train',
            frames_per_clip=32,
            step_between_clips=1,
            frame_rate=None,
            transform=t,
            extensions=('mp4',),
            num_workers=0
        )

valid_data = torchvision.datasets.Kinetics400(
            data_dir/'valid',
            frames_per_clip=32,
            step_between_clips=1,
            frame_rate=None,
            transform=t,
            extensions=('mp4',),
            num_workers=0
        )

PyTorch allows using familiar DataLoaders with video data, and for video data PyTorch includes VideoClips class used for enumerating clips in the video and also sampling clips in the video while loading. FirstClipSampler in the below example used video\_clips property from the dataset to sample a specified number of clips in the video:

In [None]:
train_sampler = FirstClipSampler(train_data.video_clips, 2)
train_dl = DataLoader(train_data, 
                      batch_size=4, 
                      sampler=train_sampler, 
                      collate_fn=collate_fn, 
                      pin_memory=True)
valid_sampler = FirstClipSampler(valid_data.video_clips, 2)
valid_dl = DataLoader(valid_data, 
                      batch_size=4, 
                      sampler=valid_sampler, 
                      collate_fn=collate_fn, pin_memory=True)

Loading and renormalizing video data can take a really long time, so you may want to save the normalized dataset in cache directory:

In [None]:
import os
cache_dir = data_dir/'.cache'
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)
cache_dir.ls()

torch.save(train_data, f'{cache_dir}/train')
torch.save(valid_data, f'{cache_dir}/valid')
train_data = torch.load(cache_dir/'train')
valid_data = torch.load(cache_dir/'valid') 

Next, you initialize the model with hyper-parameters, including the learning rate obtained earlier. Note that since we'll be training the model, we instantiate it without weights (pretrained=False or omitted):

In [None]:
import torch
import torchvision
import torchvision.models as models

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

model = models.video.r2plus1d_18()
model.cuda()
lr = 1e-2
criterion = nn.CrossEntropyLoss()
optim = torch.optim.Adam(model.parameters(), lr)
lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(optim, max_lr=5e-1, steps_per_epoch=len(train_dl), epochs=10)
metrics_dir = cache_dir/'train-metrics'

CrossEntropyLoss can be used for training classification problems and Adam optimizer (same as we used finding the learning rate). Next, we can train the model, in the example below I chose 10 epochs:

In [None]:
import sys
import time
import datetime
from utils.video_classification.train import train_one_epoch, evaluate

start_time = time.time()
 
for epoch in range(10):
    train_one_epoch(model, 
                    criterion, 
                    optim, 
                    lr_scheduler, 
                    train_dl, device, 
                    epoch, print_freq=100)
    evaluate(model, 
             criterion, 
             valid_dl, 
             device)

total_time = time.time() - start_time
total_time_str = str(datetime.timedelta(seconds=int(total_time)))
print('Training time {}'.format(total_time_str))

You can also save the model weights once it's trained:

In [None]:
SAVED_MODEL_PATH = './videoresnet_action.pth'
torch.save(model.state_dict(), SAVED_MODEL_PATH)

## Summary

In this chapter we covered practical methods and tools for video action recognition and classification. We discussed data structures for loading, normalizing and storing videos, datasets for sports action classification, such as Kinetics, and deep learning models. Using readily available pre-trained models, we can classify hundreds of sport actions and train the models to recognize new activities. For a sport data scientist, this chapter provides practical examples for deep learning, movement analysis, action recognition on any video.

Although video action recognition is becoming more usable today, and made progress in thousands of classifications, we are still far from the goals of generalized action recognition. That means, as a sport data scientist, you are still left with a lot of work to apply video recognition in the field. Is this the right time to make video action recognition a part of your toolbox? With practical examples and notebooks accompanying this chapter, I think that this is the right time for coaches and sport scientists to start using these methods in everyday sport data science. 