# Transfer Learning for Video Classification

This notebook uses video classification models from [torchvision](https://pytorch.org/vision/stable/index.html) that were originally trained using [Kinetics-400](https://arxiv.org/abs/1705.06950) and does transfer learning on the HMBD51 dataset.
The notebook performs the following steps:
1. [Import dependencies and setup parameters](#1.-Import-dependencies-and-setup-parameters)
2. [Prepare the dataset](#2.-Prepare-the-dataset)
3. [Predict using the original model](#3.-Predict-using-the-original-model)
4. [Transfer Learning](#4.-Transfer-Learning)
5. [Predict](#5.-Predict)
6. [Export the saved model](#6.-Export-the-saved-model)

## 1. Import dependencies and setup parameters

This notebook assumes that you have already followed the instructions in the [README.md](/notebooks/notebooks/setup.md) to setup a PyTorch environment with all the dependencies required to run the notebook.

In [None]:
import os
import time
import math
import cv2
import numpy as np
import pandas as pd
import torch
import torchvision
import torchvision.models.video
from torchvision import datasets, models, transforms
import torchvision.transforms as T
from tqdm import tqdm
from PIL import Image
from pydoc import locate
import warnings

import intel_extension_for_pytorch as ipex
import matplotlib.pyplot as plt

from tlt.utils.file_utils import download_and_extract_tar_file, download_file
from model_utils import torchvision_model_map, get_retrainable_model

from torchvision.io.video import read_video
from torchvision.models.video import r3d_18, mc3_18, r2plus1d_18

warnings.filterwarnings("ignore")
print('Supported models:')
print('\n'.join(torchvision_model_map.keys()))

In [None]:
# Specify a model from the list above
model_name = 'r3d_18'

# Specify the the parent directory for the custom or torchvision dataset
dataset_directory = os.environ["DATASET_DIR"] if "DATASET_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "dataset")
    
# Specify a directory for output
output_directory = os.environ["OUTPUT_DIR"] if "OUTPUT_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "output")

print("Dataset directory:", dataset_directory)
print("Output directory:", output_directory)


In [None]:
if model_name not in torchvision_model_map.keys():
    raise ValueError("The specified model_name ({}) is invalid. Please select from: {}".
                     format(model_name, torchvision_model_map.keys()))
    
print("Pretrained Video Classification Model:", model_name)   

## 2. Prepare the dataset

We will be using the HMDB51 Action Recognition dataset. Run the cell below to download the dataset to the specified dataset directory

In [None]:
! curl https://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/hmdb51_org.rar --output $dataset_directory/hmdb51_org.rar

In [None]:
# Create and specify our video directory
os.makedirs(os.path.join(dataset_directory, 'hmdb51_org'), exist_ok=True)
video_directory = os.path.join(dataset_directory, 'hmdb51_org')
downloaded_directory = os.path.join(dataset_directory, 'hmdb51_org.rar')

Run the cell below to extract the .rar files and organize the HMDB51 data into subfolders

In [None]:
# Extract .rar files and move them into respective folders
! unrar e $downloaded_directory $video_directory 
! rm $downloaded_directory -r

for files in os.listdir(video_directory):
    foldername = files.split('.')[0]
    os.system("mkdir -p " + os.path.join(video_directory, foldername))
    os.system("unrar e " + os.path.join(video_directory, files) + " " + os.path.join(video_directory, foldername))

! rm $video_directory/*.rar

Optional: Uncomment and run the cell below if you would like to convert the video frames to images in a separate folder rather than overwriting the original video dataset folder.

In [None]:
# Optional

# ! cp -R $video_directory 'hmdb51_jpeg'
# video_directory = os.path.join(dataset_directory, 'hmdb51_jpeg')

## Convert the Video Frames to Images

In order to reduce computational complexity, we will convert the video frames to images using the functions below

In [None]:
# Get each video from the dataset, returns video ids, labels, and folders
def get_vids(jpeg_path):
    folders = os.listdir(jpeg_path)
    ids = []
    labels = []
    for folder in folders:
        folderpath = os.path.join(jpeg_path, folder)
        files = os.listdir(folderpath)
        filepath= [os.path.join(folderpath, file) for file in files]
        ids.extend(filepath)
        labels.extend([folderpath]*len(files))
    return ids, labels, folders

# For each video, return a list of n_frames frames
def get_frames(filename, n_frames= 1):
    frames = []
    v_cap = cv2.VideoCapture(filename)
    v_len = int(v_cap.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_list= np.linspace(0, v_len-1, n_frames+1, dtype=np.int16)
    
    for fn in range(v_len):
        success, frame = v_cap.read()
        if success is False:
            continue
        if (fn in frame_list):
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)  
            frames.append(frame)
    v_cap.release()
    return frames, v_len

# Convert each frame to a .jpg file
def frames_to_jpg(frames, pic_path):
    for index, frame in enumerate(frames):
        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)  
        image_path = os.path.join(pic_path, "frame"+str(index)+".jpg")
        cv2.imwrite(image_path, frame)

In [None]:
# Use helper functions from above to get n_frames frames from each video and convert them to Images

# Number of frames you would like to use per video
n_frames = 16
# Video dataset format
extension = '.avi'

print("Converting video frames to jpg...")
print("Note: This may take 15-20 minutes")
for root, dirs, files in tqdm(os.walk(video_directory, topdown=False), bar_format='{l_bar}{bar:50}{r_bar}{bar:-50b}'):
    for name in files:
        if extension not in name:
            continue
        video_path = os.path.join(root, name)
        # Get video frames
        frames, vlen = get_frames(video_path, n_frames= n_frames)
        pic_path = video_path.replace(extension, "")
        os.makedirs(pic_path, exist_ok= True)
        # Convert frames to jpg
        frames_to_jpg(frames, pic_path)
print("Success.")
        
# Remove redundant video.{extension} files from the folder    
for folder in os.listdir(video_directory):
    for file in os.listdir(video_directory + '/' + folder):
        if extension in file:
            os.remove(video_directory + '/' + folder + '/' + file)

In [None]:
# Preprocessing transforms
def get_transform(train):
    transforms = []
    transforms.append(T.Resize((112, 112)))
    if train:
        transforms.append(T.RandomHorizontalFlip())
    transforms.append(T.ToTensor())
    transforms.append(T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]))

    return T.Compose(transforms)

In [None]:
train_batch_size = 30
test_batch_size = 30

# Create the dataset and DataLoader objects

print("loading dataset...")
dataset = datasets.ImageFolder(video_directory, get_transform(True))
print("Loaded.")
print("loading test dataset...")
dataset_test = datasets.ImageFolder(video_directory, get_transform(False))
print("Loaded.")
class_names = dataset.classes

# Use 25% for validation and 75% for training
print("Segmenting datasets...")
indices = torch.randperm(len(dataset)).tolist()
num_training_samples = math.floor(len(dataset)*.75)

dataset_test = torch.utils.data.Subset(dataset, indices[-num_training_samples:])
dataset = torch.utils.data.Subset(dataset, indices[:num_training_samples])
print("Success.")


# define DataLoaders
train_loader = torch.utils.data.DataLoader(dataset, batch_size=train_batch_size, 
                                           shuffle=True, num_workers=4)
test_loader  = torch.utils.data.DataLoader(dataset_test, batch_size=test_batch_size, 
                                           shuffle=False, num_workers=4)

## 3. Predict using the original model

In [None]:
# Load in the pretrained model
pretrained_model_class = locate('torchvision.models.video.{}'.format(model_name))
model = pretrained_model_class(pretrained=True)
inputs, classes = next(iter(train_loader))
model.eval()
print("Model loaded successfully")

In [None]:
# Rearrange channels and create outputs for predictions
inputs.unsqueeze_(1)
inputs = inputs.permute(0, 2, 1, 3, 4)
print("Preparing model outputs for prediction...")
outputs = model(inputs)
print("Model outputs created. We can now predict")

In [None]:
# Get the Kinetics400 labels for displaying with the predictions
kinetics400_classes = []
labels_file_url = 'https://raw.githubusercontent.com/deepmind/kinetics-i3d/master/data/label_map.txt'
labels_file_path = os.path.join(dataset_directory, os.path.basename(labels_file_url))
if not os.path.exists(labels_file_url):
    download_file(labels_file_url, dataset_directory)

with open(labels_file_path) as f:
    kinetics400_labels = f.readlines()
    kinetics400_classes = [l.strip() for l in kinetics400_labels]
print("Success.")

In [None]:
# List of the actual labels for this batch
actual_label_batch = [class_names[int(id)] for id in classes]

# Make predictions
_, predicted_id = torch.max(outputs, 1)
predicted_label_batch = [kinetics400_classes[id] for id in predicted_id]

# Visualize predictions using pandas dataframe object
results_table = []
count = 0
for prediction, actual in zip(predicted_label_batch, actual_label_batch):
    if prediction == actual:
        count += 1
    results_table.append([prediction, actual])

# Display predictions and accuracy
acc = count / len(actual_label_batch)
print("Batch Accuracy: " + str(acc))
print("note: some predictions may differ by single characters")
pd.DataFrame(results_table, columns=["Prediction", "Actual Label"])

## 4. Transfer Learning


Replace the pretrained head of the network with a new layer based on the number of classes in our dataset. Train the model using the new dataset for the specified number of epochs.

In [None]:
# Number of training epochs
num_epochs = 1

# Specify batch sizes for transfer learning
train_batch_size = 30
test_batch_size = 30

# To reduce training time, the feature extractor layer can remain frozen (do_fine_tuning=False).
# Fine-tuning can be enabled to potentially get better accuracy. Note that enabling fine-tuning
# will increase training time.
do_fine_tuning = False

In [None]:
def main(model, criterion, optimizer, dataset, dataset_test, num_epochs=1):
    since = time.time()
    
    device = torch.device("cpu")
    model = model.to(device)
    best_acc = 0.0

    # Create data loaders for training and validation
    data_loader = torch.utils.data.DataLoader(dataset, batch_size=train_batch_size,
                                          shuffle=True, num_workers=4)
    data_loader_test = torch.utils.data.DataLoader(dataset_test, batch_size=test_batch_size,
                                          shuffle=False, num_workers=4)
    
    for epoch in range(num_epochs):
        print(f'Epoch {epoch + 1}/{num_epochs}')
        print('-' * 10)

        # Training phase
        model.train()
        running_loss = 0.0
        running_corrects = 0
        # Iterate over data.
        
        for inputs, labels in tqdm(data_loader, bar_format='{l_bar}{bar:50}{r_bar}{bar:-50b}'):
            x = len(inputs)
            
            inputs = inputs.to(device)
            labels = labels.to(device)
            # Zero the parameter gradients
            optimizer.zero_grad()

            # Forward and backward pass
            with torch.set_grad_enabled(True):
                inputs.unsqueeze_(1)
                inputs = inputs.permute(0, 2, 1, 3, 4)
                inputs = inputs.float()
                outputs = model(inputs)
                _, preds = torch.max(outputs, 1)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()

            # Statistics
            running_loss += loss.item() * inputs.size(0)
            running_corrects += torch.sum(preds == labels.data)

        epoch_loss = running_loss / len(dataset)
        epoch_acc = running_corrects.double() / len(dataset)
        print(f'Training Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')
        
        # Evaluation phase
        model.eval()
        running_loss = 0.0
        running_corrects = 0
            
        # Iterate over data.
        for inputs, labels in tqdm(data_loader_test, bar_format='{l_bar}{bar:50}{r_bar}{bar:-50b}'):
            inputs = inputs.to(device)
            labels = labels.to(device)

            # Zero the parameter gradients
            optimizer.zero_grad()

            # Forward pass
            with torch.set_grad_enabled(False):
                inputs.unsqueeze_(1)
                inputs = inputs.permute(0, 2, 1, 3, 4)
                outputs = model(inputs)
                _, preds = torch.max(outputs, 1)
                loss = criterion(outputs, labels)

            # Statistics
            running_loss += loss.item() * inputs.size(0)
            running_corrects += torch.sum(preds == labels.data)
            
        epoch_loss = running_loss / len(dataset_test)
        epoch_acc = running_corrects.double() / len(dataset_test)

        if epoch_acc > best_acc:
            best_acc = epoch_acc
        
        print(f'Validation Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')
        print()
        

    time_elapsed = time.time() - since
    print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s')
    print(f'Best Validation Accuracy: {best_acc:4f}')

    return model

In [None]:
model = get_retrainable_model(model_name, len(class_names), do_fine_tuning)
criterion = torch.nn.CrossEntropyLoss()

# Adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

print('Trainable parameters: {}'.format(sum(p.numel() for p in model.parameters() if p.requires_grad)))

In [None]:
model, optimizer = ipex.optimize(model, optimizer=optimizer)
model = main(model, criterion, optimizer, dataset, dataset_test, num_epochs)

## 5. Predict

Now, let's see how Transfer Learning has improved our accuracy

In [None]:
model.eval()
outputs = model(inputs)
_, predicted_id = torch.max(outputs, 1)
predicted_label_batch = [class_names[id] for id in predicted_id]
count = 0
results_table = []
for prediction, actual in zip(predicted_label_batch, actual_label_batch):
    if prediction == actual:
        count += 1
    results_table.append([prediction, actual])

acc = count / (len(actual_label_batch))
print("Batch Accuracy: " + str(acc))

pd.DataFrame(results_table, columns=["Prediction", "Actual Label"])

## 6. Export the saved model

In [None]:
if not os.path.exists(output_directory):
    os.makedirs(output_directory)
file_path = "{}/video_classification.pt".format(output_directory)
torch.save(model.state_dict(), file_path)
print("Saved to {}".format(file_path))

## Dataset citations
```
@inproceedings{
  title = {HMDB: A Large Video Database for Human Motion Recognition},
  author = {H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre},
  year = {2011}
}
@ONLINE {HMDB,
author = {H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre},
title = "HMDB51",
year = "2011",
url = "https://serre-lab.clps.brown.edu/wp-content/uploads/2013/10/hmdb51_org.rar" }
```