# R(2+1)D Model on Webcam Stream

## Prerequisite for Webcam example
This notebook assumes you have a webcam connected to your machine. If you want to use a remote-VM to run the model and codes while using a local machine for the webcam stream, you can use an SSH tunnel:

1. SSH connect to your VM:
`$ ssh -L 8888:localhost:8888 <user-id@url-to-your-vm>`
1. Launch a Jupyter session on the VM (with port 8888 which is the default)
1. Open localhost:8888 from your browser on the webcam connected local machine to access the Jupyter notebook running on the VM.

We use the `ipywebrtc` module to show the webcam widget in the notebook. Currently, the widget works on Chrome and Firefox. For more details about the widget, please visit [ipywebrtc github](https://github.com/maartenbreddels/ipywebrtc).

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
from collections import deque
import io
import os
import sys
from time import sleep, time
from threading import Thread

import decord
import IPython.display
from ipywebrtc import CameraStream, ImageRecorder, VideoStream
from ipywidgets import HBox, HTML, Layout, VBox, Widget
import numpy as np
from PIL import Image
import torch
import torch.cuda as cuda
import torch.nn as nn
from torchvision.transforms import Compose

from vu.data import KINETICS
from vu.models.r2plus1d import R2Plus1D 
from vu.utils import system_info, transforms_video as transforms

system_info()

## Load Pre-trained Model

Load R(2+1)D 34-layer model pre-trained on IG65M and fine-tuned on Kinetics400. There are two versions of the model: 8-frame model and 32-frame model based on the input clip length. The 32-frame model is slower than 8-frame model.

In [None]:
DATA_ROOT = os.path.join("data", "hmdb51")
VIDEO_DIR = os.path.join(DATA_ROOT, "videos")
# This split is known as "split1"
TRAIN_SPLIT = os.path.join(DATA_ROOT, "hmdb51_vid_train_split_1.txt")
TEST_SPLIT = os.path.join(DATA_ROOT, "hmdb51_vid_val_split_1.txt")

In [None]:
# 8-frame or 32-frame models
MODEL_INPUT_SIZE = 32
# 16 for 8-frame model.
BATCH_SIZE = 8

# Model configuration
r2plus1d_custom_cfgs = dict(
    # HMDB51 dataset spec
    num_classes=2,
    video_dir=VIDEO_DIR,
    train_split=TRAIN_SPLIT,
    valid_split=TEST_SPLIT,
    # Pre-trained model spec ("Closer look" and "Large-scale" papers)
    base_model='ig65m',
    sample_length=MODEL_INPUT_SIZE,     
    sample_step=1,        # Frame sampling step
    im_scale=128,         # After scaling, the frames will be cropped to (112 x 112)
    mean=(0.43216, 0.394666, 0.37645),
    std=(0.22803, 0.22145, 0.216989),
    random_shift=True,
    temporal_jitter_step=2,    # Temporal jitter step in frames (only for training set)
    flip_ratio=0.5,
    random_crop=True,
    video_ext='mp4',
)

# Training configuration
train_cfgs = dict(
    mixed_prec=False,
    batch_size=BATCH_SIZE,
    grad_steps=2,
    lr=0.001,         # 0.001 ("Closer look" paper, HMDB51)
    momentum=0.95,
    warmup_pct=0.3,  # First 30% of the steps will be used for warming-up
    lr_decay_factor=0.001,
    weight_decay=0.0001,
    epochs=48,
    model_name='custom',
    model_dir=os.path.join("checkpoints", "ig65m_kinetics"),
)

In [None]:
learn = R2Plus1D(r2plus1d_custom_cfgs)

In [None]:
learn.load(model_dir="checkpoints", model_name="ig65m_kinetics/custom_021")

In [None]:
model = learn.model

### Prepare class names
Since we use Kinetics400 model out of the box, we load its class names. The dataset consists of 400 human actions. For example, the first 20 labels are:

In [None]:
labels = ["OpeningRackDoor", "NoAction"]

Among them, we will use 50 classes that we are interested in (i.e. the actions make sense to demonstrate in front of the webcam) and ignore other classes by filtering out from the model outputs.

### Load model to device

In [None]:
if cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
model.to(device)
model.eval()

## Run Model
Here, we use a sliding window classification for action recognition on the continuous webcam stream. We average the last 5 windows results to smoothing out the prediction results. We also reject classes that the score is less than `SCORE_THRESHOLD`. 

In [None]:
SCORE_THRESHOLD = 0.04
AVERAGING_SIZE = 5  # Averaging 5 latest clips to make video-level prediction (or smoothing)
NUM_CLASSES = 2

In [None]:
NUM_FRAMES = 32
IM_SCALE = 128    # resize then crop
INPUT_SIZE = 112  # input clip size: 3 x NUM_FRAMES x 112 x 112
# Normalization
MEAN = (0.43216, 0.394666, 0.37645)
STD = (0.22803, 0.22145, 0.216989)

In [None]:
transform = Compose([
    transforms.ToTensorVideo(),
    transforms.ResizeVideo(IM_SCALE),
    transforms.CenterCropVideo(INPUT_SIZE),
    transforms.NormalizeVideo(MEAN, STD)
])

In [None]:
def predict(frames, transform, device, model):
    clip = torch.from_numpy(np.array(frames))
    # Transform frames and append batch dim
    sample = torch.unsqueeze(transform(clip), 0)
    sample = sample.to(device)
    output = model(sample)
    scores = nn.functional.softmax(output, dim=1).data.cpu().numpy()[0]
    
    return scores

### Appendix: Run on a video file
Here, we show how to use the model on a video file. We utilize threading so that the inference does not block the video preview.
* Prerequisite - Download HMDB51 video files from [here](http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#Downloads)

In [None]:
def _predict_video_frames(window, scores_cache, scores_sum, is_ready):
    t = time()
    scores = predict(window, transform, device, model)
    dur = time() - t
    # Averaging scores across clips (dense prediction)
    scores_cache.append(scores)
    scores_sum += scores
    if len(scores_cache) == AVERAGING_SIZE:
        scores_avg = scores_sum / AVERAGING_SIZE
        top_id_score_dict = {
            i: scores_avg[i] for i in (-scores_avg).argpartition(1)[:2]
        }
        top = {labels[k]: v for k, v in top_id_score_dict.items()}
        top = sorted(top.items(), key=lambda kv: -kv[1])
        # Plot final results nicely
        d_caption.update(IPython.display.HTML(
            "{} fps<p style='font-size:20px'>".format(1 // dur) + "<br>".join([
                "{} ({:.3f})".format(k, v) for k, v in top
            ]) + "</p>"
        ))
        scores_sum -= scores_cache.popleft()
    
    # Inference done. Ready to run on the next frames.
    window.popleft()
    is_ready[0] = True

def predict_video_frames(video_filepath, d_video, d_caption):
    """Load video and show frames and inference results on
    d_video and d_caption displays
    """
    video_reader = decord.VideoReader(video_filepath)
    print("Total frames = {}".format(len(video_reader)))
    
    is_ready = [True]
    window = deque()
    scores_cache = deque()
    scores_sum = np.zeros(NUM_CLASSES)
    while True:
        try:
            frame = video_reader.next().asnumpy()
            if len(frame.shape) != 3:
                break
            
            # Start an inference thread when ready
            if is_ready[0]:
                window.append(frame)
                if len(window) == NUM_FRAMES:
                    is_ready[0] = False
                    Thread(
                        target=_predict_video_frames,
                        args=(window, scores_cache, scores_sum, is_ready)
                    ).start()
                    
            # Show video preview
            f = io.BytesIO()
            im = Image.fromarray(frame)
            im.save(f, 'jpeg')

            d_video.update(IPython.display.Image(data=f.getvalue()))
            sleep(0.03)
        except:
            break


In [None]:
#video_filepath = os.path.join(
#    "data",
#    "testvid.mp4"
#)
video_filepath = os.path.join("data", "custom", "testvid1 3173 1-7 Cold 2019-08-19_16_26_46_900.mp4")
#video_filepath = os.path.join("data", "custom", "testvid2 3173 8-15 Cold 2019-08-19_16_43_40_350.mp4")
#video_filepath = os.path.join("data", "custom", "testvid3 3173 8-15 Cold 2019-08-19_16_56_18_925.mp4")
#video_filepath = os.path.join("data", "custom", "testvid3 3173 8-15 Cold 2019-08-19_16_43_40_350.mp4")


d_video = IPython.display.display("", display_id=1)
d_caption = IPython.display.display("Preparing...", display_id=2)

try:
    predict_video_frames(video_filepath, d_video, d_caption)
except KeyboardInterrupt:
    pass