# Build a Video Classification System in 5 Lines

This notebook illustrates how to build a video classification system from scratch using [Towhee](https://towhee.io/). A video classification system classifies videos into pre-defined categories. This tutorial will use pretrained labels of human activities as example.

Using the sample data of different classes of human activites, we will build a basic video classification system within 5 lines of code and check the performance using Towhee. In addition, this tutorial also suggests some optimization options. At the end, we use [Gradio](https://gradio.app/) to create a showcase that can be played with.

## Preparation

### Install packages

Make sure you have installed required python packages:

| package |
| -- |
| pymilvus |
| towhee |
| towhee.models |
| pillow |
| ipython |
| gradio |

In [1]:
! python -m pip install -q pymilvus towhee towhee.models pillow ipython gradio

### Prepare data

This tutorial will use a small data extracted from validation data of [Kinetics400](https://www.deepmind.com/open-source/kinetics). You can download the subset from [Github](https://github.com/towhee-io/data/releases/download/video-data/reverse_video_search.zip). This tutorial will just use 200 videos under `train` as example.

The data is organized as follows:
- **train:** 20 classes, 10 videos per class (200 in total)
- **reverse_video_search.csv:** a csv file containing an ***id***, ***path***, and ***label*** for each video in train directory

Let's take a quick look:

In [2]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/reverse_video_search.zip -O
! unzip -q -o reverse_video_search.zip

In [3]:
import pandas as pd

df = pd.read_csv('./reverse_video_search.csv')
print(df.head(3))
print(df.label.value_counts())

   id                                          path                 label
0   0  ./train/country_line_dancing/bTbC3w_NIvM.mp4  country_line_dancing
1   1  ./train/country_line_dancing/n2dWtEmNn5c.mp4  country_line_dancing
2   2  ./train/country_line_dancing/zta-Iv-xK7I.mp4  country_line_dancing
country_line_dancing     10
pumping_fist             10
playing_trombone         10
shuffling_cards          10
tap_dancing              10
clay_pottery_making      10
eating_hotdog            10
eating_carrots           10
juggling_soccer_ball     10
juggling_fire            10
javelin_throw            10
dunking_basketball       10
chopping_wood            10
trimming_trees           10
using_segway             10
pushing_cart             10
dancing_gangnam_style    10
riding_mule              10
drop_kicking             10
doing_aerobics           10
Name: label, dtype: int64


For later steps to easier get videos & measure results, we build some helpful functions in advance:
- **ground_truth:** get ground-truth label for the video by its path

In [4]:
def ground_truth(path):
    label = df.set_index('path').at[path, 'label']
    return [label.replace('_', ' ')]

## Build System

Now we are ready to build a video classification system using sample data. We will use the [X3D_M](https://arxiv.org/abs/2004.04730) model to predict most possible action labels for input videos. With proper [Towhee operators](https://towhee.io/operators), you don't need to go through video preprocessing & model details. It is very simple to use the [method-chaining style API](https://towhee.readthedocs.io/en/main/index.html) to wrap operators and then apply them to batch inputs.

### Predict labels

Let's take some 'tap_dancing' videos as example to see how to predict labels for videos within 5 lines. By default, the system will predict top 5 labels sorting by scores (of possibility) from high to low. You can control the number of labels returnbed by change `topk`. Please note that the first time run will take some time to download model.

In [5]:
import towhee

(
    towhee.glob['path']('./train/tap_dancing/*.mp4')
          .video_decode.ffmpeg['path', 'frames'](sample_type='uniform_temporal_subsample', args={'num_samples': 16})
          .action_classification['frames', ('predicts', 'scores', 'features')].pytorchvideo(
              model_name='x3d_m', skip_preprocess=True, topk=5)
          .select['path', 'predicts', 'scores']()
          .show()
)

Using cache found in /home/mengjia.gu/.cache/torch/hub/facebookresearch_pytorchvideo_main


path,predicts,scores
./train/tap_dancing/PehoEu4WfEI....,"[tap dancing,zumba,breakdancing,dancing gangnam style,...] len=5","[0.00469,0.0029,0.0026,0.00257,...] len=5"
./train/tap_dancing/X7k8twydJIU....,"[robot dancing,tap dancing,breakdancing,krumping,...] len=5","[0.00542,0.00279,0.00265,0.00255,...] len=5"
./train/tap_dancing/Krh21z_zyV8....,"[tap dancing,dancing ballet,roller skating,dancing charleston,...] len=5","[0.00673,0.0025,0.00249,0.00249,...] len=5"
./train/tap_dancing/Uf1PiOF8Poc....,"[tap dancing,dancing ballet,country line dancing,dancing charleston,...] len=5","[0.0045,0.00362,0.00256,0.0025,...] len=5"
./train/tap_dancing/PGPn8WhG3pM....,"[tap dancing,dancing charleston,country line dancing,jumpstyle dancing,...] len=5","[0.00578,0.0029,0.0025,0.00249,...] len=5"


#### Pipeline Explanation

Here are some details for each line of the assemble pipeline:

- `towhee.read_csv()`: read tabular data from csv file


- `.video_decode.ffmpeg()`: an embeded Towhee operator reading video as frames with specified sample method and number of samples. [learn more](https://towhee.io/video-decode/ffmpeg)

- `.action_classification.pytorchvideo()`: an embeded Towhee operator applying specified model to video frames, which can be used to predict labels and extract features for video. [learn more](https://towhee.io/action-classification/pytorchvideo)

### Evaluation

We have just showed how to classify video, but how's its performance? Towhee has provided different options for metrics to evaluate predicted results against ground truths.

In this section, we'll measure the performance with the average metric value:

- **mHR (recall@K):**
    - Mean Hit Ratio describes how many actual relevant results are returned out of all ground truths.
    - Since we predict top K labels while only 1 ground truth for each entity, the mean hit ratio is equivalent to recall@topk.

In [6]:
import time

start = time.time()
dc = (
    towhee.read_csv('reverse_video_search.csv').unstream()
          .video_decode.ffmpeg['path', 'frames'](sample_type='uniform_temporal_subsample', args={'num_samples': 16})
          .action_classification['frames', ('predicts', 'scores', 'features')].pytorchvideo(
              model_name='x3d_m', skip_preprocess=True, topk=5)
)
end = time.time()
print(f'Total time: {end-start}')

benchmark = (
    dc.runas_op['path', 'ground_truth'](func=ground_truth)
      .runas_op['predicts', 'top1'](func=lambda x: x[:1])
      .runas_op['predicts', 'top3'](func=lambda x: x[:3])
      .with_metrics(['mean_hit_ratio'])
      .evaluate['ground_truth', 'top1'](name='top1')
      .evaluate['ground_truth', 'top3'](name='top3')
      .evaluate['ground_truth', 'predicts'](name='top5')
      .report()
)

Using cache found in /home/mengjia.gu/.cache/torch/hub/facebookresearch_pytorchvideo_main


Total time: 92.92158579826355


Unnamed: 0,mean_hit_ratio
top1,0.7
top3,0.875
top5,0.9


## Optimization

You're always encouraged to play around with the tutorial. We present some optimization options here to make improvements in accuracy, latency, and resource usage. With these methods, you can make the classification system better in performance and more feasible in production.

### Change model

There are more video models using different networks. Normally a more complicated or larger model will show better results while cost more. You can always try more models to tradeoff among accuracy, latency, and resource usage. Here I show the performance of video classification using a SOTA model with [multiscale vision transformer](https://arxiv.org/abs/2104.11227) as backbone.

In [7]:
benchmark = (
    towhee.read_csv('reverse_video_search.csv').unstream()
          .video_decode.ffmpeg['path', 'frames'](sample_type='uniform_temporal_subsample', args={'num_samples': 32})
          .action_classification['frames', ('predicts', 'scores', 'features')].pytorchvideo(
              model_name='mvit_base_32x3', skip_preprocess=True, topk=5)
          .runas_op['path', 'ground_truth'](func=ground_truth)
          .runas_op['predicts', 'top1'](func=lambda x: x[:1])
          .runas_op['predicts', 'top3'](func=lambda x: x[:3])
          .with_metrics(['mean_hit_ratio'])
          .evaluate['ground_truth', 'top1'](name='top1')
          .evaluate['ground_truth', 'top3'](name='top3')
          .evaluate['ground_truth', 'predicts'](name='top5')
          .report()
)

Using cache found in /home/mengjia.gu/.cache/torch/hub/facebookresearch_pytorchvideo_main


Unnamed: 0,mean_hit_ratio
top1,0.745
top3,0.9
top5,0.92


### Parallel Execution

We are able to enable parallel execution by simply calling set_parallel within the pipeline. It tells Towhee to process the data in parallel. The code below enables parallel execution on the above example. It shows that it finishes the classification of 200 videos within 2 seconds with 5 parallel executions.

In [8]:
start = time.time()
dc = (
    towhee.read_csv('reverse_video_search.csv')
          .set_parallel(5)
          .video_decode.ffmpeg['path', 'frames'](sample_type='uniform_temporal_subsample', args={'num_samples': 16})
          .action_classification['frames', ('predicts', 'scores', 'features')].pytorchvideo(
              model_name='x3d_m', skip_preprocess=True, topk=5)
)
end = time.time()
print(f'Total time: {end-start}')

Using cache found in /home/mengjia.gu/.cache/torch/hub/facebookresearch_pytorchvideo_main


Total time: 1.6542744636535645


### Exception Safe

When we have large-scale data, there may be some bad data that will cause errors. Typically, the users don't want such errors to break the system in production. Therefore, the pipeline should continue to process the rest of the videos and report broken ones.

Towhee supports an `exception-safe` execution mode that allows the pipeline to continue on exceptions and represent the exceptions with Empty values. The user can choose how to deal with the empty values at the end of the pipeline. During the query below, there are 4 files in total under the exception folder, one of them is broken. With `exception-safe`, it will print the ERROR but NOT terminate the process. As you can see from results, `drop_empty` deletes empty data.

In [9]:
(
    towhee.glob['path']('./exception/*')
          .exception_safe()
          .video_decode.ffmpeg['path', 'frames'](sample_type='uniform_temporal_subsample', args={'num_samples': 16})
          .action_classification['frames', ('labels', 'scores', 'vec')].pytorchvideo(
              model_name='x3d_m', skip_preprocess=True)
          .drop_empty()
          .select['path', 'labels']()
          .show()
)

Using cache found in /home/mengjia.gu/.cache/torch/hub/facebookresearch_pytorchvideo_main
2022-06-08 21:51:35,523 - 139916792783424 - video_decoder.py-video_decoder:121 - ERROR: moov atom not found


path,labels
./exception/kDuAS29BCwk.mp4,"[chopping wood,sword fighting,throwing axe,walking the dog,...] len=5"
./exception/ty4UQlowp0c.mp4,"[eating carrots,eating spaghetti,shaking head,eating chips,...] len=5"
./exception/rJu8mSNHX_8.mp4,"[shaking head,finger snapping,laughing,eating ice cream,...] len=5"


## Release a Showcase

We've learnt how to build a reverse video search engine. Now it's time to add some interface and release a showcase. Towhee provides `towhee.api()` to wrap the data processing pipeline as a function with `.as_function()`. So we can build a quick demo with this `action_classification_function` with [Gradio](https://gradio.app/).

In [10]:
import gradio

topk = 3
with towhee.api() as api:
    action_classification_function = (
         api.video_decode.ffmpeg(
                sample_type='uniform_temporal_subsample', args={'num_samples': 32})
            .action_classification.pytorchvideo(model_name='mvit_base_32x3', skip_preprocess=True, topk=topk)
            .runas_op(func=lambda res: {res[0][i]: res[1][i] for i in range(len(res[0]))})
            .as_function()
    )
    

interface = gradio.Interface(action_classification_function, 
                             inputs=gradio.Video(source='upload'),
                             outputs=[gradio.Label(num_top_classes=topk)]
                            )


interface.launch(inline=True, share=True)

Using cache found in /home/mengjia.gu/.cache/torch/hub/facebookresearch_pytorchvideo_main


Running on local URL:  http://127.0.0.1:7866/
Running on public URL: https://56923.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://huggingface.co/spaces)


(<gradio.routes.App at 0x7f3f3d9944c0>,
 'http://127.0.0.1:7866/',
 'https://56923.gradio.app')

<img src='action_classification_demo.png' alt='action_classification_demo' width=700px/>