# RATING Project

Comic Mischief Detection Task

Files:

1. "train.csv" : 
-- Contains multiclass classification content annotations for each video scene used in the training set.
-- Annotations are on a scene level and do not correspond to a specific modality
-- a ".csv" file containing video URLs as well as the IDs of the scenes used in the training set.
-- Videos are available in the form of URLs, collected from the Youtube and the IMDB websites.
-- Contains metadata about the videos.
-- Four content categories related to comic mischief are used (Sarcasm, Slapstick Humor, Gory Humor, Mature Humor).

2. "val.csv" : 
-- Contains multiclass classification content annotations for each video scene used in the validation set.
-- You can use this set for performing model hyperparameter tuning before using the test set


3. "test.csv" : 
-- Contains multiclass classification content annotations for each video scene used in the test set.
-- You can use this set for evaluating your method

In [None]:
!apt update && apt install ffmpeg libsm6 libxext6  -y

Get:1 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]
Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64  InRelease
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Release [696 B]
Get:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64  Release [564 B]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Release.gpg [836 B]
Get:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64  Release.gpg [833 B]
Get:9 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [1546 kB]
Get:10 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:11 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  P

In [None]:
!pip install opencv-python

Collecting opencv-python
  Downloading opencv_python-4.5.5.62-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (60.4 MB)
[K     |████████████████████████████████| 60.4 MB 31.2 MB/s 
Installing collected packages: opencv-python
Successfully installed opencv-python-4.5.5.62
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [None]:
!pip install imageio

Collecting imageio
  Downloading imageio-2.16.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 24.3 MB/s 
Installing collected packages: imageio
Successfully installed imageio-2.16.0
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [None]:
!pip install -q git+https://github.com/tensorflow/docs

You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [None]:
from tensorflow_docs.vis import embed
from tensorflow.keras import layers
from tensorflow import keras

import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imageio
import cv2
import os

In [None]:
# global variables
train_dir = "/work/train_data/"
val_dir = "/work/val_data/"
test_dir = "/work/test_data/"

# Hyperparameters
MAX_SEQ_LENGTH = 20
NUM_FEATURES = 1024
IMG_SIZE = 224

EPOCHS = 5

### References
1. https://keras.io/examples/vision/video_transformers/)
2. https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub)
3. https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb

## Data Preparation (not complete)

### METADATA loading

In [None]:
# Create a dataframe which contains multiclass classification content annotations for each video scene used in the training set.
train_df = pd.read_csv('train.csv', index_col=0, dtype={'combination': object})
train_df["path"] = train_dir + train_df["Video ID"]+ ".0" + train_df["Scene_ID"].astype(str) + ".mp4"
train_df.head()

Unnamed: 0,Video ID,Scene_ID,Video URL,Codec,Resolution,Avg Frame rate,Mature Humor - Scene Annotation,Slapstick Humor - Scene Annotation,Gory Humor - Scene Annotation,Sarcasm - Scene Annotation,combination,path
0,tt2872718,0,https://www.imdb.com/videoplayer/vi4179799833,h264,854 x 480,23.976024,0,0,0,0,0,/work/train_data/tt2872718.00.mp4
1,tt2872718,1,https://www.imdb.com/videoplayer/vi4179799833,h264,854 x 480,23.976024,0,0,1,0,10,/work/train_data/tt2872718.01.mp4
2,tt2788710,0,https://www.imdb.com/videoplayer/vi1114222361,h264,854 x 480,23.976024,0,0,0,0,0,/work/train_data/tt2788710.00.mp4
3,tt2788710,1,https://www.imdb.com/videoplayer/vi1114222361,h264,854 x 480,23.976024,1,0,0,0,1000,/work/train_data/tt2788710.01.mp4
4,tt2788710,2,https://www.imdb.com/videoplayer/vi1114222361,h264,854 x 480,23.976024,0,0,0,0,0,/work/train_data/tt2788710.02.mp4


In [None]:
# Create a dataframe which contains multiclass classification content annotations for each video scene used in the validation set.
val_df = pd.read_csv('val.csv', index_col=0, dtype={'combination': object})
val_df["path"] = val_dir + val_df["Video ID"]+ ".0" + val_df["Scene_ID"].astype(str) + ".mp4"
val_df.head()

Unnamed: 0,Video ID,Scene_ID,Video URL,Codec,Resolution,Avg Frame rate,Mature Humor - Scene Annotation,Slapstick Humor - Scene Annotation,Gory Humor - Scene Annotation,Sarcasm - Scene Annotation,combination,path
21,tt1308728,0,https://www.youtube.com/watch?v=QP9qbhTeBII,h264,640 x 360,23.975945,1,0,0,0,1000,/work/val_data/tt1308728.00.mp4
22,tt1308728,1,https://www.youtube.com/watch?v=QP9qbhTeBII,h264,640 x 360,23.975945,1,1,0,0,1100,/work/val_data/tt1308728.01.mp4
23,tt1308728,2,https://www.youtube.com/watch?v=QP9qbhTeBII,h264,640 x 360,23.975945,1,0,0,0,1000,/work/val_data/tt1308728.02.mp4
30,PGuqnE35cCg,0,https://www.youtube.com/watch?v=PGuqnE35cCg,h264,640 x 360,23.976024,1,0,0,0,1000,/work/val_data/PGuqnE35cCg.00.mp4
31,PGuqnE35cCg,1,https://www.youtube.com/watch?v=PGuqnE35cCg,h264,640 x 360,23.976024,1,0,0,0,1000,/work/val_data/PGuqnE35cCg.01.mp4


In [None]:
# Create a dataframe which contains multiclass classification content annotations for each video scene used in the test set.
test_df = pd.read_csv('test.csv', index_col=0, dtype={'combination': object})
test_df["path"] = test_dir + test_df["Video ID"]+ ".0" + test_df["Scene_ID"].astype(str) + ".mp4"
test_df.head()

Unnamed: 0,Video ID,Scene_ID,Video URL,Codec,Resolution,Avg Frame rate,Mature Humor - Scene Annotation,Slapstick Humor - Scene Annotation,Gory Humor - Scene Annotation,Sarcasm - Scene Annotation,combination,path
13,tt1741243,0,https://www.imdb.com/videoplayer/vi2169153049,h264,534 x 360,29.97,1,0,0,0,1000,/work/test_data/tt1741243.00.mp4
14,tt1741243,1,https://www.imdb.com/videoplayer/vi2169153049,h264,534 x 360,29.97,0,1,1,0,110,/work/test_data/tt1741243.01.mp4
15,tt1723121,0,https://www.youtube.com/watch?v=O7NHfAzg7Yg,h264,640 x 360,23.976024,1,0,0,0,1000,/work/test_data/tt1723121.00.mp4
16,tt1723121,1,https://www.youtube.com/watch?v=O7NHfAzg7Yg,h264,640 x 360,23.976024,1,0,0,0,1000,/work/test_data/tt1723121.01.mp4
17,tt1723121,2,https://www.youtube.com/watch?v=O7NHfAzg7Yg,h264,640 x 360,23.976024,1,0,0,0,1000,/work/test_data/tt1723121.02.mp4


### Data processing

How to feed videos to a neural network for training? <br>

1. Use OpenCV VideoCapture() method to read frames from videos.

In [None]:
# Utilities to open video files using CV2
def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y:start_y+min_dim,start_x:start_x+min_dim]

def load_video(path, max_frames=0, resize=(224, 224)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames) / 255.0

#### Feature Extraction

In [None]:
def build_feature_extractor():
    feature_extractor = keras.applications.DenseNet121(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.densenet.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()

#### Label Preprocessing 

Multilabel classification -->  Multi-class Binarization



No Matrure, No Slapstick, No Gory, No Sarcasm - 0000 <br>
No Matrure, No Slapstick, No Gory, Sarcasm - 0001 <br>
No Matrure, No Slapstick, Gory, No Sarcasm - 0010 <br>
No Matrure, No Slapstick, Gory, Sarcasm - 0011 <br>
No Matrure, Slapstick, No Gory, No Sarcasm - 0100 <br>
No Matrure, Slapstick, No Gory, Sarcasm - 0101 <br>
No Matrure, Slapstick, Gory, No Sarcasm - 0110 <br>
No Matrure, Slapstick, Gory, Sarcasm - 0111 <br>
Matrure, No Slapstick, No Gory, No Sarcasm - 1000 <br>
Matrure, No Slapstick, No Gory, Sarcasm - 1001 <br>
Matrure, No Slapstick, Gory, No Sarcasm - 1010 <br>
Matrure, No Slapstick, Gory, Sarcasm - 1011 <br>
Matrure, Slapstick, No Gory, No Sarcasm - 1100 <br>
Matrure, Slapstick, No Gory, Sarcasm - 1101 <br>
Matrure, Slapstick, Gory, No Sarcasm - 1110 <br>
Matrure, Slapstick, Gory, Sarcasm - 1111

In [None]:
# Label preprocessing with StringLookup.
label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["combination"]), mask_token=None
)
print(label_processor.get_vocabulary())

['0000', '0001', '0010', '0011', '0100', '0101', '0110', '1000', '1001', '1010', '1011', '1100', '1101']


In [None]:
def prepare_all_videos(df, root_dir):
    num_samples = len(df)
    video_paths = df["path"].values.tolist()
    labels = df["combination"].values
    labels = label_processor(labels[..., None]).numpy()

    # `frame_features` are what we will feed to our sequence model.
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )

    # For each video.
    for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames = load_video(path)
        print(frames)
        # Pad shorter videos.
        if len(frames) < MAX_SEQ_LENGTH:
            diff = MAX_SEQ_LENGTH - len(frames)
            padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3))
            frames = np.concatenate(frames, padding)

        frames = frames[None, ...]

        # Initialize placeholder to store the features of the current video.
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # Extract features from the frames of the current video.
        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                if np.mean(batch[j, :]) > 0.0:
                    temp_frame_features[i, j, :] = feature_extractor.predict(
                        batch[None, j, :]
                    )

                else:
                    temp_frame_features[i, j, :] = 0.0

        frame_features[idx,] = temp_frame_features.squeeze()

    return frame_features, labels

In [None]:
train_data, train_labels = prepare_all_videos(train_df, "train_data")
val_data, val_labels = prepare_all_videos(val_df, "val_data")
test_data, test_labels = prepare_all_videos(test_df, "test_data")

print(f"Frame features in train set: {train_data.shape}")
print(f"Train labels: {train_labels}")

[[[[0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   ...
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]]

  [[0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   ...
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]]

  [[0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   ...
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]]

  ...

  [[0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   ...
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]
   [0.00392157 0.00392157 0.00392157]]

  [[0.00392157 0.00392157 0.00392157]
   [0.0

KernelInterrupted: Execution interrupted by the Jupyter kernel.

## Build the Transformer-based model (not completed) - BASE MODEL

In [None]:
# Embedding Layer
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim
        )
        self.sequence_length = sequence_length
        self.output_dim = output_dim

    def call(self, inputs):
        # The inputs are of shape: `(batch_size, frames, num_features)`
        length = tf.shape(inputs)[1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_positions = self.position_embeddings(positions)
        return inputs + embedded_positions

    def compute_mask(self, inputs, mask=None):
        mask = tf.reduce_any(tf.cast(inputs, "bool"), axis=-1)
        return mask

In [None]:
# Subclassed layer
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim, dropout=0.3
        )
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation=tf.nn.gelu), layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]

        attention_output = self.attention(inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

## Model Training and Testing

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=7327af42-8a03-4c46-b38e-e6931aa020f3' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>