## Introduction

Scenes are important part of storytelling in movies. Detecting semantic scene changes involve understanding the interactions between actors and their environments. The task for this project is to build a machine learning system that detect semantic changes in movie scenes. First, let us define a few vocabularies. A movie is a sequence of _shots_ and _scenes_ and they are quite different. A __shot__ is series of frames captured by a camera for an uninterrupted period of time. A __scene__ is a plot-based semantic unit that is made up of a series of shots.

The data is a set of 64 `<imbd id>.pkl` files provided by [eluv.io](https://eluv.io). Each file is a movie containing the following information:
* Movie-level: the movie's IMBD identification.
* Shot-level: four features (`place`, `cast`, `action`, and `audio`). These features are two-dimensional tensors extracted according to the encoding methods found in [(Rao et al.)](https://arxiv.org/pdf/2004.02678.pdf). The first dimension is the number of shots in the movie. The second dimension are 2048, 512, 512, 512, respectively.
* Scene-level:
    - Ground truth (`scene_transition_boundary_ground_truth`) which is a boolean vector labeling scene transition boundaries.
    - Preliminary scene transition prediction (`scene_transition_boundary_prediction`) is a prediction template indicating the probability of a shot being a scene boundary.
    - The `shot_end_frame` is used for evaluation purpose.
    
Now that we have the data related details out of the way, let us discuss the structure of the rest of this notebook. Section [1](#section1) and [2](#section2) are the typically setup to load the data. Section [3](#section3) go over the data processing and transformations. Section [4](#section4) builds and train LSTM and WaveNet models. Section [5](#section5') discusses the selected model and performance.

<a id='section1'></a>
## 1. Setup

In [1]:
import tensorflow as tf
from tensorflow import keras
import torch
import numpy as np
import pandas as pd
import glob
import sys
import os
import joblib
import pickle

np.random.seed(42)

# Where to get the data
PATH = os.path.join(os.getcwd(), "data_dir")

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

<a id='section2'></a>
## 2. Get the data

To load the data, we'll use the `fetch_movies()` function below to unpickle the files and load them into a list of Python dictionaries. Notice the length of the movies ranges from 600 shots to 3100 shots. We will use the maximum length later in the data transformation process so each training instance would have the same shape when fed into tensorflow.

In [2]:
def fetch_movies(path=PATH):
    """
    Load .pkl movie files
    
    Argument:
    ---------
    path -- string representing files path
    """
    filenames = glob.glob(os.path.join(PATH, "tt*.pkl"))
    movies = []
    for fn in filenames:
        try:
            with open(fn, 'rb') as fin:
                movies.append(pickle.load(fin))
        except EOFError:
            break
    return movies

In [3]:
movies = fetch_movies()

In [4]:
# Get the train and test data sets
movies = fetch_movies()

In [5]:
# Movie length
movie_lengths = [movie['place'].shape[0] for movie in movies]
print("Max movie length: {}".format(max(movie_lengths)))
print("Min movie length: {}".format(min(movie_lengths)))

Max movie length: 3096
Min movie length: 607


In [6]:
FEATURES_DIM = 2048 + 512 + 512 + 512
MAX_MOVIE_LENGTH = 3100
NUM_EPOCHS = 20

<a id='section3'></a>
## Data processing

Now that we got the data, we will build two custom functions, `split_train_test()` and `transform_movies()` to split the data set into training set and validation set as well as transform them from `torch.Tensor` to numpy arrays. The movies are padded in the transformation process with the `MAX_MOVIE_LENGTH` constant as noted earlier so they all have the same shape. We then call these functions to split and transform the data.

In [7]:
def split_train_test(data, train_size=52):
    """
    Split data into train and test sets
    
    Argument:
    --------
    data -- a list of dictionaries each containing a movie information
    train_size -- integer representing the number of movies used for training
    """
    # For stable output across runs
    np.random.seed(42)
    # Shuffle indices
    shuffled_indices = np.random.permutation(len(data))
    train_indices = shuffled_indices[:train_size]
    test_indices = shuffled_indices[train_size:]
    train_set = [data[i] for i in train_indices]
    test_set = [data[i] for i in test_indices]
    return train_set, test_set

In [8]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def transform_movies(movies, features=['place', 'cast', 'action', 'audio'], pad_len=MAX_MOVIE_LENGTH):
    """
    Unroll the given features by column and separate features from labels.
    Then pad the sequences in each movie to the length of the longest movie.
    
    Arguments:
    ----------
    movies -- a list of dictionaries each containing a movie information
    features -- list of string representing data features
    pad-len -- integer for the maximum length of a movie
    
    Return:
    -------
    X_padded -- a 2D numpy array
    Y_padded -- a 2D numpy array
    """
    X, Y = [], []
    # Unroll the features
    for movie in movies: 
        row = torch.cat([movie[feat] for feat in features], dim=1)
        X.append(row.numpy())
        # Pre-pad the label since its length is N-1
        labels = movie['scene_transition_boundary_ground_truth']
        labels = torch.cat([torch.tensor([False]), labels])
        Y.append(labels.numpy())
    # Pad the sequences
    X_padded = pad_sequences(X, maxlen=pad_len, padding='post', dtype='float32')
    Y_padded = pad_sequences(Y, value=False, maxlen=pad_len, padding='post')
    return X_padded, Y_padded

In [9]:
# Split movies into training and validation sets
movies_train, movies_val = split_train_test(movies)

In [10]:
# Transform training and validation sets
X_train, y_train = transform_movies(movies_train)
X_val, y_val = transform_movies(movies_val)

<a id='section4'></a>
## Train LSTM and WaveNet

Since the training instances are sequences of shots, the logical models would be that of a recurrent neural network (RNN) architecture. However, simpple RNNs do not work well with long sequences. At each time step, the RNN only take the current input and an activation value from the previous time step to make prediction for the current time step. This means a RNN cannot learn from sequences that have long-term dependencies. In particular, if the network is very deep, then the gradient from the output will have a hard time propagating back to affect the earlier layers. That is, as the data traverses the RNN it goes through a series of transformations and after a while, there is very little trace of the first inputs.

To combat this problem, we can use __Long Short-Term Memory (LSTM)__ cell which is better at detecting long-term dependencies in the data. It uses update and forget gates that allow the cell to keep more information from the earlier time step compared to RNN. LSTM training is also faster than RNN.

Another method to deal with long sequences is the __WaveNet__ architecture introduced in a [2016 paper](https://arxiv.org/abs/1609.03499) by researchers at DeepMind. WaveNet stacks a group of 1D convolutional layers while doubling the dilation rate at every layer. Dilation rate means how far apart each neuron's inputs are. The first layer sees two time steps at a time while the next see four time steps, and so on. In essence, the lower layers in the stack learn short-term patterns while the higher layers learn long-term patterns. This network is extremely fast -- even for long sequences.

Before building LSTM and WaveNet, we will build a customized `callback` class which we will pass as argument to the model's `fit()` method. This will allow the training to stop early when accuracy reaches 95%.

In [11]:
class myCallback(tf.keras.callbacks.Callback):
    """To stop training early once accuracy reach 95%"""
    def on_epoch_end(self, epoch, logs={}):
        if(logs.get('accuracy') > 0.95):
            print("\nReached 95% accuracy, so cancelling training!")
            self.model.stop_training = True

### LSTM model

In [14]:
def build_lstm(n_neurons=32, input_shape=[FEATURES_DIM]):
    model = tf.keras.Sequential([
            tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=n_neurons,
                                                               input_shape=input_shape,
                                                               return_sequences=True)),
            tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(1, activation='sigmoid'))])
    model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.RMSprop(lr=2e-5),
              metrics=['accuracy'])
    return model

In [15]:
%%time
# Fit LSTM classifier
lstm_clf = build_lstm()
lstm_clf.fit(X_train, y_train, 
             epochs=NUM_EPOCHS, 
             callbacks=[myCallback()], 
             validation_data=(X_val, y_val), verbose=2)

Epoch 1/20
2/2 - 175s - loss: 0.7212 - accuracy: 0.7379 - val_loss: 0.6505 - val_accuracy: 0.9190
Epoch 2/20
2/2 - 130s - loss: 0.6438 - accuracy: 0.9267 - val_loss: 0.6130 - val_accuracy: 0.9565
Epoch 3/20
2/2 - 125s - loss: 0.6057 - accuracy: 0.9644 - val_loss: 0.5903 - val_accuracy: 0.9596

Reached 95% accuracy, so cancelling training!
CPU times: user 7min 35s, sys: 4min 32s, total: 12min 7s
Wall time: 7min 11s


<tensorflow.python.keras.callbacks.History at 0x7f880bdcf790>

### WaveNet model

In [16]:
def build_wave_net(input_shape=[None, FEATURES_DIM], num_blocks=2, num_layers=3, 
                   filters1=20, filters2=10, kern1=2, kern2=1, padding='same'):
    rates = [2**i for i in range(num_layers)]
    wave_model = tf.keras.models.Sequential()
    wave_model.add(tf.keras.layers.InputLayer(input_shape=input_shape))
    for rate in rates * num_blocks:
        wave_model.add(tf.keras.layers.Conv1D(filters=filters1, 
                                              kernel_size=kern1, 
                                              padding='same',
                                              activation='relu', dilation_rate=rate))
    wave_model.add(tf.keras.layers.Conv1D(filters=filters2, kernel_size=kern2))
    wave_model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
    wave_model.compile(loss='binary_crossentropy',
                  optimizer=tf.keras.optimizers.RMSprop(lr=2e-5),
                  metrics=['accuracy'])
    return wave_model

In [18]:
%%time
# Fit WaveNet classifier
wave_clf = build_wave_net()
wave_clf.fit(X_train, y_train, 
             epochs=NUM_EPOCHS, 
             callbacks=[myCallback()], 
             validation_data=(X_val, y_val), verbose=2)

Epoch 1/20
2/2 - 49s - loss: 0.6778 - accuracy: 0.9291 - val_loss: 0.6688 - val_accuracy: 0.9503
Epoch 2/20
2/2 - 24s - loss: 0.6654 - accuracy: 0.9617 - val_loss: 0.6590 - val_accuracy: 0.9570

Reached 95% accuracy, so cancelling training!
CPU times: user 1min, sys: 45.5 s, total: 1min 46s
Wall time: 1min 14s


<tensorflow.python.keras.callbacks.History at 0x7f83fda33610>

## Hyperparameter tuning

As seen above, the WaveNet architecture is much faster than LSTM -- by order of magnitude. Thus, we will be focusing on tunining the hyperparameter for WaveNet and use it as the final model. We'll take advantage of the `KerasClassifier` wrapper that allows us to use Scikit-Learn's functions with the WaveNet. Additionally, we will tune the hyperparameters using `RandomizedSearchCV` which randomly sample a smaller subset of the hyperparameter space. This is faster than `GridSearchCV` where search is conducted on a larger hyperparameter space.

In [19]:
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV

keras_clf = KerasClassifier(build_wave_net)

params_distribs = {
    "num_blocks": [2, 3],
    "num_layers": np.arange(4, 11),
    "filters1": np.arange(10, 21),
    "filters2": np.arange(2, 11)
}

rnd_search_cv = RandomizedSearchCV(keras_clf, params_distribs, n_iter=3, cv=3)

In [23]:
%%time
rnd_search_cv.fit(X_train, y_train, epochs=10, 
                  validation_data=(X_val, y_val),
                  callbacks=[myCallback()])
print("Best score: {}".format(rnd_search.best_score_))
print("Parameters:")
for param, value in rnd_search_cv.best_params_.items():
    print("\t{}: {}".format(param, value))

Epoch 1/10
2/2 - 34s - loss: 0.6927 - accuracy: 0.8773
Epoch 2/10
2/2 - 13s - loss: 0.6920 - accuracy: 0.9160
Epoch 3/10
2/2 - 11s - loss: 0.6914 - accuracy: 0.9318
Epoch 4/10
2/2 - 11s - loss: 0.6910 - accuracy: 0.9399
Epoch 5/10
2/2 - 10s - loss: 0.6906 - accuracy: 0.9458
Epoch 6/10
2/2 - 9s - loss: 0.6902 - accuracy: 0.9508

Reached 95% accuracy, so cancelling training!
Epoch 1/10
2/2 - 46s - loss: 0.6894 - accuracy: 0.9688

Reached 95% accuracy, so cancelling training!
Epoch 1/10
2/2 - 39s - loss: 0.6929 - accuracy: 0.8305
Epoch 2/10
2/2 - 14s - loss: 0.6922 - accuracy: 0.9187
Epoch 3/10
2/2 - 13s - loss: 0.6916 - accuracy: 0.9550

Reached 95% accuracy, so cancelling training!
Epoch 1/10
2/2 - 44s - loss: 0.6931 - accuracy: 0.8054
Epoch 2/10
2/2 - 14s - loss: 0.6929 - accuracy: 0.9711

Reached 95% accuracy, so cancelling training!
Epoch 1/10
2/2 - 43s - loss: 0.6932 - accuracy: 0.3062
Epoch 2/10
2/2 - 21s - loss: 0.6928 - accuracy: 0.9691

Reached 95% accuracy, so cancelling traini

Epoch 1/10
2/2 - 53s - loss: 0.6931 - accuracy: 0.9554

Reached 95% accuracy, so cancelling training!


RuntimeError: Cannot clone object <tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier object at 0x7f82c4cb5b50>, as the constructor either does not set or modifies parameter num_layers

In [25]:
# Best model
rnd_search_cv.best_params_

{'num_layers': 10, 'num_blocks': 3, 'filters2': 2, 'filters1': 18}

In [26]:
rnd_search_cv.best_score_

0.9702571233113607

In [30]:
rnd_search_cv.results_

AttributeError: 'RandomizedSearchCV' object has no attribute 'results_'

## Model selection and performance

Now that we have a good sense of the best hyperparameters, let us build a model based on these parameters. First, we will transform the full dataset and fit the best model on this full dataset. Recall that we padded the dataset so that all instances have identical input shape even though the movies are of different length. We will now need to reverse that padding by truncating the predictions so that the predictions for each movie have the same length at the movie itself. Then to write the prediction out to file for evaluation, we will use the `write_predictions()` function.

Note that the required model evaluation metrics by Eluv.io are the __Mean Average Precision(mAP)__ and the __Mean Maximum IoU (mean Miou)__. The evaluation script can be found at the Eluv.io's team [Github page](https://github.com/eluv-io/elv-ml-challenge).

In [None]:
# First, transform the entire dataset 
# X, y = transform_movies(movies)

In [12]:
def unpad_predictions(movies, yhat_probs):
    """
    Truncate the padded predictions to movie's original length
    
    Arguments:
    ----------
    movies -- a list of dictionaries containing movies information
    yhat_probs -- a 2D numpy array representing prediction for the given movies data set
    
    Return:
    -------
    yhat_dict -- a dictionary with each movie imbd_id as key and 
                 prediction probabilities as a 1D numpy array
    """
    imdb_lengths = [(movie['imdb_id'], movie['place'].shape[0]) for movie in movies]
    yhat_dict = dict()
    for (imdb, length), yhat in zip(imdb_lengths, yhat_probs):
        yhat = yhat[1:length]
        yhat_dict[imdb] = yhat
    return yhat_dict

In [13]:
def write_predictions(yhat_unpadded_dict, path=PATH):
    """
    Pickle the predictions
    
    Arguments:
    ----------
    yhat_unpadded_dict -- a dictionary of prediction consistent with the length of the ground-truth label
    path -- a string representing the files path
    """
    for imdb in yhat_unpadded_dict.keys():
        # Load existing pkl movie file
        filename = os.path.join(PATH, imdb + ".pkl")
        try:
            x = pickle.load(open(filename, "rb"))
            x['scene_transition_boundary_prediction'] = yhat_unpadded_dict[imdb].flatten()
            pickle.dump(x, open(filename, "wb"))
        except:
            break