# Training Collaborative Experts on MSR-VTT
This notebooks shows how to download some code that trains a modification of a Collaborative Experts model with BERT + CLS + NetVLAD on the MSR-VTT Dataset.


## Setup

*   Download Code and Dependencies
*   Import Modules
*   Download Language Model Weights
*   Download Datasets
*   Generate Encodings for Dataset Captions 



### Code Downloading and Dependency Downloading
*   Specify tensorflow version
*   Clone repository from Github
*   `cd` into the correct directory
*   Install the requirements




In [1]:
%tensorflow_version 2.x

In [2]:
!git clone https://github.com/googleinterns/via-content-understanding.git

Cloning into 'via-content-understanding'...
remote: Enumerating objects: 22, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 7211 (delta 8), reused 14 (delta 5), pack-reused 7189[K
Receiving objects: 100% (7211/7211), 850.07 KiB | 2.25 MiB/s, done.
Resolving deltas: 100% (4750/4750), done.


In [3]:
%cd via-content-understanding/videoretrieval/

/content/via-content-understanding/videoretrieval


In [4]:
!pip install -r requirements.txt

Collecting pytube3==9.6.4
  Downloading https://files.pythonhosted.org/packages/de/86/198092763646eac7abd2063192ab44ea44ad8fd6d6f3ad8586b38afcd52a/pytube3-9.6.4-py3-none-any.whl
Collecting requests==2.22.0
[?25l  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
[K     |████████████████████████████████| 61kB 5.8MB/s 
[?25hCollecting transformers==3.0.2
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 13.3MB/s 
Collecting idna<2.9,>=2.5
[?25l  Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)
[K     |████████████████████████████████| 61kB 7.9MB/s 
Collecting tokenizers==0.8.1.rc1
[?25l  Downloading

In [5]:
!pip install --upgrade tensorflow_addons

Collecting tensorflow_addons
[?25l  Downloading https://files.pythonhosted.org/packages/29/51/8e5bb7649ac136292aefef6ea0172d10cc23a26dcda093c62637585bc05e/tensorflow_addons-0.11.1-cp36-cp36m-manylinux2010_x86_64.whl (1.1MB)
[K     |▎                               | 10kB 19.5MB/s eta 0:00:01[K     |▋                               | 20kB 5.8MB/s eta 0:00:01[K     |█                               | 30kB 7.1MB/s eta 0:00:01[K     |█▏                              | 40kB 7.8MB/s eta 0:00:01[K     |█▌                              | 51kB 6.4MB/s eta 0:00:01[K     |█▉                              | 61kB 6.8MB/s eta 0:00:01[K     |██                              | 71kB 7.4MB/s eta 0:00:01[K     |██▍                             | 81kB 8.2MB/s eta 0:00:01[K     |██▊                             | 92kB 8.4MB/s eta 0:00:01[K     |███                             | 102kB 8.8MB/s eta 0:00:01[K     |███▎                            | 112kB 8.8MB/s eta 0:00:01[K     |███▋          

### Importing Modules

In [6]:
import tensorflow as tf
import languagemodels
import train.encoder_datasets
import train.language_model
import experts
import datasets
import datasets.msrvtt.constants
import os
import models.components
import models.encoder
import helper.precomputed_features
from tensorflow_addons.activations import mish  
import tensorflow_addons as tfa
import metrics.loss

### Language Model Downloading

*   Download BERT



In [7]:
bert_model = languagemodels.BERTModel()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFBertModel for predictions without further training.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




### Dataset downloading


*   Downlaod Datasets
*   Download Precomputed Features



In [8]:
datasets.msrvtt_dataset.download_dataset()

Note: The system `curl` is more memory efficent than the download function in our codebase, so here `curl` is used rather than the download function in our codebase.

In [9]:
url = datasets.msrvtt.constants.features_tar_url
path = datasets.msrvtt.constants.features_tar_path
os.system(f"curl {url} > {path}") 

0

In [10]:
helper.precomputed_features.cache_features(
    datasets.msrvtt_dataset,
    datasets.msrvtt.constants.expert_to_features,
    datasets.msrvtt.constants.features_tar_path,)

speech: (32, 300) | (29, 300)
ocr: (49, 300) | (5, 300)
densenet: (2208,) | (1, 2208)
audio: (29, 128) | (1, 128)


### Encoding Generation

* Generate Encodings for MSR-VTT

In [11]:
train.language_model.generate_and_cache_encodings(
    bert_model, datasets.msrvtt_dataset)

## Training


*  Build Train Datasets
*  Initialize Models
*  Compile Encoders
*  Fit Model
* Test Model


### Datasets Generation

In [12]:
experts_used = [
  experts.i3d,
  experts.r2p1d,
  experts.resnext,
  experts.senet,
  experts.speech_expert,
  experts.ocr_expert,
  experts.audio_expert,
  experts.densenet,
  experts.face_expert]

In [13]:
train_ds, valid_ds, test_ds = (
    train.encoder_datasets.generate_language_model_fine_tuning_datasets(
        bert_model, datasets.msrvtt_dataset, experts_used))

### Model Initialization

In [14]:
class MishLayer(tf.keras.layers.Layer):
    def call(self, inputs):
        return mish(inputs)

In [None]:
mish(tf.Variable([1.0]))

In [16]:
text_encoder = models.components.TextEncoder(
    len(experts_used),
    num_netvlad_clusters=28,
    ghost_clusters=1,
    language_model_dimensionality=768,
    encoded_expert_dimensionality=512,
    residual_cls_token=True,
)

In [17]:
video_encoder = models.components.VideoEncoder(
    num_experts=len(experts_used),
    experts_use_netvlad=[False, False, False, False, True, True, True, False, False],
    experts_netvlad_shape=[None, None, None, None, 19, 43, 8, None, None],
    expert_aggregated_size=512,
    encoded_expert_dimensionality=512,
    g_mlp_layers=3,
    h_mlp_layers=0,
    make_activation_layer=MishLayer)

In [18]:
encoder = models.encoder.EncoderForLanguageModelTuning(
    video_encoder,
    text_encoder,
    0.05,
    [1, 5, 10, 50],
    20,
    bert_model.model,
    64)

### Encoder Compliation

In [19]:
def build_optimizer(lr=0.001):
    learning_rate_scheduler = tf.keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=lr,
        decay_steps=1000,
        decay_rate=0.95,
        staircase=True)

    return tf.keras.optimizers.Adam(learning_rate_scheduler)

In [20]:
encoder.compile(build_optimizer(5e-5), metrics.loss.bidirectional_max_margin_ranking_loss)

In [21]:
train_ds_prepared = (train_ds
  .shuffle(7000)
  .batch(32, drop_remainder=True)
  .prefetch(tf.data.experimental.AUTOTUNE))
valid_ds_prepared = (valid_ds
  .prefetch(tf.data.experimental.AUTOTUNE)
  .batch(497 * 20, drop_remainder=True)
  .cache()

In [22]:
encoder.language_model.trainable = True
encoder.video_encoder.trainable = True
encoder.text_encoder.trainable = True

### Model fitting

In [None]:
encoder.fit(
    train_ds_prepared,
    #validation_data=valid_ds_prepared,
    epochs=250,
)

### Tests

In [51]:
captions_per_video = 20
num_videos_upper_bound = 100000 

In [None]:
ranks = []

for caption_index in range(captions_per_video):
    batch = next(iter(test_ds.shard(captions_per_video, caption_index).batch(
        num_videos_upper_bound)))
    video_embeddings, text_embeddings, mixture_weights = encoder.forward_pass(
        batch, training=False)
    
    similarity_matrix = metrics.loss.build_similarity_matrix(
        video_embeddings,
        text_embeddings,
        mixture_weights,
        batch[-1])
    rankings = metrics.rankings.compute_ranks(similarity_matrix)
    ranks += list(rankings.numpy())

In [None]:
def recall_at_k(ranks, k):
    return len(list(filter(lambda i: i <= k, ranks))) / len(ranks)

In [None]:
median_rank = sorted(ranks)[len(ranks)//2]

In [None]:
mean_rank = sum(ranks)/len(ranks)

In [None]:
print(f"Median Rank: {median_rank}")

In [None]:
print(f"Mean Rank: {mean_rank}")

In [None]:
for k in [1, 5, 10, 50]:
    recall = recall_at_k(ranks, k)
    print(f"R@{k}: {recall}")