# Experiment 4.2: Testing the BMT with unclipped videos from LABC

The aim of this experiment is to find differences between the event proposals that are generated by processing the ActivityNet Captions dataset and the Long ActivityNet-Based Categories (LABC) dataset. This will help us understand whether the capacity of the BMT to detect events is affected by video duration. The results will be used in later experiments to determine if our solution improves the performance of the BMT when processing longer videos. **Experiment 4.1 must be executed before this experiment**

## 1. Enviornment

In the following cell, change assign WD the path to the BMT-Clipping repository. 
Additionally, assign the path to your environments directory (e.g. where conda stores all directories) to the ENVS_PATH variable.

In [1]:
# Working directory (it must be the repository's root directory)
WD = '/home/A01630791/BMT-Clipping'
%cd $WD

# Environments directory (e.g. anaconda3/envs)
ENVS_PATH = '/home/A01630791/anaconda3/envs'

/home/A01630791/BMT-Clipping


**Optional**  
Uncomment and run the following cell if you haven't configured the Python environment in experiment 4.1

Run directly in terminal if the notebook throws errors.

In [2]:
"""
# Environment
## feature extraction (run directly in terminal if the notebook can't execute)
!conda env create -f ./submodules/video_features/conda_env_i3d.yml
!conda env create -f ./submodules/video_features/conda_env_vggish.yml
## captioning model (run directly in terminal if the notebook can't execute)
!conda env create -f ./conda_env.yml
## spacy language model (use the path to your 'bmt' environment instead)
!$ENVS_PATH/bmt/bin/python -m spacy download en


# (Optional) Install additional libraries in environment (run directly in terminal if the notebook can't execute)
!conda install pytube
!conda install numpy
!conda install matplotlib
"""

"\n# Environment\n## feature extraction (run directly in terminal if the notebook can't execute)\n!conda env create -f ./submodules/video_features/conda_env_i3d.yml\n!conda env create -f ./submodules/video_features/conda_env_vggish.yml\n## captioning model (run directly in terminal if the notebook can't execute)\n!conda env create -f ./conda_env.yml\n## spacy language model (use the path to your 'bmt' environment instead)\n!$ENVS_PATH/bmt/bin/python -m spacy download en\n\n\n# (Optional) Install additional libraries in environment (run directly in terminal if the notebook can't execute)\n!conda install pytube\n!conda install numpy\n!conda install matplotlib\n"

**BMT imports**

In [3]:
import os
from pathlib import Path
import sys
sys.path.append(WD)
from sample.single_video_prediction import get_video_duration

**Other imports**

In [4]:
import json, re, collections
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pytube import YouTube

## 2. Importing the dataset

A copy of LABC is provided in our repository in `data/long_activitynet_categories.csv`, so there is no need to download it. The url is https://github.com/oscarmires/BMT-Clipping/blob/master/data/long_activitynet_categories.csv

In [5]:
# Replace the value of the following variable with the path to the dataset in your fs.
LABC_PATH = f'{WD}/data/long_activitynet_categories.csv'
LABC_PATH

# Open the file as a Pandas DF
labc_df = pd.read_csv(LABC_PATH, index_col=0)
labc_df.head()

Unnamed: 0_level_0,url,duration,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0dyQouKDR2M,https://www.youtube.com/watch?v=0dyQouKDR2M,1169,Archery
LjGCKmNgMho,https://www.youtube.com/watch?v=LjGCKmNgMho,1822,Ballet
cQlk39LS42M,https://www.youtube.com/watch?v=cQlk39LS42M,2791,Basketball
dOHny8s8iBI,https://www.youtube.com/watch?v=dOHny8s8iBI,1323,Bathing dog
k4jLxVILsFE,https://www.youtube.com/watch?v=k4jLxVILsFE,3391,Clean and jerk


## 3. Preparing and pre-analyzing the data


## 4. Execution
In this part, we execute the BMT model to generate captions for each of the samples in the `labc_path` dataframe. The results will then be taken to obtain the following metrics: 
- Captions Per Minute (CPM): the number of captions the BMT generates after running a single video, divided by the duration of the video in minutes.
- Average number of captions
- Average CPM

### 4.1 Preparation

Assign the path where you would like to create the output file. We name this file `4_2_bmt_captions.json`.

Additionally, create a directory in your filesystem to temporarily store the YouTube videos. Assign `LABC_VIDEOS_PATH` the path to this directory.

In [6]:
# Create output file
OUTPUT_PATH = '/home/A01630791/bmt_clipping_experiments/Output_4_2'

with open(f'{OUTPUT_PATH}/4_2_bmt_captions.json', "w") as f:
  f.write("[]")

# Video directory (path to your)
LABC_VIDEOS_PATH = '/home/A01630791/bmt_clipping_experiments/LABC_videos'

### 4.2 BMT Installation (optional)

Uncomment and run the following cell if you haven't installed the BMT model in experiment 4.1

Run directly in terminal if the notebook throws errors.

In [7]:
"""
# Download BMT Checkpoints

!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/glove.840B.300d.zip
!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/best_cap_model.pt
!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/best_prop_model.pt
!wget https://storage.googleapis.com/audioset/vggish_model.ckpt

!mkdir .vector_cache
!mv glove.840B.300d.zip ./.vector_cache/
!mv best_cap_model.pt ./sample/
!mv best_prop_model.pt ./sample/
!mv vggish_model.ckpt ./submodules/video_features/models/vggish/checkpoints/
"""

'\n# Download BMT Checkpoints\n\n!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/glove.840B.300d.zip\n!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/best_cap_model.pt\n!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/best_prop_model.pt\n!wget https://storage.googleapis.com/audioset/vggish_model.ckpt\n\n!mkdir .vector_cache\n!mv glove.840B.300d.zip ./.vector_cache/\n!mv best_cap_model.pt ./sample/\n!mv best_prop_model.pt ./sample/\n!mv vggish_model.ckpt ./submodules/video_features/models/vggish/checkpoints/\n'

### 4.3 BMT Processing
The following cell generates I3D features, Vggish features, event proposals and event captions for each of the videos in `labc_df`. The execution might last for several hours.

In [None]:
# captioning parameters
MAX_PROP_PER_VIDEO = 10
NMS_TIOU_THRESHOLD = 0.4

# checkpoint paths
PROPOSAL_CKPT = f'{WD}/sample/best_prop_model.pt'
CAPTIONING_CKPT = f'{WD}/sample/best_cap_model.pt'

# execution metadata
exec_md = {
    'not_found_videos': [],
    'current_sample_num': 1,
    'total_samples': labc_df.shape[0],
    'too_many_features': []
}

for video_id, sample in labc_df.iterrows():

    # Step 1: Download the video
    try: 
        
        yt = YouTube(sample['url']) 
        
        filename = f'{video_id}.mp4'
        stream = yt.streams.get_highest_resolution()
        stream.download(output_path=LABC_VIDEOS_PATH, filename=filename)
        print('Sample download complete.')
        
    except: 
        print("Error: can't instantiate YouTube object.") 
        exec_md['not_found_videos'].append(video_id)
        
    # Step 2: Extract visual and audio features

    ## Prepare complementary paths
    MY_VIDEO_PATH = f'{LABC_VIDEOS_PATH}/{video_id}.mp4'

    VIDEO_DURATION = get_video_duration(MY_VIDEO_PATH)

    FEATURES_CACHE_PATH = f'{WD}/tmp/'
    FEATURES_PATH_STUB = os.path.join(FEATURES_CACHE_PATH, Path(MY_VIDEO_PATH).stem)
    FEATURE_PATH_VGGISH = f'{FEATURES_PATH_STUB}_vggish.npy'
    FEATURE_PATH_RGB = f'{FEATURES_PATH_STUB}_rgb.npy'
    FEATURE_PATH_FLOW = f'{FEATURES_PATH_STUB}_flow.npy'

    """
    ## Extract I3D features
    print('Extracting I3D features.')
    !cd ./submodules/video_features && $ENVS_PATH/i3d/bin/python main.py \
      --feature_type i3d \
      --on_extraction save_numpy \
      --device_ids 0 \
      --extraction_fps 25 \
      --video_paths $MY_VIDEO_PATH \
      --output_path $FEATURES_CACHE_PATH

    print('Extracting VGGish features.')
    ## Extract VGGish features (audio)
    !cd ./submodules/video_features && $ENVS_PATH/vggish/bin/python main.py \
      --feature_type vggish \
      --on_extraction save_numpy \
      --device_ids 0 \
      --video_paths $MY_VIDEO_PATH \
      --output_path $FEATURES_CACHE_PATH

    """

    # Step 3: Captioning
    print('Generating caption.')
    
    try:
        ## Run single video prediction
        !$ENVS_PATH/bmt/bin/python ./sample/single_video_prediction.py \
          --prop_generator_model_path $PROPOSAL_CKPT \
          --pretrained_cap_model_path $CAPTIONING_CKPT \
          --vggish_features_path $FEATURE_PATH_VGGISH \
          --rgb_features_path $FEATURE_PATH_RGB \
          --flow_features_path $FEATURE_PATH_FLOW \
          --duration_in_secs $VIDEO_DURATION \
          --device_id 0 \
          --max_prop_per_vid $MAX_PROP_PER_VIDEO \
          --nms_tiou_thresh $NMS_TIOU_THRESHOLD \
          --video_id $video_id \
          --output_path $OUTPUT_PATH/4_1_bmt_captions.json
    except AssertionError:
        print('Exceeded MAX_PROP_PER_VIDEO')
        exec_md['too_many_features'].append(video_id)

    # Step 4: Delete video to free space
    os.remove(MY_VIDEO_PATH)
    print('Sample deleted')

        
    curr_sample = exec_md['current_sample_num']
    exec_md['current_sample_num'] += 1
    total_samples = exec_md['total_samples']
    print(f'\n ***** Processed {curr_sample}/{total_samples} samples. *****\n')
    

print('Execution completed.') 
print(exec_md)
print('YouTube not found videos:', len(exec_md['not_found_videos']))

Sample download complete.
Video Duration: 1168.102969
Generating caption.
Contructing caption_iterator for "train" phase
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path: 
 /home/A01630791/BMT-Clipping/sample/best_cap_model.pt
***** S *****:  1216
Traceback (most recent call last):
  File "./sample/single_video_prediction.py", line 308, in <module>
    prop_model, feature_paths, train_dataset.pad_idx, prop_cfg, args.device_id, args.duration_in_secs
  File "./sample/single_video_prediction.py", line 170, in generate_proposals
    pad_feats_up_to=cfg.pad_feats_up_to
  File "./sample/single_video_prediction.py", line 63, in load_features_from_npy
    stack_vggish = pad_segment(stack_vggish, pad_feats_up_to['audio'], pad_idx)
  File "/home/A01630791/BMT-Clipping/sample/../datasets/load_features.py", line 40, in pad_segment
    assert S <= max_feature_len
AssertionError
Sample deleted

 ***** Processed 1/50 samples. *****

Sam

**AssertionError**
As we can see in the execution of the last cell, a very long video (>10 minutes) causes an AssertionError exception to be raised. This originates when the model finds that the number of features found in the video exceeds a limit established in the training configuration. In other words, the model can't process larger videos than the maximum duration that was determined during training. Nonetheless, we continue our experiment with videos with a duration ranging from 4 to 10 minutes, still under ActivityNet categories. This means we will be analysing videos that are not as long as we wanted, but still longer than ActivityNet Caption's videos.

### 4.4 Medium-sized ActivityNet Categories
We gathered a dataset of 10 videos under ActivityNet Categories and fed them to the BMT model. We provide the results here.