# Experiment 4.3: Comparing clipping techniques

The aim of this experiment is to find the best clipping technique for captioning with the BMT. The two techniques we analyze are fixed-time clipping and scene-based clipping. We will determine a technique is better than the other based on semantic similarity to ground-truths (METEOR score). In addition, other statistics such as average number of captions and CPM will be used to analyze the granularity of event detection. 

**The notebook from experiment 4.1 must be executed before this notebook**

## 1. Enviornment

In the following cell, change assign WD the path to the BMT-Clipping repository. 
Additionally, assign the path to your environments directory (e.g. where conda stores all directories) to the ENVS_PATH variable.

In [None]:
# Working directory (it must be the repository's root directory)
WD = '/home/A01630791/BMT-Clipping'
%cd $WD

# Environments directory (e.g. anaconda3/envs)
ENVS_PATH = '/home/A01630791/anaconda3/envs'

**Optional**  
Uncomment and run the following cell if you haven't configured the Python environment in experiment 4.1

Run directly in terminal if the notebook throws errors.

In [None]:
"""
# Environment
## feature extraction (run directly in terminal if the notebook can't execute)
!conda env create -f ./submodules/video_features/conda_env_i3d.yml
!conda env create -f ./submodules/video_features/conda_env_vggish.yml
## captioning model (run directly in terminal if the notebook can't execute)
!conda env create -f ./conda_env.yml
## spacy language model (use the path to your 'bmt' environment instead)
!$ENVS_PATH/bmt/bin/python -m spacy download en


# (Optional) Install additional libraries in environment (run directly in terminal if the notebook can't execute)
!conda install pytube
!conda install numpy
!conda install matplotlib
"""

**BMT imports**

In [None]:
import os
from pathlib import Path
import sys
sys.path.append(WD)
from sample.single_video_prediction import get_video_duration

**Module imports**

In [None]:
from clipping_modules.filtering import FilteringModule
from clipping_modules.clipping import ClippingModule

**Other imports**

In [None]:
import json, re, collections
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pytube import YouTube

## 2. Importing the dataset

A copy of LABC is provided in our repository in `data/long_activitynet_categories.csv`, so there is no need to download it. The url is https://github.com/oscarmires/BMT-Clipping/blob/master/data/long_activitynet_categories.csv

In [None]:
# Replace the value of the following variable with the path to the dataset in your fs.
LABC_PATH = f'{WD}/data/long_activitynet_categories.csv'
LABC_PATH

# Open the file as a Pandas DF
labc_df = pd.read_csv(LABC_PATH, index_col=0)
labc_df.head()

In [None]:
labc_df.iloc[:20].duration.sum() / 60 / 60

In [None]:
# Optional: sort by duration
labc_df.sort_values(by='duration', inplace=True)

### 2.1 YouTube Downloads
The following cell downloads from YouTube all the videos listed in LABC. Create a directory to store the videos. A path to this directory must be assigned to the variable `LABC_VIDEOS_PATH`.

In [None]:
# Assign your path to the next variable
LABC_VIDEOS_PATH = '/home/A01630791/bmt_clipping_experiments/LABC_videos'

video_count = 0
labc_rows = labc_df.shape[0]

dir_list = os.listdir(LABC_VIDEOS_PATH)

In [None]:
for video_id, sample in labc_df.iterrows():
    video_count += 1
    
    if f'{video_id}.mp4' in dir_list:
        print(f'"{video_id}.mp4" already in directory.')
        continue
    
    try: 
        print(f'Downloading {video_count}/{labc_rows}...')
        
        yt = YouTube(sample['url']) 
        filename = f'{video_id}.mp4'
        stream = yt.streams.get_highest_resolution()
        stream.download(output_path=LABC_VIDEOS_PATH, filename=filename)
    except: 
        print("Error: can't instantiate YouTube object.") 
        exec_md['not_found_videos'].append(video_id)
        os.remove(f'{LABC_VIDEOS_PATH}/{filename}')
    # Optional break to limit number of videos:
    if video_count >= 25:
        break
        
print(f'Downloaded {video_count}/{labc_rows} videos.')

In [None]:
"""
LABC_VIDEOS_PATH = '/home/A01630791/bmt_clipping_experiments/Output_4_3/ft_clips'


df = pd.DataFrame(os.listdir(LABC_VIDEOS_PATH))
df['clip_id'] = [(int) (re.findall("(?<=&).*?(?=\.)", row[0])[0]) for index, row in df.iterrows()]

df.sort_values(by='clip_id')
"""

## 3. Execution
In this part, we execute the BMT model to generate captions for each of the samples in the `labc_path` dataframe. The results will then be taken to obtain the following metrics: 
- Captions Per Minute (CPM): the number of captions the BMT generates after running a single video, divided by the duration of the video in minutes.
- Average number of captions
- Average CPM

### 3.1 Preparation

Assign the path to a directory where you would like to create the output file. We name this file `4_3_bmt_captions.json`.

In [None]:
# Create output file
OUTPUT_PATH = '/home/A01630791/bmt_clipping_experiments/Output_4_3'

with open(f'{OUTPUT_PATH}/4_3_bmt_captions.json', "w") as f:
  f.write("[]")

The next file will be used for the output of the scene detection technique. We name the file `4_3_bmt_captions_ps.json`.

In [None]:
# Create output file
OUTPUT_PATH = '/home/A01630791/bmt_clipping_experiments/Output_4_3'

with open(f'{OUTPUT_PATH}/4_3_bmt_captions_ps.json', "w") as f:
  f.write("[]")

### 3.3 BMT Installation (optional)

Uncomment and run the following cell if you haven't installed the BMT model in experiment 4.1

Run directly in terminal if the notebook throws errors.

In [None]:
"""
# Download BMT Checkpoints

!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/glove.840B.300d.zip
!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/best_cap_model.pt
!wget https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/bmt/best_prop_model.pt
!wget https://storage.googleapis.com/audioset/vggish_model.ckpt

!mkdir .vector_cache
!mv glove.840B.300d.zip ./.vector_cache/
!mv best_cap_model.pt ./sample/
!mv best_prop_model.pt ./sample/
!mv vggish_model.ckpt ./submodules/video_features/models/vggish/checkpoints/
"""

### 3.4 BMT Function
The following cell generates I3D features, Vggish features, event proposals and event captions for each of the videos in `labc_df`. It wraps all this steps in one function used later for each clip.

In [None]:
def run_bmt(input_path, output_path, video_id, clip_id):
    # captioning parameters
    MAX_PROP_PER_VIDEO = 10
    NMS_TIOU_THRESHOLD = 0.4

    # checkpoint paths
    PROPOSAL_CKPT = f'{WD}/sample/best_prop_model.pt'
    CAPTIONING_CKPT = f'{WD}/sample/best_cap_model.pt'

    # Extract visual and audio features

    ## Prepare complementary paths
    MY_VIDEO_PATH = input_path

    VIDEO_DURATION = get_video_duration(MY_VIDEO_PATH)

    FEATURES_CACHE_PATH = f'{WD}/tmp/'
    FEATURES_PATH_STUB = os.path.join(FEATURES_CACHE_PATH, Path(MY_VIDEO_PATH).stem)
    FEATURE_PATH_VGGISH = f'{FEATURES_PATH_STUB}_vggish.npy'
    FEATURE_PATH_RGB = f'{FEATURES_PATH_STUB}_rgb.npy'
    FEATURE_PATH_FLOW = f'{FEATURES_PATH_STUB}_flow.npy'

    ## Extract I3D features
    print('Extracting I3D features.')
    !cd ./submodules/video_features && $ENVS_PATH/i3d/bin/python main.py \
      --feature_type i3d \
      --on_extraction save_numpy \
      --device_ids 0 \
      --extraction_fps 25 \
      --video_paths $MY_VIDEO_PATH \
      --output_path $FEATURES_CACHE_PATH

    print('Extracting VGGish features.')
    ## Extract VGGish features (audio)
    !cd ./submodules/video_features && $ENVS_PATH/vggish/bin/python main.py \
      --feature_type vggish \
      --on_extraction save_numpy \
      --device_ids 0 \
      --video_paths $MY_VIDEO_PATH \
      --output_path $FEATURES_CACHE_PATH

    # Step 3: Captioning
    print('Generating caption...')

    try:
        ## Run single video prediction
        !$ENVS_PATH/bmt/bin/python ./sample/single_video_prediction.py \
          --prop_generator_model_path $PROPOSAL_CKPT \
          --pretrained_cap_model_path $CAPTIONING_CKPT \
          --vggish_features_path $FEATURE_PATH_VGGISH \
          --rgb_features_path $FEATURE_PATH_RGB \
          --flow_features_path $FEATURE_PATH_FLOW \
          --duration_in_secs $VIDEO_DURATION \
          --device_id 0 \
          --max_prop_per_vid $MAX_PROP_PER_VIDEO \
          --nms_tiou_thresh $NMS_TIOU_THRESHOLD \
          --video_id $video_id \
          --clip_id $clip_id \
          --output_path $output_path
    except AssertionError:
        print('Exceeded MAX_PROP_PER_VIDEO')
        exec_md['too_many_features'].append(video_id)

    print('BMT Execution completed.') 

### 3.4 Fixed-time clipping
For this first test, we will be using a window of 2 minutes for performing the fixed-time clipping. Assign the variable `FT_CLIPS_DIR` the path to the directory where you wish to store the clips temporarily. 

In [None]:
FT_CLIPS_DIR = f'{OUTPUT_PATH}/ft_clips' # Your path here
%ls $FT_CLIPS_DIR

The next cell instantiates the clipping module.

In [None]:
cm = ClippingModule(
    path_to_video_out=FT_CLIPS_DIR, 
    window=120, 
    technique='fixed'
)

We then obtain clips from all videos:

In [None]:
videos_list = os.listdir(LABC_VIDEOS_PATH)

for video in videos_list[:26]:
    if video[0] != '.': # ignore hidden files
        cm.get_clips(f'{LABC_VIDEOS_PATH}/{video}', video[:11])

We need to create a DataFrame to keep track of all filenames, video IDs and clip IDs.

In [None]:
# Initialize
df = pd.DataFrame(os.listdir(FT_CLIPS_DIR), columns=['filename'])

# Clean
df.drop(df[df['filename'].str.startswith('.')].index, inplace=True)

# Obtain clip id
df['clip_id'] = [(int) (re.findall("(?<=@).*?(?=\.)", row['filename'])[0]) for index, row in df.iterrows()]

# Obtain video id
df['video_id'] = [(re.findall(".*?(?=\@)", row['filename'])[0]) for index, row in df.iterrows()]

df.head()

In [None]:
# optional: select fewer videos if time limitations
limited_sample_video_ids = df.video_id.unique()[:10]
limited_sample_video_ids

limited_df = df[df['video_id'].isin(limited_sample_video_ids)]
limited_df.head()

The next cell will execute the BMT model to generate captions for each clip.

In [None]:
# execution metadata
exec_md = {
    'current_sample_num': 1,
    'total_samples': limited_df.shape[0],
}


for index, sample in limited_df.iterrows():
    
    run_bmt(
        input_path=f'{_DI}/{sample.filename}', 
        output_dir=OUTPUT_PATH, 
        video_id=sample.video_id, 
        clip_id=sample.clip_id
    )
    

    curr_sample = exec_md['current_sample_num']
    exec_md['current_sample_num'] += 1
    total_samples = exec_md['total_samples']
    print(f'\n ***** Processed {curr_sample}/{total_samples} samples. *****\n')


In [None]:
# (Optional) Secure the file with a copy to prevent rewriting
!cp $OUTPUT_PATH/4_3_bmt_captions.json $OUTPUT_PATH/4_3_bmt_captions_save.json

**Analyzing Fixed-Timed Clipping results**

The next cell will create a dataframe using the output from BMT

In [None]:
path_to_ft_results = f'{OUTPUT_PATH}/4_3_bmt_captions_save.json'

ft_captions_bmt = pd.read_json(path_to_ft_results, orient='records')
# ft_captions_bmt.sort_values(by="clip_id", inplace=True)

In [None]:
# Add the number of captions column
ft_captions_bmt['number_captions'] = [len(sample.captions) for _, sample in ft_captions_bmt.iterrows()]
ft_captions_bmt.head()

Average number of generated captions:

In [None]:
ft_captions_bmt.number_captions.mean()

Average captions per minute

In [None]:
ft_captions_bmt['CPM'] = [sample.number_captions / (sample.duration / 60) for _, sample in ft_captions_bmt.iterrows()]
ft_captions_bmt.CPM.mean()

Plot number of captions 

In [None]:
PLOT_SAVE_PATH = OUTPUT_PATH

x = ft_captions_bmt.duration
y = ft_captions_bmt.number_captions

plt.scatter(x, y)

plt.xlabel("Duration (seconds)")
plt.ylabel("Number of generated captions")


plt.savefig(f'{PLOT_SAVE_PATH}/3_4_a.png', dpi=300)
plt.show()

The plot shown above appears irregular because most videos have the same duration. Those samples that appear on the left hand side (smaller duration) were the remainings obtained from the end of each video. The other samples appear with a duration between 30 to 35 seconds. This happens because the BMT recalculates the duration based on a specific framerate used by the model. Hence, a video with an original framerate of 30 frames per second (fps) will be assigned a larger duration if the framerate used by BMT is 25 fps.

A histogram will help better visualize the results.

In [None]:
ft_captions_bmt.groupby("video_id")["number_captions"].count().mean()

### 3.5 `pyscene` clipping
For this first test, we will be using a window of 2 minutes for performing the fixed-time clipping. Assign the variable `FT_CLIPS_DIR` the path to the directory where you wish to store the clips temporarily. 

In [None]:
LABC_VIDEOS_PATH

In [None]:
PS_CLIPS_DIR = f'{OUTPUT_PATH}/ps_clips' # Your path here
%ls $PS_CLIPS_DIR

The next cell instantiates the clipping module.

In [None]:
cm = ClippingModule(
    path_to_video_out=PS_CLIPS_DIR, 
    technique='scene'
)

We then obtain clips from all videos:

In [None]:
videos_list = os.listdir(LABC_VIDEOS_PATH)

for video in videos_list[:10]:
    if video[0] != '.': # ignore hidden files
        cm.get_clips(input_path=f'{LABC_VIDEOS_PATH}/{video}', name=video[:11])

We need to create a DataFrame to keep track of all filenames, video IDs and clip IDs.

In [47]:
# Initialize
df = pd.DataFrame(os.listdir(PS_CLIPS_DIR), columns=['filename'])

# Clean
df.drop(df[df['filename'].str.startswith('.')].index, inplace=True)

# Obtain clip id
df['clip_id'] = [(int) (re.findall("(?<=@).*?(?=\.)", row['filename'])[0]) for index, row in df.iterrows()]

# Obtain video id
df['video_id'] = [(re.findall(".*?(?=\@)", row['filename'])[0]) for index, row in df.iterrows()]

df.video_id.unique()

array(['DAlNWRjXY4A', '0RqBZlDur5k', '_VWAFXFRTcA', '0dyQouKDR2M',
       '3oWlYHKMyv8', 'LlBedonOnR0', '3X7OqTDi8NQ', 'FJ64K9QdwDU'],
      dtype=object)

In [46]:
# optional: select fewer videos if time limitations
limited_sample_video_ids = df.video_id.unique()[:10]
limited_sample_video_ids

limited_df = df[df['video_id'].isin(limited_sample_video_ids)]
limited_df.head()

Unnamed: 0,filename,clip_id,video_id
0,DAlNWRjXY4A@017.mp4,17,DAlNWRjXY4A
1,DAlNWRjXY4A@018.mp4,18,DAlNWRjXY4A
2,DAlNWRjXY4A@019.mp4,19,DAlNWRjXY4A
3,DAlNWRjXY4A@020.mp4,20,DAlNWRjXY4A
4,DAlNWRjXY4A@021.mp4,21,DAlNWRjXY4A


The next cell will execute the BMT model to generate captions for each clip.

In [None]:
# execution metadata
exec_md = {
    'current_sample_num': 1,
    'total_samples': limited_df.shape[0],
}


for index, sample in limited_df.iterrows():
    
    run_bmt(
        input_path=f'{PS_CLIPS_DIR}/{sample.filename}', 
        output_path=f'{OUTPUT_PATH}/4_3_bmt_captions_ps.json', 
        video_id=sample.video_id, 
        clip_id=sample.clip_id
    )
    

    curr_sample = exec_md['current_sample_num']
    exec_md['current_sample_num'] += 1
    total_samples = exec_md['total_samples']
    print(f'\n ***** Processed {curr_sample}/{total_samples} samples. *****\n')


In [None]:
# (Optional) Secure the file with a copy to prevent rewriting
!cp $OUTPUT_PATH/4_3_bmt_captions_ps.json $OUTPUT_PATH/4_3_bmt_captions_ps_save.json

**Analyzing pyscene clipping results**

The next cell will create a dataframe using the output from BMT

In [None]:
path_to_ps_results = f'{OUTPUT_PATH}/4_3_bmt_captions_ps_save.json'

ps_captions_bmt = pd.read_json(path_to_ps_results, orient='records')
# ft_captions_bmt.sort_values(by="clip_id", inplace=True)

In [None]:
# Add the number of captions column
ps_captions_bmt['number_captions'] = [len(sample.captions) for _, sample in ps_captions_bmt.iterrows()]
ps_captions_bmt.head()

Average number of generated captions:

In [None]:
ps_captions_bmt.number_captions.mean()

Average captions per minute

In [None]:
ps_captions_bmt['CPM'] = [sample.number_captions / (sample.duration / 60) for _, sample in ps_captions_bmt.iterrows()]
ps_captions_bmt.CPM.mean()

Plot number of captions 

In [None]:
PLOT_SAVE_PATH = OUTPUT_PATH

x = ps_captions_bmt.duration
y = ps_captions_bmt.number_captions

plt.scatter(x, y)

plt.xlabel("Duration (seconds)")
plt.ylabel("Number of generated captions")


plt.savefig(f'{PLOT_SAVE_PATH}/3_4_a.png', dpi=300)
plt.show()

The plot shown above appears irregular because most videos have the same duration. Those samples that appear on the left hand side (smaller duration) were the remainings obtained from the end of each video. The other samples appear with a duration between 30 to 35 seconds. This happens because the BMT recalculates the duration based on a specific framerate used by the model. Hence, a video with an original framerate of 30 frames per second (fps) will be assigned a larger duration if the framerate used by BMT is 25 fps.

A histogram will help better visualize the results.

In [None]:
ps_captions_bmt.groupby("video_id")["number_captions"].count().mean()

**Comparisons with histograms.**

In [None]:
anetc_val_path = '/home/A01630791/bmt_clipping_experiments/ActivityNet_Captions/val_1.json'
anetc_val_df = pd.read_json(anetc_val_path, orient='index')

bmt_results_path = '/home/A01630791/bmt_clipping_experiments/Output_4_1/4_1_bmt_captions_save.json'
bmt_results_df = pd.read_json(bmt_results_path, orient='records')

In [None]:
# compute no. captions
anetc_val_df['number_captions'] = [len(sample.sentences) for index, sample in anetc_val_df.iterrows()]

bmt_results_df['number_captions'] = [len(sample.captions) for index, sample in bmt_results_df.iterrows()]
bmt_results_df.head()

In [None]:
# graphs
anetc_val_df.number_captions.plot.hist()

In [None]:
bmt_results_df.number_captions.plot.hist()

In [None]:
dpi_fig = plt.figure(dpi=300)

anetc_val_df.iloc[:].number_captions.plot.hist(color="lightblue")
ax = ft_captions_bmt.number_captions.plot.hist()
ax.legend(['ActivityNet (Annotations)', 'LABC (BMT-generated)'])
ax.set_xlabel('Number of captions')


In [None]:
fig = ax.get_figure()
fig.savefig(f'{PLOT_SAVE_PATH}/3_4_b.png', dpi=300)

The histogram shown above with number of captions against frequency displays the similarity between the distrbution of human annotations the model was trained on and the distribution of the number of captions generated by the BMT with a different dataset. 