<a href="https://colab.research.google.com/github/johan-lindell/VSL-egocentric/blob/main/notebooks/extension.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Queries in Egocentric Videos Project Extension 2

Here we implement an extension for our VSL model where we identify the relevant segments within long videos that contain useful information and then using these segments as input for a VLM to have textual answers.

##General Setup

Imports and persistant storage in google drive.

In [1]:
from google.colab import drive, userdata
import numpy as np
import pandas as pd
import json
import os

In [2]:

drive.mount('/content/drive')

Mounted at /content/drive


Set relevant directories

In [3]:
EXTENSION_DIR = '/content/drive/MyDrive/vsl-egocentric/extension'
VIDEO_OUT = EXTENSION_DIR + '/uncut_videos'
UNCUT_VIDEO_DIR = VIDEO_OUT + '/v1/clips'
CUT_VIDEO_OUT = EXTENSION_DIR + '/cut_videos'

## Find 50 correct queries

Find 50 correct NLQ queries based on the predictions of our model and validation data. The tolerance has been manually set.

In [None]:
# Load the validation and prediction data
with open(EXTENSION_DIR + '/val.json') as f:
    val_data = json.load(f)

with open(EXTENSION_DIR + '/preds.json') as f:
    pred_data = json.load(f)

# Create a dictionary mapping annotation_uid to its exact times and sentences
val_dict = {}
for clip_uid, clip_data in val_data.items():
    for idx, annotation_uid in enumerate(clip_data["annotation_uids"]):
        val_dict[annotation_uid] = {
            "exact_times": clip_data["exact_times"][idx],
            "sentence": clip_data["sentences"][idx]
        }

# Set a tolerance for matching times (seconds)
tolerance = 6

# List to store correctly retrieved NLQ queries
correct_queries = []
unique_entries = set()
i = 0
for result in pred_data["results"]:
    annotation_uid = result["annotation_uid"]
    clip_uid = result["clip_uid"]
    predicted_times = result["predicted_times"]

    if annotation_uid in val_dict:
        exact_times = val_dict[annotation_uid]["exact_times"]
        # Check if any of the predicted times match the exact times within tolerance
        for pred_start, pred_end in predicted_times:
            exact_start, exact_end = exact_times
            if abs(pred_start - exact_start) <= tolerance and abs(pred_end - exact_end) <= tolerance:
                entry = (annotation_uid, clip_uid)
                if entry not in unique_entries:
                    correct_queries.append({
                        "idx": i,
                        "annotation_uid": annotation_uid,
                        "clip_uid": clip_uid,
                        "sentence": val_dict[annotation_uid]["sentence"],
                        "predicted_times": (pred_start, pred_end),
                        "exact_times": (exact_start, exact_end)
                    })
                    unique_entries.add(entry)
                    i += 1
                break
    if len(correct_queries) >= 50:
        break

# Output the first 50 correct queries
correct_queries = correct_queries[:50]

# Convert the results to a dataframe
df = pd.DataFrame(correct_queries)
print(f'{df.shape[0]} correct queries found with a tolerance of {tolerance} seconds.')
df.head()

50 correct queries found with a tolerance of 1.704 seconds.


Unnamed: 0,idx,annotation_uid,clip_uid,sentence,predicted_times,exact_times
0,0,847f64a8-5335-4f1b-8248-73727dfe52ce,00d9a297-d967-4d28-8e5a-6b891814ec65,where did i put the knife?,"(146.25, 150.0)","(147.95371, 148.928)"
1,1,f9cd0c31-4e28-411e-b498-19db3e544030,9a13aee2-0dca-49f8-968f-8f53c5a62963,what vegetable did i cut?,"(75.0, 93.75)","(74.42368, 92.91834)"
2,2,21563a23-ca10-4165-8b6a-74c72d722f0a,2c1724ce-f438-4d63-a699-8a7f65e3cbd9,where is phone?,"(0.0, 3.75)","(0.283, 3.5)"
3,3,e3015a5a-3e3e-47f5-a6b9-b77d3648621e,679cfee6-7da1-4701-b75a-9e34abb9400a,where was can drink before i drank it?,"(15.0, 18.75)","(15.0, 18.086)"
4,4,763cc50c-edf6-4b99-98e2-5030a557784c,1138ced6-d580-4013-96bb-1e5c3fea62d7,how many cans were in the fridge?,"(315.0, 326.25)","(315.706, 325.888)"


In [None]:
# Save the dataframe to a CSV file
df.to_csv(f'{EXTENSION_DIR}/correct_nlq_queries_tol{tolerance}s.csv', index=False)

## Download videos


Read AWS key and secret for the EGO4D dataset.

In [None]:
#Add secrets to google colab
os.environ['AWS_ACCESS_KEY_ID'] = userdata.get('aws_access_key')
os.environ['AWS_SECRET_ACCESS_KEY'] = userdata.get('aws_secret_key')

Install AWS client.

In [None]:
# Download the AWS and Ego4D CLIs
%%bash

# Set up the AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip -o awscliv2.zip >/dev/null
sudo ./aws/install >/dev/null 2>&1
aws configure set aws_access_key_id "$AWS_ACCESS_KEY_ID" && aws configure set aws_secret_access_key "$AWS_SECRET_ACCESS_KEY"
rm "awscliv2.zip"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 57.8M  100 57.8M    0     0  78.9M      0 --:--:-- --:--:-- --:--:-- 79.0M


Install ego4d client.

In [None]:
!pip install ego4d



Download videos as per ```clip UIDs```.

In [None]:
# Extract unique clip_uids
unique_clip_uids = df["clip_uid"].unique().tolist()

# Run the commands to download the videos, could not get list input to work so looping with single uid
for uid in unique_clip_uids:
    command = f"ego4d --version v1 --output_directory={VIDEO_OUT} --datasets clips --video_uids={uid} --yes"
    !{command}

Datasets to download: {'clips'}
Download Path: /content/drive/MyDrive/vsl-egocentric/extension/uncut_videos/v1
Ego4D Metadata: /content/drive/MyDrive/vsl-egocentric/extension/uncut_videos/ego4d.json
Checking requested datasets and versions...
Created download directory for version 'v1' of dataset: 'clips' at: /content/drive/MyDrive/vsl-egocentric/extension/uncut_videos/v1/clips
Only downloading a subset of the video files because the 'video_uids' flag has been set on the command line or in the config file. A total of 1 video files will be downloaded.

Retrieving object metadata from S3...
100% 1/1 [00:00<00:00, 878.57object/s]
Checking if latest file versions are already downloaded...
100% 1/1 [00:01<00:00,  1.03s/file]
No existing videos to filter.
Downloading 1 files..
 88% 65.8M/74.5M [00:02<00:00, 76.2MiB/s]Checking file integrity...
100% 74.5M/74.5M [00:03<00:00, 25.4MiB/s]
Datasets to download: {'clips'}
Download Path: /content/drive/MyDrive/vsl-egocentric/extension/uncut_videos/

### Cut videos using ffmpeg

Videos are cut and then stored in the specified folder **CUT_VIDEO_OUT**.

In [None]:
# Define the function to run ffmpeg commands
def run_ffmpeg(input_path, output_path, start_time, duration):
    command = f"ffmpeg -i {input_path} -ss {start_time} -t {duration} -c copy {output_path}"
    !{command}


os.makedirs(CUT_VIDEO_OUT, exist_ok=True)

# Loop through the correct queries to extract the segments
for query in correct_queries:
    clip_uid = query["clip_uid"]
    annotation_uid = query["annotation_uid"]
    predicted_start, predicted_end = query["predicted_times"]
    idx = query["idx"]
    start_time = predicted_start
    duration = predicted_end - predicted_start

    input_video_path = os.path.join(UNCUT_VIDEO_DIR, f"{clip_uid}.mp4")
    output_segment_path = os.path.join(CUT_VIDEO_OUT, f"{idx}_{clip_uid}_{annotation_uid}.mp4")

    run_ffmpeg(input_video_path, output_segment_path, start_time, duration)


ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

## Implement and train model

Set annotation path and replace NaN values.

In [4]:
annotations_path = EXTENSION_DIR + '/manual_annotations.xlsx'
annotations_df = pd.read_excel(annotations_path)

# Replace NaN values
annotations_df.fillna("i don't know the video is unclear", inplace=True)

annotations_df.head()

Unnamed: 0,idx,annotation_uid,clip_uid,sentence,predicted_times,exact_times,manual_annotation
0,0,847f64a8-5335-4f1b-8248-73727dfe52ce,00d9a297-d967-4d28-8e5a-6b891814ec65,where did i put the knife?,"(146.25, 150.0)","(147.95371, 148.928)",i don't know the video is unclear
1,1,f9cd0c31-4e28-411e-b498-19db3e544030,9a13aee2-0dca-49f8-968f-8f53c5a62963,what vegetable did i cut?,"(75.0, 93.75)","(74.42368, 92.91834)",The person cut some kind of tapered cabbage.
2,2,21563a23-ca10-4165-8b6a-74c72d722f0a,2c1724ce-f438-4d63-a699-8a7f65e3cbd9,where is phone?,"(0.0, 3.75)","(0.283, 3.5)",The phone is on the shelf next to a water bottle.
3,3,e3015a5a-3e3e-47f5-a6b9-b77d3648621e,679cfee6-7da1-4701-b75a-9e34abb9400a,where was can drink before i drank it?,"(15.0, 18.75)","(15.0, 18.086)",Original location of the can drink cannot be s...
4,4,763cc50c-edf6-4b99-98e2-5030a557784c,1138ced6-d580-4013-96bb-1e5c3fea62d7,how many cans were in the fridge?,"(315.0, 326.25)","(315.706, 325.888)",There were eleven cans in the fridge.


Transform dataframe into structure for model.

In [5]:
def prepare_video_qa_data(annotations_df, video_segments_dir):
    qa_data = []
    for _, row in annotations_df.iterrows():
        idx = row['idx']
        clip_uid = row['clip_uid']
        annotation_uid = row['annotation_uid']
        sentence = row['sentence']
        manual_annotation = row['manual_annotation']

        video_segment_path = os.path.join(video_segments_dir, f"{idx}_{clip_uid}_{annotation_uid}.mp4")

        qa_data.append({
            'video_path': video_segment_path,
            'question': sentence,
            'ground_truth': manual_annotation
        })

    return qa_data


qa_data = prepare_video_qa_data(annotations_df, CUT_VIDEO_OUT)
qa_data[:3]

[{'video_path': '/content/drive/MyDrive/vsl-egocentric/extension/cut_videos/0_00d9a297-d967-4d28-8e5a-6b891814ec65_847f64a8-5335-4f1b-8248-73727dfe52ce.mp4',
  'question': 'where did i put the knife?',
  'ground_truth': "i don't know the video is unclear"},
 {'video_path': '/content/drive/MyDrive/vsl-egocentric/extension/cut_videos/1_9a13aee2-0dca-49f8-968f-8f53c5a62963_f9cd0c31-4e28-411e-b498-19db3e544030.mp4',
  'question': 'what vegetable did i cut?',
  'ground_truth': 'The person cut some kind of tapered cabbage.'},
 {'video_path': '/content/drive/MyDrive/vsl-egocentric/extension/cut_videos/2_2c1724ce-f438-4d63-a699-8a7f65e3cbd9_21563a23-ca10-4165-8b6a-74c72d722f0a.mp4',
  'question': 'where is phone?',
  'ground_truth': 'The phone is on the shelf next to a water bottle.'}]