# Project Week 1: ActivityNet Video Data Preparation and Indexing

In this example we will use the ActivityNet dataset https://github.com/activitynet/ActivityNet. 

 - Select the 10 videos with more moments.
 - Download these videos onto your computer.
 - Extract the frames for every video.
 - Read the textual descriptions of each video.
 - Index the video data in OpenSearch.

 In this week, you will index the video data and make it searchable with OpenSearch. You should refer to the OpenSearch tutorial laboratory.

## Select videos
Download the `activity_net.v1-3.min.json` file containing the list of videos. The file is in the github repository of ActivityNet.
Parse this file and select the 10 videos with more moments.

In [2]:
import json
from pprint import pprint
import subprocess

with open('activity_net.v1-3.min.json', 'r') as json_data:
    data = json.load(json_data)
    
    video_annotations = [
    {"video_id": vid, "num_moments": len(details["annotations"]), "url": details["url"]}
    for vid, details in data["database"].items()
]
    
    top_videos = sorted(video_annotations, key=lambda x: x["num_moments"], reverse=True)[:11]

    print(top_videos)
    for video in top_videos:
        subprocess.run(["yt-dlp", "-f", "best", video['url']])



[{'video_id': 'o1WPnnvs00I', 'num_moments': 23, 'url': 'https://www.youtube.com/watch?v=o1WPnnvs00I'}, {'video_id': 'oGwn4NUeoy8', 'num_moments': 23, 'url': 'https://www.youtube.com/watch?v=oGwn4NUeoy8'}, {'video_id': 'VEDRmPt_-Ms', 'num_moments': 20, 'url': 'https://www.youtube.com/watch?v=VEDRmPt_-Ms'}, {'video_id': 'qF3EbR8y8go', 'num_moments': 19, 'url': 'https://www.youtube.com/watch?v=qF3EbR8y8go'}, {'video_id': 'DLJqhYP-C0k', 'num_moments': 18, 'url': 'https://www.youtube.com/watch?v=DLJqhYP-C0k'}, {'video_id': 't6f_O8a4sSg', 'num_moments': 18, 'url': 'https://www.youtube.com/watch?v=t6f_O8a4sSg'}, {'video_id': '6gyD-Mte2ZM', 'num_moments': 18, 'url': 'https://www.youtube.com/watch?v=6gyD-Mte2ZM'}, {'video_id': 'jBvGvVw3R-Q', 'num_moments': 18, 'url': 'https://www.youtube.com/watch?v=jBvGvVw3R-Q'}, {'video_id': 'PJ72Yl0B1rY', 'num_moments': 17, 'url': 'https://www.youtube.com/watch?v=PJ72Yl0B1rY'}, {'video_id': 'QHn9KyE-zZo', 'num_moments': 17, 'url': 'https://www.youtube.com/wa

         To let yt-dlp download and merge the best available formats, simply do not pass any format selection.


[youtube] Extracting URL: https://www.youtube.com/watch?v=o1WPnnvs00I
[youtube] o1WPnnvs00I: Downloading webpage
[youtube] o1WPnnvs00I: Downloading tv client config
[youtube] o1WPnnvs00I: Downloading player 8a8ac953
[youtube] o1WPnnvs00I: Downloading tv player API JSON
[youtube] o1WPnnvs00I: Downloading ios player API JSON
[youtube] o1WPnnvs00I: Downloading m3u8 information
[info] o1WPnnvs00I: Downloading 1 format(s): 18
[download] Destination: Irish March (Fife & Flute) – Wouter Kellerman (Live) [o1WPnnvs00I].mp4
[download] 100% of   10.19MiB in 00:00:10 at 975.23KiB/s 


         To let yt-dlp download and merge the best available formats, simply do not pass any format selection.


[youtube] Extracting URL: https://www.youtube.com/watch?v=oGwn4NUeoy8
[youtube] oGwn4NUeoy8: Downloading webpage
[youtube] oGwn4NUeoy8: Downloading tv client config
[youtube] oGwn4NUeoy8: Downloading player 8a8ac953
[youtube] oGwn4NUeoy8: Downloading tv player API JSON
[youtube] oGwn4NUeoy8: Downloading ios player API JSON
[youtube] oGwn4NUeoy8: Downloading m3u8 information
[info] oGwn4NUeoy8: Downloading 1 format(s): 18
[download] Destination: Jessica Ryckewaert plays Congas Drums Mezza Voce, Rhythm Concert [oGwn4NUeoy8].mp4
[download] 100% of    5.66MiB in 00:00:03 at 1.45MiB/s   


         To let yt-dlp download and merge the best available formats, simply do not pass any format selection.


[youtube] Extracting URL: https://www.youtube.com/watch?v=VEDRmPt_-Ms
[youtube] VEDRmPt_-Ms: Downloading webpage
[youtube] VEDRmPt_-Ms: Downloading tv client config
[youtube] VEDRmPt_-Ms: Downloading player 8a8ac953
[youtube] VEDRmPt_-Ms: Downloading tv player API JSON
[youtube] VEDRmPt_-Ms: Downloading ios player API JSON
[youtube] VEDRmPt_-Ms: Downloading m3u8 information
[info] VEDRmPt_-Ms: Downloading 1 format(s): 18
[download] Destination: Raquel.Pinto.Estrela.Moitense.Tumbling.2013 [VEDRmPt_-Ms].mp4
[download] 100% of   11.13MiB in 00:00:06 at 1.73MiB/s     


         To let yt-dlp download and merge the best available formats, simply do not pass any format selection.


[youtube] Extracting URL: https://www.youtube.com/watch?v=qF3EbR8y8go
[youtube] qF3EbR8y8go: Downloading webpage
[youtube] qF3EbR8y8go: Downloading tv client config
[youtube] qF3EbR8y8go: Downloading player 8a8ac953
[youtube] qF3EbR8y8go: Downloading tv player API JSON
[youtube] qF3EbR8y8go: Downloading ios player API JSON
[youtube] qF3EbR8y8go: Downloading m3u8 information
[info] qF3EbR8y8go: Downloading 1 format(s): 18
[download] Destination: Chinese Brush Painting [qF3EbR8y8go].mp4
[download] 100% of    9.50MiB in 00:00:04 at 1.98MiB/s     


         To let yt-dlp download and merge the best available formats, simply do not pass any format selection.


[youtube] Extracting URL: https://www.youtube.com/watch?v=DLJqhYP-C0k
[youtube] DLJqhYP-C0k: Downloading webpage
[youtube] DLJqhYP-C0k: Downloading tv client config
[youtube] DLJqhYP-C0k: Downloading player 8a8ac953
[youtube] DLJqhYP-C0k: Downloading tv player API JSON
[youtube] DLJqhYP-C0k: Downloading ios player API JSON
[youtube] DLJqhYP-C0k: Downloading m3u8 information
[info] DLJqhYP-C0k: Downloading 1 format(s): 18
[download] Destination: LOS MEJORES TIROS DE BOLOS  THE  BEST BOWLING SHOTS [DLJqhYP-C0k].mp4
[download] 100% of   10.60MiB in 00:00:09 at 1.17MiB/s     


         To let yt-dlp download and merge the best available formats, simply do not pass any format selection.


[youtube] Extracting URL: https://www.youtube.com/watch?v=t6f_O8a4sSg
[youtube] t6f_O8a4sSg: Downloading webpage
[youtube] t6f_O8a4sSg: Downloading tv client config
[youtube] t6f_O8a4sSg: Downloading player 8a8ac953
[youtube] t6f_O8a4sSg: Downloading tv player API JSON
[youtube] t6f_O8a4sSg: Downloading ios player API JSON
[youtube] t6f_O8a4sSg: Downloading m3u8 information
[info] t6f_O8a4sSg: Downloading 1 format(s): 18
[download] Destination: SKATEBOARDING MADE SIMPLE VOLUME 5 AVAILABLE NOW! [t6f_O8a4sSg].mp4
[download] 100% of   11.71MiB in 00:00:07 at 1.64MiB/s     


         To let yt-dlp download and merge the best available formats, simply do not pass any format selection.


[youtube] Extracting URL: https://www.youtube.com/watch?v=6gyD-Mte2ZM
[youtube] 6gyD-Mte2ZM: Downloading webpage
[youtube] 6gyD-Mte2ZM: Downloading tv client config
[youtube] 6gyD-Mte2ZM: Downloading player 8a8ac953
[youtube] 6gyD-Mte2ZM: Downloading tv player API JSON
[youtube] 6gyD-Mte2ZM: Downloading ios player API JSON
[youtube] 6gyD-Mte2ZM: Downloading m3u8 information
[info] 6gyD-Mte2ZM: Downloading 1 format(s): 18
[download] Destination: PBA - Lowest game bowled on tv - 100 [6gyD-Mte2ZM].mp4
[download] 100% of   13.03MiB in 00:00:16 at 827.13KiB/s   


         To let yt-dlp download and merge the best available formats, simply do not pass any format selection.


[youtube] Extracting URL: https://www.youtube.com/watch?v=jBvGvVw3R-Q
[youtube] jBvGvVw3R-Q: Downloading webpage
[youtube] jBvGvVw3R-Q: Downloading tv client config
[youtube] jBvGvVw3R-Q: Downloading player 8a8ac953
[youtube] jBvGvVw3R-Q: Downloading tv player API JSON
[youtube] jBvGvVw3R-Q: Downloading ios player API JSON
[youtube] jBvGvVw3R-Q: Downloading m3u8 information
[info] jBvGvVw3R-Q: Downloading 1 format(s): 18
[download] Destination: The King of American Weightlifting Visits Cal Strength [jBvGvVw3R-Q].mp4
[download] 100% of   13.51MiB in 00:00:08 at 1.65MiB/s     


         To let yt-dlp download and merge the best available formats, simply do not pass any format selection.


[youtube] Extracting URL: https://www.youtube.com/watch?v=PJ72Yl0B1rY
[youtube] PJ72Yl0B1rY: Downloading webpage
[youtube] PJ72Yl0B1rY: Downloading tv client config
[youtube] PJ72Yl0B1rY: Downloading player 8a8ac953
[youtube] PJ72Yl0B1rY: Downloading tv player API JSON
[youtube] PJ72Yl0B1rY: Downloading ios player API JSON


ERROR: [youtube] PJ72Yl0B1rY: Private video. Sign in if you've been granted access to this video. Use --cookies-from-browser or --cookies for the authentication. See  https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp  for how to manually pass cookies. Also see  https://github.com/yt-dlp/yt-dlp/wiki/Extractors#exporting-youtube-cookies  for tips on effectively exporting YouTube cookies
         To let yt-dlp download and merge the best available formats, simply do not pass any format selection.


[youtube] Extracting URL: https://www.youtube.com/watch?v=QHn9KyE-zZo
[youtube] QHn9KyE-zZo: Downloading webpage
[youtube] QHn9KyE-zZo: Downloading tv client config
[youtube] QHn9KyE-zZo: Downloading player 8a8ac953
[youtube] QHn9KyE-zZo: Downloading tv player API JSON
[youtube] QHn9KyE-zZo: Downloading ios player API JSON
[youtube] QHn9KyE-zZo: Downloading m3u8 information
[info] QHn9KyE-zZo: Downloading 1 format(s): 18
[download] Destination: London Slacklining [QHn9KyE-zZo].mp4
[download] 100% of   16.67MiB in 00:00:23 at 734.22KiB/s 


         To let yt-dlp download and merge the best available formats, simply do not pass any format selection.


[youtube] Extracting URL: https://www.youtube.com/watch?v=9-yueOtwiL8
[youtube] 9-yueOtwiL8: Downloading webpage
[youtube] 9-yueOtwiL8: Downloading tv client config
[youtube] 9-yueOtwiL8: Downloading player 8a8ac953
[youtube] 9-yueOtwiL8: Downloading tv player API JSON
[youtube] 9-yueOtwiL8: Downloading ios player API JSON
[youtube] 9-yueOtwiL8: Downloading m3u8 information
[info] 9-yueOtwiL8: Downloading 1 format(s): 18
[download] Destination: Shot-Put, Discus, Javelin [9-yueOtwiL8].mp4
[download] 100% of   13.35MiB in 00:00:06 at 2.00MiB/s     


## Video frame extraction

PyAV is a wrapper library providing you access to `ffmpeg`, a command-line video processing tool. In the example below, you will be able to extract frames from the a video shot.

In [None]:
import av
import av.datasets
import os

video_folder = "./video"
output_folder = "./frames"

for video in os.listdir(video_folder):
    # File name without extension
    filename = os.path.splitext(video)[0]
    if not os.path.isdir(output_folder + "/" + filename):
        os.makedirs(output_folder + "/" + filename)
    with av.open(video_folder + "/" + video) as container:
        stream = container.streams.video[0]
        # We want 1 frame per second
        fps = stream.average_rate
        interval = int(fps)
        #print(fps)
        cpt = 0
        for i,frame in enumerate(container.decode(stream)):
            if i % interval == 0:
                frame.to_image().save(output_folder+"/"+filename+"/"+str(cpt)+".jpg", quality=80)
                cpt += 1

30
24000/1001
3476500/115999
25
30000/1001
30000/1001
30000/1001
30000/1001
30000/1001
30000/1001


## Video metadata

Process the video metadata provided in the `json` file and index the video data in OpenSearch.

In [None]:
#print(top_videos)

from opensearchpy import OpenSearch
import pprint as pp


host = 'api.novasearch.org'
port = 443

user = 'user01' # Add your user name here.
password = 'erasmus+2025' # Add your user password here. For testing only. Don't store credentials in code. 
index_name = "video"

client = OpenSearch(
    hosts = [{'host': host, 'port': port}],
    http_compress = True, # enables gzip compression for request bodies
    http_auth = (user, password),
    use_ssl = True,
    url_prefix = 'opensearch_v2',
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False
)

index_body = {
   "settings":{
      "index":{
         "number_of_replicas":0,
         "number_of_shards":4,
         "refresh_interval":"-1",
         "knn":"true"
      }
   },
   "mappings":{
       "properties": {
            "video_id": {"type": "keyword"},
            "num_moments": {"type": "integer"},
            "url": {"type": "text"}
        }
   }
}

if client.indices.exists(index=index_name):
    print("Index already existed. Nothing to be done.")
else:        
    response = client.indices.create(index_name, body=index_body)
    print('\nCreating index:')
    print(response)

AuthorizationException: AuthorizationException(403, '')

In [None]:
video_annotations = [
    {'video_id': 'o1WPnnvs00I', 'num_moments': 23, 'url': 'https://www.youtube.com/watch?v=o1WPnnvs00I'},
    {'video_id': 'oGwn4NUeoy8', 'num_moments': 23, 'url': 'https://www.youtube.com/watch?v=oGwn4NUeoy8'},
    {'video_id': 'VEDRmPt_-Ms', 'num_moments': 20, 'url': 'https://www.youtube.com/watch?v=VEDRmPt_-Ms'},
    {'video_id': 'qF3EbR8y8go', 'num_moments': 19, 'url': 'https://www.youtube.com/watch?v=qF3EbR8y8go'},
    {'video_id': 'DLJqhYP-C0k', 'num_moments': 18, 'url': 'https://www.youtube.com/watch?v=DLJqhYP-C0k'},
    {'video_id': 't6f_O8a4sSg', 'num_moments': 18, 'url': 'https://www.youtube.com/watch?v=t6f_O8a4sSg'},
    {'video_id': '6gyD-Mte2ZM', 'num_moments': 18, 'url': 'https://www.youtube.com/watch?v=6gyD-Mte2ZM'},
    {'video_id': 'jBvGvVw3R-Q', 'num_moments': 18, 'url': 'https://www.youtube.com/watch?v=jBvGvVw3R-Q'},
    {'video_id': 'PJ72Yl0B1rY', 'num_moments': 17, 'url': 'https://www.youtube.com/watch?v=PJ72Yl0B1rY'},
    {'video_id': 'QHn9KyE-zZo', 'num_moments': 17, 'url': 'https://www.youtube.com/watch?v=QHn9KyE-zZo'},
    {'video_id': '9-yueOtwiL8', 'num_moments': 17, 'url': 'https://www.youtube.com/watch?v=9-yueOtwiL8'}
]

for video in video_annotations:
    client.index(index=index_name, body=video)

In [None]:
response = client.search(
    index=index_name,
    body={
        "query": {
            "match_all": {}
        }
    }
)

print(response["hits"]["hits"])

## Video captions

The ActivityNetCaptions dataset https://cs.stanford.edu/people/ranjaykrishna/densevid/ dataset provides a textual description of each videos. Index the video captions on a text field of your OpenSearch index.