# Project Week 1: ActivityNet Video Data Preparation and Indexing

In this example we will use the ActivityNet dataset https://github.com/activitynet/ActivityNet. 

 - Select the 10 videos with more moments.
 - Download these videos onto your computer.
 - Extract the frames for every video.
 - Read the textual descriptions of each video.
 - Index the video data in OpenSearch.

 In this week, you will index the video data and make it searchable with OpenSearch. You should refer to the OpenSearch tutorial laboratory.

## Imports

In [2]:
import json
import pprint as pp
from pprint import pprint

#Open Search
import requests

from opensearchpy import OpenSearch
from opensearchpy import helpers

#Embeddings neighborhood
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
import pickle


## Select videos and Captions
Download the `activity_net.v1-3.min.json` file containing the list of videos. The file is in the github repository of ActivityNet.
Parse this file and select the 10 videos with more moments.

## Compute the captions

In [3]:
def load_captions_data(file_path):
    with open(file_path, 'r') as f:
        data = json.load(f)
    
    processed = {}
    for video_id, captions in data.items():
        processed[video_id] = {
            "segments": captions['segments'] if 'segments' in captions else captions,
        }
    return processed

# Load the data
val_data1 = load_captions_data('captions/val_1.json')
val_data2 = load_captions_data('captions/val_2.json')

# Combine dictionaries (preserving video_id as keys)
all_captions_data = {**val_data1, **val_data2}

pprint(f"Number of captions: {len(all_captions_data)}")
pprint(f"Example Captions: {all_captions_data}")

'Number of captions: 4917'
("Example Captions: {'v_uqiMw7tQ1Cc': {'segments': {'duration': 55.15, "
 "'timestamps': [[0, 4.14], [4.14, 33.36], [33.36, 55.15]], 'sentences': ['Two "
 'men both dressed in athletic gear are standing and talking in an indoor '
 "weight lifting gym filled with other equipment.', ' One man is holding onto "
 'a rope attached to a machine, and the other man instructs him to bend down '
 'on his left knee while still holding onto the rope and he showing the man '
 'how to have proper form.\', " The man then instructs the man holding the '
 'rope to pull the row down a few times and he\'s talking the whole time."]}}, '
 "'v_bXdq2zI1Ms0': {'segments': {'duration': 73.1, 'timestamps': [[6.94, "
 "69.08], [37.28, 43.49], [43.13, 55.55]], 'sentences': ['Three men are "
 "standing on a mat.', ' The man in front begins to do karate on the mat.', ' "
 "He gets down on the ground and flips around.']}}, 'v_FsS_NCZEfaI': "
 "{'segments': {'duration': 212.74, 'timestamps'

## Compute the videos

In [4]:
with open('activity_net.v1-3.min.json', 'r') as json_data:
    data = json.load(json_data)

database = {}

for video_id in data['database']:
    database["v_" + video_id] = data['database'][video_id]

# Criar lista ordenada com todos os dados completos
sorted_database = sorted(
    database.items(),
    key=lambda x: len(x[1]['annotations']),
    reverse=True
)

# Top 10 vídeos (completo)
top_videos = dict(sorted_database[:27])

pprint(top_videos)

{'v_-ap649M020k': {'annotations': [{'label': 'Longboarding',
                                    'segment': [9.965381472401754,
                                                10.961919619641929]},
                                   {'label': 'Longboarding',
                                    'segment': [15.280251591016023,
                                                32.88575885892579]},
                                   {'label': 'Longboarding',
                                    'segment': [37.86844959512666,
                                                44.84421662580789]},
                                   {'label': 'Longboarding',
                                    'segment': [54.80959809820965,
                                                60.456647599237314]},
                                   {'label': 'Longboarding',
                                    'segment': [72.7472847485328,
                                                74.07600227818637]},
             

In [5]:
# Verifique quantos IDs do top_10 existem nas captions
matching_ids = set(database.keys()) & set(all_captions_data.keys())
print(f"Número de IDs correspondentes: {len(matching_ids)}")
print(f"IDs no top_videos: {list(top_videos.keys())[:5]}...")
print(f"IDs em all_captions_data: {list(all_captions_data.keys())[:5]}...")

Número de IDs correspondentes: 4917
IDs no top_videos: ['v_o1WPnnvs00I', 'v_oGwn4NUeoy8', 'v_VEDRmPt_-Ms', 'v_qF3EbR8y8go', 'v_DLJqhYP-C0k']...
IDs em all_captions_data: ['v_uqiMw7tQ1Cc', 'v_bXdq2zI1Ms0', 'v_FsS_NCZEfaI', 'v_K6Tm5xHkJ5c', 'v_4Lu8ECLHvK4']...


In [6]:
#final_dataset_video = {}
final_dataset_captions = {}

for video_id in top_videos:
    try:
        if (all_captions_data[video_id] != None):
            #final_dataset_video[video_id] = top_videos[video_id]
            final_dataset_captions[video_id] = all_captions_data[video_id]
    except Exception as e:
        None

#final_dataset_video.pop("v_PJ72Yl0B1rY", None)
final_dataset_captions.pop("v_PJ72Yl0B1rY", None)

pprint(final_dataset_captions)
#pprint(final_dataset_video)
#print(len(final_dataset_video))
print(len(final_dataset_captions))

{'v_2ji02dSx1nM': {'segments': {'duration': 162.69,
                                'sentences': ['A surfer is riding on a surf '
                                              'board in the ocean.',
                                              ' He goes through the waves as '
                                              'they crash around him.',
                                              ' He continues riding the waves '
                                              'and talking to the camera in an '
                                              'interview.'],
                                'timestamps': [[0, 9.76],
                                               [18.71, 68.33],
                                               [82.97, 162.69]]}},
 'v_6gyD-Mte2ZM': {'segments': {'duration': 188.25,
                                'sentences': ["There's a man in a brown shirt "
                                              'bowling in a large alley in a '
                             

In [7]:
#OpenSearch

host = 'localhost'
port = 9200

user = 'admin' # Add your user name here.
password = 'grupo09FTC!' # Add your user password here. For testing only. Don't store credentials in code. 
index_name = user

In [8]:
#Just to test if OpenSearch is up and running

# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(
    hosts=[{'host': host, 'port': port}],
    http_auth=(user, password),
    use_ssl=True,              
    verify_certs=False,        
    ssl_show_warn=False
)

if client.indices.exists(index_name):

    resp = client.indices.open(index = index_name)
    print(resp)

    print('\n----------------------------------------------------------------------------------- INDEX SETTINGS')
    settings = client.indices.get_settings(index = index_name)
    pp.pprint(settings)

    print('\n----------------------------------------------------------------------------------- INDEX MAPPINGS')
    mappings = client.indices.get_mapping(index = index_name)
    pp.pprint(mappings)

    print('\n----------------------------------------------------------------------------------- INDEX #DOCs')
    print(client.count(index = index_name))
else:
    print("Index does not exist.")

{'acknowledged': True, 'shards_acknowledged': True}

----------------------------------------------------------------------------------- INDEX SETTINGS
{'admin': {'settings': {'index': {'creation_date': '1744195496478',
                                  'knn': 'true',
                                  'number_of_replicas': '0',
                                  'number_of_shards': '4',
                                  'provided_name': 'admin',
                                  'refresh_interval': '-1',
                                  'replication': {'type': 'DOCUMENT'},
                                  'uuid': 'g79qTcZVTS-_xHY1SCSWuQ',
                                  'version': {'created': '136407927'}}}}}

----------------------------------------------------------------------------------- INDEX MAPPINGS
{'admin': {'mappings': {'dynamic': 'strict',
                        'properties': {'description': {'type': 'text'},
                                       'sentence_embedding': 

In [9]:
client.indices.delete(index=index_name, ignore=[400, 404])

{'acknowledged': True}

## Video captions

The ActivityNetCaptions dataset https://cs.stanford.edu/people/ranjaykrishna/densevid/ dataset provides a textual description of each videos. Index the video captions on a text field of your OpenSearch index.

In [10]:
#video_id == title
#description == sentences

index_body = {
    "settings": {
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 4,
            "refresh_interval": "-1",
            "knn": "true"
        },
    },
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "title": {
                "type": "keyword"
            },
            "description": {
                "type": "text"
            },
        }
    }
}

if client.indices.exists(index=index_name):
    print("Index already exists.")
else:        
    response = client.indices.create(index_name, body=index_body)
    print('\nCreating index...')
    print(response)


Creating index...
{'acknowledged': True, 'shards_acknowledged': True, 'index': 'admin'}


In [11]:
for video_id, data in final_dataset_captions.items():
    print(f"Title: {video_id}")
    print(f"Description: {data['segments']['sentences']}")

for video_id, data in final_dataset_captions.items():
    filtered_caption = {
        "title": video_id,
        "description": data['segments']['sentences']
    }
    
    resp = client.index(index=index_name, id=video_id, body=filtered_caption)
    print(resp['result'])

Title: v_t6f_O8a4sSg
Description: ['An introduction comes onto the screen for a video about skate boarding tricks.', ' Several tricks are shown while someone narrates the tricks.', ' A man is shown on the screen giving details about the tricks and offering pointers and tips.', ' The video ends with the closing captions shown on the screen.']
Title: v_6gyD-Mte2ZM
Description: ["There's a man in a brown shirt bowling in a large alley in a competition with spectators watching him.", ' He begins by picking up his blue bowling ball and then, holds it firmly to shoot it at the pins.', ' He gets a strike after the ball hits the pin.', ' He continues throwing the ball several times and every time he gets a strike.', ' Then when he hits the ball again, he knocks down four pins in the first attempt.', ' Then after he continues bowling, the pins knock down the remaining pins down.', ' He bowls again and knocks down four pins.', ' On the second attempt, he knocks down more pins and finally gets a 

## Search Functionality

## Text-based Search 

In [12]:
client.indices.refresh(index=index_name)

{'_shards': {'total': 4, 'successful': 4, 'failed': 0}}

In [13]:
client.count(index = index_name)

{'count': 10,
 '_shards': {'total': 4, 'successful': 4, 'skipped': 0, 'failed': 0}}

In [14]:
qtxt = "finally gets a spare."

text_query = {
  "size": 5,
  "_source": ['title', 'description'],
  "query": {
    "multi_match": {
      "query": qtxt,
      "fields": ['description'],
    }
  }
}

response = client.search(
    body=text_query,
    index=index_name
)

print("\nSearch results:")
pp.pprint(response)


Search results:
{'_shards': {'failed': 0, 'skipped': 0, 'successful': 4, 'total': 4},
 'hits': {'hits': [{'_id': 'v_6gyD-Mte2ZM',
                    '_index': 'admin',
                    '_score': 2.848579,
                    '_source': {'description': ["There's a man in a brown "
                                                'shirt bowling in a large '
                                                'alley in a competition with '
                                                'spectators watching him.',
                                                ' He begins by picking up his '
                                                'blue bowling ball and then, '
                                                'holds it firmly to shoot it '
                                                'at the pins.',
                                                ' He gets a strike after the '
                                                'ball hits the pin.',
                                

In [15]:
# Search for a specific term in the "description" field

term_query = {
    "size": 5,
    "_source": ["title", "description"],
    "query": {
        "term": {
            "title": "v_od9EdcDcByA"
        }
    }
}

response = client.search(
    body=term_query,
    index=index_name
)

print("\nSearch results:")
pp.pprint(response)


Search results:
{'_shards': {'failed': 0, 'skipped': 0, 'successful': 4, 'total': 4},
 'hits': {'hits': [{'_id': 'v_od9EdcDcByA',
                    '_index': 'admin',
                    '_score': 0.9808291,
                    '_source': {'description': ['A shot of balls are shown as '
                                                'well as clips of people '
                                                'surfing and walking around.',
                                                ' More people are seen playing '
                                                'paintball as others speak to '
                                                'one another as well as surf.',
                                                ' The video continues on with '
                                                'several shots of people '
                                                'playing paintball that '
                                                'transitions into people '
          

In [16]:
bool_query = {
    "size": 5,
    "_source": ["title", "description"],
    "query": {
        "bool": {
            "must": [
                {"match": {"description": "skate"}},  # Must contain "skate"
                {"match": {"description": "tricks"}}  # Must contain "tricks"
            ],
        }
    }
}

response = client.search(
    body=bool_query,
    index=index_name
)

print("\nSearch results:")
pp.pprint(response)


Search results:
{'_shards': {'failed': 0, 'skipped': 0, 'successful': 4, 'total': 4},
 'hits': {'hits': [{'_id': 'v_t6f_O8a4sSg',
                    '_index': 'admin',
                    '_score': 2.43581,
                    '_source': {'description': ['An introduction comes onto '
                                                'the screen for a video about '
                                                'skate boarding tricks.',
                                                ' Several tricks are shown '
                                                'while someone narrates the '
                                                'tricks.',
                                                ' A man is shown on the screen '
                                                'giving details about the '
                                                'tricks and offering pointers '
                                                'and tips.',
                                               

## Embeddings Neighborhood

In [17]:
client.indices.delete(index=index_name, ignore=[400, 404])

{'acknowledged': True}

In [18]:
index_body = {
    "settings": {
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 4,
            "refresh_interval": "-1",
            "knn": True
        }
    },
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "title": { "type": "keyword" },
            "description": { "type": "text" },
            "sentence_embedding": {
                "type": "knn_vector",
                "dimension": 768,
                "method": {
                    "name": "hnsw",
                    "space_type": "innerproduct",
                    "engine": "faiss",
                    "parameters": {
                        "ef_construction": 256,
                        "m": 48
                    }
                }
            }
        }
    }
}


if client.indices.exists(index=index_name):
    print("Index already existed. You may force the new mappings.")
else:        
    response = client.indices.create(index_name, body=index_body)
    print('\nCreating index:')
    print(response)


Creating index:
{'acknowledged': True, 'shards_acknowledged': True, 'index': 'admin'}


In [19]:
client.indices.refresh(index=index_name)

{'_shards': {'total': 4, 'successful': 4, 'failed': 0}}

In [20]:
client.count(index = index_name)

{'count': 0,
 '_shards': {'total': 4, 'successful': 4, 'skipped': 0, 'failed': 0}}

In [21]:
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-base-v2")
model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-base-v2")

In [22]:
#Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    
    return embeddings

In [23]:
all_embeddings = {}

for video_id, data in final_dataset_captions.items():
    # Join all sentences to one paragraph-like string
    full_description = " ".join(data['segments']['sentences'])
    
    # Encode the full description
    embedding = encode(full_description)
    
    all_embeddings[video_id] = {
        "title": video_id,
        "description": data['segments']['sentences'],
        "sentence_embedding": embedding[0].numpy()
    }
        
    resp = client.index(index=index_name, id=video_id, body=all_embeddings[video_id])
    print(resp['result'])

    stored = client.get(index=index_name, id=video_id)
    print("\nIndexed Document:")
    pprint(stored["_source"])
    print("-" * 50)

with open('all_embeddings.pkl', 'wb') as f:
    pickle.dump(all_embeddings, f)

created

Indexed Document:
{'description': ['An introduction comes onto the screen for a video about '
                 'skate boarding tricks.',
                 ' Several tricks are shown while someone narrates the tricks.',
                 ' A man is shown on the screen giving details about the '
                 'tricks and offering pointers and tips.',
                 ' The video ends with the closing captions shown on the '
                 'screen.'],
 'sentence_embedding': [-0.014382576569914818,
                        -0.022280162200331688,
                        0.02169828489422798,
                        0.020328138023614883,
                        0.043225258588790894,
                        -0.0013238673564046621,
                        0.03738510608673096,
                        0.02335180528461933,
                        -0.04852043092250824,
                        0.05206624045968056,
                        -0.01778588257730007,
                        -0.06

If you want to load the data from the pickle file

In [24]:

with open('all_embeddings.pkl', 'rb') as f:
    all_embeddings = pickle.load(f)
print(all_embeddings)


{'v_t6f_O8a4sSg': {'title': 'v_t6f_O8a4sSg', 'description': ['An introduction comes onto the screen for a video about skate boarding tricks.', ' Several tricks are shown while someone narrates the tricks.', ' A man is shown on the screen giving details about the tricks and offering pointers and tips.', ' The video ends with the closing captions shown on the screen.'], 'sentence_embedding': array([-1.43825766e-02, -2.22801622e-02,  2.16982849e-02,  2.03281380e-02,
        4.32252586e-02, -1.32386736e-03,  3.73851061e-02,  2.33518053e-02,
       -4.85204309e-02,  5.20662405e-02, -1.77858826e-02, -6.54118881e-02,
       -1.01972027e-02,  5.02215698e-03,  5.37653677e-02, -5.93207330e-02,
        9.30975750e-03, -6.42848462e-02,  1.76999085e-02, -2.22318601e-02,
       -2.50885095e-02,  3.66076939e-02,  1.70764383e-02, -3.97567712e-02,
       -2.65968870e-02,  8.71483702e-03, -1.09048728e-02, -2.95894854e-02,
       -3.33008403e-03,  1.38159450e-02, -8.39210302e-03, -9.30577796e-03,
       

### Search query for embedding index

In [25]:
client.indices.refresh(index=index_name)

{'_shards': {'total': 4, 'successful': 4, 'failed': 0}}

In [26]:
# Compute the query embedding
query = "finally gets a spare."
query_emb = encode(query)

query_denc = {
  'size': 5,
  '_source': ['title', 'description'],
   "query": {
        "knn": {
          "sentence_embedding": {
            "vector": query_emb[0].numpy(),
            "k": 2
          }
        }
      }
}

response = client.search(
    body = query_denc,
    index = index_name
)

print('\nSearch results:')
pp.pprint(response)


Search results:
{'_shards': {'failed': 0, 'skipped': 0, 'successful': 4, 'total': 4},
 'hits': {'hits': [{'_id': 'v_6gyD-Mte2ZM',
                    '_index': 'admin',
                    '_score': 1.1583455,
                    '_source': {'description': ["There's a man in a brown "
                                                'shirt bowling in a large '
                                                'alley in a competition with '
                                                'spectators watching him.',
                                                ' He begins by picking up his '
                                                'blue bowling ball and then, '
                                                'holds it firmly to shoot it '
                                                'at the pins.',
                                                ' He gets a strike after the '
                                                'ball hits the pin.',
                               

## Discuss how the embeddings space organize data and allow for specific search:

### 1. Data Organization in Embedding Space
In **semantic search**, each document is transformed into a vector and positioned within a **high-dimensional space**. Documents with similar meanings are placed **close together**, regardless of the actual words used.

In **traditional search**, documents are stored based on **text terms and frequencies**. The index organizes data around **keyword occurrence**, not meaning.

### 2. Proximity-Based Search
In the **embedding space**, search is performed by finding documents that are **nearest neighbors** to the query vector. This means results are selected based on **semantic similarity**, not shared vocabulary.

In **traditional search**, relevance is determined by **keyword overlap and frequency**. The system doesn’t consider whether two different words or phrases mean the same thing.

### 3. Flexibility in Querying
**Semantic search** enables queries to return documents based on the **meaning alignment**, even if no keywords match directly.

**Traditional search** would require the exact terms to appear in the document to be considered relevant, limiting its ability to understand **context or nuance**.

### 4. Dimensional Context
The **embedding space** captures **contextual relationships** across hundreds of dimensions. This allows it to distinguish between similar words used in **different contexts** (e.g., `"apple"` the fruit vs. `"Apple"` the company).

**Traditional indexing** does not capture this level of nuance—it treats words more **literally and statically**.
