# Introduction to NLP Powered Video search

Ever wondered how short videos in our social media apps pop up when given a short description. With the power of AI and Natural Language Processing (NLP), we are going to do so using a powerful tool named Qdrant. Sounds interesting, well it is.

In NLP Powered Video search system we just have to provide a short description and the relevant short videos will be shown to us. It's like having a video_assistant who has seen all the videos available and to whom you describe the context and it pulls out a video showing the same.

To make our 'AI video_assistant' we will be using two main components:

1.   **Qdrant**: Powers our performant vector search. It's our magic video collection that the assistant has access to.
2.   **Retriever Model**: It helps in embedding video labels into numerical representations (vectors) that Qdrant can store and search efficiently.

We'll use the kinetics700_2020 dataset, the 2020 edition of the DeepMind Kinetics human action dataset which contains about 50,000 **videos** having human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging and more and their **labels** --so a wide range of questions can be asked!

Remember, Machine Learning and NLP are fields where practice makes perfect. Don't hesitate to experiment, make mistakes and learn!

We generate embeddings for the labels using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant labels. We then see the corresponding video according to the relevant labels.

##Install Dependencies
Let's get started by installing the packages needed for notebook to run:

In [17]:
!pip install -qU pandas==1.5.3 sentence-transformers==2.2.2 tqdm==4.65.0 qdrant-client==1.2.0 wget==3.2

## Import libraries

In [18]:
import wget
import tarfile
import torch
import pandas as pd
from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer
from qdrant_client.http import models
from google.colab import drive
from tqdm.auto import tqdm
from IPython.display import display, HTML, IFrame

## Download Dataset
First let's download the kinetics700_2020 dataset from the official site [link](https://storage.googleapis.com/deepmind-media/Datasets/kinetics700_2020.tar.gz) by running below code of directly clicking on the link.

In [19]:
filename = !wget "https://storage.googleapis.com/deepmind-media/Datasets/kinetics700_2020.tar.gz" #download the master.zip file which contains the dataset
print(filename)



## Extract Dataset


In [20]:
tar_file_path = (
    "/content/kinetics700_2020.tar.gz"  # Specify the path to the tar.gz file
)

# Open the tar.gz file
with tarfile.open(tar_file_path, "r:gz") as tar:
    # Extract all files in the tar.gz archive
    tar.extractall()

## Explore the Dataset
After extracting a folder named kinetics700_2020 will be available. We will be using the file named train.csv from inside this folder.
Also you will notice that the dataset itself does not contain any video files, it just contains youtube_ids which we can use with the base youtube url to access any video, this is great as we do not need or want to store such  large amount of data.

In [21]:
# Load dataset to a pandas dataframe
columns_to_read = ["label", "youtube_id"]  # only read columns label and youtube_id

df = pd.read_csv(
    "/content/kinetics700_2020/train.csv",
    delimiter=",",  # set delimiter as "," as our csv file is commma separated
    usecols=columns_to_read,
)
df.head()

Unnamed: 0,label,youtube_id
0,clay pottery making,---0dWlqevI
1,news anchoring,---aQ-tA5_A
2,using bagging machine,---j12rm3WI
3,javelin throw,--07WQ2iBlw
4,climbing a rope,--0NTAs-fA0



Also There are some duplicate video ids in the dataset.

In [22]:
print("total records in the dataset- ", len(df))
print("Unique youtube_id in the dataset- ", len(df["youtube_id"].unique()))

total records in the dataset-  532906
Unique youtube_id in the dataset-  530510


let's see the duplicate ids and their counts

In [23]:
dupes = df["youtube_id"].value_counts().sort_values(ascending=False)
print("   id", " " * 7, "count\n", dupes.head())

   id         count
 6aMNaKa40RU    3
-NXoI2nNu6I    3
6o-0khkmWBI    3
BE00SRKILNk    3
DcHdC00v1UU    3
Name: youtube_id, dtype: int64


Now Let's take a look at one of the videos of duplicated ids and it's labels.

In [24]:
def display_video(youtube_id):  # function to display video using video id
    embed_code = f'<iframe width="180" height="180" src="https://www.youtube.com/embed/{youtube_id}?autoplay=1&controls=0&modestbranding&rel=0&mute=1&end=20&loop=1" frameborder="0"" ></iframe>'
    display(HTML(embed_code))


dupe_youtube_id = "BE00SRKILNk"  # example of duplicate youtube id
dupe_df = df[df["youtube_id"] == dupe_youtube_id]

# let's take a look at this video and it's duplicated descriptions
for _, video in dupe_df.iterrows():
    display_video(dupe_youtube_id)
    print(video["label"])



repairing puncture


fixing bicycle


assembling bicycle


As we can see none of the labels are wrong in any way, so we do not need to remove them. Other duplicated ids also have different but accurate labels.

Let's see some more examples of videos with their labels.

In [25]:
for _, video in df[:5].iterrows():
    display_video(video["youtube_id"])
    print(video["label"])

clay pottery making


news anchoring


using bagging machine


javelin throw


climbing a rope


Now to build our video search tool we just need **two** components:
* a **retriever** to embed video labels
* a **vector database** to store video label embeddings and retrieve relevant videos

## Initialize Qdrant client
We will be using qdrant vector database, a fully managed vector database that can store and search through billions of records in milliseconds. It will store vector representations of our vector labels which we can retrieve using query vector.


In [35]:
# Initialize Qdrant client
drive.mount(
    "/content/drive"
)  # to save collection in google drive using google collab, make a folder named QdrantDB
client = QdrantClient(
    path="/content/drive/MyDrive/QdrantDB"
)  # path="path/to/db" to Persists changes to disk, use given path for google collab.

label_collection = "video-search"  # name your collection

collections = client.get_collections()
print(collections)

# only create collection if it doesn't exist
if label_collection not in collections:
    client.recreate_collection(
        collection_name=label_collection,
        vectors_config=models.VectorParams(
            size=384,  # specifying dimensionality of vectors output by model
            distance=models.Distance.COSINE,  # specifying which metric will be used to check similarity of vectors
        ),
    )
collections = client.get_collections()
print(collections)

collections=[]
collections=[CollectionDescription(name='video-search')]


## Initialize Retriever
The retriever will do two things:

1.	Generate embeddings for all the video labels (context vectors)
2.	Generate embeddings for the query (query vector)

The embeddings generated by the retriever have a unique property, similar vectors will lie closer in the vector space. We will use this property to find label vectors which are closest to our query vector using cosine similarity.

We will use `sentence-transformers/all-MiniLM-L6-v2` as our retriever which performs exceptionally well on generic semantic similarity and is  based on Microsoft's MPNet.

In [26]:
# set device to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize retriever with SentenceTransformer model
retriever = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
retriever.to(device)

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

## Generate Embeddings -> Store in Qdrant
Now we will generate embeddings for our video labels. We will do so in batches which is much faster than doing it individually. And then send a single api call to upsert the batch (also much faster).

In qdrant, we need an id (a unique value, different from youtube_id), embedding (embeddings for the video labels we have generated earlier), and metadata for each document in the dataset. The metadata is a dictionary containing data relevant to our embeddings.

In [36]:
%%time

batch_size = 512  # specify batch size according to your RAM and compute, higher batch size = more RAM usage

for i in tqdm(range(0, len(df), batch_size)):
    i_end = min(i + batch_size, len(df))  # find end of batch
    batch = df.iloc[i:i_end]  # extract batch
    emb = retriever.encode(
        batch["label"].tolist()
    ).tolist()  # generate embeddings for batch
    meta = batch.to_dict(orient="records")  # get metadata
    ids = list(range(i, i_end))  # create IDs

    # upsert to qdrant
    client.upsert(
        collection_name=label_collection,
        points=models.Batch(ids=ids, vectors=emb, payloads=meta),
    )

collection_vector_count = client.get_collection(
    collection_name=label_collection
).vectors_count
print(f"Vector count in collection: {collection_vector_count}")
assert collection_vector_count == len(df)

  0%|          | 0/1041 [00:00<?, ?it/s]

Vector count in collection: 532906
CPU times: user 24min 19s, sys: 8.64 s, total: 24min 28s
Wall time: 24min 36s


## Querying, searching for videos with description
We can now query our description of the video, we will use the function `search_display_video` to search and display the top 3 results. The function will create vector embedding for the search query and find the label embedding according to cosine similarity in the collection.



In [37]:
def search_display_video(query):
    encoded_query = retriever.encode(
        query
    ).tolist()  # Generate embeddings for the query

    # Compute cosine similarity between query and embeddings vectors and return top 3 results
    query_result = client.search(
        collection_name=label_collection,
        query_vector=encoded_query,
        limit=3,
    )
    result = []
    for context in query_result:
        id = context.payload["youtube_id"]
        result.append(id)
    for id in result:
        display_video(id)

All work is done now we just need to provide a description of the video we want to the `search_display_video` function and it will display the results to us.

In [42]:
search_display_video("doing archery")

In [46]:
search_display_video("playing football")

In [49]:
search_display_video("man using computer")

In [51]:
search_display_video("a baby laughing")

In [52]:
search_display_video("a woman dancing")

In [54]:
search_display_video("walking a dog")

# Delete the collection

If you're done with the collection, we delete it to save resources.

In [None]:
client.delete_collection(collection_name=label_collection)

---