# Video search and retrieval

In this project we will see how we can retrieve videos based on given user query. And not just the video, but the timestamp of the video that caters to the query asked by the user.

The system we will build works purely on text, once we get the transcriptions of the videos that we want to search for. We will use embedding based search. This kind of search has many advantages over the traditional search that it works on semantic level. For this reason, we might get most representative search results even if the words enterd in the query is not present anywhere in the search database/video transcripts.

We demonstrate the video search application using a sample of 4 videos downloaded from youtube and placed in the data folder. These 4 videos belong to different categories like: story, data science, programing and chemical engineering. __Whisper__ is used to get transcripts of these videos along with timestamps. We then use __Sentence Transformer__ to get vectors of these small chunks of time stamped transcripts. The vectors are then stored in a vector database for fast search and retrieval of similar vector. We are using __Qdrant__ for vector db management. When user gives a query, it is coverted to vector using same Sentence Tranformer and the vector db is quired. We get the ranked list of the chunks of transcripts which can be processed to present the final search results.


In [1]:
import warnings
warnings.filterwarnings('ignore')

import tqdm
import json
import os
from os import listdir
from os.path import isfile, join

import torch
from sentence_transformers import SentenceTransformer

import whisper
from qdrant_client import models, QdrantClient

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

### Fetching video files to perform search on

We fetch the list of mp4 files present with the data folder so that we can index them and make search possible on them. All of the videos that we want to index and make search possible for them can be placed within data folder for this project.

We see there are 4 mp4 files present in the directory from the output. Their paths are printed as output.

In [2]:
data_directory = './data'
mp4_files = [join(data_directory, f) for f in listdir(data_directory) 
             if isfile(join(data_directory, f)) and f.endswith('.mp4')]

print(mp4_files)

['./data/kids story.mp4', './data/data science.mp4', './data/chem engg.mp4', './data/programing.mp4']


### Transcribing mp4 to text

We get the transcription of the videos using _whisper_. Note that _tiny_ model is used here and a better model can also be used. Also the _device_ used is _cpu_ but a GPU can also be used for fast transription, specially helpful in longer videos.

We save the transcription in python list which contains objects correcponding to each inidividual video in our data. Each object is defined to have _id_ to identify the object, _text_ contianing complete transcription of the video, and segments which contain the utterances and its corresponding information. We will index these segments rather than the text directly so we can retrieve the exact location where the answer is possible within the video and not just the video itself.

In [3]:
transcriptions = []

speech2text_model = whisper.load_model("tiny", device='cpu')

for mp4_file in tqdm.tqdm(mp4_files):
    result = speech2text_model.transcribe(mp4_file)
    transcriptions.append({'id': mp4_file, 'text': result['text'], 'segments' : result['segments']})

100%|█████████████████████████████████████████████| 4/4 [00:42<00:00, 10.56s/it]


In [4]:
print(len(transcriptions))

4


In [5]:
print(transcriptions[2]['id'])

./data/chem engg.mp4


In [6]:
print(transcriptions[2]['text'])

 What in the world do chemical engineers do? Make life saving medicines, accessible to everyone. Develop and deliver new energy sources safely and responsibly. Eliminate plastic waste from our oceans. When it comes to getting things done, chemical engineers have a pretty full to do list. We do the math, science and engineering to make medicine accessible. Water, drinkable, air, breathable. The deep thinking to make the environment sustainable, systems affordable, safety, reliable. We make discoveries scalable and the inconceivable, feasible. So what in the world do chemical engineers do? We take things that have never been done and get them done. For good. AICHE doing a world of good.


In [7]:
print(transcriptions[2]['segments'][0])

{'id': 0, 'seek': 0, 'start': 0.0, 'end': 4.0, 'text': ' What in the world do chemical engineers do?', 'tokens': [50364, 708, 294, 264, 1002, 360, 7313, 11955, 360, 30, 50564], 'temperature': 0.0, 'avg_logprob': -0.30902165174484253, 'compression_ratio': 1.3445945945945945, 'no_speech_prob': 0.1661727875471115}


### Setting up the vector db

We now set up a vector database using Qdrant. We create a collection names _videos_ where we will be indexing all the vectors corresponding to the utterances. We define the dimension of vector same as that of sentence transformer vector dimension since those are the vectors we will be storing in the db.

In [8]:
qdrant = QdrantClient(":memory:")

In [9]:
# Create collection to store data
qdrant.recreate_collection(
    collection_name="videos",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)

True

In the below cell, we add each indivdual utterances from transcription segments to the vector db. We append the transcription id and segment id to form a unique id for the vector.

In [10]:
for transcription_idx, transcription in enumerate(transcriptions, 1):
    for segment_idx, segment in tqdm.tqdm(enumerate(transcription['segments'], 1)):
        payload_doc = {"transcription": transcription, "segment": segment}
        
        qdrant.upload_records(
            collection_name="videos",
            records=[models.Record(id=int(str(transcription_idx) + str(segment_idx)), 
                                   vector=encoder.encode(segment["text"]).tolist(), 
                                   payload=payload_doc)]
        )

27it [00:00, 66.44it/s]
32it [00:00, 52.28it/s]
13it [00:00, 55.45it/s]
25it [00:00, 53.87it/s]


We can now query our vector database with a query that we have to retrieve information from any of the indexed dialogues/utterances from the videos. The query is first converted into vector using the same sentence transformer that was used to index videos. This is necessary since we want to have same encoding model to compute similarities between vectors.

We list down the id of the video, the utterance or dialogue, seek to locate where to start the video from and score which indicates how similar the query and the retrieved utterance is.

Here we use an example query as "_chem enginers work_" to get results. The top most result that we get is from the the chem engg video itself which starts as "_What in the world do chemical engineers do?_". We purposely made spelling errors in the query to demonstrate the effectiveness of vector db. None of the words present in the query are present in the retrieved dialogue. It was still abel to retrieve it. 

We can process this ranked list further to generate the final output results that we want to give our users. It can be just the top 3 unique videos rather than time stamp, or processing based on score that we get, etc.

In [11]:
query = "chem enginers work"

hits = qdrant.search(
    collection_name="videos",
    query_vector=encoder.encode(query).tolist(),
    limit=5
)
for hit in hits:
    print('Video id: ' + str(hit.payload['transcription']['id']) + '\n', 
          'Dialogue: ' + str(hit.payload['segment']['text'].strip()) + '\n', 
          'Seek: ' + str(hit.payload['segment']['seek']) + '\n',
          "score:", hit.score)
    print('-----------')

Video id: ./data/chem engg.mp4
 Dialogue: What in the world do chemical engineers do?
 Seek: 0
 score: 0.5513292127919139
-----------
Video id: ./data/chem engg.mp4
 Dialogue: So what in the world do chemical engineers do?
 Seek: 5200
 score: 0.5466465182951933
-----------
Video id: ./data/chem engg.mp4
 Dialogue: When it comes to getting things done, chemical engineers have a pretty full to do list.
 Seek: 2500
 score: 0.5319760024110961
-----------
Video id: ./data/data science.mp4
 Dialogue: Their job is to use techniques like machine learning, predictive modeling, data mining,
 Seek: 2864
 score: 0.2413862063049848
-----------
Video id: ./data/chem engg.mp4
 Dialogue: Develop and deliver new energy sources safely and responsibly.
 Seek: 0
 score: 0.23827743765568876
-----------
