## Background: 
In the fast-evolving landscape of digital content, effective search engines play a pivotal role in connecting users with relevant information. For Google, providing a seamless and accurate search experience is paramount. This project focuses on improving the search relevance for video subtitles, enhancing the accessibility of video content.

## Objective:
Develop an advanced search engine algorithm that efficiently retrieves subtitles based on user queries, with a specific emphasis on subtitle content. The primary goal is to leverage natural language processing and machine learning techniques to enhance the relevance and accuracy of search results.

# Importing the required libraries

In [1]:
import sqlite3
import pandas as pd
import numpy as np
import requests
import re
import warnings
warnings.filterwarnings('ignore')
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Step 1 - Reading the Tables from Database file****

In [2]:
# Read the code below and write your observation in the next cell

conn = sqlite3.connect("eng_subtitles_database.db")
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

<sqlite3.Cursor at 0x1694764c0>

[('zipfiles',)]


### In the above cell, I am able to read the table inside the database. As mentioned earlier, table name is zipfiles. We also know from README.txt that this table contains three columns: 'num', 'name' and 'content'.****

# Step 2 - Reading the columns of Table

In [3]:
cursor.execute("PRAGMA table_info('zipfiles')")
cols = cursor.fetchall()
for col in cols:
    print(col[1])

<sqlite3.Cursor at 0x1694764c0>

num
name
content


### The above code helps in checking the column names in the database table.
Let's now use SELECT * FROM zipfiles to read all the data into a df variable.

# Step 3 - Loading the Database Table inside a Pandas DataFrame

In [4]:
df = pd.read_sql_query("""SELECT * FROM zipfiles""", conn)
df.head()

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   num      82498 non-null  int64 
 1   name     82498 non-null  object
 2   content  82498 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.9+ MB


### Looks like the content column do not contain the subtitles text. Instead as mentioned in README.txt, it might be latin-1 encoded.

# Step 4 - Printing content of 0th Row

In [6]:
b_data = df.iloc[0, 2]

# here 2 represent the index of content column
# 0 represents the row number


###  From the content, it appears to start with the bytes "PK\x03......", which suggests that it might be a ZIP archive file.

# Step 5 - Unzipping the content of 385th row and decoding using latin-1

In [7]:
import zipfile
import io

# Assuming 'content' is the binary data from your database
binary_data = df.iloc[385, 2]

# Decompress the binary data using the zipfile module
with io.BytesIO(binary_data) as f:
    with zipfile.ZipFile(f, 'r') as zip_file:
        # Reading only one file in the ZIP archive
        subtitle_content = zip_file.read(zip_file.namelist()[0])

# Now 'subtitle_content' should contain the extracted subtitle content
print(subtitle_content.decode('latin-1'))  # Assuming the content is latin-1 encoded text

1
00:00:06,000 --> 00:00:12,074
Watch any video online with Open-SUBTITLES
Free Browser extension: osdb.link/ext

2
00:00:15,370 --> 00:00:16,506
You lose everything, my girl.

3
00:00:16,530 --> 00:00:19,360
So you've said - four times.

4
00:00:20,330 --> 00:00:22,120
I definitely had
it on yesterday.

5
00:00:22,465 --> 00:00:25,785
Your gloves, your keys, that
handkerchief I embroidered for you

6
00:00:25,809 --> 00:00:26,168
Everything!

7
00:00:26,192 --> 00:00:27,280
Five times.

8
00:00:31,610 --> 00:00:32,920
Miss Scarlet?
- Yes.

9
00:00:36,390 --> 00:00:37,390
I'm Miss Scarlet.

10
00:00:37,872 --> 00:00:40,880
May I inquire if
you've lost something?

11
00:00:41,350 --> 00:00:42,530
Some jewellery perhaps?

12
00:00:42,870 --> 00:00:45,130
Yes, my mother's wedding ring.

13
00:00:45,220 --> 00:00:45,840
Have you found it?

14
00:00:45,950 --> 00:00:47,656
Does your ring have
an inscription?

15
00:00:48,650 --> 00:00:51,720
From my father to my mother 'For
my beloved, Livi

# Step 6 - Applying the above Function on the Entire Data

In [8]:
import zipfile
import io

count = 0

def decode_method(binary_data):
    global count
    # Decompress the binary data using the zipfile module
    # print(count, end=" ")
    count += 1
    with io.BytesIO(binary_data) as f:
        with zipfile.ZipFile(f, 'r') as zip_file:
            # Assuming there's only one file in the ZIP archive
            subtitle_content = zip_file.read(zip_file.namelist()[0])

    # Now 'subtitle_content' should contain the extracted subtitle content
    return subtitle_content.decode('latin-1')  # Assuming the content is UTF-8 encoded text

In [9]:
df['file_content'] = df['content'].apply(decode_method)

df.head()

Unnamed: 0,num,name,content,file_content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   num           82498 non-null  int64 
 1   name          82498 non-null  object
 2   content       82498 non-null  object
 3   file_content  82498 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.5+ MB


In [11]:
df.tail()

Unnamed: 0,num,name,content,file_content
82493,9521935,the.prophets.game.(2000).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xb8\xa6\x...,"ï»¿1\r\n00:01:16,284 --> 00:01:19,537\r\nGod,\..."
82494,9521937,west.beirut.(1998).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x13\x97\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\napi.Open..."
82495,9521938,frankenstein.the.true.story.(1973).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00$\x97\x9aV...,"1\r\n00:00:01,001 --> 00:00:04,630\r\n(Dramati..."
82496,9521940,frankenstein.the.true.story.(1973).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x97\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nAdvertis..."
82497,9521941,zombie.island.massacre.(1984).eng.1cd,"b'PK\x03\x04\x14\x00\x00\x00\x08\x00,\x97\x9aV...","1\r\n00:00:01,919 --> 00:00:03,253\r\n(Sharp w..."


# Step 7- Slice the DataFrame to get 30% of the data and store it in another DataFrame using the iloc method

In [12]:
sliced_data = df[:26000]

In [13]:
sliced_data

Unnamed: 0,num,name,content,file_content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."
...,...,...,...,...
25995,9284663,last.resort.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x83\xa3\x...,"ï»¿1\r\n00:00:08,041 --> 00:00:12,346\r\n[gent..."
25996,9284664,all.crazy.random.().eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x98\xa3\x...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\napi.O..."
25997,9284665,all.crazy.random.().eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1a\x85\x...,"ï»¿1\r\n00:00:03,872 --> 00:00:07,341\r\n[bees..."
25998,9284668,aurora.teagarden.mysteries.aurora.teagarden.my...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00x\x85\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nAdver..."


In [14]:
df.iloc[0,3]

'1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch any video online with Open-SUBTITLES\r\nFree Browser extension: osdb.link/ext\r\n\r\n2\r\n00:02:26,198 --> 00:02:29,953\r\nIn the name of God, the most gracious, the most Merciful.\r\n\r\n3\r\n00:02:31,072 --> 00:02:33,370\r\nFrom Muhammad, the Messenger of God\r\n\r\n4\r\n00:02:33,550 --> 00:02:36,047\r\nto Heraclius, the emperor of Byzantium.\r\n\r\n5\r\n00:02:36,407 --> 00:02:39,464\r\ngreetings to him who is the\r\nfollower of righteous guidance.\r\n\r\n6\r\n00:02:39,783 --> 00:02:42,591\r\nI bid you to hear the divine call.\r\n\r\n7\r\n00:02:43,160 --> 00:02:45,817\r\nI am the messenger of God to the people;\r\n\r\n8\r\n00:02:46,337 --> 00:02:48,784\r\naccept Islam for your salvation.\r\n\r\n9\r\n00:02:52,231 --> 00:02:54,709\r\nHe speaks of a new prophet in Arabia.\r\n\r\n10\r\n00:02:55,068 --> 00:02:57,825\r\nWas it like this when John, the Baptist\r\ncame to king Herod\r\n\r\n11\r\n00:02:58,145 --> 00:03:01,272\r\nout of the desert, 

# Data Preprocessing¶
# Data Cleaning

# Step 1 : Removing the timestamp from file_content column using regexx

In [15]:
import re

# Define the regex pattern
pattern = r'\d{2}:\d{2}:\d{2},\d{3}\s*-->\s*\d{2}:\d{2}:\d{2},\d{3}\s*'

# Apply the regex pattern to the specified column
sliced_data['cleaned_text'] = sliced_data['file_content'].apply(lambda x: re.sub(pattern, '', x))

# Display the cleaned DataFrame
sliced_data

Unnamed: 0,num,name,content,file_content,cleaned_text
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",1\r\nAh! There's Princess\r\nDawn and Terry wi...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",1\r\n<i>Yumi's Cells 2</i>\r\n\r\n2\r\n<i>Epis...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",ï»¿1\r\nWatch any video online with Open-SUBTI...
...,...,...,...,...,...
25995,9284663,last.resort.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x83\xa3\x...,"ï»¿1\r\n00:00:08,041 --> 00:00:12,346\r\n[gent...",ï»¿1\r\n[gentle music]\r\n\r\n2\r\nâª\r\n\r\n...
25996,9284664,all.crazy.random.().eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x98\xa3\x...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\napi.O...","ï»¿1\r\napi.OpenSubtitles.org is deprecated, p..."
25997,9284665,all.crazy.random.().eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1a\x85\x...,"ï»¿1\r\n00:00:03,872 --> 00:00:07,341\r\n[bees...",ï»¿1\r\n[bees buzzing]\r\n\r\n2\r\napi.OpenSub...
25998,9284668,aurora.teagarden.mysteries.aurora.teagarden.my...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00x\x85\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nAdver...",ï»¿1\r\nAdvertise your product or brand here\r...


In [16]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [17]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mac_kushal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mac_kushal/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [18]:
from bs4 import BeautifulSoup
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from tqdm import tqdm, tqdm_notebook
from sentence_transformers import SentenceTransformer, util

In [19]:
sliced_data

Unnamed: 0,num,name,content,file_content,cleaned_text
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",1\r\nAh! There's Princess\r\nDawn and Terry wi...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",1\r\n<i>Yumi's Cells 2</i>\r\n\r\n2\r\n<i>Epis...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",ï»¿1\r\nWatch any video online with Open-SUBTI...
...,...,...,...,...,...
25995,9284663,last.resort.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x83\xa3\x...,"ï»¿1\r\n00:00:08,041 --> 00:00:12,346\r\n[gent...",ï»¿1\r\n[gentle music]\r\n\r\n2\r\nâª\r\n\r\n...
25996,9284664,all.crazy.random.().eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x98\xa3\x...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\napi.O...","ï»¿1\r\napi.OpenSubtitles.org is deprecated, p..."
25997,9284665,all.crazy.random.().eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1a\x85\x...,"ï»¿1\r\n00:00:03,872 --> 00:00:07,341\r\n[bees...",ï»¿1\r\n[bees buzzing]\r\n\r\n2\r\napi.OpenSub...
25998,9284668,aurora.teagarden.mysteries.aurora.teagarden.my...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00x\x85\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nAdver...",ï»¿1\r\nAdvertise your product or brand here\r...


In [20]:
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Initialize WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to preprocess text
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # remove timestamps from subtitle documents
    cleaned_text = re.sub(r'\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+', '', text)
    # Remove line numbers
    cleaned_text = re.sub(r'\d+\s*', '', text)
    # Remove HTML tags
    cleaned_text = BeautifulSoup(cleaned_text, "html.parser").get_text(separator=" ")
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)
    cleaned_text = re.sub(r'[ï]', '', cleaned_text)
    cleaned_text = re.sub(r'[âª]', '', cleaned_text)
    # Remove extra whitespace
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
    # Tokenize text
    tokens = word_tokenize(text)
    # Lemmatize tokens
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    # Join tokens back into text
    preprocessed_text = ' '.join(lemmatized_tokens)
    return preprocessed_text.strip()

# Apply preprocessing to 'content' column
sliced_data['processed_content'] = sliced_data['cleaned_text'].apply(preprocess_text)

# Display the preprocessed data
print(sliced_data['processed_content'])


0        1 watch any video online with opensubtitles fr...
1        1 ah there princess dawn and terry with the 2 ...
2        1 iyumis cell 2i 2 iepisode 36 extremely polit...
3        1 watch any video online with opensubtitles fr...
4        ï1 watch any video online with opensubtitles f...
                               ...                        
25995    ï1 gentle music 2 âª 3 crow cawing 4 bird chir...
25996    ï1 apiopensubtitlesorg is deprecated please im...
25997    ï1 bee buzzing 2 apiopensubtitlesorg is deprec...
25998    ï1 advertise your product or brand here contac...
25999    ï1 support u and become vip member to remove a...
Name: processed_content, Length: 26000, dtype: object


In [21]:
sliced_data

Unnamed: 0,num,name,content,file_content,cleaned_text,processed_content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",1\r\nAh! There's Princess\r\nDawn and Terry wi...,1 ah there princess dawn and terry with the 2 ...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",1\r\n<i>Yumi's Cells 2</i>\r\n\r\n2\r\n<i>Epis...,1 iyumis cell 2i 2 iepisode 36 extremely polit...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",ï»¿1\r\nWatch any video online with Open-SUBTI...,ï1 watch any video online with opensubtitles f...
...,...,...,...,...,...,...
25995,9284663,last.resort.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x83\xa3\x...,"ï»¿1\r\n00:00:08,041 --> 00:00:12,346\r\n[gent...",ï»¿1\r\n[gentle music]\r\n\r\n2\r\nâª\r\n\r\n...,ï1 gentle music 2 âª 3 crow cawing 4 bird chir...
25996,9284664,all.crazy.random.().eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x98\xa3\x...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\napi.O...","ï»¿1\r\napi.OpenSubtitles.org is deprecated, p...",ï1 apiopensubtitlesorg is deprecated please im...
25997,9284665,all.crazy.random.().eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1a\x85\x...,"ï»¿1\r\n00:00:03,872 --> 00:00:07,341\r\n[bees...",ï»¿1\r\n[bees buzzing]\r\n\r\n2\r\napi.OpenSub...,ï1 bee buzzing 2 apiopensubtitlesorg is deprec...
25998,9284668,aurora.teagarden.mysteries.aurora.teagarden.my...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00x\x85\x99V...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nAdver...",ï»¿1\r\nAdvertise your product or brand here\r...,ï1 advertise your product or brand here contac...


In [22]:
sliced_data = sliced_data.drop('content', axis=1)


In [23]:
sliced_data .head()

Unnamed: 0,num,name,file_content,cleaned_text,processed_content
0,9180533,the.message.(1976).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",1\r\nAh! There's Princess\r\nDawn and Terry wi...,1 ah there princess dawn and terry with the 2 ...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",1\r\n<i>Yumi's Cells 2</i>\r\n\r\n2\r\n<i>Epis...,1 iyumis cell 2i 2 iepisode 36 extremely polit...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...
4,9180600,broker.(2022).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",ï»¿1\r\nWatch any video online with Open-SUBTI...,ï1 watch any video online with opensubtitles f...


# Document Chunking

In [24]:
def chunk_document(text, chunk_size=500, overlap=50):
    chunks = []
    words = word_tokenize(text)
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

#Apply chunking to each subtitle document
chunked_data = sliced_data['processed_content'].apply(chunk_document)

In [25]:
sliced_data .head()

Unnamed: 0,num,name,file_content,cleaned_text,processed_content
0,9180533,the.message.(1976).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",1\r\nAh! There's Princess\r\nDawn and Terry wi...,1 ah there princess dawn and terry with the 2 ...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",1\r\n<i>Yumi's Cells 2</i>\r\n\r\n2\r\n<i>Epis...,1 iyumis cell 2i 2 iepisode 36 extremely polit...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...
4,9180600,broker.(2022).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",ï»¿1\r\nWatch any video online with Open-SUBTI...,ï1 watch any video online with opensubtitles f...


# Saving the Chunked Subtitle Data in a CSV file

In [26]:
# Specify the file path for the CSV file
output_csv2_file = 'cleaned_chunked_subtitle_data.csv'

# Write the 'cleaned_text' column to a CSV file
sliced_data.to_csv(output_csv2_file, index=False, header=True)

print(f"Cleaned subtitle data has been saved to {output_csv2_file}.")

Cleaned subtitle data has been saved to cleaned_chunked_subtitle_data.csv.


In [27]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('cleaned_chunked_subtitle_data.csv')

# Print the first few rows of the DataFrame
df.head()

Unnamed: 0,num,name,file_content,cleaned_text,processed_content
0,9180533,the.message.(1976).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",1\r\nAh! There's Princess\r\nDawn and Terry wi...,1 ah there princess dawn and terry with the 2 ...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",1\r\n<i>Yumi's Cells 2</i>\r\n\r\n2\r\n<i>Epis...,1 iyumis cell 2i 2 iepisode 36 extremely polit...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...
4,9180600,broker.(2022).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",ï»¿1\r\nWatch any video online with Open-SUBTI...,ï1 watch any video online with opensubtitles f...


In [28]:
df.shape

(26000, 5)

In [29]:
sliced_data.iloc[0,4]

'1 watch any video online with opensubtitles free browser extension osdblinkext 2 in the name of god the most gracious the most merciful 3 from muhammad the messenger of god 4 to heraclius the emperor of byzantium 5 greeting to him who is the follower of righteous guidance 6 i bid you to hear the divine call 7 i am the messenger of god to the people 8 accept islam for your salvation 9 he speaks of a new prophet in arabia 10 wa it like this when john the baptist came to king herod 11 out of the desert cry about salvation 12 to muqawqis patriarch of alexandria 13 kisra emperor of persia 14 muhammad call you with the call of god 15 accept islam for your salvation 16 embrace islam 17 you come out of the desert smelling of camel and goat 18 to tell persia where he should kneel 19 muhammad messenger of god 20 who gave him this authority 21 god sent muhammad a a mercy to mankind 22 the scholar and historian of islam the university of alazhar in cairo the high islamic congress of the shiat in 

In [30]:
!pip install sentence-transformers




# Generating Text Vectors Using BERT based Sentence Transformer

In [31]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

In [32]:
sliced_data['doc_vector_pretrained_bert'] = sliced_data.processed_content.apply(model.encode)

In [33]:
sliced_data.head()

Unnamed: 0,num,name,file_content,cleaned_text,processed_content,doc_vector_pretrained_bert
0,9180533,the.message.(1976).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...,"[-0.05896317, 0.13861059, -0.04594013, -0.0916..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",1\r\nAh! There's Princess\r\nDawn and Terry wi...,1 ah there princess dawn and terry with the 2 ...,"[-0.07771736, 0.028140318, 0.033873767, -0.112..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",1\r\n<i>Yumi's Cells 2</i>\r\n\r\n2\r\n<i>Epis...,1 iyumis cell 2i 2 iepisode 36 extremely polit...,"[-0.08680382, -0.08750703, 0.06415923, -0.0391..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...,"[-0.053850733, -0.07838601, 0.05710254, -0.022..."
4,9180600,broker.(2022).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",ï»¿1\r\nWatch any video online with Open-SUBTI...,ï1 watch any video online with opensubtitles f...,"[-0.03212091, -0.0068686004, 0.052767005, -0.0..."


In [34]:
sliced_data.to_csv('search.csv')

In [35]:
import pandas as pd
df=pd.read_csv('search.csv')
df

Unnamed: 0.1,Unnamed: 0,num,name,file_content,cleaned_text,processed_content,doc_vector_pretrained_bert
0,0,9180533,the.message.(1976).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...,[-5.89631684e-02 1.38610587e-01 -4.59401309e-...
1,1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",1\r\nAh! There's Princess\r\nDawn and Terry wi...,1 ah there princess dawn and terry with the 2 ...,[-7.77173564e-02 2.81403176e-02 3.38737667e-...
2,2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",1\r\n<i>Yumi's Cells 2</i>\r\n\r\n2\r\n<i>Epis...,1 iyumis cell 2i 2 iepisode 36 extremely polit...,[-8.68038237e-02 -8.75070319e-02 6.41592294e-...
3,3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...,[-5.38507327e-02 -7.83860087e-02 5.71025386e-...
4,4,9180600,broker.(2022).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",ï»¿1\r\nWatch any video online with Open-SUBTI...,ï1 watch any video online with opensubtitles f...,[-3.21209095e-02 -6.86860038e-03 5.27670048e-...
...,...,...,...,...,...,...,...
25995,25995,9284663,last.resort.(2022).eng.1cd,"ï»¿1\r\n00:00:08,041 --> 00:00:12,346\r\n[gent...",ï»¿1\r\n[gentle music]\r\n\r\n2\r\nâª\r\n\r\n...,ï1 gentle music 2 âª 3 crow cawing 4 bird chir...,[ 6.67666867e-02 -2.95987893e-02 4.14446630e-...
25996,25996,9284664,all.crazy.random.().eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\napi.O...","ï»¿1\r\napi.OpenSubtitles.org is deprecated, p...",ï1 apiopensubtitlesorg is deprecated please im...,[-9.96579453e-02 -9.47523117e-03 -6.68287352e-...
25997,25997,9284665,all.crazy.random.().eng.1cd,"ï»¿1\r\n00:00:03,872 --> 00:00:07,341\r\n[bees...",ï»¿1\r\n[bees buzzing]\r\n\r\n2\r\napi.OpenSub...,ï1 bee buzzing 2 apiopensubtitlesorg is deprec...,[-7.94203356e-02 -7.41111860e-02 6.30527595e-...
25998,25998,9284668,aurora.teagarden.mysteries.aurora.teagarden.my...,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nAdver...",ï»¿1\r\nAdvertise your product or brand here\r...,ï1 advertise your product or brand here contac...,[-9.53414291e-02 7.55122453e-02 -2.22560056e-...


In [36]:
sliced_data.head()

Unnamed: 0,num,name,file_content,cleaned_text,processed_content,doc_vector_pretrained_bert
0,9180533,the.message.(1976).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...,"[-0.05896317, 0.13861059, -0.04594013, -0.0916..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",1\r\nAh! There's Princess\r\nDawn and Terry wi...,1 ah there princess dawn and terry with the 2 ...,"[-0.07771736, 0.028140318, 0.033873767, -0.112..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",1\r\n<i>Yumi's Cells 2</i>\r\n\r\n2\r\n<i>Epis...,1 iyumis cell 2i 2 iepisode 36 extremely polit...,"[-0.08680382, -0.08750703, 0.06415923, -0.0391..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",1\r\nWatch any video online with Open-SUBTITLE...,1 watch any video online with opensubtitles fr...,"[-0.053850733, -0.07838601, 0.05710254, -0.022..."
4,9180600,broker.(2022).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",ï»¿1\r\nWatch any video online with Open-SUBTI...,ï1 watch any video online with opensubtitles f...,"[-0.03212091, -0.0068686004, 0.052767005, -0.0..."


# Creating Query Embeddings

In [37]:
def search(query, data, embeddings, model):
    
    query_embedding = model.encode([query])[0]
    similarities = cosine_similarity([query_embedding], embeddings)
    
    top_n = 10
    top_indices = np.argsort(similarities[0])[-top_n:][::-1]  
    results = [(data['name'][i], similarities[0][i]) for i in top_indices]
    
    return results

In [38]:
embeddings = np.array(sliced_data['doc_vector_pretrained_bert'].tolist())

In [39]:
embedding_dict = {}
for i, embedding in enumerate(embeddings):
    embedding_dict[i] = embedding

for i in range(1):
    print(f"Embedding {i}: {embedding_dict[i]}")

Embedding 0: [-5.89631684e-02  1.38610587e-01 -4.59401309e-02 -9.16577876e-02
  1.63094494e-02  5.10551631e-02  7.45496200e-03  2.11077109e-02
  1.07918076e-01 -2.19542533e-02  1.50389504e-02 -3.40051651e-02
  7.48673603e-02  1.57759320e-02 -1.06348712e-02 -3.69665399e-02
 -1.64274052e-02  4.22191694e-02 -3.94361354e-02 -1.07354373e-01
  2.35912048e-05  2.93981023e-02  6.41805381e-02  2.36042328e-02
 -3.92498933e-02  5.24988747e-04 -2.15270277e-02  2.46926062e-02
  1.48853129e-02 -3.78233381e-02 -1.83449034e-02 -2.56619300e-03
  2.84688342e-02 -3.86758000e-02 -6.48269579e-02  6.15539923e-02
  5.70464730e-02  3.57763395e-02  1.27658173e-02 -2.27961000e-02
  1.16382048e-01  2.04285849e-02 -2.90124733e-02 -5.61350584e-03
 -1.98993599e-03 -3.54942642e-02 -2.91566178e-02  4.08234261e-02
  1.36579238e-02 -1.08691417e-02 -1.52520508e-01  2.55288165e-02
  6.52375519e-02 -6.53583184e-02 -5.17040007e-02 -1.09972671e-01
 -3.25277150e-02  2.24912763e-02  2.00676303e-02 -9.27843451e-02
 -5.77203408

# Calculating Cosine Similarity Score

In [40]:
from sklearn.metrics.pairwise import cosine_similarity

query = input("Enter your search query of English movies and series:")
search_results = search(query, sliced_data, embeddings, model)
for result in search_results:
    print("Document:", result[0])
    print("Similarity Score:", result[1])
    print()

Document: glitch.s01.e01.episode.1.1.(2022).eng.1cd
Similarity Score: 0.29631895

Document: leng.mian.ju.ji.shou.(1991).eng.1cd
Similarity Score: 0.26968023

Document: my.love.(1994).eng.1cd
Similarity Score: 0.26907033

Document: a.grunts.life.s02.e03.best.of.the.afghan.best.(2022).eng.1cd
Similarity Score: 0.26138896

Document: falco.s01.e04.rencontres.assassines.(2013).eng.1cd
Similarity Score: 0.26127806

Document: big.love.s04.e06.under.one.roof.(2010).eng.1cd
Similarity Score: 0.2592083

Document: would.i.lie.to.you.s15.e04.episode.15.4.(2021).eng.1cd
Similarity Score: 0.2577954

Document: bring.it.on.cheer.or.die.(2022).eng.1cd
Similarity Score: 0.25580403

Document: bling.empire.s03.e01.blast.from.the.past.(2022).eng.1cd
Similarity Score: 0.25550932

Document: falco.s01.e01.le.reveil.(2013).eng.1cd
Similarity Score: 0.2550542



In [41]:
ids = sliced_data.index.astype(str).tolist()
documents = sliced_data['processed_content'].tolist()
metadata = sliced_data.drop(['file_content','cleaned_text','processed_content','doc_vector_pretrained_bert'], axis = 1).to_dict(orient = 'records')

In [42]:
documents[0]

'1 watch any video online with opensubtitles free browser extension osdblinkext 2 in the name of god the most gracious the most merciful 3 from muhammad the messenger of god 4 to heraclius the emperor of byzantium 5 greeting to him who is the follower of righteous guidance 6 i bid you to hear the divine call 7 i am the messenger of god to the people 8 accept islam for your salvation 9 he speaks of a new prophet in arabia 10 wa it like this when john the baptist came to king herod 11 out of the desert cry about salvation 12 to muqawqis patriarch of alexandria 13 kisra emperor of persia 14 muhammad call you with the call of god 15 accept islam for your salvation 16 embrace islam 17 you come out of the desert smelling of camel and goat 18 to tell persia where he should kneel 19 muhammad messenger of god 20 who gave him this authority 21 god sent muhammad a a mercy to mankind 22 the scholar and historian of islam the university of alazhar in cairo the high islamic congress of the shiat in 

In [43]:
embeddings[0]

array([-5.89631684e-02,  1.38610587e-01, -4.59401309e-02, -9.16577876e-02,
        1.63094494e-02,  5.10551631e-02,  7.45496200e-03,  2.11077109e-02,
        1.07918076e-01, -2.19542533e-02,  1.50389504e-02, -3.40051651e-02,
        7.48673603e-02,  1.57759320e-02, -1.06348712e-02, -3.69665399e-02,
       -1.64274052e-02,  4.22191694e-02, -3.94361354e-02, -1.07354373e-01,
        2.35912048e-05,  2.93981023e-02,  6.41805381e-02,  2.36042328e-02,
       -3.92498933e-02,  5.24988747e-04, -2.15270277e-02,  2.46926062e-02,
        1.48853129e-02, -3.78233381e-02, -1.83449034e-02, -2.56619300e-03,
        2.84688342e-02, -3.86758000e-02, -6.48269579e-02,  6.15539923e-02,
        5.70464730e-02,  3.57763395e-02,  1.27658173e-02, -2.27961000e-02,
        1.16382048e-01,  2.04285849e-02, -2.90124733e-02, -5.61350584e-03,
       -1.98993599e-03, -3.54942642e-02, -2.91566178e-02,  4.08234261e-02,
        1.36579238e-02, -1.08691417e-02, -1.52520508e-01,  2.55288165e-02,
        6.52375519e-02, -

# Storing the Vectors generated using ChromaDB database

In [44]:
import chromadb
client = chromadb.PersistentClient(path="Embeddings")

In [47]:
collection = client.create_collection(name="Search_Engine", metadata={"hnsw:space": "cosine"})

In [48]:
embeddings_as_lists = [embedding.tolist() for embedding in embeddings]

In [49]:
for i, embedding in enumerate(embeddings_as_lists):

    # Add the embeddings list to your collection
    collection.add(
            documents=documents[i],
            embeddings=embeddings_as_lists[i],
            ids=ids[i],
            metadatas=metadata[i]
        )

In [50]:
results = collection.query(query_texts=[" through abraham noah moses and through jesus christ 571 why should we be so surprised that god speaks to u now through muhammad 572 who taught you those name 573 they are named in the quran"],
                           n_results=10)

In [51]:
results

{'ids': [['22655',
   '22665',
   '22663',
   '22653',
   '7508',
   '1444',
   '22128',
   '6815',
   '10282',
   '22649']],
 'distances': [[0.49688708782196045,
   0.5383045077323914,
   0.571373701095581,
   0.5761290788650513,
   0.5826025009155273,
   0.5833632349967957,
   0.5917107462882996,
   0.6019304990768433,
   0.6034767627716064,
   0.6035460233688354]],
 'metadatas': [[{'name': 'the.bible.s01.e09.passion.(2013).eng.1cd',
    'num': 9271790},
   {'name': 'the.bible.s01.e09.passion.(2013).eng.1cd', 'num': 9271801},
   {'name': 'the.bible.s01.e07.mission.(2013).eng.1cd', 'num': 9271799},
   {'name': 'the.bible.s01.e07.mission.(2013).eng.1cd', 'num': 9271788},
   {'name': 'cabrito.(2020).eng.1cd', 'num': 9213169},
   {'name': 'american.experience.s28.e04.the.pilgrims.(2015).eng.1cd',
    'num': 9187731},
   {'name': 'ballad.of.a.white.cow.(2020).eng.1cd', 'num': 9270457},
   {'name': 'american.experience.s28.e04.the.pilgrims.(2015).eng.1cd',
    'num': 9210022},
   {'name': 