**Problem Statement - Enhancing Search Engine Relevance for Video Subtitles**

**Background:**
In the fast-evolving landscape of digital content, effective search engines play a pivotal role in connecting users with relevant information. For Google, providing a seamless and accurate search experience is paramount. This project focuses on improving the search relevance for video subtitles, enhancing the accessibility of video content.

**Objective:**
Develop an advanced search engine algorithm that efficiently retrieves videos based on user queries, with a specific emphasis on subtitle content. The primary goal is to leverage natural language processing and machine learning techniques to enhance the relevance and accuracy of search results.

**Core Logic:**
To compare a user query against a video subtitle document, the core logic involves three key steps:

1. TF-IDF Vectorization for Subtitle Document:
   - Find the TF-IDF vector for each subtitle document.

2. TF-IDF Vectorization for Query:
   - Compute the TF-IDF vector for the user's query.
   - Treat the query as a new document without recomputing IDF values.
   - Utilize the previously computed IDF values, avoiding unnecessary recomputation.

3. Cosine Similarity Calculation:
   - Compute the cosine similarity between the TF-IDF vector of the document and the TF-IDF vector of the query.
   - This similarity score determines the relevance of the document to the user's query.


### 1.  Converting Subtitles Data to Numerical Matrix:

#### extracting zipped files and storing inside folders

In [5]:
import os
import zipfile

def unzip_files(zip_folders, destination_folder):
    # Create the common destination folder if it doesn't exist
    os.makedirs(destination_folder, exist_ok=True)

    # Iterate over each zip folder
    for zip_folder in zip_folders:
        # Iterate over each zip file in the current zip folder
        for zip_file in os.listdir(zip_folder):
            if zip_file.endswith(".zip"):
                zip_file_path = os.path.join(zip_folder, zip_file)

                # Extract the contents of the zip file into the common destination folder
                with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
                    zip_ref.extractall(destination_folder)

                print(f"Extracted {zip_file} from {zip_folder} to {destination_folder}")



zip_folders_paths = [r'C:\Users\Sruthi\Desktop\project_3\zipped\HIMYM', r'C:\Users\Sruthi\Desktop\project_3\zipped\MCU', r'C:\Users\Sruthi\Desktop\project_3\zipped\Breaking Bad', r'C:\Users\Sruthi\Desktop\project_3\zipped\F R I E N D S']

common_destination_folder = r'C:\Users\Sruthi\Desktop\project_3\unzipped'

unzip_files(zip_folders_paths, common_destination_folder)

Extracted HIMYM-Season-01_en.zip from C:\Users\Sruthi\Desktop\project_3\zipped\HIMYM to C:\Users\Sruthi\Desktop\project_3\unzipped
Extracted HIMYM-Season-02_en.zip from C:\Users\Sruthi\Desktop\project_3\zipped\HIMYM to C:\Users\Sruthi\Desktop\project_3\unzipped
Extracted HIMYM-Season-03_en.zip from C:\Users\Sruthi\Desktop\project_3\zipped\HIMYM to C:\Users\Sruthi\Desktop\project_3\unzipped
Extracted HIMYM-Season-04_en.zip from C:\Users\Sruthi\Desktop\project_3\zipped\HIMYM to C:\Users\Sruthi\Desktop\project_3\unzipped
Extracted HIMYM-Season-05_en.zip from C:\Users\Sruthi\Desktop\project_3\zipped\HIMYM to C:\Users\Sruthi\Desktop\project_3\unzipped
Extracted HIMYM-Season-06_en.zip from C:\Users\Sruthi\Desktop\project_3\zipped\HIMYM to C:\Users\Sruthi\Desktop\project_3\unzipped
Extracted HIMYM-Season-07_en.zip from C:\Users\Sruthi\Desktop\project_3\zipped\HIMYM to C:\Users\Sruthi\Desktop\project_3\unzipped
Extracted HIMYM-Season-08_en.zip from C:\Users\Sruthi\Desktop\project_3\zipped\HIMY

#### reading files

In [6]:
import os

dir_path = r"C:\Users\Sruthi\Desktop\project_3\unzipped"

files = os.listdir(dir_path)

In [7]:
files

['Breaking_bad', 'friends', 'himym', 'mcu']

In [62]:
import os
import pandas as pd

def read_files_from_folders(folders):
    file_data = []  # List to store file paths, names, and contents

    for folder in folders:
        print(f"Reading files from folder: {folder}")

        # Iterate over all files in the current folder
        for root, _, files in os.walk(folder):
            for file in files:
                file_path = os.path.join(root, file)

                # Read the contents of the file
                with open(file_path, 'r', encoding='latin-1') as file_handle:
                    file_contents = file_handle.read()

                # Append file path, name, and content to the list
                file_data.append({'File_Path': file_path, 'File_Name': file, 'File_Content': file_contents})

    return file_data

folders_to_read = [r'C:\Users\Sruthi\Desktop\project_3\unzipped\Breaking_bad', r'C:\Users\Sruthi\Desktop\project_3\unzipped\Friends',r'C:\Users\Sruthi\Desktop\project_3\unzipped\himym',r'C:\Users\Sruthi\Desktop\project_3\unzipped\mcu']


files_data = read_files_from_folders(folders_to_read)

    # Create a DataFrame from the list
df = pd.DataFrame(files_data)

    # Display the DataFrame
print(df)


Reading files from folder: C:\Users\Sruthi\Desktop\project_3\unzipped\Breaking_bad
Reading files from folder: C:\Users\Sruthi\Desktop\project_3\unzipped\Friends
Reading files from folder: C:\Users\Sruthi\Desktop\project_3\unzipped\himym
Reading files from folder: C:\Users\Sruthi\Desktop\project_3\unzipped\mcu
                                             File_Path  \
0    C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...   
1    C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...   
2    C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...   
3    C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...   
4    C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...   
..                                                 ...   
938  C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...   
939  C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...   
940  C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...   
941  C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...   
942  C:\Users\Sruthi\Desktop\project_3\unzipped\mcu

In [14]:
df.to_csv(r"C:\Users\Sruthi\Desktop\project_3\vedio_titles.csv", index = False)

In [2]:
import pandas as pd

In [47]:
df = pd.read_csv(r"C:\Users\Sruthi\Desktop\project_3\vedio_titles.csv")

#### ..and converting it into DataFrame

In [48]:
df.head()

Unnamed: 0.1,Unnamed: 0,File_Path,File_Name,File_Content
0,0,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x01 - Pilot.DSR.0TV.en.srt,"1\n00:02:15,273 --> 00:02:17,187\nMy name is\n..."
1,1,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x01 - Pilot.DVDRip.ORPHEUS.en.srt,"1\n00:01:09,386 --> 00:01:11,221\nOh, my god.\..."
2,2,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x02 - Cat's in the Bag....DSR....,"1\n00:00:05,695 --> 00:00:07,790\nPreviously o..."
3,3,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x02 - Cat's in the Bag....DVDR...,"1\n00:00:20,838 --> 00:00:23,257\nAre you okay..."
4,4,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x03 - ...and the Bag's in the ...,"1\n00:00:43,059 --> 00:00:44,757\nLet's break ..."


In [49]:
df.shape

(943, 4)

In [50]:
df.drop(['Unnamed: 0'], axis = 1, inplace = True)

In [51]:
df.shape

(943, 3)

#### Clean the “file_content” by removing HTML tags

In [52]:
import pandas as pd
from bs4 import BeautifulSoup

def remove_html_tags(html_text):
    # Parse the HTML text
    soup = BeautifulSoup(html_text, 'html.parser')

    # Extract the text content without HTML tags
    text_content = soup.get_text(separator=' ', strip=True)
    
    return text_content
    
df['Text_Content'] = df['File_Content'].apply(remove_html_tags)

    

  soup = BeautifulSoup(html_text, 'html.parser')


In [53]:
df['Text_Content']

0      1\n00:02:15,273 --> 00:02:17,187\nMy name is\n...
1      1\n00:01:09,386 --> 00:01:11,221\nOh, my god.\...
2      1\n00:00:05,695 --> 00:00:07,790\nPreviously o...
3      1\n00:00:20,838 --> 00:00:23,257\nAre you okay...
4      1\n00:00:43,059 --> 00:00:44,757\nLet's break ...
                             ...                        
938    ï»¿1\n00:00:26,361 --> 00:00:28,196 This is th...
939    ï»¿1\n00:01:24,209 --> 00:01:25,543\nWait for ...
940    ï»¿1\n00:00:41,834 --> 00:00:43,877\nOh, great...
941    ï»¿1\n00:00:54,888 --> 00:00:56,598 Now, I kno...
942    1\n00:00:39,039 --> 00:00:43,636 ODIN: Long be...
Name: Text_Content, Length: 943, dtype: object

In [54]:
df.head()

Unnamed: 0,File_Path,File_Name,File_Content,Text_Content
0,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x01 - Pilot.DSR.0TV.en.srt,"1\n00:02:15,273 --> 00:02:17,187\nMy name is\n...","1\n00:02:15,273 --> 00:02:17,187\nMy name is\n..."
1,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x01 - Pilot.DVDRip.ORPHEUS.en.srt,"1\n00:01:09,386 --> 00:01:11,221\nOh, my god.\...","1\n00:01:09,386 --> 00:01:11,221\nOh, my god.\..."
2,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x02 - Cat's in the Bag....DSR....,"1\n00:00:05,695 --> 00:00:07,790\nPreviously o...","1\n00:00:05,695 --> 00:00:07,790\nPreviously o..."
3,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x02 - Cat's in the Bag....DVDR...,"1\n00:00:20,838 --> 00:00:23,257\nAre you okay...","1\n00:00:20,838 --> 00:00:23,257\nAre you okay..."
4,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x03 - ...and the Bag's in the ...,"1\n00:00:43,059 --> 00:00:44,757\nLet's break ...","1\n00:00:43,059 --> 00:00:44,757\nLet's break ..."


#### special characters, digits, and applying lemmatization.


In [55]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [56]:
lemmatizer = WordNetLemmatizer()

In [57]:
def preprocess(text):
    # Removing special characters and digits
    sentence = re.sub("[^a-zA-Z]", " ", text)
    
    # change sentence to lower case
    sentence = sentence.lower()

    # tokenize into words
    tokens = sentence.split()
    
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    preprocessed_text = ' '.join(tokens)

    return preprocessed_text

In [58]:
df['Clean_Text_Lemma'] = df['Text_Content'].apply(preprocess)

df.head()

Unnamed: 0,File_Path,File_Name,File_Content,Text_Content,Clean_Text_Lemma
0,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x01 - Pilot.DSR.0TV.en.srt,"1\n00:02:15,273 --> 00:02:17,187\nMy name is\n...","1\n00:02:15,273 --> 00:02:17,187\nMy name is\n...",my name is walter hartwell white i live at neg...
1,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x01 - Pilot.DVDRip.ORPHEUS.en.srt,"1\n00:01:09,386 --> 00:01:11,221\nOh, my god.\...","1\n00:01:09,386 --> 00:01:11,221\nOh, my god.\...",oh my god christ shit oh god oh my god oh my g...
2,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x02 - Cat's in the Bag....DSR....,"1\n00:00:05,695 --> 00:00:07,790\nPreviously o...","1\n00:00:05,695 --> 00:00:07,790\nPreviously o...",previously on breaking bad i don t know what s...
3,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x02 - Cat's in the Bag....DVDR...,"1\n00:00:20,838 --> 00:00:23,257\nAre you okay...","1\n00:00:20,838 --> 00:00:23,257\nAre you okay...",are you okay you are a lifesaver yeah man we c...
4,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x03 - ...and the Bag's in the ...,"1\n00:00:43,059 --> 00:00:44,757\nLet's break ...","1\n00:00:43,059 --> 00:00:44,757\nLet's break ...",let s break it down hydrogen what doe that giv...


In [59]:
def preprocess(text):
    # Removing special characters and digits
    sentence = re.sub("[^a-zA-Z]", " ", text)
    
    # change sentence to lower case
    sentence = sentence.lower()

    # tokenize into words
    tokens = sentence.split()
    
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    preprocessed_text = len(tokens)

    return preprocessed_text

In [60]:
df['Text_Len_Lemma'] = df['Clean_Text_Lemma'].apply(preprocess)

df.head()

Unnamed: 0,File_Path,File_Name,File_Content,Text_Content,Clean_Text_Lemma,Text_Len_Lemma
0,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x01 - Pilot.DSR.0TV.en.srt,"1\n00:02:15,273 --> 00:02:17,187\nMy name is\n...","1\n00:02:15,273 --> 00:02:17,187\nMy name is\n...",my name is walter hartwell white i live at neg...,3830
1,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x01 - Pilot.DVDRip.ORPHEUS.en.srt,"1\n00:01:09,386 --> 00:01:11,221\nOh, my god.\...","1\n00:01:09,386 --> 00:01:11,221\nOh, my god.\...",oh my god christ shit oh god oh my god oh my g...,3924
2,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x02 - Cat's in the Bag....DSR....,"1\n00:00:05,695 --> 00:00:07,790\nPreviously o...","1\n00:00:05,695 --> 00:00:07,790\nPreviously o...",previously on breaking bad i don t know what s...,3039
3,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x02 - Cat's in the Bag....DVDR...,"1\n00:00:20,838 --> 00:00:23,257\nAre you okay...","1\n00:00:20,838 --> 00:00:23,257\nAre you okay...",are you okay you are a lifesaver yeah man we c...,3131
4,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 1x03 - ...and the Bag's in the ...,"1\n00:00:43,059 --> 00:00:44,757\nLet's break ...","1\n00:00:43,059 --> 00:00:44,757\nLet's break ...",let s break it down hydrogen what doe that giv...,3240


In [61]:
df.shape

(943, 6)

#### Droping rows with less than 500 words

In [62]:
len(df[df.Text_Len_Lemma < 500])

257

In [63]:
df[df.Text_Len_Lemma < 500].head()

Unnamed: 0,File_Path,File_Name,File_Content,Text_Content,Clean_Text_Lemma,Text_Len_Lemma
87,C:\Users\Sruthi\Desktop\project_3\unzipped\Bre...,Breaking Bad - 4x02 - Thirty-Eight Snub.720p.W...,ÿþ1,ÿþ1,,0
560,C:\Users\Sruthi\Desktop\project_3\unzipped\him...,How I Met Your Mother - 3x01 - Wait For It....srt,"1\n00:00:00,461 --> 00:00:03,401\nÈå åÇ, ÈíÔÊ...","1\n00:00:00,461 --> 00:00:03,401\nÈå åÇ, ÈíÔÊ...",mehrdadsh m shahiri m gmail com www free offli...,11
561,C:\Users\Sruthi\Desktop\project_3\unzipped\him...,How I Met Your Mother - 3x02 - We're Not From ...,"1\n00:00:00,500 --> 00:00:03,960\nÎÈ ... ÇÒÏæÇ...","1\n00:00:00,500 --> 00:00:03,960\nÎÈ ... ÇÒÏæÇ...",a am ami amin aminx aminx aminx www free offli...,12
562,C:\Users\Sruthi\Desktop\project_3\unzipped\him...,How I Met Your Mother - 3x03 - Third Wheel.srt,"1\n00:00:00,523 --> 00:00:02,873\nÈå åÇ¡ ãíÏæ...","1\n00:00:00,523 --> 00:00:02,873\nÈå åÇ¡ ãíÏæ...",mehrdadsh m shahiri m gmail com www free offli...,10
563,C:\Users\Sruthi\Desktop\project_3\unzipped\him...,How I Met Your Mother - 3x04 - Little Boys.srt,"1\n00:00:00,960 --> 00:00:02,910\nÔäÈå ÔÈ äíæí...","1\n00:00:00,960 --> 00:00:02,910\nÔäÈå ÔÈ äíæí...",mehrdadsh aminx www free offline com,6


In [64]:
data_500 = df[df.Text_Len_Lemma < 500].index
df.drop(data_500 , inplace=True)

In [65]:
df.shape

(686, 6)

In [66]:
df[df.Text_Len_Lemma < 500]

Unnamed: 0,File_Path,File_Name,File_Content,Text_Content,Clean_Text_Lemma,Text_Len_Lemma


#### apply TL-IDF on clean-text_lemma

In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [68]:
#object
vectorizer = TfidfVectorizer()

In [69]:
#fit and transform
tfidf_matrix = vectorizer.fit_transform(df.Clean_Text_Lemma)

In [70]:
len(vectorizer.vocabulary_)

24042

In [71]:
print(tfidf_matrix.toarray()) 

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [72]:
feature_names = vectorizer.get_feature_names_out()

In [73]:
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

In [74]:
tfidf_df

Unnamed: 0,aa,aaaaaaaaaaaaa,aaah,aaahhh,aaand,aaay,aah,aahh,aahing,aakonian,...,zuckerberg,zuckerberging,zumba,zundung,zurich,zverik,zxy,zygomatic,zz,zzzzt
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
681,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
682,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
683,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
684,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2. Query Processing

In [75]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise

#### Cosine similarity

In [76]:
def text_similarity_finder(search_query, k):
    # Vectorize the search query using the fitted vectorizer
    search_query_vector = vectorizer.transform([search_query])

    # Similarity Calculation
    similarity_scores = pairwise.cosine_similarity(search_query_vector, tfidf_df).flatten()

    # Indices of similar documents
    similar_doc_indices = similarity_scores.argsort()

    # Ranking and Retrieval
    return tfidf_df.loc[similar_doc_indices].head(k)

similar_texts = text_similarity_finder(input("Enter your query"), 10)
print(similar_texts)


Enter your querypeter parker
      aa  aaaaaaaaaaaaa  aaah  aaahhh  aaand  aaay       aah  aahh  aahing  \
0    0.0            0.0   0.0     0.0    0.0   0.0  0.000000   0.0     0.0   
440  0.0            0.0   0.0     0.0    0.0   0.0  0.010499   0.0     0.0   
441  0.0            0.0   0.0     0.0    0.0   0.0  0.000000   0.0     0.0   
442  0.0            0.0   0.0     0.0    0.0   0.0  0.000000   0.0     0.0   
443  0.0            0.0   0.0     0.0    0.0   0.0  0.009941   0.0     0.0   
444  0.0            0.0   0.0     0.0    0.0   0.0  0.000000   0.0     0.0   
445  0.0            0.0   0.0     0.0    0.0   0.0  0.000000   0.0     0.0   
446  0.0            0.0   0.0     0.0    0.0   0.0  0.014557   0.0     0.0   
439  0.0            0.0   0.0     0.0    0.0   0.0  0.000000   0.0     0.0   
447  0.0            0.0   0.0     0.0    0.0   0.0  0.000000   0.0     0.0   

     aakonian  ...  zuckerberg  zuckerberging  zumba  zundung  zurich  zverik  \
0         0.0  ...         0.0 

In [59]:
pip install --upgrade scikit-learn




In [77]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [78]:
from sklearn.metrics import pairwise_distances

#### cosine distance

In [79]:
# Function to find similar texts
def text_similarity_finder(search_query, k):
    # Vectorize the search query using the fitted vectorizer
    search_query_vector = vectorizer.transform([search_query])

    # Similarity Calculation
    similarity_scores = pairwise_distances(search_query_vector, tfidf_df, metric='cosine').flatten()

    # Indices of similar documents
    similar_doc_indices = similarity_scores.argsort()
    

    # Ranking and Retrieval
    similar_texts = tfidf_df.iloc[similar_doc_indices[:k]]
    df_3 = pd.DataFrame(similar_texts)
    return df_3

similar_texts = text_similarity_finder(input("enter your query: "), 10)
similar_texts

enter your query: peter parker


Unnamed: 0,aa,aaaaaaaaaaaaa,aaah,aaahhh,aaand,aaay,aah,aahh,aahing,aakonian,...,zuckerberg,zuckerberging,zumba,zundung,zurich,zverik,zxy,zygomatic,zz,zzzzt
675,0.0,0.0,0.0,0.0,0.0,0.0,0.001947,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
678,0.0,0.0,0.0,0.0,0.0,0.0,0.031857,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
673,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
670,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
676,0.0,0.0,0.0,0.0,0.0,0.0,0.004321,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
672,0.0,0.0,0.0,0.0,0.0,0.0,0.002289,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
674,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
677,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
467,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [80]:
df.iloc[675]

File_Path           C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...
File_Name           Spider-Man.No.Way.Home.2021.1080p.10bit.BrRip....
File_Content        ï»¿1\n00:00:02,604 --> 00:00:04,939\n<font col...
Text_Content        ï»¿1\n00:00:02,604 --> 00:00:04,939 PAT [ON BR...
Clean_Text_Lemma    pat on broadcast we come to you now with revel...
Text_Len_Lemma                                                  14122
Name: 932, dtype: object

In [81]:
index_of_10 = similar_texts.index

In [82]:
index_of_10

Index([675, 678, 673, 670, 676, 672, 674, 669, 677, 467], dtype='int64')

#### appending results into list

In [83]:
output_10 = []
for i in index_of_10:
    similar_text_10 = df.iloc[i]
    output_10.append(similar_text_10)

In [84]:
output_10

[File_Path           C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...
 File_Name           Spider-Man.No.Way.Home.2021.1080p.10bit.BrRip....
 File_Content        ï»¿1\n00:00:02,604 --> 00:00:04,939\n<font col...
 Text_Content        ï»¿1\n00:00:02,604 --> 00:00:04,939 PAT [ON BR...
 Clean_Text_Lemma    pat on broadcast we come to you now with revel...
 Text_Len_Lemma                                                  14122
 Name: 932, dtype: object,
 File_Path           C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...
 File_Name           The.Amazing.Spider-Man.2012.4K.REMASTERED.1080...
 File_Content        ï»¿1\n00:00:37,949 --> 00:00:39,712\n<font col...
 Text_Content        ï»¿1\n00:00:37,949 --> 00:00:39,712 (âªâªâª...
 Clean_Text_Lemma    peter five four three two one ready or not her...
 Text_Len_Lemma                                                   9041
 Name: 935, dtype: object,
 File_Path           C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...
 File_Name           Sp

In [117]:
df_top10 = pd.DataFrame(output_10)

#### top 10 result of search query

In [118]:
df_top10.drop(['File_Content'], axis = 1, inplace = True)

In [119]:
df_top10 

Unnamed: 0,File_Path,File_Name,Text_Content,Clean_Text_Lemma,Text_Len_Lemma
932,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man.No.Way.Home.2021.1080p.10bit.BrRip....,"ï»¿1\n00:00:02,604 --> 00:00:04,939 PAT [ON BR...",pat on broadcast we come to you now with revel...,14122
935,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,The.Amazing.Spider-Man.2012.4K.REMASTERED.1080...,"ï»¿1\n00:00:37,949 --> 00:00:39,712 (âªâªâª...",peter five four three two one ready or not her...,9041
930,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man.3.2007.REPACK.UHD.BluRay.2160p.True...,"ï»¿1\n00:03:17,530 --> 00:03:21,033\nIt's me, ...",it s me peter parker your friendly neighborhoo...,7185
927,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man 2 (2004) [EXTENDED 1080p BluRay x26...,"1\n00:03:02,058 --> 00:03:04,018 She looks at ...",she look at me every day mary jane watson oh b...,7683
933,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider.Man-Far.From.Home.2019.KORSUB.HDRip.x26...,"ï»¿1\n00:00:23,000 --> 00:00:28,000\nSubtitles...",subtitle by explosiveskull www opensubtitles o...,13077
929,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man Into The Spider-Verse (2018).srt,"ï»¿1\n00:00:05,674 --> 00:00:10,674\nSubtitles...",subtitle by explosiveskull colored sdh fixed b...,12425
931,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man.Homecoming.2017.1080p.WEB-DL.DD5.1....,"ï»¿1\n00:00:39,330 --> 00:00:41,958\nThings ar...",thing are never gonna be the same now i mean l...,12332
926,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man (2002) [REMASTERED 1080p BluRay x26...,"1\n00:03:08,105 --> 00:03:11,274 Who am l?\nYo...",who am l you sure you wanna know the story of ...,8367
934,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,The.Amazing.Spider-Man.2.2014.UHD.BluRay.2160p...,"ï»¿1\n00:01:52,822 --> 00:01:54,532 People wil...",people will say i am a monster for what i ve d...,10787
468,C:\Users\Sruthi\Desktop\project_3\unzipped\Fri...,Friends.S08E18.720p.BluRay.x264-CiNEFiLE.srt,"ï»¿1\n00:00:03,158 --> 00:00:07,036\nRoss, Mon...",ross mon is it okay if i bring someone to your...,2809


In [122]:
def text_similarity_finder(search_query, k):
    # Vectorize the search query using the fitted vectorizer
    search_query_vector = vectorizer.transform([search_query])

    # Similarity Calculation
    similarity_scores = pairwise_distances(search_query_vector, tfidf_df, metric='cosine').flatten()

    # Indices of similar documents
    similar_doc_indices = similarity_scores.argsort()
    
    # Ranking and Retrieval
    similar_texts = tfidf_df.iloc[similar_doc_indices[:k]]
    df_3 = pd.DataFrame(similar_texts)
    #return df_3
    
    #capturing index
    index_of_10 = similar_texts.index
    output_10 = []
    
    #appending top10 into list
    for i in index_of_10:
        similar_text_10 = df.iloc[i]
        output_10.append(similar_text_10)
        
    #creating dataframe  
    df_top10 = pd.DataFrame(output_10) 
    df_top10.drop(['File_Content'], axis = 1, inplace = True)
    
    return df_top10

## FINAL OUTPUT

In [123]:
similar_texts = text_similarity_finder(input("Enter your query: "), 10)

similar_texts

Enter your query: peter parker


Unnamed: 0,File_Path,File_Name,Text_Content,Clean_Text_Lemma,Text_Len_Lemma
932,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man.No.Way.Home.2021.1080p.10bit.BrRip....,"ï»¿1\n00:00:02,604 --> 00:00:04,939 PAT [ON BR...",pat on broadcast we come to you now with revel...,14122
935,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,The.Amazing.Spider-Man.2012.4K.REMASTERED.1080...,"ï»¿1\n00:00:37,949 --> 00:00:39,712 (âªâªâª...",peter five four three two one ready or not her...,9041
930,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man.3.2007.REPACK.UHD.BluRay.2160p.True...,"ï»¿1\n00:03:17,530 --> 00:03:21,033\nIt's me, ...",it s me peter parker your friendly neighborhoo...,7185
927,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man 2 (2004) [EXTENDED 1080p BluRay x26...,"1\n00:03:02,058 --> 00:03:04,018 She looks at ...",she look at me every day mary jane watson oh b...,7683
933,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider.Man-Far.From.Home.2019.KORSUB.HDRip.x26...,"ï»¿1\n00:00:23,000 --> 00:00:28,000\nSubtitles...",subtitle by explosiveskull www opensubtitles o...,13077
929,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man Into The Spider-Verse (2018).srt,"ï»¿1\n00:00:05,674 --> 00:00:10,674\nSubtitles...",subtitle by explosiveskull colored sdh fixed b...,12425
931,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man.Homecoming.2017.1080p.WEB-DL.DD5.1....,"ï»¿1\n00:00:39,330 --> 00:00:41,958\nThings ar...",thing are never gonna be the same now i mean l...,12332
926,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,Spider-Man (2002) [REMASTERED 1080p BluRay x26...,"1\n00:03:08,105 --> 00:03:11,274 Who am l?\nYo...",who am l you sure you wanna know the story of ...,8367
934,C:\Users\Sruthi\Desktop\project_3\unzipped\mcu...,The.Amazing.Spider-Man.2.2014.UHD.BluRay.2160p...,"ï»¿1\n00:01:52,822 --> 00:01:54,532 People wil...",people will say i am a monster for what i ve d...,10787
468,C:\Users\Sruthi\Desktop\project_3\unzipped\Fri...,Friends.S08E18.720p.BluRay.x264-CiNEFiLE.srt,"ï»¿1\n00:00:03,158 --> 00:00:07,036\nRoss, Mon...",ross mon is it okay if i bring someone to your...,2809
