#  What is semantic search?
 Semantic search is a type of search algorithm that seeks to understand the intent and meaning behind a user's query, rather than just matching keywords. It uses natural language processing (NLP) and machine learning techniques to interpret the query and provide more relevant and accurate results.

# **Importing Packages**

In [1]:
import numpy as np
import pandas as pd

# **Installing datasets,evalute, transformers and faiss**

In [2]:
!pip install faiss-gpu
!pip install datasets evaluate transformers[sentencepiece]

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting protobuf<=3.20.2
  Downloading protobuf-3.20.2-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: protobuf, evaluate
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.3
    Uninstalling protobuf-3.20.3:
      Successfully uninstalled protobuf-3.20.3
[31m

In [3]:
df = pd.read_csv(r"C:\Users\ishan\Data Anaytics\IMDB Movies\imdb_top_1000.csv)

df.columns

Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
       'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
       'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
      dtype='object')

Relevant data for us will series title,Genre ,Overview and Director for search purpose.

In [4]:
df= df[['Series_Title','Genre','Overview','Director']]
df.head()

Unnamed: 0,Series_Title,Genre,Overview,Director
0,The Shawshank Redemption,Drama,Two imprisoned men bond over a number of years...,Frank Darabont
1,The Godfather,"Crime, Drama",An organized crime dynasty's aging patriarch t...,Francis Ford Coppola
2,The Dark Knight,"Action, Crime, Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan
3,The Godfather: Part II,"Crime, Drama",The early life and career of Vito Corleone in ...,Francis Ford Coppola
4,12 Angry Men,"Crime, Drama",A jury holdout attempts to prevent a miscarria...,Sidney Lumet


Converting pandas dataframe to Huggingface dataset as it will be easy to use and we can use Huggingface Tokenizers and models directly on huggingface dataset objects.

In [5]:
from datasets import Dataset

movie_dataset = Dataset.from_pandas(df)

movie_dataset

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director'],
    num_rows: 1000
})

Concatenating all the text field so that we can make a single embedding vector for all the relevant data

In [6]:
def concatenate_text(data):
    
    return {"text": data['Series_Title']+ '\n' + data['Genre']+ '\n' + data['Overview']+ '\n'+ data['Director']}

movie_dataset = movie_dataset.map(concatenate_text)

movie_dataset

  0%|          | 0/1000 [00:00<?, ?ex/s]

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director', 'text'],
    num_rows: 1000
})

Result of concatenation

In [7]:
movie_dataset['text'][0]

'The Shawshank Redemption\nDrama\nTwo imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.\nFrank Darabont'

# Importing Model and Tokenizer from HuggingFace

In [8]:
from transformers import AutoTokenizer, TFAutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer=AutoTokenizer.from_pretrained(model_ckpt)
model= TFAutoModel.from_pretrained(model_ckpt,from_pt=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]



Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFMPNetModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFMPNetModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMPNetModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFMPNetModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMPNetModel for predictions without further training.


In [9]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:,0]

def get_embeddings(text_list):
    encoded_input =tokenizer(text_list,padding=True,truncation=True,return_tensors="tf")
    encoded_output = {k: v for k, v in encoded_input.items()}
    model_output=model(**encoded_input)
    return cls_pooling(model_output)

DEBUGGING

In [10]:
#embedding = get_embeddings(movie_dataset['text'][0])

#embedding

Apply the function on dataset

In [11]:
embeddings_dataset = movie_dataset.map(
                 lambda x : {"embeddings": get_embeddings(x["text"]).numpy()[0]})

  0%|          | 0/1000 [00:00<?, ?ex/s]

In [12]:
embeddings_dataset

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director', 'text', 'embeddings'],
    num_rows: 1000
})

# Using FAISS for efficient similarity search

In [13]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['Series_Title', 'Genre', 'Overview', 'Director', 'text', 'embeddings'],
    num_rows: 1000
})

# TESTING

In [14]:
question = "Batman?"
question_embedding = get_embeddings([question]).numpy()
question_embedding.shape

(1, 768)

In [15]:
scores , samples = embeddings_dataset.get_nearest_examples(
   "embeddings", question_embedding, k=5)

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores",ascending=False,inplace=True)

samples_df

Unnamed: 0,Series_Title,Genre,Overview,Director,text,embeddings,scores
4,Joker,"Crime, Drama, Thriller","In Gotham City, mentally troubled comedian Art...",Todd Phillips,"Joker\nCrime, Drama, Thriller\nIn Gotham City,...","[0.20968075096607208, -0.3021984100341797, -0....",32.632893
3,Batman Begins,"Action, Adventure","After training with his mentor, Batman begins ...",Christopher Nolan,"Batman Begins\nAction, Adventure\nAfter traini...","[-0.11327170580625534, 0.4175341725349426, -0....",28.97954
2,The Dark Knight,"Action, Crime, Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"The Dark Knight\nAction, Crime, Drama\nWhen th...","[0.12353761494159698, 0.12746308743953705, -0....",28.628519
1,The Dark Knight Rises,"Action, Adventure",Eight years after the Joker's reign of anarchy...,Christopher Nolan,"The Dark Knight Rises\nAction, Adventure\nEigh...","[0.03760784864425659, 0.441690593957901, -0.22...",26.83872
0,Batman: Mask of the Phantasm,"Animation, Action, Crime",Batman is wrongly implicated in a series of mu...,Kevin Altieri,"Batman: Mask of the Phantasm\nAnimation, Actio...","[-0.0030359849333763123, -0.05288544297218323,...",26.005501


# Results

In [16]:
for _, row in samples_df.iterrows():
    print(f"Series Title: {row.Series_Title}")
    print(f"Overview: {row.Overview}")
    print(f"Genre: {row.Genre}")
    print(f"Scores: {row.scores}")
    print("=" * 50)
    print()

Series Title: Joker
Overview: In Gotham City, mentally troubled comedian Arthur Fleck is disregarded and mistreated by society. He then embarks on a downward spiral of revolution and bloody crime. This path brings him face-to-face with his alter-ego: the Joker.
Genre: Crime, Drama, Thriller
Scores: 32.63289260864258

Series Title: Batman Begins
Overview: After training with his mentor, Batman begins his fight to free crime-ridden Gotham City from corruption.
Genre: Action, Adventure
Scores: 28.97953987121582

Series Title: The Dark Knight
Overview: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.
Genre: Action, Crime, Drama
Scores: 28.62851905822754

Series Title: The Dark Knight Rises
Overview: Eight years after the Joker's reign of anarchy, Batman, with the help of the enigmatic Catwoman, is forced from his exile to save Gotham City from the bru