<a href="https://colab.research.google.com/github/jyotidabass/o-Implementing-Semantic-Search-with-FAISS/blob/main/Implementing_Semantic_Search_with_FAISS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to implement semantic search with FAISS.

For this, first, we need to install FAISS and the required packages (Skip this step if you already have FAISS installed):

In [1]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0


Next, we will import the required libraries:

In [2]:
import numpy as np
import pandas as pd
import gensim
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

After this, we will load the pre-trained word embedding model (e.g., GloVe). In this example, we'll use a pre-trained GloVe model.

In [4]:
glove_path = '/content/glove.6B.50d.txt'

def load_glove_embeddings(glove_file):
    embeddings = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.strip().split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove_embeddings = load_glove_embeddings(glove_path)

Next, we will convert the text data into numerical embeddings using the pre-trained model:

In [5]:
text_data = [
    'The quick brown fox jumps over the lazy dog.',
    'The dog chased the lazy fox.',
    'The fox and the dog are both animals.',
    'A cat is a mammal, but not a dog.',
    'Dogs and cats are both pets.'
]

text_embeddings = []

for text in text_data:
    text_embedding = np.mean([glove_embeddings[word] for word in text.split() if word in glove_embeddings], axis=0)
    text_embeddings.append(text_embedding)

text_embeddings = np.array(text_embeddings)

Now we will use FAISS to create an index for efficient search:

In [6]:
import faiss

dimension = 50  # The embedding size (e.g., 50 for GloVe)

# Create a Faiss index (e.g., IndexFlatL2 for L2 distance)
index = faiss.IndexFlatL2(dimension)

# Add your text embeddings to the index
index.add(text_embeddings.astype('float32'))  # Make sure embeddings are float32

# Example search: find the nearest neighbor to the first embedding
D, I = index.search(np.expand_dims(text_embeddings[0].astype('float32'), axis=0), k=1)  # k=1 for 1 nearest neighbor
# D contains the distances, I contains the indices of the nearest neighbors
print("Nearest neighbor index:", I[0][0])
print("Distance:", D[0][0])

Nearest neighbor index: 0
Distance: 0.0


Finally, we will search for semantically similar text:

In [7]:
query_text = 'The fox and the dog are both animals.'
query_embedding = np.mean([glove_embeddings[word] for word in query_text.split() if word in glove_embeddings], axis=0)
query_embedding = np.array(query_embedding).reshape(1, -1)

_, similar_indices = index.search(query_embedding, 5)

similar_texts = [text_data[i] for i in similar_indices[0]]

print('Similar texts to:', query_text)
print(similar_texts)

Similar texts to: The fox and the dog are both animals.
['The fox and the dog are both animals.', 'Dogs and cats are both pets.', 'A cat is a mammal, but not a dog.', 'The quick brown fox jumps over the lazy dog.', 'The dog chased the lazy fox.']


In summary, this code performs a semantic search on a given text, query_text, using pre-trained GloVe word embeddings and the FAISS library. The text is first converted into a vector representation, called query_embedding, by averaging the embeddings of individual words in the text. The index object, which was built using a set of pre-defined texts, is used to search for the five most similar texts to the query. The similar texts are then printed.