## FAISS

This notebook provides a collection of real-world English text data specifically curated for practicing with FAISS (Facebook AI Similarity Search). FAISS is a powerful library for similarity search and clustering of dense vectors, commonly used in various applications like information retrieval, recommendation systems, and image search.

### Setting Up

This session is inteded to import all the necessary libraries as well as importing the data and creating the indexes, starting with 0.

In [2]:
from sentence_transformers import SentenceTransformer
from sentence_transformers import InputExample
import pandas as pd
import numpy as np
import faiss

df = pd.read_csv('../data/similarity_search.csv')

df = df[['text', 'id']]

if len(df) != 0:
  print(f'Dataframe imported successfully with a shape of {df.shape} 🎉')
  
if df.id.min() == 1:
  df.id = df.id.apply(lambda x: x-1)
elif df.id.min() == 0:
  print('ID starting with zero!')

display(df.head(5))

Dataframe imported successfully with a shape of (280, 2) 🎉
ID starting with zero!


Unnamed: 0,text,id
0,The COVID-19 pandemic has had a significant im...,0
1,Artificial intelligence is transforming variou...,1
2,Social media platforms play a crucial role in ...,2
3,Renewable energy sources like solar and wind p...,3
4,Cryptocurrencies such as Bitcoin have gained w...,4


### Vectorize text into embedding vectors

Now that we have our data, I'll be using SentenceTransformer to load a language model to vectorize our texts to embeddings.

In [9]:
model = SentenceTransformer(
  'distilbert-base-nli-stsb-mean-tokens', 
  device='cpu',
  cache_folder='../data/cache/'
)

Downloading (…)7e0d5/.gitattributes: 100%|██████████| 345/345 [00:00<00:00, 1.23MB/s]


Downloading (…)_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 535kB/s]
Downloading (…)0e5ca7e0d5/README.md: 100%|██████████| 4.01k/4.01k [00:00<00:00, 12.8MB/s]
Downloading (…)5ca7e0d5/config.json: 100%|██████████| 555/555 [00:00<00:00, 1.33MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 122/122 [00:00<00:00, 373kB/s]
Downloading pytorch_model.bin: 100%|██████████| 265M/265M [00:04<00:00, 62.2MB/s] 
Downloading (…)nce_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 181kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 473kB/s]
Downloading (…)7e0d5/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 22.2MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 505/505 [00:00<00:00, 1.94MB/s]
Downloading (…)0e5ca7e0d5/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 140MB/s]
Downloading (…)ca7e0d5/modules.json: 100%|██████████| 229/229 [00:00<00:00, 861kB/s]


In [13]:
texts = df.text.values.tolist()
texts[:5]

['The COVID-19 pandemic has had a significant impact on global economies.',
 'Artificial intelligence is transforming various industries, including healthcare and finance.',
 'Social media platforms play a crucial role in connecting people around the world.',
 'Renewable energy sources like solar and wind power are essential for a sustainable future.',
 'Cryptocurrencies such as Bitcoin have gained widespread attention and adoption.']

In [16]:
embeddings = model.encode(texts)
embeddings[:3]

array([[ 0.947502  , -1.0846487 , -0.2284832 , ..., -0.06836055,
        -0.14919937,  0.66071904],
       [ 0.28361663, -0.1461976 ,  0.76421636, ..., -0.09583422,
        -0.00354805,  0.03140324],
       [ 0.04713582, -0.09198026,  0.03990472, ..., -0.05552263,
        -1.0880418 , -0.33173525]], dtype=float32)

### Saving the embeddings to FAISS index

Now that we have our embeddings, I'll be creating the FAISS index based on them in order to add them - normalizing meanwhile.

In [37]:
df_to_index = df.set_index(["id"], drop=False)
id_index = np.array(df_to_index.id.values).flatten().astype("int")

normalized_embeddings = embeddings.copy()
faiss.normalize_L2(normalized_embeddings)
index_flat = faiss.IndexFlatIP(len(embeddings[0]))

In [38]:
index_content = faiss.IndexIDMap(index_flat)
index_content.add_with_ids(normalized_embeddings, id_index)

### Defining the query function

In order to search contents, we need to perform the preprocessing as we did before - including the SentenceTransformer and normalization. Let's create a function to handle this.

In [53]:
def search(query: str, k: int = 5) -> pd.core.frame.DataFrame:
  vector = model.encode([query])
  faiss.normalize_L2(vector)
  
  top_k = index_content.search(vector, k)
  ids = top_k[1][0].tolist()
  similarities = top_k[0][0].tolist()
  
  print(f'Query: {query}')
    
  results = df_to_index.loc[ids]
  results['similarity'] = similarities
  
  return results.reset_index(drop=True)

### Querying

Alright. If I did all correctly, it should return related texts for us :)

In [54]:
search('I want to buy a car', 3)[['id', 'text', 'similarity']]

Query: I want to buy a car


Unnamed: 0,id,text,similarity
0,267,The potential of 3D printing in the automotive...,0.238573
1,69,The development of smart cities requires colla...,0.238199
2,73,The future of transportation lies in autonomou...,0.207966


In [55]:
search('Artificial Intelligence', 3)[['id', 'text', 'similarity']]

Query: Artificial Intelligence


Unnamed: 0,id,text,similarity
0,1,Artificial intelligence is transforming variou...,0.605489
1,159,The impact of automation on the transportation...,0.584801
2,250,The impact of automation on the entertainment ...,0.57947


In [59]:
search('Diversity is important', 5)[['id', 'text', 'similarity']]

Query: Diversity is important


Unnamed: 0,id,text,similarity
0,23,The importance of diversity and inclusion in o...,0.599246
1,88,The need for accessible and inclusive design i...,0.534014
2,67,The exploration of deep-sea ecosystems reveals...,0.526486
3,64,The ethical implications of gene editing and C...,0.479641
4,155,The potential of genetic engineering in agricu...,0.475088
