<a href="https://colab.research.google.com/github/pedroramos17/Bootstrap/blob/main/Recommendation_Algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Different ANN Algos
## 1. HNSWLIB (Proximity Graph)
## 2. Google ScaNN (Vector Compression)
## 3. Spotify Annoy (Trees)
## 4. Faiss (Clustering)

**HNSWLIB**
HNSW is a Hierarchical Navigate Small World is a graph based algo used for eficient approximate nearest neighbor (ANN) search in high-d spaces.

* HNSWLIB library
* Performance in terms of speed and accuracy for ANN tasks.



In [1]:
!pip install sentence_transformers hnswlib

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m61.4/86.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting hnswlib
  Downloading hnswlib-0.8.0.tar.gz (36 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m41.3 

In [5]:
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [6]:
paragraphs = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial Intelligence has been a subject of intrigue for decades.",
    "The history of Python dates back to the late 1980s.",
    "Machine learning and deep learning drive modern AI.",
    "Natural language processing enables computers to understand human language."
]

In [8]:
## generate embeddings
embeddings = model.encode(paragraphs)

In [9]:
embeddings

array([[ 0.04393357,  0.05893441,  0.04817837, ...,  0.05216278,
         0.05610648,  0.10206395],
       [-0.03982176, -0.04618942,  0.05204437, ...,  0.04973065,
        -0.02232705, -0.03882542],
       [-0.07098128,  0.06234046, -0.01155042, ...,  0.08019316,
         0.12593739, -0.00557179],
       [-0.04370016, -0.02541894,  0.07470033, ...,  0.00048776,
        -0.01091537, -0.02558111],
       [ 0.01383319,  0.01758451,  0.09789698, ...,  0.0888078 ,
         0.04175095, -0.05655895]], dtype=float32)

In [10]:
import hnswlib

In [12]:
dim = embeddings.shape[1]

384

In [13]:
index = hnswlib.Index(space='l2', dim=dim)

In [14]:
num_elements = len(paragraphs)
num_elements

5

In [15]:
index.init_index(max_elements = num_elements, ef_construction = 200, M=16)

In [16]:
## Add embeddings into the index
index.add_items(embeddings)

## Querying the index

In [17]:
query_sentence = """Programming languages like Python have revolutionized software development."""

In [18]:
query_embeddings = model.encode([query_sentence])

In [None]:
query_embeddings

In [20]:
# Query the index
k = 2
labels, distances = index.knn_query(query_embeddings, k=k)

In [22]:
print("Query:", query_sentence)
for label, distance in zip(labels[0], distances[0]):
  print(f"Paragraph: {paragraphs[label]}, Distance: {distance}")

Query: Programming languages like Python have revolutionized software development.
Paragraph: The history of Python dates back to the late 1980s., Distance: 0.7363049387931824
Paragraph: Artificial Intelligence has been a subject of intrigue for decades., Distance: 1.315548062324524


### Resuts interpretation

Lower Score = Closer Match

## Google ScaNN

Scalable Nearest Neighbors.

Designed for efficient vector similarity search at scale. Suited for handling very large datasets. Leverages techniques like vector compression and quantization to speed up the search process.

Features:

1. Parttioning: Dividing the datasets into smaller, more mmangeable clusters.

2. Quantization: Approximating high-dim vectors with compact, lower-dim representations.

3. Scoring: Using efficient algos to comput ANN.

In [None]:
!pip install scann

In [27]:
extended_paragraphs = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial Intelligence has been a subject of intrigue for decades.",
    "The history of Python dates back to the late 1980s.",
    "Machine learning and deep learning drive modern AI.",
    "Natural language processing enables computers to understand human language.",
    "Climate change is impacting ecosystems worldwide.",
    "The exploration of Mars has revealed much about the history of the planet.",
    "Quantum computing holds the potential to revolutionize technology.",
    "Blockchain technology is transforming digital transactions.",
    "Renewable energy sources are becoming increasingly vital.",
    "The study of genetics has advanced our understanding of human health.",
    "Autonomous vehicles could reshape transportation.",
    "Virtual reality offers new possibilities in gaming and education.",
    "The Internet of Things connects everyday devices to the web.",
    "Cybersecurity is crucial in the digital age.",
    "3D printing is changing manufacturing processes.",
    "Augmented reality blends the digital and physical worlds.",
    "Big data analytics helps in understanding complex patterns.",
    "The human brain is an intricate organ still being studied.",
    "Advances in medicine are prolonging life expectancy.",
    "Robotics technology is automating various industries.",
    "Sustainable farming practices are essential for food security.",
    "The psychology of motivation is a complex study.",
    "Urban planning is key to sustainable city development.",
    "The art world continuously evolves with cultural shifts.",
    "Conservation efforts are crucial for endangered species.",
    "Space travel has always captivated human imagination.",
    "Nanotechnology is finding applications in multiple fields.",
    "Philosophy challenges our understanding of existence.",
    "Mathematics is the language of the universe.",
    "Literature reflects the human condition in diverse ways.",
    "The history of cinema offers insight into cultural changes.",
    "Music is a universal language that transcends boundaries.",
    "Photography captures moments and tells stories.",
    "Oceanography is essential for understanding marine ecosystems.",
    "Sports bring people together across the world.",
    "Cooking is both an art and a science.",
    "Fashion reflects both personal style and cultural trends.",
    "Architecture combines functionality with aesthetics.",
    "Linguistics explores the structure and evolution of language.",
    "Anthropology studies human societies and cultural diversity.",
    "The immune system is vital for human health.",
    "Astrophysics seeks to understand the universe's workings.",
    "Psychiatry plays a crucial role in mental health.",
    "Geology helps us understand Earth's history and future.",
    "The digital revolution has transformed how we communicate.",
    "Environmental science is key to addressing ecological issues.",
    "Astronomy has been practiced since ancient times.",
    "Sociology examines the behavior of societies.",
    "Biotechnology is advancing the capabilities of medical treatment."
]

In [28]:
embeddings = np.array(model.encode(extended_paragraphs))

In [29]:
print("Number of embeddings:", embeddings.shape[0])

Number of embeddings: 50


In [30]:
embeddings

array([[ 0.04393357,  0.05893441,  0.04817837, ...,  0.05216278,
         0.05610648,  0.10206395],
       [-0.03982176, -0.04618944,  0.05204437, ...,  0.04973065,
        -0.02232707, -0.03882541],
       [-0.07098126,  0.06234047, -0.01155042, ...,  0.08019315,
         0.12593739, -0.00557179],
       ...,
       [ 0.02637684,  0.02016133,  0.02142278, ...,  0.00266602,
        -0.051114  ,  0.02646729],
       [-0.00185669,  0.02726151, -0.09919582, ...,  0.04235468,
         0.06581975, -0.07125171],
       [ 0.01416314,  0.00176452, -0.04424744, ..., -0.0820567 ,
         0.13389045, -0.03259734]], dtype=float32)

In [31]:
import scann

In [37]:
searcher = scann.scann_ops_pybind.builder(embeddings, 10, "dot_product").tree(num_leaves=10, num_leaves_to_search=5,
                                                                             training_sample_size=50).score_ah(2, anisotropic_quantization_threshold=0.2).reorder(100).build()

In [33]:
query_embeddings = model.encode([query_sentence])

In [None]:
query_embeddings

In [35]:
query_embeddings = query_embeddings.reshape(-1)

In [38]:
neighbors, distances = searcher.search(query_embeddings, final_num_neighbors=5)

In [42]:
print("Query:", query_sentence)
print("Nearest Neighbors:")

for idx, (neighbor, distance) in enumerate(zip(neighbors, distances)):
  paragraph = extended_paragraphs[neighbor]
  print(f"{idx + 1}: Paragraph: '{paragraph}', Distance: {distance}")

Query: Programming languages like Python have revolutionized software development.
Nearest Neighbors:
1: Paragraph: 'The history of Python dates back to the late 1980s.', Distance: 0.6318474411964417
2: Paragraph: 'Artificial Intelligence has been a subject of intrigue for decades.', Distance: 0.34222590923309326
3: Paragraph: 'Natural language processing enables computers to understand human language.', Distance: 0.3414521813392639
4: Paragraph: 'Quantum computing holds the potential to revolutionize technology.', Distance: 0.3374345898628235
5: Paragraph: 'Mathematics is the language of the universe.', Distance: 0.33381569385528564


## Spotify Annoy

Approximate Nearest Neighbors Oh Yeah is a C++ library. Python binding of this.

Spotify for music recommendation systems.

In [43]:
!pip install annoy

Collecting annoy
  Downloading annoy-1.17.3.tar.gz (647 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.5/647.5 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.3-cp310-cp310-linux_x86_64.whl size=552450 sha256=0e197253bccbed12eea6c0e1e5a4230bb95cef6ca0fc540196776656f0eb96f2
  Stored in directory: /root/.cache/pip/wheels/64/8a/da/f714bcf46c5efdcfcac0559e63370c21abe961c48e3992465a
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.17.3


In [44]:
from annoy import AnnoyIndex

## Building the Annoy index

Annoy requires you to specify the number of dims for the vectors and the metrics (like "Angular", "euclidean", etc.)

In [45]:
f = embeddings.shape[1]
t = AnnoyIndex(f, 'angular')

In [46]:
for i, vector in enumerate(embeddings):
  t.add_item(i, vector)

In [47]:
t.build(10)
t.save('test.ann')

True

## Querying the index

In [48]:
u = AnnoyIndex(f, 'angular')
u.load('test.ann')

True

In [49]:
n_neighbors = 5
nearest_neighbors_indices, distances = u.get_nns_by_vector(query_embeddings, n_neighbors, include_distances= True)

In [51]:
print("Query:", query_sentence)

for i, (neighbor_idx, distance) in enumerate(zip(nearest_neighbors_indices, distances)):
  paragraph = extended_paragraphs[neighbor_idx]
  print(f"{i + 1}: Paragraph: '{paragraph}', Distance: {distance}")

Query: Programming languages like Python have revolutionized software development.
1: Paragraph: 'The history of Python dates back to the late 1980s.', Distance: 0.8580821752548218
2: Paragraph: 'Artificial Intelligence has been a subject of intrigue for decades.', Distance: 1.1469734907150269
3: Paragraph: 'Natural language processing enables computers to understand human language.', Distance: 1.1476478576660156
4: Paragraph: 'Quantum computing holds the potential to revolutionize technology.', Distance: 1.1511433124542236
5: Paragraph: 'Mathematics is the language of the universe.', Distance: 1.1542826890945435


## FAISS

In [52]:
!pip install faiss_cpu

Collecting faiss_cpu
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss_cpu
Successfully installed faiss_cpu-1.7.4


In [53]:
import faiss

In [54]:
embeddings = np.array(model.encode(extended_paragraphs)).astype('float32')

In [55]:
d = embeddings.shape[1]

In [56]:
## flat index
index = faiss.IndexFlatL2(d)
index.add(embeddings)

In [57]:
query_embedding = np.array(model.encode([query_sentence])).astype('float32')

In [58]:
k = 5
D, I = index.search(query_embedding, k)

In [62]:
print("Query:", query_sentence)

for i in range(k):
  print(f"{i + 1}: Paragraph: '{extended_paragraphs[I[0][i]]}', Distance: {D[0][i]}")

Query: Programming languages like Python have revolutionized software development.
1: Paragraph: 'The history of Python dates back to the late 1980s.', Distance: 0.7363049983978271
2: Paragraph: 'Artificial Intelligence has been a subject of intrigue for decades.', Distance: 1.3155479431152344
3: Paragraph: 'Natural language processing enables computers to understand human language.', Distance: 1.3170955181121826
4: Paragraph: 'Quantum computing holds the potential to revolutionize technology.', Distance: 1.325130820274353
5: Paragraph: 'Mathematics is the language of the universe.', Distance: 1.3323683738708496
