<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Similarity Search by Hand**


### Install Required Libraries

Run the cell below to install the required libraries for this lab.


In [None]:
!pip install sentence-transformers==4.1.0 | tail -n 1

### Restart the Kernel

After the above cell finishes running and successfully installs required libraries, please restart the kernel by clicking on the `Restart the kernel` button as indicated on the image below, or by going to `Kernel` --> `Restart kernel...` in the file menu.

![restart-jupyterlab-kernel.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/0oK423hHPyQ0BhXaflSvyg/restart-jupyterlab-kernel.png)


### Import Required Libraries


For this lab, we will be using the following external libraries:


*   [`numpy`](https://numpy.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for mathematical operations.
*   [`scipy`](https://scipy.org/) for additional mathematical operations not found in `numpy`.
*   [`torch`](https://pytorch.org/) for vector operations, although this library is also typically used for deep learning.
*   [`sentence_transformers`](https://sbert.net/) for obtaining vector embeddings from text data using pre-trained models.

Run the cell below to import these libraries, along with the built-in [`math`](https://docs.python.org/3/library/math.html) module from Python’s standard library:


In [None]:
import math

import numpy as np
import scipy
import torch
from sentence_transformers import SentenceTransformer


----


## Obtain Vector Embeddings


To calculate distance and similarity metrics, we first need to generate vector embeddings for some text documents. Let’s begin by defining a few example documents:


In [None]:
# Example documents
documents = [
    'Bugs introduced by the intern had to be squashed by the lead developer.',
    'Bugs found by the quality assurance engineer were difficult to debug.',
    'Bugs are common throughout the warm summer months, according to the entomologist.',
    'Bugs, in particular spiders, are extensively studied by arachnologists.'
]


As shown above, there are four example documents, each consisting of a single sentence. These sentences are intentionally designed to be challenging for semantic similarity search.

All four sentences begin with the word "Bugs," but they refer to different meanings of the word depending on context. The first two sentences relate to software bugs in programming, while the last two refer to physical bugs, such as insects or spiders.

The key to distinguishing between these meanings lies in the context, particularly the type of professional mentioned in each sentence. For example, if we replaced the word "arachnologists" (scientists who study spiders and other arthropods) in the last sentence with "lead developers," the sentence would instead refer to programming bugs.

Let's now define a model that will embed these text documents into numerical vectors:


In [None]:
# Load a pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

The code above creates an instance of the SentenceTransformer class from the sentence_transformers library, which is commonly used to generate vector embeddings from pre-trained models.

In this example, we’re using the [paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2) model. It was trained on pairs of paraphrased sentences, with the goal of generating similar embeddings for sentences that express the same meaning. While the model was originally designed for paraphrase identification, it also performs well on general semantic similarity tasks—like the ones we’re exploring in this lab.

Let's now use the model instance to encode the documents into embedding vectors:


In [None]:
# Generate embeddings
embeddings = model.encode(documents)

The function below implements the L2 distance formula. It first calculates the sum of the squared differences between corresponding elements, then returns the square root of that sum:


In [None]:
def euclidean_distance_fn(vector1, vector2):
    squared_sum = sum((x - y) ** 2 for x, y in zip(vector1, vector2))
    return math.sqrt(squared_sum)

#### Exercise 1 - Make the manual calculation more efficient


The code used to populate the `l2_dist_manual array` is not very efficient. First, it redundantly calculates the distance between a vector and itself, even though the L2 distance in such cases is always zero. Second, the array is symmetric—meaning the distance between vectors at indices $i$ and $j$ is the same as between $j$ and $i$. Therefore, each distance only needs to be computed once.


In [None]:
l2_dist_manual_improved = np.zeros([4,4])
for i in range(embeddings.shape[0]):
    for j in range(embeddings.shape[0]):
        if j > i:
            l2_dist_manual_improved[i,j] = euclidean_distance_fn(embeddings[i], embeddings[j])
        elif i > j:
            l2_dist_manual_improved[i,j] = l2_dist_manual_improved[j,i]
        else:
            l2_dist_manual_improved[i,j] = 0

l2_dist_manual_improved

The following calculates the L2 norms for all the vectors in the `embeddings` array. The calculation simply squares each vector component, sums across columns (note the `axis=1` parameter in the `sum`), and takes a square root:


In [None]:
# L2 norms
l2_norms = np.sqrt(np.sum(embeddings**2, axis=1))
l2_norms

Note that the result is a vector with 4 numbers, each of which corresponds to the L2 norm, or the magnitude, of each vector. In order to normalize the vectors to a lenght of one, we should divide each vector's components by the norm. However, in order to do that efficiently, let's reshape the `l2_norms` vector into a 4x1 array:


In [None]:
# L2 norms reshaped
l2_norms_reshaped = l2_norms.reshape(-1,1)
l2_norms_reshaped

The following code calculates normalized embedding vectors by dividing every component in the vector by the vector's L2 norm:


In [None]:
normalized_embeddings_manual = embeddings/l2_norms_reshaped
normalized_embeddings_manual

##### Exercise 2 - Verify that vectors are normalized


Verify that `normalized_embeddings_manual` are normalized vectors by making sure that the length of each vector is equal to 1.


In [None]:
np.sqrt(np.sum(normalized_embeddings_manual**2, axis=1))

##### Exercise 3 - Similarity Search Using a Query


In the above examples, we calculated similarity between 4 documents:

```python
documents = [
    'Bugs introduced by the intern had to be squashed by the lead developer.',
    'Bugs found by the quality assurance engineer were difficult to debug.',
    'Bugs are common throughout the warm summer months, according to the entomologist.',
    'Bugs, in particular spiders, are extensively studied by arachnologists.'
]
```

Now, your task is to find which of these 4 documents is most similar to the query `Who is responsible for a coding project and fixing others' mistakes?` using cosine similarity. You can reuse the `documents` and `normalized_embeddings_manual` arrays in your answer:


In [None]:
# Embeddings
query = "Who is responsible for a coding project and fixing others' mistakes?"
query_embedding = model.encode([query])

# Normalize the query embedding
normalized_query_embedding = torch.nn.functional.normalize(
  torch.from_numpy(query_embedding)
).numpy()

# Calculate the cosine similarity between the query embedding and the embeddings
cosine_similarity_query = normalized_embeddings_manual @ normalized_query_embedding.T

# Get highest cosine similarity
highest_position = cosine_similarity_query.argmax()

# Find document with highest cosine similarity
documents[highest_position]


Copyright © IBM Corporation. All rights reserved.
