# An Experiment with Similarity Search and Embeddings

We query a dataset that has semantically similar and dissimilar sentences to the query sentence. Only known similar sentences should be returned in query results. We assign an accuracy score based on the actual versus expected results,

We conduct the experiment using the following tools:

Embeddings: Hugging Face 'all-MimiLM-L6-v2'
Similarity metrics: 
    a) Cosine and b) Dot product functions in semantic_search in sentence_transformer.util 
    c) Euclidean distance with numpy's numalg.norm


## Queries

We query the dataset separately for results most similar to the following two sentences:
1. "The deer froze in the headlights of the car."
2. "Dream to solve world's problems."

## The Dataset
The dataset consists of the following groups of sentences:
1. Known similar sentences.
2. Confounding sentences, or sentences that may share common words with the query, but are semantically dissimiar.
3. Unrelated sentences.

## Known similar sentences
For query sentence 1 ("The deer froze in the headlights of the car."):
a known similar sentence is manually created.
  1. The car skidded to stop for the deer that stood frozen in the headlights of the car.

Other similar sentences are obtained from ChatGPT using the following prompt:
  - Generate variations without changing the meaning of the paragraph:  "The car skidded to stop for the deer that stood frozen in the headlights of the car".  Provide the following variations: 1) Paraphrase, 2) Elaboration, 3) Simplification, 4) Synonym, 5) Summary in 20 words or less."

ChatGPT responses:
1. Paraphrase: The vehicle slid and came to a halt in response to the deer's immobility under the car's headlights.
1. Elaboration: As the car lost traction, it slid across the road and eventually halted abruptly, its brakes screeching, due to the presence of a motionless deer standing in the direct path of its headlights.
1. Simplification: The car skidded and stopped for the frozen deer in its headlights.
1. Synonym: The car skidded to a standstill for the deer that remained motionless in the car's headlights.
1. Summary: The car skidded and stopped to avoid the motionless deer illuminated by its headlights.

For query sentence 2 ("Dream to solve world's problems."), a known similar sentence is manually created.
  1. Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution 

Other similar sentences are obtained from ChatGPT using the following prompt:
  - Generate variations without changing the meaning of the  paragraph:  "Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution ".  Provide the following variations: 1) Paraphrase, 2) Elaboration, 3) Simplification, 4) Synonym, 5) Summary in 15 words or less.

ChatGPT responses:
1. Paraphrase: After completing his PhD at the age of 16, the exceptionally talented young prodigy reflected on the global challenges and recognized science as the answer.
1. Elaboration: Having successfully obtained his PhD at the remarkably young age of 16, the boy genius engaged in deep contemplation of the complex issues faced by the world. His intuition led him to firmly believe that science held the key to addressing these challenges, further fueling his dedication to scientific pursuits.
1. Simplification: The boy genius finished his PhD at 16 and believed science was the solution to the world's challenges.
1. Synonym: Having completed his doctorate at 16, the exceptionally gifted young prodigy pondered the difficulties confronting the world, intuitively perceiving science as the remedy.
1. Summary: The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.

## Known dissimilar but close sentences
A few dissimilar sentences were manually created and added:
  1. The Frozen Deer was a popular store in the mountain town that sold supplies for hikers and mountain bikers.
  1. The scientific article explains what causes animals to freeze staring into the headlights of speeding vehicles causing many deaths and accidents every year.
  1. A record cold winter followed the drought. Many malnourished deer froze to death in the region.
  1. The ice cream shop sold the Frozen Dear, a treat that became an inseparable and endearing part of the town’s high school dating.
  1. The hunter proceeded to cut, clean, and freeze the deer. The hunter was lucky to get the deer, as his food supplies for the winter were running low. 

Other dissimilar sentences were generated by ChatGPT on related topics using the prompt:
- Generate 10 sentences on each of the following topics, every sentence being 20 words or less: 1) car accidents, 2) science inventions, 3) deer, 4) challenges facing the world.

ChatGPT response:

Car Accidents:
1. The driver's negligence resulted in a tragic car crash.
1. Poor road conditions contributed to the accident.
1. The collision caused significant damage to both vehicles.
1. Speeding led to a fatal car accident on the highway.
1. The driver's distraction caused a rear-end collision.
1. The car skidded on the icy road, causing a collision with a tree.
1. Drunk driving resulted in a serious car wreck.
1. A sudden brake failure caused the car to crash into a building.
1. The driver lost control of the vehicle, causing a rollover accident.
1. Reckless overtaking led to a head-on collision.

Science Inventions:
1. The invention of the telephone revolutionized communication.
1. The discovery of penicillin transformed medicine.
1. The invention of the light bulb revolutionized modern society.
1. The invention of the computer paved the way for the digital age.
1. The discovery of DNA's structure laid the foundation for genetic research.
1. The invention of the internet connected the world like never before.
1. The development of vaccines saved millions of lives.
1. The invention of the automobile changed transportation forever.
1. The discovery of X-rays revolutionized medical imaging.
1. The invention of the printing press transformed the spread of knowledge.

Deer:
1. The deer gracefully leaped over the fence.
1. The deer cautiously approached the forest clearing.
1. The majestic deer stood still, observing its surroundings.
1. The deer nibbled on the lush grass by the stream.
1. The deer's antlers glistened in the sunlight.
1. The startled deer quickly darted into the dense forest.
1. The deer bounded effortlessly through the meadow.
1. The doe gently nuzzled her fawn in a tender display of affection.
1. The deer cautiously crossed the road, looking out for oncoming vehicles.
1. The hunter patiently waited for a deer to appear within range.

Challenges Facing the World:
1. Climate change poses a grave threat to our planet.
1. Poverty remains a persistent challenge in many regions.
1. Access to clean water is a pressing global issue.
1. Political instability continues to disrupt nations worldwide.
1. The fight against inequality and social injustice persists.
1. The threat of terrorism looms large in many countries.
1. Cybersecurity breaches pose a significant challenge in the digital age.
1. Food security remains a critical concern for many communities.
1. The global pandemic has exposed vulnerabilities in healthcare systems.
1. Finding sustainable energy solutions is crucial for a greener future.


## Expected results and scoring
There are 6 expected similar sentences for each of two query sentences, as described above.

If there are other results in the top 6 similar results, the accuracy score is calculated thus (here, n=6): 

 ` (n-d) / n`

where d is the dissimilar results in the top n returned results.

## Populating the dataset

In [1]:
docs = [
    "The car skidded to stop for the deer that stood frozen in the headlights of the car.",
    "The vehicle slid and came to a halt in response to the deer's immobility under the car's headlights.",
    "As the car lost traction, it slid across the road and eventually halted abruptly, its brakes screeching, due to the presence of a motionless deer standing in the direct path of its headlights.",
    "The car skidded and stopped for the frozen deer in its headlights.",
    "The car skidded to a standstill for the deer that remained motionless in the car's headlights.",
    "The car skidded and stopped to avoid the motionless deer illuminated by its headlights.",
    "Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution.",
    "After completing his PhD at the age of 16, the exceptionally talented young prodigy reflected on the global challenges and recognized science as the answer.",
    "Having successfully obtained his PhD at the remarkably young age of 16, the boy genius engaged in deep contemplation of the complex issues faced by the world. His intuition led him to firmly believe that science held the key to addressing these challenges, further fueling his dedication to scientific pursuits.",
    "The boy genius finished his PhD at 16 and believed science was the solution to the world's challenges.",
    "Having completed his doctorate at 16, the exceptionally gifted young prodigy pondered the difficulties confronting the world, intuitively perceiving science as the remedy.",
    "The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.",
    "The Frozen Deer was a popular store in the mountain town that sold supplies for hikers and mountain bikers.",
    "The scientific article explains what causes animals to freeze staring into the headlights of speeding vehicles causing many deaths and accidents every year.",
    "A record cold winter followed the drought. Many malnourished deer froze to death in the region.",
    "The ice cream shop sold the Frozen Dear, a treat that became an inseparable and endearing part of the town’s high school dating.",
    "The hunter proceeded to cut, clean, and freeze the deer. The hunter was lucky to get the deer, as his food supplies for the winter were running low.",
    "The driver's negligence resulted in a tragic car crash.",
    "Poor road conditions contributed to the accident.",
    "The collision caused significant damage to both vehicles.",
    "Speeding led to a fatal car accident on the highway.",
    "The driver's distraction caused a rear-end collision.",
    "The car skidded on the icy road, causing a collision with a tree.",
    "Drunk driving resulted in a serious car wreck.",
    "A sudden brake failure caused the car to crash into a building.",
    "The driver lost control of the vehicle, causing a rollover accident.",
    "Reckless overtaking led to a head-on collision.",
    "The invention of the telephone revolutionized communication.",
    "The discovery of penicillin transformed medicine.",
    "The invention of the light bulb revolutionized modern society.",
    "The invention of the computer paved the way for the digital age.",
    "The discovery of DNA's structure laid the foundation for genetic research.",
    "The invention of the internet connected the world like never before.",
    "The development of vaccines saved millions of lives.",
    "The invention of the automobile changed transportation forever.",
    "The discovery of X-rays revolutionized medical imaging.",
    "The invention of the printing press transformed the spread of knowledge.",
    "The deer gracefully leaped over the fence.",
    "The deer cautiously approached the forest clearing.",
    "The majestic deer stood still, observing its surroundings.",
    "The deer nibbled on the lush grass by the stream.",
    "The deer's antlers glistened in the sunlight.",
    "The startled deer quickly darted into the dense forest.",
    "The deer bounded effortlessly through the meadow.",
    "The doe gently nuzzled her fawn in a tender display of affection.",
    "The deer cautiously crossed the road, looking out for oncoming vehicles.",
    "The hunter patiently waited for a deer to appear within range.",
    "Climate change poses a grave threat to our planet.",
    "Poverty remains a persistent challenge in many regions.",
    "Access to clean water is a pressing global issue.",
    "Political instability continues to disrupt nations worldwide.",
    "The fight against inequality and social injustice persists.",
    "The threat of terrorism looms large in many countries.",
    "Cybersecurity breaches pose a significant challenge in the digital age.",
    "Food security remains a critical concern for many communities.",
    "The global pandemic has exposed vulnerabilities in healthcare systems.",
    "Finding sustainable energy solutions is crucial for a greener future.",
    ]

In [41]:
!pip3 install python-dotenv

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0


In [43]:
from dotenv import dotenv_values

config = dotenv_values(".env")
hf_token = config['HUGGING_FACE_TOKEN'] # "get your token in http://hf.co/settings/tokens"

## Generate embeddings for the dataset
Obtain the doc embeddings from the LLM model. We use Hugging Face 'all-MimiLM-L6-v2'.::

In [4]:
model_id = "sentence-transformers/all-MiniLM-L6-v2"
api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}

In [5]:
import requests

def query(texts):
    response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})
    return response.json()

In [6]:
hf_embeddings = query(docs)
print(f'number of docs: {len(hf_embeddings)}, dims: {len(hf_embeddings[0])}')

number of docs: 57, dims: 384


In [7]:
import pandas as pd
import torch

hf_embeddings_df = pd.DataFrame(hf_embeddings)
doc_embeddings = torch.from_numpy(hf_embeddings_df.to_numpy()).to(torch.float)

## Semantic search
We use semantic search functionality from the sentence_transformers library.

In [8]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.30.1-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m99.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m89.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence-transformers)
  Downloading huggingface_hub-0.15.1-py3-

### First search

In [9]:
q1_sentence = "The deer froze in the headlights of the car."
q1_output = query(q1_sentence)
q1_embeddings = torch.FloatTensor(q1_output)

In [10]:
from sentence_transformers.util import semantic_search, cos_sim, dot_score

hits1 = semantic_search(q1_embeddings, doc_embeddings, top_k=7, score_function=cos_sim)

In [11]:
print(*[docs[hits1[0][i]['corpus_id']] for i in range(len(hits1[0]))], sep='\n')

The car skidded and stopped for the frozen deer in its headlights.
The car skidded to stop for the deer that stood frozen in the headlights of the car.
The car skidded to a standstill for the deer that remained motionless in the car's headlights.
The car skidded and stopped to avoid the motionless deer illuminated by its headlights.
The vehicle slid and came to a halt in response to the deer's immobility under the car's headlights.
The scientific article explains what causes animals to freeze staring into the headlights of speeding vehicles causing many deaths and accidents every year.
As the car lost traction, it slid across the road and eventually halted abruptly, its brakes screeching, due to the presence of a motionless deer standing in the direct path of its headlights.


The top 5 results are expected semantically similar results. The positions of 6th and 7th results should be reversed. 

Acccuracy: 5/6 = 83%.

### Second search

In [40]:
q2_sentence = "Dream to solve world's problems."
q2_output = query(q2_sentence)
q2_embeddings = torch.FloatTensor(q2_output)
hits2 = semantic_search(q2_embeddings, doc_embeddings, top_k=7, score_function=cos_sim)
print(*[docs[hits2[0][i]['corpus_id']] for i in range(len(hits2[0]))], sep='\n')

The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.
Having completed his doctorate at 16, the exceptionally gifted young prodigy pondered the difficulties confronting the world, intuitively perceiving science as the remedy.
Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution.
The boy genius finished his PhD at 16 and believed science was the solution to the world's challenges.
After completing his PhD at the age of 16, the exceptionally talented young prodigy reflected on the global challenges and recognized science as the answer.
Finding sustainable energy solutions is crucial for a greener future.
Having successfully obtained his PhD at the remarkably young age of 16, the boy genius engaged in deep contemplation of the complex issues faced by the world. His intuition led him to firmly believe that science held the key to addressing the

5 out of top 6 results are the expected semantically similar results. 

Acccuracy: 5/6 = 83%.

### Similarity using dot product function

In [44]:
# query 1
hits1 = semantic_search(q1_embeddings, doc_embeddings, top_k=7, score_function=dot_score)
print(*[docs[hits1[0][i]['corpus_id']] for i in range(len(hits1[0]))], sep='\n')

The car skidded and stopped for the frozen deer in its headlights.
The car skidded to stop for the deer that stood frozen in the headlights of the car.
The car skidded to a standstill for the deer that remained motionless in the car's headlights.
The car skidded and stopped to avoid the motionless deer illuminated by its headlights.
The vehicle slid and came to a halt in response to the deer's immobility under the car's headlights.
The scientific article explains what causes animals to freeze staring into the headlights of speeding vehicles causing many deaths and accidents every year.
As the car lost traction, it slid across the road and eventually halted abruptly, its brakes screeching, due to the presence of a motionless deer standing in the direct path of its headlights.


5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.

In [47]:
# query 2
hits2 = semantic_search(q2_embeddings, doc_embeddings, top_k=7, score_function=dot_score)
print(*[docs[hits2[0][i]['corpus_id']] for i in range(len(hits2[0]))], sep='\n')

The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.
Having completed his doctorate at 16, the exceptionally gifted young prodigy pondered the difficulties confronting the world, intuitively perceiving science as the remedy.
Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution.
The boy genius finished his PhD at 16 and believed science was the solution to the world's challenges.
After completing his PhD at the age of 16, the exceptionally talented young prodigy reflected on the global challenges and recognized science as the answer.
Finding sustainable energy solutions is crucial for a greener future.
Having successfully obtained his PhD at the remarkably young age of 16, the boy genius engaged in deep contemplation of the complex issues faced by the world. His intuition led him to firmly believe that science held the key to addressing the

5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.

### Similarity using Euclidean distance
We use numpy's functions to compute Euclidean distance.

In [33]:
import numpy as np
N = 57
d = 384
k = 7 # #top k
# create an array of N d-dimensional vectors (the search space)
S = hf_embeddings_df.to_numpy()

In [35]:
# query 1
# create a d-dimensional query vector
x = np.array(q1_output)
# compute distances
distances = np.linalg.norm(S - x, axis = 1)
# select indices of vectors having the lowest distances from the X
neighbours = np.argpartition(distances, range(0, k))[:k]
print(*[docs[neighbours[i]] for i in range(len(neighbours))], sep='\n')

The car skidded and stopped for the frozen deer in its headlights.
The car skidded to stop for the deer that stood frozen in the headlights of the car.
The car skidded to a standstill for the deer that remained motionless in the car's headlights.
The car skidded and stopped to avoid the motionless deer illuminated by its headlights.
The vehicle slid and came to a halt in response to the deer's immobility under the car's headlights.
The scientific article explains what causes animals to freeze staring into the headlights of speeding vehicles causing many deaths and accidents every year.
As the car lost traction, it slid across the road and eventually halted abruptly, its brakes screeching, due to the presence of a motionless deer standing in the direct path of its headlights.


5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.

In [36]:
# query 2
# create a d-dimensional query vector
x = np.array(q2_output)
# compute distances
distances = np.linalg.norm(S - x, axis = 1)
# select indices of vectors having the lowest distances from the X
neighbours = np.argpartition(distances, range(0, k))[:k]
print(*[docs[neighbours[i]] for i in range(len(neighbours))], sep='\n')

The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.
Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution.
The boy genius finished his PhD at 16 and believed science was the solution to the world's challenges.
The invention of the internet connected the world like never before.
Having completed his doctorate at 16, the exceptionally gifted young prodigy pondered the difficulties confronting the world, intuitively perceiving science as the remedy.
Finding sustainable energy solutions is crucial for a greener future.
The fight against inequality and social injustice persists.


5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.