# An Experiment with Embeddings, Vector Database, and Similarity Metrics

We query a dataset that has semantically similar and dissimilar sentences to the query sentence. Only known similar sentences should be returned in query results. We assign an accuracy score based on the actual versus expected results,

We conduct the experiment using the following tools:

Embeddings:
  - Hugging Face's 'all-MiniLM-L6-v2'
  - Open AI's "text-embedding-ada-002"

Embeddings are stored and indexed in Pinecone vector database.

Similarity functions:
  - Cosine
  - Dot product
  - Euclidean distance

## Helper functions

In [1]:
import pickle

# save object to file
def save(object, tofile):
    with open(tofile, 'wb') as fp:
        pickle.dump(object, fp)
    return

# load object from file
def load(fromfile):
    with open(fromfile, 'rb') as fp:
        object = pickle.load(fp)
    return object

## Load dataset and queries

In [2]:
datadir = 'data/'
datadir = 'sample_data/'
# load dataset and query sentences
docs = load(datadir+'docs.pickle')
print(f'docs[0]: {docs[0]}')
q1_sentence = load(datadir+'q1.pickle')
print(f'query sentence 1: {q1_sentence}')
q2_sentence = load(datadir+'q2.pickle')
print(f'query sentence 2: {q2_sentence}')

docs[0]: The car skidded to stop for the deer that stood frozen in the headlights of the car.
query sentence 1: The deer froze in the headlights of the car.
query sentence 2: Dream to solve world's problems.


## Embeddings with `all-MiniLM-L6-v2`

### Load saved embeddings

In [6]:
text_embedding_model = 'all-MiniLM-L6-v2'
docs_file = f'{datadir}docs-{text_embedding_model}.pickle'

try:
  doc_embeddings = load(docs_file)
  doc_embeddings_list = doc_embeddings.tolist()
  print('Dataset embeddings loaded.')
  print(f'shape: {doc_embeddings.shape}')

except e:
  print(f'Exception: {e}. Use Part 1 ntoebook to generate these embeddings.')


Dataset embeddings loaded.
shape: torch.Size([57, 384])


In [None]:
q1_file = f'{datadir}q1-{text_embedding_model}.pickle'
q2_file = f'{datadir}q2-{text_embedding_model}.pickle'

try:
  q1_embeddings = load(q1_file)
  q2_embeddings = load(q2_file)
  print('Query embeddings loaded.')
  print(f'shape: {q1_embeddings.shape}')

except e:
  print(f'Exception: {e}. Use Part 1 ntoebook to generate these embeddings.')


Query embeddings loaded.
shape: torch.Size([384])


## Initialize the vector database

In [10]:
#!pip install "pinecone-client[grpc]"==2.2.1
!pip3 install pinecone-client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pinecone-client
  Downloading pinecone_client-2.2.2-py3-none-any.whl (179 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting loguru>=0.5.0 (from pinecone-client)
  Downloading loguru-0.7.0-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting dnspython>=2.0.0 (from pinecone-client)
  Downloading dnspython-2.3.0-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.7/283.7 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: loguru, dnspython, pinecone-client
Successfully installed dnspython-2.3.0 loguru-0.7.0 pinecone-client-2.2.2


In [11]:
!pip3 install python-dotenv

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0


In [12]:
from dotenv import dotenv_values

#config = dotenv_values(".env")
#pineconse_api_key = config['PINECONE_API_KEY']
#pineconse_env = config['PINECONE_ENVIRONMENT']

In [14]:
import pinecone

pinecone.init(
    api_key=pineconse_api_key,
    environment=pineconse_env
)

  from tqdm.autonotebook import tqdm


## Indexes, similarity searches, and comparisons



In [15]:
# convenience functions to create and populate indexes in pinecone

pinecone_index = None
def create_index(index_to_create, dimension, metric):
  # only create index if it doesn't exist
  if index_to_create not in pinecone.list_indexes():
      pinecone.create_index(
          name=index_to_create,
          dimension=dimension,
          metric=metric
      )
  return

def delete_index(index_name):
  pinecone.delete_index(index_name)

def populate_index(index_name, upsert_docs):
  global pinecone_index
  pinecone_index = pinecone.Index(index_name)
  return pinecone_index.upsert(upsert_docs)

def query_index(q_embeddings):
  # submit query
  xc = pinecone_index.query(q_embeddings, top_k=7, include_metadata=True)
  print(*[docs[int(xc['matches'][i]['id'])] for i in range(len(xc['matches']))], sep='\n')
  return

In [103]:
docs_to_upsert = [(str(i), de.tolist()) for i, de in enumerate(doc_embeddings)]

### Cosine similarity search and results

In [67]:
index_name = 'minilm-cosine'
create_index(index_name, len(doc_embeddings_list[0]), 'cosine')
populate_index(index_name, docs_to_upsert)

{'upserted_count': 57}

In [70]:
query_index(q1_embeddings.tolist())

The car skidded and stopped for the frozen deer in its headlights.
The car skidded to stop for the deer that stood frozen in the headlights of the car.
The car skidded to a standstill for the deer that remained motionless in the car's headlights.
The car skidded and stopped to avoid the motionless deer illuminated by its headlights.
The vehicle slid and came to a halt in response to the deer's immobility under the car's headlights.
The scientific article explains what causes animals to freeze staring into the headlights of speeding vehicles causing many deaths and accidents every year.
As the car lost traction, it slid across the road and eventually halted abruptly, its brakes screeching, due to the presence of a motionless deer standing in the direct path of its headlights.


5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.



In [69]:
query_index(q2_embeddings.tolist())

The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.
Having completed his doctorate at 16, the exceptionally gifted young prodigy pondered the difficulties confronting the world, intuitively perceiving science as the remedy.
Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution.
The boy genius finished his PhD at 16 and believed science was the solution to the world's challenges.
After completing his PhD at the age of 16, the exceptionally talented young prodigy reflected on the global challenges and recognized science as the answer.
Finding sustainable energy solutions is crucial for a greener future.
Having successfully obtained his PhD at the remarkably young age of 16, the boy genius engaged in deep contemplation of the complex issues faced by the world. His intuition led him to firmly believe that science held the key to addressing the

5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.



In [72]:
delete_index('minilm-cosine')

### Euclidean Distance similarity search and results

In [104]:
index_name = 'minilm-euclidean'
create_index(index_name, len(doc_embeddings_list[0]), 'euclidean')
populate_index(index_name, docs_to_upsert)

{'upserted_count': 57}

In [105]:
query_index(q1_embeddings.tolist())

The car skidded and stopped for the frozen deer in its headlights.
The car skidded to stop for the deer that stood frozen in the headlights of the car.
The car skidded to a standstill for the deer that remained motionless in the car's headlights.
The car skidded and stopped to avoid the motionless deer illuminated by its headlights.
The vehicle slid and came to a halt in response to the deer's immobility under the car's headlights.
The scientific article explains what causes animals to freeze staring into the headlights of speeding vehicles causing many deaths and accidents every year.
As the car lost traction, it slid across the road and eventually halted abruptly, its brakes screeching, due to the presence of a motionless deer standing in the direct path of its headlights.


5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.



In [106]:
query_index(q2_embeddings.tolist())

The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.
Having completed his doctorate at 16, the exceptionally gifted young prodigy pondered the difficulties confronting the world, intuitively perceiving science as the remedy.
Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution.
The boy genius finished his PhD at 16 and believed science was the solution to the world's challenges.
After completing his PhD at the age of 16, the exceptionally talented young prodigy reflected on the global challenges and recognized science as the answer.
Finding sustainable energy solutions is crucial for a greener future.
Having successfully obtained his PhD at the remarkably young age of 16, the boy genius engaged in deep contemplation of the complex issues faced by the world. His intuition led him to firmly believe that science held the key to addressing the

5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.



In [107]:
delete_index('minilm-euclidean')

### Dot Product similarity search and results

In [94]:
index_name = 'minilm-dotproduct'
create_index(index_name, len(doc_embeddings_list[0]), 'dotproduct')
populate_index(index_name, docs_to_upsert)

{'upserted_count': 57}

In [95]:
query_index(q1_embeddings.tolist())

The car skidded and stopped for the frozen deer in its headlights.
The car skidded to stop for the deer that stood frozen in the headlights of the car.
The car skidded to a standstill for the deer that remained motionless in the car's headlights.
The car skidded and stopped to avoid the motionless deer illuminated by its headlights.
The vehicle slid and came to a halt in response to the deer's immobility under the car's headlights.
The scientific article explains what causes animals to freeze staring into the headlights of speeding vehicles causing many deaths and accidents every year.
As the car lost traction, it slid across the road and eventually halted abruptly, its brakes screeching, due to the presence of a motionless deer standing in the direct path of its headlights.


5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.



In [96]:
query_index(q2_embeddings.tolist())

The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.
Having completed his doctorate at 16, the exceptionally gifted young prodigy pondered the difficulties confronting the world, intuitively perceiving science as the remedy.
Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution.
The boy genius finished his PhD at 16 and believed science was the solution to the world's challenges.
After completing his PhD at the age of 16, the exceptionally talented young prodigy reflected on the global challenges and recognized science as the answer.
Finding sustainable energy solutions is crucial for a greener future.
Having successfully obtained his PhD at the remarkably young age of 16, the boy genius engaged in deep contemplation of the complex issues faced by the world. His intuition led him to firmly believe that science held the key to addressing the

5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.



In [97]:
delete_index('minilm-dotproduct')

## Embeddings with `text-embedding-ada-002`

### Initialize openai

In [3]:
!pip install langchain openai tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.202-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m64.6 MB/s[0m eta [36m0:00:00[0m

In [17]:
#config = dotenv_values(".env")
#openai_api_key = config['OPENAI_API_KEY'] # "get your token in http://hf.co/settings/tokens"

### Load or create dataset embeddings

In [19]:
text_embedding_model = "text-embedding-ada-002"
docs_file = f'{datadir}docs-{text_embedding_model}.pickle'

try:
  doc_embeddings = load(docs_file)

except:
  # if not present, create and save
  from langchain.embeddings import OpenAIEmbeddings
  embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
  doc_embeddings = embeddings.embed_documents(docs)
  save(doc_embeddings, docs_file)
  ""
print(f'shape: {len(doc_embeddings)}x{len(doc_embeddings[0])}')

shape: 57x1536


### Load or create query sentence embeddings

In [31]:
q1_file = f'{datadir}q1-{text_embedding_model}.pickle'
q2_file = f'{datadir}q2-{text_embedding_model}.pickle'

try:
  q1_embeddings = load(q1_file)
  q2_embeddings = load(q2_file)

except:
  # if not present, create and save
  from langchain.embeddings import OpenAIEmbeddings
  embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
  q1_embeddings = embeddings.embed_query(q1_sentence)
  save(q1_embeddings, q1_file)
  q2_embeddings = embeddings.embed_query(q2_sentence)
  save(q2_embeddings, q2_file)

print(f'size: {len(q1_embeddings)}')

size: 1536


In [27]:
docs_to_upsert = [(str(i), de) for i, de in enumerate(doc_embeddings)]

## Store embeddings in a vector database and perform similarity searches

### Cosine similarity search and results

In [32]:
query_index(q1_embeddings)

The car skidded to stop for the deer that stood frozen in the headlights of the car.
The car skidded and stopped for the frozen deer in its headlights.
The car skidded to a standstill for the deer that remained motionless in the car's headlights.
The vehicle slid and came to a halt in response to the deer's immobility under the car's headlights.
The car skidded and stopped to avoid the motionless deer illuminated by its headlights.
As the car lost traction, it slid across the road and eventually halted abruptly, its brakes screeching, due to the presence of a motionless deer standing in the direct path of its headlights.
The deer cautiously crossed the road, looking out for oncoming vehicles.


6 out of top 6 results are the expected semantically similar results. Acccuracy: 6/6 = 100%.



In [33]:
query_index(q2_embeddings)

Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution.
The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.
The boy genius finished his PhD at 16 and believed science was the solution to the world's challenges.
Having completed his doctorate at 16, the exceptionally gifted young prodigy pondered the difficulties confronting the world, intuitively perceiving science as the remedy.
After completing his PhD at the age of 16, the exceptionally talented young prodigy reflected on the global challenges and recognized science as the answer.
Finding sustainable energy solutions is crucial for a greener future.
Access to clean water is a pressing global issue.


5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.



In [34]:
delete_index('ada-cosine')

### Euclidean Distance similarity search and results

In [54]:
index_name = 'ada-euclidean'
create_index(index_name, len(doc_embeddings[0]), 'euclidean')
populate_index(index_name, docs_to_upsert)

{'upserted_count': 57}

In [55]:
query_index(q1_embeddings)

The car skidded to stop for the deer that stood frozen in the headlights of the car.
The car skidded and stopped for the frozen deer in its headlights.
The car skidded to a standstill for the deer that remained motionless in the car's headlights.
The vehicle slid and came to a halt in response to the deer's immobility under the car's headlights.
The car skidded and stopped to avoid the motionless deer illuminated by its headlights.
As the car lost traction, it slid across the road and eventually halted abruptly, its brakes screeching, due to the presence of a motionless deer standing in the direct path of its headlights.
The deer cautiously crossed the road, looking out for oncoming vehicles.


6 out of top 6 results are the expected semantically similar results. Acccuracy: 6/6 = 100%.



In [56]:
query_index(q2_embeddings)

Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution.
The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.
The boy genius finished his PhD at 16 and believed science was the solution to the world's challenges.
Having completed his doctorate at 16, the exceptionally gifted young prodigy pondered the difficulties confronting the world, intuitively perceiving science as the remedy.
After completing his PhD at the age of 16, the exceptionally talented young prodigy reflected on the global challenges and recognized science as the answer.
Finding sustainable energy solutions is crucial for a greener future.
Access to clean water is a pressing global issue.


5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.



In [57]:
delete_index('ada-euclidean')

### Dot Product similarity search and results

In [35]:
index_name = 'ada-dotproduct'
create_index(index_name, len(doc_embeddings[0]), 'dotproduct')
populate_index(index_name, docs_to_upsert)

{'upserted_count': 57}

In [36]:
query_index(q1_embeddings)

The car skidded to stop for the deer that stood frozen in the headlights of the car.
The car skidded and stopped for the frozen deer in its headlights.
The car skidded to a standstill for the deer that remained motionless in the car's headlights.
The vehicle slid and came to a halt in response to the deer's immobility under the car's headlights.
The car skidded and stopped to avoid the motionless deer illuminated by its headlights.
As the car lost traction, it slid across the road and eventually halted abruptly, its brakes screeching, due to the presence of a motionless deer standing in the direct path of its headlights.
The deer cautiously crossed the road, looking out for oncoming vehicles.


6 out of top 6 results are the expected semantically similar results. Acccuracy: 6/6 = 100%.



In [37]:
query_index(q2_embeddings)

Having finished his PhD at 16, the boy genius  contemplated the challenges the world faced, and intuited that science must be the solution.
The 16-year-old genius, having earned a PhD, contemplated global challenges and recognized science as the solution.
The boy genius finished his PhD at 16 and believed science was the solution to the world's challenges.
Having completed his doctorate at 16, the exceptionally gifted young prodigy pondered the difficulties confronting the world, intuitively perceiving science as the remedy.
After completing his PhD at the age of 16, the exceptionally talented young prodigy reflected on the global challenges and recognized science as the answer.
Finding sustainable energy solutions is crucial for a greener future.
Access to clean water is a pressing global issue.


5 out of top 6 results are the expected semantically similar results. Acccuracy: 5/6 = 83%.



In [38]:
delete_index('ada-dotproduct')