# Cosine Similarity - Near Search

- Perform a near search using cosine similarity

### Import packages

In [1]:
import common
import requests

"""curl https://YOUR_RESOURCE_NAME.openai.azure.com/openai/deployments/YOUR_DEPLOYMENT_NAME/embeddings?api-version=2023-05-15 \
  -H "Content-Type: application/json" \
  -H "api-key: YOUR_API_KEY" \
  -d "{\"input\": \"The food was delicious and the waiter...\"}"""
common.ada_full_uri


'https://alemoraoaican.openai.azure.com/openai/deployments/ada-large/embeddings?api-version=2023-05-15'

### Get an embedding

- 02/2024 - new models: text-embedding-3-small and text-embedding-3-large
- By default, the length of the embedding vector will be 1536 for text-embedding-3-small or 3072 for text-embedding-3-large. You can reduce the dimensions of the embedding by passing in the dimensions parameter without the embedding losing its concept-representing properties.
- Parameters:
```bash
input string or array Required
Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. The input must not exceed the max input tokens for the model (8192 tokens for text-embedding-ada-002), cannot be an empty string, and any array must be 2048 dimensions or less. Example Python code for counting tokens.

model string Required
ID of the model to use. You can use the List models API to see all of your available models, or see our Model overview for descriptions of them.

encoding_format string Optional
Defaults to float
The format to return the embeddings in. Can be either float or base64.

dimensions integer Optional
The number of dimensions the resulting output embeddings should have. Only supported in text-embedding-3 and later models.

user string Optional
A unique identifier representing your end-user, which can help OpenAI to monitor and detect abuse. Learn more.
```

In [38]:
headers = {
        "Content-Type": "application/json",
        "api-key": common.ada_key
}
def get_embedding(input:str,model=2,dimension=1536):
    json_data = None
    if model != 2:
        json_data = {"input": input,"dimensions":dimension}
    else:
        json_data = {"input": input}
    response = requests.post(common.ada_full_uri, 
                             headers=headers, 
                             json=json_data)
    response.raise_for_status()
    res = response.json()
    vector = res['data'][0]['embedding']
    print(f"Input: {input} Vector size: {len(vector)}")
    return (input,vector)

### Calculate the Cosine Similarity

In [39]:
def cosine_similarity(v1, v2):
    dot_product = sum(a*b for a, b in zip(v1, v2))
    magnitude_A = sum(a*a for a in v1)**0.5
    magnitude_B = sum(b*b for b in v2)**0.5
    # cosine_similarity = dot_product(A*B) / (magnitude_A * magnitude_B) is the cosine of the angle
    # With numpy, it's simply:
    # dot_product = np.dot(A, B)
    # magnitude_A = np.linalg.norm(A)
    # magnitude_B = np.linalg.norm(B)
    # cosine_similarity = dot_product / (magnitude_A * magnitude_B)
    return dot_product / (magnitude_A * magnitude_B)

### Prepare the mock vector database

In [43]:
model=3 # using new ada model 3 or previous model 2
dimension=256 # if using models 3 you can change the dimension ortherwise it will be 1536

In [41]:
content = [
    "The chemical composition of water is H2O.",
    "The speed of light is 300,000 km/s.",
    "Acceleration of gravity on earth is 9.8m/s^2.",
    "The chemical composition of salt or sodium clorida is NaCl.",
]
vector_database = [get_embedding(c,model,dimension) for c in content]

Input: The chemical composition of water is H2O. Vector size: 256
Input: The speed of light is 300,000 km/s. Vector size: 256
Input: Acceleration of gravity on earth is 9.8m/s^2. Vector size: 256
Input: The chemical composition of salt or sodium clorida is NaCl. Vector size: 256


### Embed the question

In [42]:
(p1,e1) = get_embedding("What is the speed of light?",model,dimension)

Input: What is the speed of light? Vector size: 256


### Perform nearest search

In [34]:
limit =3
relevance=0.1
count = 0
results_list = []
for entry in vector_database:
    (content,entry_embedding) = entry
    cs = cosine_similarity(e1, entry_embedding)
    if cs>relevance:
        results_list.append((content,cs))
    count+=1    
    if count==limit:
        break

### Print the results

In [35]:
results_list.sort(key=lambda x: x[1], reverse=True)
for entry in results_list:
    print(f"Similarity: {entry[1]}, Content: {entry[0]}")

Similarity: 0.6822626605775532, Content: The speed of light is 300,000 km/s.
Similarity: 0.2587818231311923, Content: Acceleration of gravity on earth is 9.8m/s^2.
Similarity: 0.13597596777794896, Content: The chemical composition of water is H2O.
