# Step 2: Finding Relevant Data

Add your API key to the cell below then run it.

In [1]:
from dotenv import load_dotenv
load_dotenv()  # take environment variables

import os
OPENAI_KEY = os.getenv('OPENAI_KEY')

In [5]:


import openai
# openai.api_base = "https://openai.vocareum.com/v1"
# openai.api_key = OPENAI_KEY #"YOUR API KEY"

from openai import OpenAI
client = OpenAI(
    base_url = "https://openai.vocareum.com/v1",
    api_key = OPENAI_KEY
)

The code below loads in the embeddings you previously created. Run it as-is.

In [6]:
import numpy as np
import pandas as pd

df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df

Unnamed: 0,text,embeddings
0,"On 6 February 2023, at 04:17 TRT (01:17 UTC), ...","[-0.007865791209042072, -0.01488738413900137, ..."
1,The Mw 7.8 earthquake is the largest in Turkey...,"[0.00019888307724613696, -0.022314351052045822..."
2,There was widespread damage in an area of abou...,"[-0.003678650129586458, -0.020112549886107445,..."
3,Central southern Turkey and northwestern Syria...,"[-0.005976187530905008, -0.011475914157927036,..."
4,The EAF has produced large or damaging earthqu...,"[0.0002380282385274768, -0.02387528494000435, ..."
...,...,...
96,The International Seismological Centre has a b...,"[-0.004583664704114199, -0.009662682190537453,..."
97,The International Seismological Centre has a b...,"[-0.004807258490473032, -0.01684679090976715, ..."
98,"Erdik, M., Tümsa, M. B. D., Pınar, A., Altunel...","[-0.006729048676788807, -0.04049292206764221, ..."
99,"""Kahramanmaraş Supersite science page"". Group ...","[0.0036264623049646616, -0.007848413661122322,..."


## TODO 1: Create Embeddings for the User's Question

In the previous exercise, you were given the code to create embeddings for everything in the dataset. Now your task is to create embeddings for just one string: the user's question. Assign the result to the variable `question_embeddings`. This variable should contain a list of 1,536 floating point numbers, and the provided code will print the first 100 once `question_embeddings` has been created correctly.

If you're getting stuck, you can click to reveal the solution then copy and paste this into the cell below.

---

<details>
    <summary style="cursor: pointer"><strong>Solution (click to show/hide)</strong></summary>

```python
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
USER_QUESTION = """What were the estimated damages of the 2023 \
Turkey-Syria earthquake?"""

# Generate the embedding response
response = openai.Embedding.create(
    input=USER_QUESTION,
    engine=EMBEDDING_MODEL_NAME
)

# Extract the embeddings from the response
question_embeddings = response["data"][0]["embedding"]

print(question_embeddings[:100])
```

</details>

In [12]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
USER_QUESTION = """What were the estimated damages of the 2023 \
Turkey-Syria earthquake?"""

# Generate text response
response_text = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "developer", "content": "You are a helpful assistant."},
    {"role": "user", "content": USER_QUESTION}
  ]
)

response_text

ChatCompletion(id='chatcmpl-Ai9a26ErrMGPBhAZhkBV8bCtMWPwM', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="I'm sorry, but I couldn't find specific information on a Turkey-Syria earthquake in 2023. If you have any other questions or need information on a different topic, feel free to ask!", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1735086622, model='gpt-3.5-turbo-0125', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=43, prompt_tokens=33, total_tokens=76, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

In [26]:
# Generate the embedding response
# https://platform.openai.com/docs/guides/embeddings/embedding-models
response_embeddings = client.embeddings.create(
  model=EMBEDDING_MODEL_NAME, #"gpt-3.5-turbo",
  input=USER_QUESTION
)

# Extract the embeddings from the response
question_embeddings = response_embeddings.data[0].embedding

# len(question_embeddings.data[0].embedding) # = 1536 

print(question_embeddings[:100])

[0.0055134412832558155, -0.024900270625948906, 0.0023327376693487167, -0.012057743035256863, -0.02148180454969406, 0.0025389099027961493, -0.03381222486495972, -0.013101905584335327, 0.002271218691021204, -0.015004009008407593, 0.0162410419434309, 0.044320352375507355, -0.010262050665915012, -0.013308077119290829, 0.015070516616106033, -0.005400379188358784, 0.012456785887479782, -0.013135158456861973, 0.008339994587004185, -0.0057362401857972145, -0.0074754017405211926, 0.011472480371594429, 0.012396929785609245, -0.009018367156386375, 0.015123722143471241, 0.032162848860025406, 0.008027411065995693, -0.0023377256002277136, -0.0007943445816636086, -0.012204058468341827, 0.005523417145013809, -0.006550952792167664, -0.023450415581464767, 0.015256736427545547, -0.03240227326750755, -0.007575162220746279, 0.0068834880366921425, -0.0048450445756316185, 0.023344002664089203, -0.007927650585770607, 0.010075830854475498, 0.028624670580029488, 0.01296889130026102, -0.007708176504820585, -0.01

## TODO 2: Find Cosine Distances

Create a new list called `distances`, which represents the cosine distances between `question_embeddings` and each value in the `'embeddings'` column of `df`.

If you're getting stuck, you can click to reveal the solution then copy and paste this into the cell below.

---

<details>
    <summary style="cursor: pointer"><strong>Solution (click to show/hide)</strong></summary>

```python
from openai.embeddings_utils import distances_from_embeddings

# Create a list containing the distances from question_embeddings
distances = distances_from_embeddings(
    question_embeddings,
    df["embeddings"],
    distance_metric="cosine"
)

print(distances[:100])
```

</details>

In [29]:
# from openai.embeddings_utils import distances_from_embeddings # No more supported with OpenAI >2.0.0



from scipy import spatial
# from scipy.spatial.distance import cosine
from typing import List, Optional

def distances_from_embeddings(
    query_embedding: List[float],
    embeddings: List[List[float]],
    distance_metric="cosine",
) -> List[List]:
    """Return the distances between a query embedding and a list of embeddings."""
    distance_metrics = {
        "cosine": spatial.distance.cosine,
        "L1": spatial.distance.cityblock,
        "L2": spatial.distance.euclidean,
        "Linf": spatial.distance.chebyshev,
    }
    distances = [
        distance_metrics[distance_metric](query_embedding, embedding)
        for embedding in embeddings
    ]
    return distances



# Create a list containing the distances from question_embeddings
distances = distances_from_embeddings(question_embeddings, df["embeddings"], distance_metric="cosine")

df["distances"] = distances

print(distances[:100])

[np.float64(0.1168092065197357), np.float64(0.13425348660279057), np.float64(0.08788036609447802), np.float64(0.17517923221232434), np.float64(0.15367085650797774), np.float64(0.14492131396243735), np.float64(0.14775697880875105), np.float64(0.1593530015043324), np.float64(0.14752053002356003), np.float64(0.16198672215288168), np.float64(0.19435548085926002), np.float64(0.14276406037143263), np.float64(0.16905621941011284), np.float64(0.18120243879979459), np.float64(0.16958621221992443), np.float64(0.18445703167759608), np.float64(0.12379760165159637), np.float64(0.159785686324041), np.float64(0.16502861780429234), np.float64(0.16986085780715665), np.float64(0.1597849518435457), np.float64(0.1613345275361785), np.float64(0.14730919036105172), np.float64(0.15206817976490428), np.float64(0.17780499399558958), np.float64(0.1898979088177538), np.float64(0.28042572824119905), np.float64(0.2960231621976831), np.float64(0.2960231621976831), np.float64(0.2803639942901184), np.float64(0.280425

## Sorting by Distance

The code below uses the `distances` list to update `df` then sorts `df` to find the most related rows. Shorter distance means more similarity, so we'll use an ascending sorting order. Run the cell below as-is.

In [33]:
df["distances"] = distances
df.sort_values(by="distances", ascending=True, inplace=True)
df.head(5)

Unnamed: 0,text,embeddings,distances
12,Ground acceleration values recorded in some ar...,"[0.0004522834497038275, -0.0006016142433509231...",0.08788
15,"Despite an epicenter 90 km (56 mi) inland, a t...","[0.002972564660012722, -0.02091204933822155, 0...",0.102367
89,"Utkucu, Murat; Uzunca, Fatih; Durmuş, Hatice; ...","[-0.008880216628313065, -0.028865838423371315,...",0.11647
46,The United Nations Development Programme estim...,"[0.012168490327894688, -0.0028820110019296408,...",0.116809
28,\t\t\t,"[-0.01989811100065708, -0.02768372930586338, -...",0.122092


Great, now we have the dataset sorted from most relevant to least relevant! Let's save this as a CSV so we can load it in the next step and compose a custom prompt.

Run the cell below as-is.

In [34]:
df.to_csv("distances_my_exercise.csv")