# Step 1: Preparing a Dataset with Embeddings

Add your API key to the cell below then run it.

In [2]:
import openai
import getpass

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = getpass.getpass("Digite sua API Key: ").strip()

Digite sua API Key: ········


## Loading the Data

We are using the `requests` library ([documentation here](https://requests.readthedocs.io/en/latest/user/quickstart/)) to get the text of a page from Wikipedia using the `extracts` API feature ([documentation here](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bextracts)). You can ignore the details of the `params` being sent — the important takeaway is that **`response_dict` is a Python dictionary containing the the response to our query**.

Run the cell below as-is.

In [3]:
import requests

# Get the Wikipedia page for the 2023 Turkey–Syria earthquake
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2023_Turkey–Syria_earthquakes",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

In [4]:
response_dict

{'batchcomplete': True,
 'query': {'normalized': [{'fromencoded': False,
    'from': '2023_Turkey–Syria_earthquakes',
    'to': '2023 Turkey–Syria earthquakes'}],
  'pages': [{'pageid': 72956318,
    'ns': 0,
    'title': '2023 Turkey–Syria earthquakes',

### TODO: Parse `response_dict` to get a list of text data samples

Look at the nested data structure of `response_dict` and find the key-value pair with the key of `"extract"`. The associated value will be a string containing a long block of text. Split this text into a list of strings using the `"\n"` separator and assign to the variable `text_data`.

If you're getting stuck, you can click to reveal the solution then copy and paste this into the cell below.

---

<details>
    <summary style="cursor: pointer"><strong>Solution (click to show/hide)</strong></summary>

```python
text_data = response_dict["query"]["pages"][0]["extract"].split("\n")
```

</details>

In [5]:
text_data = response_dict["query"]["pages"][0]["extract"].split("\n")

### Adding the Text Data to a DataFrame

Run the cell below as-is.

In [6]:
import pandas as pd

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = text_data

# Clean up dataframe to remove empty lines and headings
df = df[(
    (df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))
)].reset_index(drop=True)
df.head()

Unnamed: 0,text
0,"On 6 February 2023, at 04:17 TRT (01:17 UTC), ..."
1,The Mw 7.8 earthquake is the largest in Turkey...
2,There was widespread damage in an area of abou...
3,"The confirmed death toll in Turkey was 53,537;..."
4,"Damaged roads, winter storms, and disruption t..."


## Creating the Embeddings Index

Here is the text from the first row of our dataset. Run the cell below as-is.

In [7]:
df["text"][0]

'On 6 February 2023, at 04:17 TRT (01:17 UTC), a Mw 7.8 earthquake struck southern and central Turkey and northern and western Syria. The epicenter was 37 km (23 mi) west–northwest of Gaziantep. The earthquake had a maximum Mercalli intensity of XII (Extreme) around the epicenter and in Antakya. It was followed by a Mw\u202f7.7 earthquake at 13:24. This earthquake was centered 95 km (59 mi) north-northeast from the first. There was widespread damage and tens of thousands of fatalities.'

This code creates embeddings for that text sample. Run the cell below as-is.

In [8]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input=[df["text"][0]],
    engine=EMBEDDING_MODEL_NAME
)

# Extract and print the first 20 numbers in the embedding
response_list = response["data"]
first_item = response_list[0]
first_item_embedding = first_item["embedding"]
print(first_item_embedding[:20])

[-0.007916178554296494, -0.014893945306539536, -0.013553355820477009, -0.030699491500854492, 0.0021617000456899405, 0.020926596596837044, -0.03895752131938934, -0.016623305156826973, 0.0042764791287481785, -0.02986832521855831, 0.02281682752072811, 0.05217573046684265, -0.010771634057164192, -0.018500130623579025, 0.012011678889393806, -0.0002576444821897894, 0.015349745750427246, -0.015376557596027851, 0.005074129905551672, -0.008700423873960972]


### Creating a list of embeddings

This code sends all of the data from `df["text"].tolist()` to the `openai.Embedding.create` function, then extracts the resulting embeddings and creates a list of embeddings called `embeddings`.

Run the cell below as-is.

In [9]:
# Send text data to the model
response = openai.Embedding.create(
    input=df["text"].tolist(),
    engine=EMBEDDING_MODEL_NAME
)

# Extract embeddings
embeddings = [data["embedding"] for data in response["data"]]

### Adding Embeddings to DataFrame and Saving as CSV

Run the cell below as-is.

In [10]:
# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.to_csv("embeddings.csv")

In [11]:
df.head(5)

Unnamed: 0,text,embeddings
0,"On 6 February 2023, at 04:17 TRT (01:17 UTC), ...","[-0.007916178554296494, -0.014893945306539536,..."
1,The Mw 7.8 earthquake is the largest in Turkey...,"[0.0002615457051433623, -0.022248437628149986,..."
2,There was widespread damage in an area of abou...,"[-0.00022320200514514, -0.01703203096985817, 0..."
3,"The confirmed death toll in Turkey was 53,537;...","[0.0002244623174192384, -0.02535487338900566, ..."
4,"Damaged roads, winter storms, and disruption t...","[-0.018199238926172256, -0.014467408880591393,..."


## Conclusion

You have now created and saved an embeddings index!