# Codealong Notebook

Use this notebook as your "scratch pad" as you go through the course contents. Feel free to copy any example code and tweak it to get a better understanding of how it works!

Use the **+** button or `Insert` menu to add additional code cells as needed.

## Step 1

### Ask ChatGpt before using RAG

In [10]:
!pip install python-dotenv -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
# Read credentials from .env file
# https://github.com/theskumar/python-dotenv


from dotenv import load_dotenv
load_dotenv()  # take environment variables

import os
OPENAI_KEY = os.getenv('OPENAI_KEY')

In [2]:
from openai import OpenAI
client = OpenAI(
    base_url = "https://openai.vocareum.com/v1",
    api_key = OPENAI_KEY
)

In [3]:
question1 = "When did Russia invade Ukrain ?"

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "developer", "content": "You are a helpful assistant."},
    {"role": "user", "content": question1}
  ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content="Russia's invasion of Ukraine began in February 2014 when Russian military forces entered the Ukrainian region of Crimea. This was followed by further military intervention and support for separatist movements in eastern Ukraine. The conflict has continued since then, with ongoing fighting in certain areas of eastern Ukraine.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)


In [4]:
question2 = "Who owns twitter ?"

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "developer", "content": "You are a helpful assistant."},
    {"role": "user", "content": question2}
  ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content='Twitter is a publicly traded company, so it is owned by its shareholders. The largest shareholders typically include institutional investors, mutual funds, and individual investors who own shares of the company.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)


## Step 2

### Course 3 - 4.10 `Loading and Wrangling data`

In [5]:
!pip install requests -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
import requests
import json

# Send a GET request to the specified URL

# Some parameters explanation
# https://www.mediawiki.org/wiki/Extension:TextExtracts
url = "https://en.wikipedia.org/w/api.php"

params = {
    "action": "query",
    "format": "json",
    "titles": "2022",
    "prop": "extracts",
    "exlimit": 1,
    "explaintext": 1,
    "extracts": True
}

response = requests.get(url, params=params)
json_data = response.json()

In [7]:
json_data

 'batchcomplete': '',
 'query': {'pages': {'52412': {'pageid': 52412,
    'ns': 0,
    'title': '2022',

In [8]:
page_id=list(json_data['query']['pages'].keys())[0]

In [9]:
print(json_data["query"]["pages"][page_id]["extract"])

2022 (MMXXII) was a common year starting on Saturday of the Gregorian calendar, the 2022nd year of the Common Era (CE) and Anno Domini (AD) designations, the 22nd  year of the 3rd millennium and the 21st century, and the  3rd   year of the 2020s decade.  
The year began with another wave in the COVID-19 pandemic, with Omicron spreading rapidly and becoming the dominant variant of the SARS-CoV-2 virus worldwide. Tracking a decrease in cases and deaths, 2022 saw the removal of most COVID-19 restrictions and the reopening of international borders in the vast majority of countries, while the global rollout of COVID-19 vaccines continued. The global economic recovery from the pandemic continued, though many countries experienced an ongoing inflation surge; in response, many central banks raised their interest rates to landmark levels. The world population reached eight billion people in 2022. The year also witnessed numerous natural disasters, including two devastating Atlantic hurricanes (

In [10]:
import pandas as pd

df = pd.DataFrame()
df["text"] = json_data["query"]["pages"][page_id]["extract"].split("\n")

In [11]:
len(df) #head(20)

276

In [12]:
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,The year began with another wave in the COVID-...
2,2022 was also dominated by wars and armed conf...
3,
4,
...,...
271,== References ==
272,
273,
274,== External links ==


Data from the API is much cleaner than raw website source code, but it still needs some work to be ideally configured for our purposes.

In this demo, we walked through how to wrangle and clean the data in df:

* Addressing the problem of empty rows by subsetting to include only rows where the length is > 0
* Addressing the problem of headings by subsetting to exclude rows where the text starts with ==
* Addressing the problem of rows without dates using a date parser and somewhat more complex logic

In [13]:
from dateutil.parser import parse

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)

In [14]:
df

Unnamed: 0,text
0,– 2022 (MMXXII) was a common year starting on...
1,– The year began with another wave in the COV...
2,– 2022 was also dominated by wars and armed c...
3,– The ongoing Russian invasion of Ukraine esc...
4,January 1 – The Regional Comprehensive Econom...
...,...
188,December 24 – 2022 Fijian general election: Th...
189,December 29 – Brazilian football legend Pelé d...
190,December 31 – Former Pope Benedict XVI dies at...
191,December 7 – The world population was estimate...


### Creating an Embeddings Index with `openai.Embedding`

In [15]:
!pip install openai -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [16]:
import openai #1.58.1

# Set up your API key and model
# openai.api_key = OPENAI_KEY
# model = "text-embedding-ada-002"

In [17]:
# from openai import OpenAI
# client = OpenAI(
#     base_url = "https://openai.vocareum.com/v1",
#     api_key = OPENAI_KEY
# )

In [18]:


# EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
# batch_size = 10
# embeddings = []
# for i in range(0, len(df), batch_size):
#     # Send text data to OpenAI model to get embeddings
#     response = client.embeddings.create( #openai.Embedding.create(
#         input=df.iloc[i:i+batch_size]["text"].tolist(),
#         model=EMBEDDING_MODEL_NAME
#     )
#     # print(df.iloc[i:i+batch_size]["text"].tolist())
#     print("response={}".format(response))
#     # embeddings.extend([data["embedding"] for data in response["data"]])
#     embeddings.extend([datao[0]["embedding"] for datao in response.data])
#     print(embeddings)



#must be changed in batch
def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input = [text], model=model).data[0].embedding


df["embeddings"] = df["text"].apply(lambda x: get_embedding(x, model="text-embedding-ada-002"))

df

Unnamed: 0,text,embeddings
0,– 2022 (MMXXII) was a common year starting on...,"[5.03144838148728e-05, -0.017939811572432518, ..."
1,– The year began with another wave in the COV...,"[-0.004625678062438965, -0.02004571445286274, ..."
2,– 2022 was also dominated by wars and armed c...,"[-0.009635788388550282, -0.015319113619625568,..."
3,– The ongoing Russian invasion of Ukraine esc...,"[-0.014713579788804054, -0.007582539692521095,..."
4,January 1 – The Regional Comprehensive Econom...,"[-0.0005856040515936911, -0.024172160774469376..."
...,...,...
188,December 24 – 2022 Fijian general election: Th...,"[-0.01166312675923109, -0.00934850424528122, -..."
189,December 29 – Brazilian football legend Pelé d...,"[-0.007571390364319086, 0.0040404098108410835,..."
190,December 31 – Former Pope Benedict XVI dies at...,"[0.02359509840607643, 0.007731214631348848, -0..."
191,December 7 – The world population was estimate...,"[-0.0017243337351828814, -0.015179171226918697..."


In [19]:
df.to_csv("embeddings_my_exercise.csv")

In [20]:
! ls

casestudy.ipynb            embeddings.csv
codealong-original.ipynb   embeddings_my_exercise.csv
codealong.ipynb            [1m[36mexercises[m[m


If you want to stop the tutorial here and come back, you can reload `df` using this code (again adding your API key) rather than generating the embeddings again:

In [21]:
# import numpy as np
# import pandas as pd
# import openai
# import os
# OPENAI_KEY = os.getenv('OPENAI_KEY')
# openai.api_base = "https://openai.vocareum.com/v1"
# openai.api_key = OPENAI_KEY
# df = pd.read_csv("embeddings.csv", index_col=0)
# df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

## Step 2

### Finding Relevant Data with Cosine Similarity

In [24]:
question1_embeddings = get_embedding(question1)
question1_embeddings

[-0.09932626038789749,
 0.021798856556415558,
 0.0017237075371667743,
 0.02637844905257225,
 0.00024376786313951015,
 -0.021798856556415558,
 -0.002361034043133259,
 -0.009647673927247524,
 -0.03415357694029808,
 0.0015138095477595925,
 0.0015710544539615512,
 -0.007597033865749836,
 0.015560435131192207,
 0.004879809450358152,
 -0.004704258404672146,
 0.04298710078001022,
 -0.03358367457985878,
 0.012069769203662872,
 -0.06794078648090363,
 -0.01353523787111044,
 -0.019254639744758606,
 -0.0010221394477412105,
 0.02224664017558098,
 -0.026134204119443893,
 -0.044900353997945786,
 -0.04400479048490524,
 0.004147074650973082,
 -0.0072611975483596325,
 -0.05487368628382683,
 -0.0408296063542366,
 0.004068204201757908,
 -0.03205714374780655,
 -0.03262704610824585,
 -0.005744843743741512,
 -0.0351916179060936,
 -0.02812887169420719,
 0.005576925352215767,
 0.0021854829974472523,
 -0.004383686929941177,
 0.0030021769925951958,
 -0.023508571088314056,
 0.007897252216935158,
 -0.0012473027454

In [23]:
!pip install scipy -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [22]:
# openai.embeddings_utils.distance_from_embeddings is no more supported so using the removed code from old commits:
# https://github.com/openai/openai-cookbook/blob/59019bd21ee4964dd58f6671b8dc1f3f6f3a92b3/examples/utils/embeddings_utils.py

from scipy import spatial
# from scipy.spatial.distance import cosine
from typing import List, Optional

def distances_from_embeddings(
    query_embedding: List[float],
    embeddings: List[List[float]],
    distance_metric="cosine",
) -> List[List]:
    """Return the distances between a query embedding and a list of embeddings."""
    distance_metrics = {
        "cosine": spatial.distance.cosine,
        "L1": spatial.distance.cityblock,
        "L2": spatial.distance.euclidean,
        "Linf": spatial.distance.chebyshev,
    }
    distances = [
        distance_metrics[distance_metric](query_embedding, embedding)
        for embedding in embeddings
    ]
    return distances

    # distances = df["embeddings"].apply(lambda x: cosine(query_embedding, x))

    # return distances



In [25]:
distances = distances_from_embeddings(question1_embeddings, df["embeddings"].tolist(), distance_metric="cosine")
type(distances)
distances

[np.float64(1.0325053799362507),
 np.float64(1.0542450419357583),
 np.float64(1.0616402750014688),
 np.float64(1.0674954870423636),
 np.float64(1.0595687376491745),
 np.float64(1.0439791337830535),
 np.float64(1.0691915849012876),
 np.float64(1.049120939892588),
 np.float64(1.0583032852124652),
 np.float64(1.0603165253743456),
 np.float64(1.0270232545406133),
 np.float64(1.0488205235142138),
 np.float64(1.0415092455288808),
 np.float64(1.0806690233045981),
 np.float64(1.0746952679897994),
 np.float64(1.0611230008991612),
 np.float64(1.037400272436777),
 np.float64(1.0201380137784444),
 np.float64(1.0620445510010803),
 np.float64(1.053596482472448),
 np.float64(1.0601117876057495),
 np.float64(1.0353022457217547),
 np.float64(1.0409980773158614),
 np.float64(1.039522353160118),
 np.float64(1.043636709469357),
 np.float64(1.053872151144458),
 np.float64(1.0555509489417374),
 np.float64(1.069163049040774),
 np.float64(1.0613715975732814),
 np.float64(1.0583619440549557),
 np.float64(1.043

In [26]:
df["distances"] = distances
df

Unnamed: 0,text,embeddings,distances
0,– 2022 (MMXXII) was a common year starting on...,"[5.03144838148728e-05, -0.017939811572432518, ...",1.032505
1,– The year began with another wave in the COV...,"[-0.004625678062438965, -0.02004571445286274, ...",1.054245
2,– 2022 was also dominated by wars and armed c...,"[-0.009635788388550282, -0.015319113619625568,...",1.061640
3,– The ongoing Russian invasion of Ukraine esc...,"[-0.014713579788804054, -0.007582539692521095,...",1.067495
4,January 1 – The Regional Comprehensive Econom...,"[-0.0005856040515936911, -0.024172160774469376...",1.059569
...,...,...,...
188,December 24 – 2022 Fijian general election: Th...,"[-0.01166312675923109, -0.00934850424528122, -...",1.039461
189,December 29 – Brazilian football legend Pelé d...,"[-0.007571390364319086, 0.0040404098108410835,...",1.037576
190,December 31 – Former Pope Benedict XVI dies at...,"[0.02359509840607643, 0.007731214631348848, -0...",1.053134
191,December 7 – The world population was estimate...,"[-0.0017243337351828814, -0.015179171226918697...",1.067083


In [27]:
df.sort_values(by='distances')

Unnamed: 0,text,embeddings,distances
43,March 5 – Researchers in the Antarctic find En...,"[-0.00688336743041873, -0.007791997864842415, ...",1.015287
17,January 24 – The federal government under Scot...,"[-0.009007374756038189, -0.008609510958194733,...",1.020138
157,October 25 – Rishi Sunak becomes Prime Ministe...,"[0.007556896656751633, -0.02273419313132763, -...",1.021319
10,January 9 – February 6 – The 2021 Africa Cup o...,"[-0.00272810785099864, -0.020591892302036285, ...",1.027023
119,"August 4 – The Prime Minister of Peru, Aníbal ...","[0.003285350976511836, 0.0016602528048679233, ...",1.027578
...,...,...,...
115,July 27 – A 7.0 earthquake strikes the island ...,"[0.0020686390344053507, 0.00393812358379364, 0...",1.078235
178,November 21 – A 5.6 earthquake strikes near Ci...,"[0.004115112125873566, -0.004124832805246115, ...",1.080273
13,January 15 – A large eruption of Hunga Tonga–H...,"[-0.006226368714123964, -0.017272738739848137,...",1.080669
136,September 12 – September 2022 Armenia–Azerbaij...,"[-0.012488479726016521, -0.009171020239591599,...",1.081130


In [28]:
current_shortest = df.iloc[0]["distances"]

for distance in df["distances"].values:
    if distance < current_shortest:
        current_shortest = distance

current_shortest

np.float64(1.0152870748085228)

In [29]:
df.iloc[0]["text"]

' – 2022 (MMXXII) was a common year starting on Saturday of the Gregorian calendar, the 2022nd year of the Common Era (CE) and Anno Domini (AD) designations, the 22nd  year of the 3rd millennium and the 21st century, and the  3rd   year of the 2020s decade.  '

## Step 3

### Tokenizing with `tiktoken`

### Composing a Custom Text Prompt

## Step 4

### Getting a Custom Q&A Response with `openai.Completion`

## 🎉 Congratulations 🎉

You have now completed the prompt engineering process using unsupervised ML to get a custom answer from an OpenAI model!

![image description](Congratulation.png)