# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [12]:
import pandas as pd
import openai
import requests
from scipy.spatial import distance
from openai.embeddings_utils import distances_from_embeddings
from openai.embeddings_utils import get_embedding

openai.api_key =  "API_KEY"
openai.api_base = "https://openai.vocareum.com/v1" # Remove this if using personal key

In [2]:
# Load Data from Wikipedia using API; this can be skipped if you have alread y saved text.csv

params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "Synthesizer",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

#response_dict 
text_data = response_dict["query"]["pages"][0]["extract"].split("\n")
#leaving older code that was used for clean up above for learnings.
#response = requests.get("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&rvprop=content&titles=Synthesizer&rvslots=*")
#response.json()["query"]["pages"]["10791746"]["revisions"][0]["slots"]["main"]["*"].split("\n")

In [3]:
# Load page text into a dataframe this can be skipped if you have alread y saved text.csv
df = pd.DataFrame()
df["text"] = text_data
df

Unnamed: 0,text
0,A synthesizer (also synthesiser or synth) is a...
1,
2,Synthesizer-like instruments emerged in the Un...
3,"In 1978, Sequential Circuits released the Prop..."
4,Synthesizers were initially viewed as avant-ga...
...,...
157,== External links ==
158,
159,Sound Synthesis Theory wikibook
160,Principles of Sound Synthesis Archived 20 Janu...


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [4]:
# Clean up text to remove empty lines and headings; this can be skipped if you have already saved text.csv
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

In [5]:
# this can be skipped if you have already saved text.csv

# For Debug
#df

# Save to CSV
df.to_csv('text.csv', index=False)

# Load csv if saved; start here (After loading required libraries) if you have a text.csv
# df = pd.read_csv('text.csv', index_col=0) 
# Load Embedding Model / Engine 
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input=df["text"].tolist(),
    model=EMBEDDING_MODEL_NAME
)

# Extract and print the first 20 numbers in the embedding
response_list = response["data"]
first_item = response_list[0]
first_item_embedding = first_item["embedding"]
print(first_item_embedding[:20])
len(first_item_embedding)

embeddings = [data["embedding"] for data in response["data"]]

# used to check embedding made, used for debug
# embeddings

[-0.024538422003388405, -0.014351220801472664, -0.01926654577255249, -0.007525795139372349, -0.01990324631333351, 0.02309948019683361, -0.028676973655819893, 0.0024958644062280655, -0.026486724615097046, -0.004390047397464514, 0.02814214490354061, 0.022806597873568535, -0.026894211769104004, 0.001149243675172329, 0.01446582656353712, -0.0007127061835490167, 0.029670225456357002, -0.009512299671769142, 0.007264748215675354, -0.03150392323732376]


In [6]:
# Add embeddings list to dataframe
df["embeddings"] = embeddings

#for debug
#df

# Save embeddings
df.to_csv("embeddings.csv")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["embeddings"] = embeddings


In [7]:
#df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
#df



## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [9]:
question = "What components are used to altered by sounds on a Synthesizer"
question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

# For Debug
# question_embeddings



[0.0024525001645088196,
 0.01510996650904417,
 -0.016851866617798805,
 -0.011626167222857475,
 -0.01517748273909092,
 0.045748498290777206,
 -0.01960649900138378,
 -0.0261555016040802,
 -0.02444060705602169,
 -0.004989394918084145,
 0.029814841225743294,
 0.010086813941597939,
 -0.018202176317572594,
 -0.009749236516654491,
 0.003166476497426629,
 -0.009593951515853405,
 0.01216629147529602,
 0.007116132881492376,
 0.0017756574088707566,
 -0.025331811979413033,
 -0.012659154832363129,
 0.016703331843018532,
 -0.018350711092352867,
 -0.00812886469066143,
 -0.023171316832304,
 0.008668988943099976,
 0.018121158704161644,
 -0.026992693543434143,
 -0.016190214082598686,
 0.017108425498008728,
 0.04418213665485382,
 0.014826402068138123,
 -0.024184048175811768,
 -0.007082375232130289,
 -0.031057126820087433,
 0.016568301245570183,
 0.006133782211691141,
 -0.018377717584371567,
 0.01408373098820448,
 -0.0030381970573216677,
 0.008864783681929111,
 0.01129534188657999,
 -0.002482882235199213,

In [24]:
distances = distances_from_embeddings(question_embeddings, df["embeddings"].tolist(), distance_metric="cosine")

# For Debug
# distances

df["distances"] = distances
# For Debug
df
df.to_csv("distances")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["distances"] = distances


### Question 2