# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import pandas as pd
import openai
import requests
from scipy.spatial import distance
from openai.embeddings_utils import distances_from_embeddings
from openai.embeddings_utils import get_embedding 
import tiktoken


openai.api_key =  "API_KEY"
openai.api_base = "https://openai.vocareum.com/v1" # Remove this if using personal key

In [2]:
# Load Data from Wikipedia using API; this can be skipped if you have alread y saved text.csv

params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "Synthesizer",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

#response_dict 
text_data = response_dict["query"]["pages"][0]["extract"].split("\n")
#leaving older code that was used for clean up above for learnings.
#response = requests.get("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&rvprop=content&titles=Synthesizer&rvslots=*")
#response.json()["query"]["pages"]["10791746"]["revisions"][0]["slots"]["main"]["*"].split("\n")

In [3]:
# Load page text into a dataframe this can be skipped if you have alread y saved text.csv
df = pd.DataFrame()
df["text"] = text_data
# Clean up text to remove empty lines and headings; this can be skipped if you have already saved text.csv
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [4]:
# this can be skipped if you have already saved text.csv

# For Debug
#df

# Save to CSV
df.to_csv('text.csv', index=False)

# Load csv if saved; start here (After loading required libraries) if you have a text.csv
# df = pd.read_csv('text.csv', index_col=0) 
# Load Embedding Model / Engine 
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input=df["text"].tolist(),
    model=EMBEDDING_MODEL_NAME
)

In [5]:
# Extract and print the first 20 numbers in the embedding
response_list = response["data"]
first_item = response_list[0]
first_item_embedding = first_item["embedding"]
print(first_item_embedding[:20])
len(first_item_embedding)

embeddings = [data["embedding"] for data in response["data"]]

# used to check embedding made, used for debug
# embeddings

[-0.024538422003388405, -0.014351220801472664, -0.01926654577255249, -0.007525795139372349, -0.01990324631333351, 0.02309948019683361, -0.028676973655819893, 0.0024958644062280655, -0.026486724615097046, -0.004390047397464514, 0.02814214490354061, 0.022806597873568535, -0.026894211769104004, 0.001149243675172329, 0.01446582656353712, -0.0007127061835490167, 0.029670225456357002, -0.009512299671769142, 0.007264748215675354, -0.03150392323732376]


In [6]:
# Add embeddings list to dataframe
df["embeddings"] = embeddings

#for debug
#df

# Save embeddings
df.to_csv("embeddings.csv")

In [7]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy
    



In [8]:
#df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
#df

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    {}

    ---

    Question: {}
    Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [9]:
max_token_count = 1000
question = "What components are used to altered by sounds on a Synthesizer"
create_prompt(question, df, max_token_count)





'\n    Answer the question based on the context below, and if the question\n    can\'t be answered based on the context, say "I don\'t know"\n\n    Context: \n\n    Synthesizers generate audio through various forms of analog and digital synthesis.\n\n###\n\nSynthesizers are often controlled with electronic or digital keyboards or MIDI controller keyboards, which may be built into the synthesizer unit or attached via connections such as CV/gate, USB, or MIDI. Keyboards may offer expression such as velocity sensitivity and aftertouch, allowing for more control over the sound. Other controllers include ribbon controllers, which track the movement of the finger across a touch-sensitive surface; wind controllers, played similarly to woodwind instruments; motion-sensitive controllers similar to video game motion controllers; electronic drum pads, played similarly to the heads of a drum kit; touchplates, which send signals depending on finger position and force; controllers designed for micro

In [10]:
response = openai.Completion.create(
  model="gpt-3.5-turbo-instruct",
  prompt=create_prompt(question, df, max_token_count),
  max_tokens=7,
  temperature=0
)

print(response["choices"][0]["text"])

 Components such as filters, envelopes,


### Question 2

In [11]:
max_token_count = 1000
question = "What music genres have been influenced by the Synthesizer?"
create_prompt(question, df, max_token_count)

'\n    Answer the question based on the context below, and if the question\n    can\'t be answered based on the context, say "I don\'t know"\n\n    Context: \n\n    In the 1970s, electronic music composers such as Jean Michel Jarre and Isao Tomita released successful synthesizer-led instrumental albums. This influenced the emergence of synth-pop from the late 1970s to the early 1980s. The work of German krautrock bands such as Kraftwerk and Tangerine Dream, British acts such as John Foxx, Gary Numan and David Bowie, African-American acts such as George Clinton and Zapp, and Japanese electronic acts such as Yellow Magic Orchestra and Kitaro were influential in the development of the genre.\n\n###\n\nSynthesizers were initially viewed as avant-garde, valued by the 1960s psychedelic and countercultural scenes but with little perceived commercial potential. Switched-On Bach (1968), a bestselling album of Bach compositions arranged for synthesizer by Wendy Carlos, took synthesizers to the m

In [12]:
response = openai.Completion.create(
  model="gpt-3.5-turbo-instruct",
  prompt=create_prompt(question, df, max_token_count),
  max_tokens=7,
  temperature=0
)

print(response["choices"][0]["text"])

 Electronic, hip hop, disco,
