# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

Dataset Choice: NYC Food Scrap Drop-off Sites

I have chosen the NYC Food Scrap Drop-off Sites dataset for this custom chatbot project. This dataset contains information about food scrap drop-off sites in New York City, including locations, hours, and other relevant details. The dataset is composed of text data with at least 20 rows, making it suitable for the task.

Scenario:

The custom chatbot will be designed to provide accurate and up-to-date information on food scrap drop-off sites in New York City to users who are interested in composting and contributing to a more sustainable urban environment. By incorporating this dataset, the chatbot will be able to answer specific questions about the locations, hours of operation, and other details related to food scrap drop-off sites.

This customization would be useful for NYC residents or businesses looking to dispose of their food scraps responsibly, as well as for tourists who want to practice eco-friendly habits during their visit. By providing relevant and accurate information on food scrap drop-off sites, the chatbot can assist users in adopting sustainable practices and contribute to reducing the overall waste generated in New York City.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [61]:
import pandas as pd
import openai


In [154]:
# Load your dataset
df = pd.read_csv("nyc_food_scrap_drop_off_sites.csv")

# Combine relevant columns into a single "text" column
df['text'] = "Borough: " + df['Borough'] + "," + "Site Name: " + df['SiteName'] + ", " + "Address: " + df['SiteAddr'] + ", " + "Hosted By: " + df['Hosted_By'] + ", " + "Open Months: " + df['Open_Month'] + ", " + "Days and Hours: " + df['Day_Hours'] + ", " + "Notes: " + df['Notes'] + ", " + "Website: " + df['Website']

df = df[(df["text"].str.len() > 0)]

# Drop unnecessary columns
df = df[df["text"].str.contains("Borough")]
df = df[['text']].dropna()
df

Unnamed: 0,text
4,"Borough: Queens,Site Name: Astoria Pug: Broadw..."
5,"Borough: Brooklyn,Site Name: East New York Far..."
9,"Borough: Manhattan,Site Name: Battery Park Cit..."
12,"Borough: Bronx,Site Name: Drew Gardens, Addres..."
20,"Borough: Staten Island,Site Name: *CLOSED FOR ..."
...,...
468,"Borough: Brooklyn,Site Name: Imani Community G..."
470,"Borough: Manhattan,Site Name: Battery Park Cit..."
475,"Borough: Manhattan,Site Name: 79th St. Greenma..."
477,"Borough: Brooklyn,Site Name: *CLOSED FOR THE S..."


In [155]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df


Unnamed: 0,text,embeddings
4,"Borough: Queens,Site Name: Astoria Pug: Broadw...","[0.011943390592932701, -0.007551177404820919, ..."
5,"Borough: Brooklyn,Site Name: East New York Far...","[0.014079678803682327, -0.028027154505252838, ..."
9,"Borough: Manhattan,Site Name: Battery Park Cit...","[0.019175074994564056, -0.01914825662970543, 0..."
12,"Borough: Bronx,Site Name: Drew Gardens, Addres...","[-0.007001108955591917, -0.015985535457730293,..."
20,"Borough: Staten Island,Site Name: *CLOSED FOR ...","[0.01771564967930317, -0.010109322145581245, 0..."
...,...,...
468,"Borough: Brooklyn,Site Name: Imani Community G...","[0.018176088109612465, -0.022423330694437027, ..."
470,"Borough: Manhattan,Site Name: Battery Park Cit...","[0.01392833050340414, -0.017139511182904243, 0..."
475,"Borough: Manhattan,Site Name: 79th St. Greenma...","[0.019267931580543518, -0.009739760309457779, ..."
477,"Borough: Brooklyn,Site Name: *CLOSED FOR THE S...","[-0.0077240243554115295, -0.04065559804439545,..."


In [156]:
df.to_csv("embeddings.csv")

In [157]:
! ls

data  embeddings.csv  nyc_food_scrap_drop_off_sites.csv  project.ipynb


In [158]:
import numpy as np
df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df

Unnamed: 0,text,embeddings
4,"Borough: Queens,Site Name: Astoria Pug: Broadw...","[0.011943390592932701, -0.007551177404820919, ..."
5,"Borough: Brooklyn,Site Name: East New York Far...","[0.014079678803682327, -0.028027154505252838, ..."
9,"Borough: Manhattan,Site Name: Battery Park Cit...","[0.019175074994564056, -0.01914825662970543, 0..."
12,"Borough: Bronx,Site Name: Drew Gardens, Addres...","[-0.007001108955591917, -0.015985535457730293,..."
20,"Borough: Staten Island,Site Name: *CLOSED FOR ...","[0.01771564967930317, -0.010109322145581245, 0..."
...,...,...
468,"Borough: Brooklyn,Site Name: Imani Community G...","[0.018176088109612465, -0.022423330694437027, ..."
470,"Borough: Manhattan,Site Name: Battery Park Cit...","[0.01392833050340414, -0.017139511182904243, 0..."
475,"Borough: Manhattan,Site Name: 79th St. Greenma...","[0.019267931580543518, -0.009739760309457779, ..."
477,"Borough: Brooklyn,Site Name: *CLOSED FOR THE S...","[-0.0077240243554115295, -0.04065559804439545,..."


In [159]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


In [140]:
q1_prompt = """
Question: "What is the Name for this site: Newkirk Ave & Nostrand Ave"
Answer:
"""
initial_q1_answer = openai.Completion.create(
    model="text-davinci-003",
    prompt=q1_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_q1_answer)

q2_prompt = """
Question: "what is the address for this site: Crown Heights(South)?"
Answer:
"""
initial_q2_answer = openai.Completion.create(
    model="text-davinci-003",
    prompt=q2_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_q2_answer)

#get_rows_sorted_by_relevance("What is the total weight of food scraps collected in Manhattan in 2019?", df)
#get_rows_sorted_by_relevance("Which borough had the highest food scrap collection in 2020?", df)


The intersection of Newkirk Avenue and Nostrand Avenue is known as Newkirk Nostrand Plaza.
The address for Crown Heights (South) is 5700 North Broadway, Chicago, IL 60660.


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [160]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
4,"Borough: Queens,Site Name: Astoria Pug: Broadw...","[0.0118943490087986, -0.007599717006087303, -0..."
5,"Borough: Brooklyn,Site Name: East New York Far...","[0.014079678803682327, -0.028027154505252838, ..."
9,"Borough: Manhattan,Site Name: Battery Park Cit...","[0.019175074994564056, -0.01914825662970543, 0..."
12,"Borough: Bronx,Site Name: Drew Gardens, Addres...","[-0.007001108955591917, -0.015985535457730293,..."
20,"Borough: Staten Island,Site Name: *CLOSED FOR ...","[0.01771564967930317, -0.010109322145581245, 0..."
...,...,...
468,"Borough: Brooklyn,Site Name: Imani Community G...","[0.018320772796869278, -0.022356882691383362, ..."
470,"Borough: Manhattan,Site Name: Battery Park Cit...","[0.01392833236604929, -0.01712629944086075, 0...."
475,"Borough: Manhattan,Site Name: 79th St. Greenma...","[0.019249621778726578, -0.009757110849022865, ..."
477,"Borough: Brooklyn,Site Name: *CLOSED FOR THE S...","[-0.007695777341723442, -0.040675755590200424,..."


In [111]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [161]:
COMPLETION_MODEL_NAME = "text-davinci-003"

def custom_query(question, df, max_prompt_tokens=3000, max_answer_tokens=750):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    #print(prompt)
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""


In [125]:
print(create_prompt("In which borough can you find the most food scrap drop-off sites that are open on Saturdays?", df, 1000))




Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Borough: Manhattan,Site Name: East 96th Street Food Scrap Drop-off, Address: 96th St & Lexington Ave, Hosted By: GrowNYC, Open Months: Year Round, Days and Hours: Fridays (Start Time: 7:30 AM - End Time:  11:30 AM), Notes: Not accepted: meat, bones, or dairy, Website: grownyc.org/compost

###

Borough: Brooklyn,Site Name: Kensington Food Scrap Drop-off, Address: McDonald Ave & Albemarle Rd, Hosted By: GrowNYC, Open Months: Year Round, Days and Hours: Saturdays (Start Time: 8:30 AM - End Time:  11:30 AM), Notes: Not accepted: meat, bones, or dairy, Website: grownyc.org/compost

###

Borough: Manhattan,Site Name: Asphalt Green Food Scrap Drop-off, Address: East 91st St & York Ave, Hosted By: GrowNYC, Open Months: Year Round, Days and Hours: Sundays (Start Time: 7:30 AM - End Time:  12:30 PM), Notes: Not accepted: meat, bones, or dairy, Website: grown

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [162]:
print(f"Standard query result for Question 1: {initial_q1_answer}")

Standard query result for Question 1: The intersection of Newkirk Avenue and Nostrand Avenue is known as Newkirk Nostrand Plaza.


In [163]:
# Question 1
question1 = "What is the Name for this site: Newkirk Ave & Nostrand Ave"

result_custom_q1 = custom_query(question1, df)

print(f"Custom query result for Question 1: {result_custom_q1}")


Custom query result for Question 1: Little Haiti Food Scrap Drop-off Site


### Question 2

In [164]:
print(f"Standard query result for Question 2: {initial_q2_answer}")


Standard query result for Question 2: The address for Crown Heights (South) is 5700 North Broadway, Chicago, IL 60660.


In [165]:
# Question 2
question2 = "what is the address for this site: Crown Heights(South)?"

result_custom_q2 = custom_query(question2, df)

print(f"Custom query result for Question 2: {result_custom_q2}")


Custom query result for Question 2: Nostrand Ave & Crown St
