# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

## Dataset Choice

I selected the nyc_food_scrap_drop_off_sites.csv data set primarily, because I found it to be valuable information which could be used by end users for making sustainable choices in their day to day life. The dataset was also interesting and more close to how datasets tend to get, with multiple columns having distinctive data

The dataset has many useful column information. I selected the below columns to work with the data

borough: This helps the user get information for a specific borough in New york
ntaname: Name of the neighbourhood
food_scrap_drop_off_site: Intersection information of the drop off site
location: This is helpful to get the exact location for the user
hosted_by: This provides information on the institute managing the drop off site
open_months: This is critical to know when the site is open. Is it certain months, days or year around
operation_day_hours: This is critical information for user to know when the site is open
website: This is useful for the user to get more information on the site
notes: This contains useful information on the type of food that is accepted or any specific app requirement

I feel, a chatbot with this information would generate relevant information very easily to end users.

In [1]:
import getpass
import openai
import pandas as pd
import numpy as np
import tiktoken
from openai.embeddings_utils import distances_from_embeddings

In [15]:
# ===============================
# API Key Configuration (DO NOT MODIFY)
# ===============================

openai.api_base = "https://openai.vocareum.com/v1"

# This function is complete and should not be modified.
def get_openai_api_key():
    key = getpass.getpass("Enter OpenAI API key (input hidden): ").strip()
    while not key:
        print("API key cannot be empty!")
        key = getpass.getpass("Enter OpenAI API key (input hidden): ").strip()

    print(f"API key configured (last 4 chars): ****{key[-4:]}")
    return key

openai.api_key = get_openai_api_key()

Enter OpenAI API key (input hidden): ········
API key configured (last 4 chars): **** KEY


In [3]:
# ===============================
# Dataset & Embedding Functions
# ===============================

def filter_dataset(df, column_names):
    filtered_df = df[column_names].copy()
    return filtered_df

def load_dataset(file_path, column_names):
    df = pd.read_csv(file_path)
    filtered_df = filter_dataset(df, column_names)
    json_list = filtered_df.apply(lambda row: row.to_json(), axis=1)  
    filtered_df['text'] = json_list 
    return filtered_df[['text']]


def generate_embeddings(df, embedding_model_name="text-embedding-ada-002", batch_size=1):
    embeddings = []
    for i in range(0, len(df), batch_size):
        response = openai.Embedding.create(
            input=df.iloc[i:i + batch_size]["text"].tolist(),
            engine=embedding_model_name
        )
        embeddings.extend([data["embedding"] for data in response["data"]])
    df["embeddings"] = embeddings
    return df

def save_embeddings(df, output_file):
    df.to_csv(output_file, index=False) 

def load_embeddings(file_path):
    df = pd.read_csv(file_path)
    df["embeddings"] = df["embeddings"].apply(eval).apply(np.array) 
    return df

embedding_model_name="text-embedding-ada-002"
def get_relevant_rows(question, df, top_n):
    # Encode the question
    question_embedding = openai.Embedding.create(
        model=embedding_model_name,
        input=question
    )['data'][0]['embedding']

    df_copy = df.copy()
    # calculate cosine distance between the questions and the embeddings column
    df_copy['distance'] = distances_from_embeddings(question_embedding, df_copy['embeddings'].values, distance_metric="cosine")
    return df_copy.nsmallest(top_n, 'distance')

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [5]:
# ===============================
# Prompt Creation & Answering
# ===============================

def create_prompt(question, df, max_token_count=1500):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    prompt_template = """
    Answer the question based on the context below. If the question can't be answered based on the context, say "I don't know."

    Context: {}

    ---

    Question: {}

    Answer:
    """
    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
    context = []
    for text in df["text"].values:
        tokens_in_text = len(tokenizer.encode(text))
        if current_token_count + tokens_in_text <= max_token_count:
            context.append(text)
            current_token_count += tokens_in_text
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

def get_openai_answer(prompt, max_answer_tokens=150):
    try:
        response = openai.Completion.create(
            model="gpt-3.5-turbo-instruct",
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip() 
    except Exception as e:
        print(f"Error: {str(e)}")
        return "An error occurred."
    

def reformat_rows(df):
    import json
    food_scrap_list = []
    reformatted_df = pd.DataFrame()
    for index, row in df.iterrows():
        # reformated_df['text'] =
        row_obj = json.loads(row['text'])
        formatted_str = 'Food scrap drop off is available at ' + str(row_obj['food_scrap_drop_off_site']) + ' in ' + str(row_obj['ntaname']) + ', '+ str(row_obj['borough']) + '. It is hosted by ' + str(row_obj['hosted_by']) + ' and is open ' + str(row_obj['open_months']) + ' during the hours of ' + str(row_obj['operation_day_hours']) + '. You can find more information at the location ' + str(row_obj['location']) + ' or by visiting their website at ' + str(row_obj['website'])
        food_scrap_list.append(formatted_str)
    reformatted_df['text'] = food_scrap_list
    return reformatted_df
        
    
def answer_question_with_context(question, df, max_prompt_tokens=1500, max_answer_tokens=150, top_n=10):
    relevant_rows = get_relevant_rows(question, df, top_n=top_n)
    # print(relevant_rows)
    refromatted_rows = reformat_rows(relevant_rows)
    # Construct a combined prompt using the relevant rows and the question.
    prompt = create_prompt(question, refromatted_rows, max_token_count=max_prompt_tokens)
    # Generate and return the answer using the combined prompt.
    # print(prompt)
    return get_openai_answer(prompt, max_answer_tokens=max_answer_tokens)


## Dataset Choice

I selected the nyc_food_scrap_drop_off_sites.csv data set primarily, because I found it to be valuable information which could be used by end users for making sustainable choices in their day to day life. 
The dataset was also interesting and more close to how datasets tend to get, with multiple columns having distinctive data

In [13]:
def main():
    dataset_path = './data/nyc_food_scrap_drop_off_sites.csv'
    col_names = ["borough","ntaname","food_scrap_drop_off_site","location","hosted_by","open_months","operation_day_hours","website","notes"]
    loaded_dataset = load_dataset(dataset_path, col_names)
    # filtered_dataset = loaded_dataset[]copy()
    # print(loaded_dataset.head())
    generate_embeddings(loaded_dataset)

    # Question 1
    question1 = "List me the food scrap drop off sites available in Brooklyn borough for 2023 year?"
    response11 = get_openai_answer(question1)
    # print(response11)
    response12 = answer_question_with_context(question1, loaded_dataset)
    # print(response12)

     # Question 2
    question2 = "List me the food scrap drop off sites available in Manhattan for 2023 year?"
    response21 = get_openai_answer(question2)
    # print(response21)
    response22 = answer_question_with_context(question2, loaded_dataset)
    # print(response22)

    print(f"Question 1: {question1}\n\n Basic Answer: {response11}\n\n Custom Answer: {response12}\n\n")
    print(f"Question 2: {question2}\n\n Basic Answer: {response21}\n\n Custom Answer: {response22}\n\n")

In [14]:
if __name__ == "__main__":
    main()

Question 1: List me the food scrap drop off sites available in Brooklyn borough for 2023 year?

 Basic Answer: Unfortunately, it is impossible to provide a list of food scrap drop off sites for the year 2023 as they may change or new sites may be added in the next two years. The best option would be to check with local government or waste management agencies for updated information closer to the desired time.

 Custom Answer: 1. Food scrap drop off available at Flatbush Junction Food Scrap Drop-off in Flatbush, Brooklyn, hosted by GrowNYC, open Year Round on Fridays (Start Time: 8:30 AM - End Time: 2:30 PM).
2. Food scrap drop off available at Bay Parkway at 66th Street in Bensonhurst, Brooklyn, hosted by NYC Compost Project Hosted by LES Ecology Center, open Year Round on Tuesdays (Start Time: 10:00 AM - End Time: 2:00 PM).
3. Food scrap drop off available at Kensington Food Scrap Drop-off in Kensington, Brooklyn, hosted by GrowNYC, open Year Round on Saturdays (Start Time: 8:30 AM




In [8]:
# dataset_path = './data/nyc_food_scrap_drop_off_sites.csv'
# col_names = ["borough","ntaname","food_scrap_drop_off_site","location","hosted_by","open_months","operation_day_hours","website","notes"]
# loaded_dataset = load_dataset(dataset_path, col_names)
# f = reformat_rows(loaded_dataset)
# print(f.head())

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [9]:
question1 = "Can you let me know the drop off locations for food scrap in Brooklyn borough for 2023 year?"
# response11 = get_openai_answer(question1)
# print(response11)

In [10]:
# question1 = "Can you let me know the drop off locations for food scrap in Brooklyn borough for 2023 year?"
# response12 = answer_question_with_context(question1, loaded_dataset)
# print(response12)

### Question 2

In [11]:
question2 = "How many food scrap drop off locations are available in Queens in 2023?"
# question2 = "Which food scrap drop off locations are available in Queens in 2023?"
# response21 = get_openai_answer(question2)
# print(response21)


In [12]:
# response22 = answer_question_with_context(question2, loaded_dataset)
# print(response22)