# Custom Chatbot Project

For this project, I selected the most recent version of the NYC Food Scrap Drop-Off Sites dataset. While the dataset provided by Udacity was last refreshed in early 2023, I opted to retrieve the latest data directly from the City of New York Open Data portal, accessible via this link: <a href="https://dev.socrata.com/foundry/data.cityofnewyork.us/if26-z6xq" target="_blank" rel="noopener noreferrer">NYC Food Scrap Drop-Off Sites</a>.

This dataset was chosen for two primary reasons:

Firstly, considering that GPT-3.5's training data only extends up to January 2022 (an increase from the previous cutoff in September 2021), it is essential to incorporate the most current dataset available. By doing so, I ensure that any chatbot developed using this information can provide accurate and relevant answers based on the latest data. This is crucial for maintaining the reliability and usefulness of the chatbot in providing updated information on food scrap drop-off sites across New York City.

Secondly, this dataset represents an excellent example of using Generative AI (Gen AI) for altruistic purposes. A chatbot equipped with up-to-date information on food scrap drop-off locations can play a significant role in reducing food waste and contributing to community efforts to support the hungry. This aligns with the ethical considerations that must be at the forefront of AI development and deployment. By focusing on beneficial and responsible use cases, we can harness AI's potential to address real-world issues and promote sustainability and community well-being.

In summary, the NYC Food Scrap Drop-Off Sites dataset is not only relevant and timely but also exemplifies the positive impact that AI can have when applied thoughtfully and ethically.

## Data Wrangling

In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [461]:
# import necessary libraries

import openai
import pandas as pd
import tiktoken
import re
pd.set_option('max_colwidth', None)
pd.set_option('display.max_columns', 1000)
pd.set_option("expand_frame_repr", False)

In [462]:
# Define api key and model
openai.api_key = 'YOUR API KEY'
openai.api_base = "https://openai.vocareum.com/v1"
MODEL_NAME = 'gpt-3.5-turbo-instruct'

In [463]:
# Load the updated dataset as of 2024-10-08
dataset = './data/nyc_food_scrap_drop_off_sites_updated.csv'
custom_data = pd.read_csv(dataset)

In [464]:
# Get summary statistics on the data
custom_data.describe()

Unnamed: 0,BoroCD,CouncilDis,ct2010,BBL,BIN,Latitude,Longitude,PolicePrec,Object ID,Assembly District,Congress District,Senate District
count,591.0,591.0,591.0,10.0,10.0,591.0,591.0,591.0,591.0,591.0,591.0,591.0
mean,227.377327,19.314721,2239067.0,2819645000.0,2831421.0,40.744548,-73.946515,54.529611,27500.0,61.560068,10.756345,30.064298
std,120.395062,15.018038,1218548.0,1231969000.0,1213051.0,0.064998,0.054846,37.176109,170.751281,14.803436,2.653081,12.335807
min,101.0,1.0,1000201.0,1000030000.0,1000000.0,40.578254,-74.161931,1.0,27205.0,25.0,3.0,10.0
25%,107.0,5.0,1015552.0,2282348000.0,2327232.0,40.690421,-73.979825,19.0,27352.5,53.0,9.0,21.0
50%,207.0,15.0,2026602.0,3024980000.0,3000000.0,40.742724,-73.946308,48.0,27500.0,66.0,11.0,28.0
75%,308.0,35.0,3041400.0,3040173000.0,3069012.0,40.789329,-73.918279,83.0,27647.5,73.0,13.0,32.5
max,502.0,51.0,5027702.0,5009550000.0,5000000.0,40.903544,-73.721002,122.0,27795.0,87.0,15.0,59.0


In [465]:
# Look at the first few rows of data
custom_data.head()

Unnamed: 0,Borough,NTAName,SiteName,SiteAddr,Hosted_By,Open_Month,Day_Hours,Notes,Website,BoroCD,CouncilDis,ct2010,BBL,BIN,Latitude,Longitude,PolicePrec,Object ID,Location Point,App Android,App iOS,Assembly District,Congress District,DSNY District,DSNY Section,DSNY Zone,Senate District
0,Brooklyn,Bay Ridge,4th Avenue Presbyterian Church,"6753 4th Avenue, Brooklyn, NY 11220",4th Avenue Presbyterian Church,Year Round,Every day (Start Time: Dawn - End Time: Dusk),"No meat, bones, or dairy.",,310,47,3012600,,,40.635514,-74.022767,68,27717,POINT (-74.022767 40.635514),,,51,10,BKS10,BKS101,BKS,17
1,Manhattan,East Midtown-Turtle Bay,Dag Hammarskjold Plaza Greenmarket,E 47th St & 2nd Ave,GrowNYC,Year Round,Wednesday (Start Time: 8:00 AM - End Time: 12:30 PM),,grownyc.org/compost,106,4,1009000,,,40.752606,-73.969036,17,27577,POINT (-73.969036 40.752606),,,74,12,MN06,MN063,MN,28
2,Manhattan,Hell's Kitchen,Hudson River Park's Pier 84 at W. 44th St.,Pier 84 at W. 44th St. near dog park,Staff at Hudson River Park,Year Round,Every day (Start Time: 7:00 AM - End Time: 7:00 PM),,https://hudsonriverpark.org/the-park/sustainability/community-compost-program/,104,3,1012901,,,40.76346,-74.00025,18,27545,POINT (-74.00025 40.76346),,,67,12,MN04,MN043,MN,47
3,Manhattan,East Midtown-Turtle Bay,58th Street Library FSDO,127 East 58th Street,GrowNYC,Year Round,Wednesdays (Start Time: 7:30 AM - End Time: 1:30 PM),,grownyc.org/compost,105,4,1011203,,,40.76198,-73.9693,18,27538,POINT (-73.9693 40.76198),,,73,12,MN05,MN052,MN,28
4,Manhattan,Tribeca-Civic Center,Tribeca Greenmarket,Greenwich St. & Duane St,GrowNYC,Year Round,Saturday (Start Time: 8:00 AM - End Time: 1:00 PM),,grownyc.org/compost,101,1,1003900,,,40.717424,-74.010793,1,27450,POINT (-74.010793 40.717424),,,66,10,MN01,MN013,MN,27


In [466]:
# Examine the length of the data
len(custom_data)

591

In [467]:
# Group by the 'Day_Hours' column and count occurrences
custom_data.groupby('Day_Hours').size().reset_index(name='counts')

Unnamed: 0,Day_Hours,counts
0,24/7,390
1,24/7 (Start Time: 24/7 - End Time: 24/7),21
2,"Alternating Tuesdays: 7/11, 7/25, 8/8, 8/22, 9/5, 9/19 (Start Time: 11:00 AM - End Time: 3:00 PM)",1
3,Every 3rd Saturday of the month (Start Time: 10:00 AM - End Time: 12:00 PM),1
4,Every Day (Start Time: 10:00 AM - End Time: 6:00 PM),1
...,...,...
141,"Wednesday, Thursday, Saturday (Start Time: 10:00 AM - End Time: Wednesday 1:00 PM, Thursday 1:00 PM, Saturday 2:00 PM)",1
142,Wednesdays (Start Time: 7:30 AM - End Time: 1:30 PM),1
143,Wednesdays (Start Time: 8:30 AM - End Time: 12:30 PM),1
144,Wednesdays (Start Time: 9:30 AM - End Time: 1:30 PM),1


In [468]:
# Function to separate days and hours
def split_day_hours(day_hours):
    # Check for "24/7" special case
    if day_hours.strip() == '24/7':
        return 'Every day', '24 hours'

    # Check for single day patterns
    match = re.match(r"(.+?) \(Start Time: (.+?) - End Time: (.+?)\)", day_hours)
    if match:
        days = match.group(1).strip()
        start_time = match.group(2).strip()
        end_time = match.group(3).strip()
        hours = f"{start_time} - {end_time}"
        return days, hours

    # Handle multiple days and times
    days_hours_pattern = re.compile(r"(?P<day>\w+day) from (?P<start_time>[\d:]+\s?(?:AM|PM|am|pm)?) - (?P<end_time>[\d:]+\s?(?:AM|PM|am|pm)?)")
    matches = days_hours_pattern.findall(day_hours)

    days = []
    hours = []
    for match in matches:
        day, start_time, end_time = match
        days.append(day)
        hours.append(f"{start_time} - {end_time}")

    return '; '.join(days), '; '.join(hours)

# Apply the function to create new columns
custom_data[['Days_Open', 'Hours_Open']] = custom_data['Day_Hours'].apply(lambda x: pd.Series(split_day_hours(x)))

In [469]:
# Group by the 'Open_Month' column and count occurrences
custom_data.groupby('Open_Month').size().reset_index(name='counts')

Unnamed: 0,Open_Month,counts
0,April - August,1
1,April - November,2
2,April - October,4
3,April-November,1
4,Closed during Winter,1
5,End of March - January 30,1
6,July - December,1
7,July - November,1
8,July - September,1
9,July 14th-November 24th,1


In [470]:
# Function to clean up 'Open_Month' values
def clean_open_month(value):
    # Lowercase for consistency
    value = value.lower()
    
    # Standardize separators
    value = re.sub(r'\s*-\s*', ' - ', value)
    value = re.sub(r'\s*–\s*', ' - ', value)
    value = re.sub(r'\s*–\s*', ' - ', value)
    value = re.sub(r'\s*-\s*', ' - ', value)

    # Remove specific dates and keep only months
    value = re.sub(r'\b\d{1,2}(st|nd|rd|th)?\b', '', value)
    value = re.sub(r'\s+', ' ', value).strip()

    # Handle common terms
    if 'year round' in value:
        return 'Year Round'
    if 'seasonal' in value or 'spring - fall' in value:
        return 'Seasonal'
    if 'closed during winter' in value:
        return 'Closed during Winter'

    # Standardize month names
    month_mappings = {
        'january': 'January', 'february': 'February', 'march': 'March', 'april': 'April',
        'may': 'May', 'june': 'June', 'july': 'July', 'august': 'August', 'september': 'September',
        'october': 'October', 'november': 'November', 'december': 'December'
    }

    # Replace month names with standardized names
    for month in month_mappings:
        value = re.sub(month, month_mappings[month], value)

    return value

# Apply the cleaning function
custom_data['Open_Month'] = custom_data['Open_Month'].apply(clean_open_month)

In [471]:
# Define which columns will be used in the text - only selecting the relevant columns
relevant_columns = ["SiteName","Hosted_By","Borough","NTAName","SiteAddr","Open_Month","Days_Open","Hours_Open"]

In [472]:
# Filter to only the relevant columns
custom_data = custom_data[relevant_columns]

In [473]:
# Rename columns
custom_data.rename(columns={'SiteAddr': 'Address'}, inplace=True)

In [474]:
# Define a function to concatenate row values with column names and add context - also factors in missing values
def concatenate_row_with_colnames(row):
    pieces = ["NYC Food Scrap Site:"]
    for col in row.index:
        if pd.notna(row[col]):
            col_name = col.replace('_', ' ')
            pieces.append(f"{col_name}: {row[col]}")
    return ' '.join(pieces)

# Apply the function to each row to create the "text" column
custom_data['text'] = custom_data.apply(concatenate_row_with_colnames, axis=1)

In [475]:
# Convert the 'text' column to lowercase
custom_data['text'] = custom_data['text'].str.lower()

## Custom Query Completion

In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [476]:
CONTEXT = """
Answer the question based on the context below, if it can't be answered using the context, say "I dont know".

Context: 

{}

---

"""

PROMPT_TEMPLATE = """{}
Question: {}

Provide your answer if the following form:

Place
- Hosted By
- Borough
- NTAName
- Address
-Days Open
-Hours Open

Answer:"""

In [477]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

In [478]:
BATCH_SIZE = 100

def get_embeddings(data, model_name, batch_size):
    """
    Generates embeddings for the 'text' column in the provided dataframe using the specified model and batch size.
    
    Parameters:
    data (pandas.DataFrame): The dataframe containing the text data.
    model_name (str): The name of the model to use for generating embeddings.
    batch_size (int): The number of rows to process in each batch.
    
    Returns:
    list: A list of embeddings corresponding to the text data.
    """
    
    
    embeddings = []
    for i in range(0, len(data), batch_size):
        try:
            # Send text data to OpenAI model to get embeddings
            response = openai.Embedding.create(
                input=data.iloc[i:i+batch_size]["text"].tolist(),
                engine=model_name
            )

            # Add embeddings to list
            embeddings.extend([data_point["embedding"] for data_point in response["data"]])
        except openai.error.OpenAIError as e:
            print(f"OpenAI API error: {e}")
            break
        except Exception as e:
            print(f"Unexpected error: {e}")
            break

    return embeddings

# Get embeddings for the custom data
embeddings = get_embeddings(custom_data, EMBEDDING_MODEL_NAME, BATCH_SIZE)

# Ensure the length of embeddings matches the length of the DataFrame
if len(embeddings) == len(custom_data):
    custom_data["embeddings"] = embeddings
else:
    print("Error: The number of embeddings does not match the number of rows in the DataFrame.")

# Save to JSON file
custom_data.to_json("custom_embeddings.json", orient="records", lines=True)

print("Embeddings have been saved to custom_embeddings.json")

Embeddings have been saved to custom_embeddings.json


In [479]:
def get_relevant_rows(question, df):
    """
    Sorts the rows of the dataframe based on their relevance to the given question.
    
    Parameters:
    question (str): The question string for which relevance is being determined.
    df (pandas.DataFrame): The dataframe containing rows of text and their associated embeddings.
    
    Returns:
    pandas.DataFrame: A dataframe sorted from most to least relevant rows based on the question.
    """
    try:

        # Get embeddings for the question text
        question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

        # Make a copy of the dataframe and add a "distances" column containing
        # the cosine distances between each row's embeddings and the
        # embeddings of the question
        df_copy = df.copy()
        df_copy["distances"] = distances_from_embeddings(
            question_embeddings,
            df_copy["embeddings"].values,
            distance_metric="cosine"
        )

        # Sort the copied dataframe by the distances and return it
        df_copy.sort_values("distances", ascending=True, inplace=True)
        return df_copy
    except Exception as e:
        print(f"Error calculating relevance: {e}")
        return df

In [480]:
def create_prompt(question, df, max_token_count, ask_with_context=True):
    """
    Creates a text prompt for a Completion model based on the given question and dataframe.
    
    Parameters:
    question (str): The question to be answered.
    df (pandas.DataFrame): The dataframe containing rows of text and their embeddings.
    max_token_count (int): The maximum number of tokens allowed in the prompt.
    ask_with_context (bool): If True, includes context in the prompt. If False, only the question is used.
    
    Returns:
    str: A formatted text prompt ready to be sent to the Completion model.
    """

    if not ask_with_context:
        return PROMPT_TEMPLATE.format('', question)

    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    current_token_count = len(tokenizer.encode(PROMPT_TEMPLATE.format('', ''))) + \
                          len(tokenizer.encode(question))

    context = []
    for text in get_relevant_rows(question, df)["text"].values:
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    context_string = CONTEXT.format("\n\n###\n\n".join(context))    
    return PROMPT_TEMPLATE.format(context_string, question)

In [481]:
def answer_question(question, df, max_prompt_tokens=500, max_answer_tokens=500, ask_with_context=True):
    """
    Answers the given question using a Completion model based on the provided dataframe and token limits.
    
    Parameters:
    question (str): The question to be answered.
    df (pandas.DataFrame): The dataframe containing rows of text and their embeddings.
    max_prompt_tokens (int): The maximum number of tokens allowed in the prompt.
    max_answer_tokens (int): The maximum number of tokens allowed in the response.
    ask_with_context (bool): If True, includes context in the prompt. If False, only the question is used.
    
    Returns:
    str: The answer to the question generated by the Completion model. If an error occurs, returns an empty string.
    """
    
    # Convert question to lowercase before creating the prompt
    question = question.lower()

    prompt = create_prompt(question, df, max_prompt_tokens, ask_with_context)
    # For debugging purposes
    #print(prompt)
    
    try:
        response = openai.Completion.create(
            model=MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except openai.error.OpenAIError as e:
        print(f"OpenAI API error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")
    return ""

## Custom Performance Demonstration

In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [482]:
embeddings = pd.read_json("custom_embeddings.json", lines = True)

Each question will be asked first with context and subsequently without context.

### Question 1

In [483]:
question_1 = "What are the NYC food scrap sites that are in East Elmhurst?"

In [491]:
print(answer_question(question_1, embeddings, ask_with_context = True))

East Elmhurst Community School
- East Elmhurst Community School
- Queens
- East Elmhurst
- 26-25 97th St, Queens, NY, 11369
- Wednesday
- 1:00 pm - 3:00 pm

McIntosh Neighborhood Assoc.
- N/A
- Queens
- East Elmhurst
- 25-16 McIntosh St, East Elmhurst, NY, 11369
- Every 3rd Saturday of the month
- 10:00 am - 12:00 pm


In [485]:
print(answer_question(question_1, embeddings, ask_with_context=False))

Place: JFK Composting Drop-Off Food Scrap Drop-Off
Hosted By: NYC Department of Sanitation
Borough: Queens
NTAName: East Elmhurst
Address: Under the AirTrain JFK People Mover, on 94th Avenue between 165th and 166th Streets, East Elmhurst, NY 11369
Days Open: Mondays, Thursdays, and Fridays
Hours Open: 8am to 4pm


In [486]:
# Fact check
custom_data[(custom_data['NTAName'] == 'East Elmhurst')][['SiteName','Days_Open','Hours_Open']]

Unnamed: 0,SiteName,Days_Open,Hours_Open
65,McIntosh Neighborhood Assoc.,Every 3rd Saturday of the month,10:00 AM - 12:00 PM
474,East Elmhurst Community School,Wednesday,1:00 PM - 3:00 PM


### Question 2

In [487]:
question_2 = "What are all of the NYC food scrap sites that are hosted by Volunteers at St. James Compost?"

In [492]:
print(answer_question(question_2, embeddings, ask_with_context = True))

- St. James Compost
- Volunteers at St. James Compost
- Queens
- Elmhurst
- 86-02 Broadway, Elmhurst, NY 11373
- Every day
- 9:00 am - 6:00 pm


In [489]:
print(answer_question(question_2, embeddings, ask_with_context=False))

Place: McCarren Park Greenmarket
- Hosted By: St. James Compost
- Borough: Brooklyn
- NTAName: North Side-South Side
- Address: Union Ave and Driggs Ave, Brooklyn, NY 11249
- Days Open: Saturdays
- Hours Open: 8am-3pm

Place: Fort Greene Greenmarket
- Hosted By: St. James Compost
- Borough: Brooklyn
- NTAName: Clinton Hill
- Address: 175 Lafayette Ave, Brooklyn, NY 11238
- Days Open: Saturdays
- Hours Open: 8am-3pm

Place: Grand Army Plaza Greenmarket
- Hosted By: St. James Compost
- Borough: Brooklyn
- NTAName: Park Slope-Gowanus
- Address: Flatbush Ave and Prospect Park West, Brooklyn, NY 11238
- Days Open: Saturdays
- Hours Open: 8am-3pm

Place: Columbia Greenmarket
- Hosted By: St. James Compost
- Borough: Manhattan
- NTAName: Morningside Heights
- Address: Broadway and W 114th St, New York, NY 10027
- Days Open: Thursdays and Sundays
- Hours Open: 8am-3pm on Thursdays, 8am-2pm on Sundays

Place: 23rd Street Greenmarket
- Hosted By: St. James Compost
- Borough: Manhattan
- NTAName:

In [490]:
# Fact check
custom_data[(custom_data['Hosted_By'] == 'Volunteers at St. James Compost')][['Hosted_By','NTAName','SiteName','Borough','Days_Open','Hours_Open']]

Unnamed: 0,Hosted_By,NTAName,SiteName,Borough,Days_Open,Hours_Open
276,Volunteers at St. James Compost,Elmhurst,St. James Compost,Queens,Every day,9:00 AM - 6:00 PM


#### Analysis

By adding in the context, we were able to get the correct results. For question 1, the model output included both of the food scrap sites that were in East Elmhurst. Without context, the model incorrectly gave the JFK Composting Drop-Off Food Scrap Drop-Off which does not exist. Likewise, for question 2, the model gave the one correct answer of St. James Compost whereas the model without context gave 7 answers, all of which are incorrect. 