# Custom Chatbot Project

I have chosen the "nyc_food_scap_drop_off_sites.csv" dataset because it is ab little more challenging than the other two (they are a bit small) and this example is more realistic. In order to not exceed the prompt token limit, I will use an embedding (I start with all-MiniLM-L6-v2) to only add relevant information to the context of the query.

## Data Wrangling

In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

First we determine which fields to keep for our embedding and which not. Here comes an analyzation of the columns.

| **Header**              | **Meaning**                                                                                   | **Recommendation**             | **Reason**                                                                                       |
|--------------------------|-----------------------------------------------------------------------------------------------|---------------------------------|--------------------------------------------------------------------------------------------------|
| **Borough**              | The borough where the site is located (e.g., Manhattan, Queens, Bronx).                      | **Keep**                        | Essential for geographic filtering and answering borough-specific queries.                      |
| **NTAName**              | Neighborhood Tabulation Area Name, representing specific neighborhoods within the borough.    | **Keep**                        | Useful for more granular geographic filtering and responding to neighborhood-specific queries.   |
| **SiteName**             | The name of the compost site (e.g., "Queensbridge Compost Site").                             | **Keep**                        | Adds detail to responses and helps identify specific sites.                                      |
| **SiteAddr**             | The address of the compost site.                                                             | **Keep**                        | Critical for providing exact locations to users.                                                |
| **Hosted_By**            | Organization or entity managing the site (e.g., Department of Sanitation).                   | **Optional**                    | Useful if users care about who operates the site but not always necessary for basic queries.    |
| **Open_Month**           | Months the site is operational (e.g., "Year Round").                                         | **Keep**                        | Important for determining site availability based on time of year.                              |
| **Day_Hours**            | Days and hours of operation (e.g., "24/7").                                                  | **Keep**                        | Critical for answering questions about when the site is open.                                   |
| **Notes**                | Additional instructions or restrictions (e.g., "Accepts all food scraps, including meat").   | **Keep**                        | Useful for clarifying special conditions or restrictions.                                       |
| **Website**              | URL for more information about the site or program.                                          | **Optional**                    | Include if users might need external resources for further details.                             |
| **BoroCD**               | Borough Community District code, a numeric representation of the borough and community.      | **Exclude**                     | Likely unnecessary for general user queries unless hyper-specific geographic filtering is needed. |
| **CouncilDis**           | City Council District where the site is located.                                             | **Exclude**                     | Unlikely to be relevant for typical user queries.                                               |
| **ct2010**               | Census tract code for 2010, representing a specific geographic area.                         | **Exclude**                     | Too technical and irrelevant for most users.                                                    |
| **BBL**                  | Borough-Block-Lot number, representing a parcel of land.                                     | **Exclude**                     | Useful for internal geospatial analysis but not for general chatbot queries.                    |
| **BIN**                  | Building Identification Number, representing a building or site.                             | **Exclude**                     | Internal use; not useful for answering user queries.                                            |
| **Latitude**             | Geographical latitude of the site.                                                           | **Optional**                    | Could be used for mapping or geospatial features but not necessary for textual queries.         |
| **Longitude**            | Geographical longitude of the site.                                                         | **Optional**                    | Same as latitude.                                                                                |
| **PolicePrec**           | Police precinct serving the site’s location.                                                 | **Exclude**                     | Irrelevant for compost-related queries.                                                         |
| **Object ID**            | Unique identifier for the dataset entry.                                                     | **Exclude**                     | Internal metadata; not relevant for user-facing features.                                       |
| **Location Point**       | Geographic coordinates in a specific format (e.g., "POINT (longitude latitude)").            | **Optional**                    | Same as latitude and longitude; redundant unless formatting is required for a specific purpose. |
| **App Android**          | URL to download an Android app associated with the site.                                     | **Optional**                    | Include only if users might need app-related information.                                       |
| **App iOS**              | URL to download an iOS app associated with the site.                                         | **Optional**                    | Same as Android app URL.                                                                        |
| **Assembly District**    | State Assembly District covering the site.                                                   | **Exclude**                     | Unlikely to be relevant for general queries.                                                    |
| **Congress District**    | U.S. Congressional District covering the site.                                               | **Exclude**                     | Same as above; not relevant to the task.                                                        |
| **DSNY District**        | Department of Sanitation District code.                                                      | **Exclude**                     | Too technical for most queries.                                                                 |
| **DSNY Section**         | Subsection within a DSNY District.                                                           | **Exclude**                     | Same as above.                                                                                  |
| **DSNY Zone**            | Operational zone within a DSNY District.                                                     | **Exclude**                     | Same as above.                                                                                  |
| **Senate District**      | State Senate District covering the site.                                                     | **Exclude**                     | Unlikely to be relevant.                                                                        |


Now prepare the dataset (prune not needed columns, preprocess text and create embedding field).

In [2]:
import pandas as pd

# 1. Load the dataset into a pandas DataFrame
file_path = 'data/nyc_food_scrap_drop_off_sites.csv'
df = pd.read_csv(file_path)

# 2. Prune columns that are not to be kept (I left out the optional columns)
columns_to_keep = [
    'Borough', 'NTAName', 'SiteName', 'SiteAddr', 'Open_Month',
    'Day_Hours', 'Notes'
]
df_pruned = df[columns_to_keep]

# 3. Additional Preprocessing
# a) Fill missing values (if any) with placeholders or appropriate defaults
# For critical fields, use specific placeholders; for Notes, use an empty string
df_pruned.loc[:, 'Borough'] = df_pruned['Borough'].fillna('Unknown Borough')
df_pruned.loc[:, 'NTAName'] = df_pruned['NTAName'].fillna('Unknown Neighborhood')
df_pruned.loc[:, 'SiteName'] = df_pruned['SiteName'].fillna('Unknown Site')
df_pruned.loc[:, 'SiteAddr'] = df_pruned['SiteAddr'].fillna('Address Not Provided')
df_pruned.loc[:, 'Open_Month'] = df_pruned['Open_Month'].fillna('Availability Unknown')
df_pruned.loc[:, 'Day_Hours'] = df_pruned['Day_Hours'].fillna('Hours Not Provided')
df_pruned.loc[:, 'Notes'] = df_pruned['Notes'].fillna('')

# Strip extra whitespace from string fields
for column in ['Borough', 'NTAName', 'SiteName', 'SiteAddr', 'Open_Month', 'Day_Hours', 'Notes']:
    df_pruned.loc[:, column] = df_pruned[column].str.strip()

# c) Create a combined description column for embeddings (optional)
# This will concatenate relevant fields into a single descriptive string
df_pruned['Text'] = df_pruned.apply(
    lambda row: f"{row['Borough']}, {row['NTAName']}: {row['SiteName']} located at {row['SiteAddr']}. "
                f"Open: {row['Open_Month']} during {row['Day_Hours']}. Notes: {row['Notes']}",
    axis=1
)

# Save the preprocessed dataset for later use
output_path = 'data/nyc_food_scrap_sites_pruned.csv'
df_pruned.to_csv(output_path, index=False)

# Display a preview of the pruned and preprocessed dataset
df_pruned.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pruned['Text'] = df_pruned.apply(


Unnamed: 0,Borough,NTAName,SiteName,SiteAddr,Open_Month,Day_Hours,Notes,Text
0,Brooklyn,Bay Ridge,4th Avenue Presbyterian Church,"6753 4th Avenue, Brooklyn, NY 11220",Year Round,Every day (Start Time: Dawn - End Time: Dusk),"No meat, bones, or dairy.","Brooklyn, Bay Ridge: 4th Avenue Presbyterian C..."
1,Manhattan,East Midtown-Turtle Bay,Dag Hammarskjold Plaza Greenmarket,E 47th St & 2nd Ave,Year Round,Wednesday (Start Time: 8:00 AM - End Time: 12...,,"Manhattan, East Midtown-Turtle Bay: Dag Hammar..."
2,Manhattan,Hell's Kitchen,Hudson River Park's Pier 84 at W. 44th St.,Pier 84 at W. 44th St. near dog park,Year Round,Every day (Start Time: 7:00 AM - End Time: 7:...,,"Manhattan, Hell's Kitchen: Hudson River Park's..."
3,Manhattan,East Midtown-Turtle Bay,58th Street Library FSDO,127 East 58th Street,Year Round,Wednesdays (Start Time: 7:30 AM - End Time: 1...,,"Manhattan, East Midtown-Turtle Bay: 58th Stree..."
4,Manhattan,Tribeca-Civic Center,Tribeca Greenmarket,Greenwich St. & Duane St,Year Round,Saturday (Start Time: 8:00 AM - End Time: 1:0...,,"Manhattan, Tribeca-Civic Center: Tribeca Green..."


Lets use a sentence transformer to create an embedding.

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load the preprocessed dataset
file_path = 'data/nyc_food_scrap_sites_pruned.csv'
df_pruned = pd.read_csv(file_path)

# Load a pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for the Description column
df_pruned['Text'] = df_pruned['Text'].fillna('')  # Ensure no null values
descriptions = df_pruned['Text'].tolist()
embeddings = model.encode(descriptions, show_progress_bar=True)

# Save the embeddings to a file for later use
np.save('data/food_scrap_embeddings.npy', embeddings)

# Optionally, add the embeddings back to the DataFrame (for debugging or combined use)
df_pruned['Embedding'] = list(embeddings)

# Save the DataFrame with embeddings (optional)
df_pruned.to_csv('data/nyc_food_scrap_sites_with_embeddings.csv', index=False)

print("Embeddings generated and saved successfully!")


Batches: 100%|██████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 33.00it/s]


Embeddings generated and saved successfully!


## Custom Query Completion

In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [3]:
from openai import OpenAI
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer, util

# Load precomputed embeddings and dataset
embeddings = np.load('data/food_scrap_embeddings.npy')
dataset = pd.read_csv('data/nyc_food_scrap_sites_pruned.csv')

# Load the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

def process_query(user_query, top_k=5):
    # Encode the user's query
    query_embedding = model.encode(user_query, show_progress_bar=False)
    
    # Compute cosine similarity between query and dataset embeddings
    cosine_scores = util.cos_sim(query_embedding, embeddings)
    
    # Get the top-k most relevant rows
    top_indices = np.argsort(-cosine_scores.numpy().flatten())[:top_k]
    relevant_contexts = dataset.iloc[top_indices]['Text'].tolist()
    
    return relevant_contexts

def create_messages(user_query, relevant_contexts=[]):
    # System message to set the assistant's behavior
    system_message = {
        "role": "system",
        "content": "You are a helpful assistant providing information on NYC food scrap drop-off sites."
    }
    
    # User message with the query
    user_message = {
        "role": "user",
        "content": user_query
    }
    
    # Combine relevant contexts into a single string
    if len(relevant_contexts) > 0:
        context = "\n".join(relevant_contexts)
    
        # Assistant message with the context
        assistant_message = {
            "role": "assistant",
            "content": f"Here is some information that might help:\n{context}"
        }
    else:
        # Assistant message when no relevant contexts are found
        assistant_message = {
            "role": "assistant",
            "content": "If you don't know the answer please say it politely."
        }
    
    # Return the list of messages
    return [system_message, assistant_message, user_message]

def get_openai_completion(client, messages, model="gpt-3.5-turbo", max_tokens=150, temperature=0.7):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature
    )
    return response.model_dump_json(indent=2) #response.choices[0]['text']

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Define the user's query
user_query = "Where can I drop off food scraps in Queens?"

# Process the query to get relevant contexts
top_k = 5  # Number of relevant results to retrieve
relevant_contexts = process_query(user_query, top_k=top_k)

#print(relevant_contexts)

# Create messages for the Chat Completions API
messages = create_messages(user_query, relevant_contexts)

# Set your OpenAI API key
voc_key = "<openai-api-key>"

client = OpenAI(
    #base_url = "https://openai.vocareum.com/v1",
    api_key = voc_key
)

# Call the OpenAI Chat Completions API
response = get_openai_completion(client, messages)

# Print the response
print("\nOpenAI Response:")
print(response)

['Queens, Flushing-Willets Point: Queens Botanical Garden Public Food Scrap Drop-Off located at Parking Garden at QBG 42-80 Crommelin Street 42-80 Crommelin Avenue, Flushing, NY 11355. Open: Year Round during Every day (Start Time: 8:00 AM - End Time:  5:00 PM). Notes: ', 'Manhattan, Washington Heights (North): 181st Street Food Scrap Drop-off located at W 181st St & Fort Washington Ave, New York, NY 10033. Open: Year Round during Thursday (Start Time: 8:00 AM - End Time:  12:30 PM). Notes: ', 'Brooklyn, Dyker Heights: Lief Ericson Park Food Scrap Drop-off Site located at Fort Hamilton Pkwy b/t 66 & 67 Sts. Open: Year Round during Wednesdays (Start Time: 8:30 AM - End Time:  12:30 PM). Notes: ', 'Brooklyn, Crown Heights (North): Crown Heights Franklin Ave Food Scrap Drop-off located at Franklin Avenue & Eastern Parkway. Open: Year Round during Thursday (Start Time: 8:30 AM - End Time:  11:30 AM). Notes: ', 'Queens, Kew Gardens: Kew Gardens Food Scrap Drop-off located at Metropolitan Av

## Custom Performance Demonstration

In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [6]:
# Define the user's query
user_query = "What are the hours for food scrap drop-off sites in Manhattan?"

# Process the query to get relevant contexts
top_k = 3  # Number of relevant results to retrieve
relevant_contexts = process_query(user_query, top_k=top_k)

# Create messages for the Chat Completions API
messages = create_messages(user_query, relevant_contexts)

# Set your OpenAI API key
voc_key = "voc-8118169101266773668125674db374ce79f5.96133329"

client = OpenAI(
    base_url = "https://openai.vocareum.com/v1",
    api_key = voc_key
)

# Call the OpenAI Chat Completions API
response = get_openai_completion(client, messages)

# Print the response
print("\nOpenAI Response:")
print(response)


OpenAI Response:
{
  "id": "chatcmpl-Ab2HydNvHyAfOJLvBRR1U1VQMPY6k",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Here are some food scrap drop-off sites in Manhattan with their hours of operation:\n1. Washington Heights (North): 181st Street Food Scrap Drop-off located at W 181st St & Fort Washington Ave. Open on Thursdays year-round from 8:00 AM to 12:30 PM.\n2. Upper East Side-Carnegie Hill: East 96th Street Food Scrap Drop-off located at 96th St & Lexington Ave. Open on Fridays year-round from 7:30 AM to 11:30 AM.\n\nPlease note that the hours of operation may be subject to change, so it's always a good idea to check with the specific drop-off site or the NYC Department of Sanitation for the most up-to-date",
        "refusal": null,
        "role": "assistant",
        "audio": null,
        "function_call": null,
        "tool_calls": null
      }
    }
  ],
  "created": 1733390298,
  "mode

And here the same question without the custom context:

In [7]:
# Define the user's query
user_query = "What are the hours for food scrap drop-off sites in Manhattan?"

# Process the query to get relevant contexts
top_k = 3  # Number of relevant results to retrieve
#relevant_contexts = process_query(user_query, top_k=top_k)

# Create messages for the Chat Completions API
messages = create_messages(user_query)

# Set your OpenAI API key
voc_key = "voc-8118169101266773668125674db374ce79f5.96133329"

client = OpenAI(
    base_url = "https://openai.vocareum.com/v1",
    api_key = voc_key
)

# Call the OpenAI Chat Completions API
response = get_openai_completion(client, messages)

# Print the response
print("\nOpenAI Response:")
print(response)


OpenAI Response:
{
  "id": "chatcmpl-Ab2IMyyjMVZOMgnispYijvxulD9fz",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "I'm not sure about the specific hours for food scrap drop-off sites in Manhattan. I recommend checking the NYC Department of Sanitation website or contacting them directly for the most up-to-date information on hours of operation for food scrap drop-off sites in Manhattan.",
        "refusal": null,
        "role": "assistant",
        "audio": null,
        "function_call": null,
        "tool_calls": null
      }
    }
  ],
  "created": 1733390322,
  "model": "gpt-3.5-turbo-0125",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 51,
    "prompt_tokens": 55,
    "total_tokens": 106,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "audio_tokens": 0,
      "reasoning_tokens": 0,
    

### Question 2

In [8]:
# Define the user's query
user_query = "Which sites in Queens accept food scraps, including meat and dairy?"

# Process the query to get relevant contexts
top_k = 3  # Number of relevant results to retrieve
relevant_contexts = process_query(user_query, top_k=top_k)

# Create messages for the Chat Completions API
messages = create_messages(user_query, relevant_contexts)

# Set your OpenAI API key
voc_key = "voc-8118169101266773668125674db374ce79f5.96133329"

client = OpenAI(
    base_url = "https://openai.vocareum.com/v1",
    api_key = voc_key
)

# Call the OpenAI Chat Completions API
response = get_openai_completion(client, messages)

# Print the response
print("\nOpenAI Response:")
print(response)


OpenAI Response:
{
  "id": "chatcmpl-Ab2Ifv0IV70JtQIgQRQRtUpItVXTG",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "In Queens, the following food scrap drop-off sites accept all food scraps, including meat and dairy:\n\n1. Queens Botanical Garden Public Food Scrap Drop-Off located at Parking Garden at QBG\n   Address: 42-80 Crommelin Street, Flushing, NY 11355\n   Open: Year Round, Every day (8:00 AM - 5:00 PM)\n   Notes: Accepts all food scraps, including meat and dairy\n\n2. Astoria (Central): SE 31st Avenue & Crescent Street\n   Address: Not Provided\n   Open: Year Round, 24/7\n   Notes: Download the app to access bins. Accepts all food scraps, including meat and dairy. Do not",
        "refusal": null,
        "role": "assistant",
        "audio": null,
        "function_call": null,
        "tool_calls": null
      }
    }
  ],
  "created": 1733390341,
  "model": "gpt-3.5-turbo-0125",
  "objec

And here the same question without context provided:

In [41]:
# Define the user's query
user_query = "Which sites in Queens accept food scraps, including meat and dairy?"

# Process the query to get relevant contexts
top_k = 3  # Number of relevant results to retrieve
#relevant_contexts = process_query(user_query, top_k=top_k)

# Create messages for the Chat Completions API
messages = create_messages(user_query)

# Set your OpenAI API key
voc_key = "voc-8118169101266773668125674db374ce79f5.96133329"

client = OpenAI(
    base_url = "https://openai.vocareum.com/v1",
    api_key = voc_key
)

# Call the OpenAI Chat Completions API
response = get_openai_completion(client, messages)

# Print the response
print("\nOpenAI Response:")
print(response)


OpenAI Response:
{
  "id": "chatcmpl-Aaoodo1Yx9vaVcsSAVD4mjQiamszt",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "I'm not sure about specific sites in Queens that accept food scraps including meat and dairy. Would you like me to look up more information for you?",
        "refusal": null,
        "role": "assistant",
        "audio": null,
        "function_call": null,
        "tool_calls": null
      }
    }
  ],
  "created": 1733338507,
  "model": "gpt-3.5-turbo-0125",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 30,
    "prompt_tokens": 55,
    "total_tokens": 85,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "audio_tokens": 0,
      "reasoning_tokens": 0,
      "rejected_prediction_tokens": 0
    },
    "prompt_tokens_details": {
      "audio_tokens": 0,
      "cached_tokens": 0
    

This was only a basic demo, the quality of the prompt depends strongly of the embedding used and of course of the language model.