# Data Aggregation
As a first step in the Places to Go Demo, we will need static venue data to create reccomendations from. In production, our venue sources will be managed by a Web Scraper bot that will handle crawling social media and updating the list based on activity. For now, we will populate a static set of 2,500 locations, which will be sourced from 5 cities. 

The five cities that have been requested by the client for the demo are:
- **New York**
- **Scottsdale**
- **Miami**
- **Los Angeles**
- **Chicago**

We will use the Yelp API to gather the top 500 rated locations in each city. We will then feed the `name` and `categories` field of each response to the AI model, which will seek to associate each venue with a list of keywords.

We will need to take the following steps to achieve our task:
1. Gather JSON objects for top 500 locations in each city
2. Extract exhaustive list of all categories from the 2,500 locations
3. Provide list of ChatGPT and prompt it to create a list of 20 keywords for each archetype
4. Design prompt for associating businesses with keywords based on `name` and `categories` field 
5. Run list of 2,500 businesses and store results in a JSON file.

In [6]:
import os
from dotenv import load_dotenv
load_dotenv("../.env")

YELP_API_KEY = os.getenv("YELP_API_KEY")

## 1. Gather JSON Data of Locations
We want to start by using the `/businesses/search` endpoint of the Yelp Fusion API to gather the top 500 rated locations in each of our 5 cities. We will store these responses directly in JSON files to retrieve for future steps.

In [23]:
import requests
from tqdm import tqdm

# List of cities to search
CITIES = ["New%20York%20City", "Scottsdale", "Miami", "Los%20Angeles", "Chicago"]
CITY_CODES = ["NYC", "SCOTTSDALE", "MIAMI", "LA", "CHICAGO"]

CITY_TO_CODE = dict(zip(CITIES, CITY_CODES))

# Yelp Fusion API URL
API_URL = "https://api.yelp.com/v3"
BUSINESS_SEARCH_ENDPOINT = "/businesses/search"

# Search Params For API Request
LIMIT = 50
SORT_BY = "rating"

# Authorization
HEADERS = {
    "Authorization": "Bearer " + YELP_API_KEY,
}

def request_city_data(city: str):
    """Request data from Yelp API for a given city"""
    base_url = f"{API_URL}{BUSINESS_SEARCH_ENDPOINT}?location={city}&limit={LIMIT}&sort_by={SORT_BY}"
    url = base_url + "&offset={}"
    offset = 0
    data = []
    for i in range(10):
        results = requests.get(url.format(offset), headers=HEADERS).json()
        
        # Add the city code to the data
        for result in results['businesses']: result['city'] = city

        offset += LIMIT
        data.extend(results["businesses"])
    return data


def extract_location_data():
    """Extract location data from Yelp API"""
    data = []
    for city in tqdm(CITIES):
        try:
            print("Requesting data for", city)
            city_results = request_city_data(city)
            data.extend(city_results)
            print(f"Received {len(city_results)} results for {city}. Total: {len(data)}")
        except Exception as e:
            print("Failed to request data for", city)
            print(e)
    return data

In [21]:
data = extract_location_data()

  0%|          | 0/5 [00:00<?, ?it/s]

Requesting data for New%20York%20City


 20%|██        | 1/5 [00:06<00:25,  6.48s/it]

Received 500 results for New%20York%20City. Total: 500
Requesting data for Scottsdale


 40%|████      | 2/5 [00:13<00:20,  6.71s/it]

Received 500 results for Scottsdale. Total: 1000
Requesting data for Miami


 60%|██████    | 3/5 [00:22<00:15,  7.79s/it]

Received 500 results for Miami. Total: 1500
Requesting data for Los%20Angeles


 80%|████████  | 4/5 [00:30<00:08,  8.08s/it]

Received 500 results for Los%20Angeles. Total: 2000
Requesting data for Chicago


100%|██████████| 5/5 [00:37<00:00,  7.57s/it]

Received 500 results for Chicago. Total: 2500





In [24]:
import json

with open("../data/raw_location_data.json", "w", encoding="utf-8") as f:
    for locaction in data:
        locaction['city_code'] = CITY_TO_CODE[locaction['city']]
    json.dump(data, f, ensure_ascii=False, indent=4)

## 2. Extract Exhuastive Category List from Locations
We now want to get an exhaustive list of all the categories provided in our 2,500 locaitons. To do this, we will extract the `categories` field from each location, and append the values to a list. Once all values have been appended, we will type cast the list to a set to remove duplicates. 

We will store the category list in a JSON file for future use.

In [25]:
categories = []
for location in data:
    loc_categories = [category['alias'] for category in location['categories']]
    categories.extend(loc_categories)
categories = list(set(categories))


with open("../data/raw_categories.json", "w", encoding="utf-8") as f:
    json.dump(categories, f, ensure_ascii=False, indent=4)

## 3. Keyword List Generation
We have used ChatGPT to convert our categories into keywords to use for classification. The keywords can be found in `../data/keywords.json`.

## 4. Prompt Engineering
Now that we have our keywords set, we need to do some prompt engineering to create a GPT-3.5-Turbo prompt which associates a venue with a set of our keywords. To do this, there are a few considerations we must make:
- Prompt must provide the list of keywords to the model
- Model must accurately associate keywords with venues according to product needs
- Want to process as many venues in one prompt as possible

To provide the LLM with the list of keywords, we will simply provide them in the system prompt. For token efficiency, we may try to cram as many venues as possible into every prompt, so we limit the number of times we have to send a system prompt.

In order to get accurate results without fine-tuning, we should take a few-shot approach, to do this, we will use ChatGPT to do ~20 locations, and we will then use these as an example for each prompt we send.

Finally, we should try to jam as many tokens as possible into each prompt. We have 16k tokens to work with as a context window. We can use the examples to determine the optimal number of locations to use per prompt.

In [26]:
trimmed_data = []
for datum in data:
    trimmed_datum = {
        "id": datum['id'],
        "name": datum["name"],
        "rating": datum["rating"],
        "city": datum['city_code'],
        "categories": [category["alias"] for category in datum["categories"]],
    }
    trimmed_data.append(trimmed_datum)

with open("../data/trimmed_location_data.json", "w", encoding="utf-8") as f:
    json.dump(trimmed_data, f, ensure_ascii=False, indent=4)

In [63]:
with open("../data/keywords.json", "r", encoding="utf-8") as f:
    keywords = json.load(f)
    keyword_list = []
    for cat, keyword in keywords.items():
        words = [word.replace("-", " ") for word in keyword]
        words = [word.title().replace(" ", "") for word in words]
        keywords[cat] = words
        keyword_list.extend(words)

60

In [64]:
JSON_SCHEMA = {
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "venues": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "id": {
            "type": "string"
          },
          "keywords": {
            "type": "array",
            "items": {
              "type": "string",
              "enum": ["red", "green", "blue"]
            }
          }
        },
        "required": ["id", "keywords"]
      }
    }
  },
  "required": ["venues"]
}

JSON_SCHEMA['properties']['venues']['items']['properties']['keywords']['items']['enum'] = keyword_list


In [72]:
SYSTEM_TEMPLATE = """
# Instructions for Extracting Keywords for Venue Analysis
**Objective**: You will perform keyword extraction for a list of venues. Each venue will be
represented by a unique ID, its name, its city, a rating, and a list of categories. Your task is to associate 
each venue with relevant keywords based on its name and categories.

## Materials Provided:

- A JSON schema which defines the schema of your response object. This will be provided to you.
    - IMPORTANT: Every keyword you extract should come from the enum list detailed in this schema.
- A JSON list of venues, each containing an ID, name, rating, and a list of categories.

## Response Formatting:
You provide your response as a JSON object which strictly adheres to the following schema:
{json_schema}
"""

# Format the system template with the JSON schema and append the instructions for the task
SYSTEM_PROMPT = SYSTEM_TEMPLATE.format(json_schema=json.dumps(JSON_SCHEMA)) + """
## Steps

1. Understand the Keywords:
Familiarize yourself with the keywords. This will help you understand the themes and ideas each keyword represents.

2. Review the List of Venues:

You will receive a JSON list of venues. Each venue will have an id, name, rating, and a list of categories.
For example: 
```
[
    {
        "id": "unique-id-1",
        "name": "Venue Name 1",
        "rating": 4.5,
        "city": "NYC",
        "categories": ["category1", "category2"]
    },
    {
        "id": "unique-id-2",
        "name": "Venue Name 2",
        "rating": 3.5,
        "city": "LA",
        "categories": ["category3", "category4"]
    }
    // more venues...
]
```

3. Perform Keyword Extraction:

For each venue, go through its list of categories. Match each category with the relevant keywords 
from the provided JSON object. Consider the themes and ideas the venue's categories represent and select 
appropriate keywords from the enum values. It's important to use your judgment 
to best match the venue's categories to the keywords that represent its essence.

4. Create the Output JSON Object:

For each venue, create a JSON object containing its id and a list of extracted keywords.
The output should be an array of these objects.
Example format:
```
[
    {
        "id": "unique-id-1",
        "keywords": ["keyword1", "keyword2"]
    },
    {
        "id": "unique-id-2",
        "keywords": ["keyword3", "keyword4"]
    }
    // more venues with keywords...
]

# 'keyword1', 'keyword2', 'keyword3', 'keyword4' are keywords from the provided keywords enum.
```

Note: It's essential to interpret the categories creatively and contextually. The goal is to capture the 
essence of each venue in the keywords selected.

**IMPORTANT**: You are a keyword extraction expert. This task REQUIRES that every keyword you assign to a venue comes
from the discrete list of keywords provided. A venue can have multiple keywords, however, they all must be keywords
from the provided list of sluggified keywords. This is incredibly important for the constraints of the project.
"""

HUMAN_TEMPLATE = """
{venue_data}

Please extract the keywords for each of the provided venues. Remember, it is imperative that you only use keywords
from the provided list of keywords in the JSON schema. You can, and in many cases should, assign multiple keywords to a venue.

Begin!
"""

In [73]:
with open("../data/keywords.json", "r", encoding="utf-8") as f:
    keywords = json.load(f)
    raw_keywords = []
    for cat, keyword in keywords.items():
        words = [word.replace("-", " ") for word in keyword]
        words = [word.title().replace(" ", "") for word in words]
        raw_keywords.extend(words)
    keyword_set = set(raw_keywords)
    

with open("../data/example.json", 'r', encoding='utf-8') as f:
    example = json.load(f)

def check_outputs(outputs):
    output_keywords = []
    for output in outputs['venues']:
        output_keywords.extend(output['keywords'])

    keyword_set = set(output_keywords)
    # Ensure that the keywords are all contained in the raw keywords
    if not keyword_set.issubset(raw_keywords):
        raise ValueError("output_keywords is not a subset of raw_keywords")

check_outputs(example['output'])


In [74]:
def load_messages():
    with open("../data/example.json", 'r', encoding='utf-8') as f:
        example = json.load(f)
    example_input = example['input']
    example_output = example['output']

    messages = [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': HUMAN_TEMPLATE.format(venue_data=json.dumps(example_input))},
        {'role': 'assistant', 'content': json.dumps(example_output)},
    ]
    return messages

def format_messages(start: int, stop: int):
    input_values = trimmed_data[start:stop]
    messages = load_messages()
    messages.append({
        'role': 'user',
        'content': HUMAN_TEMPLATE.format(venue_data=json.dumps(input_values))
    })
    return messages

In [80]:
from openai import OpenAI



client = OpenAI()
def extract_keywords(start: int, stop: int, token_count: int):
    completion = client.chat.completions.create(
        model="gpt-4-1106-preview",
        messages=format_messages(start, stop),
        response_format={"type": "json_object"}
    )
    data = json.loads(completion.choices[0].message.content)

    try:
        # Ensure that all the outputs are valid keywords
        token_count += completion.usage.total_tokens
        check_outputs(data)
        return data['venues'], token_count

    except ValueError:
        # If there are invalid keywords, try again
        print(f"Invalid keywords at indicies [{start}, {stop}]. Trying again...")
        extract_keywords(start, stop, token_count)



In [None]:
start = 0
BATCH_SIZE = 30
DATA_SIZE = len(trimmed_data)

extracted_keywords = []
token_count = 0

for start in tqdm(range(0, DATA_SIZE, BATCH_SIZE)):
    # Add 150 examples for each prompt
    end = start + BATCH_SIZE if start + BATCH_SIZE < DATA_SIZE else None

    # Take the slice from the dataset
    _input = trimmed_data[start:end] if end else trimmed_data[start:]

    # Generate the completion
    result, token_count = extract_keywords(start, end, token_count)
    extracted_keywords.extend(result)
    print(f"Extracted keywords for {len(result)} venues. Total Processed: {len(extracted_keywords)} ({token_count} Total Tokens)")


In [118]:
# with open("../data/venue_keywords.json", "w", encoding="utf-8") as f:
#     json.dump(extracted_keywords, f, ensure_ascii=False, indent=4)

In [90]:
with open("../data/venue_keywords.json", "r", encoding="utf-8") as f:
    extracted_keywords = json.load(f)

with open("../data/keywords.json", "r", encoding="utf-8") as f:
    keywords = json.load(f)
    raw_keywords = []
    for cat, keyword in keywords.items():
        words = [word.replace("-", " ") for word in keyword]
        raw_keywords.extend(words)

In [101]:
from typing import List

vector_data = []
# First, we need to embed our raw list of keywords
def embed_keyword(keyword: str):
    """Embed a single keyword"""    
    response = client.embeddings.create(
        input=keyword,
        model='text-embedding-ada-002'
    )
    return response.data[0].embedding

def embed_terms(terms: List[str]):
    response = client.embeddings.create(
        input=terms,
        model='text-embedding-ada-002'
    )
    return [datum.embedding for datum in response.data]


In [102]:
for keyword in raw_keywords:
    vector_data.append({
        "keyword": keyword,
        "vector": embed_keyword(keyword)
    })

In [105]:
import numpy as np
import pandas as pd

for vector in vector_data:
    vector['vector'] = np.array(vector['vector'])

df = pd.DataFrame(vector_data)

In [109]:

# Now, we want to create an embedding for each venue, by embedding the keywords
terms = []
for venue_keywords in extracted_keywords:
    term = ", ".join([keyword.replace("-", " ").title() for keyword in venue_keywords['keywords']])
    terms.append(term)

embeddings = [*embed_terms(terms[:1000]), *embed_terms(terms[1000:2000]), *embed_terms(terms[2000:])]
len(embeddings)

2458

In [115]:
normalized_keywords = []
for i, embed in enumerate(embeddings):
    embed = np.array(embed)
    vector_store = df.copy()
    vector_store['similarity'] = df['vector'].apply(lambda x: np.dot(x, embed))
    vector_store = vector_store.sort_values(by='similarity', ascending=False)
    results = vector_store.keyword.values[:3].tolist()

    normalized_keywords.append({
        'id': extracted_keywords[i]['id'],
        'keywords': results,
    })
normalized_keywords

[{'id': 'DZDc1dCf8Xa-e3X76vYTJQ',
  'keywords': ['Baking Passion', 'Sweet Treats', 'Culinary Adventure']},
 {'id': 'C6ohbrxuiGBuk7_a4oqy4Q',
  'keywords': ['Cocktail Culture', 'Nightlife', 'Happy Hours']},
 {'id': 'lonFxLWHRU-XiqneeIyhUQ',
  'keywords': ['Coffee Culture', 'Culinary Adventure', 'Street Food']},
 {'id': 't3JSO9KxofgrIb1sIYF4hw',
  'keywords': ['Culinary Adventure', 'Coffee Culture', 'Gourmet']},
 {'id': '0DxTgpLfIvnqJT8mfMqVcQ',
  'keywords': ['Outdoor Thrills', 'Fitness Challenge', 'Adventure Sports']},
 {'id': 'IfpSfI3PXnVuhepkjTfUqA',
  'keywords': ['Culinary Adventure', 'Gourmet', 'Gastronomy']},
 {'id': '9M4Rdn_iKt4F0cxla5aatg',
  'keywords': ['Sweet Treats', 'Seafood Delights', 'Exotic Flavors']},
 {'id': '3Mf7d2n8XJnyH94I4vUoYw',
  'keywords': ['Health Conscious Eating', 'Vegan Choices', 'Coffee Culture']},
 {'id': 'RZ-HIZ9oj7tKQ4i4qsohWQ',
  'keywords': ['Coffee Culture', 'Cocktail Culture', 'Gourmet']},
 {'id': 'QVc7uzk07IaBAcgXs_l_fg',
  'keywords': ['Culinary 

In [116]:
with open("../data/normalized_keywords.json", "w", encoding="utf-8") as f:
    json.dump(normalized_keywords, f, ensure_ascii=False, indent=4)