# Data Aggregation
As a first step in the Places to Go Demo, we will need static venue data to create reccomendations from. In production, our venue sources will be managed by a Web Scraper bot that will handle crawling social media and updating the list based on activity. For now, we will populate a static set of 2,500 locations, which will be sourced from 5 cities. 

The five cities that have been requested by the client for the demo are:
- **New York**
- **Scottsdale**
- **Miami**
- **Los Angeles**
- **Chicago**

We will use the Yelp API to gather the top 500 rated locations in each city. We will then feed the `name` and `categories` field of each response to the AI model, which will seek to associate each venue with a list of keywords.

We will need to take the following steps to achieve our task:
1. Gather JSON objects for top 500 locations in each city
2. Extract exhaustive list of all categories from the 2,500 locations
3. Provide list of ChatGPT and prompt it to create a list of 20 keywords for each archetype
4. Design prompt for associating businesses with keywords based on `name` and `categories` field 
5. Run list of 2,500 businesses and store results in a JSON file.

In [5]:
import os
from dotenv import load_dotenv
load_dotenv("../.env")

YELP_API_KEY = os.getenv("YELP_API_KEY")

## 1. Gather JSON Data of Locations
We want to start by using the `/businesses/search` endpoint of the Yelp Fusion API to gather the top 500 rated locations in each of our 5 cities. We will store these responses directly in JSON files to retrieve for future steps.

In [29]:
import requests
from tqdm import tqdm

# List of cities to search
CITIES = ["New%20York%20City", "Scottsdale", "Miami", "Los%20Angeles", "Chicago"]

# Yelp Fusion API URL
API_URL = "https://api.yelp.com/v3"
BUSINESS_SEARCH_ENDPOINT = "/businesses/search"

# Search Params For API Request
LIMIT = 50
SORT_BY = "rating"

# Authorization
HEADERS = {
    "Authorization": "Bearer " + YELP_API_KEY,
}

def request_city_data(city: str):
    """Request data from Yelp API for a given city"""
    base_url = f"{API_URL}{BUSINESS_SEARCH_ENDPOINT}?location={city}&limit={LIMIT}&sort_by={SORT_BY}"
    url = base_url + "&offset={}"
    offset = 0
    data = []
    for i in range(10):
        results = requests.get(url.format(offset), headers=HEADERS).json()
        offset += LIMIT
        data.extend(results["businesses"])
    return data


def extract_location_data():
    """Extract location data from Yelp API"""
    data = []
    for city in tqdm(CITIES):
        try:
            print("Requesting data for", city)
            city_results = request_city_data(city)
            data.extend(city_results)
            print(f"Received {len(city_results)} results for {city}. Total: {len(data)}")
        except:
            print("Failed to request data for", city)
    return data

In [30]:
data = extract_location_data()

  0%|          | 0/5 [00:00<?, ?it/s]

Requesting data for New%20York%20City


 20%|██        | 1/5 [00:06<00:26,  6.53s/it]

Received 500 results for New%20York%20City. Total: 500
Requesting data for Scottsdale


 40%|████      | 2/5 [00:13<00:20,  6.90s/it]

Received 500 results for Scottsdale. Total: 1000
Requesting data for Miami


 60%|██████    | 3/5 [00:21<00:14,  7.37s/it]

Received 500 results for Miami. Total: 1500
Requesting data for Los%20Angeles


 80%|████████  | 4/5 [00:29<00:07,  7.61s/it]

Received 500 results for Los%20Angeles. Total: 2000
Requesting data for Chicago


100%|██████████| 5/5 [00:36<00:00,  7.36s/it]

Received 500 results for Chicago. Total: 2500





In [33]:
import json

with open("../data/raw_location_data.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

## 2. Extract Exhuastive Category List from Locations
We now want to get an exhaustive list of all the categories provided in our 2,500 locaitons. To do this, we will extract the `categories` field from each location, and append the values to a list. Once all values have been appended, we will type cast the list to a set to remove duplicates. 

We will store the category list in a JSON file for future use.

In [41]:
categories = []
for location in data:
    loc_categories = [category['alias'] for category in location['categories']]
    categories.extend(loc_categories)
categories = list(set(categories))


with open("../data/raw_categories.json", "w", encoding="utf-8") as f:
    json.dump(categories, f, ensure_ascii=False, indent=4)

## 3. Keyword List Generation
We have used ChatGPT to convert our categories into keywords to use for classification. The keywords can be found in `../data/keywords.json`.

## 4. Prompt Engineering
Now that we have our keywords set, we need to do some prompt engineering to create a GPT-3.5-Turbo prompt which associates a venue with a set of our keywords. To do this, there are a few considerations we must make:
- Prompt must provide the list of keywords to the model
- Model must accurately associate keywords with venues according to product needs
- Want to process as many venues in one prompt as possible

To provide the LLM with the list of keywords, we will simply provide them in the system prompt. For token efficiency, we may try to cram as many venues as possible into every prompt, so we limit the number of times we have to send a system prompt.

In order to get accurate results without fine-tuning, we should take a few-shot approach, to do this, we will use ChatGPT to do ~20 locations, and we will then use these as an example for each prompt we send.

Finally, we should try to jam as many tokens as possible into each prompt. We have 16k tokens to work with as a context window. We can use the examples to determine the optimal number of locations to use per prompt.

In [55]:
trimmed_data = []
for datum in data:
    trimmed_datum = {
        "id": datum['id'],
        "name": datum["name"],
        "rating": datum["rating"],
        "categories": [category["alias"] for category in datum["categories"]],
    }
    trimmed_data.append(trimmed_datum)

with open("../data/trimmed_location_data.json", "w", encoding="utf-8") as f:
    json.dump(trimmed_data, f, ensure_ascii=False, indent=4)

In [57]:
SYSTEM_PROMPT = """
# Instructions for Extracting Keywords for Venue Analysis
**Objective**: You will perform keyword extraction for a list of venues. Each venue will be
represented by a unique ID, its name, a rating, and a list of categories. Your task is to associate 
each venue with relevant keywords based on its categories.

## Materials Provided:

- A JSON object of slugified keywords categorized under three archetypes: "foodie", "party", and "adventure". This will be provided to you.
- A JSON list of venues, each containing an ID, name, rating, and a list of categories.

## Slugified Keywords
```
{
    "foodie": [
        "gourmet", "artisanal", "organic", "exotic-flavors", "farm-to-table", "culinary-adventure", "gastronomy", "comfort-food",
        "street-food", "seafood-delights", "sweet-treats", "vegan-choices", "wine-lover", "coffee-culture", "cheese-aficionado",
        "spice-enthusiast", "health-conscious-eating", "baking-passion", "fusion-cuisine", "food-markets"
    ],
    "party": [
        "nightlife", "live-music", "dance-vibes", "social-gatherings", "festive-atmosphere", "dj-beats", "cocktail-culture",
        "laugh-out-loud", "trendy-spots", "group-fun", "vibrant-scenes", "themed-parties", "chill-out", "exclusive-events", "happy-hours",
        "sing-along", "glam-nights", "bar-hopping", "eclectic-mix", "celebrity-style"
    ],
    "adventure": [
        "exploration", "outdoor-thrills", "nature-lover", "extreme-sports",  "scenic-beauty", "water-adventures", "trail-seeker",  "wildlife-encounters",
        "eco-friendly", "adventure-sports", "fitness-challenge", "sky-high", "off-the-beaten-path", "cultural-immersion", "urban-exploration",
        "mountain-escapes", "underwater-worlds", "road-trips", "star-gazing", "historical-discovery"
    ]
}
```

## Steps

1. Understand the Keywords:
Familiarize yourself with the keywords under each category. This will help you understand the themes and ideas each keyword represents.

2. Review the List of Venues:

You will receive a JSON list of venues. Each venue will have an id, name, rating, and a list of categories.
For example: 
```
[
    {
        "id": "unique-id-1",
        "name": "Venue Name 1",
        "rating": 4.5,
        "categories": ["category1", "category2"]
    },
    {
        "id": "unique-id-2",
        "name": "Venue Name 2",
        "rating": 3.5,
        "categories": ["category3", "category4"]
    }
    // more venues...
]
```

3. Perform Keyword Extraction:

For each venue, go through its list of categories. Match each category with the relevant keywords 
from the provided JSON object. Consider the themes and ideas the venue's categories represent and select 
appropriate keywords from the "foodie", "party", and "adventure" lists. It's important to use your judgment 
to best match the venue's categories to the keywords that represent its essence.

4. Create the Output JSON Object:

For each venue, create a JSON object containing its id and a list of extracted keywords.
The output should be an array of these objects.
Example format:
```
[
    {
        "id": "unique-id-1",
        "keywords": ["keyword1", "keyword2"]
    },
    {
        "id": "unique-id-2",
        "keywords": ["keyword3", "keyword4"]
    }
    // more venues with keywords...
]
```

Note: It's essential to interpret the categories creatively and contextually. The goal is to capture the 
essence of each venue in the keywords selected.
"""

In [85]:
def load_messages():
    with open("../data/example.json", 'r', encoding='utf-8') as f:
        example = json.load(f)
    example_input = example['input']
    example_output = example['output']

    messages = [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': json.dumps(example_input)},
        {'role': 'assistant', 'content': json.dumps(example_output)},
    ]
    return messages

def format_messages(start: int, stop: int):
    input_values = trimmed_data[start:stop]
    messages = load_messages()
    messages.append({
        'role': 'user',
        'content': json.dumps(input_values)
    })
    return messages

In [92]:
from openai import OpenAI

client = OpenAI()
def extract_keywords(start: int, stop: int):
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=format_messages(start, stop),
        response_format={"type": "json_object"}
    )
    data = json.loads(completion.choices[0].message.content)
    return data['venues']



In [105]:
start = 0
BATCH_SIZE = 50
DATA_SIZE = len(trimmed_data)

extracted_keywords = []

for start in tqdm(range(0, DATA_SIZE, BATCH_SIZE)):
    # Add 150 examples for each prompt
    end = start + BATCH_SIZE if start + BATCH_SIZE < DATA_SIZE else None

    # Take the slice from the dataset
    _input = trimmed_data[start:end] if end else trimmed_data[start:]

    # Generate the completion
    result = extract_keywords(start, end)
    extracted_keywords.extend(result)
    print(f"Extracted keywords for {len(result)} venues. Total Processed: {len(extracted_keywords)}")


  6%|▌         | 1/17 [00:46<12:17, 46.10s/it]

CompletionUsage(completion_tokens=2640, prompt_tokens=9663, total_tokens=12303)
Extracted keywords for 76 venues. Total Processed: 76


  6%|▌         | 1/17 [01:34<25:07, 94.21s/it]


KeyboardInterrupt: 