# Data Aggregation
As a first step in the Places to Go Demo, we will need static venue data to create reccomendations from. In production, our venue sources will be managed by a Web Scraper bot that will handle crawling social media and updating the list based on activity. For now, we will populate a static set of 2,500 locations, which will be sourced from 5 cities. 

The five cities that have been requested by the client for the demo are:
- **New York**
- **Scottsdale**
- **Miami**
- **Los Angeles**
- **Chicago**

We will use the Yelp API to gather the top 500 rated locations in each city. We will then feed the `name` and `categories` field of each response to the AI model, which will seek to associate each venue with a list of keywords.

We will need to take the following steps to achieve our task:
1. Gather JSON objects for top 500 locations in each city
2. Extract exhaustive list of all categories from the 2,500 locations
3. Provide list of ChatGPT and prompt it to create a list of 20 keywords for each archetype
4. Design prompt for associating businesses with keywords based on `name` and `categories` field 
5. Run list of 2,500 businesses and store results in a JSON file.

In [5]:
import os
from dotenv import load_dotenv
load_dotenv("../.env")

YELP_API_KEY = os.getenv("YELP_API_KEY")

## 1. Gather JSON Data of Locations
We want to start by using the `/businesses/search` endpoint of the Yelp Fusion API to gather the top 500 rated locations in each of our 5 cities. We will store these responses directly in JSON files to retrieve for future steps.

In [29]:
import requests
from tqdm import tqdm

# List of cities to search
CITIES = ["New%20York%20City", "Scottsdale", "Miami", "Los%20Angeles", "Chicago"]

# Yelp Fusion API URL
API_URL = "https://api.yelp.com/v3"
BUSINESS_SEARCH_ENDPOINT = "/businesses/search"

# Search Params For API Request
LIMIT = 50
SORT_BY = "rating"

# Authorization
HEADERS = {
    "Authorization": "Bearer " + YELP_API_KEY,
}

def request_city_data(city: str):
    """Request data from Yelp API for a given city"""
    base_url = f"{API_URL}{BUSINESS_SEARCH_ENDPOINT}?location={city}&limit={LIMIT}&sort_by={SORT_BY}"
    url = base_url + "&offset={}"
    offset = 0
    data = []
    for i in range(10):
        results = requests.get(url.format(offset), headers=HEADERS).json()
        offset += LIMIT
        data.extend(results["businesses"])
    return data


def extract_location_data():
    """Extract location data from Yelp API"""
    data = []
    for city in tqdm(CITIES):
        try:
            print("Requesting data for", city)
            city_results = request_city_data(city)
            data.extend(city_results)
            print(f"Received {len(city_results)} results for {city}. Total: {len(data)}")
        except:
            print("Failed to request data for", city)
    return data

In [30]:
data = extract_location_data()

  0%|          | 0/5 [00:00<?, ?it/s]

Requesting data for New%20York%20City


 20%|██        | 1/5 [00:06<00:26,  6.53s/it]

Received 500 results for New%20York%20City. Total: 500
Requesting data for Scottsdale


 40%|████      | 2/5 [00:13<00:20,  6.90s/it]

Received 500 results for Scottsdale. Total: 1000
Requesting data for Miami


 60%|██████    | 3/5 [00:21<00:14,  7.37s/it]

Received 500 results for Miami. Total: 1500
Requesting data for Los%20Angeles


 80%|████████  | 4/5 [00:29<00:07,  7.61s/it]

Received 500 results for Los%20Angeles. Total: 2000
Requesting data for Chicago


100%|██████████| 5/5 [00:36<00:00,  7.36s/it]

Received 500 results for Chicago. Total: 2500





In [33]:
import json

with open("../data/raw_location_data.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

## 2. Extract Exhuastive Category List from Locations
We now want to get an exhaustive list of all the categories provided in our 2,500 locaitons. To do this, we will extract the `categories` field from each location, and append the values to a list. Once all values have been appended, we will type cast the list to a set to remove duplicates. 

We will store the category list in a JSON file for future use.

In [41]:
categories = []
for location in data:
    loc_categories = [category['alias'] for category in location['categories']]
    categories.extend(loc_categories)
categories = list(set(categories))


with open("../data/raw_categories.json", "w", encoding="utf-8") as f:
    json.dump(categories, f, ensure_ascii=False, indent=4)

## 3. Keyword List Generation
We have used ChatGPT to convert our categories into keywords to use for classification. The keywords can be found in `../data/keywords.json`.

## 4. Prompt Engineering
Now that we have our keywords set, we need to do some prompt engineering to create a GPT-3.5-Turbo prompt which associates a venue with a set of our keywords. To do this, there are a few considerations we must make:
- Prompt must provide the list of keywords to the model
- Model must accurately associate keywords with venues according to product needs
- Want to process as many venues in one prompt as possible

To provide the LLM with the list of keywords, we will simply provide them in the system prompt. For token efficiency, we may try to cram as many venues as possible into every prompt, so we limit the number of times we have to send a system prompt.

In order to get accurate results without fine-tuning, we should take a few-shot approach, to do this, we will use ChatGPT to do ~20 locations, and we will then use these as an example for each prompt we send.

Finally, we should try to jam as many tokens as possible into each prompt. We have 16k tokens to work with as a context window. We can use the examples to determine the optimal number of locations to use per prompt.

In [55]:
trimmed_data = []
for datum in data:
    trimmed_datum = {
        "id": datum['id'],
        "name": datum["name"],
        "rating": datum["rating"],
        "categories": [category["alias"] for category in datum["categories"]],
    }
    trimmed_data.append(trimmed_datum)

with open("../data/trimmed_location_data.json", "w", encoding="utf-8") as f:
    json.dump(trimmed_data, f, ensure_ascii=False, indent=4)

In [57]:
SYSTEM_PROMPT = """
# Instructions for Extracting Keywords for Venue Analysis
**Objective**: You will perform keyword extraction for a list of venues. Each venue will be
represented by a unique ID, its name, a rating, and a list of categories. Your task is to associate 
each venue with relevant keywords based on its categories.

## Materials Provided:

- A JSON object of slugified keywords categorized under three archetypes: "foodie", "party", and "adventure". This will be provided to you.
- A JSON list of venues, each containing an ID, name, rating, and a list of categories.

## Slugified Keywords
```
{
    "foodie": [
        "gourmet", "artisanal", "organic", "exotic-flavors", "farm-to-table", "culinary-adventure", "gastronomy", "comfort-food",
        "street-food", "seafood-delights", "sweet-treats", "vegan-choices", "wine-lover", "coffee-culture", "cheese-aficionado",
        "spice-enthusiast", "health-conscious-eating", "baking-passion", "fusion-cuisine", "food-markets"
    ],
    "party": [
        "nightlife", "live-music", "dance-vibes", "social-gatherings", "festive-atmosphere", "dj-beats", "cocktail-culture",
        "laugh-out-loud", "trendy-spots", "group-fun", "vibrant-scenes", "themed-parties", "chill-out", "exclusive-events", "happy-hours",
        "sing-along", "glam-nights", "bar-hopping", "eclectic-mix", "celebrity-style"
    ],
    "adventure": [
        "exploration", "outdoor-thrills", "nature-lover", "extreme-sports",  "scenic-beauty", "water-adventures", "trail-seeker",  "wildlife-encounters",
        "eco-friendly", "adventure-sports", "fitness-challenge", "sky-high", "off-the-beaten-path", "cultural-immersion", "urban-exploration",
        "mountain-escapes", "underwater-worlds", "road-trips", "star-gazing", "historical-discovery"
    ]
}
```

## Steps

1. Understand the Keywords:
Familiarize yourself with the keywords under each category. This will help you understand the themes and ideas each keyword represents.

2. Review the List of Venues:

You will receive a JSON list of venues. Each venue will have an id, name, rating, and a list of categories.
For example: 
```
[
    {
        "id": "unique-id-1",
        "name": "Venue Name 1",
        "rating": 4.5,
        "categories": ["category1", "category2"]
    },
    {
        "id": "unique-id-2",
        "name": "Venue Name 2",
        "rating": 3.5,
        "categories": ["category3", "category4"]
    }
    // more venues...
]
```

3. Perform Keyword Extraction:

For each venue, go through its list of categories. Match each category with the relevant keywords 
from the provided JSON object. Consider the themes and ideas the venue's categories represent and select 
appropriate keywords from the "foodie", "party", and "adventure" lists. It's important to use your judgment 
to best match the venue's categories to the keywords that represent its essence.

4. Create the Output JSON Object:

For each venue, create a JSON object containing its id and a list of extracted keywords.
The output should be an array of these objects.
Example format:
```
[
    {
        "id": "unique-id-1",
        "keywords": ["keyword1", "keyword2"]
    },
    {
        "id": "unique-id-2",
        "keywords": ["keyword3", "keyword4"]
    }
    // more venues with keywords...
]
```

Note: It's essential to interpret the categories creatively and contextually. The goal is to capture the 
essence of each venue in the keywords selected.
"""

In [85]:
def load_messages():
    with open("../data/example.json", 'r', encoding='utf-8') as f:
        example = json.load(f)
    example_input = example['input']
    example_output = example['output']

    messages = [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': json.dumps(example_input)},
        {'role': 'assistant', 'content': json.dumps(example_output)},
    ]
    return messages

def format_messages(start: int, stop: int):
    input_values = trimmed_data[start:stop]
    messages = load_messages()
    messages.append({
        'role': 'user',
        'content': json.dumps(input_values)
    })
    return messages

In [92]:
from openai import OpenAI

client = OpenAI()
def extract_keywords(start: int, stop: int):
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=format_messages(start, stop),
        response_format={"type": "json_object"}
    )
    data = json.loads(completion.choices[0].message.content)
    return data['venues']



In [107]:
start = 0
BATCH_SIZE = 30
DATA_SIZE = len(trimmed_data)

extracted_keywords = []

for start in tqdm(range(0, DATA_SIZE, BATCH_SIZE)):
    # Add 150 examples for each prompt
    end = start + BATCH_SIZE if start + BATCH_SIZE < DATA_SIZE else None

    # Take the slice from the dataset
    _input = trimmed_data[start:end] if end else trimmed_data[start:]

    # Generate the completion
    result = extract_keywords(start, end)
    extracted_keywords.extend(result)
    print(f"Extracted keywords for {len(result)} venues. Total Processed: {len(extracted_keywords)}")


  1%|          | 1/84 [00:18<25:45, 18.62s/it]

CompletionUsage(completion_tokens=1130, prompt_tokens=3551, total_tokens=4681)
Extracted keywords for 30 venues. Total Processed: 30


  2%|▏         | 2/84 [00:31<20:36, 15.07s/it]

CompletionUsage(completion_tokens=964, prompt_tokens=3467, total_tokens=4431)
Extracted keywords for 30 venues. Total Processed: 60


  4%|▎         | 3/84 [00:47<20:50, 15.44s/it]

CompletionUsage(completion_tokens=978, prompt_tokens=3501, total_tokens=4479)
Extracted keywords for 30 venues. Total Processed: 90


  5%|▍         | 4/84 [01:07<22:58, 17.23s/it]

CompletionUsage(completion_tokens=1122, prompt_tokens=3571, total_tokens=4693)
Extracted keywords for 30 venues. Total Processed: 120


  6%|▌         | 5/84 [01:27<24:25, 18.55s/it]

CompletionUsage(completion_tokens=1087, prompt_tokens=3549, total_tokens=4636)
Extracted keywords for 30 venues. Total Processed: 150


  7%|▋         | 6/84 [01:45<23:32, 18.11s/it]

CompletionUsage(completion_tokens=1044, prompt_tokens=3501, total_tokens=4545)
Extracted keywords for 30 venues. Total Processed: 180


  8%|▊         | 7/84 [02:03<23:08, 18.03s/it]

CompletionUsage(completion_tokens=1046, prompt_tokens=3530, total_tokens=4576)
Extracted keywords for 30 venues. Total Processed: 210


 10%|▉         | 8/84 [02:21<23:02, 18.19s/it]

CompletionUsage(completion_tokens=1102, prompt_tokens=3555, total_tokens=4657)
Extracted keywords for 31 venues. Total Processed: 241


 11%|█         | 9/84 [02:36<21:35, 17.27s/it]

CompletionUsage(completion_tokens=913, prompt_tokens=3518, total_tokens=4431)
Extracted keywords for 27 venues. Total Processed: 268


 12%|█▏        | 10/84 [02:51<20:18, 16.46s/it]

CompletionUsage(completion_tokens=1012, prompt_tokens=3482, total_tokens=4494)
Extracted keywords for 30 venues. Total Processed: 298


 13%|█▎        | 11/84 [03:09<20:32, 16.88s/it]

CompletionUsage(completion_tokens=970, prompt_tokens=3505, total_tokens=4475)
Extracted keywords for 30 venues. Total Processed: 328


 14%|█▍        | 12/84 [03:24<19:30, 16.26s/it]

CompletionUsage(completion_tokens=1049, prompt_tokens=3529, total_tokens=4578)
Extracted keywords for 30 venues. Total Processed: 358


 15%|█▌        | 13/84 [03:41<19:41, 16.64s/it]

CompletionUsage(completion_tokens=985, prompt_tokens=3486, total_tokens=4471)
Extracted keywords for 30 venues. Total Processed: 388


 17%|█▋        | 14/84 [03:58<19:24, 16.63s/it]

CompletionUsage(completion_tokens=904, prompt_tokens=3487, total_tokens=4391)
Extracted keywords for 27 venues. Total Processed: 415


 18%|█▊        | 15/84 [04:13<18:27, 16.05s/it]

CompletionUsage(completion_tokens=1055, prompt_tokens=3517, total_tokens=4572)
Extracted keywords for 30 venues. Total Processed: 445


 19%|█▉        | 16/84 [04:28<18:01, 15.91s/it]

CompletionUsage(completion_tokens=1073, prompt_tokens=3543, total_tokens=4616)
Extracted keywords for 30 venues. Total Processed: 475


 20%|██        | 17/84 [04:49<19:20, 17.33s/it]

CompletionUsage(completion_tokens=1119, prompt_tokens=3561, total_tokens=4680)
Extracted keywords for 31 venues. Total Processed: 506


 21%|██▏       | 18/84 [05:08<19:34, 17.80s/it]

CompletionUsage(completion_tokens=1036, prompt_tokens=3582, total_tokens=4618)
Extracted keywords for 30 venues. Total Processed: 536


 23%|██▎       | 19/84 [05:27<19:51, 18.33s/it]

CompletionUsage(completion_tokens=1049, prompt_tokens=3568, total_tokens=4617)
Extracted keywords for 30 venues. Total Processed: 566


 24%|██▍       | 20/84 [05:42<18:20, 17.19s/it]

CompletionUsage(completion_tokens=1029, prompt_tokens=3563, total_tokens=4592)
Extracted keywords for 30 venues. Total Processed: 596


 25%|██▌       | 21/84 [05:58<17:43, 16.89s/it]

CompletionUsage(completion_tokens=1039, prompt_tokens=3521, total_tokens=4560)
Extracted keywords for 30 venues. Total Processed: 626


 26%|██▌       | 22/84 [06:19<18:52, 18.27s/it]

CompletionUsage(completion_tokens=1086, prompt_tokens=3550, total_tokens=4636)
Extracted keywords for 30 venues. Total Processed: 656


 27%|██▋       | 23/84 [06:36<18:07, 17.82s/it]

CompletionUsage(completion_tokens=914, prompt_tokens=3575, total_tokens=4489)
Extracted keywords for 27 venues. Total Processed: 683


 29%|██▊       | 24/84 [06:55<18:04, 18.07s/it]

CompletionUsage(completion_tokens=1044, prompt_tokens=3564, total_tokens=4608)
Extracted keywords for 30 venues. Total Processed: 713


 30%|██▉       | 25/84 [07:13<17:47, 18.09s/it]

CompletionUsage(completion_tokens=966, prompt_tokens=3516, total_tokens=4482)
Extracted keywords for 30 venues. Total Processed: 743


 31%|███       | 26/84 [07:33<18:08, 18.77s/it]

CompletionUsage(completion_tokens=1106, prompt_tokens=3555, total_tokens=4661)
Extracted keywords for 30 venues. Total Processed: 773


 32%|███▏      | 27/84 [07:53<17:58, 18.91s/it]

CompletionUsage(completion_tokens=1020, prompt_tokens=3554, total_tokens=4574)
Extracted keywords for 29 venues. Total Processed: 802


 33%|███▎      | 28/84 [08:09<17:04, 18.30s/it]

CompletionUsage(completion_tokens=998, prompt_tokens=3533, total_tokens=4531)
Extracted keywords for 29 venues. Total Processed: 831


 35%|███▍      | 29/84 [08:24<15:48, 17.24s/it]

CompletionUsage(completion_tokens=1015, prompt_tokens=3554, total_tokens=4569)
Extracted keywords for 29 venues. Total Processed: 860


 36%|███▌      | 30/84 [08:39<14:59, 16.65s/it]

CompletionUsage(completion_tokens=1052, prompt_tokens=3550, total_tokens=4602)
Extracted keywords for 30 venues. Total Processed: 890


 37%|███▋      | 31/84 [08:54<14:07, 15.98s/it]

CompletionUsage(completion_tokens=943, prompt_tokens=3559, total_tokens=4502)
Extracted keywords for 28 venues. Total Processed: 918


 38%|███▊      | 32/84 [09:11<14:10, 16.35s/it]

CompletionUsage(completion_tokens=1129, prompt_tokens=3612, total_tokens=4741)
Extracted keywords for 30 venues. Total Processed: 948


 39%|███▉      | 33/84 [09:28<14:03, 16.54s/it]

CompletionUsage(completion_tokens=1020, prompt_tokens=3543, total_tokens=4563)
Extracted keywords for 29 venues. Total Processed: 977


 40%|████      | 34/84 [09:46<14:06, 16.93s/it]

CompletionUsage(completion_tokens=1080, prompt_tokens=3554, total_tokens=4634)
Extracted keywords for 30 venues. Total Processed: 1007


 42%|████▏     | 35/84 [10:00<13:03, 16.00s/it]

CompletionUsage(completion_tokens=883, prompt_tokens=3593, total_tokens=4476)
Extracted keywords for 24 venues. Total Processed: 1031


 43%|████▎     | 36/84 [10:19<13:31, 16.91s/it]

CompletionUsage(completion_tokens=1083, prompt_tokens=3560, total_tokens=4643)
Extracted keywords for 30 venues. Total Processed: 1061


 44%|████▍     | 37/84 [10:34<12:49, 16.36s/it]

CompletionUsage(completion_tokens=1115, prompt_tokens=3554, total_tokens=4669)
Extracted keywords for 30 venues. Total Processed: 1091


 45%|████▌     | 38/84 [10:53<13:05, 17.07s/it]

CompletionUsage(completion_tokens=1086, prompt_tokens=3572, total_tokens=4658)
Extracted keywords for 30 venues. Total Processed: 1121


 46%|████▋     | 39/84 [11:10<12:50, 17.12s/it]

CompletionUsage(completion_tokens=1059, prompt_tokens=3577, total_tokens=4636)
Extracted keywords for 28 venues. Total Processed: 1149


 48%|████▊     | 40/84 [11:29<13:00, 17.74s/it]

CompletionUsage(completion_tokens=1011, prompt_tokens=3600, total_tokens=4611)
Extracted keywords for 30 venues. Total Processed: 1179


 49%|████▉     | 41/84 [11:47<12:49, 17.89s/it]

CompletionUsage(completion_tokens=1035, prompt_tokens=3594, total_tokens=4629)
Extracted keywords for 30 venues. Total Processed: 1209


 50%|█████     | 42/84 [12:02<11:55, 17.04s/it]

CompletionUsage(completion_tokens=1003, prompt_tokens=3581, total_tokens=4584)
Extracted keywords for 29 venues. Total Processed: 1238


 51%|█████     | 43/84 [12:20<11:51, 17.35s/it]

CompletionUsage(completion_tokens=1038, prompt_tokens=3562, total_tokens=4600)
Extracted keywords for 30 venues. Total Processed: 1268


 52%|█████▏    | 44/84 [12:38<11:39, 17.48s/it]

CompletionUsage(completion_tokens=1008, prompt_tokens=3559, total_tokens=4567)
Extracted keywords for 30 venues. Total Processed: 1298


 54%|█████▎    | 45/84 [12:56<11:26, 17.60s/it]

CompletionUsage(completion_tokens=1053, prompt_tokens=3550, total_tokens=4603)
Extracted keywords for 30 venues. Total Processed: 1328


 55%|█████▍    | 46/84 [13:11<10:41, 16.89s/it]

CompletionUsage(completion_tokens=1090, prompt_tokens=3518, total_tokens=4608)
Extracted keywords for 30 venues. Total Processed: 1358


 56%|█████▌    | 47/84 [13:28<10:19, 16.74s/it]

CompletionUsage(completion_tokens=1012, prompt_tokens=3600, total_tokens=4612)
Extracted keywords for 27 venues. Total Processed: 1385


 57%|█████▋    | 48/84 [13:44<09:55, 16.55s/it]

CompletionUsage(completion_tokens=1025, prompt_tokens=3564, total_tokens=4589)
Extracted keywords for 30 venues. Total Processed: 1415


 58%|█████▊    | 49/84 [14:00<09:33, 16.38s/it]

CompletionUsage(completion_tokens=1052, prompt_tokens=3595, total_tokens=4647)
Extracted keywords for 28 venues. Total Processed: 1443


 60%|█████▉    | 50/84 [14:19<09:45, 17.23s/it]

CompletionUsage(completion_tokens=1119, prompt_tokens=3588, total_tokens=4707)
Extracted keywords for 30 venues. Total Processed: 1473


 61%|██████    | 51/84 [14:37<09:32, 17.35s/it]

CompletionUsage(completion_tokens=1069, prompt_tokens=3553, total_tokens=4622)
Extracted keywords for 30 venues. Total Processed: 1503


 62%|██████▏   | 52/84 [14:52<09:01, 16.91s/it]

CompletionUsage(completion_tokens=1017, prompt_tokens=3520, total_tokens=4537)
Extracted keywords for 30 venues. Total Processed: 1533


 63%|██████▎   | 53/84 [15:07<08:24, 16.28s/it]

CompletionUsage(completion_tokens=1053, prompt_tokens=3572, total_tokens=4625)
Extracted keywords for 30 venues. Total Processed: 1563


 64%|██████▍   | 54/84 [15:23<08:05, 16.17s/it]

CompletionUsage(completion_tokens=1023, prompt_tokens=3569, total_tokens=4592)
Extracted keywords for 27 venues. Total Processed: 1590


 65%|██████▌   | 55/84 [15:39<07:43, 15.99s/it]

CompletionUsage(completion_tokens=1059, prompt_tokens=3533, total_tokens=4592)
Extracted keywords for 30 venues. Total Processed: 1620


 67%|██████▋   | 56/84 [15:56<07:42, 16.50s/it]

CompletionUsage(completion_tokens=1011, prompt_tokens=3529, total_tokens=4540)
Extracted keywords for 30 venues. Total Processed: 1650


 68%|██████▊   | 57/84 [16:13<07:25, 16.51s/it]

CompletionUsage(completion_tokens=1030, prompt_tokens=3569, total_tokens=4599)
Extracted keywords for 30 venues. Total Processed: 1680


 69%|██████▉   | 58/84 [16:30<07:12, 16.65s/it]

CompletionUsage(completion_tokens=1029, prompt_tokens=3538, total_tokens=4567)
Extracted keywords for 30 venues. Total Processed: 1710


 70%|███████   | 59/84 [16:49<07:13, 17.36s/it]

CompletionUsage(completion_tokens=1075, prompt_tokens=3546, total_tokens=4621)
Extracted keywords for 30 venues. Total Processed: 1740


 71%|███████▏  | 60/84 [17:06<06:51, 17.16s/it]

CompletionUsage(completion_tokens=1012, prompt_tokens=3561, total_tokens=4573)
Extracted keywords for 30 venues. Total Processed: 1770


 73%|███████▎  | 61/84 [17:23<06:32, 17.09s/it]

CompletionUsage(completion_tokens=1069, prompt_tokens=3529, total_tokens=4598)
Extracted keywords for 30 venues. Total Processed: 1800


 74%|███████▍  | 62/84 [17:40<06:16, 17.11s/it]

CompletionUsage(completion_tokens=991, prompt_tokens=3508, total_tokens=4499)
Extracted keywords for 30 venues. Total Processed: 1830


 75%|███████▌  | 63/84 [18:00<06:19, 18.09s/it]

CompletionUsage(completion_tokens=1134, prompt_tokens=3558, total_tokens=4692)
Extracted keywords for 30 venues. Total Processed: 1860


 76%|███████▌  | 64/84 [18:23<06:29, 19.45s/it]

CompletionUsage(completion_tokens=1171, prompt_tokens=3607, total_tokens=4778)
Extracted keywords for 30 venues. Total Processed: 1890


 77%|███████▋  | 65/84 [18:38<05:45, 18.17s/it]

CompletionUsage(completion_tokens=969, prompt_tokens=3531, total_tokens=4500)
Extracted keywords for 29 venues. Total Processed: 1919


 79%|███████▊  | 66/84 [18:57<05:29, 18.31s/it]

CompletionUsage(completion_tokens=1180, prompt_tokens=3580, total_tokens=4760)
Extracted keywords for 30 venues. Total Processed: 1949


 80%|███████▉  | 67/84 [19:15<05:11, 18.30s/it]

CompletionUsage(completion_tokens=1061, prompt_tokens=3549, total_tokens=4610)
Extracted keywords for 30 venues. Total Processed: 1979


 81%|████████  | 68/84 [19:28<04:27, 16.69s/it]

CompletionUsage(completion_tokens=1022, prompt_tokens=3588, total_tokens=4610)
Extracted keywords for 30 venues. Total Processed: 2009


 82%|████████▏ | 69/84 [19:45<04:11, 16.75s/it]

CompletionUsage(completion_tokens=1050, prompt_tokens=3565, total_tokens=4615)
Extracted keywords for 30 venues. Total Processed: 2039


 83%|████████▎ | 70/84 [20:00<03:49, 16.42s/it]

CompletionUsage(completion_tokens=932, prompt_tokens=3559, total_tokens=4491)
Extracted keywords for 27 venues. Total Processed: 2066


 85%|████████▍ | 71/84 [20:16<03:28, 16.05s/it]

CompletionUsage(completion_tokens=1025, prompt_tokens=3591, total_tokens=4616)
Extracted keywords for 29 venues. Total Processed: 2095


 86%|████████▌ | 72/84 [20:32<03:14, 16.18s/it]

CompletionUsage(completion_tokens=1077, prompt_tokens=3555, total_tokens=4632)
Extracted keywords for 30 venues. Total Processed: 2125


 87%|████████▋ | 73/84 [20:48<02:56, 16.02s/it]

CompletionUsage(completion_tokens=1038, prompt_tokens=3522, total_tokens=4560)
Extracted keywords for 29 venues. Total Processed: 2154


 88%|████████▊ | 74/84 [21:03<02:39, 15.96s/it]

CompletionUsage(completion_tokens=1036, prompt_tokens=3534, total_tokens=4570)
Extracted keywords for 30 venues. Total Processed: 2184


 89%|████████▉ | 75/84 [21:21<02:26, 16.28s/it]

CompletionUsage(completion_tokens=1076, prompt_tokens=3538, total_tokens=4614)
Extracted keywords for 30 venues. Total Processed: 2214


 90%|█████████ | 76/84 [21:36<02:09, 16.16s/it]

CompletionUsage(completion_tokens=1024, prompt_tokens=3550, total_tokens=4574)
Extracted keywords for 30 venues. Total Processed: 2244


 92%|█████████▏| 77/84 [21:51<01:48, 15.56s/it]

CompletionUsage(completion_tokens=978, prompt_tokens=3477, total_tokens=4455)
Extracted keywords for 30 venues. Total Processed: 2274


 93%|█████████▎| 78/84 [22:07<01:35, 15.90s/it]

CompletionUsage(completion_tokens=1052, prompt_tokens=3595, total_tokens=4647)
Extracted keywords for 29 venues. Total Processed: 2303


 94%|█████████▍| 79/84 [22:24<01:20, 16.16s/it]

CompletionUsage(completion_tokens=1068, prompt_tokens=3582, total_tokens=4650)
Extracted keywords for 30 venues. Total Processed: 2333


 95%|█████████▌| 80/84 [22:41<01:05, 16.36s/it]

CompletionUsage(completion_tokens=1136, prompt_tokens=3592, total_tokens=4728)
Extracted keywords for 30 venues. Total Processed: 2363


 96%|█████████▋| 81/84 [22:52<00:44, 14.81s/it]

CompletionUsage(completion_tokens=835, prompt_tokens=3524, total_tokens=4359)
Extracted keywords for 25 venues. Total Processed: 2388


 98%|█████████▊| 82/84 [23:10<00:31, 15.67s/it]

CompletionUsage(completion_tokens=1030, prompt_tokens=3566, total_tokens=4596)
Extracted keywords for 30 venues. Total Processed: 2418


 99%|█████████▉| 83/84 [23:23<00:14, 15.00s/it]

CompletionUsage(completion_tokens=978, prompt_tokens=3504, total_tokens=4482)
Extracted keywords for 30 venues. Total Processed: 2448


100%|██████████| 84/84 [23:31<00:00, 16.80s/it]

CompletionUsage(completion_tokens=362, prompt_tokens=2506, total_tokens=2868)
Extracted keywords for 10 venues. Total Processed: 2458





In [118]:
with open("../data/venue_keywords.json", "w", encoding="utf-8") as f:
    json.dump(extracted_keywords, f, ensure_ascii=False, indent=4)