# Venue Data Cleaning

Now that we have webscraped the reviews of our businesses, we need to clean the data so that it can be used to create Graph entites as well as documents for a pinecone index. The goal of this notebook is to populate two files:
- `../data/cypher/venue-entities.json`: The data that will be used for writing the venue entites to Neo4J
- `../data/vector/venue-documents.json`: The data that will be used for writing the venue documents to Pinecone

## General Process

**Data Loading**:
We will start by reading out of the final file of the scraping process: `../data/scrape/locations_finished`. We will extract on the fields that we need to extract in order to create the embedding documents and the Venue entities.

**Prompt Engineering**:
We will start by engineering a prompt that has GPT-3.5-Turbo parse the Yelp Business and make certian classifications for that business. To do this, we will need to have a prompt template that characterizes a JSON schema that the model should output for each venue.


In [144]:
import os
import sys

from dotenv import load_dotenv
load_dotenv("../.env")

sys.path.append("../")


### Data Loading

Let's start by loading our scraped location data, and extracting the necessary fields that we will need for the notebook.

In [145]:
import json

with open("../data/scrape/reviews/locations_finished.json", "r", encoding="utf-8") as f:
    location_data = json.load(f)
    trimmed_location_data = []
    for location in location_data:
        trimmed_location_data.append({
            "id": location['id'],
            "name": location['name'],
            "city": location['city_code'],
            'rating': location['rating'],
            "reviews": location['reviews'],
            "url": location['url'],
            "image_url": location['image_url'],
            "categories": location['categories'],
        })
    location_data = trimmed_location_data
    del trimmed_location_data


### Prompt Engineering

There are tons of locations in our raw Yelp dataset that have no relevance to someone looking to make plans for an upcoming vaction or trip. Let's use ChatGPT to trim down these results, so that we can work with a list of venues that will all make for relevant activity suggestions.

In [147]:
SELECTION_PROMPT = """
You are working for a travel agency. You are searching through Yelp Business Pages to find new and exciting vaction activites
for your clients. Your clients are often young adults (18-35) who are looking for fun, exciting, and trendy vaction activites 
and resturants to visit. Your task is to select only the most premium and interesting businesses for your clients to visit.

You will be given a JSON file containing a list of Yelp Businesses. For each business, you will be provided with the ID, name, 
categories, and yelp rating. Your task is to infer which of these businesses are interesting for your clients to visit based 
on this limited information. You are attempting to make your best guess based on the information you are provided with. 

The most IMPORTANT component of this task is removing all busineses that are not relevant for TOURISTS. Since your clients are planning
a vacation, you should only select businesses that are relevant for a tourist to visit. You MUST FILTER OUT all businesses
that are not relevant for tourism. 

To summarize your instructions:
For any business that is not relevant for a toursit, you filter it out. It is crucial that 
you are very selective in your process, and you only produce a list of businesses that are relevant 
for a TOURIST to visit.

Once you have made your decision on which businesses you are keeping, you should provide your list of relevant business IDs in a JSON 
object. Here is an example of the JSON object you should produce:
{
    "relevant_businesses": [
        '0EWwmbkJ97nDZCPOdrWMau', 
        'fWdKIXkM66qoLrllGfT7LR',
        'Sg4N2msDQjYXuqmaR5a9OO',
        'Zvb2bH2rDtJpJwK0YTfpwy',
        'EDE2n455bQndLRbeLMn5xb',
        'A77ORqKZTOck45DDArSWoL', 
        'RNqVLeLK7YjrbiTBU7YIFu', 
        'ApKcaAL2HMxVtLvGrUVAPi', 
        'cjCiLUeU5pnU5mnG0lvcKN', 
        'NbUT2tBMs3c5cvIUmrUnCE', 
        'rxN806Yvca8Wb1E37WXEc4', 
        'eN3Y0CQm3Ht03Rz6wn9HBn'
    ]
}

It is crucial that you adhere strictly to this format, otherwise, your submission will not be accepted.
"""



In [148]:
from openai import OpenAI
from tqdm import tqdm

client = OpenAI()

class ParseError(Exception):
    pass

def parse_output(output):
    try:
        output_object = json.loads(output)
        biz_list = output_object['relevant_businesses']
        return biz_list
    except:
        raise ParseError("Could not parse output")

def select_businesses(batch):
    messages = [
        {'role': 'system', 'content': SELECTION_PROMPT},
        {'role': 'user', 'content': json.dumps(batch)}
    ]
    completion = client.chat.completions.create(
        model='gpt-3.5-turbo-1106',
        messages=messages,
        response_format={'type': 'json_object'}
    )
    biz_list = parse_output(completion.choices[0].message.content)
    return biz_list


In [149]:
# Format the Data For the Prompt
prompt_data = [
    {
        'id': location['id'], 
        'name': location['name'], 
        'categories': location['categories'],
        'rating': location['rating'],
    } 
    for location in location_data
]

def save_progress(new_ids, new_failed):

    # Read the current progress
    with open("../data/trimming_progress.json", 'r', encoding='utf-8') as f:
        data = json.load(f)
        biz_ids = data['biz_ids'] + new_ids
        failed = data['failed_businesses'] + new_failed
        batch_count = data['batch_count'] + 1

    # Update the current progress with the new results
    with open("../data/trimming_progress.json", 'w', encoding='utf-8') as f:
        data = {
            'biz_ids': biz_ids,
            'failed_businesses': failed,
            'batch_count': batch_count
        }
        json.dump(data, f, indent=4)

for i in tqdm(range(0, len(prompt_data), 30)):
    failed = []
    try:
        batch = prompt_data[i:i+30]
        selected_ids = select_businesses(batch)
    except ParseError:
        failed = batch
        
    save_progress(selected_ids, failed)

100%|██████████| 94/94 [09:29<00:00,  6.06s/it]


In [170]:
# Use the IDs provide by GPT to filter our list of locations
# 
# with open("../data/trimming_progress.json", 'r', encoding='utf-8') as f:
#     data = json.load(f)
#     valid_ids = data['biz_ids']

# with open("../data/scrape/reviews/locations_finished.json", "r", encoding="utf-8") as f:
#     location_data = json.load(f)
#     trimmed_location_data = []
#     for location in location_data:
#         if location['id'] in valid_ids:
#             trimmed_location_data.append(location)
#     location_data = trimmed_location_data
#     del trimmed_location_data

# with open("../data/trimmed_locations.json", 'w') as f:
#     json.dump(location_data, f, indent=4)

### Business Classification

Now that we have our list trimmed down to only relevant vacation activites and resturants, we should look to begin classifying our businesses so that they can be stored in the graph and in a vector store. To store the businesses in the vector store, we will want to have ChatGPT write a 1 sentence description of each business. This will allow us to embed these descriptions and store those embeddings in a pinecone index. Whenever our AI travel agent wants to reccomend locations. It can first use the pinecone index to get a set of businesses that match the needs of the user's query, then it can analyze those businesses in the graph, to select those that are most appropriately algined with the user's mood board. 

For vector searching we want a clear concise one sentence description of the business. Then, when our agent is querying the vectorstore, it will write a query that is a clear concise one sentence description of the business it is looking for. 

For storing venues in the graph, we want to associate each venue with each of the 8 personas. To do this, we will need to develop a labeled dataset of reviews. Each reveiw should classify what labels are associated with the venue. Then, we will train a BERT model on an 8-head classification task. Then, when we want to classify venues, we will need to pass the venue's reveiws to the BERT classifier. This will give us a relevance score for each persona. This will create a graph architecture where every venue is related to every persona and every social media post is related to each persona. We will have our set of 8 personas ($P$) Given a set ($S$) of social media posts, we can use a user's query to first filter a set of potential venues ($V$). Each venue and each post will be related to every post through the 8 personas. We will compute the score of each venue $k_i \forall i \in \{1, 2, ... |v| \}$. We will compute $k_i$ as follows:
$$
k_i = \sum_{s_k \in S} \sum_{j = 0}^{8} \text{weight}(p_j, v_i) + \text{weight}(p_j, s_k)
$$

Where $\text{weight}$ is the function that takes two nodes and returns the weight of the relationship between those two nodes. From this summation, we can see that our venue suggestion results will first be filtered, such that we are only ranking venues that are appropriate based on the user's query and then are ranked by computing the relevance between the posts provided by the user and the venue we are ranking. 

**Data Processing**:
To start, we need to write 1 sentence descriptions of the business for each business in our dataset. Let's use ChatGPT to read the page content of a business's Yelp page in order to return a 1 sentence description of that business. In addition to the one sentence description, we also need to use ChatGPT to extract a businesses hours from the Yelp page as well. This will allow us to filter our vector query using times, so that we only return businesses that are open during the window of time that the user is looking to book an event during.

**Persona Classfiication**:
Once we have our businesses processed by ChatGPT, we will want to use that information to feed a BERT Model with information on a buisness and receive label scores for each of our 8 labels. This way we will be able to draw relationships between each of our persona nodes and each of our venue nodes.

### Data Processing

To start, there is plenty of data that can be extracted using Regex. We will begin be extracting the business hours from each HTML page, and appending the hours to the `location_data` object. This will allow our agent to filter out any businesses that are not open at the time of the event requested by the user.

In [42]:
with open("../data/scrape/locations_finished.json", "r") as f:
    location_data = json.load(f)

In [43]:
import re

business_hours = []

for loc in location_data:
    pattern = pattern = r"(Mon|Tue|Wed|Thu|Fri|Sat|Sun)\n(.*?)\n"
    matches = re.findall(pattern, loc['page_content'], re.DOTALL)
    hours = {}
    for match in matches:
        hours[match[0]] = match[1]
    loc['hours'] = hours

Now that we have the hours collected, we can begin prepping our context for the LLM prompts. Since it would take way too many tokens to prompt the LLM with the entire webpage content of each business, we will instead use NLP techniques to first analyze the page of each business, and then create a much shorter keyword analysis to be used by the LLM (in conjunction with other minor details) to write a business_summary and make persona labels.

We will use the persona labels to train a BERT encoder. The BERT encoder will be take a concatenated string of the keyword analysis as input, and output the persona labels. The classification scores of BERT will be used as the relationship weights between each venue and persona entity.

In [50]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

content = [loc['page_content'].lower() for loc in location_data]

# strip numbers out of each item in content
content = [re.sub(r'\d+', '', item) for item in content]

# Adding custom common words to the stop words list
custom_stop_words = stopwords.words('english') + ['yelp', '2023', 'reviews', 'photos', 'chicago', 'york', 'los', 'angeles', 'pheonix', 'ave', 'st', 'rd']

# Initialize TF-IDF Vectorizer with updated parameters
vectorizer = TfidfVectorizer(max_features=200, stop_words=custom_stop_words, min_df=2, max_df=0.5)

# Fit and transform the Yelp texts
tfidf_matrix = vectorizer.fit_transform(content)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

top_n = 50
keywords = []
for doc in range(tfidf_matrix.shape[0]):
    feature_index = tfidf_matrix[doc,:].nonzero()[1]
    tfidf_scores = zip(feature_index, [tfidf_matrix[doc, x] for x in feature_index])
    top_keywords = sorted(tfidf_scores, key=lambda x: x[1], reverse=True)[:top_n]
    top_keywords = [feature_names[i] for i, score in top_keywords]
    keywords.append(top_keywords)

words_and_biz = zip(keywords, location_data)
for words, biz in words_and_biz:
    biz['biz_features'] = ', '.join(words)

In [92]:
example_output_object = {
    "businesses": [
        {
            "id": "0EWwmbkJ97nDZCPOdrWMau",
            "category": "restaurant",
            "business_summary": "A trendy and modern restaurant with a focus on local ingredients and sustainable practices.",
            "personas": [
                "ecoConsciousConsumer",
                "culinaryExplorer",
            ]
        },
        {
            "id": "fWdKIXkM66qoLrllGfT7LR",
            "category": "entertainment",
            "business_summary": "A modern comedy club with a focus on local talent and a wide variety of shows.",
            "personas": [
                "socialButterfly",
                "artCultureEnthusiast",
            ]
        },
        {
            "id": "Sg4N2msDQjYXuqmaR5a9OO",
            "category": "activity",
            "business_summary": "An envigorating and exciting hike through the Smoky Mountains.",
            "personas": [
                "adventurerExplorer",
            ]
        }
    ]
}

In [93]:
SYSTEM_PROMPT = f"""
You work at a travel agency. You are helping to process web scraped data from Yelp. Each webpage has already
been preprocessed through an TF-IDF pipeline. Each business has been assigned a list of 50 keywords that
describe the business.

You will be provided with batches of 20 businesses.

For each business providided with the Yelp ID, the name of the business, the categories of the business (as 
classified by Yelp), and the list of keywords produced by the TF-IDF pipeline. Your task is to use this 
information to produce a JSON object that describes the business. The JSON object you produce should contain 
the following information: 
- Category: The category of the business. You should pick one of the following categories: restaurant, activity, entertainment
- Business Summary: A 1 sentence concise summary of the business that captures the essence and theme of the business.
- Personas: The traveler personas that this business is a good fit for. You should pick 1-3 personas from the following list:
    - 1. **The Social Butterfly**: A vibrant and outgoing individual who thrives in the energy of social gatherings, frequently found enjoying the nightlife at lively bars and clubs, and always up for a celebration with friends.
    - 2. **The Culinary Explorer**: A gourmet aficionado who revels in culinary adventures, exploring diverse cuisines at fine dining establishments, and sharing their love for unique and delicious food experiences.
    - 3. **The Beauty and Fashion Aficionado**: A trendsetter passionate about the latest in fashion and beauty, often seen at stylish shopping venues and beauty product launches, and always keeping up with the newest trends.
    - 4. **The Family-Oriented Individual**: A person who cherishes family time and creates memories with loved ones, often participating in family-friendly activities, visiting parks, and enjoying experiences that cater to all ages.
    - 5. **The Art and Culture Enthusiast**: A lover of the arts and culture, often found absorbing the rich experiences offered by museums and galleries, and always seeking to expand their horizons through artistic and cultural exploration.
    - 6. **The Wellness and Self-Care Advocate**: A seeker of tranquility and personal well-being, often indulging in self-care routines, visiting wellness retreats and spas, and embracing serene natural environments for relaxation.
    - 7. **The Adventurer and Explorer**: An intrepid soul with a thirst for adventure, often embarking on exciting journeys, exploring the great outdoors, and engaging in activities that offer a rush of adrenaline and connection with nature.
    - 8. **The Eco-Conscious Consumer**: A dedicated advocate for sustainability and eco-friendly living, preferring to shop at environmentally conscious stores, visit farmers' markets, and support initiatives that align with their green lifestyle.


As an example, here is an example of a valid JSON object that you should output:
```
{json.dumps(example_output_object, indent=4)}
```
"""

BUSINESS_PROMPT = """
ID: {id} | Name: {name}

Yelp Categories: {categories}

Buisness Keywords: {keywords}
"""

def format_batch(batch):
    batch_str = ""
    for biz in batch:
        batch_str += BUSINESS_PROMPT.format(
            id=biz['id'],
            name=biz['name'],
            categories=biz['categories'],
            keywords=biz['biz_features']
        ) + '\n'
    return batch_str

In [124]:
import tiktoken    
from openai import OpenAI
from tqdm import tqdm

def process_batch(batch):
    client = OpenAI()
    human_prompt = format_batch(batch)
    messages = [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': human_prompt}
    ]
    completion = client.chat.completions.create(
        model='gpt-3.5-turbo-1106',
        messages=messages,
        response_format={'type': 'json_object'}
    )
    return completion.choices[0].message.content

completions = []
for i in tqdm(range(0, len(location_data), 20)):
    batch = location_data[i:i+20]
    output = process_batch(batch)
    try:
        json_output = json.loads(output)
        completions.extend(json_output['businesses'])
    except KeyError as error:
        print(f"Error parsing output object: {error}")
        print(output)

100%|██████████| 1/1 [00:01<00:00,  1.66s/it]


In [127]:
for comp in completions:
    for loc in location_data:
        if loc['id'] == comp['id']:
            loc['category'] = comp['category']
            loc['business_summary'] = comp['business_summary']
            loc['personas'] = comp['personas']