# Venue Entity Creation

Now that we have completed the process of filling out the entire pipeline of for venue processing, it is time to create the Venue entities and store them in the database. We will need to create two types of entities for each venue: a **vectorstore document** and a **graph node**. These entities will be able to be used in conjunction to first search the vecotr database to find a venue based on the user's query, and then the graph node will be used to order those results by there relationship to the user's mood board.

**Vectorstore Documents**:

This is where the majority of the metadata for each venue is stored. These documents should carry all of the metadata that we are stroing for each venue, the graph will just be used for relationships with social media posts.

**Graph Nodes**: 

The graph nodes will be created using the BERT classifier. We will classify each venue and create relationships with each persona type, where the weight is the output score of BERT's label classification for that venue, after being passed through a softmax layer.

### Vectorstore Documents

We want to start by creating the vectorstore documents for each venue, which will allow us to search and filter for venues based on specific criteria and a embeded query.

In [6]:
import json
from dotenv import load_dotenv

load_dotenv()

with open("../data/venues/yelp.json", "r") as f:
    location_data = json.load(f)

In [7]:
vector_data = []
for loc in location_data:
    vector_data.append({
        "id": loc["id"],
        "name": loc["name"],
        "description": loc["business_summary"],
        "city": loc['city_code'],
        "hours": loc["hours"],
        "category": loc['category']
    })

In [10]:
from openai import OpenAI
from typing import List

def embed_batch(batch: List[str]):
    client = OpenAI()
    response = client.embeddings.create(
        input=batch,
        model="text-embedding-ada-002"
    )
    return [datum.embedding for datum in response.data]

descriptions = [datum["description"] for datum in vector_data]
embeddings = embed_batch(descriptions)
for row, embedding in zip(vector_data, embeddings):
    row["embedding"] = embedding

In [47]:
# Finally, we need to transform our vector data into a format that is consumeable by
# the pinecone API

pinecone_data = []
for row in vector_data:
    pinecone_data.append({
        "id": row["id"],
        "values": row["embedding"],
        "metadata": {
            "name": row["name"],
            "description": row["description"],
            "city": row["city"],
            "category": row["category"]
        }
    })

In [48]:
with open("../data/venues/vectorstore_docs.json", "w") as f:
    json.dump(pinecone_data, f)

### Graph Nodes

Now that we have our vectorstore docs setup, we need to use our model at `../models/bert-yelp` to classfiy persona labels on each venue. We will then use those persona labels create relationships between each Venue entity and Persona entity, with a weight that is determined by the softmax adjusted score of the label.

In [21]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_tokenizer = "distilbert-base-uncased"
model_path = "../models/bert-yelp"
tokenizer = AutoTokenizer.from_pretrained(model_tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

def format_input(row):
    return (
        f"Name: {row.biz_name}\n"
        f"Categories: {row.categories}\n"
        f"Biz Features: {row.biz_features}\n"
        f"Summary: {row.summary}\n"
    )


In [26]:
import pandas as pd

graph_data = []
for loc in location_data:
    graph_data.append({
        "id": loc["id"],
        "biz_name": loc["name"],
        "categories": ', '.join([cat['title'] for cat in loc["categories"]]),
        "biz_features": loc["biz_features"],
        'summary': loc["business_summary"],
    })

df = pd.DataFrame(graph_data)

df['input'] = df.apply(format_input, axis=1)

Unnamed: 0,id,biz_name,categories,biz_features,summary,input
0,ngB0iQM1Yz7Nx_tQZqU7NA,Daisies,"Pasta Shops, New American, Cocktail Bars","bars, il, cocktail, american, brunch, italian,...","A delightful blend of New American cuisine, pa...","Name: Daisies\nCategories: Pasta Shops, New Am..."
1,uri07mm-ffMohsZmX4-eZw,South Mountain Reservation,"Hiking, Parks","south, parks, trails, hiking, trail, dog, hike...","A picturesque park with scenic trails, perfect...",Name: South Mountain Reservation\nCategories: ...
2,l3zSU4mh6YcNw6dfdkxStQ,Rickshaw Rick's,"Tours, Pedicabs","tours, wedding, il, tour, boat, ride, knowledg...",Experience the city with guided tours and pedi...,"Name: Rickshaw Rick's\nCategories: Tours, Pedi..."
3,COfmsJPeRu_4qFzDpvAAgw,Rewined Beer And Wine Bar,"Wine Bars, Beer Bar","beer, phoenix, bar, wine, bars, az, pizza, gam...",A cozy beer and wine bar offering a laid-back ...,Name: Rewined Beer And Wine Bar\nCategories: W...
4,PdFV7_kB6w9hkEcz5pijDA,Resident,"Cocktail Bars, Music Venues, Beer Gardens","bars, beer, bar, cocktail, music, drinks, la, ...","A vibrant and lively venue with cocktails, mus...","Name: Resident\nCategories: Cocktail Bars, Mus..."


In [31]:
inputs = tokenizer(df.input.tolist(), return_tensors='pt', padding=True, truncation=True, max_length=512)

df['input_ids'] = inputs['input_ids'].tolist()
df['attention_mask'] = inputs['attention_mask'].tolist()

In [None]:
import torch

with torch.no_grad():
    model.eval()
    inputs = {
        'input_ids': torch.tensor(df['input_ids'].tolist()),
        'attention_mask': torch.tensor(df['attention_mask'].tolist())
    }
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    df['predictions'] = predictions.tolist()

In [39]:
personas = ["socialButterfly", "culinaryExplorer", "beautyFashionAficionado", "familyOrientedIndividual", "artCultureEnthusiast", "wellnessSelfCareAdvocate", "adventurerExplorer", "ecoConsciousConsumer"]
for i, persona in enumerate(personas):
    df[persona] = df['predictions'].apply(lambda x: x[i])
df.head()

Unnamed: 0,id,biz_name,categories,biz_features,summary,input,input_ids,attention_mask,predictions,socialButterfly,culinaryExplorer,beautyFashionAficionado,familyOrientedIndividual,artCultureEnthusiast,wellnessSelfCareAdvocate,adventurerExplorer,ecoConsciousConsumer
0,ngB0iQM1Yz7Nx_tQZqU7NA,Daisies,"Pasta Shops, New American, Cocktail Bars","bars, il, cocktail, american, brunch, italian,...","A delightful blend of New American cuisine, pa...","Name: Daisies\nCategories: Pasta Shops, New Am...","[101, 2171, 1024, 18765, 14625, 7236, 1024, 24...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.27572759985923767, 0.4735364317893982, 0.04...",0.275728,0.473536,0.044341,0.058704,0.0468,0.035647,0.036347,0.028897
1,uri07mm-ffMohsZmX4-eZw,South Mountain Reservation,"Hiking, Parks","south, parks, trails, hiking, trail, dog, hike...","A picturesque park with scenic trails, perfect...",Name: South Mountain Reservation\nCategories: ...,"[101, 2171, 1024, 2148, 3137, 11079, 7236, 102...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.03556738793849945, 0.023338479921221733, 0....",0.035567,0.023338,0.024519,0.119604,0.061853,0.056787,0.642034,0.036297
2,l3zSU4mh6YcNw6dfdkxStQ,Rickshaw Rick's,"Tours, Pedicabs","tours, wedding, il, tour, boat, ride, knowledg...",Experience the city with guided tours and pedi...,"Name: Rickshaw Rick's\nCategories: Tours, Pedi...","[101, 2171, 1024, 6174, 17980, 6174, 1005, 105...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.04238021746277809, 0.026886263862252235, 0....",0.04238,0.026886,0.025441,0.117018,0.091981,0.053619,0.608015,0.034659
3,COfmsJPeRu_4qFzDpvAAgw,Rewined Beer And Wine Bar,"Wine Bars, Beer Bar","beer, phoenix, bar, wine, bars, az, pizza, gam...",A cozy beer and wine bar offering a laid-back ...,Name: Rewined Beer And Wine Bar\nCategories: W...,"[101, 2171, 1024, 2128, 21924, 2094, 5404, 199...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.3058517277240753, 0.439247727394104, 0.0438...",0.305852,0.439248,0.043808,0.060174,0.048609,0.034647,0.039497,0.028166
4,PdFV7_kB6w9hkEcz5pijDA,Resident,"Cocktail Bars, Music Venues, Beer Gardens","bars, beer, bar, cocktail, music, drinks, la, ...","A vibrant and lively venue with cocktails, mus...","Name: Resident\nCategories: Cocktail Bars, Mus...","[101, 2171, 1024, 6319, 7236, 1024, 18901, 696...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.3925231695175171, 0.3095736503601074, 0.044...",0.392523,0.309574,0.044745,0.067839,0.077792,0.034211,0.045013,0.028303


In [43]:
# Now, we want to create an array of objects that can be used to create cypher statments.

cypher_data = []
for _, row in df.iterrows():
    data = {
        "venue": {
            "id": row['id'],
            "name": row['biz_name'],
        },
        "personas": {
            "socialButterfly": row['socialButterfly'],
            "culinaryExplorer": row['culinaryExplorer'],
            "beautyFashionAficionado": row['beautyFashionAficionado'],
            "familyOrientedIndividual": row['familyOrientedIndividual'],
            "artCultureEnthusiast": row['artCultureEnthusiast'],
            "wellnessSelfCareAdvocate": row['wellnessSelfCareAdvocate'],
            "adventurerExplorer": row['adventurerExplorer'],
            "ecoConsciousConsumer": row['ecoConsciousConsumer'],
        }
    }
    cypher_data.append(data)

In [44]:
with open("../data/venues/cypher_data.json", "w") as f:
    json.dump(cypher_data, f)