# Storage of Venue Data
Now that we have extracted the keywords from our venue list, it is time to store the Venue's and keywords in a graph database. Additionally, we will need to store each venue in our document vectorstroe as well. To do this we will need to parse the extracted keyword-venue JSON objects to create Cypher Statements for writing the entities and relationships to our Neo4J database and then use the vectorstore docs to upsert our venue document vectors to Pinecone.

This notebook will walk us through a few principle steps:
1. **Creating Cyphers from the extracted JSON data**
2. **Using Cyphers to write to Neo4J**
3. **Writing venue documents to pinecone**

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

True

In [3]:
import json
import re

with open("../data/venues/cypher_data.json", 'r') as location_data:
    entity_data = json.load(location_data)

### Creating Cypher Statements from the extracted Venue data

Now that we have our `cypher_entities`, we can use them to format our cypher statments, for writing a query to the graph database.

In [5]:
from typing import Dict, List, Union

def make_safe(val: Union[str| float]) -> Union[str | float]:
    """Make a string with ' characters safe for Cypher"""
    if type(val) == str:
        return val.replace("'", "\\'")
    return val
    
def generate_cypher(venue: Dict[str, Union[str, float, List[str]]]) -> str:
    e_statements = []
    r_statements = []

    venue_properties = ", ".join([f"{key}: '{make_safe(value)}'" for key, value in venue['venue'].items()]) if 'venue' in venue else ""
    venue_cypher = f"MERGE (v:Venue {{ {venue_properties} }})"
    e_statements.append(venue_cypher)

    for i, items in enumerate(venue['personas'].items()):
        persona, weight = items
        persona_cypher = f"MERGE (p{i+1}:Persona {{ value: '{make_safe(persona)}' }})"
        e_statements.append(persona_cypher)
        r_statements.append(f"MERGE (v)-[r{i+1}:PERSONA_RELEVANCE]->(p{i+1}) SET r{i+1}.weight = {weight}") 
    
    return e_statements, r_statements

In [6]:

cypher_statements = []

# Conver the cypher entity data into cypher statements
for venue in entity_data:
    e_statements, r_statements = generate_cypher(venue)
    cypher_statemnt = "\n".join(e_statements + r_statements)
    cypher_statements.append(cypher_statemnt)

assert len(cypher_statements) == len(entity_data)

### Using Cypher Statements to Write to Neo4J

With our Cypher statements in hand, we will update the Neo4J database to store our Venue and Keyword entites. The usage of the 'MERGE' keyword will prevent nodes from being duplicated.

In [8]:
import os
from neo4j import GraphDatabase

DB_USER = os.getenv("NEO4J_DATABASE_USERNAME")
DB_URL = os.getenv("NEO4J_DATABASE_URL")
DB_PASSWORD = os.getenv("NEO4J_DATABASE_PASSWORD")

def get_driver():
    return GraphDatabase.driver(DB_URL, auth=(DB_USER, DB_PASSWORD))

driver = get_driver()

with driver.session() as session:
    for statement in cypher_statements:
        session.run(statement)

driver.close()

### Writing the Venue Documents to Pinecone

Next, we will load our vectorstore documents, and use them to write all of our documents to pinecone. Again, the use of the `upsert` method, will prevent records with the same ID from being duplicated.

In [15]:
import pinecone

PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
PINECONE_ENVIRONMENT = os.getenv("PINECONE_ENVIRONMENT")
PINECONE_INDEX = os.getenv("PINECONE_INDEX")

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
index = pinecone.Index(PINECONE_INDEX)


with open("../data/venues/vectorstore_docs.json", "r") as f:
    vectorstore_docs = json.load(f)


In [16]:
from tqdm import tqdm

def store_docs(docs):
    try:
        BATCH_SIZE = 250
        count = 0
        for start in tqdm(range(0, len(docs), BATCH_SIZE)):
            # Select the batch
            batch = docs[start:start+BATCH_SIZE]
            upsert_response = index.upsert(vectors=batch, namespace='venues')
            
            # Log the results
            count += upsert_response['upserted_count']

        print(f"Upserted {count} vectors")
        return count
    except Exception as e:
        print(e)
        return 0

upserted_count = store_docs(vectorstore_docs)

100%|██████████| 6/6 [00:10<00:00,  1.69s/it]

Upserted 1425 vectors





### Testing the Results

Finally, we want to see how well our results correspond to a user's query. This will help us determine what threshold to set when we are receiving results from our pinecone query.

In [25]:
from typing import Tuple
from openai import OpenAI

import urllib.parse

openai_client = OpenAI()

def get_relevant_businesses(city: str, category: str, query: str) -> List[Tuple[str, str]]:
    """Get the relevant businesses from the databases."""
    # First, we need to embed the query
    response = openai_client.embeddings.create(input=query, model="text-embedding-ada-002")
    embeddings = response.data[0].embedding


    # First we need to query pinecone, to get the IDs of the relevant businesses
    query_response = index.query(vector=embeddings, top_k=100, namespace='venues', filter={'city': {'$eq': city}, 'category': {'$eq': category}})
    ids = [result['id'] for result in query_response['matches']]
    
    # Now we need to query Neo4J, to get the details of the businesses
    driver = get_driver()
    with driver.session() as session:
        results = session.run(f"MATCH (v:Venue) WHERE v.id IN {ids} RETURN v")
        
        data = []
        for record in results:
            venue = record['v']
            data.append((
                venue['name'], 
                urllib.parse.unquote(venue['url'])
            ))

    driver.close()

    return data