# Venue Data Cleaning

Now that we have webscraped the reviews of our businesses, we need to clean the data so that it can be used to create Graph entites as well as documents for a pinecone index. The goal of this notebook is to populate two files:
- `../data/cypher/venue-entities.json`: The data that will be used for writing the venue entites to Neo4J
- `../data/vector/venue-documents.json`: The data that will be used for writing the venue documents to Pinecone

## General Process

**Data Loading**:
We will start by reading out of the final file of the scraping process: `../data/scrape/locations_finished`. We will extract on the fields that we need to extract in order to create the embedding documents and the Venue entities.

**Venue Embeddings**:
To embed our venues, we will next format an embedding term for each venue, and use OpenAI's embedding model to create a venue embedding. This will allow us to get similarity scores for venues and natural language queries. 

**Cypher Entity Creation**:
Next, we will use these embeddings to retrieve the most similar three keywords for each venue. These keywords will be related to the Venue in the graph, and they will help us determine how well a certain location is suited to a user.

**Vectorstore Document Creation**:
Finally, we will use the embeddings to create objects for embedding the venues into a pinecone index. We will want to keep the `category` and `city` fields to use as indexes for filter our results.


In [1]:
import os
import sys

from dotenv import load_dotenv
load_dotenv("../.env")

sys.path.append("../")


### Data Loading

Let's start by loading our scraped location data, and extracting the necessary fields that we will need for the notebook.

In [2]:
import json

with open("../data/scrape/locations_finished.json", "r", encoding="utf-8") as f:
    location_data = json.load(f)
    trimmed_location_data = []
    for location in location_data:
        trimmed_location_data.append({
            "id": location['id'],
            "name": location['name'],
            "city": location['city_code'],
            'rating': location['rating'],
            "reviews": location['reviews'],
        })
    location_data = trimmed_location_data
    del trimmed_location_data


### Venue Embeddings
With the data loaded, we will now move on to embed each of our venues. To do this, we will create an embedding string that contains the name of the location, as well as the content of all of its scraped reviews. This should allow us to semantically search for venues, by comparing the semantic meaning of a user's query against the semantics of the reviews.

In [3]:

# Add an embed term to each location
for loc in location_data:
    reviews = '\n'.join([f"{list(review.keys())[0]}:\n{list(review.values())[0]}" for review in loc['reviews']])
    term = f"{loc['name']}\n\n{reviews}"
    loc['embed_term'] = term

embed_terms = [location['embed_term'] for location in location_data]

In [4]:
from typing import List

from openai import OpenAI

def _embed_terms(terms: List[str]) -> List[List[float]]:
    client = OpenAI()
    response = client.embeddings.create(input=terms, model="text-embedding-ada-002")
    return [datum.embedding for datum in response.data]

def embed_locations(embedding_terms: List[str]) -> List[List[float]]:
    results = []
    for i in range(0, len(embedding_terms), 2000):
        batch = embedding_terms[i:i+2000]
        results.extend(_embed_terms(batch))
    return results

In [5]:
try:
    with open("../data/location_embeddings.json", "r") as f:
        location_embeddings = json.load(f)
        assert len(location_embeddings) == len(location_data) == len(embed_terms)
except Exception:
    location_embeddings = embed_locations(embed_terms)
    assert len(location_embeddings) == len(location_data) == len(embed_terms)

In [6]:
from vectorstores.categories import CategoryVectorstore, CATEGORIES

category_vectorstore = CategoryVectorstore()
categories = category_vectorstore.search_categories(location_embeddings)

In [7]:
from vectorstores.personas import PersonaVectorstore

persona_vectorstore = PersonaVectorstore()
relevant_personas = persona_vectorstore.search_keywords(location_embeddings)

### Cypher Entity Creation

Now that we have our our category and keyword associations for each location, we will loop 

In [13]:
location_and_data = zip(relevant_personas, categories, location_data)

cypher_entities = []
for i, data in enumerate(location_and_data):
    personas, category, location = data
    cypher_entities.append({
        'venue': {
            'id': location['id'],
            'name': location['name'],
            'city': location['city'],
            'category': category,
            'rating': location['rating'],
        },
        'personas': personas,
    })
assert len(cypher_entities) == len(location_data)

In [14]:
with open("../data/venues/cypher_entities.json", "w", encoding="utf-8") as f:
    json.dump(cypher_entities, f, ensure_ascii=False, indent=4)

### Vectorstore Document Creation

Finally, we want to create objects to use for writing our vector store documents to the pinecone instance.

In [15]:
location_and_data = zip(location_data, location_embeddings, categories)

vector_documents = []
for i, data, in enumerate(location_and_data):
    location, embedding, category = data
    vector_documents.append({
        "id": location['id'],
        "values": embedding,
        "metadata": {
            "category": category,
            "city": location['city']
        }
    })

assert len(vector_documents) == len(location_data)

In [16]:
with open("../data/venues/vectorstore_docs.json", "w", encoding="utf-8") as f:
    json.dump(vector_documents, f, ensure_ascii=False, indent=4)