### Exploring Yelp Business Point Data in OmniSci

Dr. Michael Flaxman, Geodesign Technologies<p>
    November 2018

<table>
    <tr><td>
<img src="yelp/img/yelp_dataset_challenge_crop.png" width=300>
        </td>
        <td>
<img src="yelp/img/yelp_teaser_shot_vegas.png" width=280>
        </td>
    </tr>
    </table>

### Intro

Business point data from traditional sources falls a bit flat - it usually is only a geocoded address point, and often lacks even a reasonable business point classification.  For example, government data typically only groups data by general "<a href="https://www.bls.gov/bls/naics.htm">NAICs</a>" codes, which are missing any interesting level of detail.

Crowd-sourced business data can be much more interesting!  For starters, it is opinionated, including star ratings and reviews.  It also has much finer-grained classifications, allowing drill-down into interesting data subsets. It even has reviews and images.

Yelp data for 10 metro regions is available in aggregate form as part of the student challenge described here:

<a href="https://www.yelp.com/dataset">Yelp Student Challenge Dataset</a>

<li><b><i>By the way, if any of you reading this are students - Yelp is offering a very nice cash prize for the best creative use of this data.  So feel free to use our cleaned-up data and visualization techniques below to win this prize!</i></b>

The Yelp data are provided as one massive JSON file, which is a kind of huge download (>3G!).  We have used Python Pandas and OmniSci to put the data into an interactive form ready for exploration.  We've also provided public links to analysis-ready data here:

<a href="https://s3-us-west-1.amazonaws.com/mapd-cloud/DataSets/yelp/yelp_academic_dataset_business_original.json.zip">Original Yelp Data Subset to Business Points Only (JSON 22.5 Mb)</a><P>
<a href="https://s3-us-west-1.amazonaws.com/mapd-cloud/DataSets/yelp/yelp_academic_dataset_business_cleaned.csv.zip">Cleaned flattened CSV file of Yelp Business Points (CSV 13 Mb)</a>

#### Understanding the Data

Yelp is so widely used that most people already have an intuitive idea of what data are collected.  However, from a data science and visualization perspective, data structure and organization are important.  Yelp data include a set of separate tables for businesses, reviews, check-ins, and photos. In this post, we'll start with the business points data.<P>
    The data from the businesses json look like this:

   {
    // string, 22 character unique string business id
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // string, the business's name
    "name": "Garaje",

    // string, the neighborhood's name
    "neighborhood": "SoMa",

    // string, the full address of the business
    "address": "475 3rd St",

    // string, the city
    "city": "San Francisco",

    // string, 2 character state code, if applicable
    "state": "CA",

    // string, the postal code
    "postal code": "94107",

    // float, latitude
    "latitude": 37.7817529521,

    // float, longitude
    "longitude": -122.39612197,

    // float, star rating, rounded to half-stars
    "stars": 4.5,

    // interger, number of reviews
    "review_count": 1198,

    // integer, 0 or 1 for closed or open, respectively
    "is_open": 1,

    // object, business attributes to values. note: some attribute values might be objects
    "attributes": {
        "RestaurantsTakeOut": true,
        "BusinessParking": {
            "garage": false,
            "street": true,
            "validated": false,
            "lot": false,
            "valet": false
        },
    },

    // an array of strings of business categories
    "categories": [
        "Mexican",
        "Burgers",
        "Gastropubs"
    ],

    // an object of key day to value hours, hours are using a 24hr clock
    "hours": {
        "Monday": "10:00-21:00",
        "Tuesday": "10:00-21:00",
        "Friday": "10:00-21:00",
        "Wednesday": "10:00-21:00",
        "Thursday": "10:00-21:00",
        "Sunday": "11:00-18:00",
        "Saturday": "10:00-21:00"
    }
}`

Because we are importing this into a single table and don't need the sub-objects for attributes and hours, we can ignore the complexity of creating related subtables for these.

### Create a Pandas dataframe

If we were dealing with the full Yelp database, we would probably need to read directly into OmniSci and drop out un-needed data there.  The OmniSci JSON reader is very efficient and can handle files larger than main memory.  <p>Since the sample data are 'only' 188k locations, they easily fit in typical machine memory, and we can use the Pandas library to read and manipulate them.  This technique also allows us to create a true "Point" field type from separate latitude and longitude columns, which is not currently possible within OmniSci's JSON reader (at version 4.2).

Note: the lines=True is necessary since Yelp's business file is line-separated.

In [6]:
import pandas as pd

biz_df = pd.read_json('yelp/yelp_academic_dataset_business.json', lines=True)

ok, what have we got?

In [7]:
biz_df.shape

(188593, 15)

great, 188k rows and 15 columns... what do they look like?

In [37]:
biz_df.head(3)

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
0,1314 44 Avenue NE,"{'BikeParking': 'False', 'BusinessAcceptsCredi...",Apn5Q_b6Nz61Tq4XzPdf9A,"Tours, Breweries, Pizza, Restaurants, Food, Ho...",Calgary,"{'Monday': '8:30-17:0', 'Tuesday': '11:0-21:0'...",1,51.091813,-114.031675,Minhas Micro Brewery,,T2E 6L6,24,4.0,AB
1,,"{'Alcohol': 'none', 'BikeParking': 'False', 'B...",AjEbIBw6ZFfln7ePHha9PA,"Chicken Wings, Burgers, Caterers, Street Vendo...",Henderson,"{'Friday': '17:0-23:0', 'Saturday': '17:0-23:0...",0,35.960734,-114.939821,CK'S BBQ & Catering,,89002,3,4.5,NV
2,1335 rue Beaubien E,"{'Alcohol': 'beer_and_wine', 'Ambience': '{'ro...",O8S5hYJ1SMc8fA4QBtVujA,"Breakfast & Brunch, Restaurants, French, Sandw...",Montréal,"{'Monday': '10:0-22:0', 'Tuesday': '10:0-22:0'...",0,45.540503,-73.5993,La Bastringue,Rosemont-La Petite-Patrie,H2G 1K7,5,4.0,QC


### Select Out What We Want

OK, for our immediate purposes, the nested objects seem like overkill, except for the categories, which are a simple nested list.  Since pandas quote-delimits strings columns by default, it is fine to have a comma-separated list inside a CSV.

In [9]:
yelp_biz = biz_df[['business_id','name','address','neighborhood','city','postal_code','state','longitude','latitude','categories','review_count','stars']]

In [36]:
yelp_biz.head(3)

Unnamed: 0,business_id,name,address,neighborhood,city,postal_code,state,longitude,latitude,categories,review_count,stars
0,Apn5Q_b6Nz61Tq4XzPdf9A,Minhas Micro Brewery,1314 44 Avenue NE,,Calgary,T2E 6L6,AB,-114.031675,51.091813,"Tours, Breweries, Pizza, Restaurants, Food, Ho...",24,4.0
1,AjEbIBw6ZFfln7ePHha9PA,CK'S BBQ & Catering,,,Henderson,89002,NV,-114.939821,35.960734,"Chicken Wings, Burgers, Caterers, Street Vendo...",3,4.5
2,O8S5hYJ1SMc8fA4QBtVujA,La Bastringue,1335 rue Beaubien E,Rosemont-La Petite-Patrie,Montréal,H2G 1K7,QC,-73.5993,45.540503,"Breakfast & Brunch, Restaurants, French, Sandw...",5,4.0


### Export out clean table as CSV

Now that we have a Pandas dataframe with only the information required, it is easy to dump this out into a CSV (creating the same file as shown above).  

In [11]:
of = 'yelp/yelp_biz_2018.csv'
yelp_biz.to_csv(of, index=False)

In [12]:
!ls -l yelp/yelp_biz_2018.csv

-rw-rw-r-- 1 mapdadmin mapdadmin 32859164 Nov 16 00:05 yelp/yelp_biz_2018.csv


### Set Up Import in OmniSci

We could do this in a fully-automated way using pymapd's load_table method, which accepts a pandas dataframe.  However, since that doesn't support "geo" features, and the number of columns is reasonably small, let's just hand-craft a nice efficient data definition.

#### OmniSci Utilities

In [13]:
import pymapd
import sys
print(pymapd.__version__)
import sys, traceback

0.5.2.dev1+ga0feb21.d20181113


In [14]:
from pymapd import connect

dbname = 'mapd'
user = 'mapd'
host = 'localhost'
password = 'HyperInteractive!'
con = connect(user="mapd", password= "HyperInteractive!", host="localhost", dbname="mapd")

In [15]:
def mapdql(query):
    try:
        print('Executing query: {}'.format(query))
        return(con.execute(query))
    except:
        print('Exception executing query')
        a,b,c = sys.exc_info()
        for d in traceback.format_exception(a,b,c) :
           print (d)

#### Figure out Column Schema

Which columns do we have?

In [16]:
cols = biz_df.columns
cols

Index(['address', 'attributes', 'business_id', 'categories', 'city', 'hours',
       'is_open', 'latitude', 'longitude', 'name', 'neighborhood',
       'postal_code', 'review_count', 'stars', 'state'],
      dtype='object')

Hmmm, data frame index is sorted alphabetically, but here order is important!  Let's just get the header from the CSV we just created

In [17]:
!head -1 {of}

business_id,name,address,neighborhood,city,postal_code,state,longitude,latitude,categories,review_count,stars


Building a DDL is relatively straightforward.  In OmniSci, we can encode simple strings that represent categories as dictionaries, saving on graphics card memory. The rest of the basic column types are basically standard SQL

Then, instead of keeping the coordinates in separate columns, we use OmniSci's POINT geometry type to merge the columns on import.

In [18]:
ddl = 'business_id TEXT ENCODING DICT, '
ddl += 'name TEXT ENCODING DICT, '
ddl += 'address TEXT ENCODING DICT, '
ddl += 'neighborhood TEXT ENCODING DICT, '
ddl += 'city TEXT ENCODING DICT, '
ddl += 'postal_code TEXT ENCODING DICT, '
ddl += 'state TEXT ENCODING DICT, '
ddl += 'omnisci_geo GEOMETRY(POINT, 4326) ENCODING COMPRESSED(32), '
#ddl += 'longitude FLOAT, '
#ddl += 'latitude FLOAT, '
ddl += 'categories TEXT, ' # should be comma-separated sublist, worth to break out?
ddl += 'review_count INTEGER, '
ddl += 'stars FLOAT '

In [21]:
query = 'CREATE TABLE yelp_biz ({});'.format(ddl)

In [None]:
result = mapdql(query)
# result is empty table with correct schema

In [23]:
# find current working directory
# since COPY FROM requires a full path for data
cwd_list = !pwd
cwd = cwd_list[0]

In [27]:
import os
csv_full_path = os.path.join(cwd, 'yelp/yelp_biz_2018.csv')

In [31]:
table_name = 'yelp_biz' # output table name
query = "COPY {} FROM '{}';".format(table_name, csv_full_path)

In [29]:
result = mapdql(query)

Executing query: COPY yelp_biz FROM '/home/mapdadmin/demo/yelp/yelp_biz_2018.csv';


In [35]:
omnisci_tables = con.get_tables()

if table_name in omnisci_tables:
    print("Created yelp business point table: '{}'".format(table_name))

Created yelp business point table: 'yelp_biz'


### Results

With data imported into OmniSci, it is a breeze to construct interactive dashboards.  For example, here is the data above zoomed into Pheonix Arizona.  The charts are basically just using the defaults here, except that I colored Businesses on the main map from red to green based on their Yelp Stars.  <p>
    I then added a bar chart showing the number of businesses by top-20 category types, and distribution of star ratings.  With a little more work, I could have retitled and created custom colors for the bar charts - but I was too eagre to interactively explore the data!

#### All Phoenix Business Points by Yelp Star Rating

<img src="yelp/img/Pheonix_business_points_by_Yelp_star_rating.png">

#### Best Mexican Restaurants in Pheonix, according to Yelp

<img src="yelp/img/Pheonixs_Best_Mexican_Restaurants_OmniSci.png">

<i>Exercise left to the reader.  Yelp's classification system is evidently not perfect.  For the SW Cities, our top two categories are: 'Mexican, Restaurants' and 'Restaurants, Mexican'.  How would you clean this up?</i>

### Summary

The code above shows specifically how to load JSON point data with embedded separate latitude/longitude columns.  We also covered how to drop out or merge in-line objects within JSON files.  This workflow should also function well for any similar datasets which fit into machine memory.

The result is a very usable dashboard webapp which should allow you to quickly map yelp ratings for any business type.  