# Data Import, Cleaning, and Preparation

This module is used to query the postgreSQL database in order to retrieve the Yelp and Violations dataset. There will be several steps to do this:
1. Import flattened violations dataset. We will use a SQL query to perform aggregation of violation data by restaurant and inspeciton date. 
2. Import Yelp business data and join to inspection data    
4. Join the Yelp business and Inspection Data with the Yelp Review Data
    + Reviews for a given establishment will be aggregated so that reviews *after* the previous inspection (or the earliest review date) and *before* the date of a given inspection are in one batch. 
    +  Aggregate any review "count" features using this same logic
    + Combine the review documents for a restaurant into a CLOB using the same logic

## Import and Clean Data

In [88]:
import psycopg2 as psy
import pandas as pd
import re
import numpy as np

In [89]:
#set up connection to our DB
conn = psy.connect(database="sterndsyelp", 
                        user="mvsternds", 
                        password="nyustern123!", 
                        host="sterndsyelp.cawzspvmqd5q.us-east-1.rds.amazonaws.com", 
                        port="5432"
                       )
#open cursor and check our tables in the DB
cur = conn.cursor()

In [102]:
cur.execute("SELECT * FROM public.restaurants ")
biz = pd.DataFrame(cur.fetchall())

cur.execute("SELECT * FROM public.toronto_checkins LIMIT 50")
checkins = pd.DataFrame(cur.fetchall())

cur.execute("SELECT * FROM public.toronto_reviews")
reviews = pd.DataFrame(cur.fetchall())

**NOTE: ONLY LIMITING to 50 rows during build phase to limit processing time. **

### Yelp Business Data

In [103]:
biz.columns = ['bizID', 'name', 'address', 'zip', 'neighborhood', 'lat','long', 'categories','attributes','is_open','review_count','hours','stars']
biz.describe()

Unnamed: 0,bizID,name,address,zip,neighborhood,lat,long,categories,attributes,is_open,review_count,hours,stars
count,8138,8138,8138.0,8138.0,8138.0,8138.0,8138.0,8138,8138.0,8138,8138,8138.0,8138.0
unique,8138,6532,6711.0,3071.0,69.0,7210.0,7207.0,3997,6892.0,2,321,3758.0,9.0
top,v1uIObWcfiQiyr4EmtAixw,Starbucks,,,,43.653226,-79.3831843,"['Coffee & Tea', 'Food']",,1,3,,3.5
freq,1,132,39.0,42.0,1340.0,28.0,28.0,213,167.0,6063,785,2520.0,2113.0


**following section is to normalize addresses. a package is available but doing it manually seems easier / good enough. package:** https://github.com/pnpnpn/street-address 

In [None]:
#normalizes addresses
biz['address'] = [addr.replace('Street','St').replace('Boulevard','Blvd').replace('Avenue','Ave').replace('Road','Rd')
        .replace('North','N').replace('West','W').replace('South','S').replace('East','E') for addr in biz['address']]

#we should also think about removing pre and suffixes, like this example (insp data does not seem to have units etc):
biz['address'][37]

### Inspection Data

In [112]:
#import toronto inspection data
cur.execute("SELECT * FROM public.toronto_inspections LIMIT 50")
insp = pd.DataFrame(cur.fetchall())
insp.columns = ['insp_bizID','insp_biz_name','insp_biz_address','insp_date', 'last_inspection','insp_count_minor','insp_count_significant','insp_count_crucial','insp_count_na','insp_total_count_cs']
insp.describe()

Unnamed: 0,insp_bizID,insp_biz_name,insp_biz_address,insp_date,last_inspection,insp_count_minor,insp_count_significant,insp_count_crucial,insp_count_na,insp_total_count_cs
0,10187087,METROPOLIS BAKESHOP,2 BLOOR ST W,3/17/17,10/18/16,0,0,0,0,0
1,10289695,MILAN'S PIZZERIA & WINGS,1792 WESTON RD,11/9/15,11/2/15,0,0,0,0,0
2,10355463,PIZZA PASTA WAYS,3300 BLOOR ST W,12/15/16,10/7/15,2,0,0,0,0
3,10453689,TRINITY - ST. PAUL'S UNITED CHURCH,427 BLOOR ST W,2/18/16,12/21/15,0,0,0,0,0
4,10503134,INTERNATIONAL NEWS SHEPPARD,45 SHEPPARD AVE E,1/31/17,1/31/17,1,0,0,0,0


### Join Yelp Business Data with Inspection Dataset

In [None]:
#declare function to calculate levenshtein distance between 2 strings (not case sensitive)
def lev(string1, string2):
    #delete the ".lower()" in the following two lines to make distance case sensitive
    s1=str.strip(string1.lower())
    s2=str.strip(string2.lower())
    m=len(s1)+1
    n=len(s2)+1

    tbl = {}
    for i in range(m): tbl[i,0]=i
    for j in range(n): tbl[0,j]=j
    for i in range(1, m):
        for j in range(1, n):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)

    return tbl[i,j]

#test the function
print(lev('Hello',"hello"))
print(lev('dock','duck '))
print(lev('st','saint'))

**note: next cell should return matches once we include more than the 50 rows (fingers crossed)**

In [None]:
#set value of levenshtein distance threshold (4 means only distances of 3 and lower would be considered)
lev_dist_threshold = 4

#loop through each yelp bizID and find the restaurant with closest lev distance (currently matches using name only)
#left join the inspection data to the yelp business table so that the reviews can be aggregated on bizID and inspeciton 
bizrevs['lev_dist'] = lev_dist_threshold
bizrevs['insp_bizID'] = ""
for i in range(len(bizrevs['bizID'])):
    for x in range(len(insp['insp_bizID'])):
        dist = lev(str(bizrevs['name'][i]),str(insp['insp_biz_name'][x]))
        if dist < bizrevs['lev_dist'][i]:
            bizrevs['lev_dist'][i] = dist
            bizrevs['insp_bizID'][i] = insp['insp_bizID'][x]

df = pd.merge(bizrevs,insp,on='insp_bizID', how='left')            

df.head()

**Note: merge in cell above won't result in matches for name/address because of 50 row limit put in place. should work after removing limit**

### Yelp Review Data

In [104]:
reviews.columns = ['bizID','reviewID','userID','type','stars','text','useful','funny','cool','date']
#get dummies for star rating column
reviews = pd.concat([reviews, pd.get_dummies(reviews['stars'], prefix='stars')], axis=1)
reviews.head()

Unnamed: 0,bizID,reviewID,userID,type,stars,text,useful,funny,cool,date,stars_1,stars_2,stars_3,stars_4,stars_5
0,007Dg4ESDVacWcC4Vq704Q,RiJic78k_rMZLERUKFJvfw,AHXy4uTg_L8VFXNRufLYdQ,review,5,I have been with FDO for a year now and I love...,0,0,0,2013-10-31,0,0,0,0,1
1,007Dg4ESDVacWcC4Vq704Q,__uRY7WHzEddvIKCf-oFgA,9sDNyyANgUMNg0RsuR0E7A,review,5,"""This company is flexible, caring, and committ...",0,0,0,2015-05-20,0,0,0,0,1
2,007Dg4ESDVacWcC4Vq704Q,vueoOPpxrrfqbPqXshdr2A,TaJ3hRYUW9Z82HF0qc4hFQ,review,4,"""You know, I think I was in a super-good mood ...",3,0,2,2010-09-11,0,0,0,1,0
3,007Dg4ESDVacWcC4Vq704Q,9dURm92vIofTFMgA8_Aj8g,VVm-TFCpi9M1-k8ED0l1eA,review,4,"""I've been using this delivery service for alm...",2,0,0,2011-10-24,0,0,0,1,0
4,007Dg4ESDVacWcC4Vq704Q,ufdLhOT_xDN7Pld35F7QrA,Dl6Y6sjVGL7br1O44rXDQg,review,5,"I love this service, a client had referred me ...",0,0,0,2012-09-11,0,0,0,0,1


In [93]:
checkins.columns = ['bizID','type','datetime']
checkins.describe()

Unnamed: 0,bizID,type,datetime
count,50,50,50
unique,50,1,50
top,mhm5282-LI8Ddq3txkijYQ,b'checkin',"['Mon-0:1', 'Sun-0:1', 'Thu-0:2', 'Wed-0:1', '..."
freq,1,50,1


In [107]:
#get list of unique biz and create df
uniquebiz=np.unique(biz['bizID'])
bizrevs = pd.DataFrame(uniquebiz)
bizrevs.columns = ['bizID']
bizrevs = pd.merge(bizrevs,biz[['bizID','name','address']],on='bizID', how='left')
bizrevs.head()

Unnamed: 0,bizID,name,address
0,--DaPTJW3-tB1vP-PfdTEg,Sunnyside Grill,1218 Saint Clair Avenue W
1,--SrzpvFLwP_YFwB_Cetow,Keung Kee Restaurant,"3300 Midland Avenue, Unit 41"
2,-0DwB6Swi349EKfbBAOF7A,Qi Natural Foods,710 Bloor Street W
3,-0NhdsDJsdarxyDPR523ZQ,Akco Lounge,100 King St W
4,-0aOudcaAyac0VJbMX-L1g,Express Pizza & Grill,4917 Bathurst


In [108]:
#declare rest of columns
bizrevs['reviews'] = ""
bizrevs['checkins'] = 0
bizrevs['stars_1'] = 0
bizrevs['stars_2'] = 0
bizrevs['stars_3'] = 0
bizrevs['stars_4'] = 0
bizrevs['stars_5'] = 0
bizrevs['reviews_whole_words'] = ""
bizrevs.head()

Unnamed: 0,bizID,name,address,reviews,checkins,stars_1,stars_2,stars_3,stars_4,stars_5,reviews_whole_words
0,--DaPTJW3-tB1vP-PfdTEg,Sunnyside Grill,1218 Saint Clair Avenue W,,0,0,0,0,0,0,
1,--SrzpvFLwP_YFwB_Cetow,Keung Kee Restaurant,"3300 Midland Avenue, Unit 41",,0,0,0,0,0,0,
2,-0DwB6Swi349EKfbBAOF7A,Qi Natural Foods,710 Bloor Street W,,0,0,0,0,0,0,
3,-0NhdsDJsdarxyDPR523ZQ,Akco Lounge,100 King St W,,0,0,0,0,0,0,
4,-0aOudcaAyac0VJbMX-L1g,Express Pizza & Grill,4917 Bathurst,,0,0,0,0,0,0,


In [96]:
#look up each unique biz ID in reviews table
for i in range(len(unique(bizrevs['bizID'])):
    #add text of review to reviews column if biz IDs match
    for x in range(len(reviews['bizID'])):
        if bizrevs['bizID'][i] == reviews['bizID'][x]:
            bizrevs['reviews'][i] = bizrevs['reviews'][i] + reviews['text'][x]
    #count number of checkins
    for y in range(len(checkins['bizID'])):
        if bizrevs['bizID'][i] == checkins['bizID'][y]:
            bizrevs['checkins'][i] = bizrevs['checkins'][i] + 1
    #count number of reviews with each star rating
    for z in range(len(reviews['bizID'])):
        if bizrevs['bizID'][i] == reviews['bizID'][z]:
            bizrevs['stars_1'][i] = bizrevs['stars_1'][i] + reviews['stars_1'][z]
            bizrevs['stars_2'][i] = bizrevs['stars_2'][i] + reviews['stars_2'][z]
            bizrevs['stars_3'][i] = bizrevs['stars_3'][i] + reviews['stars_3'][z]
            bizrevs['stars_4'][i] = bizrevs['stars_4'][i] + reviews['stars_4'][z]
            bizrevs['stars_5'][i] = bizrevs['stars_5'][i] + reviews['stars_5'][z]
    #extract whole words from reviews
    bizrevs['reviews_whole_words'][i] = ' '.join(re.findall('[A-Za-z]+', bizrevs['reviews'][i]))
    
bizrevs.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,bizID,name,address,reviews,checkins,stars_1,stars_2,stars_3,stars_4,stars_5,reviews_whole_words
0,4MMPpETGn-3LG5xMpcPO7w,Pho King Fabulous!,2411 Yonge Street,,0,0,0,0,0,0,
1,4MU88s7YswXGq6KcX1W-Iw,Loblaws,17 Leslie Street,,0,0,0,0,0,0,
2,5H7AyjxmLGuEjigfXVApZg,"""Michel's Bakery Cafe""",3401 Dufferin Street,,0,0,0,0,0,0,
3,5O2Gk2Kg3QpKHmcNTCmBuw,Kkorae,3 Finch Avenue E,,0,0,0,0,0,0,
4,7eSbvHOOpRmEwywDOclevQ,Pizzaiolo,624 Queen St W,,0,0,0,0,0,0,
