# Data Import, Cleaning, and Preparation

This module is used to query the postgreSQL database in order to retrieve the Yelp and Violations dataset. There will be several steps to do this:
* Import flattened violations dataset. We will use a SQL query to perform aggregation of violation data by restaurant and inspeciton date. 
* Aggregate the Yelp Data
    * Reviews for a given establishment will be aggregated so that reviews *after* the previous inspection (or the earliest review date) and *before* the date of a given inspection are in one batch. 
    * Aggregate "count" features using this same logic
    * Combine the review document for a restaurant into a CLOB using the same logic
* LEFT JOIN the violations dataset to the Yelp data after aggregationis complete for both datasets

In [1]:
#set up connection to our DB
import psycopg2 as psy
import pandas as pd
conn = psy.connect(database="sterndsyelp", 
                        user="mvsternds", 
                        password="nyustern123!", 
                        host="sterndsyelp.cawzspvmqd5q.us-east-1.rds.amazonaws.com", 
                        port="5432"
                       )
#open cursor and check our tables in the DB
cur = conn.cursor()
cur.execute("SELECT table_name FROM information_schema.tables"
            " WHERE table_schema='public'" 
            " AND table_type='BASE TABLE'")
rows =cur.fetchall()
print(pd.DataFrame(rows))

                       0
0               business
1                checkin
2             trnt_insps
3  violations_pittsburgh
4                   tips
5                reviews


In [94]:
cur.execute("SELECT * FROM public.business LIMIT 50")
biz = pd.DataFrame(cur.fetchall())

cur.execute("SELECT * FROM public.checkin LIMIT 50")
checkins = pd.DataFrame(cur.fetchall())

cur.execute("SELECT * FROM public.reviews LIMIT 50")
reviews = pd.DataFrame(cur.fetchall())

**NOTE: ONLY LIMITING to 50 rows during build phase to limit processing time. remove limit once dataset is cleaned via SQL (ie. toronto only, restaurants only etc**

In [62]:
biz.columns = ['state','hours','type','review_count','neighborhood','longitude','is_open','attributes','name','address','city','latitude','categories','zip','bizID','rating']
biz.describe()

Unnamed: 0,state,hours,type,review_count,neighborhood,longitude,is_open,attributes,name,address,city,latitude,categories,zip,bizID,rating
count,50,50.0,50,50,50,50.0,50,50.0,50,50,50,50.0,50,50,50,50.0
unique,8,34.0,1,27,12,49.0,2,39.0,49,48,15,49.0,49,24,50,9.0
top,b'NV',,b'business',3,b'',36.1107323,1,,b'Hertz Rent A Car',b'',b'Las Vegas',-115.1722365,"['Hotels & Travel', 'Car Rental']",b'89109',b'HJe16HMwkPy269Vk1Sl4-Q',3.5
freq,25,14.0,50,9,19,2.0,38,10.0,2,3,25,2.0,2,11,1,13.0


**NOTE: column order below is probably wrong. need to find true positions of the IDs and useful/cool/funny columns.**

In [104]:
reviews.columns = ['type','funny','bizID','userID','reviewID','rating','text','useful','cool','date']
reviews.describe()

Unnamed: 0,type,funny,bizID,userID,reviewID,rating,text,useful,cool,date
count,50,50,50,50,50,50,50,50,50,50
unique,1,5,7,50,50,5,50,7,7,49
top,b'review',0,b'4P-vTvE6cncJyUyLh73pxw',b'VlDz03s9VyODcVi1S9-Yfw',b'je5k8a3qIOM0VJE5MaxsfQ',5,b'Have dined in twice now and today was take o...,0,0,b'2014-08-23'
freq,50,34,17,1,1,18,1,27,38,2


In [103]:
checkins.columns = ['bizID','type','datetime']
checkins.describe()

Unnamed: 0,bizID,type,datetime
count,50,50,50
unique,50,1,50
top,b'iX2rl6mNNu2TjoRbiX6wSQ',b'checkin',"['Thu-0:1', 'Mon-1:1', 'Mon-12:1', 'Sat-16:1']"
freq,1,50,1


In [106]:
#This section lumps all reviews of a business into one row and counts checkins

#get list of unique biz IDs and create df
uniquebiz = list(set(reviews['bizID']))
bizrevs = pd.DataFrame(uniquebiz)
bizrevs.columns = ['bizID']
bizrevs['reviews'] = ""
bizrevs['checkins'] = 0

#look up each unique biz ID in reviews table
for i in range(len(bizrevs['bizID'])):
    #add text of review to reviews column if biz IDs match
    for x in range(len(reviews['bizID'])):
        if bizrevs['bizID'][i] == reviews['bizID'][x]:
            bizrevs['reviews'][i] = bizrevs['reviews'][i] + reviews['text'][x]
    #count number of checkins
    for y in range(len(checkins['bizID'])):
        if bizrevs['bizID'][i] == checkins['bizID'][y]:
            bizrevs['checkins'][i] = bizrevs['checkins'][i] + 1
            
bizrevs.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,bizID,reviews,checkins
0,b'2aFiy99vNLklCx3T_tGS9A',"b""If you enjoy service by someone who is as co...",0
1,b'7GI_V9oLCUGdn2ogqB0IBg',"b""I highly doubt anyone uses Yelp to find a pl...",0
2,b'0czfEgv9KAD4VlIa7ANPWQ',"b""Overall, I'll never go back. Rewinding to th...",0
3,b'4P-vTvE6cncJyUyLh73pxw',"b""Unmmmm, no. It's a bar with tables in the ne...",0
4,b'2LfIuF3_sX6uwe-IR-P0jQ',"b""Highly recommended. Went in yesterday lookin...",0
