# Case Study 2 : Data Science in Yelp Data

**Required Readings:** 
* [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge) 
* Please download the Yelp dataset from the above webpage.
* [TED Talks](https://www.ted.com/talks) for examples of 10 minutes talks.


** NOTE **
* Please don't forget to save the notebook frequently when working in Jupyter Notebook, otherwise the changes you made can be lost.

*----------------------

Here is an example of the data format. More details are included [here](https://www.yelp.com/dataset_challenge)

## Business Objects

Business objects contain basic information about local businesses. The fields are as follows:

```json
{
  'type': 'business',
  'business_id': (a unique identifier for this business),
  'name': (the full business name),
  'neighborhoods': (a list of neighborhood names, might be empty),
  'full_address': (localized address),
  'city': (city),
  'state': (state),
  'latitude': (latitude),
  'longitude': (longitude),
  'stars': (star rating, rounded to half-stars),
  'review_count': (review count),
  'photo_url': (photo url),
  'categories': [(localized category names)]
  'open': (is the business still open for business?),
  'schools': (nearby universities),
  'url': (yelp url)
}
```
## Checkin Objects
```json
{
    'type': 'checkin',
    'business_id': (encrypted business id),
    'checkin_info': {
        '0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
        '1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
        ...
        '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
        ...
        '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
    }, # if there was no checkin for a hour-day block it will not be in the dict
}
```

# Problem: pick a data science problem that you plan to solve using Yelp Data
* The problem should be important and interesting, which has a potential impact in some area.
* The problem should be solvable using yelp data and data science solutions.

Please briefly describe in the following cell: what problem are you trying to solve? why this problem is important and interesting?

In [0]:
# Using Yelp business and checkin data, we will determine which states have the
# most active yelp users, and which states have the least active yelp users.
# This will allow Yelp to determine into which geographies they need to invest 
# resources to drive engagement.



# Data Collection/Processing: 

In [0]:
# READ ME

# Our team chose to use Google Colaboratory for this project to help expidite 
# the rate at which we could work together. This resulted in creating a shared
# Google Drive folder where the respective json files from the Yelp Challenge
# were stored. 

# This made reading the json files difficult, and as a result we
# chose to use the files.upload() setup (Option 2), as this was deemed 
# to be the best process for reading in the data from the json files.

# Option 1 is the approach our team would have taken if we used Jupyter Notebook
# and the json files were stored on our local devices.

# Option 3 is a direct link to the Google Drive .json files but requires a login

# Data Collection: Option 1

In [0]:
# Below is the approach our team would have taken if we used Jupyter Notebook
# and the json files were stored on our local devices.


import json
import io

# Need to import review.json information
review = []
with io.open('review.json', encoding = "utf-8-sig") as f:
  for line in f:
    review.append(json.loads(line))
  print("done")

  
# Need to import checkin.json information
checkins = []
with io.open('checkin.json', encoding = "utf-8-sig") as f:
  for line in f:
    checkin.append(json.loads(line))
  print("done")
  
  
# Need to import business.json information
businesses = []
with io.open('business.json', encoding = "utf-8-sig") as f:
  for line in f:
    businesses.append(json.loads(line))
  print("done")


# Data Collection: Option 2

In [0]:
from google.colab import files
import json

#upload file, run this cell only if you need, if checkins_file already exist, don't upload it again
checkins_file = files.upload()

In [0]:
checkins = []
for k in checkins_file.keys():
  checkins_list = checkins_file[k].split('\n')
  for mLines in checkins_list:
    try:
      checkins.append(json.loads(mLines))
    except:
      pass

In [3]:
#upload file, run this cell only if you need, if businesses_file already exist, don't upload it again
businesses_file = files.upload()

In [0]:
businesses = []
for k in businesses_file.keys():
  businesses_list = businesses_file[k].split('\n')
  for mLines in businesses_list:
    try:
      businesses.append(json.loads(mLines))
    except:
      pass

# Data Collection: Option 3

In [0]:
# run this cell to import the json data from google drive without uploading

!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

import json
checkin_file = drive.CreateFile({'id':'1g8kUdzMAiIVrE--3xiA_E3F3KrOUuy2A'})
checkin_file.GetContentFile('check.json')
with open('check.json', 'r') as handle:
    checkins = [json.loads(line) for line in handle]
bus_file = drive.CreateFile({'id':'1ZTl5dNxEX6ANbNpoYGFGSM9JONB95nUt'})
bus_file.GetContentFile('bus.json')
with open('bus.json', 'r') as handle:
    businesses = [json.loads(line) for line in handle]


# Data Exploration: Exploring the Yelp Dataset

**(1) Finding the most popular business categories:** 
* print the top 10 most popular business categories in the dataset and their counts (i.e., how many business objects in each category). Here we say a category is "popular" if there are many business objects in this category (such as 'restaurants').

In [4]:
!pip install pandas



In [5]:
# Create new list for Categories
import operator
categoriesCount = {}

for entry in range(len(businesses)): # iterate through the review list
  if businesses[entry]['categories'] is None: # skip over any empty entries
    pass
  else:
    # if we have meet the id , add it 1, else set it 1
    for each in businesses[entry]['categories']: # iterate business categories
      if each.lower() in categoriesCount:
        categoriesCount[each.lower()] = categoriesCount[each.lower()] + 1
      else:
        categoriesCount[each.lower()] = 1

# Return the top 10 categories and their respective counts in ascending order
sorted_categoriesCount = sorted(categoriesCount.items(), key=operator.itemgetter(1))
top10_ca = sorted_categoriesCount[::-1][:10]
print top10_ca


[(u'restaurants', 54618), (u'shopping', 27971), (u'food', 24777), (u'beauty & spas', 17014), (u'home services', 16205), (u'health & medical', 14230), (u'nightlife', 12154), (u'local services', 11232), (u'automotive', 11052), (u'bars', 10563)]


In [6]:
# use pandas to visualize result in a datafram
import pandas as pd

temp = pd.DataFrame(top10_ca)

print temp

                  0      1
0       restaurants  54618
1          shopping  27971
2              food  24777
3     beauty & spas  17014
4     home services  16205
5  health & medical  14230
6         nightlife  12154
7    local services  11232
8        automotive  11052
9              bars  10563


** (2) Find the most popular business objects** 
* print the top 10 most popular business objects in the dataset and their counts (i.e., how many checkins in total for each business object).  Here we say a business object is "popular" if the business object attracts a large number of checkins from the users.

In [7]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

import operator

businessCount = {}
for entry in range(len(checkins)):
  business_id = checkins[entry]['business_id']
  #jump every none id
  if business_id is None:
    pass
  else:
    checkNum = 0
    for each in checkins[entry]['time'].values():
      for eachHour in each.values():
        checkNum = checkNum + eachHour
    # if we have meet the id , add it 1, else set it 1
    if business_id in businessCount:
      businessCount[business_id] = businessCount[business_id] + checkNum
    else:
      businessCount[business_id] = checkNum
#top10 businessid in checkins, we need to link id to name next
sorted_businessCount = sorted(businessCount.items(), key=operator.itemgetter(1))
top10_id = sorted_businessCount[::-1][:10]

#this id_to_name contains tuples like (businessid, name)
id_to_name = {}
for entry in range(len(businesses)):
  #jump every none id
  if businesses[entry]['business_id'] is None:
    pass
  else:
    if businesses[entry]['business_id'] not in id_to_name:
      id_to_name[businesses[entry]['business_id']] = businesses[entry]['name']

#generate a new list contains only top10 business name
res = [(id_to_name[top10_id[i][0]], top10_id[i][1]) for i in range(10)]
print res

[(u'McCarran International Airport', 131958), (u'Phoenix Sky Harbor International Airport', 112590), (u'Charlotte Douglas International Airport', 49934), (u'The Cosmopolitan of Las Vegas', 43995), (u'ARIA Resort & Casino', 32603), (u'Kung Fu Tea', 32393), (u'The Venetian Las Vegas', 30583), (u'Bellagio Hotel', 29271), (u'MGM Grand Hotel', 28272), (u'Caesars Palace Las Vegas Hotel & Casino', 27306)]


In [8]:
# use pandas to visualize result in a datafram

temp = pd.DataFrame(res)

print temp

                                          0       1
0            McCarran International Airport  131958
1  Phoenix Sky Harbor International Airport  112590
2   Charlotte Douglas International Airport   49934
3             The Cosmopolitan of Las Vegas   43995
4                      ARIA Resort & Casino   32603
5                               Kung Fu Tea   32393
6                    The Venetian Las Vegas   30583
7                            Bellagio Hotel   29271
8                           MGM Grand Hotel   28272
9   Caesars Palace Las Vegas Hotel & Casino   27306


# The Solution: implement a data science solution to the problem you are trying to solve.

Briefly describe the idea of your solution to the problem in the following cell:

In [0]:

# Gather a list of the number of total reviews in each city. 
# Plot these total reviews on a map.
# Plot total reviews/businesses on a map to show reviews per capita.
# Yelp can use this information to deploy marketing efforts in underused areas.



Write codes to implement the solution in python:

In [69]:
import pandas as pd

# create dataframe from businesses
df = pd.DataFrame(businesses)

# define aggregation for different colums - we want to sum reviews, count
# businesses, and use the avg latitude/longitude for each city name
d = {'review_count':'sum', 'business_id':'count','latitude':'mean','longitude':'mean'}

# aggregate the business data by city using the above parameters
cityData = df.groupby('city', as_index=False).agg(d)

# sort output by review count descending
sortedCityData = cityData.sort_values('review_count', ascending = False)

# rename business_id to business_count
sortedCityData = sortedCityData.rename(columns={'business_id':'business_count'}).reset_index()

# clean up prior index
del sortedCityData['index']

#print sorted output
print sortedCityData[['city','review_count']]


                                                   city  review_count
0                                             Las Vegas       1604173
1                                               Phoenix        576709
2                                               Toronto        430923
3                                            Scottsdale        308529
4                                             Charlotte        237115
5                                            Pittsburgh        179471
6                                             Henderson        166884
7                                                 Tempe        162772
8                                                  Mesa        130883
9                                              Montréal        122620
10                                             Chandler        122343
11                                              Gilbert         97204
12                                            Cleveland         92280
13                  

In [70]:
# This cell will create a visualization of the total review counts by city
# Adapted from Plotly US Bubble Map https://plot.ly/python/bubble-maps/

import plotly.plotly as py
py.sign_in('jennifersue', 'rXLUQ3ZBHBrX0iVs5TL6')

# plot number of reviews
df = pd.DataFrame(sortedCityData)
df.head()

# set hover text box
df['text'] = df['city']  + '<br>Businesses: ' + (df['business_count']).astype(str) +'<br>Reviews: ' + (df['review_count']).astype(str) 
# set data ranges for bubbles
limits = [(0,2), (3,15), (16,50), (51,100), (101,1650000)]
colors = ["rgb(0,116,217)","rgb(255,65,54)","rgb(133,200,75)","rgb(255,133,27)","rgb(50,150,133)"]
cities = []
scale = 500

# place bubbles
for i in range(len(limits)):
    lim = limits[i]
    df_sub = df[lim[0]:lim[1]]
    city = dict(
                type = 'scattergeo',
                locationmode = 'USA-states',
                lon = df_sub['longitude'],
                lat = df_sub['latitude'],
                text = df_sub['text'],
                marker = dict(
                              size = df_sub['review_count']/scale,
                              color = colors[i],
                              line = dict(width=0.0, color='rgb(40,40,40)'),
                              sizemode = 'area'
                              ),
                name = " " )
    cities.append(city)

# set map title and background
layout = dict(
              title = 'Yelp Reviews per City',
              showlegend = False,
              geo = dict(
                         scope='usa',
                         projection=dict( type='albers usa' ),
                         showland = True,
                         landcolor = 'rgb(217, 217, 217)',
                         subunitwidth=1,
                         countrywidth=1,
                         subunitcolor="rgb(255, 255, 255)",
                         countrycolor="rgb(255, 255, 255)"
                         ),
              )

# show map
fig = dict( data=cities, layout=layout )
py.iplot( fig, validate=False, filename='d3-bubble-map-populations.html')


High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~jennifersue/0 or inside your plot.ly account where it is named 'd3-bubble-map-populations.html'


In [0]:
# calculate average review per business for each city to improve output
sortedCityData['avg'] = (sortedCityData.review_count / sortedCityData.business_count)
sortedCityData.avg = sortedCityData.avg.round()

# remove noisey values where there are fewer than 20 businesses in a city
avgData = sortedCityData.drop(sortedCityData[sortedCityData.business_count < 20].index)

# resort data based on avg and reset index
sortedAvgData = avgData.sort_values('avg', ascending = False).reset_index()

# cleanup prior index
del sortedAvgData['index']

In [66]:
# This cell will create an enhanced output for more meaningful comparison
# Adapted from Plotly US Bubble Map https://plot.ly/python/bubble-maps/

# plot number of reviews
df = pd.DataFrame(sortedAvgData)
df.head()

# set hover text box
df['text'] = df['city'] + '<br>Avg Reviews per Business: ' + df['avg'].astype(str) + '<br>Total Reviews: ' + df['review_count'].astype(str)+ '<br>Total Businesses: ' + (df['business_count']).astype(str)
# set data ranges for bubbles
limits = [(0,4), (5,24), (25,99), (100,174), (175,249)]
colors = ["rgb(0,116,217)","rgb(255,65,54)","rgb(133,200,75)","rgb(255,133,27)","rgb(50,150,133)"]
cities = []
scale = 0.1

# place bubbles
for i in range(len(limits)):
    lim = limits[i]
    df_sub = df[lim[0]:lim[1]]
    city = dict(
                type = 'scattergeo',
                locationmode = 'USA-states',
                lon = df_sub['longitude'],
                lat = df_sub['latitude'],
                text = df_sub['text'],
                marker = dict(
                              size = df_sub['avg']/scale,
                              color = colors[i],
                              line = dict(width=0.0, color='rgb(40,40,40)'),
                              sizemode = 'area'
                              ),
                name = '{0} - {1}'.format(lim[0]+1,lim[1]+1) )
    cities.append(city)

# set map title and background
layout = dict(
              title = 'Yelp Reviews/Businesses per City',
              showlegend = False,
              geo = dict(
                         scope='usa',
                         projection=dict( type='albers usa' ),
                         showland = True,
                         landcolor = 'rgb(217, 217, 217)',
                         subunitwidth=1,
                         countrywidth=1,
                         subunitcolor="rgb(255, 255, 255)",
                         countrycolor="rgb(255, 255, 255)"
                         ),
              )
# show map
fig = dict( data=cities, layout=layout )
py.iplot( fig, validate=False, filename='d3-bubble-map-populations.html')


High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~jennifersue/0 or inside your plot.ly account where it is named 'd3-bubble-map-populations.html'


# Results: summarize and visualize the results discovered from the analysis

Please use figures, tables, or videos to communicate the results with the audience.


*-----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this Jupyter notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "jupyter notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.

* **PPT Slides**: please prepare PPT slides (for 10 minutes' talk) to present about the case study . Each team present their case studies in class for 10 minutes.

Please compress all the files in a zipped file.


** How to submit: **

        Please submit through Canvas, in the Assignment "Case Study 2".
        
** Note: Each team only needs to submit one submission in Canvas **


# Peer-Review Grading Template:

** Total Points: (100 points) ** Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.

Please add an "**X**" mark in front of your rating: 

For example:

*2: bad*
          
**X** *3: good*
    
*4: perfect*


    ---------------------------------
    The Problem: 
    ---------------------------------
    
    1. (10 points) how well did the team describe the problem they are trying to solve using the data? 
       0: not clear
       2: I can barely understand the problem
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
    
    2. (10 points) do you think the problem is important or has a potential impact?
        0: not important at all
        2: not sure if it is important
        4: seems important, but not clear
        6: interesting problem
        8: an important problem, which I want to know the answer myself
       10: very important, I would be happy invest money on a project like this.
    
    ----------------------------------
    Data Collection and Processing:
    ----------------------------------
    
    3. (10 points) Do you think the data collected/processed are relevant and sufficient for solving the above problem? 
       0: not clear
       2: I can barely understand what data they are trying to collect/process
       4: I can barely understand why the data is relevant to the problem
       6: the data are relevant to the problem, but better data can be collected
       8: the data collected are relevant and at a proper scale
      10: the data are properly collected and they are sufficient

    -----------------------------------
    Data Exploration:
    -----------------------------------
    4. How well did the team solve the following task:
    
    (1) Finding the most popular business categories (5 points):
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect
    
    (2) Find the most popular business objects (5 points)
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect
    
    -----------------------------------
    The Solution
    -----------------------------------
    5.  how well did the team describe the solution they used to solve the problem? (10 points)
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
       
    6. how well is the solution in solving the problem? (10 points)
       0: not relevant
       2: barely relevant to the problem
       4: okay solution, but there is an easier solution.
       6: good, but can be improved
       8: very good, but solution is simple/old
       10: innovative and technically sound
       
    7. how well did the team implement the solution in python? (10 points)
       0: the code is not relevant to the solution proposed
       2: the code is barely understandable, but not relevant
       4: okay, the code is clear but incorrect
       6: good, the code is correct, but with major errors
       8: very good, the code is correct, but with minor errors
      10: perfect 
   
    -----------------------------------
    The Results
    -----------------------------------
     8.  How well did the team present the results they found in the data? (10 points)
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
      10: crystal clear
       
     9.  How do you think of the results they found in the data?  (5 points)
       0: not clear
       1: likely to be wrong
       2: okay, maybe wrong
       3: good, but can be improved
       4: make sense, but not interesting
       5: make sense and very interesting
     
    -----------------------------------
    The Presentation
    -----------------------------------
    10. How all the different parts (data, problem, solution, result) fit together as a coherent story?  
       0: they are irrelevant
       1: I can barely understand how they are related to each other
       2: okay, the problem is good, but the solution doesn't match well, or the problem is not solvable.
       3: good, but the results don't make much sense in the context
       4: very good fit, but not exciting (the storyline can be improved/polished)
       5: a perfect story
      
    11. Did the presenter make good use of the 10 minutes for presentation?  
       0: the team didn't present
       1: bad, barely finished a small part of the talk
       2: okay, barely finished most parts of the talk.
       3: good, finished all parts of the talk, but some part is rushed
       4: very good, but the allocation of time on different parts can be improved.
       5: perfect timing and good use of time      

    12. How well do you think of the presentation (overall quality)?  
       0: the team didn't present
       1: bad
       2: okay
       3: good
       4: very good
       5: perfect


    -----------------------------------
    Overall: 
    -----------------------------------
    13. How many points out of the 100 do you give to this project in total?  Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.
    Total score:
    
    14. What are the strengths of this project? Briefly, list up to 3 strengths.
       1: 
       2:
       3:
    
    15. What are the weaknesses of this project? Briefly, list up to 3 weaknesses.
       1:
       2:
       3:
    
    16. Detailed comments and suggestions. What suggestions do you have for this project to improve its quality further.
    
    
    

    ---------------------------------
    Your Vote: 
    ---------------------------------
    1. [Overall Quality] Between the two submissions that you are reviewing, which team would you vote for a better score?  (5 bonus points)
        0: I vote the other team is better than this team
        5: I vote this team is better than the other team 
        
    2. [Presentation] Among all the teams in the presentation, which team do you think deserves the best presentation award for this case study?  
        1: Team 1
        2: Team 2
        3: Team 3
        4: Team 4
        5: Team 5
        6: Team 6
        7: Team 7
        8: Team 8
        9: Team 9


