# Case Study 2 : Data Science in Yelp Data

**Required Readings:** 
* [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge) 
* Please download the Yelp dataset from the above webpage.
* [TED Talks](https://www.ted.com/talks) for examples of 10 minutes talks.


** NOTE **
* Please don't forget to save the notebook frequently when working in Jupyter Notebook, otherwise the changes you made can be lost.

*----------------------

Here is an example of the data format. More details are included [here](https://www.yelp.com/dataset_challenge)

## Business Objects

Business objects contain basic information about local businesses. The fields are as follows:

```json
{
  'type': 'business',
  'business_id': (a unique identifier for this business),
  'name': (the full business name),
  'neighborhoods': (a list of neighborhood names, might be empty),
  'full_address': (localized address),
  'city': (city),
  'state': (state),
  'latitude': (latitude),
  'longitude': (longitude),
  'stars': (star rating, rounded to half-stars),
  'review_count': (review count),
  'photo_url': (photo url),
  'categories': [(localized category names)]
  'open': (is the business still open for business?),
  'schools': (nearby universities),
  'url': (yelp url)
}
```
## Checkin Objects
```json
{
    'type': 'checkin',
    'business_id': (encrypted business id),
    'checkin_info': {
        '0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
        '1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
        ...
        '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
        ...
        '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
    }, # if there was no checkin for a hour-day block it will not be in the dict
}
```

# Problem: pick a data science problem that you plan to solve using Yelp Data
* The problem should be important and interesting, which has a potential impact in some area.
* The problem should be solvable using yelp data and data science solutions.

Please briefly describe in the following cell: what problem are you trying to solve? why this problem is important and interesting?

In [None]:
In the previous case study, several groups performed sentiment analysis on tweets using dictionaries of words given
weights towards a positive or negative sentiment. To create such a dictionary, one must usually use advanced machine learning 
techniques. Yelp provides a unique solution to this problem; if one assumes that reviews with higher ratings will have a larger 
concentration of words with a positive sentiment, and that reviews with lower ratings will have a larger concentration of words 
with a negative sentiment, then it is possible to use this data to discover which words/phrases have a positive/negative
sentiment. 

This data would be also usefull for any business which is heavily reviewed on Yelp; it would provide a scalable way to know which
things they could improve on--which things have been mentioned in a negative sentiment, and which good things they are known for.























# Data Collection/Processing: 

In [None]:
# The data was downloaded from the Yelp Dataset challenge website. We used MRJob to process a portion of this data.
# We just downloaded the data so there was no collection code involved







# Data Exploration: Exploring the Yelp Dataset

**(1) Finding the most popular business categories:** 
* print the top 10 most popular business categories in the dataset and their counts (i.e., how many business objects in each category). Here we say a category is "popular" if there are many business objects in this category (such as 'restaurants').

In [None]:
import json, prettytable

data = open('yelp_academic_dataset_business.json','r')

categ = {}

#i = 0
for line in data:
    business = json.loads(line.strip())
    categories = business['categories']
    if categories is None:
        continue
    for category in categories:
        if category not in categ.keys():
            categ[category] = 1
        else:
            categ[category] += 1
    #print json.dumps(business, indent=2)
    #i +=1
    #if i >100:
    #    break

#Get the top ten most frequent categories
categories = categ.items()
sorted_categories = sorted(categories, key=lambda x: x[1], reverse=True)
#print sorted_categories[0:10]
table = prettytable.PrettyTable(['Category','Count of businesses in category'])
for row in sorted_categories[0:10]:
    table.add_row(row)
print table














In [None]:
+------------------+---------------------------------+
|     Category     | Count of businesses in category |
+------------------+---------------------------------+
|   Restaurants    |              48485              |
|     Shopping     |              22466              |
|       Food       |              21189              |
|  Beauty & Spas   |              13711              |
|  Home Services   |              11241              |
|    Nightlife     |              10524              |
| Health & Medical |              10476              |
|       Bars       |               9087              |
|    Automotive    |               8554              |
|  Local Services  |               8133              |
+------------------+---------------------------------+

** (2) Find the most popular business objects** 
* print the top 10 most popular business objects in the dataset and their counts (i.e., how many checkins in total for each business object).  Here we say a business object is "popular" if the business object attracts a large number of checkins from the users.

1) Go through the ...checkin.json file, counting the number of checkins for each business id.
2) Go through the ...business.json file, creating a lookup table so that we can convert each business id to a business name
3) Use the lookup table to convert business ids to names and finally,
4) Display the table with counts.

In [None]:
# WARNING: This takes a very long time. Results have been 
# recorded below.
import json, prettytable

data = open('yelp_academic_dataset_checkin.json','r')

business_id_counts = {}

i = 0
for line in data:
    checkin = json.loads(line.strip())
    business_id = checkin['business_id']
    if business_id is not None:
        if business_id not in business_id_counts.keys():
            business_id_counts[business_id] = len(checkin['time'])
        else:
            business_id_counts[business_id] += len(checkin['time'])
    #print json.dumps(business, indent=2)
    i +=1
    #if i >10000:
    #    break
    print str(i), 'counted.\r',

#Get the top ten most frequent categories
business_counts = business_id_counts.items()
sorted_counts = sorted(business_counts, key=lambda x: x[1], reverse=True)
table = prettytable.PrettyTable(['Business ID','Count'])
out_counts = {}
for row in sorted_counts[0:10]:
    table.add_row(row)
    out_counts[row[0]] = row[1]

with open('checkin_table.txt', 'w') as outfile:
    outfile.write(table)

with open('checkin_table.json', 'w') as outfile:
    outfile.write(json.dumps(out_counts, indent=2))



In [None]:
Generated table: 
    +------------------------+-------+ | Business ID | Count | +------------------------+-------+ 
    | FaHADZARwnY4yvlvpnsfGA | 168 | 
    | t-o_Sraneime4DDhWrQRBA | 168 | 
    | VxCnyVYn-FFgv6F1EqbdKA | 168 | 
    | -kG0N8sBhBotMbu0KVSPaw | 168 | 
    | na4Th5DrNauOv-c43QQFvA | 168 | 
    | gRCEObNuHtI61xR32ytqNQ | 168 | 
    | O3lQvyOADBs7f2W8A5D0Yg | 168 | 
    | jKmAswXvFVRHN4VP-88zOA | 168 | 
    | u_vPjx925UPEG9DFOAAvFQ | 168 | 
    | XXW_OFaYQkkGOGniujZFHg | 168 | 
    +------------------------+-------+

Generated table:
+------------------------+-------+
|      Business ID       | Count |
+------------------------+-------+
| FaHADZARwnY4yvlvpnsfGA |  168  |
| t-o_Sraneime4DDhWrQRBA |  168  |
| VxCnyVYn-FFgv6F1EqbdKA |  168  |
| -kG0N8sBhBotMbu0KVSPaw |  168  |
| na4Th5DrNauOv-c43QQFvA |  168  |
| gRCEObNuHtI61xR32ytqNQ |  168  |
| O3lQvyOADBs7f2W8A5D0Yg |  168  |
| jKmAswXvFVRHN4VP-88zOA |  168  |
| u_vPjx925UPEG9DFOAAvFQ |  168  |
| XXW_OFaYQkkGOGniujZFHg |  168  |
+------------------------+-------+

In [None]:
# Create a lookup table for business ids
import json, prettytable

business_lookup = {}

data = open('yelp_academic_dataset_business.json','r')
for line in data:
    business = json.loads(line.strip())
    business_id_str = business['business_id']
    business_name_str = business['name']
    
    business_lookup[business_id_str] = business_name_str
    
with open('business_lookup.json','w') as lookup_file:
    lookup_file.write(json.dumps(business_lookup, indent=2))
print 'Writing Complete.'



Now, we have created our lookup table and stored it in business_lookup.json
We have also created a table of the most popular businesses(in terms of ids)
All we need to do is load both tables in, and lookup each id. 

In [8]:
import json, prettytable

# The file to decode each business id
with open('business_lookup.json','r') as lookup_file:
    read = lookup_file.read().strip()
    lookup_table = json.loads(read)

# The file with the top ten business ids and counts
with open('checkin_table.json','r') as checkin_file:
    read = checkin_file.read().strip()
    checkin_table = json.loads(read)

final_table = {}

for business_id in checkin_table.keys():
    business_name = lookup_table[business_id]
    business_count = checkin_table[business_id]

    final_table[business_name] = business_count

table = prettytable.PrettyTable(['Business Name','Checkin Count'])
sorted_items = sorted(final_table.items(), key=lambda x: x[1], reverse=True)
for row in sorted_items:
    table.add_row(row)

with open('business_count_table.txt', 'w') as outfile:
    outfile.write(str(table))
    
    


In [None]:
+---------------------------------------------+---------------+ | Business Name | Checkin Count | +---------------------------------------------+---------------+
| Bellagio Hotel | 168 | 
| Toronto Pearson International Airport | 168 | 
| 24 Hour Fitness Sport | 168 | 
| Gold Coast Hotel & Casino | 168 | 
| Mandarin Oriental, Las Vegas | 168 | 
| Palace Station Hotel & Casino | 168 | 
| Wynn Las Vegas | 168 | 
| McCarran International Airport | 168 | 
| The Peppermill Restaurant & Fireside Lounge | 168 | 
| Flamingo Las Vegas Hotel & Casino | 168 |
+---------------------------------------------+---------------+

+---------------------------------------------+---------------+
|                Business Name                | Checkin Count |
+---------------------------------------------+---------------+
|                Bellagio Hotel               |      168      |
|    Toronto Pearson International Airport    |      168      |
|            24 Hour Fitness Sport            |      168      |
|          Gold Coast Hotel & Casino          |      168      |
|         Mandarin Oriental, Las Vegas        |      168      |
|        Palace Station Hotel & Casino        |      168      |
|                Wynn Las Vegas               |      168      |
|        McCarran International Airport       |      168      |
| The Peppermill Restaurant & Fireside Lounge |      168      |
|      Flamingo Las Vegas Hotel & Casino      |      168      |
+---------------------------------------------+---------------+

# The Solution: implement a data science solution to the problem you are trying to solve.

Briefly describe the idea of your solution to the problem in the following cell:

In [None]:
Reverse Sentiment Analysis
In order to figure out the sentiment (positive, neutral, negative) of any given word, we used the following process:
    1. Go through each review and remove all punctuation and grammar (anything but alphabetical characters). Convert to lowercase.
    2. Remove all irrelevant parts of speech (coordinating conjuctives, prepositions, etc)
    3. For each word in the review left, record the score of the review for that word
    4. When all reviews have been processed, take the average of the scores for each word. Record this as the words final score.
    
    Using this simple preliminary analysis, words with a higher score will be used in more positive reviews, and words with a lower 
    score will have been used in more negative reviews.
    
We can then use this data to predict the score for any given review. By taking the average score for each word in a review, we can
get a somewhat accurate prediction. The graph in the figures section shows how effective this method is.











Write codes to implement the solution in python:

In [None]:
#This code is run on the command line with the input being piped in from the review file.
from mrjob.job import MRJob
import json, nltk
from nltk.stem.snowball import SnowballStemmer
undesirables = ['CC','IN','TO','DT']
stemmer = SnowballStemmer('english')

class MRWordScore(MRJob):

    def mapper(self, _, line):
        review = json.loads(line.strip())
        cleaned_text = clean_text(review['text'])
        clean_words = cleaned_text.split()
        cleaner_text = remove_unecessary_pos(cleaned_text).split()
        #stemmed_words = [stemmer.stem(word) for word in cleaner_text.split()]
        score = float(review['stars'])
        for word in cleaner_text:
            yield (word, score)

    '''def combiner(self, word, counts):
        length = len(list(counts))
        if length == 0:
            yield (word, 0)
        yield (word, int(sum(counts)/length)) # Average score'''

    def reducer(self, word, counts):
        total = 0
        count = 0
        for num in counts:
            total += num
            count += 1
        #if count == 0:
        #    yield (word, (0, count))
        if count < 10:
            return
        yield (word, {'score':total/count, 'count':count}) # Average score

def clean_text(text):
    # Get rid of all bad characters, and uppercase.
    new_text = ''
    for c in text:
        if c in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
            new_text += c.lower()
        elif c in 'abcdefghijklmnopqrstuvwxyz ':
            new_text += c
    return new_text

def remove_unecessary_pos(text):
    pos_list = nltk.pos_tag(text.split())
    #print pos_list
    new_text = ''
    for pos in pos_list:
        if pos[1] in undesirables:
            pass
        else:
            new_text += pos[0] + ' '
    return new_text


#dataset = open('yelp_academic_dataset_review.json','r')

#clean_reviews = []

def write_data(clean_reviews):
    with open('clean_reviews.json','a') as outfile:
        print 'Writing {0} reviews to file.'.format(len(clean_reviews))
        [outfile.write(json.dumps(review) + '\n') for review in clean_reviews]
        #outfile.write(json.dumps(clean_reviews))
        outfile.close()

MRWordScore.run()
    

















In [None]:
# Used to predict review scores
import sys, json, math, nltk, random
undesirables = ['CC','IN','TO','DT']

def log(n):
    return math.log(n) - 1.3

def calcscore(score):
    if score < 24.17:
        return 1
    elif score < 24.75:
        return 2
    elif score < 25.2:
        return 3
    elif score < 25.45:
        return 4
    else:
        return 5

words = {}
wordlist = open(sys.argv[1],'r')
for line in wordlist:
    firstspace = line.strip().index('{')
    word = line[0:firstspace][1:-2]
    info = line[firstspace-1:]
    infoDict = json.loads(info.strip())
    score = infoDict['score']
    count = infoDict['count']
    infoDict['score'] = score*log(count)
    words[word] = infoDict
    #print word, '{0:.2f}'.format(infoDict['score'])

'''sorted_words = sorted(words, key=lambda x: x['score'], reverse=True)

for wordDict in sorted_words:
    score = wordDict['score']
    count = wordDict['count']
    print wordDict['word'], '{0:.2f}'.format(wordDict['score'])'''

def clean_text(text):
    # Get rid of all bad characters, and uppercase.
    new_text = ''
    for c in text:
        if c in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
            new_text += c.lower()
        elif c in 'abcdefghijklmnopqrstuvwxyz ':
            new_text += c
    return new_text

def remove_unecessary_pos(text):
    pos_list = nltk.pos_tag(text.split())
    #print pos_list
    new_text = ''
    for pos in pos_list:
        if pos[1] in undesirables:
            pass
        else:
            new_text += pos[0] + ' '
    return new_text


scores = {}

i = 0
for line in open(sys.argv[2], 'r'):
    review = json.loads(line.strip())
    cleaned_text = clean_text(review['text'])
    cleaner_words = remove_unecessary_pos(cleaned_text).split()
    total = 0
    count = 0
    for word in cleaner_words:
        if word in words.keys():
            total += words[word]['score']
            count += 1
        if count == 0:
            print 'Continuing ' + word
    if count == 0:
        continue
    score = total/count
    print 'Our score: {0:.2f} | Their score: {1}'.format(int(calcscore(score)), review['stars'])
    #scores += (score, int(review['stars']))
    if review['stars'] not in scores.keys():
        scores[review['stars']] = []
    scores[review['stars']] += [score]
    i += 1
    if i > 50:
        break

for key, scorelist in scores.items():
    total = 0
    for score in scorelist:
        total += score
    print 'Average for {} : {}'.format(key, total/len(scorelist))





            


# Results: summarize and visualize the results discovered from the analysis

Please use figures, tables, or videos to communicate the results with the audience.


In [None]:
Most positive words (with > 100 uses)
+-------------+-----------------------+
|     Word    | Score (Average stars) |
+-------------+-----------------------+
|   crossfit  |          4.81         |
|   listens   |          4.81         |
|    debbie   |          4.75         |
|  meticulous |          4.73         |
|  supportive |          4.70         |
|   talented  |          4.66         |
|   painless  |          4.66         |
| trustworthy |          4.66         |
|   workouts  |          4.63         |
|    kelly    |          4.63         |
+-------------+-----------------------+
Most negative words (with > 100 uses)
+----------------+-----------------------+
|      Word      | Score (Average stars) |
+----------------+-----------------------+
| unprofessional |          1.23         |
|     rudely     |          1.32         |
|  incompetent   |          1.32         |
|      scam      |          1.33         |
|     refund     |          1.37         |
|   dishonest    |          1.41         |
|   disgusted    |          1.44         |
|   unhelpful    |          1.44         |
|      bbb       |          1.44         |
|    apology     |          1.44         |
+----------------+-----------------------+
Most neutral words (with > 100 uses)
+-----------+-----------------------+
|    Word   | Score (Average stars) |
+-----------+-----------------------+
|   boxes   |          3.48         |
|    here   |          3.48         |
|    size   |          3.48         |
|   forget  |          3.48         |
|  toppings |          3.48         |
|    does   |          3.48         |
|  squeeze  |          3.49         |
| croissant |          3.49         |
|    gast   |          3.49         |
|   bench   |          3.49         |
+-----------+-----------------------+

Please also see the graph in this same folder







*-----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this Jupyter notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "jupyter notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.

* **PPT Slides**: please prepare PPT slides (for 10 minutes' talk) to present about the case study . Each team present their case studies in class for 10 minutes.

Please compress all the files in a zipped file.


** How to submit: **

        Please submit through Canvas, in the Assignment "Case Study 2".
        
** Note: Each team only needs to submit one submission in Canvas **


# Peer-Review Grading Template:

** Total Points: (100 points) ** Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.

Please add an "**X**" mark in front of your rating: 

For example:

*2: bad*
          
**X** *3: good*
    
*4: perfect*


    ---------------------------------
    The Problem: 
    ---------------------------------
    
    1. (10 points) how well did the team describe the problem they are trying to solve using the data? 
       0: not clear
       2: I can barely understand the problem
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
    
    2. (10 points) do you think the problem is important or has a potential impact?
        0: not important at all
        2: not sure if it is important
        4: seems important, but not clear
        6: interesting problem
        8: an important problem, which I want to know the answer myself
       10: very important, I would be happy invest money on a project like this.
    
    ----------------------------------
    Data Collection and Processing:
    ----------------------------------
    
    3. (10 points) Do you think the data collected/processed are relevant and sufficient for solving the above problem? 
       0: not clear
       2: I can barely understand what data they are trying to collect/process
       4: I can barely understand why the data is relevant to the problem
       6: the data are relevant to the problem, but better data can be collected
       8: the data collected are relevant and at a proper scale
      10: the data are properly collected and they are sufficient

    -----------------------------------
    Data Exploration:
    -----------------------------------
    4. How well did the team solve the following task:
    
    (1) Finding the most popular business categories (5 points):
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect
    
    (2) Find the most popular business objects (5 points)
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect
    
    -----------------------------------
    The Solution
    -----------------------------------
    5.  how well did the team describe the solution they used to solve the problem? (10 points)
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
       
    6. how well is the solution in solving the problem? (10 points)
       0: not relevant
       2: barely relevant to the problem
       4: okay solution, but there is an easier solution.
       6: good, but can be improved
       8: very good, but solution is simple/old
       10: innovative and technically sound
       
    7. how well did the team implement the solution in python? (10 points)
       0: the code is not relevant to the solution proposed
       2: the code is barely understandable, but not relevant
       4: okay, the code is clear but incorrect
       6: good, the code is correct, but with major errors
       8: very good, the code is correct, but with minor errors
      10: perfect 
   
    -----------------------------------
    The Results
    -----------------------------------
     8.  How well did the team present the results they found in the data? (10 points)
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
      10: crystal clear
       
     9.  How do you think of the results they found in the data?  (5 points)
       0: not clear
       1: likely to be wrong
       2: okay, maybe wrong
       3: good, but can be improved
       4: make sense, but not interesting
       5: make sense and very interesting
     
    -----------------------------------
    The Presentation
    -----------------------------------
    10. How all the different parts (data, problem, solution, result) fit together as a coherent story?  
       0: they are irrelevant
       1: I can barely understand how they are related to each other
       2: okay, the problem is good, but the solution doesn't match well, or the problem is not solvable.
       3: good, but the results don't make much sense in the context
       4: very good fit, but not exciting (the storyline can be improved/polished)
       5: a perfect story
      
    11. Did the presenter make good use of the 10 minutes for presentation?  
       0: the team didn't present
       1: bad, barely finished a small part of the talk
       2: okay, barely finished most parts of the talk.
       3: good, finished all parts of the talk, but some part is rushed
       4: very good, but the allocation of time on different parts can be improved.
       5: perfect timing and good use of time      

    12. How well do you think of the presentation (overall quality)?  
       0: the team didn't present
       1: bad
       2: okay
       3: good
       4: very good
       5: perfect


    -----------------------------------
    Overall: 
    -----------------------------------
    13. How many points out of the 100 do you give to this project in total?  Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.
    Total score:
    
    14. What are the strengths of this project? Briefly, list up to 3 strengths.
       1: 
       2:
       3:
    
    15. What are the weaknesses of this project? Briefly, list up to 3 weaknesses.
       1:
       2:
       3:
    
    16. Detailed comments and suggestions. What suggestions do you have for this project to improve its quality further.
    
    
    

    ---------------------------------
    Your Vote: 
    ---------------------------------
    1. [Overall Quality] Between the two submissions that you are reviewing, which team would you vote for a better score?  (5 bonus points)
        0: I vote the other team is better than this team
        5: I vote this team is better than the other team 
        
    2. [Presentation] Among all the teams in the presentation, which team do you think deserves the best presentation award for this case study?  
        1: Team 1
        2: Team 2
        3: Team 3
        4: Team 4
        5: Team 5
        6: Team 6
        7: Team 7
        8: Team 8
        9: Team 9


