# Data Mining Project

The goal of this task is to explore the Yelp data set to get a sense about what the data look like and their characteristics. You can think about the goal as being to answer questions such as:
1. What are the major topics in the reviews? Are they different in the positive and negative reviews? Are they different for different cuisines?   
2. What does the distribution of the number of reviews over other variables (e.g., cuisine, location) look like?
3. What does the distribution of ratings look like?

As with any project, we start with a little...

# Exploratory Data Analysis

In [1]:
import numpy as np
import pandas as pd

from string import punctuation    #Dis for cleaning

First we read in the data with a little help from pandas

In [2]:
reviews_df = pd.read_json("yelp_academic_dataset_review.json", lines=True)
reviews_df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
0,vcNAWiLM4dR7D2nwwJ7nCA,2007-05-17,15SdjuK7DmYqUAj6rjGowg,5,dr. goldberg offers everything i look for in a...,review,Xqd0DzHaiyRqVH3WRG7hzg,"{'funny': 0, 'useful': 2, 'cool': 1}"
1,vcNAWiLM4dR7D2nwwJ7nCA,2010-03-22,RF6UnRTtG7tWMcrO2GEoAg,2,"Unfortunately, the frustration of being Dr. Go...",review,H1kH6QZV7Le4zqTRNxoZow,"{'funny': 0, 'useful': 2, 'cool': 0}"
2,vcNAWiLM4dR7D2nwwJ7nCA,2012-02-14,-TsVN230RCkLYKBeLsuz7A,4,Dr. Goldberg has been my doctor for years and ...,review,zvJCcrpm2yOZrxKffwGQLA,"{'funny': 0, 'useful': 1, 'cool': 1}"
3,vcNAWiLM4dR7D2nwwJ7nCA,2012-03-02,dNocEAyUucjT371NNND41Q,4,Been going to Dr. Goldberg for over 10 years. ...,review,KBLW4wJA_fwoWmMhiHRVOA,"{'funny': 0, 'useful': 0, 'cool': 0}"
4,vcNAWiLM4dR7D2nwwJ7nCA,2012-05-15,ebcN2aqmNUuYNoyvQErgnA,4,Got a letter in the mail last week that said D...,review,zvJCcrpm2yOZrxKffwGQLA,"{'funny': 0, 'useful': 2, 'cool': 1}"


In [3]:
reviews_df.shape

(1125458, 8)

We have a problem: so much reviews! I want to take a subset of the data. Types of businesses seems like a good place to go. Let's see how many types of businesses there are. 

In [4]:
reviews_df.business_id.nunique()

41958

That's a lot of businesses. Since food is what gets me out of bed everyday, I only want to look at restaurants. How to do that? We need to read in the business dataset and see what business_ids match with restaurants. Let's go.

In [5]:
business_df = pd.read_json("yelp_academic_dataset_business.json", lines=True)
business_df.shape

(42153, 15)

Interestingly bigger than the reviews number. That suggests that not all places have reviews, which is a reasonable thing. Smaller than the other dataset but still huge! Let's take a sneak peak.

In [6]:
business_df.head()

Unnamed: 0,attributes,business_id,categories,city,full_address,hours,latitude,longitude,name,neighborhoods,open,review_count,stars,state,type
0,{'By Appointment Only': True},vcNAWiLM4dR7D2nwwJ7nCA,"[Doctors, Health & Medical]",Phoenix,"4840 E Indian School Rd\nSte 101\nPhoenix, AZ ...","{'Tuesday': {'close': '17:00', 'open': '08:00'...",33.499313,-111.983758,"Eric Goldberg, MD",[],True,7,3.5,AZ,business
1,"{'Take-out': True, 'Good For': {'dessert': Fal...",JwUE5GmEO-sH1FuwJgKBlQ,[Restaurants],De Forest,"6162 US Highway 51\nDe Forest, WI 53532",{},43.238893,-89.335844,Pine Cone Restaurant,[],True,26,4.0,WI,business
2,"{'Take-out': True, 'Good For': {'dessert': Fal...",uGykseHzyS5xAMWoN6YUqA,"[American (Traditional), Restaurants]",De Forest,"505 W North St\nDe Forest, WI 53532","{'Monday': {'close': '22:00', 'open': '06:00'}...",43.252267,-89.353437,Deforest Family Restaurant,[],True,16,4.0,WI,business
3,"{'Take-out': True, 'Wi-Fi': 'free', 'Takes Res...",LRKJF43s9-3jG9Lgx4zODg,"[Food, Ice Cream & Frozen Yogurt, Fast Food, R...",De Forest,"4910 County Rd V\nDe Forest, WI 53532","{'Monday': {'close': '22:00', 'open': '10:30'}...",43.251045,-89.374983,Culver's,[],True,7,4.5,WI,business
4,"{'Take-out': True, 'Has TV': False, 'Outdoor S...",RgDg-k9S5YD_BaxMckifkg,"[Chinese, Restaurants]",De Forest,"631 S Main St\nDe Forest, WI 53532","{'Monday': {'close': '22:00', 'open': '11:00'}...",43.240875,-89.343722,Chang Jiang Chinese Kitchen,[],True,3,4.0,WI,business


We can see that the column that we want is the categories column. We see that its an array of categories, and so we are looking for the word Restaurants in that column to make a new dataframe. I am cheating and converting the column to a string because dealing with that list was too much.

In [7]:
business_df['categories'] = business_df['categories'].apply(', '.join)
restaurants_df = business_df[business_df['categories'].str.contains("Restaurants")]
restaurants_df.shape

(14303, 15)

In [8]:
restaurants = list(restaurants_df['business_id'])
restaurants[0:10]

['JwUE5GmEO-sH1FuwJgKBlQ',
 'uGykseHzyS5xAMWoN6YUqA',
 'LRKJF43s9-3jG9Lgx4zODg',
 'RgDg-k9S5YD_BaxMckifkg',
 'rdAdANPNOcvUtoFgcaY9KA',
 '_wZTYYL7cutanzAnJUTGMA',
 'zOc8lbjViUZajbY7M0aUCQ',
 'UgjVZTSOaYoEvws_lAP_Dw',
 'SKLw05kEIlZcpTD5pqma8Q',
 '77ESrCo7hQ96VpCWWdvoxg']

Now I want to only have the businesses that match restaurant ids in the main df

In [9]:
restaurant_reviews_df = reviews_df.loc[reviews_df['business_id'].isin(restaurants)]
restaurant_reviews_df.shape

(706646, 8)

Still pretty huge. Now we'll cut it down based on the number of reviews. We'll keep places that have at least 5 reviews because that seems like a reasonable number for the public to judge a place. Let's see what we are dealing with.

In [10]:
review_counts = dict(restaurant_reviews_df.business_id.value_counts())
review_counts

{'4bEjOyTaDG24SY5TxsaUNQ': 3695,
 '2e2e7WgqU1BnpxmQL5jbfw': 3263,
 'zt1TpTuJ6y9n551sw9TaEg': 3011,
 'YNQgak-ZLtYJQxlDwN-qIg': 2494,
 'Xhg93cMdemu5pAMkDoEdtQ': 2399,
 'tFU2Js_nbIZOrnKfYJYBBg': 2203,
 'CZjcFdvJhksq9dy58NVEzw': 2122,
 'sIyHTizqAiGu12XMLX3N3g': 2090,
 'xfwRO04KbAPw_zRotCfWQQ': 2027,
 'DO3Gk17RyJVW7zYMCtYPnw': 1819,
 'xyTJYlbE_MLouK6rCou6zg': 1690,
 'aGbjLWzcrnEx2ZmMCFm3EA': 1671,
 'lliksv-tglfUz1T3B3vgvA': 1647,
 'eq6lQI039SBLC6sHm3idGA': 1607,
 'QbmcCE_cLq4WO8ZMKImaLw': 1582,
 'PXviRcHR1mqdH4vRc2LEAQ': 1530,
 'BqrTtox0JbG-P_DKBB5bBw': 1358,
 'jOuERtVf7QePnK9ZcdH5XA': 1340,
 'HbUQ_3dlm3uCacmhTEMnuA': 1297,
 '8buIr1zBCO7OEcAQSZko7w': 1237,
 'vxxMqBaAHuWdx4impsLSSA': 1205,
 'FV16IeXJp2W6pnghTz2FAw': 1196,
 'rBPQuQgTcMtUq5-RYhY2uQ': 1191,
 'VVeogjZya58oiTxK7qUjAQ': 1178,
 'mDdqifuTrfXAOfxiLMGu5Q': 1110,
 'SsTxjxo8qvqBMvan1rzNzg': 1083,
 'JokKtdXU7zXHcr20Lrk29A': 1040,
 'DjOxXobyGDwWt89q4z1twg': 1038,
 'NsnZ5GhagXBKrCOelxVQxw': 1037,
 'tqu42L0qXzkvYKSruOz0IA': 1031,
 'EWMwV5V9

So we see some places have over 3000 reviews and some places only have 1 (and some unlisted have none!) so we'll kick out the ones that have less than 5 and see what happens.

In [11]:
good_ones = []
cutoff = 1000

for key in review_counts.keys():
    if review_counts[key]>cutoff:
        good_ones.append(key)
        
len(good_ones)

32

In [12]:
restaurant_reviews_df = restaurant_reviews_df.loc[restaurant_reviews_df['business_id'].isin(good_ones)]
restaurant_reviews_df.shape

(53222, 8)

That's not much of a decrease :(

## Cleaning the Data

So we have a lot of reviews to look at (> 1 million!) and we only really care about the text column for now. The next step is to clean the data. In this instance we are going to remove the punctuation and convert all to lowercase.

In [13]:
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Convert all to lowercase
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char.lower() for char in mess if char not in punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    return nopunc

In [14]:
restaurant_reviews_df['text_clean'] = restaurant_reviews_df['text'].apply(text_process)
restaurant_reviews_df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes,text_clean
99644,FV16IeXJp2W6pnghTz2FAw,2007-01-02,aH8o5WZ6mAu3FKEqUm9aDg,4,Me and la familia ate here 3 times over my 3 n...,review,sCU3KOdf_E1X1THWjhhxDw,"{'funny': 1, 'useful': 1, 'cool': 1}",me and la familia ate here 3 times over my 3 n...
99645,FV16IeXJp2W6pnghTz2FAw,2007-01-07,6WG0bt9wew39wmdyCoFteQ,4,"When we went to Vegas for New Year's, some peo...",review,YyPncV7fwWflne10BiCf6Q,"{'funny': 1, 'useful': 2, 'cool': 2}",when we went to vegas for new years some peopl...
99646,FV16IeXJp2W6pnghTz2FAw,2007-01-13,s6dt4gwoqV-0B_TOljcWEA,4,I'm giving it 3 stars because I only like 3 it...,review,C6wyE9k2vjpU4tIGsbN7Ww,"{'funny': 0, 'useful': 0, 'cool': 0}",im giving it 3 stars because i only like 3 ite...
99647,FV16IeXJp2W6pnghTz2FAw,2007-01-24,7EPKJm7bB6EsW0Ke0lRWKg,4,"I have to be honest, the first time I ate hear...",review,7Oj_nlhwmUoTQGSLegMCdQ,"{'funny': 0, 'useful': 0, 'cool': 0}",i have to be honest the first time i ate hear ...
99648,FV16IeXJp2W6pnghTz2FAw,2007-01-30,EFIVGXez5b9k43865kgJKg,4,"Dirty name, great pho! A wonderful choice on ...",review,R-2GWjkuiCyFojQZ5cmvIw,"{'funny': 1, 'useful': 1, 'cool': 0}",dirty name great pho a wonderful choice on ch...


Since we only care about the text column, we will pull that out into a list

In [15]:
reviews = list(restaurant_reviews_df['text_clean'])
reviews[0]

'me and la familia ate here 3 times over my 3 nights stay over christmas  i think i wont be missing pho for a good couple of months i lovee the pho soup and i heard the entrees are pretty damn good too especially the clams and oysters the only reasons for the 4 stars are the service and the ventilation i always come out smelling like pho which i dont like i like my hair to smell flowery u know'

Now's the hard part. We want to use LDA to extract topics from the review text to understand what people have talked about in these reviews. First we want to vectorize the documents.

# 1.1:Topic Modelling

Use a topic model (e.g., PLSA or LDA) to extract topics from all the review text (or a large sample of them) and visualize the topics to understand what people have talked about in these reviews.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [17]:
# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(stop_words='english')
tf = tf_vectorizer.fit_transform(reviews)
tf_feature_names = tf_vectorizer.get_feature_names()
tf_feature_names[0:10]

['00', '000', '0000', '003100', '006', '00pm', '01', '010', '010407', '0110']

Well that suggests that some reviews have some weird numbers in them. Oh well. Now to build the LDA model. LDA requires some prior knowledge on the number of topics. 

In [18]:
no_topics=7
lda_model = LatentDirichletAllocation(n_components=no_topics, learning_method='online').fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

In [19]:
def display_topics(H, W, feature_names, documents, no_top_words):
    all_topics = []
    print("\n\nLDA Topics \n\n")
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        topics = [feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]
        all_topics.append(topics)
        print(" ".join(topics)) 
        
    return all_topics

In [20]:
no_top_words = 15
lda_clusters = display_topics(lda_H, lda_W, tf_feature_names, reviews, no_top_words)



LDA Topics 


Topic 0:
table wait food minutes time service came seated got server ordered restaurant took order asked
Topic 1:
burger fries sandwich ordered good chicken cheese burgers great french delicious bacon sweet bread place
Topic 2:
great ramen steak vegas best amazing service restaurant delicious excellent bouchon wine making meal perfect
Topic 3:
pizza burger gordon beer secret burgers bachi burgr fez serendipity bar bun ramsay holsteins best
Topic 4:
buffet food vegas good great place best line buffets breakfast worth quality selection price legs
Topic 5:
dessert crab good shrimp cream sauce seafood salad ice like cheese pork dishes fresh dish
Topic 6:
good just like place food really dont time im pretty didnt try think got wasnt


Now we want the data in a nice d3-friendly format for a radial dendrogam :). I want an array of objects, with each top_word coneected by a ".".

In [21]:
d3_array = [{'id':'Most Reviewed Yelp Restaurants'},]
count = 0

for topic in lda_clusters:
    first = {"id":'Most Reviewed Yelp Restaurants.Topic '+str(count)}
    d3_array.append(first)
    
    for word in topic:
        element = {"id":'Most Reviewed Yelp Restaurants.Topic '+str(count)+'.'+word}
        d3_array.append(element)
    
    count += 1
    
d3_array

[{'id': 'Most Reviewed Yelp Restaurants'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.table'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.wait'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.food'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.minutes'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.time'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.service'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.came'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.seated'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.got'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.server'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.ordered'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.restaurant'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.took'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.order'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic 0.asked'},
 {'id': 'Most Reviewed Yelp Restaurants.Topic

In [22]:
with open('task1topics.txt', 'w') as f:
    for x in d3_array:
        f.write('{}\n'.format(x))

# Task 1.2
Do the same for two subsets of reviews that are interesting to compare (e.g., positive vs. negative reviews for a particular cuisine or restaurant), and visually compare the topics extracted from the two subsets to help understand the similarity and differences between these topics extracted from the two subsets.

## High and low ratings
I want to look at the different words used in high and low rating reviews. Let's see how to do this without being too overwhelmed. Let's use the same dataframe and see the spread of ratings we have.

In [23]:
# review_counts = dict(restaurant_reviews_df.business_id.value_counts())
# review_counts
restaurant_reviews_df.stars.value_counts()

5    19919
4    18671
3     8220
2     4118
1     2294
Name: stars, dtype: int64

So as expected there are many more good reviews than bad ones. Since we have so much good reviews, I'll define a good review as 5 stars, and bad reviews as ones with less than 3 stars. I'm going to have two arrays now.

In [24]:
good_reviews_df = restaurant_reviews_df.loc[restaurant_reviews_df['stars'] == 5]
good_reviews = list(good_reviews_df['text_clean'])
bad_reviews_df = restaurant_reviews_df.loc[restaurant_reviews_df['stars'] < 3]
bad_reviews = list(bad_reviews_df['text_clean'])
len(good_reviews)

19919

In [25]:
tf_vectorizer = CountVectorizer(stop_words='english')
tf = tf_vectorizer.fit_transform(good_reviews)
tf_feature_names = tf_vectorizer.get_feature_names()

no_topics=5
lda_model = LatentDirichletAllocation(n_components=no_topics, learning_method='online').fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

no_top_words = 15
good_review_topics = display_topics(lda_H, lda_W, tf_feature_names, good_reviews, no_top_words)



LDA Topics 


Topic 0:
burger fries cheese ordered steak good chicken delicious beef sweet amazing bacon sauce cream came
Topic 1:
place food good vegas great just time like love really best wait sandwich service delicious
Topic 2:
buffet food vegas best good great worth wait buffets time burger place quality ive went
Topic 3:
crab like rib shrimp ramen sushi buffet dishes dessert thai prime seafood food fresh good
Topic 4:
pizza bouchon secret crust mon ami gabi floor white bobby bianco hallway flay thomas pizzeria


In [26]:
tf_vectorizer = CountVectorizer(stop_words='english')
tf = tf_vectorizer.fit_transform(bad_reviews)
tf_feature_names = tf_vectorizer.get_feature_names()

no_topics=5
lda_model = LatentDirichletAllocation(n_components=no_topics, learning_method='online').fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

no_top_words = 15
bad_review_topics = display_topics(lda_H, lda_W, tf_feature_names, bad_reviews, no_top_words)



LDA Topics 


Topic 0:
buffet crab buffets sushi legs food rib seafood selection prime good wynn dessert shrimp station
Topic 1:
burger good ordered like just food fries chicken hot chocolate really place got came didnt
Topic 2:
island ellis dogs yea earl tourists neon karaoke episode serendipity fez pack duke chutney chantrelles
Topic 3:
minutes table service food time order asked server came told restaurant seated took said wait
Topic 4:
food place good just like dont vegas time really wait line service better im people


In [36]:
good_d3_array = [{'id':'Most Reviewed Yelp Restaurants.5 Star Reviews', "value":5},]
count = 0

for topic in good_review_topics:
    first = {"id":'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic '+str(count), "value":5}
    good_d3_array.append(first)
    
    for word in topic:
        element = {"id":'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic '+str(count)+'.'+word, "value":5}
        good_d3_array.append(element)
    
    count += 1
    
good_d3_array

[{'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews', 'value': 5},
 {'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic 0', 'value': 5},
 {'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic 0.burger',
  'value': 5},
 {'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic 0.fries',
  'value': 5},
 {'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic 0.cheese',
  'value': 5},
 {'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic 0.ordered',
  'value': 5},
 {'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic 0.steak',
  'value': 5},
 {'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic 0.good',
  'value': 5},
 {'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic 0.chicken',
  'value': 5},
 {'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic 0.delicious',
  'value': 5},
 {'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic 0.beef',
  'value': 5},
 {'id': 'Most Reviewed Yelp Restaurants.5 Star Reviews.Topic 0.sweet',

In [35]:
bad_d3_array = [{'id':'Most Reviewed Yelp Restaurants.1 Star Reviews', "value":1},]
count = 0

for topic in bad_review_topics:
    first = {"id":'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic '+str(count), "value":1}
    bad_d3_array.append(first)
    
    for word in topic:
        element = {"id":'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic '+str(count)+'.'+word, "value":1}
        bad_d3_array.append(element)
    
    count += 1
    
bad_d3_array

[{'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews', 'value': 1},
 {'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic 0', 'value': 1},
 {'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic 0.buffet',
  'value': 1},
 {'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic 0.crab',
  'value': 1},
 {'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic 0.buffets',
  'value': 1},
 {'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic 0.sushi',
  'value': 1},
 {'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic 0.legs',
  'value': 1},
 {'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic 0.food',
  'value': 1},
 {'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic 0.rib',
  'value': 1},
 {'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic 0.seafood',
  'value': 1},
 {'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic 0.selection',
  'value': 1},
 {'id': 'Most Reviewed Yelp Restaurants.1 Star Reviews.Topic 0.prime',
  '