# Capston Project | Yelp - Progress Report

Basic informatio on the dataset: 
- 2.2M reviews and 591K tips by 552K users for 77K businesses
- 566K business attributes, e.g., hours, parking availability, ambience.
- Social network of 552K users for a total of 3.5M social edges.
- Aggregated check-ins over time for each of the 77K businesses
- 200,000 pictures from the included businesses



5 files: businesss(77,445 rows), checkin(55,569 rows), review(2,225,213 rows), tip(591,864 rows), user(552,339 rows) 
- Business has a total of 36 attributes.
- (u'Accepts Credit Cards', 56528),
 (u'Price Range', 50070),
 (u'Parking', 44617),
 (u'Good for Kids', 30328),
 (u'Outdoor Seating', 26601),
 (u'Good For Groups', 26011),
 (u'Delivery', 23624),
 (u'Take-out', 23601),
 (u'Attire', 23487),
 (u'Alcohol', 23328),
 (u'Takes Reservations', 23072),
 (u'Has TV', 22703),
 (u'Wheelchair Accessible', 22511),
 (u'Good For', 22080),
 (u'Wi-Fi', 21624),
 (u'Ambience', 21447),
 (u'Waiter Service', 21332),
 (u'Noise Level', 21305),
- The missing attributes are either filled with 'unknown' or False based on what attribute it is. For example, attributes such as 'delivery', 'cater', 'take out', 'Wheelchair Accessible', 'Dogs Allowed' and 'Happy Hour' are more likely to be the determining factors of whether someone decides to visit that certain business, therefore the missing values are filled with 'False.'  Assuming some who has a dog with him/her yelping 'cafe' nearby,  that person is more likely to treat the missing 'Dog Allowed' information as if dogs are not allowed and find a place that he/she is sure that they can visit with dogs. However,  none-determining factors such as 'Coat Check' and 'Has TV', as well as factors we cannot assume like 'Noise Level', 'Price Range', and 'Takes Reservations' are filled with 'unknown.'

## Model Exploration 

### Cultural Trends: 
By adding a diverse set of cities, we want participants to compare and contrast what makes a particular city different. For example, are people in international cities less concerned about driving in to a business, indicated by their lack of mention about parking? What cuisines are Yelpers raving about in these different countries? Do Americans tend to eat out late compared to the Germans and English? In which countries are Yelpers sticklers for service quality? In international cities such as Montreal, are French speakers reviewing places differently than English speakers?



In [1]:
import pandas as pd

In [10]:

sample_df = pd.read_csv('../../capstone/merged_review_sample.csv')
# full_df = pd.read_csv('../../capstone/merged_review.csv')

In [12]:
drop = ['votes_y','friends','elite','compliments','votes_x','type_y','user_id','type_x','review_id','attributes',\
        'business_id','full_address','hours','neighborhoods','open','Unnamed: 0','city','name_x','name_y','type','compliment']
ndf = sample_df.drop(drop, 1)


In [13]:
c = ['state', 'credit_card', 'price', 'parking', 'kids', 'ourdoor_seating', 'groups',\
    'delivery', 'take_out', 'attire', 'alcohol', 'reservation', 'tv', 'wheelchair', 'wifi', 'waiter', \
     'noise', 'cater', 'appointment_only', 'happy_hour', 'dancing', 'coatcheck', 'dogs', 'drive_thru']
temp = pd.DataFrame()
for i in c:
    print i
    unique = list(ndf[i].unique())
    unique_n = {y:x for x, y in enumerate(unique)}
    temp[i] = ndf[i].apply(lambda x: unique_n[x])

state
credit_card
price
parking
kids
ourdoor_seating
groups
delivery
take_out
attire
alcohol
reservation
tv
wheelchair
wifi
waiter
noise
cater
appointment_only
happy_hour
dancing
coatcheck
dogs
drive_thru


In [28]:
x = temp.iloc[:,1:]

In [47]:
y = ndf['stars_y']

In [48]:
from sklearn.cross_validation import train_test_split, cross_val_score
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=3)

In [49]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier,RandomForestClassifier, BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier

# try other ensemble methods (random forest=, extra tree, adaboost, gradient boosting)
dt = DecisionTreeClassifier(random_state=3)
bdt = BaggingClassifier(DecisionTreeClassifier(random_state=3))
rfdt = RandomForestClassifier(random_state=3)
etdt = ExtraTreesClassifier(random_state=3)
abdt = AdaBoostClassifier(random_state=3)
gbdt = GradientBoostingClassifier(random_state=3)

# apply those models to the train set 
result_dt = dt.fit(x_train,y_train)
result_bdt = bdt.fit(x_train,y_train)
result_rfdt = rfdt.fit(x_train,y_train)
result_etdt = etdt.fit(x_train,y_train)
result_abdt = abdt.fit(x_train,y_train)
result_gbdt = gbdt.fit(x_train,y_train)

# print out the accuracy scores 
print "Decision Tree Accuracy Score: " + str(result_dt.score(x_test,y_test))
print "Bagging Decision Tree Accuracy Score: " + str(result_bdt.score(x_test,y_test))
print "Random Forest Accuracy Score: " + str(result_rfdt.score(x_test,y_test))
print "Extra Tree Accuracy Score: " + str(result_etdt.score(x_test,y_test))
print "Ada Boost Accuracy Score: " + str(result_abdt.score(x_test,y_test))
print "Gradient Boosting Accuracy Score: " + str(result_gbdt.score(x_test,y_test))

Decision Tree Accuracy Score: 0.423129251701
Bagging Decision Tree Accuracy Score: 0.480272108844
Random Forest Accuracy Score: 0.477551020408
Extra Tree Accuracy Score: 0.451700680272
Ada Boost Accuracy Score: 0.538775510204
Gradient Boosting Accuracy Score: 0.531972789116


In [52]:
feature_importances_rf = pd.DataFrame(result_rfdt.feature_importances_, 
                                   index = x.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Random Forest'
print feature_importances_rf
print ''

feature_importances_et = pd.DataFrame(result_etdt.feature_importances_, 
                                   index = x.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Extra Tree'
print feature_importances_et
print ''

feature_importances_ab = pd.DataFrame(result_abdt.feature_importances_, 
                                   index = x.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Ada Boost'
print feature_importances_ab
print ''

feature_importances_gb = pd.DataFrame(result_gbdt.feature_importances_, 
                                   index = x.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Gradient Boosting'
print feature_importances_gb
print ''

Random Forest
                  importance
wifi                0.118548
price               0.088968
noise               0.078202
alcohol             0.065903
ourdoor_seating     0.061687
wheelchair          0.060873
cater               0.059226
tv                  0.056107
kids                0.054169
reservation         0.050601
parking             0.040821
delivery            0.040207
waiter              0.034185
coatcheck           0.026408
groups              0.026327
dogs                0.026213
attire              0.026132
take_out            0.025761
happy_hour          0.020393
credit_card         0.012880
dancing             0.010585
appointment_only    0.009381
drive_thru          0.006423

Extra Tree
                  importance
wifi                0.100537
price               0.095212
wheelchair          0.070046
noise               0.069911
ourdoor_seating     0.067862
tv                  0.060373
cater               0.056823
alcohol             0.056255
kids             

In [106]:
def clean(x):
    try: 
        z = x.replace('\n\n',' ')

        return z
    except: 
        pass
temp['text']= df['text'].apply(clean)

In [112]:
temp.text[595]

'The skylofts are truly amazing.  We stayed in the two bedroom loft which was about 3000 sq/ft.  There are incredible views of the strip and of the airport.  The rolls royce picked us up at the airport and we bypassed the mile long check in line upon arrival.  We got an escort to the room where the butler was waiting with fresh juice.  All the cokes, fiji water, and sprites included.  The room was incredible.  Completely sound proof and every room and living room has bang and olufson sound systems.  We ordered room service and the butler delivered and set everything up on the dining table.  The food was pricey but very good.  I would highly recommend staying at the skylofts.  You wont regret it!'

In [7]:
from nltk import *
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
# strip unnesscery words
stemmer = PorterStemmer()
temp['text'] = [stemmer.stem(t.decode('utf-8')) for t in temp['text']]

# set stop words and vectorize it 
cvec = CountVectorizer(stop_words = 'english')
cvec.fit(temp['text'])
review = pd.DataFrame(cvec.fit_transform(temp['text']).todense(),
                       columns=cvec.get_feature_names())

NameError: name 'temp' is not defined

In [None]:
x = review.values 
y = temp['star']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=3)

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier,RandomForestClassifier, BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier

# try other ensemble methods (random forest=, extra tree, adaboost, gradient boosting)
dt = DecisionTreeClassifier(random_state=3)
bdt = BaggingClassifier(DecisionTreeClassifier(random_state=3))
rfdt = RandomForestClassifier(random_state=3)
etdt = ExtraTreesClassifier(random_state=3)
abdt = AdaBoostClassifier(random_state=3)
gbdt = GradientBoostingClassifier(random_state=3)

# apply those models to the train set 
result_dt = dt.fit(x_train,y_train)
result_bdt = bdt.fit(x_train,y_train)
result_rfdt = rfdt.fit(x_train,y_train)
result_etdt = etdt.fit(x_train,y_train)
result_abdt = abdt.fit(x_train,y_train)
result_gbdt = gbdt.fit(x_train,y_train)

# print out the accuracy scores 
print "Decision Tree Accuracy Score: " + str(result_dt.score(x_test,y_test))
print "Bagging Decision Tree Accuracy Score: " + str(result_bdt.score(x_test,y_test))
print "Random Forest Accuracy Score: " + str(result_rfdt.score(x_test,y_test))
print "Extra Tree Accuracy Score: " + str(result_etdt.score(x_test,y_test))
print "Ada Boost Accuracy Score: " + str(result_abdt.score(x_test,y_test))
print "Gradient Boosting Accuracy Score: " + str(result_gbdt.score(x_test,y_test))

NameError: name 'x_train' is not defined

In [129]:
feature_importances_rf = pd.DataFrame(result_rfdt.feature_importances_, 
                                   index = review.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Random Forest'
print feature_importances_rf.head()
print ''

feature_importances_et = pd.DataFrame(result_etdt.feature_importances_, 
                                   index = review.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Extra Tree'
print feature_importances_et.head()
print ''

feature_importances_ab = pd.DataFrame(result_abdt.feature_importances_, 
                                   index = review.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Gradient Boosting'
print feature_importances_gb.head()
print ''

Random Forest
            importance
vegas         0.028862
charlotte     0.013321
strip         0.006695
montreal      0.006199
pittsburgh    0.005126

Extra Tree
           importance
vegas        0.013041
las          0.010051
charlotte    0.007318
montreal     0.006986
phoenix      0.006513

Gradient Boosting
                 importance
parking            0.086447
tv                 0.072042
alcohol            0.067728
ourdoor_seating    0.062705
noise              0.061974



In [2]:
df_tip = pd.read_csv('../../capstone/merged_tip.csv')

In [146]:
df_tip.head()

Unnamed: 0.1,Unnamed: 0,attributes,business_id,categories,city,full_address,hours,latitude,longitude,name_x,...,number_friends,elites,compliment,number_compliments,cool_user,photo_user,hot_user,funny_user,pop_user,list_user
0,0,{u'Good for Kids': True},cE27W9VPgO88Qxe4ol6y_g,"[Active Life, Mini Golf, Golf]",Bethel Park,"1530 Hamilton Rd\nBethel Park, PA 15234",{},40.354116,-80.01466,Cool Springs Golf Center,...,3,0,"[note, plain, cool]",5,3,0,0,0,0,0
1,1,"{u'Alcohol': u'full_bar', u'Noise Level': u'av...",mVHrayjG3uZ_RLHkLj-AMg,"[Bars, American (New), Nightlife, Lounges, Res...",rankin,"414 Hawkins Ave\nrankin, PA 15104","{u'Tuesday': {u'close': u'19:00', u'open': u'1...",40.413464,-79.880247,Emil's Lounge,...,5,0,"[cool, more]",2,1,0,0,0,1,0
2,2,"{u'Alcohol': u'full_bar', u'Noise Level': u'lo...",KayYbHCt-RkbGcPdGOThNg,"[Bars, American (Traditional), Nightlife, Rest...",Carnegie,"141 Hawthorne St\nGreentree\nCarnegie, PA 15106","{u'Monday': {u'close': u'02:00', u'open': u'11...",40.415517,-80.067534,Alexion's Bar & Grill,...,8,0,"[photos, hot]",2,0,2,1,0,0,0
3,3,"{u'Alcohol': u'full_bar', u'Noise Level': u'lo...",KayYbHCt-RkbGcPdGOThNg,"[Bars, American (Traditional), Nightlife, Rest...",Carnegie,"141 Hawthorne St\nGreentree\nCarnegie, PA 15106","{u'Monday': {u'close': u'02:00', u'open': u'11...",40.415517,-80.067534,Alexion's Bar & Grill,...,0,0,[],0,0,0,0,0,0,0
4,4,{},1_lU0-eSWJCRvNGk78Zh9Q,"[Libraries, Public Services & Government]",Carnegie,"300 Beechwood Ave\nCarnegie\nCarnegie, PA 15106",{},40.406842,-80.085866,Carnegie Free Library,...,5,0,"[plain, cool]",3,2,0,0,0,0,0


In [5]:
from nltk import *
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
stemmer = PorterStemmer()
tip = pd.DataFrame()
tip['text'] = [stemmer.stem(t.decode('utf-8')) for t in df_tip['text']]

# set stop words and vectorize it 
cvec = CountVectorizer(stop_words = 'english')
cvec.fit(tip['text'])
tip = pd.DataFrame(cvec.fit_transform(tip['text']).todense(),
                       columns=cvec.get_feature_names())

In [6]:
x = tip.values 
y = df_tip['number_compliments']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=3)

NameError: name 'train_test_split' is not defined

In [None]:
# try other ensemble methods (random forest=, extra tree, adaboost, gradient boosting)
dt = DecisionTreeClassifier(random_state=3)
bdt = BaggingClassifier(DecisionTreeClassifier(random_state=3))
rfdt = RandomForestClassifier(random_state=3)
etdt = ExtraTreesClassifier(random_state=3)
abdt = AdaBoostClassifier(random_state=3)
gbdt = GradientBoostingClassifier(random_state=3)

# apply those models to the train set 
result_dt = dt.fit(x_train,y_train)
result_bdt = bdt.fit(x_train,y_train)
result_rfdt = rfdt.fit(x_train,y_train)
result_etdt = etdt.fit(x_train,y_train)
result_abdt = abdt.fit(x_train,y_train)
result_gbdt = gbdt.fit(x_train,y_train)

# print out the accuracy scores 
print "Decision Tree Accuracy Score: " + str(result_dt.score(x_test,y_test))
print "Bagging Decision Tree Accuracy Score: " + str(result_bdt.score(x_test,y_test))
print "Random Forest Accuracy Score: " + str(result_rfdt.score(x_test,y_test))
print "Extra Tree Accuracy Score: " + str(result_etdt.score(x_test,y_test))
print "Ada Boost Accuracy Score: " + str(result_abdt.score(x_test,y_test))
print "Gradient Boosting Accuracy Score: " + str(result_gbdt.score(x_test,y_test))

In [None]:
feature_importances_rf = pd.DataFrame(result_rfdt.feature_importances_, 
                                   index = review.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Random Forest'
print feature_importances_rf.head()
print ''

feature_importances_et = pd.DataFrame(result_etdt.feature_importances_, 
                                   index = review.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Extra Tree'
print feature_importances_et.head()
print ''

feature_importances_ab = pd.DataFrame(result_abdt.feature_importances_, 
                                   index = review.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Gradient Boosting'
print feature_importances_gb.head()
print ''

In [None]:
os.system('say "Ugh! Lydia, I am so done with this!!"')

### Location Mining and Urban Planning: 
How much of a business' success is really just location, location, location? Do you see reviewers' behavior change when they travel?

### Seasonal Trends: 
What about seasonal effects: Are HVAC contractors being reviewed just at onset of winter, and manicure salons at onset of summer? Are there more reviews for sports bars on major game days and if so, could you predict that?

### Infer Categories: 
Do you see any non-intuitive correlations between business categories e.g., how many karaoke bars also offer Korean food, and vice versa? What businesses deserve their own subcategory (i.e., Szechuan or Hunan versus just "Chinese restaurants"), and can you learn this from the review text?

### Natural Language Processing (NLP): 
How well can you guess a review's rating from its text alone? What are the most common positive and negative words used in our reviews? Are Yelpers a sarcastic bunch? And what kinds of correlations do you see between tips and reviews: could you extract tips from reviews?

In [None]:
y = ndf['stars_x']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=3)

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier,RandomForestClassifier, BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier

# try other ensemble methods (random forest=, extra tree, adaboost, gradient boosting)
dt = DecisionTreeClassifier(random_state=3)
bdt = BaggingClassifier(DecisionTreeClassifier(random_state=3))
rfdt = RandomForestClassifier(random_state=3)
etdt = ExtraTreesClassifier(random_state=3)
abdt = AdaBoostClassifier(random_state=3)
gbdt = GradientBoostingClassifier(random_state=3)

# apply those models to the train set 
result_dt = dt.fit(x_train,y_train)
result_bdt = bdt.fit(x_train,y_train)
result_rfdt = rfdt.fit(x_train,y_train)
result_etdt = etdt.fit(x_train,y_train)
result_abdt = abdt.fit(x_train,y_train)
result_gbdt = gbdt.fit(x_train,y_train)

# print out the accuracy scores 
print "Decision Tree Accuracy Score: " + str(result_dt.score(x_test,y_test))
print "Bagging Decision Tree Accuracy Score: " + str(result_bdt.score(x_test,y_test))
print "Random Forest Accuracy Score: " + str(result_rfdt.score(x_test,y_test))
print "Extra Tree Accuracy Score: " + str(result_etdt.score(x_test,y_test))
print "Ada Boost Accuracy Score: " + str(result_abdt.score(x_test,y_test))
print "Gradient Boosting Accuracy Score: " + str(result_gbdt.score(x_test,y_test))

NameError: name 'x_train' is not defined

### Changepoints and Events: 
Can you detect when things change suddenly (i.e. a business coming under new management)? Can you see when a city starts going nuts over cronuts?

### Social Graph Mining: 
Can you figure out who the trend setters are and who found the best waffle joint before waffles were cool? How much influence does my social circle have on my business choices and my ratings?

In [134]:
import spacy
