# Capston Project | Yelp - Progress Report

Basic informatio on the dataset: 
- 2.2M reviews and 591K tips by 552K users for 77K businesses
- 566K business attributes, e.g., hours, parking availability, ambience.
- Social network of 552K users for a total of 3.5M social edges.
- Aggregated check-ins over time for each of the 77K businesses
- 200,000 pictures from the included businesses



5 files: businesss(77,445 rows), checkin(55,569 rows), review(2,225,213 rows), tip(591,864 rows), user(552,339 rows) 
- Business has a total of 36 attributes.
- (u'Accepts Credit Cards', 56528),
 (u'Price Range', 50070),
 (u'Parking', 44617),
 (u'Good for Kids', 30328),
 (u'Outdoor Seating', 26601),
 (u'Good For Groups', 26011),
 (u'Delivery', 23624),
 (u'Take-out', 23601),
 (u'Attire', 23487),
 (u'Alcohol', 23328),
 (u'Takes Reservations', 23072),
 (u'Has TV', 22703),
 (u'Wheelchair Accessible', 22511),
 (u'Good For', 22080),
 (u'Wi-Fi', 21624),
 (u'Ambience', 21447),
 (u'Waiter Service', 21332),
 (u'Noise Level', 21305),
- The missing attributes are either filled with 'unknown' or False based on what attribute it is. For example, attributes such as 'delivery', 'cater', 'take out', 'Wheelchair Accessible', 'Dogs Allowed' and 'Happy Hour' are more likely to be the determining factors of whether someone decides to visit that certain business, therefore the missing values are filled with 'False.'  Assuming some who has a dog with him/her yelping 'cafe' nearby,  that person is more likely to treat the missing 'Dog Allowed' information as if dogs are not allowed and find a place that he/she is sure that they can visit with dogs. However,  none-determining factors such as 'Coat Check' and 'Has TV', as well as factors we cannot assume like 'Noise Level', 'Price Range', and 'Takes Reservations' are filled with 'unknown.'

## Model Exploration 

### Cultural Trends: 
By adding a diverse set of cities, we want participants to compare and contrast what makes a particular city different. For example, are people in international cities less concerned about driving in to a business, indicated by their lack of mention about parking? What cuisines are Yelpers raving about in these different countries? Do Americans tend to eat out late compared to the Germans and English? In which countries are Yelpers sticklers for service quality? In international cities such as Montreal, are French speakers reviewing places differently than English speakers?



In [None]:
By adding a diverse set of cities, we want participants to compare and contrast what makes a particular city different. For example, are people in international cities less concerned about driving in to a business, indicated by their lack of mention about parking? What cuisines are Yelpers raving about in these different countries? Do Americans tend to eat out late compared to the Germans and English? In which countries are Yelpers sticklers for service quality? In international cities such as Montreal, are French speakers reviewing places differently than English speakers

In [28]:
import pandas as pd
df = pd.read_csv('../../capstone/merged_review.csv')

In [50]:
drop = ['votes_y','friends','elite','compliments','votes_x','type_y','user_id','type_x','review_id','attributes',\
        'business_id','full_address','hours','neighborhoods','open','Unnamed: 0','city','name_x','name_y','type','compliment']
ndf = df.drop(drop, 1)


In [42]:
c = ['state', 'credit_card', 'price', 'parking', 'kids', 'ourdoor_seating', 'groups',\
    'delivery', 'take_out', 'attire', 'alcohol', 'reservation', 'tv', 'wheelchair', 'wifi', 'waiter', \
     'noise', 'cater', 'appointment_only', 'happy_hour', 'dancing', 'coatcheck', 'dogs', 'drive_thru']
temp = pd.DataFrame()
for i in c:
    print i
    unique = list(ndf[i].unique())
    unique_n = {y:x for x, y in enumerate(unique)}
    temp[i] = ndf[i].apply(lambda x: unique_n[x])

In [58]:
x = temp.iloc[:,1:]

In [107]:
y = temp['state']

In [59]:
from sklearn.cross_validation import train_test_split, cross_val_score
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=3)

In [60]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier,RandomForestClassifier, BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier

# try other ensemble methods (random forest=, extra tree, adaboost, gradient boosting)
dt = DecisionTreeClassifier(random_state=3)
bdt = BaggingClassifier(DecisionTreeClassifier(random_state=3))
rfdt = RandomForestClassifier(random_state=3)
etdt = ExtraTreesClassifier(random_state=3)
abdt = AdaBoostClassifier(random_state=3)
gbdt = GradientBoostingClassifier(random_state=3)

# apply those models to the train set 
result_dt = dt.fit(x_train,y_train)
result_bdt = bdt.fit(x_train,y_train)
result_rfdt = rfdt.fit(x_train,y_train)
result_etdt = etdt.fit(x_train,y_train)
result_abdt = abdt.fit(x_train,y_train)
result_gbdt = gbdt.fit(x_train,y_train)

# print out the accuracy scores 
print "Decision Tree Accuracy Score: " + str(result_dt.score(x_test,y_test))
print "Bagging Decision Tree Accuracy Score: " + str(result_bdt.score(x_test,y_test))
print "Random Forest Accuracy Score: " + str(result_rfdt.score(x_test,y_test))
print "Extra Tree Accuracy Score: " + str(result_etdt.score(x_test,y_test))
print "Ada Boost Accuracy Score: " + str(result_abdt.score(x_test,y_test))
print "Gradient Boosting Accuracy Score: " + str(result_gbdt.score(x_test,y_test))

Decision Tree Accuracy Score: 0.48843537415
Bagging Decision Tree Accuracy Score: 0.499319727891
Random Forest Accuracy Score: 0.478911564626
Extra Tree Accuracy Score: 0.503401360544
Ada Boost Accuracy Score: 0.102040816327
Gradient Boosting Accuracy Score: 0.500680272109


In [62]:
feature_importances_rf = pd.DataFrame(result_rfdt.feature_importances_, 
                                   index = x.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Random Forest'
print feature_importances_rf
print ''

feature_importances_et = pd.DataFrame(result_etdt.feature_importances_, 
                                   index = x.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Extra Tree'
print feature_importances_et
print ''

feature_importances_ab = pd.DataFrame(result_abdt.feature_importances_, 
                                   index = x.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Gradient Boosting'
print feature_importances_gb
print ''

Random Forest
                  importance
wifi                0.110719
price               0.094095
alcohol             0.079364
ourdoor_seating     0.069529
tv                  0.066923
noise               0.062370
wheelchair          0.057440
reservation         0.056857
cater               0.055550
waiter              0.046633
kids                0.045608
parking             0.041969
delivery            0.031881
groups              0.025191
coatcheck           0.024479
dogs                0.024330
happy_hour          0.022103
credit_card         0.020734
take_out            0.019872
attire              0.019321
appointment_only    0.009646
dancing             0.008333
drive_thru          0.007052

Extra Tree
                  importance
wifi                0.114921
price               0.079831
alcohol             0.078879
noise               0.072864
ourdoor_seating     0.069988
tv                  0.063693
wheelchair          0.054021
kids                0.048472
parking          

In [106]:
def clean(x):
    try: 
        z = x.replace('\n\n',' ')

        return z
    except: 
        pass
temp['text']= df['text'].apply(clean)

In [112]:
temp.text[595]

'The skylofts are truly amazing.  We stayed in the two bedroom loft which was about 3000 sq/ft.  There are incredible views of the strip and of the airport.  The rolls royce picked us up at the airport and we bypassed the mile long check in line upon arrival.  We got an escort to the room where the butler was waiting with fresh juice.  All the cokes, fiji water, and sprites included.  The room was incredible.  Completely sound proof and every room and living room has bang and olufson sound systems.  We ordered room service and the butler delivered and set everything up on the dining table.  The food was pricey but very good.  I would highly recommend staying at the skylofts.  You wont regret it!'

In [115]:
from nltk import *
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
# strip unnesscery words
stemmer = PorterStemmer()
temp['text'] = [stemmer.stem(t.decode('utf-8')) for t in temp['text']]

# set stop words and vectorize it 
cvec = CountVectorizer(stop_words = 'english')
cvec.fit(temp['text'])
review = pd.DataFrame(cvec.fit_transform(temp['text']).todense(),
                       columns=cvec.get_feature_names())

In [123]:
x = review.values 
y = temp['state']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=3)

In [124]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier,RandomForestClassifier, BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier

# try other ensemble methods (random forest=, extra tree, adaboost, gradient boosting)
dt = DecisionTreeClassifier(random_state=3)
bdt = BaggingClassifier(DecisionTreeClassifier(random_state=3))
rfdt = RandomForestClassifier(random_state=3)
etdt = ExtraTreesClassifier(random_state=3)
abdt = AdaBoostClassifier(random_state=3)
gbdt = GradientBoostingClassifier(random_state=3)

# apply those models to the train set 
result_dt = dt.fit(x_train,y_train)
result_bdt = bdt.fit(x_train,y_train)
result_rfdt = rfdt.fit(x_train,y_train)
result_etdt = etdt.fit(x_train,y_train)
result_abdt = abdt.fit(x_train,y_train)
result_gbdt = gbdt.fit(x_train,y_train)

# print out the accuracy scores 
print "Decision Tree Accuracy Score: " + str(result_dt.score(x_test,y_test))
print "Bagging Decision Tree Accuracy Score: " + str(result_bdt.score(x_test,y_test))
print "Random Forest Accuracy Score: " + str(result_rfdt.score(x_test,y_test))
print "Extra Tree Accuracy Score: " + str(result_etdt.score(x_test,y_test))
print "Ada Boost Accuracy Score: " + str(result_abdt.score(x_test,y_test))
print "Gradient Boosting Accuracy Score: " + str(result_gbdt.score(x_test,y_test))

Decision Tree Accuracy Score: 0.466666666667
Bagging Decision Tree Accuracy Score: 0.533333333333
Random Forest Accuracy Score: 0.517006802721
Extra Tree Accuracy Score: 0.534693877551
Ada Boost Accuracy Score: 0.489795918367
Gradient Boosting Accuracy Score: 0.563265306122


In [129]:
feature_importances_rf = pd.DataFrame(result_rfdt.feature_importances_, 
                                   index = review.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Random Forest'
print feature_importances_rf.head()
print ''

feature_importances_et = pd.DataFrame(result_etdt.feature_importances_, 
                                   index = review.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Extra Tree'
print feature_importances_et.head()
print ''

feature_importances_ab = pd.DataFrame(result_abdt.feature_importances_, 
                                   index = review.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
print 'Gradient Boosting'
print feature_importances_gb.head()
print ''

Random Forest
            importance
vegas         0.028862
charlotte     0.013321
strip         0.006695
montreal      0.006199
pittsburgh    0.005126

Extra Tree
           importance
vegas        0.013041
las          0.010051
charlotte    0.007318
montreal     0.006986
phoenix      0.006513

Gradient Boosting
                 importance
parking            0.086447
tv                 0.072042
alcohol            0.067728
ourdoor_seating    0.062705
noise              0.061974



### Location Mining and Urban Planning: 
How much of a business' success is really just location, location, location? Do you see reviewers' behavior change when they travel?

### Seasonal Trends: 
What about seasonal effects: Are HVAC contractors being reviewed just at onset of winter, and manicure salons at onset of summer? Are there more reviews for sports bars on major game days and if so, could you predict that?

### Infer Categories: 
Do you see any non-intuitive correlations between business categories e.g., how many karaoke bars also offer Korean food, and vice versa? What businesses deserve their own subcategory (i.e., Szechuan or Hunan versus just "Chinese restaurants"), and can you learn this from the review text?

### Natural Language Processing (NLP): 
How well can you guess a review's rating from its text alone? What are the most common positive and negative words used in our reviews? Are Yelpers a sarcastic bunch? And what kinds of correlations do you see between tips and reviews: could you extract tips from reviews?

In [None]:
y = ndf['stars_x']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=3)

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier,RandomForestClassifier, BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier

# try other ensemble methods (random forest=, extra tree, adaboost, gradient boosting)
dt = DecisionTreeClassifier(random_state=3)
bdt = BaggingClassifier(DecisionTreeClassifier(random_state=3))
rfdt = RandomForestClassifier(random_state=3)
etdt = ExtraTreesClassifier(random_state=3)
abdt = AdaBoostClassifier(random_state=3)
gbdt = GradientBoostingClassifier(random_state=3)

# apply those models to the train set 
result_dt = dt.fit(x_train,y_train)
result_bdt = bdt.fit(x_train,y_train)
result_rfdt = rfdt.fit(x_train,y_train)
result_etdt = etdt.fit(x_train,y_train)
result_abdt = abdt.fit(x_train,y_train)
result_gbdt = gbdt.fit(x_train,y_train)

# print out the accuracy scores 
print "Decision Tree Accuracy Score: " + str(result_dt.score(x_test,y_test))
print "Bagging Decision Tree Accuracy Score: " + str(result_bdt.score(x_test,y_test))
print "Random Forest Accuracy Score: " + str(result_rfdt.score(x_test,y_test))
print "Extra Tree Accuracy Score: " + str(result_etdt.score(x_test,y_test))
print "Ada Boost Accuracy Score: " + str(result_abdt.score(x_test,y_test))
print "Gradient Boosting Accuracy Score: " + str(result_gbdt.score(x_test,y_test))

NameError: name 'x_train' is not defined

### Changepoints and Events: 
Can you detect when things change suddenly (i.e. a business coming under new management)? Can you see when a city starts going nuts over cronuts?

### Social Graph Mining: 
Can you figure out who the trend setters are and who found the best waffle joint before waffles were cool? How much influence does my social circle have on my business choices and my ratings?