### Effects of weather and price on Yelp star rating

It may be a good idea to segregate the data by business type (restaurant, hardware store, etc.). It could be easier and less computationally intensive per category. But it would be interesting to find very general features that can help determine star ratings for all businesses.

Three stages: first we look at comment text alone, to see how accurate we can predict star rating based on that. Then we add in weather effects. Finally, we consider the relative price of the business being reviewed. Since star ratings are highly subjective, users may be influenced by many things when it comes to the rating. These factors will undoubtedly affect the review text as well, but there may also be subtle additional effects on the star rating.

The test will be done in a rather crude way: if including weather and price can improve the accuracy of star predictions, then we will conclude that weather and price have an effect.

In [1]:
import pandas as pd
import numpy as np

from nltk.corpus import stopwords
from nltk import WordNetLemmatizer
from nltk import pos_tag, word_tokenize

from keras.models import Sequential
from keras.layers import (Dense, Dropout, Input, LSTM, Activation, Flatten,
                          Convolution1D, MaxPooling1D, Bidirectional,
                         GlobalMaxPooling1D, Embedding, BatchNormalization,
                         SpatialDropout1D)
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc
from sklearn.preprocessing import LabelEncoder

%matplotlib inline

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
PATH = "/d/data/yelpdata/dataset/"
WEAT = f'{PATH}processed_weather/'

In [5]:
#businesses = pd.read_csv(f'{PATH}business_on.csv', index_col=0)
reviews = pd.read_csv(f'{PATH}review_on.csv', index_col=0)

In [6]:
reviews = reviews[['stars','text']]

In [7]:
reviews['text'].fillna('empty', inplace=True)

### Stage 1: Predicting star rating based on review text alone

In [3]:
def clean_up(t):
    t = t.strip().lower()
    words = t.split()
    
    # first get rid of the stopwords, or a lemmatized stopword might not
    # be recognized as a stopword
    
    imp_words = ' '.join(w for w in words if w not in set(stopwords.words('english')))

    # lemmatize based on adjectives (J), verbs (V), nouns (N) and adverbs (R) to
    # return only the base words (as opposed to stemming which can return
    # non-words). e.g. ponies -> poni with stemming, and pony with lemmatizing
    
    final_words = ''
    
    lemma = WordNetLemmatizer()
    for (w,tag) in pos_tag(word_tokenize(imp_words)):
        if tag.startswith('J'):
            final_words += ' '+ lemma.lemmatize(w, pos='a')
        elif tag.startswith('V'):
            final_words += ' '+ lemma.lemmatize(w, pos='v')
        elif tag.startswith('N'):
            final_words += ' '+ lemma.lemmatize(w, pos='n')
        elif tag.startswith('R'):
            final_words += ' '+ lemma.lemmatize(w, pos='r')
        else:
            final_words += ' '+ w
    
    return final_words

# what a great name. do_stuff

def do_stuff (df):
    text = df['text'].copy()
    
    text.replace(to_replace={r'[^\x00-\x7F]':' '},inplace=True,regex=True)
    text.replace(to_replace={r'[^a-zA-Z]': ' '},inplace=True,regex=True)
    
    # Then lower case, tokenize and lemmatize

    # with over 600,000 entries, this is going to be one hell of a long apply...
    
    text = text.apply(lambda t:clean_up(t))
    return text

In [15]:
def seq_model (X_train, y_train, test, val='no'):
    model=Sequential()
    model.add(Embedding(50000,128,input_length=1000))
    model.add(SpatialDropout1D(0.25))
    model.add(GlobalMaxPooling1D())
    model.add(BatchNormalization())
    model.add(Dense(64))
    model.add(Dropout(0.5))
    
    model.add(Dense(5,activation='softmax'))
    model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    
    if val == 'no':
        model.fit(X_train,y_train,batch_size=512,epochs=5)
    else:
        model.fit(X_train,y_train,batch_size=512,epochs=5,validation_split=0.2)
    pred = model.predict(test)
    return pred

In [None]:
#data = do_stuff(reviews)

In [None]:
#data.to_csv(f'{PATH}review_on_processed_text.csv')

In [8]:
data = pd.Series.from_csv(f'{PATH}review_on_processed_text.csv', index_col=0)

In [9]:
stars = reviews['stars']

In [10]:
enc = LabelEncoder()
enc.fit(stars)
y = enc.transform(stars)
dummy_y = np_utils.to_categorical(y)

In [11]:
data.fillna('empty', inplace=True)

In [12]:
tok = Tokenizer(num_words=50000)
tok.fit_on_texts(data)
     
# set our max text length to 1000 characters, some of these reviews are pretty long
sequenced = tok.texts_to_sequences(data)
padded = pad_sequences(sequenced,maxlen=1000)

In [13]:
X_train, X_test, y_train, y_test = train_test_split(padded, dummy_y, test_size=0.2)

In [16]:
pred = seq_model (X_train, y_train, X_test, val='yes')

Train on 405993 samples, validate on 101499 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [20]:
roc_auc_score(y_test,pred)

0.869052128035148

In [28]:
preds = np.argmax(pred, axis=1)
ys = np.argmax(y_test, axis=1)

Add 1 to each row label to get the star rating...

In [31]:
print(classification_report(ys,preds))

             precision    recall  f1-score   support

          0       0.72      0.75      0.73     15271
          1       0.46      0.33      0.39     13079
          2       0.49      0.43      0.46     22067
          3       0.54      0.61      0.57     39122
          4       0.69      0.69      0.69     37335

avg / total       0.59      0.59      0.59    126874



In [32]:
confusion_matrix (ys, preds)

array([[11481,  2240,   814,   392,   344],
       [ 3102,  4340,  4006,  1306,   325],
       [  809,  2242,  9502,  8468,  1046],
       [  348,   417,  4318, 23950, 10089],
       [  316,    98,   623, 10619, 25679]])

While 0.59 precision isn't great, the 0.869 AUC score is actually very very okay, one of the better kinds of okay. Also, the validation scores during training were very good, which is always helpful. A benefit, no doubt, of using all 630,000 reviews of business in the Toronto area.

We can see why the AUC is pretty good: the vast majority of predicted scores are within 1 star of the actual rating. Additionally, 1 and 5 star ratings had the greatest precision and recall, so our model is decent at picking up extreme sentiment (or the users are effusive in praise and unrestrained in condemnation).

But let's see if adding in weather and relative price can increase accuracy.

### Stage 2: weather effects

Star ratings are neither objective nor scientific. We humans often make  bizarre, irrational and otherwise inconsistent choices due to many internal and external factors. Let's consider weather as one of the external factors, especially with regards to giving a star rating for a business. While good weather and a good mood might influence me to leave a more positive review as well as a higher star rating, there is really no way know the sort of review I would have left had the weather been different.

What we can do is see if the review text matches with the score, and if knowing the weather conditions can improve the accuracy of our star predictions.

This raises an interesting question for businesses, since weather is something entirely out of their control. But it does suggest that if a business' sales are clearly affected by its star ratings, perhaps something extra can be done to improve customer satisfaction on a rainy day.

In [38]:
reviews_w = pd.read_csv(f'{PATH}review_on.csv', index_col=0)

In [40]:
reviews_w = reviews_w[['stars','date','text']]

In [54]:
weather = pd.read_csv(f'{WEAT}all_weather.csv', index_col='Unnamed: 0')

  interactivity=interactivity, compiler=compiler, result=result)


In [63]:
weather.head()

Unnamed: 0,Date/Time,Year,Month,Day,Time,Data Quality,Temp (°C),Temp Flag,Dew Point Temp (°C),Dew Point Temp Flag,...,Wind Spd Flag,Visibility (km),Visibility Flag,Stn Press (kPa),Stn Press Flag,Hmdx,Hmdx Flag,Wind Chill,Wind Chill Flag,Weather
0.0,2006-01-01 00:00,2006.0,1.0,1.0,00:00,,,,,,...,,,,,,,,,,
1.0,2006-01-01 01:00,2006.0,1.0,1.0,01:00,,,,,,...,,,,,,,,,,
2.0,2006-01-01 02:00,2006.0,1.0,1.0,02:00,,-4.2,,-5.6,,...,,3.2,,100.2,,,,-8.0,,Snow
3.0,2006-01-01 03:00,2006.0,1.0,1.0,03:00,,-4.5,,-5.6,,...,,2.3,,100.27,,,,-10.0,,Freezing Drizzle
4.0,2006-01-01 04:00,2006.0,1.0,1.0,04:00,,-4.0,,-5.2,,...,,6.4,,100.27,,,,-8.0,,


In [64]:
weather['Year'] = weather['Year'].astype(int)
weather['Month'] = weather['Month'].astype(int)
weather['Day'] = weather['Day'].astype(int)

In [59]:
reviews_w['date'] = pd.to_datetime(reviews_w['date'])

In [61]:
reviews_w.head()

Unnamed: 0,stars,date,text
0,4,2012-05-11,Who would have guess that you would be able to...
1,4,2015-10-27,Always drove past this coffee house and wonder...
2,3,2013-02-09,"Not bad!! Love that there is a gluten-free, ve..."
3,5,2016-04-06,Love this place! Peggy is great with dogs and...
4,4,2013-05-01,This is currently my parents new favourite res...


to be continued...