### Effects of weather on Yelp star rating

It may be a good idea to segregate the data by business type (restaurant, hardware store, etc.). It could be easier and less computationally intensive per category. But it would be interesting to find very general features that can help determine star ratings for all businesses.

Three stages: first we look at comment text alone, to see how accurate we can predict star rating based on that. Then we add in weather effects. Since star ratings are highly subjective, users may be influenced by many things when it comes to the rating. These factors will undoubtedly affect the review text as well, but there may also be subtle additional effects on the star rating.

The goal isn't so much to painstakingly tune a NN for the last bit of accuracy, but rather to see if adding two new features can have a significant improvement regardless of model.

In [26]:
import pandas as pd
import numpy as np

from nltk.corpus import stopwords
from nltk import WordNetLemmatizer
from nltk import pos_tag, word_tokenize

from keras.models import Sequential
from keras.layers import (Dense, Dropout, Input, LSTM, Activation, Flatten,
                          Convolution1D, MaxPooling1D, Bidirectional,
                         GlobalMaxPooling1D, Embedding, BatchNormalization,
                         SpatialDropout1D)
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from keras.optimizers import SGD

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc
from sklearn.preprocessing import LabelEncoder

%matplotlib inline

In [2]:
PATH = "/d/data/yelpdata/dataset/"
WEAT = f'{PATH}processed_weather/'

In [3]:
#businesses = pd.read_csv(f'{PATH}business_on.csv', index_col=0)
reviews = pd.read_csv(f'{PATH}review_on.csv', index_col=0)

In [4]:
reviews = reviews[['stars','text']]

In [5]:
reviews['text'].fillna('empty', inplace=True)

### Stage 1: Predicting star rating based on review text alone

In [6]:
def clean_up(t):
    t = t.strip().lower()
    words = t.split()
    
    # first get rid of the stopwords, or a lemmatized stopword might not
    # be recognized as a stopword
    
    imp_words = ' '.join(w for w in words if w not in set(stopwords.words('english')))

    # lemmatize based on adjectives (J), verbs (V), nouns (N) and adverbs (R) to
    # return only the base words (as opposed to stemming which can return
    # non-words). e.g. ponies -> poni with stemming, and pony with lemmatizing
    
    final_words = ''
    
    lemma = WordNetLemmatizer()
    for (w,tag) in pos_tag(word_tokenize(imp_words)):
        if tag.startswith('J'):
            final_words += ' '+ lemma.lemmatize(w, pos='a')
        elif tag.startswith('V'):
            final_words += ' '+ lemma.lemmatize(w, pos='v')
        elif tag.startswith('N'):
            final_words += ' '+ lemma.lemmatize(w, pos='n')
        elif tag.startswith('R'):
            final_words += ' '+ lemma.lemmatize(w, pos='r')
        else:
            final_words += ' '+ w
    
    return final_words

# what a great name. do_stuff

def do_stuff (df):
    text = df['text'].copy()
    
    text.replace(to_replace={r'[^\x00-\x7F]':' '},inplace=True,regex=True)
    text.replace(to_replace={r'[^a-zA-Z]': ' '},inplace=True,regex=True)
    
    # Then lower case, tokenize and lemmatize

    # with over 600,000 entries, this is going to be one hell of a long apply...
    
    text = text.apply(lambda t:clean_up(t))
    return text

In [42]:
# converging to a very conventional convolutional NN model to convert non-conversational text to star rations
# uh... with a non-convex loss function
def cnn_model (X_train, y_train, test, val='no'):
    model=Sequential()
    model.add(Embedding(50000,128,input_length=1000))
    model.add(Convolution1D(128,5,activation='relu'))
    model.add(MaxPooling1D(5))
    model.add(Dropout(0.2))
    
    model.add(Convolution1D(128,5,activation='relu'))
    model.add(MaxPooling1D(5))
    model.add(Dropout(0.2))
    
    model.add(Convolution1D(128,5,activation='relu'))
    model.add(MaxPooling1D(35))
    model.add(Flatten())
    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.2))
    
    model.add(Dense(5,activation='softmax'))
    
    sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9)     
    model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])
    
    if val == 'no':
        model.fit(X_train,y_train,batch_size=128,epochs=5)
    else:
        model.fit(X_train,y_train,batch_size=128,epochs=5,validation_split=0.2)
    pred = model.predict(test)
    return pred

In [9]:
#data = do_stuff(reviews)

In [10]:
#data.to_csv(f'{PATH}review_on_processed_text.csv')

In [11]:
data = pd.Series.from_csv(f'{PATH}review_on_processed_text.csv', index_col=0)

In [12]:
stars = reviews['stars']

In [13]:
del reviews

In [14]:
stars[:10]

0    4
1    4
2    3
3    5
4    4
5    3
6    1
7    3
8    5
9    1
Name: stars, dtype: int64

In [15]:
enc = LabelEncoder()
enc.fit(stars)
y = enc.transform(stars)
dummy_y = np_utils.to_categorical(y)

In [16]:
dummy_y

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.]])

In [17]:
data.fillna('empty', inplace=True)

In [18]:
tok = Tokenizer(num_words=50000)
tok.fit_on_texts(data)
     
# set our max text length to 1000 characters, some of these reviews are pretty long
sequenced = tok.texts_to_sequences(data)
padded = pad_sequences(sequenced,maxlen=1000)

In [19]:
X_train, X_test, y_train, y_test = train_test_split(padded, dummy_y, test_size=0.2)

In [41]:
pred2 = cnn_model (X_train, y_train, X_test, val='yes')

Train on 405993 samples, validate on 101499 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


At 5 epochs it seems to be slightly overtrained, but for my purposes here I'm not too worried about that.

In [43]:
roc_auc_score(y_test,pred2)

0.8725433443648803

In [44]:
preds2 = np.argmax(pred2, axis=1)

In [45]:
print(classification_report(ys,preds2))

             precision    recall  f1-score   support

          0       0.74      0.69      0.71     15267
          1       0.46      0.31      0.37     13016
          2       0.49      0.50      0.49     22118
          3       0.53      0.65      0.59     39190
          4       0.70      0.65      0.67     37283

avg / total       0.59      0.59      0.59    126874



In [46]:
confusion_matrix (ys, preds2)

array([[10511,  2899,  1035,   519,   303],
       [ 2386,  4077,  5080,  1213,   260],
       [  662,  1520, 10966,  8172,   798],
       [  312,   216,  4539, 25325,  8798],
       [  305,    59,   642, 12139, 24138]])

While 0.59 precision isn't great, the 0.8725 AUC score is actually very very okay, one of the better kinds of okay. Also, the validation scores during training were very good, which is always helpful. A benefit, no doubt, of using all 630,000 reviews of business in the Toronto area.

The vast majority of predicted scores are within 1 star of the actual rating. Additionally, 1 and 5 star ratings had the greatest precision and recall, so our model is decent at picking up extreme sentiment (or the users are effusive in praise and unrestrained in condemnation).

But let's see if adding in weather and relative price can increase accuracy.

### Stage 2: weather effects

Star ratings are neither objective nor scientific. We humans often make  bizarre, irrational and otherwise inconsistent choices due to many internal and external factors. Let's consider weather as one of the external factors, especially with regards to giving a star rating for a business. While good weather and a good mood might influence me to leave a more positive review as well as a higher star rating, there is really no way know the sort of review I would have left had the weather been different (the old problem of not knowing probabilities conditional on histories that haven't happened).

What we can do is see if the review text matches with the score, and if knowing the weather conditions can improve the accuracy of our star predictions.

This raises an interesting question for businesses, since weather is something entirely out of their control. But it does suggest that if a business' sales are clearly affected by its star ratings, perhaps something extra can be done to improve customer satisfaction on a rainy day.

In [None]:
reviews_w = pd.read_csv(f'{PATH}review_on.csv', index_col=0)

In [None]:
reviews_w = reviews_w[['stars','date','text']]

In [None]:
weather = pd.read_csv(f'{WEAT}all_weather.csv', index_col='Unnamed: 0')

In [None]:
weather.head()

In [None]:
weather['Year'] = weather['Year'].astype(int)
weather['Month'] = weather['Month'].astype(int)
weather['Day'] = weather['Day'].astype(int)

In [None]:
reviews_w['date'] = pd.to_datetime(reviews_w['date'])

In [None]:
reviews_w.head()

to be continued...