| Methods       | Accuracy         | 
| ------------- |:-------------:| 
| TF-IDF + Random Forest    | 0.39| 
| Word2Vec + Random Forest    | 0.52     |
| Word2Vec + Neural Network | 0.48     |

In [1]:
import numpy as np
import pandas as pd
import datetime
from matplotlib import pyplot as plt
%matplotlib inline 

### Data preparation

In [2]:
data = pd.read_json('./chinese_restaurants_review.json', lines=True)
data.shape

(156690, 10)

In [3]:
# Select columns we need
data = data[['cool', 'funny','text','stars']]
data.head()

Unnamed: 0,cool,funny,text,stars
0,0,0,This place is a gem. We had friendly attentive...,5
1,0,0,This is perhaps the closest pho restaurant to ...,5
2,0,0,Happened to stumble upon this little quaint re...,4
3,0,0,Very good food for decent prices. The atmosphe...,4
4,1,1,The food was very fresh and quite tasty. The m...,3


In [4]:
# get token and stemming, also remove punctuation
import nltk
import string
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
ps = PorterStemmer()
def token(x):
    
    x = x.lower()
    token = nltk.word_tokenize(x)
    
    # remove stop words
    stopWords = set(stopwords.words('english'))
    token = [x for x in token if x not in stopWords]
    
    # transform each token into string and remove punctuation
    t = [x.encode('utf-8').translate(None, string.punctuation) for x in token ]
    
    # after removing punctuation, some token may become ' '
    return [x for x in t if x != '']

In [7]:
begin = datetime.datetime.now()
data['token'] = data.text.apply(lambda x: token(x))
print ('Total time spent:', datetime.datetime.now() - begin)
data.head()

('Total time spent:', datetime.timedelta(0, 339, 55161))


Unnamed: 0,cool,funny,text,stars,token
0,0,0,This place is a gem. We had friendly attentive...,5,"[place, gem, friendly, attentive, service, foo..."
1,0,0,This is perhaps the closest pho restaurant to ...,5,"[perhaps, closest, pho, restaurant, port, cred..."
2,0,0,Happened to stumble upon this little quaint re...,4,"[happened, stumble, upon, little, quaint, rest..."
3,0,0,Very good food for decent prices. The atmosphe...,4,"[good, food, decent, prices, atmosphere, nice,..."
4,1,1,The food was very fresh and quite tasty. The m...,3,"[food, fresh, quite, tasty, mango, salad, nice..."


###  word2vec

In [146]:
from gensim.models import word2vec
num_features = 300   # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10         # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

In [147]:
import time
begin = time.time()
w2v_model = word2vec.Word2Vec(data.token, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)
print ('Total time spent:', time.time() - begin)

('Total time spent:', 98.15893697738647)


After we train the word2vec model, each word in our data set is associated with a (300,) vector. We recall that this vector is derived from the input-to-hidden-layer weights of the neural newwork model. Once we got this vecoter, we can plug in into the soft matrix to get the distribution of all the words. This distribution measures how similar (in terms of probability) the word is to the input. For example, we can find the similar words to 'friendly'

In [148]:
w2v_model.most_similar('friendly')

[('polite', 0.7394624352455139),
 ('courteous', 0.7326678037643433),
 ('friendliest', 0.6483472585678101),
 ('gracious', 0.6144683361053467),
 ('pleasant', 0.5970733165740967),
 ('nice', 0.5592578649520874),
 ('friendliness', 0.553424596786499),
 ('attentive', 0.5446797609329224),
 ('hospitable', 0.5388295650482178),
 ('tentative', 0.5190747976303101)]

Now each word has it's corresponding feature vectors of length 300. We than consider features of a review. A natural idea is that we compute the feature vecotor of each word in the review 
than take average.

In [153]:
def getAverage(token):
    
    # Here 300 is the number of neurals in the hidden layer
    feature_vec = np.zeros((num_features,), dtype="float32")
    
    n_words = 0
    # we have trained a word2vec model named 'model'
    
    # make a set of words we have learned in model
    word_set = set(w2v_model.wv.index2word)
    
    for x in token:
        if x in word_set:
            n_words += 1
            feature_vec += w2v_model[x]
    
    return np.divide(feature_vec,float(n_words))

In [154]:
begin = time.time()
data['w2v_feature'] = data.token.apply(lambda token: getAverage(token))
print ('Total time spent:', time.time() - begin)

('Total time spent:', 211.63270497322083)


Here we should be very careful, according to the algorithm of word2vec if the words are not used 'often' enough the features will be all 'None'. For example, in the following data the w2v_feature is all null value.

In [94]:
data.iloc[12]

cool                                                           0
funny                                                          0
text           店堂不大，我到的早，周末中午第一拨客人，随后陆续来了不少中国人外国人，拖家带口的，小情侣，嗯...
stars                                                          4
token          [店堂不大，我到的早，周末中午第一拨客人，随后陆续来了不少中国人外国人，拖家带口的，小情侣，...
w2v_feature    [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
Name: 12, dtype: object

To deal with this issue, we just drop all the data points with null features. (There are actually 83 of them)

In [155]:
ix = []
for i in range(data.shape[0]):
    if np.any(np.isnan(data.w2v_feature.iloc[i])): ix.append(i)
data = data.drop(ix, axis= 0)   

In [156]:
data.shape

(156607, 6)

In [157]:
# we can save the w2v_model 
w2v_model.save('./w2v_model')

Ok! We are now ready to fit some machine learning algorithm on our data.

In [158]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# transform the feature into the form that sklearn accept
feature = np.array([x for x in data.w2v_feature])
X_train,  X_test, y_train, y_test = train_test_split(feature, data['stars'], test_size = 0.3, random_state = 1024)

#### Random Forest

In [19]:
from sklearn.model_selection import GridSearchCV
n_trees = [100]
m_depth = [10,15,20]
clf = RandomForestClassifier()
grid = GridSearchCV(estimator = clf, param_grid= dict(n_estimators = n_trees, max_depth = m_depth))

In [31]:
begin = datetime.datetime.now()
grid.fit(X_train, y_train)
print ('Total time spent:', datetime.datetime.now() - begin)

('Total time spent:', datetime.timedelta(0, 3010, 277790))


In [32]:
print grid.best_params_

{'n_estimators': 100, 'max_depth': 15}


In [34]:
forest = RandomForestClassifier(n_estimators=100, max_depth=15)
w2v_model = forest.fit(X_train, y_train)

In [35]:
w2v_predict = w2v_model.predict(X_test)

In [36]:
# test Accuracy
np.sum([w2v_predict == y_test])/float(len(y_test))

0.52929783113040885

The general accuracy is not that impressive. But if we take a detailed of the result we can see that among the good restaurants, the accuracy is 94%. 

In [71]:
np.sum([(w2v_predict >= 4)& (y_test >= 4)])/float(np.sum(y_test >= 4))

0.94393311576399841

In [63]:
# predict as bad rating while it is actually good rating
np.sum([(w2v_predict < 4)& (y_test >= 4)])/float(np.sum(y_test >= 4))

0.056066884236001611

In [78]:
# percentage that 4 stars predicted as 3 stars
np.sum([(w2v_predict == 3)& (y_test == 4)])/float(np.sum(y_test ==4))

0.054036781935667119

In [79]:
# percentage that 3 stars predicted as 4 stars
np.sum([(w2v_predict == 4)& (y_test == 3)])/float(np.sum(y_test==3))

0.57437407952871866

#### Positive or Negative
Lets reduce the problem into binary classification, we divide ratings into 'good' (stars = 4,5) and 'bad' (stars = 1,2,3).

In [86]:
def good_or_bad(stars):
    if stars >= 4: return 'good'
    else: return 'bad'
data['two_label'] = data.stars.apply(lambda stars: good_or_bad(stars))
data.head(3)

Unnamed: 0,cool,funny,text,stars,token,w2v_feature,two_label
0,0,0,This place is a gem. We had friendly attentive...,5,"[place, gem, friendly, attentive, service, foo...","[0.23933, -0.652453, 0.746594, 0.98832, 1.2323...",good
1,0,0,This is perhaps the closest pho restaurant to ...,5,"[perhaps, closest, pho, restaurant, port, cred...","[0.485038, 0.363754, 0.0717848, 0.172526, 0.95...",good
2,0,0,Happened to stumble upon this little quaint re...,4,"[happened, stumble, upon, little, quaint, rest...","[-0.629566, -0.113229, 0.406629, 0.274055, 0.3...",good


In [87]:
X_train,  X_test, y_train, y_test = train_test_split(feature, data['two_label'], test_size = 0.3, random_state = 1024)
forest = RandomForestClassifier(n_estimators=100, max_depth=15)
binary_model = forest.fit(X_train, y_train)

In [89]:
binary_predict = binary_model.predict(X_test)
# bianry accuracy
np.sum([binary_predict == y_test])/float(len(y_test))

0.82721409871655704

### TF-IDF
As a benchmark, we use the method of TF-IDF method to do the analysis. The major difference is the way we extract features from words. In TF-IDF we use the freqency of the word appears in a review (TF) and other reviews(IDF) to get a 'feature' of the word.

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
count_vec = CountVectorizer()
tfidf = TfidfTransformer()
cv = count_vec.fit_transform(data.text)
tfidf_feature = tfidf.fit_transform(cv)

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf_feature, data['stars'], test_size = 0.3, random_state = 1024)

In [12]:
forest = RandomForestClassifier(n_estimators=100, max_depth=15)
tfidf_model = forest.fit(X_train, y_train)

In [13]:
tfidf_predict = tfidf_model.predict(X_test)
# Accuracy
np.sum([tfidf_predict == y_test])/float(len(y_test))

0.39028229838109219

### Two stage learning
Use word2vec to extract features, then run fully connected NN with two hidden layers.

In [13]:
import tensorflow as tf

In [176]:
feature_size = 300
n_node_hl1 = 100
n_node_hl2 = 100
n_class = 5
batch_size = 200
n_epoch = 10

In [177]:
# placeholers
x = tf.placeholder('float', [None, feature_size])

# should assign shape?
y = tf.placeholder('float', [None, 5])

In [178]:
def NN_model(data):
    # Weights and bias of each layer
    
    hl1 = {'weights': tf.Variable(tf.random_normal([feature_size, n_node_hl1])),
            'bias': tf.Variable(tf.random_normal([n_node_hl1])) }
    hl2 = {'weights': tf.Variable(tf.random_normal([n_node_hl1, n_node_hl2])),
            'bias': tf.Variable(tf.random_normal([n_node_hl2])) }

    output_layer = { 'weights': tf.Variable(tf.random_normal([n_node_hl2, n_class])),
                'bias' : tf.Variable(tf.random_normal([n_class]))}
    
    # Output of each layer
    # Relu((x*W + bias))
    s1 = tf.add(tf.matmul(data, hl1['weights']),hl1['bias'])
    a1 = tf.nn.relu(s1)
    
    s2 = tf.add(tf.matmul(a1, hl2['weights']), hl2['bias'])
    a2 = tf.nn.relu(s2)
    
    output = tf.add(tf.matmul(a2, output_layer['weights']), output_layer['bias'])
    
    #return tf.reshape(tf.cast(tf.argmax(output,1),'float'), [batch_size, 1])
    return output

In [179]:
def train_NN_model(x,y, n_epoch):
    prediction = NN_model(x)
    cost = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(logits = prediction, labels=y))
    
    optimizer = tf.train.AdamOptimizer().minimize(cost)
    
    
    sess = tf.InteractiveSession()
    sess.run(tf.global_variables_initializer())
    
    
    for epoch in range(n_epoch):
        epoch_loss = 0
        
        # cut into batches
        i = 0
        while i < len(X_train):
            start = i
            end = i + batch_size
            batch_X = X_train[start: end]
            
            # the original y is single numeric labels, we transform into one hot 
            # but the out put of tf.one_hot is a tensor node, so we need to sess.run(batch_y)
            batch_y = \
            tf.one_hot(np.array(y_train[start: end]-1),5,on_value=1.0, off_value=0.0)
        
            _, c = sess.run([optimizer, cost],feed_dict={x: batch_X, y:sess.run(batch_y)})
            
            epoch_loss += c
            
            i += batch_size
        print  ('Epoch:', epoch, 'loss:', epoch_loss)
    
    
    correct = tf.equal(tf.argmax(prediction,1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
    
    one_hot_y_test = tf.one_hot(np.array(y_test-1),5,on_value=1.0, off_value=0.0)
    print ('Accuracy', accuracy.eval(feed_dict={x: X_test, y: sess.run(one_hot_y_test)}))

In [180]:
begin = time.time()
train_NN_model(x,y,n_epoch)
print ('Total time spent:', time.time() - begin)

('Epoch:', 0, 'loss:', 7566256.7371826172)
('Epoch:', 1, 'loss:', 2794056.434387207)
('Epoch:', 2, 'loss:', 1613582.7296142578)
('Epoch:', 3, 'loss:', 990616.95877075195)
('Epoch:', 4, 'loss:', 604009.03515625)
('Epoch:', 5, 'loss:', 373846.35391998291)
('Epoch:', 6, 'loss:', 261112.11938858032)
('Epoch:', 7, 'loss:', 208930.66054153442)
('Epoch:', 8, 'loss:', 183264.86547851562)
('Epoch:', 9, 'loss:', 167344.09373474121)
('Accuracy', 0.48677182)
('Total time spent:', 10487.929569005966)


In [None]:
"""
n_node_hl1 = 50
n_node_hl2 = 50
n_class = 5
batch_size = 3000
n_epoch = 10
Acc: 0.40

"""
"""
feature_size = 100
n_node_hl1 = 50
n_node_hl2 = 50
n_class = 5
batch_size = 3000
n_epoch = 15
Loss: 1400282
Acc = 0.41

"""

"""
feature_size = 300
n_node_hl1 = 100
n_node_hl2 = 100
n_class = 5
batch_size = 3000
n_epoch = 10
Acc = 0.435
"""

"""
feature_size = 300     
n_node_hl1 = 100
n_node_hl2 = 100
n_class = 5
batch_size = 500
n_epoch = 10
Loss : 352956
Acc = 0.459
"""
"""
feature_size = 300
n_node_hl1 = 100
n_node_hl2 = 100
n_class = 5
batch_size = 500
n_epoch = 15
Loss = 176313
Acc = 0.44
"""
"""
feature_size = 300
n_node_hl1 = 100
n_node_hl2 = 100
n_class = 5
batch_size = 200
n_epoch = 10
Loss = 167344
Acc = 0.48

"""