# Chinese food review analyzer

## Part 1: Sentiment analysis in Chinese food review data

The architecture for the neural network and parts of the preprocessing (i.e. tokenization) of this Chinese sentiment analysis are borrowed from https://github.com/Tony607/Chinese_sentiment_analysis


In [4]:
from os import listdir
from os.path import isfile, join
import jieba
import codecs
from langconv import * # convert Traditional Chinese characters to Simplified Chinese characters
import pickle
import random
import numpy as np

from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU
from keras.preprocessing.text import Tokenizer
from keras.layers.core import Dense
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import TensorBoard
from keras.models import load_model

Using TensorFlow backend.


### Get files and Prepare doucments


In [3]:
negativeReviews = './data/reviews/neg/both_neg.txt'
positiveReviews = './data/reviews/pos/both_pos.txt'

documents = []
with codecs.open(positiveReviews, "rb") as doc_file:
    for line in doc_file:
        try:
            line = line.decode("utf-8")
        except:
            continue
        # Convert from traditional to simplified Chinese
        text = Converter('zh-hans').convert(line)
        text = text.replace("\n", "")
        text = text.replace("\r", "")
        documents.append((text, "pos"))

with codecs.open(negativeReviews, "rb") as doc_file:
    for line in doc_file:
        try:
            line = line.decode("utf-8")
        except:
            continue
        # Convert from traditional to simplified Chinese
        text = Converter('zh-hans').convert(line)
        text = text.replace("\n", "")
        text = text.replace("\r", "")
        documents.append((text, "neg"))


### Shuffle the data

Because we don't want all the positives and then all the negatives

In [5]:
random.shuffle(documents)

### Prepare the input and output for the model
Each review will be a list of tokens, output will be one token("pos" or "neg"). 

In [6]:
# Tokenize
totalX = []
totalY = [str(doc[1]) for doc in documents]
for doc in documents:
    seg_list = jieba.cut(doc[0], cut_all=False)
    seg_list = list(seg_list)
    totalX.append(seg_list)


Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/2m/qy2t9vkn71g_l16mj4nxvbwc0000gn/T/jieba.cache
Loading model cost 1.094 seconds.
Prefix dict has been built succesfully.


### Determine max length
Decide the max input sequence, here we cover up to 60% sentences. The longer input sequence, the more training time will take, but could improve  prediction accuracy.

In [None]:
h = sorted([len(sentence) for sentence in totalX])
print(h)
maxLength = h[int(len(h) * 0.60)]
print("Max length is: ",h[-1])
print("60% cover length up to: ",maxLength)

### Words to number tokens, padding
Pad input sequence to max input length if it is shorter


Save the input tokenizer, since we need to use the same tokenizer for our new predition data.

In [8]:
totalX = [" ".join(wordslist) for wordslist in totalX]  # Keras Tokenizer expect the words tokens to be seperated by space 
input_tokenizer = Tokenizer(30000) # Initial vocab size
input_tokenizer.fit_on_texts(totalX)
vocab_size = len(input_tokenizer.word_index) + 1
totalX = np.array(pad_sequences(input_tokenizer.texts_to_sequences(totalX), maxlen=maxLength))
__pickleStuff("./data/input_tokenizer_chinese.p", input_tokenizer)

In [9]:
print("input vocab_size:",vocab_size)

input vocab_size: 3861


In [10]:
print(totalX[0])

[  0   0   0   0   0   0   0   0   0   0   0   0  28  54   6  21  21  21
 157  67   4  11 412   2 179  78   2  70   9   4  11]


### Output, array of 0s and 1s

In [11]:
target_tokenizer = Tokenizer(3) 
target_tokenizer.fit_on_texts(totalY)
print("output vocab_size:",len(target_tokenizer.word_index) + 1)
totalY = np.array(target_tokenizer.texts_to_sequences(totalY)) -1
totalY = totalY.reshape(totalY.shape[0])

output vocab_size: 3


In [12]:
totalY[40:50]

array([1, 1, 1, 0, 0, 1, 0, 0, 1, 0])

In [13]:
len(totalY)

693

### Turn output 0s and 1s to categories(one-hot vectors)

In [14]:
totalY = to_categorical(totalY, num_classes=2)

In [None]:
totalY[40:50]

In [16]:
output_dimen = totalY.shape[1] # which is 2

In [17]:
totalY.shape

(693, 2)

### Save meta data for later predition
maxLength: the input sequence length

vocab_size: Input vocab size

output_dimen: which is 2 in this example (pos or neg)

sentiment_tag: either ["neg","pos"] or ["pos","neg"] matching the target tokenizer

In [18]:
target_reverse_word_index = {v: k for k, v in list(target_tokenizer.word_index.items())}
sentiment_tag = [target_reverse_word_index[1],target_reverse_word_index[2]] 
metaData = {"maxLength":maxLength,"vocab_size":vocab_size,"output_dimen":output_dimen,"sentiment_tag":sentiment_tag}
__pickleStuff("./data/meta_sentiment_chinese.p", metaData)

### Build the Model, train and save it


In [19]:
embedding_dim = 256

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim,input_length = maxLength))
# Each input would have a size of (maxLength x 256) and each of these 256 sized vectors are fed into the GRU layer one at a time.
# All the intermediate outputs are collected and then passed on to the second GRU layer.
model.add(GRU(256, dropout=0.9, return_sequences=True))
# Using the intermediate outputs, we pass them to another GRU layer and collect the final output only this time
model.add(GRU(256, dropout=0.9))
# The output is then sent to a fully connected layer that would give us our final output_dim classes
model.add(Dense(output_dimen, activation='softmax'))
# We use the adam optimizer instead of standard SGD since it converges much faster
tbCallBack = TensorBoard(log_dir='./Graph/sentiment_chinese', histogram_freq=0,
                            write_graph=True, write_images=True)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
model.fit(totalX, totalY, validation_split=0.1, batch_size=32, epochs=20, verbose=1, callbacks=[tbCallBack])
! rm mod.h5
model.save('mod.h5')
print("Saved model!")

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 31, 256)           988416    
_________________________________________________________________
gru_1 (GRU)                  (None, 31, 256)           393984    
_________________________________________________________________
gru_2 (GRU)                  (None, 256)               393984    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 514       
Total params: 1,776,898
Trainable params: 1,776,898
Non-trainable params: 0
_________________________________________________________________
Train on 623 samples, validate on 70 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20


### loading meta data and the model we just trained and saved

In [44]:
def __pickleStuff(filename, stuff):
    save_stuff = open(filename, "wb")
    pickle.dump(stuff, save_stuff)
    save_stuff.close()
    
def __loadStuff(filename):
    saved_stuff = open(filename,"rb")
    stuff = pickle.load(saved_stuff)
    saved_stuff.close()
    return stuff

#Function to load model + meta data
def load_it(filepath):
    global model, sentiment_tag, maxLength
    metaData = __loadStuff("./data/meta_sentiment_chinese.p")
    maxLength = metaData.get("maxLength")
    vocab_size = metaData.get("vocab_size")
    output_dimen = metaData.get("output_dimen")
    sentiment_tag = metaData.get("sentiment_tag")
    embedding_dim = 256
    model = load_model(filepath)

### Prediction code

In [41]:
# Function to convert sentence to model input
def findFeatures(text):
    text=Converter('zh-hans').convert(text)
    text = text.replace("\n", "")
    text = text.replace("\r", "") 
    seg_list = jieba.cut(text, cut_all=False)
    seg_list = list(seg_list)
    text = " ".join(seg_list)
    textArray = [text]
    input_tokenizer_load = __loadStuff("./data/input_tokenizer_chinese.p")
    textArray = np.array(pad_sequences(input_tokenizer_load.texts_to_sequences(textArray), maxlen=maxLength))
    return textArray

# Function to predict result
def predictResult(text):
    features = findFeatures(text)
    predicted = model.predict(features)[0] # we have only one sentence to predict, so take index 0
    predicted = np.array(predicted)
    probab = predicted.max()
    predition = sentiment_tag[predicted.argmax()]
    return predition, probab

### Actually load the model

In [37]:
load_it('mod.h5')

### Testing predictions

In [42]:
predictResult("姜葱焗龙虾是几乎每桌必点的招牌菜，分为两种规格，大龙虾是一只2.5磅以上的大龙虾，当天时价16.95刀/磅，双龙虾是两只约1磅的小龙虾（共计2磅左右），当天时价29.95刀/份，点菜时服务员直接推荐了按份计价，相对实惠的双龙虾，上菜后感觉菜量并不小，三个人吃还是可以的。龙虾的肉质鲜嫩饱满，葱姜炒的做法对足了中国人的胃口，吃着格外舒坦，尽管勾芡稍有些稠，味道也稍微偏咸，但瑕不掩瑜，依旧吃得很过瘾。")

('pos', 0.9678621)

In [43]:
print(predictResult("蒜蓉通菜上菜很快，菜量很大，摆盘略丑，通心菜有些老，色泽略微偏黄，炒得火候比较老，口感偏软，味道也比较一般。"))

('neg', 0.99999464)


In [10]:
print(predictResult("瑶柱蛋白炒饭上菜较快，菜量较大，米饭炒得颗粒分明，咸淡适中，葱花和干贝丝点缀其中，鲜而不咸，吃起来很香。"))

('pos', 0.999877)


In [21]:
print(predictResult("鱼香肉丝超级好吃的"))

('pos', 0.94182307)


## Part 2: Menu
hard-coded now but can be more dynamic (pdf converter, html scraper, etc.)

In [5]:
menu = {}
menu["三文鱼"] = "Salmon"
menu["茶碗蒸"] = "Steamed Egg"
menu["黄瓜鳗鱼卷"] = "UNAGI KYURI MAKI"
menu["波龙"] = "Lobster with Ginger and Scallion"
menu["清蒸鱼"] = "Steamed Big Flounder"
menu["鱼香肉丝"] = "Shredded Pork with Garlic Sauce"
menu["宫保鸡丁"] = "Kung Pao Chicken"
menu["姜葱焗龙虾"] = "Lobster with Ginger and Scallion"
menu["蒜蓉通菜"] = "Stir Fried Tong Cai"
menu["瑶柱蛋白炒饭"] = "Dried Scallop with Egg White"
menu["红烧豆腐"] = "Fried Bean Curd with Mushroom"
menu["蚝油牛肉"] = "Beef with Oyster Sauce"
menu["生炒三鲜"] = "Pan Fried Assorted Seafood"
menu["节瓜粉丝煲"] = "Dried Shrimp, Vermicelli and Pork with Fuzzy Melon"
menu["红豆汤"] = "Red Bean Soup"
menu["西米露"] = "Sweet Milk with Boba"
menu["椒盐牛仔骨"] = "Spicy Salted Dry Fried Veal Ribs"
menu["红酒烩牛尾"] = "Ox Tail with Red Wine" 
menu["椒盐龙虾"] = "Spicy Salted Dry Fried Lobster"
menu["罗汉斋"] = "Buddha's Delight"
menu["珊瑚虾刺身"]= "Shrimp Sashimi"
menu["椒盐大蟹"] = "Spicy Salted Dry Fried Crab"
menu["豉汁蒸生蚝"] = "Steamed Oysters with Black Bean Sauce"

## Part 3: Parser 
Segments a real review

### Read file
Save file into "text"

In [27]:
test_review2 = './data/reviews/test_review2.txt'
filename = test_review2
with codecs.open(filename, "rb") as doc_file:
    count = 0
    for line in doc_file:
        count += 1
        try:
            line = line.decode("utf-8")
        except:
            continue

text = ""
with codecs.open(filename, "r") as doc_file:
    for line in doc_file:
        text += line


### Create the mapping between dishes and reviews

The reviews are later fed into the sentiment analysis

In [30]:
def get_dishes(phrase, menu):
    dishes = set()
    for d in menu:
        if d in phrase:
            dishes.add(d)
    return dishes

def contains_diff_dish(phrase, dishes, menu):
    for d in menu:
        if d in phrase and d not in dishes:
            return True
    return False

def create_mapping(review, menu):
    mapping = dict()
    sentences = review.split("。")
    for sentence in sentences:
        phrases = sentence.split("，")
        i = 0
        while i < len(phrases):
            phrase = phrases[i]
            dishes = list(get_dishes(phrase, menu))
            if len(dishes) >= 1:
                idea = phrase
                i += 1
                while i < (len(phrases)) and not contains_diff_dish(phrases[i], dishes, menu):
                    idea += "，"
                    idea += phrases[i]
                    i+=1
                for d in dishes:
                    mapping.update({d: idea})
            else:
                i+=1
    return mapping 

In [32]:
mapping = create_mapping(text, menu)
mapping

{'姜葱焗龙虾': '姜葱焗龙虾是几乎每桌必点的招牌菜，分为两种规格，大龙虾是一只2.5磅以上的大龙虾，当天时价16.95刀/磅，双龙虾是两只约1磅的小龙虾（共计2磅左右），当天时价29.95刀/份，点菜时服务员直接推荐了按份计价，相对实惠的双龙虾，上菜后感觉菜量并不小，三个人吃还是可以的',
 '蒜蓉通菜': '蒜蓉通菜和鱼香肉丝上菜很快，菜量很大，摆盘略丑，通心菜有些老，色泽略微偏黄，炒得火候比较老，口感偏软，味道也比较一般',
 '鱼香肉丝': '蒜蓉通菜和鱼香肉丝上菜很快，菜量很大，摆盘略丑，通心菜有些老，色泽略微偏黄，炒得火候比较老，口感偏软，味道也比较一般',
 '瑶柱蛋白炒饭': '瑶柱蛋白炒饭上菜较快，菜量较大，米饭炒得颗粒分明，咸淡适中，葱花和干贝丝点缀其中，鲜而不咸，吃起来很香'}

# Get results 

Combine all three parts.
Run sentiment analysis on reviews for dishes, assign dishes to "Good", "Bad"

In [45]:
neg_list = []
pos_list = []
for dish_name in mapping:
    review = mapping[dish_name]
    if predictResult(review)[0] == "neg":
        neg_list.append(dish_name)
    else:
        pos_list.append(dish_name)

print("Good dishes: ")
for dish in pos_list:
    print(dish + " -- " + menu[dish])
print("Bad dishes: ")
for dish in neg_list:
    print(dish + " -- " + menu[dish])

Good dishes: 
姜葱焗龙虾 -- Lobster with Ginger and Scallion
瑶柱蛋白炒饭 -- Dried Scallop with Egg White
Bad dishes: 
蒜蓉通菜 -- Stir Fried Tong Cai
鱼香肉丝 -- Shredded Pork with Garlic Sauce
