# Tutorial for Chinese Sentiment analysis with hotel review data
## Dependencies

Python 3.5, numpy, pickle, keras, tensorflow, [jieba](https://github.com/fxsjy/jieba)

## Optional for plotting

pylab, scipy


In [3]:
from os import listdir
from os.path import isfile, join
import jieba
import codecs
from profitero_data_scientist.chinese_sentiment_analysis_master.langconv import *  # convert Traditional Chinese characters to Simplified Chinese characters
import pickle
import random

from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU
from keras.preprocessing.text import Tokenizer
from keras.layers.core import Dense
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import TensorBoard


## Helper function to pickle and load stuff

In [4]:
def __pickleStuff(filename, stuff):
    save_stuff = open(filename, "wb")
    pickle.dump(stuff, save_stuff)
    save_stuff.close()
def __loadStuff(filename):
    saved_stuff = open(filename, "rb")
    stuff = pickle.load(saved_stuff)
    saved_stuff.close()
    return stuff


## Get lists of files, positive and negative files

In [5]:
dataBaseDirPos = "./profitero_data_scientist/chinese_sentiment_analysis_master/data/ChnSentiCorp_htl_ba_6000/pos/"
dataBaseDirNeg = "./profitero_data_scientist/chinese_sentiment_analysis_master/data/ChnSentiCorp_htl_ba_6000/neg/"
positiveFiles = [dataBaseDirPos + f for f in listdir(dataBaseDirPos) if isfile(join(dataBaseDirPos, f))]
negativeFiles = [dataBaseDirNeg + f for f in listdir(dataBaseDirNeg) if isfile(join(dataBaseDirNeg, f))]

## Show length of samples

In [6]:
print(len(positiveFiles))
print(len(negativeFiles))

## Have a look at what's in a file(one hotel review)

In [7]:
filename = positiveFiles[0]
with codecs.open(filename, "rb") as doc_file:
    text = doc_file.read()
    text = text.decode("GB2312")
    text = text.replace("\n", "")
    text = text.replace("\r", "")
print("==Orginal==:\n\r{}".format(text))
print(text)

==Orginal==:
海景不错，属于在威海最好的吧。服务也还可以，但也没留下特深刻的印象。房间内的设施还不错，只是七成新吧，跟蓝天酒店比还差些。早餐我觉得不比三十八的好多少，可是却要六十八元。房间的隔音很差，我隔壁入住了几个人，开始唱歌，后来打牌，我听得很清楚。
海景不错，属于在威海最好的吧。服务也还可以，但也没留下特深刻的印象。房间内的设施还不错，只是七成新吧，跟蓝天酒店比还差些。早餐我觉得不比三十八的好多少，可是却要六十八元。房间的隔音很差，我隔壁入住了几个人，开始唱歌，后来打牌，我听得很清楚。


## Test removing stop words
Demo what it looks like to tokenize the sentence and remove stop words.

In [8]:
filename = positiveFiles[110]
with codecs.open(filename, "rb") as doc_file:
    text = doc_file.read()
    text = text.decode("GB2312")
    text = text.replace("\n", "")
    text = text.replace("\r", "")
print("==Orginal==:\n\r{}".format(text))

stopwords = [line.rstrip() for line in
             codecs.open('./profitero_data_scientist/chinese_sentiment_analysis_master/data/chinese_stop_words.txt',
                         "r", encoding="utf-8")]
seg_list = jieba.cut(text, cut_all=False)
final = []
seg_list = list(seg_list)
for seg in seg_list:
    if seg not in stopwords:
        final.append(seg)
print("==Tokenized==\tToken count:{}\n\r{}".format(len(seg_list), " ".join(seg_list)))
print("==Stop Words Removed==\tToken count:{}\n\r{}".format(len(final), " ".join(final)))


Building prefix dict from the default dictionary ...


Loading model from cache /tmp/jieba.cache


==Orginal==:
很不错的一家五星的酒店,房间很大,设施也很新,重要它的位置在金融中心,去哪里都很方便,以后会考虑再次入住的.


Loading model cost 1.243 seconds.


Prefix dict has been built succesfully.


==Tokenized==	Token count:36
很 不错 的 一家 五星 的 酒店 , 房间 很大 , 设施 也 很 新 , 重要 它 的 位置 在 金融中心 , 去 哪里 都 很 方便 , 以后 会 考虑 再次 入住 的 .
==Stop Words Removed==	Token count:11
不错 一家 五星 酒店 房间 很大 设施 新 位置 金融中心 入住


## Prepare "doucments", a list of tuples
Some files contain abnormal encoding characters which encoding GB2312 will complain about. Solution: read as bytes then decode as GB2312 line by line, skip lines with abnormal encodings. We also convert any traditional Chinese characters to simplified Chinese characters.

In [10]:
documents = []
for filename in positiveFiles:
    text = ""
    with codecs.open(filename, "rb") as doc_file:
        for line in doc_file:
            try:
                line = line.decode("GB2312")
            except:
                continue
            text+=Converter('zh-hans').convert(line)# Convert from traditional to simplified Chinese

            text = text.replace("\n", "")
            text = text.replace("\r", "")
    documents.append((text, "pos"))

for filename in negativeFiles:
    text = ""
    with codecs.open(filename, "rb") as doc_file:
        for line in doc_file:
            try:
                line = line.decode("GB2312")
            except:
                continue
            text+=Converter('zh-hans').convert(line)# Convert from traditional to simplified Chinese

            text = text.replace("\n", "")
            text = text.replace("\r", "")
    documents.append((text, "neg"))


## Optional step to save/load the documents as pickle file

In [11]:
# Uncomment those two lines to save/load the documents for later use since the step above takes a while
# __pickleStuff("./data/chinese_sentiment_corpus.p", documents)
# documents = __loadStuff("./data/chinese_sentiment_corpus.p")
print(len(documents))
print(documents[4000])

## shuffle the data

In [12]:
random.shuffle(documents)

## Prepare the input and output for the model
Each input (hotel review) will be a list of tokens, output will be one token("pos" or "neg"). The stopwords are not removed here since the dataset is relative small and removing the stop words are not saving much traing time.

In [14]:
# Tokenize only
totalX = []
totalY = [str(doc[1]) for doc in documents]
for doc in documents:
    seg_list = jieba.cut(doc[0], cut_all=False)
    seg_list = list(seg_list)
    totalX.append(seg_list)

print(totalX[0])

#Switch to below code to experiment with removing stop words
# Tokenize and remove stop words
# totalX = []
# totalY = [str(doc[1]) for doc in documents]
# stopwords = [ line.rstrip() for line in codecs.open('./data/chinese_stop_words.txt',"r", encoding="utf-8") ]
# for doc in documents:
#     seg_list = jieba.cut(doc[0], cut_all=False)
#     seg_list = list(seg_list)
#     Uncomment below code to experiment with removing stop words
#     final =[]
#     for seg in seg_list:
#         if seg not in stopwords:
#             final.append(seg)
#     totalX.append(final)


['酒店', '一般', '，', '设施', '还', '可以', '，', '就是', '床上', '可能', '有', '什么', '虫子', '，', '早晨', '起来', '感觉', '很痒', '，', '同住', '的', '同事', '也', '有', '同样', '的', '感觉', '，', '今后', '不敢', '再', '住', '了', '。']


## Visualize distribution of sentence length
Decide the max input sequence, here we cover up to 60% sentences. The longer input sequence, the more training time will take, but could improve  prediction accuracy.

In [15]:
import numpy as np
import scipy.stats as stats
import pylab as pl

h = sorted([len(sentence) for sentence in totalX])
maxLength = h[int(len(h) * 0.60)]
print("Max length is: ", h[len(h) - 1])
print("60% cover length up to: ", maxLength)
h = h[:5000]
fit = stats.norm.pdf(h, np.mean(h), np.std(h))  #this is a fitting indeed

pl.plot(h, fit, '-o')
pl.hist(h, normed=True)  #use this to draw histogram of your data
pl.show()


Max length is:  1804
60% cover length up to:  68


<matplotlib.figure.Figure at 0x7f4055ef4ac8>

## Words to number tokens, padding
Pad input sequence to max input length if it is shorter


Save the input tokenizer, since we need to use the same tokenizer for our new predition data.

In [16]:
totalX = [" ".join(wordslist) for wordslist in
          totalX]  # Keras Tokenizer expect the words tokens to be seperated by space 
input_tokenizer = Tokenizer(30000)  # Initial vocab size
input_tokenizer.fit_on_texts(totalX)
vocab_size = len(input_tokenizer.word_index) + 1
print("input vocab_size:", vocab_size)
totalX = np.array(pad_sequences(input_tokenizer.texts_to_sequences(totalX), maxlen=maxLength))
__pickleStuff("./profitero_data_scientist/chinese_sentiment_analysis_master/data/input_tokenizer_chinese.p",
              input_tokenizer)


input vocab_size: 22123


## Output, array of 0s and 1s

In [17]:
target_tokenizer = Tokenizer(3)
target_tokenizer.fit_on_texts(totalY)
print("output vocab_size:",len(target_tokenizer.word_index) + 1)
totalY = np.array(target_tokenizer.texts_to_sequences(totalY)) -1
totalY = totalY.reshape(totalY.shape[0])

output vocab_size: 3


In [18]:
totalY[40:50]

array([1, 0, 0, 1, 1, 1, 0, 1, 1, 0])

## Turn output 0s and 1s to categories(one-hot vectors)

In [18]:
totalY = to_categorical(totalY, num_classes=2)

In [20]:
totalY[40:50]

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.]])

In [21]:
output_dimen = totalY.shape[1] # which is 2

## Save meta data for later predition
maxLength: the input sequence length

vocab_size: Input vocab size

output_dimen: which is 2 in this example (pos or neg)

sentiment_tag: either ["neg","pos"] or ["pos","neg"] matching the target tokenizer

In [22]:
target_reverse_word_index = {v: k for k, v in list(target_tokenizer.word_index.items())}
sentiment_tag = [target_reverse_word_index[1], target_reverse_word_index[2]]
metaData = {"maxLength": maxLength, "vocab_size": vocab_size, "output_dimen": output_dimen,
            "sentiment_tag": sentiment_tag}
__pickleStuff("./profitero_data_scientist/chinese_sentiment_analysis_master/data/meta_sentiment_chinese.p", metaData)


## Build the Model, train and save it
The training data is logged to Tensorboard, we can look at it by cd into directory 

"./Graph/sentiment_chinese" and run


"python -m tensorflow.tensorboard --logdir=."

In [23]:
embedding_dim = 256

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim,input_length = maxLength))
# Each input would have a size of (maxLength x 256) and each of these 256 sized vectors are fed into the GRU layer one at a time.
# All the intermediate outputs are collected and then passed on to the second GRU layer.
model.add(GRU(256, dropout=0.9, return_sequences=True))
# Using the intermediate outputs, we pass them to another GRU layer and collect the final output only this time
model.add(GRU(256, dropout=0.9))
# The output is then sent to a fully connected layer that would give us our final output_dim classes
model.add(Dense(output_dimen, activation='softmax'))
# We use the adam optimizer instead of standard SGD since it converges much faster
tbCallBack = TensorBoard(log_dir='./Graph/sentiment_chinese', histogram_freq=0,
                            write_graph=True, write_images=True)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
model.fit(totalX, totalY, validation_split=0.1, batch_size=32, epochs=1, verbose=1, callbacks=[tbCallBack])
model.save('./profitero_data_scientist/chinese_sentiment_analysis_master/data/sentiment_chinese_model.hdf5')

print("Saved model!")

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 68, 256)           5663488   
_________________________________________________________________
gru_1 (GRU)                  (None, 68, 256)           393984    
_________________________________________________________________
gru_2 (GRU)                  (None, 256)               393984    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 514       
Total params: 6,451,970
Trainable params: 6,451,970
Non-trainable params: 0
_________________________________________________________________


Train on 5324 samples, validate on 592 samples


Epoch 1/1


  32/5324 [..............................] - ETA: 3:02 - loss: 0.7065 - acc: 0.4062

  64/5324 [..............................] - ETA: 2:31 - loss: 0.7052 - acc: 0.4375

  96/5324 [..............................] - ETA: 2:17 - loss: 0.7058 - acc: 0.3958

 128/5324 [..............................] - ETA: 2:10 - loss: 0.7009 - acc: 0.4375

 160/5324 [..............................] - ETA: 2:07 - loss: 0.7019 - acc: 0.4500

 192/5324 [>.............................] - ETA: 2:03 - loss: 0.6985 - acc: 0.4531

 224/5324 [>.............................] - ETA: 2:00 - loss: 0.6986 - acc: 0.4688

 256/5324 [>.............................] - ETA: 1:59 - loss: 0.6975 - acc: 0.4766

 288/5324 [>.............................] - ETA: 1:57 - loss: 0.6953 - acc: 0.4792

 320/5324 [>.............................] - ETA: 1:55 - loss: 0.6971 - acc: 0.4594

 352/5324 [>.............................] - ETA: 1:54 - loss: 0.6982 - acc: 0.4659

 384/5324 [=>............................] - ETA: 1:53 - loss: 0.6966 - acc: 0.4792

 416/5324 [=>............................] - ETA: 1:51 - loss: 0.6993 - acc: 0.4760

 448/5324 [=>............................] - ETA: 1:50 - loss: 0.6990 - acc: 0.4821

 480/5324 [=>............................] - ETA: 1:49 - loss: 0.7002 - acc: 0.4771

 512/5324 [=>............................] - ETA: 1:48 - loss: 0.7014 - acc: 0.4746

 544/5324 [==>...........................] - ETA: 1:47 - loss: 0.7036 - acc: 0.4688

 576/5324 [==>...........................] - ETA: 1:47 - loss: 0.7025 - acc: 0.4740

 608/5324 [==>...........................] - ETA: 1:45 - loss: 0.7029 - acc: 0.4786

 640/5324 [==>...........................] - ETA: 1:45 - loss: 0.7009 - acc: 0.4875

 672/5324 [==>...........................] - ETA: 1:44 - loss: 0.7018 - acc: 0.4851

 704/5324 [==>...........................] - ETA: 1:43 - loss: 0.7025 - acc: 0.4815

 736/5324 [===>..........................] - ETA: 1:42 - loss: 0.7053 - acc: 0.4851

 768/5324 [===>..........................] - ETA: 1:41 - loss: 0.7037 - acc: 0.4883

 800/5324 [===>..........................] - ETA: 1:41 - loss: 0.7052 - acc: 0.4875

 832/5324 [===>..........................] - ETA: 1:40 - loss: 0.7054 - acc: 0.4916

 864/5324 [===>..........................] - ETA: 1:39 - loss: 0.7056 - acc: 0.4884

 896/5324 [====>.........................] - ETA: 1:38 - loss: 0.7057 - acc: 0.4900

 928/5324 [====>.........................] - ETA: 1:37 - loss: 0.7046 - acc: 0.4903

 960/5324 [====>.........................] - ETA: 1:37 - loss: 0.7036 - acc: 0.4938

 992/5324 [====>.........................] - ETA: 1:36 - loss: 0.7045 - acc: 0.4889

1024/5324 [====>.........................] - ETA: 1:35 - loss: 0.7034 - acc: 0.4941

1056/5324 [====>.........................] - ETA: 1:34 - loss: 0.7027 - acc: 0.4972

1088/5324 [=====>........................] - ETA: 1:34 - loss: 0.7036 - acc: 0.4954

1120/5324 [=====>........................] - ETA: 1:33 - loss: 0.7036 - acc: 0.4946

1152/5324 [=====>........................] - ETA: 1:32 - loss: 0.7033 - acc: 0.4931

1184/5324 [=====>........................] - ETA: 1:31 - loss: 0.7037 - acc: 0.4924

1216/5324 [=====>........................] - ETA: 1:31 - loss: 0.7035 - acc: 0.4926



































































































































































































































































Saved model!


### Below are prediction code
Function to load the meta data and the model we just trained.

In [25]:
model = None
sentiment_tag = None
maxLength = None
def loadModel():
    global model, sentiment_tag, maxLength
    metaData = __loadStuff("./profitero_data_scientist/chinese_sentiment_analysis_master/data/meta_sentiment_chinese.p")
    maxLength = metaData.get("maxLength")
    vocab_size = metaData.get("vocab_size")
    output_dimen = metaData.get("output_dimen")
    sentiment_tag = metaData.get("sentiment_tag")
    embedding_dim = 256
    if model is None:
        model = Sequential()
        model.add(Embedding(vocab_size, embedding_dim, input_length=maxLength))
        # Each input would have a size of (maxLength x 256) and each of these 256 sized vectors are fed into the GRU layer one at a time.
        # All the intermediate outputs are collected and then passed on to the second GRU layer.
        model.add(GRU(256, dropout=0.9, return_sequences=True))
        # Using the intermediate outputs, we pass them to another GRU layer and collect the final output only this time
        model.add(GRU(256, dropout=0.9))
        # The output is then sent to a fully connected layer that would give us our final output_dim classes
        model.add(Dense(output_dimen, activation='softmax'))
        # We use the adam optimizer instead of standard SGD since it converges much faster
        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
        model.load_weights('./profitero_data_scientist/chinese_sentiment_analysis_master/data/sentiment_chinese_model.hdf5')
        model.summary()
    print("Model weights loaded!")

## Functions to convert sentence to model input, and predict result

In [26]:
def findFeatures(text):
    text=Converter('zh-hans').convert(text)
    text = text.replace("\n", "")
    text = text.replace("\r", "") 
    seg_list = jieba.cut(text, cut_all=False)
    seg_list = list(seg_list)
    text = " ".join(seg_list)
    textArray = [text]
    input_tokenizer_load = __loadStuff("./profitero_data_scientist/chinese_sentiment_analysis_master/data/input_tokenizer_chinese.p")
    textArray = np.array(pad_sequences(input_tokenizer_load.texts_to_sequences(textArray), maxlen=maxLength))
    return textArray
def predictResult(text):
    if model is None:
        print("Please run \"loadModel\" first.")
        return None
    features = findFeatures(text)
    predicted = model.predict(features)[0] # we have only one sentence to predict, so take index 0
    predicted = np.array(predicted)
    probab = predicted.max()
    predition = sentiment_tag[predicted.argmax()]
    return predition, probab

## Calling the load model function

In [27]:
loadModel()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 68, 256)           5663488   
_________________________________________________________________
gru_3 (GRU)                  (None, 68, 256)           393984    
_________________________________________________________________
gru_4 (GRU)                  (None, 256)               393984    
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 514       
Total params: 6,451,970
Trainable params: 6,451,970
Non-trainable params: 0
_________________________________________________________________
Model weights loaded!


## Try some new comments, feel free to try your own
The result tuple consists the predicted result and likehood.

In [28]:
predictResult("还好，床很大而且很干净，前台很友好，很满意，下次还来。")

('pos', 0.63859636)

In [29]:
predictResult("床上有污渍，房间太挤不透气，空调不怎么好用。")

('neg', 0.50563675)

In [30]:
predictResult("房间有点小但是设备还齐全，没有异味。")

('pos', 0.5149448)

In [31]:
predictResult("房间还算干净，一般般吧，短住还凑合。")

('pos', 0.76030725)

('pos', 0.66190755)

In [32]:
predictResult("开始不太满意，前台好说话换了一间，房间很干净没有异味。")

('pos', 0.55261517)