# Sentiment analysis using CNN, LSTM, CNN+LSTM  
Nivit Nantanivattikul - 5833638023  
Tatchanon Kummalue - 5833630023

   ### แรงจูงใจ (Motivation)
<p>&emsp;&emsp;เราพูดได้ว่าทุกวันนี้โลกธุรกิจของเรากำลังถูกขับเคลื่อนด้วยข้อมูลเป็นส่วนมาก หรือที่เรียกว่า Data driven business ถ้าบริษัทไหนมี knowledge มากกว่าหรือมีข้อมูลในมือมากกว่าก็จะได้เปรียบในการพัฒนาบริษัทได้มากกว่า แต่หากว่าบริษัทมีแต่ข้อมูลแต่ไม่สามารถดึงเอาสิ่งที่มีประโยชน์ออกมาจากข้อมูลได้ ข้อมูลนั้นก็จะไร้ซึ่งประโยชน์ เราจึงเห็นได้ว่าทุกวันนี้มีรูปแบบการนำเอาโมเดลต่างๆ ไปเรียนรู้ข้อมูลเพื่อที่จะทำให้เราได้ Insights ออกมาจากข้อมูลได้ วิธีการอย่างนึงที่นำมาใช้วัดผลลัพธ์ของ Campaign marketing ต่างๆก็คือการวัดผลตอบรับจากทางกระแส Social media ต่างๆ แต่การที่จะให้มนุษย์มานั่ง Label data จะส่งผลให้เสียทรัพยากรไปอย่างเปล่าประโยชน์ ดังนั้นเราจึงจะเห็นได้ว่าการทำ Sentimental analysis กำลังเป็นกระแสในช่วงนี้<p>
<p>&emsp;&emsp;เราจึงต้องการที่จะนำเอาโมเดลที่ได้พัฒนาจาก Dataset นี้ ไปชี้วัด Feedback จาก Marketing campaign ต่างๆได้ โดยดูผลตอบรับทาง social media ต่างๆที่มี feed ข้อมูลจำนวนมากไหลเวียนอยู่ตลอดเวลา ทำให้บริษัทสามารถวัดได้ว่า Campaign ต่างๆที่ Launch ไปนั้น ROI เหมาะสมหรือไม่ และควรที่จะลงทุนกับ campaign นี้ต่อหรือไม่ <p>

### วัตถุประสงค์  (Objective)
- เพื่อเรียนรู้ถึงวิธีการทำ feature extraction จาก text file เช่นการทำ stemming กับ negation handling ก่อน
- เพื่อทำความเข้าใจและเรียนรู้ถึงวิธีการใช้ Word2Vec และ Doc2Vec ในการทำ Classification ซึ่งจะทำนายผลออกมาว่า text นั้นๆ มีผลเป็นบวกหรือเป็นลบทางอารมณ์
- เพื่อนำความรู้ที่ได้ไปใช้ทำนายอารมณ์ของข้อความอื่นๆ
- เพื่อเปรียบเทียบความสามารถกับวิธีอื่นๆ ทั่วไป เช่น tf-idf classifier, logistic regression หรือ random forest


### ลักษณะของข้อมูลที่เลือกมา
<p>&emsp;&emsp;Large Movie Review Dataset v1.0 เป็นข้อมูลที่ประกอบด้วยรีวิวของหนังแต่ละเรื่องจาก IMDb ซึ่งเหมาะสำหรับการทำ Sentiment Classification โดยข้อมูลมีจำนวน 50000 รีวิว และใช้การแบ่งเป็น Train 70 % และ Test 30 % (ข้อมูลมีการกระจายตัวอย่างสม่ำเสมอโดยเป็น Positive review 50 % และ Negative review 50 % โดยข้อมูลที่จะถูกแบ่งเป็น Positive review จะเป็น review ที่มี rating score >=7 และ ข้อมูลที่เป็น negative review จะเป็น review ที่มี rating score <=4) โดยสามารถเข้าถึงข้อมูลได้จาก http://ai.stanford.edu/~amaas/data/sentiment/ <p><p>
    
### ไฟล์ที่ได้ประกอบด้วย
- Train Set และ Test set ซึ่งมีข้อมูล Positive review และ Negative review ในแต่ละเซตข้อมูล โดยข้อมูลรีวิวถูกเก็บในรูป text files และตั้งชื่อด้วยเงื่อนไข [[id]_[rating].txt] โดย [id] เป็น unique id และ [rating] เป็น star rating ซึ่งมี rating ตั้งแต่ 1-10 เช่น [test/pos/200_8.txt] คือ รีวิวที่มี unique id 200 and star rating 8/10 from IMDb
- Tokenized bag of words features ที่ใช้ใน Dataset นี้ได้ถูกทำไว้แล้ว เพื่อความสะดวกของผู้ที่นำไปใช้งาน ซึ่งอยู่ในใน .feat files ในรูปแบบ LIBSVM format ในแต่ละ train/test directories โดยจะเก็บในรูป sparse-vector format โดยจะเป็นการเก็บความถี่ของคำ เช่น 0:7 ใน .feat file แปลว่า คำแรกใน [imdb.vocab] พบ 7 ครั้งใน review นั้น

![Example of Dataset](https://www.img.in.th/images/55bf9954485c17641335b18e6875c18c.jpg)

### วิธีการที่ใช้
- Data >> preprocess >> Embedding Layer >> LSTM
- Data >> preprocess >> Embedding Layer >> CNN
- Data >> preprocess >> Embedding Layer >> CNN+LSTM

In [1]:
from __future__ import print_function
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, f1_score
from keras.wrappers.scikit_learn import KerasClassifier
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.layers import Dense, Embedding, Dropout, Activation, Flatten, MaxPooling1D
from keras.layers import LSTM
from keras.callbacks import TensorBoard, EarlyStopping
from keras.datasets import imdb
from IPython.display import SVG
from IPython.display import Image

Using TensorFlow backend.


### Test Data

In [2]:
NUM_WORDS=1000 
INDEX_FROM=3   

train,test = imdb.load_data(num_words=NUM_WORDS, index_from=INDEX_FROM)
x_train,y_train = train
x_test,y_test = test

word_to_id = imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNKNOWN>"] = 2

id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in x_train[0] ))

<START> this film was just brilliant casting <UNKNOWN> <UNKNOWN> story direction <UNKNOWN> really <UNKNOWN> the part they played and you could just imagine being there robert <UNKNOWN> is an amazing actor and now the same being director <UNKNOWN> father came from the same <UNKNOWN> <UNKNOWN> as myself so i loved the fact there was a real <UNKNOWN> with this film the <UNKNOWN> <UNKNOWN> throughout the film were great it was just brilliant so much that i <UNKNOWN> the film as soon as it was released for <UNKNOWN> and would recommend it to everyone to watch and the <UNKNOWN> <UNKNOWN> was amazing really <UNKNOWN> at the end it was so sad and you know what they say if you <UNKNOWN> at a film it must have been good and this definitely was also <UNKNOWN> to the two little <UNKNOWN> that played the <UNKNOWN> of <UNKNOWN> and paul they were just brilliant children are often left out of the <UNKNOWN> <UNKNOWN> i think because the stars that play them all <UNKNOWN> up are such a big <UNKNOWN> fo

In [3]:
[x_train[0][k] for k in range(10)]

[1, 14, 22, 16, 43, 530, 973, 2, 2, 65]

In [4]:
word_to_id

{'whiles': 53540,
 "'western": 57510,
 'mechanic': 9025,
 'aggresive': 63355,
 'pwnz': 69868,
 'mikal': 77342,
 'choca': 54835,
 'rummaged': 65625,
 'schecky': 74020,
 'sanitizes': 57422,
 'lybia': 56893,
 'introduced': 1725,
 'domineering': 15122,
 'mechenosets': 50178,
 "captain's": 26444,
 'kundera': 28727,
 "'pretty": 43475,
 'sjöberg': 73829,
 'reinterpretations': 35249,
 'employed': 5652,
 'decried': 63225,
 'wmd': 53687,
 "'on": 30019,
 'courtrooms': 32975,
 'intricacy': 27931,
 'lieutenant': 9280,
 'illogical': 4330,
 'celozzi': 46897,
 'melvyn': 6597,
 "dine's": 63598,
 "india'": 66410,
 "macarhur's": 77457,
 "lehar's": 53496,
 'fredrich': 71797,
 'notice': 1495,
 'calomari': 85977,
 'slowely': 71231,
 'chocked': 34094,
 'sidearm': 62458,
 'ustase': 52390,
 'transcendence': 34367,
 'horrormovies': 60055,
 'cateress': 67340,
 "carlin's": 44275,
 'suffocated': 40044,
 'jumbo': 10059,
 'reinterpretation': 79007,
 "gorris'": 48383,
 'ism': 19709,
 'blister': 69879,
 'kana': 70379,

In [5]:
id_to_word

{0: '<PAD>',
 1: '<START>',
 2: '<UNKNOWN>',
 4: 'the',
 5: 'and',
 6: 'a',
 7: 'of',
 8: 'to',
 9: 'is',
 10: 'br',
 11: 'in',
 12: 'it',
 13: 'i',
 14: 'this',
 15: 'that',
 16: 'was',
 17: 'as',
 18: 'for',
 19: 'with',
 20: 'movie',
 21: 'but',
 22: 'film',
 23: 'on',
 24: 'not',
 25: 'you',
 26: 'are',
 27: 'his',
 28: 'have',
 29: 'he',
 30: 'be',
 31: 'one',
 32: 'all',
 33: 'at',
 34: 'by',
 35: 'an',
 36: 'they',
 37: 'who',
 38: 'so',
 39: 'from',
 40: 'like',
 41: 'her',
 42: 'or',
 43: 'just',
 44: 'about',
 45: "it's",
 46: 'out',
 47: 'has',
 48: 'if',
 49: 'some',
 50: 'there',
 51: 'what',
 52: 'good',
 53: 'more',
 54: 'when',
 55: 'very',
 56: 'up',
 57: 'no',
 58: 'time',
 59: 'she',
 60: 'even',
 61: 'my',
 62: 'would',
 63: 'which',
 64: 'only',
 65: 'story',
 66: 'really',
 67: 'see',
 68: 'their',
 69: 'had',
 70: 'can',
 71: 'were',
 72: 'me',
 73: 'well',
 74: 'than',
 75: 'we',
 76: 'much',
 77: 'been',
 78: 'bad',
 79: 'get',
 80: 'will',
 81: 'do',
 82: 'als

In [6]:
print("Training data: ")
print("x_train : ",len(train[0])," x_test : ",len(train[1]))
print("y_train : ",len(test[0])," y_test : ",len(test[1]))

Training data: 
x_train :  25000  x_test :  25000
y_train :  25000  y_test :  25000


In [7]:
X = np.concatenate((x_train, x_test), axis=0)
y = np.concatenate((y_train, y_test), axis=0)

In [8]:
print("Number of unique words: ")
print(len(np.unique(np.hstack(X))))

Number of unique words: 
998


# Long short-term memory (LSTM)

## Import Library

In [2]:
import numpy as np
import pickle
from __future__ import print_function



from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

## Set Parameter

In [5]:
max_features = 20000
maxlen = 80 
batch_size = 32

## Import Data

In [6]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

## Pad Data

In [7]:
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (25000, 80)
x_test shape: (25000, 80)


## Build LSTM Model And Run

In [5]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 1.0178739750412107
Test accuracy: 0.81036


## Model Summary

In [19]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 128)         2560000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 2,691,713
Trainable params: 2,691,713
Non-trainable params: 0
_________________________________________________________________


## Tuning Hyperarameter

### Tune units in LSTM

- 64 units

In [6]:
# units: Positive integer, dimensionality of the output space.

print('Build model2 with LSTM 64 units...')
model2 = Sequential()
model2.add(Embedding(max_features, 64))
model2.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model2.add(Dense(1, activation='sigmoid'))

model2.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model2.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model2.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model2 with LSTM 64 units...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 0.9270928121328353
Test accuracy: 0.80888


---
- 256 units

In [7]:
print('Build model3 with LSTM 256 units...')
model3 = Sequential()
model3.add(Embedding(max_features, 256))
model3.add(LSTM(256, dropout=0.2, recurrent_dropout=0.2))
model3.add(Dense(1, activation='sigmoid'))

model3.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model3.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model3.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model3 with LSTM 256 units...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 1.2637602571652644
Test accuracy: 0.80772


---
### Tune batch size

- batch size = 64

In [8]:
batch_size = 64
print('Build model4 with batch_size = 64...')
model4 = Sequential()
model4.add(Embedding(max_features, 128))
model4.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model4.add(Dense(1, activation='sigmoid'))

model4.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model4.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model4.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model4 with batch_size = 64...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 0.9715033755111694
Test accuracy: 0.8099999999809265


---
- batch size = 256

In [9]:
batch_size = 256
print('Build model5 with batch_size = 256...')
model5 = Sequential()
model5.add(Embedding(max_features, 128))
model5.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model5.add(Dense(1, activation='sigmoid'))

model5.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model5.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model5.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model5 with batch_size = 256...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 0.9567267325592042
Test accuracy: 0.8042400000190735


---
### Tune optimizer
- use Stochastic gradient descent optimizer

In [10]:
batch_size = 32
print('Build model6 with optimizer = SGD...')
model6 = Sequential()
model6.add(Embedding(max_features, 128))
model6.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model6.add(Dense(1, activation='sigmoid'))

model6.compile(loss='binary_crossentropy',
              optimizer='SGD',
              metrics=['accuracy'])

print('Train...')
model6.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model6.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model6 with optimizer = SGD...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 0.6854370599746704
Test accuracy: 0.59404


---
- use RMSProp optimizer

In [16]:
print('Build model7 with optimizer = RMSprop...')
model7 = Sequential()
model7.add(Embedding(max_features, 128))
model7.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model7.add(Dense(1, activation='sigmoid'))

model7.compile(loss='binary_crossentropy',
              optimizer='RMSprop',
              metrics=['accuracy'])

print('Train...')
model7.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model7.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model6 with optimizer = RMSprop...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 0.5830579057455063
Test accuracy: 0.82688


---
## Tune Layer

- optimizer = RMSprop + add hidden layer in lstm + batch size = 128' + epoch = 30

In [15]:
- optimizer = RMSprop + add hidden layer in lstm + batch size = 128' + epoch = 30batch_size = 128
print('Build model8 with optimizer = RMSprop + add hidden layer in lstm + batch size = 128')
model8 = Sequential()
model8.add(Embedding(max_features, 128))
model8.add(LSTM(128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model8.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model8.add(Dense(1, activation='sigmoid'))

model8.compile(loss='binary_crossentropy',
              optimizer='RMSprop',
              metrics=['accuracy'])

print('Train...')
model8.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=30,
          validation_data=(x_test, y_test))
score, acc = model8.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model8 with optimizer = RMSprop + add hidden layer in lstm + batch size = 128
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Test score: 1.3992165909576415
Test accuracy: 0.8035999999809265


In [16]:
batch_size = 128
print('Build model8 with optimizer = RMSprop + add hidden layer in lstm + batch size = 128')
model8 = Sequential()
model8.add(Embedding(max_features, 128))
model8.add(LSTM(128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model8.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model8.add(Dense(1, activation='sigmoid'))

model8.compile(loss='binary_crossentropy',
              optimizer='RMSprop',
              metrics=['accuracy'])

print('Train...')
model8.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=30,
          validation_data=(x_test, y_test))
score, acc = model8.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)model8.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, None, 128)         2560000   
_________________________________________________________________
lstm_9 (LSTM)                (None, None, 128)         131584    
_________________________________________________________________
lstm_10 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 129       
Total params: 2,823,297
Trainable params: 2,823,297
Non-trainable params: 0
_________________________________________________________________


In [14]:
model8.summary()print('Build model9 with optimizer = Adam + add hidden layer in lstm + batch size = 32...')
model9 = Sequential()
model9.add(Embedding(max_features, 128))
model9.add(LSTM(128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model9.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model9.add(Dense(1, activation='sigmoid'))

model9.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model9.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model9.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model9 with optimizer = Adam + add hidden layer in lstm + batch size = 32...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 1.0762920999908447
Test accuracy: 0.8054000000381469


In [17]:
print('Build model9 with optimizer = Adam + add hidden layer in lstm + batch size = 32...')
model9 = Sequential()
model9.add(Embedding(max_features, 128))
model9.add(LSTM(128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model9.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model9.add(Dense(1, activation='sigmoid'))

model9.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model9.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model9.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)model9.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, None, 128)         2560000   
_________________________________________________________________
lstm_7 (LSTM)                (None, None, 128)         131584    
_________________________________________________________________
lstm_8 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
Total params: 2,823,297
Trainable params: 2,823,297
Non-trainable params: 0
_________________________________________________________________


---
## Conclusion LSTM


- สำหรับ Model LSTM ที่ดีที่สุดน่าจะเป็น Model ที่ 7 ซึ่งมีค่า Test accuracy: 0.82688 มากที่สุดในทุกโมเดล และ จากค่า loss ใน epoch ล่าสุด คาดว่ายังสามารถเพิ่มจำนวน epoch เพื่อเพิ่มความแม่นยำ ให้กับโมเดลได้อีกด้วย
- สำหรับ Model LSTM ที่เพิ่มจำนวน hidden layer เข้าไปพบว่าอาจทำให้เกิดการ overfit เกินไปกับตัว train data ทำให้ Test Accuracy ที่ได้ ไม่มากเท่าที่คิด หรืออาจเกิดจากการปรับค่า Parameter ได้ไม่ดีพอ

|                               Model                              | Test Accuracy | Test Score         |
|:----------------------------------------------------------------:|---------------|--------------------|
|                            Base model                            |       0.81036 | 1.0178739750412107 |
|                           LSTM 64 units                          |       0.80888 | 0.9270928121328353 |
|                          LSTM 256 units                          |       0.80772 | 1.2637602571652644 |
|                          batch size = 64                         |       0.80999 | 0.9715033755111694 |
|                         batch size = 256                         |       0.80424 | 0.9567267325592042 |
|                           SGD optimizer                          |       0.59404 | 0.6854370599746704 |
|                         RMSProp optimizer                        |       0.82688 | 0.5830579057455063 |
| 2 hidden layer in lstm + RMSprop + batch size = 128 + epoch = 30 |       0.80359 | 1.3992165909576415 |
| 2 hidden layer in lstm + adam + batch size = 32 + epoch = 15     |       0.80540 | 1.0762920999908447 |

#### Base model Parameter
1. batch size = 32
2. epoch = 15
3. LSTM 128 units 1 hidden layer
4. adam optimizer

## CNN

In [9]:
# set parameters:
max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 8

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')
model = Sequential()

model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))
model.add(Dropout(0.2))

model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))

model.add(GlobalMaxPooling1D())

model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'],
             )
model.summary()
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))
loss, accuracy = model.evaluate(x_train, y_train, verbose=0)
print("Training: accuracy = %f  ;  loss = %f" % (accuracy, loss))
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print("Validation: accuracy1 = %f  ;  loss1 = %f" % (accuracy, loss))


Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 400)
x_test shape: (25000, 400)
Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 400, 50)           250000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 400, 50)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 250)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               62750     
_________________________________________________________________
dropout_2 (Dropout)          (None, 250)        

## cnn_lstm

In [11]:
# Embedding
max_features = 20000
maxlen = 100
embedding_size = 128

# Convolution
kernel_size = 5
filters = 64
pool_size = 4

# LSTM
lstm_output_size = 70

# Training
batch_size = 30
epochs = 4


print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')

model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))
model.add(Dropout(0.25))
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(LSTM(lstm_output_size))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))
loss, accuracy = model.evaluate(x_train, y_train, verbose=0)
print("Training: accuracy = %f  ;  loss = %f" % (accuracy, loss))
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print("Validation: accuracy1 = %f  ;  loss1 = %f" % (accuracy, loss))

Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 100)
x_test shape: (25000, 100)
Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 100, 128)          2560000   
_________________________________________________________________
dropout_4 (Dropout)          (None, 100, 128)          0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 96, 64)            41024     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 24, 64)            0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 70)                37800     
_________________________________________________________________
dense_4 (Dense)              (None, 1)          

# CNN, CNN+LSTM Model structure

### CNN  
```
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 400, 50)           250000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 400, 50)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 250)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               62750     
_________________________________________________________________
dropout_2 (Dropout)          (None, 250)               0         
_________________________________________________________________
activation_1 (Activation)    (None, 250)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 251       
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Epoch 1/8
25000/25000 [==============================] - 87s 3ms/step - loss: 0.4027 - acc: 0.8010 - val_loss: 0.3173 - val_acc: 0.8617
Epoch 2/8
25000/25000 [==============================] - 86s 3ms/step - loss: 0.2280 - acc: 0.9099 - val_loss: 0.2778 - val_acc: 0.8835
Epoch 3/8
25000/25000 [==============================] - 86s 3ms/step - loss: 0.1627 - acc: 0.9374 - val_loss: 0.2691 - val_acc: 0.8910
Epoch 4/8
25000/25000 [==============================] - 86s 3ms/step - loss: 0.1114 - acc: 0.9590 - val_loss: 0.3591 - val_acc: 0.8806
Epoch 5/8
25000/25000 [==============================] - 86s 3ms/step - loss: 0.0791 - acc: 0.9719 - val_loss: 0.3894 - val_acc: 0.8819
Epoch 6/8
25000/25000 [==============================] - 86s 3ms/step - loss: 0.0575 - acc: 0.9790 - val_loss: 0.4305 - val_acc: 0.8744
Epoch 7/8
25000/25000 [==============================] - 85s 3ms/step - loss: 0.0455 - acc: 0.9835 - val_loss: 0.4175 - val_acc: 0.8795
Epoch 8/8
25000/25000 [==============================] - 86s 3ms/step - loss: 0.0398 - acc: 0.9854 - val_loss: 0.4689 - val_acc: 0.8836
```  
### CNN LSTM  
```
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, 100, 128)          2560000   
_________________________________________________________________
dropout_4 (Dropout)          (None, 100, 128)          0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 96, 64)            41024     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 24, 64)            0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 70)                37800     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 71        
_________________________________________________________________
activation_4 (Activation)    (None, 1)                 0         
=================================================================
Epoch 1/4
25000/25000 [==============================] - 62s 2ms/step - loss: 0.3866 - acc: 0.8199 - val_loss: 0.3418 - val_acc: 0.8480
Epoch 2/4
25000/25000 [==============================] - 61s 2ms/step - loss: 0.1982 - acc: 0.9250 - val_loss: 0.3437 - val_acc: 0.8575
Epoch 3/4
25000/25000 [==============================] - 60s 2ms/step - loss: 0.0948 - acc: 0.9672 - val_loss: 0.4149 - val_acc: 0.8420
Epoch 4/4
25000/25000 [==============================] - 61s 2ms/step - loss: 0.0441 - acc: 0.9858 - val_loss: 0.5634 - val_acc: 0.8386
```

### Summary


#### LSTM
- Training: accuracy = 0.9789  ;  loss = 0.0610  
- Validation: accuracy1 = 0.82688  ;  loss1 = 0.5830579057455063

#### CNN  
- Training: accuracy = 0.998800  ;  loss = 0.006388  
- Validation: accuracy1 = 0.883600  ;  loss1 = 0.468917 

#### CNN+LSTM  
- Training: accuracy = 0.996320  ;  loss = 0.014933  
- Validation: accuracy1 = 0.838640  ;  loss1 = 0.563363
