# **Movie Review : Part II (Preprocessing and Modeling)**

---



## **Download Project**

我們要從Github上下載TensorFlow-Tutorials專案，我們將會使用專案裡的imdb.py．

imdb.py的功能就是下載影評資料，這筆資料將會做為訓練和測試的資料．

影評資料路徑：http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

1.使用 ! git clone 指令將網址的專案下載下來．

2.使用 %cd 指令索引到下載下來的TensorFlow-Tutorials專案．

In [None]:
# '''Download the project including imdb.py file'''
! git clone https://github.com/Hvass-Labs/TensorFlow-Tutorials.git
%cd TensorFlow-Tutorials

fatal: destination path 'TensorFlow-Tutorials' already exists and is not an empty directory.
/content/TensorFlow-Tutorials


1.使用 ls 的指令將會顯示TensorFlow-Tutorials專案底下的所有目錄

In [None]:
ls

01_Simple_Linear_Model.ipynb           22_Image_Captioning.ipynb
02_Convolutional_Neural_Network.ipynb  23_Time-Series-Prediction.ipynb
03B_Layers_API.ipynb                   cache.py
03C_Keras_API.ipynb                    cifar10.py
03_PrettyTensor.ipynb                  coco.py
04_Save_Restore.ipynb                  [0m[01;32mconvert.py[0m*
05_Ensemble_Learning.ipynb             [01;34mdata[0m/
06_CIFAR-10.ipynb                      dataset.py
07_Inception_Model.ipynb               download.py
08_Transfer_Learning.ipynb             europarl.py
09_Video_Data.ipynb                    forks.md
10_Fine-Tuning.ipynb                   [01;34mimages[0m/
11_Adversarial_Examples.ipynb          imdb.py
12_Adversarial_Noise_MNIST.ipynb       inception5h.py
13B_Visual_Analysis_MNIST.ipynb        inception.py
13_Visual_Analysis.ipynb               knifey.py
14_DeepDream.ipynb                     LICENSE
15_Style_Transfer.ipynb                mnist.py
16_Reinforcement_Learning.ipynb        

## **Import Modules and Data**

1.引入tensorflow模組，此模組包含了機器學習和深度學習的套件，在我們的專案裡將tensorflow模組命名為 tf．

2.引入numpy模組，此模組包含了計算大量多維度的陣列和矩陣的套件，在我們的專案裡將numpy模組命名為 np．

In [None]:
# '''Step1: Import two libaries'''
import tensorflow as tf
import numpy as np

我們將會從tensorflow裡的keras套件引入一些模組用於建構神經網路和資料預處理．

1.從keras的models引入Sequential模組，此用模組用於建構神經網路，建構的方式為依序將layer一層一層的堆疊上去．

2.從keras的layers引入Dense、GRU、LSTM、Embedding模組，通常一個神經網路會經過 embedding layer(Embedding)、hidden layer(GRU or LSTM)、output layer(Dense)．

3.從keras的optimizers引入Adam,Adam全名為Adaptive Moment Estimation，是其中一種訓練的優化方式．

4.從keras的preprocessing裡的text引入Tokenizer模組，此模組用於建立token字典和將輸入的字元轉換成對應的token．

5.從keras的preprocessing裡的sequence引入pad_sequences模組，此模組用於對於每個字向量截長補短以達到相同的維度．

In [None]:
# '''Step2: Import Five modules from Keras'''
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dropout, GRU, LSTM, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

1.引入imdb模組，這個模組就是前面所提到的imdb.py

2.使用imdb模組的函式maybe_download_and_extract()下載影評資料IMDB，IMDB是Internet Movie Database的縮寫，裡面包含了50000筆電影的影評

In [None]:
# '''Step3: Download and extract the data'''
import imdb
imdb.maybe_download_and_extract()

Data has apparently already been downloaded and unpacked.


1.使用imdb的函式imdb.load_data(train=True)下載訓練資料集並且以list型態存在x_train_text和y_train裡，x_train_text為電影評論集(inputs)，y_train為電影評分集(label data)，一則電影評論對應一個電影評分．

2.使用imdb的函式imdb.load_data(train=False)下載測試資料集並且以list型態存在x_test_text和y_test裡．

3.顯示訓練資料集和驗證資料集的數量．

In [None]:
# '''Step4: Load the data to build file lists'''
x_train_text, y_train = imdb.load_data(train=True)
x_test_text, y_test = imdb.load_data(train=False)
print("Train-set size: ", len(x_train_text))
print("Test-set size:  ", len(x_test_text))

Train-set size:  25000
Test-set size:   25000


1.顯示第一筆電影評論集的資料．

2.顯示第一筆電影評分集的資料．

假如電影評分為1.0表示電影評論是正面的，如果電影評分為0.0代表電影評論是負面的．

In [None]:
# '''Examine the details of x_train_text and y_train.'''
print('Movie review text:\n', x_train_text[0])
print('Sentiment label value:\n', y_train[0])

Movie review text:
 The movie was a suspenseful, and somewhat dark, look at the severe results of a genuinely human mistake. Connery and Fishburne work very well together in this thriller about murder and redemption. Keep your boots on for the strange turnaround at the end of the movie...you'd never expect it!
Sentiment label value:
 1.0


## **Tokenizer**

1.將x_train_text和x_test_text存入data_text．

2.定義num_popular_words為5000．

3.建立一個Tokenizer物件命名為tokenizer，其中將num_words設為num_popular_words表示tokenizer只保留5000個最常出現的詞．

4.使用data_text作為tokenizer函式fit_on_texts的輸入建立token字典．



In [None]:
# '''Step1: Instruct the tokenizer scans through all the text'''
data_text = x_train_text + x_test_text
num_popular_words = 5000
tokenizer = Tokenizer(num_words=num_popular_words)
tokenizer.fit_on_texts(data_text)

5.word_index為tokenizer的一個屬性型態為dict，word_index保存了5000個最常出現在電影評論(x_train_text+x_test_text)的詞，這5000個詞都有一個對應的正整數，這個正整數表示了詞出現次數的高低，這些正整數也稱為token． 下面列出前1000個token

In [None]:
# '''Step2: Build a dictionary by converting all movie-review texts to lists of the fitted tokens'''
tokenizer.word_index

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'be': 26,
 'one': 27,
 'he': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'so': 34,
 'who': 35,
 'from': 36,
 'like': 37,
 'or': 38,
 'just': 39,
 'her': 40,
 'out': 41,
 'about': 42,
 'if': 43,
 "it's": 44,
 'has': 45,
 'there': 46,
 'some': 47,
 'what': 48,
 'good': 49,
 'when': 50,
 'more': 51,
 'very': 52,
 'up': 53,
 'no': 54,
 'time': 55,
 'my': 56,
 'even': 57,
 'would': 58,
 'she': 59,
 'which': 60,
 'only': 61,
 'really': 62,
 'see': 63,
 'story': 64,
 'their': 65,
 'had': 66,
 'can': 67,
 'me': 68,
 'well': 69,
 'were': 70,
 'than': 71,
 'much': 72,
 'we': 73,
 'bad': 74,
 'been': 75,
 'get': 76,
 'do': 77,
 'great': 78,
 'other': 79,
 'will': 80,
 'also': 81,
 'into': 82,
 'p

1.使用x_train_text作為tokenizer函式text_to_sequences的輸入，x_train_text裡的電影評論的每個詞將轉換成為對應的token，轉換的結果存入x_train_tokens．

2.使用x_test_text作為tokenizer函式text_to_sequences的輸入，x_test_text裡的電影評論的每個詞將轉換成為對應的token，轉換的結果存入x_test_tokens．

In [None]:
# '''Step3: Convert all texts to lists of tokens'''
x_train_tokens = tokenizer.texts_to_sequences(x_train_text)
x_test_tokens = tokenizer.texts_to_sequences(x_test_text)

1.顯示x_train_text第一筆電影評論的內容．

2.顯示第一筆電影評論轉化成token的內容．

In [None]:
# '''Investigate the respones after text-to-token conversion'''
print('Orginal text: \n', x_train_text[0])
print('Text-to-token: \n', np.array(x_train_tokens[0]))

Orginal text: 
 The movie was a suspenseful, and somewhat dark, look at the severe results of a genuinely human mistake. Connery and Fishburne work very well together in this thriller about murder and redemption. Keep your boots on for the strange turnaround at the end of the movie...you'd never expect it!
Text-to-token: 
 [   1   17   13    3 2488    2  672  457  163   30    1 4780 1994    4
    3 2039  395 1408 3589    2  158   52   69  294    8   11  704   42
  593    2 3389  390  125   20   15    1  685   30    1  127    4    1
   17 1421  110  525    9]


## **Pad and Truncate Data**

為了方便神經網路計算，我們必須要讓所有的輸入序列(x_train_tokens和x_test_tokens)維持一樣的長度，所以我們必須要對輸入序列做截長(Truncate)補短(Pad)．

1.將x_train_tokens和x_test_tokens裡的tokens依序取出，並且計算取出tokens的長度，計算結果將依序存入num_tokens．

2.計算所有tokens長度的平均值，計算結果存入mean_tokens．

3.顯示所有tokens的長度．

4.顯示tokens長度的平均值．

In [None]:
# '''Step1: Count the number of tokens in each sequence and Calculate their average number'''
num_tokens = [len(tokens) for tokens in x_train_tokens + x_test_tokens]
mean_tokens = int(np.mean(num_tokens))

print('Number of tokens in all the sequences:', num_tokens)
print('Average number of sequence length:', mean_tokens)

Number of tokens in all the sequences: [47, 141, 125, 120, 432, 241, 187, 145, 147, 50, 151, 68, 123, 130, 48, 95, 284, 138, 109, 358, 160, 366, 352, 139, 82, 132, 181, 171, 64, 206, 120, 138, 263, 130, 205, 273, 193, 174, 149, 289, 120, 221, 130, 171, 143, 201, 762, 182, 142, 55, 229, 466, 184, 120, 550, 764, 114, 110, 354, 270, 112, 183, 362, 154, 49, 879, 52, 225, 172, 118, 146, 131, 70, 48, 118, 53, 54, 166, 24, 141, 44, 322, 160, 682, 125, 141, 56, 160, 43, 125, 47, 111, 175, 147, 116, 128, 157, 467, 125, 33, 314, 112, 142, 99, 120, 769, 188, 117, 295, 295, 544, 133, 129, 119, 86, 147, 186, 164, 161, 276, 93, 103, 323, 326, 141, 219, 367, 533, 527, 93, 190, 512, 170, 840, 122, 182, 128, 193, 265, 107, 241, 275, 207, 304, 107, 269, 143, 132, 74, 161, 226, 178, 46, 87, 79, 91, 312, 222, 159, 176, 144, 200, 206, 215, 123, 204, 1681, 226, 129, 57, 607, 117, 124, 232, 61, 144, 422, 76, 140, 761, 838, 64, 826, 87, 29, 118, 120, 37, 123, 457, 94, 180, 442, 452, 64, 118, 145, 55, 75, 166,

1.使用pad_sequences模組對x_train_tokens裡的tokens做截長補短至長度為maxlen，結果將存入x_train_pad(型態為陣列)，這邊設maxlen為tokens長度的平均值mean_tokens，padding='pre'表示補短的時候從tokens的頭開始補，truncating='post'表示截長從tokens的尾巴開始截．

2.使用pad_sequences模組對x_test_tokens裡的tokens做截長補短至長度為maxlen，結果將存入x_test_pad(型態為陣列)．

3.顯示x_train_pad的陣列維度．

4.顯示x_test_pad的陣列維度．

In [None]:
#'''Step2: Pad or truncate the tokened word-sequences'''
pad = 'pre'
truncate = 'post'
seq_len = mean_tokens #sequence length is optional

x_train_pad = pad_sequences(x_train_tokens, maxlen=seq_len, 
                            padding=pad, truncating=truncate)
x_test_pad = pad_sequences(x_test_tokens, maxlen=seq_len, padding=pad, truncating=truncate)
print(x_train_pad.shape, 'shows the array dimension of prepared training-set.')
print(x_test_pad.shape, 'shows the array dimension of prepared tset-set.')

(25000, 211) shows the array dimension of prepared training-set.
(25000, 211) shows the array dimension of prepared tset-set.


1.顯示x_train_tokens第一筆tokens的內容

2.顯示x_train_pad第一筆tokens經過截長補短的內容

In [None]:
# '''Investigate the respones after padding or truncating'''
print('Before padding and truncating:\n', np.array(x_train_tokens[0]))
print('After padding and truncating:\n', x_train_pad[0])

Before padding and truncating:
 [   1   17   13    3 2488    2  672  457  163   30    1 4780 1994    4
    3 2039  395 1408 3589    2  158   52   69  294    8   11  704   42
  593    2 3389  390  125   20   15    1  685   30    1  127    4    1
   17 1421  110  525    9]
After padding and truncating:
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0   

## Create the RNN-Classifier 

1.建立一個Sequential model，這個model在前面有提過，此model用於建構神經網路，建構的方式為依序將layer一層一層的堆疊上去．

In [None]:
# '''Step1: Initialize the model'''
model = Sequential()

Embedding layer 就是將電影評論的tokens轉變成word vector的過程．

1.將word vector的長度設定為8．

2.建立embedding_layer，embeding_layer有三個參數要設定：

    a.input_dim即為token字典大小，設為num_popular_words．

    b.output_dim即為word vector長度，設為word_vec_len．

    c.input_length即為電影評論tokens的平均長度(輸入序列的長度)，設為mean_tokens．
3.將embedding_layer加入model．

In [None]:
# '''Step2: Add the embedding layer'''
embedding_size = 8
model.add(Embedding (input_dim=num_popular_words,
                     output_dim=embedding_size,
                     input_length=seq_len,
                     name='layer_embedding'))

接下來我們會在這個model加入三層的hidden layer，三層的hidden layer都是由LSTM Units構成的．

1.將layer_1的Units設定為LSTM，其中LSTM Units的output vector長度為16，return_sequences=True表示所有LSTM Units的output vector將會作為下一層layer的input．

2.將layer_2的Units設定為LSTM，其中LSTM Units的output vector長度為8，return_sequences=True表示所有LSTM Units的output vector將會作為下一層layer的input．

3.將layer_3的Units設定為LSTM，其中LSTM Units的output vector長度為4，return_sequences的預設值為False表示只有最後一個LSTM Unit的output vector將會作為下一層layer的input．

4.將layer_1加入model．

5.將layer_2加入model．

6.將layer_3加入model．

In [None]:
# '''Step3: Add the RNN layers with LSTM'''
model.add(LSTM(units = 32, return_sequences = True, input_shape = (embedding_size, 1)))
model.add(LSTM(units = 16, return_sequences = True))
model.add(LSTM(units = 8, return_sequences = True))
model.add(LSTM(units = 4))


最後我們會在model加入output layer將結果輸出．

 1.使用Dense模組建立output_layer，其中output layer的output vector長度為1，激發函數設定為sigmoid，sigmoid函數可以將output layer的結果壓縮在0到1之間．


In [None]:
# '''Step4: Add the dense layer as the classification output'''
model.add(Dense(units=1, activation='sigmoid'))

1.使用model.summary來顯示神經網路中各層的資訊

    a.5000為使用的dictionary中的字詞量

    b.8為字向量的長度

    c.211為輸入字串的長度

    d.三層神經網路分別有16,8,4個神經元

    e.各層分別有40000, 1200, 600 and 5個參數，皆是可以訓練的

In [None]:
# '''Step5: View a summary of the model'''
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 layer_embedding (Embedding)  (None, 211, 8)           40000     
                                                                 
 lstm (LSTM)                 (None, 211, 32)           5248      
                                                                 
 lstm_1 (LSTM)               (None, 211, 16)           3136      
                                                                 
 lstm_2 (LSTM)               (None, 211, 8)            800       
                                                                 
 lstm_3 (LSTM)               (None, 4)                 208       
                                                                 
 dense (Dense)               (None, 1)                 5         
                                                                 
Total params: 49,397
Trainable params: 49,397
Non-traina

1.使用model的compile函式建立訓練model的設置：

    a.loss即為損失函式，設置為binary_crossentropy．
    
    b.optimizer即為優化的方式設置為Adam，Adam的learning rate設置為1e-3(10的負三次方)．
    
    c.metrics即為model在訓練和測試期間效能的評定方式設置為accuracy．

In [None]:
# '''Step6: Compile the model'''
optimizer = Adam(lr=1e-3)
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])



製作第2組模型，用GRU作為神經網路

In [None]:
# '''Step1: Initialize the model'''
model2 = Sequential()

In [None]:
# '''Step2: Add the embedding layer'''
embedding_size = 8
model2.add(Embedding (input_dim=num_popular_words,
                     output_dim=embedding_size,
                     input_length=seq_len,
                     name='layer_embedding'))

In [None]:
# '''Step3: Add the RNN layers with LSTM'''
model2.add(GRU(units = 32, return_sequences = True, input_shape = (embedding_size, 1)))
model2.add(GRU(units = 16, return_sequences = True))
model2.add(GRU(units = 8, return_sequences = True))
model2.add(GRU(units = 4))

In [None]:
# '''Step4: Add the dense layer as the classification output'''
model2.add(Dense(units=1, activation='relu'))

In [None]:
# '''Step5: View a summary of the model'''
model2.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 layer_embedding (Embedding)  (None, 211, 8)           40000     
                                                                 
 gru (GRU)                   (None, 211, 32)           4032      
                                                                 
 gru_1 (GRU)                 (None, 211, 16)           2400      
                                                                 
 gru_2 (GRU)                 (None, 211, 8)            624       
                                                                 
 gru_3 (GRU)                 (None, 4)                 168       
                                                                 
 dense_1 (Dense)             (None, 1)                 5         
                                                                 
Total params: 47,229
Trainable params: 47,229
Non-trai

In [None]:
# '''Step6: Compile the model'''
optimizer = Adam(lr=1e-3)
model2.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['binary_accuracy'])



製作第三組模型RNN

In [None]:
from tensorflow.keras.layers import SimpleRNN

In [None]:
# '''Step1: Initialize the model'''
model3 = Sequential()

In [None]:
# '''Step2: Add the embedding layer'''
embedding_size = 8
model3.add(Embedding (input_dim=num_popular_words,
                     output_dim=embedding_size,
                     input_length=seq_len,
                     name='layer_embedding'))

In [None]:
# '''Step3: Add the RNN layers'''
model3.add(SimpleRNN(units = 32, return_sequences = True, input_shape = (embedding_size, 1)))
model3.add(SimpleRNN(units = 16, return_sequences = True))
model3.add(SimpleRNN(units = 8, return_sequences = True))
model3.add(GRU(units = 4))

In [None]:
# '''Step4: Add the dense layer as the classification output'''
model3.add(Dense(units=1, activation='relu'))

In [None]:
model3.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 layer_embedding (Embedding)  (None, 211, 8)           40000     
                                                                 
 simple_rnn (SimpleRNN)      (None, 211, 32)           1312      
                                                                 
 simple_rnn_1 (SimpleRNN)    (None, 211, 16)           784       
                                                                 
 simple_rnn_2 (SimpleRNN)    (None, 211, 8)            200       
                                                                 
 gru_4 (GRU)                 (None, 4)                 168       
                                                                 
 dense_2 (Dense)             (None, 1)                 5         
                                                                 
Total params: 42,469
Trainable params: 42,469
Non-trai

In [None]:
model3.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics =["accuracy"])

## **Train and Test the RNN-Classifier**

1.%%time將會顯示執行此區塊程式碼所花的時間．

2.使用model的fit函式開始執行model訓練:

    a.x_train_pad即為電影評論集經過token和截長補短的訓練集，將作為model訓練的input．

    b.y_train即為電影評分集，將作為model訓練的label data，將與model訓練的output計算誤差．

    c.validation_split=0.05表示將5%訓練集作為訓練的效能驗證．

    d.epochs=3表示model將會做三次訓練．

    e.batch_size=64表示model每經過的64筆的訓練資料將會做一次的優化．

In [None]:
y_train=np.array(y_train)

In [None]:
# '''Step1: Train the RNN-Classifier'''
%%time
model.fit(x_train_pad, y_train,
          validation_split=0.2, epochs=20, batch_size=64)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
CPU times: user 4min 14s, sys: 7.78 s, total: 4min 22s
Wall time: 5min 28s


<keras.callbacks.History at 0x7ff9dcf76400>

1.%%time將會顯示執行此區塊程式碼所花的時間．

2.使用model的evaluate函式來測試model的效能

    a.x_test_pad即為電影評論集經過token和截長補短的測試集，將作為model測試的input．

    b.y_test即為電影評分集，將作為model測試的label data，將與model測試的output計算誤差．

    c.result將會儲存evaluate函式回傳的誤差值和準確度．

3.顯示model測試的準確度．

In [None]:
y_test=np.array(y_test)

In [None]:
# '''Step2: Performance on Test-Set'''
%%time
result = model.evaluate(x_test_pad, y_test)
print("\nAccuracy: {0:.2%}".format(result[1]))


Accuracy: 82.32%
CPU times: user 10.3 s, sys: 380 ms, total: 10.7 s
Wall time: 10.3 s


訓練第2組模型

In [None]:
# '''Step1: Train the RNN-Classifier'''
%%time
model2.fit(x_train_pad, y_train,
          validation_split=0.2, epochs=20, batch_size=64)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
CPU times: user 4min 2s, sys: 6.41 s, total: 4min 8s
Wall time: 4min 27s


<keras.callbacks.History at 0x7ff95c09a430>

In [None]:
# '''Step2: Performance on Test-Set'''
%%time
result = model2.evaluate(x_test_pad, y_test)
print("\nAccuracy: {0:.2%}".format(result[1]))


Accuracy: 80.69%
CPU times: user 9.42 s, sys: 322 ms, total: 9.75 s
Wall time: 10.3 s


---

---

---

---

---

# **Tutorial_NLP+RNN : Part III (Analysis and Discussion)**

---



## **Example of Mis-Classified Text**

1.使用model.predict()來預測測試集中25000條評論

     a.x_test_pad即為電影評論集經過token和截長補短的測試集，將作為model.predict()測試的input

     b.y_pred即為電影評分集，為model.predict()的output

2.cls_pred根據0.5的閥值，將預測結果分為0和1；cls_true為電影評分集

3.使用np.where(cls_pred != cls_true)找出被分類錯誤的電影評論

4.mis_cls_idx 將incorrect陣列從二維轉換為一維，使用len()計算出錯誤分類的數量

In [None]:
# '''Step1: Calculate the total number of mis-classified text and its text index'''
y_pred = model.predict(x_test_pad)
cls_pred = np.array([1.0 if p>0.5 else 0.0 for p in y_pred])
cls_true = np.array(y_test)
incorrect = np.where(cls_pred != cls_true)
mis_cls_idx = incorrect[0]
num_incorrect = len(mis_cls_idx)
print('Of the 25000 texts used, there are %s texts were mis-classified.'  %num_incorrect)
print('\nThe Array of all incorrect text number is:', mis_cls_idx)

Of the 25000 texts used, there are 4421 texts were mis-classified.

The Array of all incorrect text number is: [    5     9    10 ... 24995 24997 24999]


1.idx = mis_cls_idx[5]來展示第6個錯誤分類的評論，關於25000則測試集中的位置

2.mis_cls_text = x_text_text[idx] 為錯誤分類的評論內容

3.y_pred[idx] 為model預測的分數，cls_true[idx]為評論的實際分類

In [None]:
# '''Step2: View one of the mis-classified text'''
idx = mis_cls_idx[5]
print('idx =', idx)

mis_cls_text = x_test_text[idx]
print('The mis-classified text is:\n 「', mis_cls_text, '」')

print('\nThe predicted class for this text:', y_pred[idx])
print('The true classes for this text:', cls_true[idx])

idx = 43
The mis-classified text is:
 「 First of all , you should watch this only if you don't mind the lack of subtitles , pornography , kinky sex and utter , horrifying and truly shocking depravity . I mentioned kinky sex , but to call sex in the second half of the movie " kinky " would be a great understatement . It's more like a punch in the face if you aren't prepared for this sort of sickness . That being said , I can go back to reviewing this morbid piece of pseudo - snuff genre brought to us by our fellow Japanese .<br /><br />The plot seems to be fairly basic , almost nonexistent : a girl is hired to perform in amateur porn movie . Don't expect much in first 30 - 40 minutes . There is some dialog - if you don't speak Japanese it's not going to mean much to you - that seems to be an occasional chatting between the girl and the crew & performer , then there is some porn ( straight sex ) , and after the scene is finished the performers and the crew take a break . And then ... it 

混淆矩陣 Confusion Matrix

![混淆矩陣 Confusion Matrix](https://pic.pimg.tw/belleaya/1465822085-3562204760.jpg?v=1465822086)

混淆矩陣包含實際分類與預測分類資訊

一個2X2的混淆矩陣包含四種可能的輸出

True-Positive 展示了model預測為正面評價且實際為正面評價的數量

False-Negative 展示了model預測為負面評價但實際為正面評價的數量

False-Positive 展示了model預測為正面評價但實際為負面評價的數量

True-Negative 展示了model預測為負面評價且實際為負面評價的數量


In [None]:
# '''Step3: Use the confusion metrix as the evaluation metrics'''
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(cls_true, cls_pred)
print("The confusion matrix is:\n", cm)
ttl_mis_text = cm[0,1] + cm[1,0]
print("The total number of mis-classified texts is '%d+%d=%d'" % (cm[0,1], cm[1,0], ttl_mis_text))

The confusion matrix is:
 [[ 9667  2833]
 [ 1588 10912]]
The total number of mis-classified texts is '2833+1588=4421'


## **New text data testing**

建立新的評論Text1 到 Text8來輸入進已訓練完成的model，進行測試


In [None]:
# '''Step1: Build some new txet data'''
text1 = "This movie is fantastic! I really like it because it is so good!" 
text2 = "Good movie!"
text3 = "Maybe I like this movie."
text4 = "Meh ..."
text5 = "If I were a drunk teenager then this movie might be good."
text6 = "Bad movie!"
text7 = "Not a good movie!"
text8 = "This movie really sucks! Can I get my money back please?"
new_texts = [text1, text2, text3, text4, text5, text6, text7, text8]

1.使用tokenizer.text_to_sequences()將新建立的評論轉換成tokens

2.使用pad_sequence()將tokens進行截長補短

3.使用model.predict()預測text1到text8的評價

In [None]:
# '''Step2: Predict the sentiment of new data'''
tokens_nd = tokenizer.texts_to_sequences(new_texts)
tokens_nd_pad = pad_sequences(tokens_nd, maxlen=seq_len,
                           padding=pad, truncating=pad)
pred_nd = model.predict(tokens_nd_pad)
print('Probability of sentiment from text 1 to 8:\n', pred_nd)

Probability of sentiment from text 1 to 8:
 [[0.98655546]
 [0.04397453]
 [0.02593382]
 [0.02902037]
 [0.0151109 ]
 [0.01706331]
 [0.05142704]
 [0.0164567 ]]


In [None]:
# '''Step2: Predict the sentiment of new data'''
pred_nd2 = model2.predict(tokens_nd_pad)
print('Probability of sentiment from text 1 to 8:\n', pred_nd2)

Probability of sentiment from text 1 to 8:
 [[0.72604126]
 [0.27853864]
 [0.3781093 ]
 [0.18481424]
 [0.25790918]
 [0.12167244]
 [0.3087553 ]
 [0.00336485]]


## **Embeddings**

1.使用model.get_layer()將layer_embedding從model中取出

2.使用get_weights()將權重的數值取出

3.使用'shape'來觀察權重的形狀，(5000,8)代表dictionary的長度為5000，word vector的長度為8

In [None]:
# '''Step1: Take the weights out from the trained RNN-classifier'''
layer_embedding = model.get_layer('layer_embedding')
weights_embedding = layer_embedding.get_weights()[0]
print('The array dimension of weights_embedding is:', weights_embedding.shape)

The array dimension of weights_embedding is: (5000, 8)


1.使用tokenizer.word_index[]將dictionary中的字轉換成token，此處分別展示good 與great的token值，以及使用weight_embedding[]將token的word vector展現出來

In [None]:
# '''Step2: View two word vectors with positive sentiment'''
token_good = tokenizer.word_index['good']
print("The token number of 'good':", token_good)
print("The word vector of 'good': \n", weights_embedding[token_good])

token_great = tokenizer.word_index['great']
print("The token number of 'great':", token_great)
print("The word vector of 'great': \n", weights_embedding[token_great])

The token number of 'good': 49
The word vector of 'good': 
 [-0.0669536  -0.02946232 -0.02141166 -0.07207938 -0.04383925 -0.06601611
  0.0433591  -0.06836047]
The token number of 'great': 78
The word vector of 'great': 
 [-0.13109653 -0.2106202  -0.12394591 -0.19295937 -0.20537293 -0.2180386
  0.11041305 -0.12451705]


In [None]:
# '''Step3: View two word vectors with negative sentiment'''
token_bad = tokenizer.word_index['bad']
print("The token number of 'bad':", token_bad)
print("The word vector of 'bad': \n", weights_embedding[token_bad])

token_horrible = tokenizer.word_index['horrible']
print("The token number of 'horrible:'", token_horrible)
print("The word vector of 'horrible': \n", weights_embedding[token_horrible])

The token number of 'bad': 74
The word vector of 'bad': 
 [ 0.12419138  0.17240778  0.17221008  0.20071897  0.19355962  0.211528
 -0.22701228  0.15973277]
The token number of 'horrible:' 488
The word vector of 'horrible': 
 [ 0.11307114  0.15891303  0.23722452  0.23837747  0.16913286  0.22561371
 -0.3627046   0.31144097]
