## LSTM for binary sentiment classification, on IMDB

- 50,000 movie reviews
- 25,000 for training, 25,000 for testing
- Positive reviews labled with 1
- Negative reviews labled with 0
- Obtainable at http://ai.stanford.edu/~amaas/data/sentiment/

**일반적인 RNN보다 LSTM이 파라미터가 4배 더 많음 (Gate 갯수가 더 있음)** <br/>
**GRU는 파라미터 갯수가 LSTM의 반 ** (속도면에서 LSTM보다 빠르고, 성능면에서 많이 차이 나지 않기 때문에 GRU도 많이 사용함)

In [1]:
### Step 1: Import modules & set logging
import os
import logging

import numpy as np

import keras.backend as K

from keras.datasets import imdb
from keras.models import Model, Input
from keras.layers.core import Dense
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence


## Fix random seed for reproducibility
np.random.seed(20170704)


## Check proper working directory
path = os.getcwd()
os.chdir(path)
if os.getcwd().split('/')[-1] == 'DLdata':
    pass
else:
    path = os.getcwd()+'/DLdata'
    #raise OSError('Check current working directory.\n'
    #              'If not specified as instructed, '
    #              'more errors will occur throught the code.\n'
    #              '- Current working directory: %s' % os.getcwd())
print(path)

## Set logging
def set_logging(testlog=False):
    # 1. Make 'logger' instance
    logger = logging.getLogger()
    # 2. Make 'formatter'
    formatter = logging.Formatter(
            '[%(levelname)s:%(lineno)s] %(asctime)s > %(message)s'
            )
    # 3. Make 'streamHandler'
    streamHandler = logging.StreamHandler()
    # 4. Set 'formatter' to 'streamHandler'
    streamHandler.setFormatter(formatter)
    # 5. Add streamHandler to 'logger' instance
    logger.addHandler(streamHandler)
    # 6. Set level of log; DEBUG -> INFO -> WARNING -> ERROR -> CRITICAL
    logger.setLevel(logging.DEBUG)
    # 7. Print test INFO message
    if testlog: # default is 'False'
        logging.info("Stream logging available!")
    
    return logger

_ = set_logging()


####################################################################################


### Step 2: Load, view & preprocess data

## 2-1. Load
# Load dataset, but only keep the top n words
logging.info('Loading imdb dataset...')
top_words = 5000 
# 빈도수 높은 단어 5,000개만 끊어서 (index값) 사용 
# 5,000 dimensional one-hot vector
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

logging.debug('Shape of train data: {}'.format(X_train.shape))
logging.debug('Shape of test data: {}'.format(X_test.shape))


## 2-2. View
# word2idx, idx2word <- python dictionaries
word2idx = imdb.get_word_index()
idx2word = dict((v, k) for k, v in word2idx.items())
# index가 1부터 시작하는 index, word(key, value) 형태의 dictionary (index 0 은 아래에 padding 해줌)
logging.info('Vocabulary size: {}'.format(len(idx2word)))

# View the original review in text
# index word를 넣어서 text로 역변환해주는 함수 (빈번하게 등장하는 5,000개만)
def to_text(X, idx2word):
    text = [' '.join([idx2word[index] for index in review])for review in X]
    return text

text_train = to_text(X_train, idx2word)
text_test = to_text(X_test, idx2word)
logging.info('\n{}\n- {}'.format(text_train[0], ('pos' if y_train[0] == 1 else 'neg')))
logging.info('\n{}\n- {}'.format(text_test[245], ('pos' if y_test[245] == 1 else 'neg')))


## 2-3. Preprocess
# Truncate and pad input sequences
max_review_len = 500 
# input length의 길이는 500으로 고정 (한 문장의 단위를 500 단어까지로 정의)
# 500보다 긴것은 500에서 잘라주고, 500보다 짧은 것은 나머지를 0으로 채워 줌

if X_train.shape == (25000, ) and X_test.shape == (25000, ):
    X_train = sequence.pad_sequences(X_train, maxlen=max_review_len,
                                     padding='pre', truncating='pre',
                                     value=0.)
    X_test = sequence.pad_sequences(X_test, maxlen=max_review_len,
                                    padding='pre', truncating='pre',
                                    value=0.)
# padding='pre' 0을 앞에 붙여줌
# : 만약 뒤에 0을 붙이게 되면, vanishing gradient 문제 + 앞에서 계산된 것이 뒤에서 손실될 수 있음
# truncating='pre' 문장을 앞에서 잘라줌 : 보통 결론을 뒤에 적는 경우가 많기 때문
logging.info('Pad sequences shorter than %d with "0"' % max_review_len)
logging.info('Truncate sequences longer than {0} to {0}'.format(max_review_len))
logging.debug('Shape of train data (preprocessed): {}'.format(X_train.shape))
logging.debug('Shape of test data (preprocessed) : {}'.format(X_test.shape))


####################################################################################


### Step 3: Build model

## 3-1. Hyperparameters
epochs = 5
batch_size = 128
hidden_size = 100
embedding_vector_len = 32 
# 5,000 짜리 one-hot vector를 넣어주면, embedding layer를 통과하면 32 dimension 으로 변형됨
# dimension이 너무 크므로 32 dimension으로 embedding을 시킴

## 3-2. Define RNN model with LSTM cells for IMDB data

# Define input (SHAPE IS IMPORTANT!!!)
input_sequence = Input(shape=(max_review_len, ), # max_review_len = 500 # LST의 timestep unfold 수
                       dtype='int32', 
                       name='input_sequence')


# Define Embedding layer (단어의 index를 넣어줌)
x = Embedding(input_dim=top_words, # top_words = 5000  : one-hot vector의 크기
              output_dim=embedding_vector_len, # embedding_vetor_len = 32
              input_length=max_review_len, # max_review_len = 500 : LSTM timestep 길이와 동일
              mask_zero=True, # MASK를 씌워놓으면 Gradient 계산을 하지 않음 (zero <- padding 0 으로 들어간 가짜값)
              name='embedding')(input_sequence)


# Define LSTM layer
x = LSTM(units=hidden_size,
         dropout=0.,
         recurrent_dropout=0., # overfitting 방지하기 위해 dropout을 사용하는데, 이 옵션은 사용하지 않는 것을 권고함
                               # LSTM이 vanishing gradient 문제를 해결하기 위한 모델이므로, 다음 timestep에 최대한 많은 정보를 포함하기 위해
         kernel_initializer='glorot_uniform', # 초기값을 잘 잡아줌
         recurrent_initializer='orthogonal',  # identity matrix를 사용
         return_sequences=False, 
         # 현재 모델은 LSTM 을 한 layer만 쌓았는데 여러 layer를 쌓을 수 있으므로, 
         # 두번째 LSTM에서 이전 layer의 output을 input으로 받을 것인지 아닌지에 대한 옵션  (TRUE 면 받을 수 있음, FALSE면 마지막 단계 Y 만 출력)
         name='lstm')(x)


# Define Dense layer
x = Dense(units=100, activation='relu', name='fc')(x)


# Define prediction layer; use sigmoid for binary classification
prediction = Dense(units=1, activation='sigmoid', name='prediction')(x)


# Instantiate model
model = Model(inputs=input_sequence,
              outputs=prediction,
              name='LSTM_imdb')


####################################################################################


### Step 4: Define callbacks

from keras.callbacks import ModelCheckpoint
from keras.callbacks import EarlyStopping
from keras.callbacks import ReduceLROnPlateau
from keras.callbacks import TensorBoard

# List of callbacks
callbacks = []

# Model checkpoints
ckpt_path = path+'/lstm_imdb_ckpts/weights.{epoch:02d}-{val_acc:.2f}.hdf5'
if not os.path.exists(os.path.dirname(ckpt_path)):
    os.makedirs(os.path.dirname(ckpt_path))

checkpoint = ModelCheckpoint(filepath=ckpt_path,
                             monitor='val_acc',
                             save_best_only=True,
                             verbose=1)
callbacks.append(checkpoint)

# Stop training early
earlystopping = EarlyStopping(monitor='val_loss',
                              patience=5,
                              verbose=1)
callbacks.append(earlystopping)

# Reduce learning rate when learning does not improve
reducelr = ReduceLROnPlateau(monitor='val_loss',
                             factor=0.1, 
                             patience=10,
                             verbose=1)
callbacks.append(reducelr)

# Tensorboard for visualization
if K.backend() == 'tensorflow':
    tb_logdir = path+'/lstm_imdb_logs/'
    if not os.path.exists(tb_logdir):
        os.makedirs(tb_logdir)
    tensorboard = TensorBoard(log_dir=tb_logdir,
                              histogram_freq=1,
                              write_graph=True)
    callbacks.append(tensorboard)

####################################################################################


### Step 5: Compile & train model

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop', #  ADAM이 일반적으로 좋지만, RNN에서는 이 optimizer도 잘되는 것으로 알려져 있음
              metrics=['accuracy'])


print(model.summary())


history = model.fit(X_train, y_train,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_split=0.1,
                    callbacks=callbacks,
                    verbose=1)


####################################################################################


### Step 6: Save & load weights

# Save model weights
model.save_weights(path+'/weights/lstm_imdb_weights.h5')

# Load model weights
model.load_weights(path+'/weights/lstm_imdb_weights_master.h5')


####################################################################################


### Step 7: Test final model performance

test_scores = model.evaluate(X_test, y_test, verbose=1)
logging.info('Test accuracy: %.2f%%' %(test_scores[1] * 100))
#print("Test accuracy: %.2f%%" % (test_scores[1] * 100))

#train_scores = model.evaluate(X_train, y_train, verbose=1)
#print("Train accuracy: %.2f%%" % (train_scores[1] * 100))

K.clear_session()

Using TensorFlow backend.
[INFO:66] 2017-07-05 23:58:39,705 > Loading imdb dataset...


/home/user/DataScience/DataScience/Study Note/Deep Learning/DLdata


[DEBUG:72] 2017-07-05 23:58:42,464 > Shape of train data: (25000,)
[DEBUG:73] 2017-07-05 23:58:42,465 > Shape of test data: (25000,)
[INFO:80] 2017-07-05 23:58:42,504 > Vocabulary size: 88584
[INFO:89] 2017-07-05 23:58:43,257 > 
the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room and it so heart shows to years of every never going and help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but and to story wonderful that in seeing in character to of 70s and with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other and in of seen over and for anyone of and br show's to whether from than o

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_sequence (InputLayer)  (None, 500)               0         
_________________________________________________________________
embedding (Embedding)        (None, 500, 32)           160000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               53200     
_________________________________________________________________
fc (Dense)                   (None, 100)               10100     
_________________________________________________________________
prediction (Dense)           (None, 1)                 101       
Total params: 223,401
Trainable params: 223,401
Non-trainable params: 0
_________________________________________________________________
None
Train on 22500 samples, validate on 2500 samples
INFO:tensorflow:Summary name embedding/embeddings:0 is illegal; using embedding/em

[INFO:82] 2017-07-05 23:58:45,130 > Summary name embedding/embeddings:0 is illegal; using embedding/embeddings_0 instead.


INFO:tensorflow:Summary name lstm/kernel:0 is illegal; using lstm/kernel_0 instead.


[INFO:82] 2017-07-05 23:58:45,132 > Summary name lstm/kernel:0 is illegal; using lstm/kernel_0 instead.


INFO:tensorflow:Summary name lstm/recurrent_kernel:0 is illegal; using lstm/recurrent_kernel_0 instead.


[INFO:82] 2017-07-05 23:58:45,134 > Summary name lstm/recurrent_kernel:0 is illegal; using lstm/recurrent_kernel_0 instead.


INFO:tensorflow:Summary name lstm/bias:0 is illegal; using lstm/bias_0 instead.


[INFO:82] 2017-07-05 23:58:45,136 > Summary name lstm/bias:0 is illegal; using lstm/bias_0 instead.


INFO:tensorflow:Summary name fc/kernel:0 is illegal; using fc/kernel_0 instead.


[INFO:82] 2017-07-05 23:58:45,139 > Summary name fc/kernel:0 is illegal; using fc/kernel_0 instead.


INFO:tensorflow:Summary name fc/bias:0 is illegal; using fc/bias_0 instead.


[INFO:82] 2017-07-05 23:58:45,141 > Summary name fc/bias:0 is illegal; using fc/bias_0 instead.


INFO:tensorflow:Summary name prediction/kernel:0 is illegal; using prediction/kernel_0 instead.


[INFO:82] 2017-07-05 23:58:45,144 > Summary name prediction/kernel:0 is illegal; using prediction/kernel_0 instead.


INFO:tensorflow:Summary name prediction/bias:0 is illegal; using prediction/bias_0 instead.


[INFO:82] 2017-07-05 23:58:45,146 > Summary name prediction/bias:0 is illegal; using prediction/bias_0 instead.


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

[INFO:251] 2017-07-06 00:20:39,340 > Test accuracy: 86.39%



