# 5_B Sentiment Ananlysis with RNN & LSTM
- author: Eu-Bin KIM 
- source: https://www.tensorflow.org/text/tutorials/text_classification_rnn
- date: 5th of September 2021


## 목차
1. 입력 파이프라인 구축하기
2. 텍스트 정수 인코딩
3. 모델 정의하기 (RNN, LSTM)
4. 모델 훈련하기
5. RNN과, BiRNN의 성능 비교하기




## 1. 입력 파이프라인 구축하기


In [1]:
import numpy as np  # 텐서구축을 위해
import tensorflow_datasets as tfds  # 데이터 로드를 위해
import tensorflow as tf  # 모델학습을 위해
import matplotlib.pyplot as plt  # 로스 시각화를 위해

In [2]:
# gpu 사용가능 여부 체크
# 출처: https://colab.research.google.com/notebooks/gpu.ipynb#scrollTo=Y04m-jvKRDsJ
%tensorflow_version 2.x
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [3]:
dataset, info = tfds.load('imdb_reviews', with_info=True,
                          as_supervised=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete3YL5X1/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete3YL5X1/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete3YL5X1/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [4]:
train_dataset, test_dataset = dataset['train'], dataset['test']
train_dataset.element_spec

(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

In [5]:
for example, label in train_dataset.take(1):
  print('text: ', example.numpy())
  print('label: ', label.numpy())

text:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
label:  0


In [6]:
# shuffle의 buffer_size는 몇으로 두는 것이 적당할까? https://helloyjam.github.io/tensorflow/buffer-size-in-shuffle/
# prefetch의 bufffer_size 값은 어느 정도로 두는 것이 적당할까? https://stackoverflow.com/questions/56613155/tensorflow-tf-data-autotune
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE) \
                             .batch(BATCH_SIZE) \
                             .prefetch(tf.data.AUTOTUNE)
# 테스팅을 할때는 셔플을 할 필요가 없다.
test_dataset = test_dataset.batch(BATCH_SIZE) \
                           .prefetch(tf.data.AUTOTUNE)

In [7]:
for example, label in train_dataset.take(1):
  # 배치 속 첫 3개의 데이터 샘플 확인하기
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])

 b'Spoiler alert \xc2\x96 although I think this one was spoiled coming out of the can\xc2\x85 It\'s hard to even imagine that a film with these stars, from this studio, made at this time period, could be so awful, but it is. It is the film\'s biggest flaw by far that it just doesn\'t make any damn sense.<br /><br />Rich widower American aristocrat Penn Gaylord leaves his small daughter "in charge" and goes off to World War I where he is killed. Then we flash forward to present day (1942) and total confusion. The three sisters are in court where they are said to have spent the last twenty years, and some jerk named Barclay is trying to take their home away from them. This is just the beginning of an endless series of unanswered questions that comprises the script, more holes in it than The Warren Report. What happened to the Gaylord fortune? If the will is worth half a billion, why has the family home gone from an opulent palace to the house on The Munsters? Who the devil is this Barcla

## 2. 텍스트 정수 인코딩

In [8]:
VOCAB_SIZE = 1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

In [9]:
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

array(['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i',
       'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but'],
      dtype='<U14')

In [10]:
encoded_example = encoder(example)[:3].numpy()
encoded_example

array([[ 10, 796,  42, ...,   0,   0,   0],
       [  1,   1, 456, ...,   0,   0,   0],
       [ 56, 340,   1, ...,   0,   0,   0]])

##padding
vocvec = [

  [a,b,c]

  [a,b,c,d]

  [a,c]

]

이렇게 문장의 길이가 다른 경우 RNN 의 가장문장의 길이가 긴 문장 = t
위의 경우는 t = 4
t를 맞춰주는데 빈공간에는 0 을 채워준다.


vocvec = [

  [a,b,c,0]

  [a,b,c,d]
  
  [a,c,0,0]

]
pading 은 0 으로 채워준다. 

In [16]:
encoder([["[UNK]"]]).numpy()

array([[1]])

In [14]:
encoded_example[0]

array([ 10, 796,  42,  31, 212, 104, 168,   1,  18,  27, 366,  46,  17,
         4,   1,   5,  50,  93,  92,   1,  78, 330, 526,  84, 385, 161,
         3,   4,   1,   1,  27, 733, 411,  70,   4, 164,  93, 598,  27,
         1,  87,   6,  94,   4, 528,  18, 150, 228,   1, 173,  11,  20,
         7, 632, 455,  21,  99, 680,  23, 451,   6, 398,  48,   7,   4,
       169,   5,   1,   1, 166,  21, 132,  17,  57,   1,   1, 893,  30,
       946,  15,   4,   1, 416,   2,   1,  13,   1, 194,  26,  67,   4,
         1,  16, 106,   5,  25, 137,  15,   9,  14, 389, 100, 455,   6,
       360,   3, 283,   1, 575,  14, 334,  88,   5,   2,  62,   1,   2,
        61, 218, 109,  14,   2,  84, 230,  37, 547,  25,   1,   6,   1,
         1,  92,   1,   6,   1,   1,  32, 126,   2, 269, 869, 231,  51,
        82,  72, 269,  34,   1,   1,   8,   2,   1, 684,  17,   4,   1,
       578,  11,  84, 230,  14,  31, 212,   1,   3,  53,   1, 213, 741,
         5, 630,  45,  23,  39,  12, 682,   5, 151,   9, 581,   

In [11]:
for n in range(3):
  print("Original: ", example[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoded_example[n]]))
  print()

Round-trip:  i buy or at least watch every [UNK] movie he came out with a [UNK] of good movies then [UNK] into poor stories bad camera work and a [UNK] [UNK] he nearly lost me a few movies ago he [UNK] how to make a decent movie now hes [UNK] again this film is seriously dark on any level you care to name there is a lot of [UNK] [UNK] going on here with no [UNK] [UNK] unless its meant as a [UNK] against the [UNK] br [UNK] may have had a [UNK] for many of his scenes as it was often too dark to tell and someone [UNK] voice was used most of the time [UNK] the only interesting character was the bad guy who killed his [UNK] to [UNK] [UNK] then [UNK] to [UNK] [UNK] all over the place okay since when do we place an [UNK] [UNK] in the [UNK] room with a [UNK] anyway this bad guy was at least [UNK] and very [UNK] theres lots of gore if you like that king of thing it looked to me like the bad guys [UNK] the same [UNK] every time im just [UNK] they didnt [UNK] the blood from their [UNK] [UNK] i [U

## 3. Sentiment Analysis를 위한 Bidirectional RNN 모델 정의하기

![image.png](https://github.com/tensorflow/text/blob/master/docs/tutorials/images/bidirectional.png?raw=1)

1. 정수 인코딩 레이어. (N,) -> (N,) 
2. 임베딩 벡터 레이어. (N,) -> (N, 50)
3. LSTM 레이어. (N, 100) -> (N, 16)
5. Dense 레이어. (N, 16) -> (N,) 

In [12]:
VOCAB_SIZE = len(encoder.get_vocabulary())
EMB_SIZE = 50
HIDDEN_SIZE = 16
DENSE_SIZE = 1

model_rnn = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMB_SIZE),
    tf.keras.layers.SimpleRNN(HIDDEN_SIZE, activation='tanh', return_sequences=True),
    tf.keras.layers.Dense(DENSE_SIZE, activation='sigmoid')
])

model_lstm = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMB_SIZE),
    # Long-short term Memory 
    tf.keras.layers.LSTM(HIDDEN_SIZE, return_sequences=True),
    tf.keras.layers.Dense(units=DENSE_SIZE, activation='sigmoid') 
])

LR = 0.0001  # learning rate
model_rnn.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
                  optimizer=tf.keras.optimizers.Adam(LR),
                  metrics=['accuracy'])
model_lstm.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
                    optimizer=tf.keras.optimizers.Adam(LR),
                    metrics=['accuracy'])


In [17]:
model_rnn.summary()
model_lstm.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 50)          50000     
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, None, 16)          1072      
_________________________________________________________________
dense (Dense)                (None, None, 1)           17        
Total params: 51,089
Trainable params: 51,089
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, None)              0         
______________________________

## 모델 학습시키기

In [18]:
# 학습 하이퍼 파라미터
EPOCHS = 3
VAL_STEPS = 30
STEPS_PER_EPOCH = 100

In [19]:
history_rnn = model_rnn.fit(train_dataset, 
                            epochs=EPOCHS,
                            #각 배치별로 경사 하강법을 몇번 할것인가 
                            steps_per_epoch=STEPS_PER_EPOCH,
                            validation_data=test_dataset,
                            validation_steps=VAL_STEPS)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [20]:
history_lstm = model_lstm.fit(train_dataset, epochs=EPOCHS,
                              steps_per_epoch = STEPS_PER_EPOCH,
                              validation_data=test_dataset,
                              validation_steps=VAL_STEPS)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [21]:
test_loss, test_acc = model_rnn.evaluate(test_dataset)
print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

Test Loss: 0.6926727294921875
Test Accuracy: 0.5066515803337097


In [22]:
test_loss, test_acc = model_lstm.evaluate(test_dataset)
print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

Test Loss: 0.6920323371887207
Test Accuracy: 0.5179228782653809


## To-do

RNN을 사용한 경우 대비, LSTM을 사용했을 때 더 성능이 좋은 이유가 무엇일까요? 이번 수업시간에 배워볼 LSTM은 RNN의 어떤 문제를 해결했길래, 단순히 모델을 LSTM으로 바꾸는 것만으로 성능을 올릴 수 있었을까요? [이 블로그 포스팅](https://dgkim5360.tistory.com/entry/understanding-long-short-term-memory-lstm-kr)을 읽고, 한번 답해보세요!

---
답: RNN은 단기 기억 상실증이 있다 .
입력 나열의 길이가 길어지면 기울기 손실 혹은 폭주 문제가 심해진다. 

32번 rnn 셀의 기울기 신호가 ,1번 rnn 셀까지 도달하지 못한다. 
그것을 Long term memory 완화를 한것이 LSTM


---