# 5_B Sentiment Ananlysis with RNN & LSTM
- author: Eu-Bin KIM 
- source: https://www.tensorflow.org/text/tutorials/text_classification_rnn
- date: 5th of September 2021


## 목차
1. 입력 파이프라인 구축하기
2. 텍스트 정수 인코딩
3. 모델 정의하기 (RNN, LSTM)
4. 모델 훈련하기
5. RNN과, BiRNN의 성능 비교하기




## 1. 입력 파이프라인 구축하기


In [1]:
import numpy as np  # 텐서구축을 위해
import tensorflow_datasets as tfds  # 데이터 로드를 위해
import tensorflow as tf  # 모델학습을 위해
import matplotlib.pyplot as plt  # 로스 시각화를 위해

In [2]:
# gpu 사용가능 여부 체크
# 출처: https://colab.research.google.com/notebooks/gpu.ipynb#scrollTo=Y04m-jvKRDsJ
%tensorflow_version 2.x
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [3]:
# tensorflow dataset (tfds)
# (리뷰, 1)
dataset, info = tfds.load('imdb_reviews', with_info=True,
                          as_supervised=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteYNL8AT/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteYNL8AT/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteYNL8AT/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [4]:
train_dataset, test_dataset = dataset['train'], dataset['test']
train_dataset.element_spec

(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

In [5]:
for example, label in train_dataset.take(1):
  print('text: ', example.numpy())
  print('label: ', label.numpy())

text:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
label:  0


In [6]:
# shuffle의 buffer_size는 몇으로 두는 것이 적당할까? https://helloyjam.github.io/tensorflow/buffer-size-in-shuffle/
# prefetch의 bufffer_size 값은 어느 정도로 두는 것이 적당할까? https://stackoverflow.com/questions/56613155/tensorflow-tf-data-autotune
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE) \
                             .batch(BATCH_SIZE) \
                             .prefetch(tf.data.AUTOTUNE)
# 테스팅을 할때는 셔플을 할 필요가 없다.
test_dataset = test_dataset.batch(BATCH_SIZE) \
                           .prefetch(tf.data.AUTOTUNE)

In [7]:
for example, label in train_dataset.take(1):
  # 배치 속 첫 3개의 데이터 샘플 확인하기
  # example (64, seq_length)
  # label (64, )
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])

texts:  [b'This movie makes me want to fall in love all over again!I am naming my next daughter "Adelaide". Just so that someone who sings like Ol Blue eyes can swoon her one day, and feel the butterflies I felt hearing it sung, and it wasn\'t even to me! I give it a 9/10'
 b'hi for all the people who have seen this wonderful movie im sure thet you would have liked it as much as i. i love the songs once you have seen the show you can sing along as though you are part of the show singing and dancing . dancing and singing. the song ONE is an all time fave musical song too and the strutters at the end with the mirror its so oh you have to watch this one']

labels:  [1 0 1]


## 2. 텍스트 정수 인코딩

In [8]:
VOCAB_SIZE = 1000
# 모델 내부에 정수 인코딩을 담당하는 레이어를 추가할 수 있다
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
# 말뭉치로부터 단어에 대응하는 정수 인코딩을 학습.
# 리뷰데이터 대상으로 정수인코딩을 하기위해 text만을 넣어준다.
encoder.adapt(train_dataset.map(lambda text, label: text))

In [9]:
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

array(['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i',
       'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but'],
      dtype='<U14')

In [10]:
print(len(vocab))

1000


In [11]:
 # __call__
encoded_example = encoder(example)[:3].numpy()
encoded_example

array([[ 11,  18, 159, ...,   0,   0,   0],
       [ 50,   1,  49, ...,   0,   0,   0],
       [  1,  16,  32, ...,   0,   0,   0]])

In [12]:
encoder(['[PAD]']).numpy()

array([[1]])

In [None]:
# 정수 인코딩 레이어는 padding을 해준다.
print(encoded_example[0])


```
sents = [
  [a, b, c]
  [a, b, c, d]
  [a, b]
]
```
이렇게 문장의 길이가 다른 경우 (sequences with variable length), RNN의 x_t의 t는 몇으로 두어야 하는가?
데이터를 확인 후, 가장 길이가 긴 문장의 길이 = t

위의 상황에서는 t = 4.
이때 가장 길이가 긴 문장에 맞추어 다른 문장을 padding을 해준다. 
```
sents = [
  [a, b, c, PAD]
  [a, b, c, PAD]
  [a, b, PAD, PAD]
]
```

정수 인코딩을 할 경우, 예를들어 PAD에 대응하는 정수가 0 이라면, 다음과 같이 전처리된 
데이터가 RNN의 입력으로 들어간다.

```
sents = [
  [1, 2, 3, 0]
  [1, 2, 3, 0]
  [1, 2, 0, 0]
]
```



In [14]:
for n in range(3):
  print("Original: ", example[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoded_example[n]]))
  print()

Original:  b'This movie makes me want to fall in love all over again!I am naming my next daughter "Adelaide". Just so that someone who sings like Ol Blue eyes can swoon her one day, and feel the butterflies I felt hearing it sung, and it wasn\'t even to me! I give it a 9/10'
Round-trip:  this movie makes me want to fall in love all over [UNK] am [UNK] my next daughter [UNK] just so that someone who [UNK] like [UNK] [UNK] eyes can [UNK] her one day and feel the [UNK] i felt [UNK] it [UNK] and it wasnt even to me i give it a [UNK]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

## 3. Sentiment Analysis를 위한 LSTM 모델 정의하기

![image.png](https://github.com/tensorflow/text/blob/master/docs/tutorials/images/bidirectional.png?raw=1)

1. 정수 인코딩 레이어. (N,) -> (N,) 
2. 임베딩 벡터 레이어. (N,) -> (N, 50)
3. LSTM 레이어. (N, 100) -> (N, 16)
5. Dense 레이어. (N, 16) -> (N,) 

In [15]:
VOCAB_SIZE = len(encoder.get_vocabulary())
EMB_SIZE = 50
HIDDEN_SIZE = 16
DENSE_SIZE = 1

# encoder  (N, L) -> (N, L)
# embedding (N, L) -> (N, 50)
# rnn (N, 50) ->  (N, 16)
# dense (N, 16) -> (N,) (N개의 문장에 대응하는 긍정 문장일 확률)
 
model_rnn = tf.keras.Sequential([
    encoder,   # 정수인코딩을 해주는 레이어
    tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMB_SIZE),
    tf.keras.layers.SimpleRNN(HIDDEN_SIZE, activation='tanh', return_sequences=True),
    tf.keras.layers.Dense(DENSE_SIZE, activation='sigmoid')
])

model_lstm = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMB_SIZE),
    # Long-short term Memory 
    tf.keras.layers.LSTM(HIDDEN_SIZE, return_sequences=True),
    tf.keras.layers.Dense(units=DENSE_SIZE, activation='sigmoid') 
])

LR = 0.0001  # learning rate
model_rnn.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
                  optimizer=tf.keras.optimizers.Adam(LR),
                  metrics=['accuracy'])
model_lstm.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
                    optimizer=tf.keras.optimizers.Adam(LR),
                    metrics=['accuracy'])


In [16]:
model_rnn.summary()
model_lstm.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 50)          50000     
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, None, 16)          1072      
_________________________________________________________________
dense (Dense)                (None, None, 1)           17        
Total params: 51,089
Trainable params: 51,089
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, None)              0         
______________________________

## 모델 학습시키기

In [17]:
# 학습 하이퍼 파라미터
EPOCHS = 3
VAL_STEPS = 30
STEPS_PER_EPOCH = 100  # 첫 100개 배치 대상으로만 학습을 진행.

In [18]:
history_rnn = model_rnn.fit(train_dataset, 
                            epochs=EPOCHS,
                            # 경사도 하강을 몇번 할 것인가? (로스를 계산할 배치의 개수)
                            steps_per_epoch=STEPS_PER_EPOCH,
                            validation_data=test_dataset,
                            validation_steps=VAL_STEPS)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [19]:
history_lstm = model_lstm.fit(train_dataset, epochs=EPOCHS,
                              steps_per_epoch = STEPS_PER_EPOCH,
                              validation_data=test_dataset,
                              validation_steps=VAL_STEPS)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [20]:
test_loss, test_acc = model_rnn.evaluate(test_dataset)
print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

Test Loss: 0.6181204915046692
Test Accuracy: 0.7018077969551086


In [21]:
test_loss, test_acc = model_lstm.evaluate(test_dataset)
print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

Test Loss: 0.6923380494117737
Test Accuracy: 0.5127862691879272


## To-do

RNN을 사용한 경우 대비, LSTM을 사용했을 때 더 성능이 좋은 이유가 무엇일까요? 이번 수업시간에 배워볼 LSTM은 RNN의 어떤 문제를 해결했길래, 단순히 모델을 LSTM으로 바꾸는 것만으로 성능을 올릴 수 있었을까요? [이 블로그 포스팅](https://dgkim5360.tistory.com/entry/understanding-long-short-term-memory-lstm-kr)을 읽고, 한번 답해보세요!

---
답:
rnn은 단기기억상실증이 있다. (입력 나열의 길이가 길어지면, 기울기 소실 혹은 폭주 문제가 심해진다).
32번 rnn셀의 기울기 신호가, 1번 rnn셀까지 도달하지 못한다.
 그 문제를  Long-term memory 완화를 한것이 LSTM.

---