# CS470 Introduction to Artificial Intelligence
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

### 4-3. Text classification with an RNN

Let's build text classification model with RNN on the IMDB dataset for sentiment analysis.

In [1]:
try:
    %tensorflow_version 2.x
except Exception:
    pass

import tensorflow_datasets as tfds
import tensorflow as tf

#### Setup input pipeline

The IMDB large movie review dataset is a binary classification dataset—all the reviews have either a positive or negative sentiment.

Let's download the dataset using [`TensorFlow Datasets`](https://www.tensorflow.org/datasets).

In [2]:
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']



[1mDownloading and preparing dataset imdb_reviews/subwords8k/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete8P5IBG/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete8P5IBG/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete8P5IBG/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.[0m


Since the dataset comes with an inbuilt subword tokenizer, we can use the tokenzier to tokenize any strings into tokens.

In [8]:
tokenizer = info.features['text'].encoder
print (f'Vocabulary size: {tokenizer.vocab_size}')

Vocabulary size: 8185


In [9]:
sample_string = 'TensorFlow is cool.'

# Encode the sample string to integers
tokenized_string = tokenizer.encode(sample_string)
print (f'Tokenized string is {tokenized_string}')

# Decode the encoded integers to the string 
original_string = tokenizer.decode(tokenized_string)
print (f'The original string: {original_string}')

assert original_string == sample_string

Tokenized string is [6307, 2327, 4043, 4265, 9, 2724, 7975]
The original string: TensorFlow is cool.


If a word is not in its dictionary, the tokenizer encodes the word by breaking it into subwords.

In [10]:
for ts in tokenized_string:
    print (f'{ts} ----> {tokenizer.decode([ts])}')

6307 ----> Ten
2327 ----> sor
4043 ----> Fl
4265 ----> ow 
9 ----> is 
2724 ----> cool
7975 ----> .


Now, let's combine consecutive elements of this dataset into padded batches using [`tf.data.Dataset.padded_batch()`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#padded_batch).

In [12]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE)
test_dataset = test_dataset.padded_batch(BATCH_SIZE)

#### Build and train the model
Let's build a recurrent neural network using `tf.keras.Sequential`. Here, we will use `tf.keras.layers.LSTM` as the recurrent layer for the model.

In [15]:
model = tf.keras.Sequential([
  tf.keras.layers.Embedding(tokenizer.vocab_size, 64), 
  tf.keras.layers.LSTM(64), #INPUT size is 64
  tf.keras.layers.Dense(64, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

Compile the model to configure the training process.

In [16]:
model.compile(
  loss = 'binary_crossentropy',
  optimizer='adam',
  metrics=['accuracy']
)

Then, train the model using `train_dataset` with validation data as `test_dataset`.

In [17]:
history = model.fit(
  train_dataset, 
  epochs=10,
  validation_data = test_dataset
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


![Text classification loss](https://github.com/keai-kaist/CS470/blob/main/Lab3/May%2011/images/text-classification-loss.PNG?raw=true)

Let's evaluate the trained model.

In [18]:
test_loss, test_acc = model.evaluate(test_dataset)

print(f'Test Loss: {test_loss}')
print(f'Test Accuracy: {test_acc}')

Test Loss: 0.6946749091148376
Test Accuracy: 0.5012400150299072


In [20]:
text = 'The movie was cool. The animation and the graphics were out of this world. I would recommend this movie.'

predictions = model.predict([
  tokenizer.encode(text)
])
print(predictions)

[[0.45037454]]


#### Bidirectional LSTM layer
When you wrap any recurrent neural network layers with `tf.keras.layers.Bidirectional`, it allows the layer to propagate the input forward and backwards through the layer. This helps the RNN to learn long range dependencies.

![Bidirectional](https://github.com/keai-kaist/CS470/blob/main/Lab3/May%2011/images/bidirectional.jpg?raw=true)

In [23]:
model_bidirectional = tf.keras.Sequential([
  tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)), 
  tf.keras.layers.Dense(64, activation = 'relu'),
  tf.keras.layers.Dense(1, activation = 'sigmoid')
])

In [24]:
model_bidirectional.compile(
  loss='binary_crossentropy', 
  optimizer='adam',
  metrics=['accuracy'],
)

In [25]:
history_bidirectional = model_bidirectional.fit(
  train_dataset,
  epochs=10,
  validation_data = test_dataset
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [26]:
test_loss, test_acc = model_bidirectional.evaluate(test_dataset)

print(f'Test Loss: {test_loss}')
print(f'Test Accuracy: {test_acc}')

Test Loss: 0.4711703062057495
Test Accuracy: 0.8389999866485596


#### Stack two or more LSTM layers
Keras recurrent layers have two available modes that are controlled by the `return_sequences` constructor argument:
- Return either the full sequences of successive outputs for each timestep `(batch_size, timesteps, output_features)`
- Return only the last output for each input sequence `(batch_size, output_features)`

To stack two or more LSTM layers, we should set `return_sequences` as `True`.

In [29]:
model_stacked = tf.keras.Sequential([
  tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences = True) ), 
  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.Dense(1, activation ='relu') 
])

In [30]:
model_stacked.compile(
  loss = 'binary_crossentropy', 
  optimizer ='adam',
  metrics=['accuracy']
)

In [31]:
history_stacked = model_stacked.fit(
  train_dataset, 
  epochs=10,
  validation_data=test_dataset
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [32]:
test_loss, test_acc = model_stacked.evaluate(test_dataset)

print(f'Test Loss: {test_loss}')
print(f'Test Accuracy: {test_acc}')

Test Loss: 7.712468147277832
Test Accuracy: 0.5


In [33]:
text = 'The movie was cool. The animation and the graphics were out of this world. I would recommend this movie.'

predictions = model_stacked.predict([
    tokenizer.encode(text)
])
print(predictions)

[[0.]]
