# CS470 Introduction to Artificial Intelligence
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

### 4-3. Text classification with an RNN

Let's build text classification model with RNN on the IMDB dataset for sentiment analysis.

In [1]:
try:
    %tensorflow_version 2.x
except Exception:
    pass

import tensorflow_datasets as tfds
import tensorflow as tf

#### Setup input pipeline

The IMDB large movie review dataset is a binary classification dataset—all the reviews have either a positive or negative sentiment.

Let's download the dataset using [`TensorFlow Datasets`](https://www.tensorflow.org/datasets).


#### What is tokenizer?

We know that in order to get our computer to understand any text, we need to break that sentence down into words in a way that our machine can understand. Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. And in this practice, we will use subword tokenizer. Before using them, let's briefly compare them.

* Word Tokenization
It splits a piece of text into individual words based on a certain delimiter such as space. For example, the sentence 'I like playing the guitar' will be tokenized into 'I', 'like', 'playing', 'the', 'guitar'. Although this method is very simple, it cannot process Out Of Vocabulary (OOV) words well.

* Character Tokenization
Character Tokenization splits apiece of text into a set of characters. OOV words does not occur because this method limits the number of tokens to 26. But the length of the input and output sentences increases rapidly as we are representing a sentence as a sequence of characters. Also, it becomes challenging to learn the relationship between the characters to form meaningful words.

* Subword Tokenization
Subword Tokenization splits the piece of text into subwords. For example, words like lower can be segmented as low-er, smartest as smart-est, unfriendly as un-friend-ly, and so on. It splits sentences into words or smaller units until the token included in the subword of the tokenizer is found. 

In [2]:
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']



[1mDownloading and preparing dataset imdb_reviews/subwords8k/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete2SIKCN/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete2SIKCN/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete2SIKCN/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.[0m


Since the dataset comes with an inbuilt subword tokenizer, we can use the tokenzier to tokenize any strings into tokens.

In [3]:
tokenizer = info.features['text'].encoder
print (f'Vocabulary size: {tokenizer.vocab_size}')

Vocabulary size: 8185


In [5]:
sample_string = "Yesterday was my grandmother's birthday."

# Encode the sample string to integers
tokenized_string = tokenizer.encode(sample_string)
print (f'Tokenized string is {tokenized_string}')

# Decode the encoded integers to the string 
original_string = tokenizer.decode(tokenized_string)
print (f'The original string: {original_string}')

assert original_string == sample_string

Tokenized string is [1071, 487, 414, 18, 82, 1481, 1300, 7968, 8, 3534, 606, 7975]
The original string: Yesterday was my grandmother's birthday.


If a word is not in its dictionary, the tokenizer encodes the word by breaking it into subwords.

In [6]:
for ts in tokenized_string:
    print (f'{ts} ----> {tokenizer.decode([ts])}')

1071 ----> Yes
487 ----> ter
414 ----> day 
18 ----> was 
82 ----> my 
1481 ----> grand
1300 ----> mother
7968 ----> '
8 ----> s 
3534 ----> birth
606 ----> day
7975 ----> .


Now, let's combine consecutive elements of this dataset into padded batches using [`tf.data.Dataset.padded_batch()`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#padded_batch).

In [10]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE, ((None,),()))
test_dataset = test_dataset.padded_batch(BATCH_SIZE,  ((None,),())) 

ValueError: ignored

#### Build and train the model
Let's build a recurrent neural network using `tf.keras.Sequential`. Here, we will use `tf.keras.layers.LSTM` as the recurrent layer for the model.

In [None]:
model = tf.keras.Sequential([

])

Compile the model to configure the training process.

In [None]:
model.compile(

)

Then, train the model using `train_dataset` with validation data as `test_dataset`.

In [None]:
history = model.fit(

)

![Text classification loss](https://github.com/keai-kaist/CS470-Spring-2022-/blob/main/Lab3/images/text-classification-loss.PNG?raw=1)

Let's evaluate the trained model.

In [None]:
test_loss, test_acc = model.evaluate(test_dataset)

print(f'Test Loss: {test_loss}')
print(f'Test Accuracy: {test_acc}')

In [None]:
text = 'The movie was cool. The animation and the graphics were out of this world. I would recommend this movie.'

predictions = model.predict([

])
print(predictions)

#### Bidirectional LSTM layer
When you wrap any recurrent neural network layers with `tf.keras.layers.Bidirectional`, it allows the layer to propagate the input forward and backwards through the layer. This helps the RNN to learn long range dependencies.

![Bidirectional](https://github.com/keai-kaist/CS470-Spring-2022-/blob/main/Lab3/images/bidirectional.jpg?raw=1)

#### Why bidirectional LSTM can be effecitve?

Let's say that we want our model to find a word to fill in the blank in the following sentence.

* I like to eat [________] because today is too hot.

In this sentence, for accurate blank prediction, the words after the blank are more important than before the blank.

In [None]:
model_bidirectional = tf.keras.Sequential([

])

In [None]:
model_bidirectional.compile(

)

In [None]:
history_bidirectional = model_bidirectional.fit(

)

In [None]:
test_loss, test_acc = model_bidirectional.evaluate(test_dataset)

print(f'Test Loss: {test_loss}')
print(f'Test Accuracy: {test_acc}')

#### Stack two or more LSTM layers
Keras recurrent layers have two available modes that are controlled by the `return_sequences` constructor argument:
- Return either the full sequences of successive outputs for each timestep `(batch_size, timesteps, output_features)`
- Return only the last output for each input sequence `(batch_size, output_features)`

To stack two or more LSTM layers, we should set `return_sequences` as `True`.

In [None]:
model_stacked = tf.keras.Sequential([

])

In [None]:
model_stacked.compile(

)

In [None]:
history_stacked = model_stacked.fit(

)

In [None]:
test_loss, test_acc = model_stacked.evaluate(test_dataset)

print(f'Test Loss: {test_loss}')
print(f'Test Accuracy: {test_acc}')

In [None]:
text = 'The movie was cool. The animation and the graphics were out of this world. I would recommend this movie.'

predictions = model_stacked.predict([
    
])
print(predictions)