# CS492 전산학특강<인공지능 산업 및 스마트에너지>
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

### 2-2 Text classification using IMDB dataset

#### IMDB dataset

The IMDB dataset is for movie reviews as positive or negative using the text of the review. This contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/). These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf
print('Tensorflow: ', tf.__version__)

import numpy as np
import matplotlib.pyplot as plt

To handle the text dataset, We first need to transform the data into a form that can be used for training. There are two common methods, one-hot encoding and word embedding, in this practice, we'll use one-hot encoding method (Word embedding will be covered later in the RNN part). The another reason for using the one-hot encding method is to check the overfitting problem in next practice. 


<img src="images/one_vs_we.png" width=1000>

Let's first download the IMDB dataset:

In [None]:
NUM_WORDS = 10000
(train_data, train_labels), (test_data, test_labels) = tf.keras.datasets.imdb.load_data(num_words=NUM_WORDS)

The argument num_words=10000 keeps the top 10,000 most frequently occurring words in the training data. The rare words are discarded to keep the size of the data manageable.

Let's take a moment to understand the format of the data. The dataset comes preprocessed: each example is an array of integers representing the words of the movie review. Each label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

In [None]:
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
print("Train data[0]: {}. \nlabel[0]: {}".format(train_data[0], train_labels[0]))

From each data, we'll create an ont-hot encodering vector with a 10,000-dimension

In [None]:
def multi_hot_sequences(sequences, dimension):
    # Create an all-zero matrix of shape (len(sequences), dimension)

    return results
    
    
train_data = multi_hot_sequences(train_data, dimension=NUM_WORDS)
test_data = multi_hot_sequences(test_data, dimension=NUM_WORDS)

In [None]:
print("Train data[0]: {}. \nlabel[0]: {}".format(train_data[0], train_labels[0]))

In [None]:
plt.plot(train_data[0])

#### Build model

Similar to previous model, we'll define a simple model using the [`tf.keras.layers.Dense`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Dense) layer.

In [None]:
# TODO: define a sequential model
model = tf.keras.Sequential([
    # TODO: add a first dense layer with 16 nodes applying ReLU activation function
    
    # TODO: add a second dense layer with 16 nodes applying ReLU activation function
    
    # TODO: add a last dense layer with only 1 node applying Softmax activation function
    
])

In [None]:
model.summary()

#### Compile the model

After defining the model structure, we configure the model such as optimizer, loss and metrics for training.

In [None]:
# TODO: compile the model with the following parameters
# - Optimizer: adam optimizer
# - Loss: binary_crossentropy
# - Metrics: accuracy

model.compile(

)

#### Create a validation set

When training, we want to check the accuracy of the model on data it hasn't seen before. Create a validation set by setting apart 10,000 examples from the original training data. (Why not use the testing set now? Our goal is to develop and tune our model using only the training data, then use the test data just once to evaluate our accuracy).

In [None]:
x_val =
partial_x_train =

y_val = 
partial_y_train =

#### Train the model

Then, we'll start to train the model.

In [None]:
# Define the batch size and the number of epochs to use during training
BATCH_SIZE = 512
EPOCHS = 20

In [None]:
# TODO: train the model using (train_data, train_labels) 
#       do not forget to set epochs and batch size as the parameters!
# To compare the loss trend of train_data and validation_data, add the validation_data as a parameter of fit function
model_history = model.fit(

)

With this approach our model reaches a validation accuracy of around 88% (note the model is overfitting, training accuracy is significantly higher).

#### Evaluate the model
And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [None]:
results = model.evaluate()

print(results)

This fairly naive approach achieves an accuracy of about 87%. With more advanced approaches, the model should get closer to 95%.

#### Plotting network performance trend

In [None]:
history_dict = model_history.history
history_dict.keys()

In [None]:
history_dict = model_history.history

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

plt.figure(figsize=(12,9))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

plt.figure(figsize=(12,9))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim((0.5,1))
plt.show()

In this plot, the dots represent the training loss and accuracy, and the solid lines are the validation loss and accuracy.

Notice the training loss decreases with each epoch and the training accuracy increases with each epoch. This is expected when using a gradient descent optimization—it should minimize the desired quantity on every iteration.

This isn't the case for the validation loss and accuracy—they seem to peak after about twenty epochs. **This is an example of _overfitting_: the model performs better on the training data than it does on data it has never seen before.** After this point, the model over-optimizes and learns representations specific to the training data that do not generalize to test data.

For this particular case, we could prevent overfitting by simply stopping the training after twenty or so epochs. Later, you'll see how to do this automatically with a callback.