# SEP532 인공지능 이론과 실제
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

## Recurrent Neural Networks
### Word embeddings
Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model. In this section, you will look at three strategies for doing so.

#### Representing text as numbers: One-hot encodings 
First method for representing the text as numbers is an "one-hot" encode each word in the vocabulary. Consider the sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). To represent each word, we will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word. This approach is shown in the following diagram.

![Diagram of one-hot encodings](images/one-hot.png)

To create a vector that contains the encoding of the sentence, we could then concatenate the one-hot vectors for each word.

**Key point:** This approach is **inefficient**. A one-hot encoded vector is **sparse (meaning, most indicices are zero)**. Imagine we have 10,000 words in the vocabulary. To one-hot encode each word, we would create a vector where 99.99% of the elements are zero.


#### Encode each word with a unique number
There is another apporach of one-hot encoding. This is to encode each word using a unique number. Continuing the example above, we could assign 1 to "cat", 2 to "mat", and so on. We could then **encode the sentence "The cat sat on the mat" as a dense vector like [5, 1, 4, 3, 5, 2]**. This appoach is efficient. Instead of a sparse vector, we now have a dense one (where all elements are full).

There are two downsides to this approach:
- The integer-encoding is arbitrary (**it does not capture any relationship between words**).
- An integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful.

#### Word embeddings
Word embeddings give us a way to use an **efficient, dense representation in which similar words have a similar encoding**. Importantly, we do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify).

Instead of specifying the values for the embedding manually, they are **trainable parameters** (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. **A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.**

![Diagram of an embedding](images/embeddings.png)

Above is a diagram for a word embedding. Each word is represented as a 4-dimensional vector of floating point values. Another way to think of an embedding is as "lookup table". After these weights have been learned, we can encode each word by looking up the dense vector it corresponds to in the table.

- Visualization of trained word embedding through [Embedding Projector](http://projector.tensorflow.org/)

#### Using the Embedding layer
Keras makes it easy to use word embeddings. Let's take a look at the [Embedding layer](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Embedding).

`tf.keras.layers.Embedding(input_dim, output_dim))`: Turns positive integers (indexes) into dense vectors of fixed size.
 - input_dim:  int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
 - output_dim: int >= 0. Dimension of the dense embedding.
 - Input shape: 2D tensor with shape: (batch_size, input_length).
 - Output shape: 3D tensor with shape: (batch_size, input_length, output_dim).

In [None]:
import tensorflow as tf

# The Embedding layer takes at least two arguments:
# the number of possible words in the vocabulary, here 1000 (1 + maximum word index),
# and the dimensionality of the embeddings, here 32.
embedding_layer = 

The Embedding layer can be understood as a **lookup table** that maps from integer indices (which stand for specific words) to dense vectors. The dimensionality of the embedding is a parameter you can experiment with to see what works well for your problem (much in the same way you would experiment with the number of neurons in a Dense layer).

When you create an Embedding layer, the weights for the embedding are randomly initialized. During training, they are gradually adjusted via backpropagation. Once trained, **the learned word embeddings will roughly encode similarities between words** (as they were learned for the specific problem your model is trained on).

As input, the Embedding layer takes a 2D tensor of integers, of shape (batches, sequence_length), where each entry is a _sequence of integers_. It can embed sequences of variable lengths. You could feed into the embedding layer above batches with shapes (32, 10) (batch of 32 sequences of length 10). All sequences in a batch must have the same length, so sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated.

As output, the embedding layer returns a 3D floating point tensor, of shape (batches, sequence_length, embedding_dimensionality). Such a 3D tensor can then be processed by a RNN layer, or can simply be flattened or pooled and processed by a Dense layer.

#### Learning embeddings from scratch
We will train a sentiment classifier on IMDB movie reviews we previously did. In the process, we will learn embeddings from scratch. 

In [None]:
vocab_size = 10000
imdb = tf.keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = 

As imported, the text of reviews is integer-encoded (each integer represents a specific word in a dictionary).

In [None]:
print(train_data[0])

#### Convert the integers back to words
It may be useful to know how to convert integers back to text. Here, we'll create a helper function to query a dictionary object that contains the integer to string mapping:

In [None]:
# A dictionary mapping words to an integer index
word_index = 

# The first indices are reserved
# Create dict class using key(words)-value(integers)
word_index = 

# Set new keys for us 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

# Key: integers, Value: words
reverse_word_index = 

def decode_review(text):
    pass

print("Train_data[0]: {}\n".format(train_data[0]))
decode_review(train_data[0])

Movie reviews can be different lengths. We will use the [`pad_sequences`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) function to standardize the lengths of the reviews.

In [None]:
# Standardize the lenghts of data to use it as an input of model
maxlen = 500

# Return the 2D Numpy array of shape (train_data, maxlen) 
# by transfoming the input data with zeros peddings 
train_data = 

test_data = 

In [None]:
print("Len of data: {}".format(len(train_data[0])))
print(train_data[0])

#### Create simple embedding model 

We will use the Keras Sequential API to define our model.
- The first layer is an **Embedding layer**. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index (convert integers to floating values). These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: `(batch, sequence, embedding)`(3D tensor).
- Next, a **GlobalAveragePooling1D layer** returns a fixed-length output vector for each example by averaging over the sequence dimension `(batch, features)` (2D tensor). This allows the model to handle input of variable length, in the simplest way possible. 
<img src=https://jsideas.net/assets/materials/20180104/GAP_GMP.png width=600>

- This fixed-length output vector is piped through a **fully-connected (Dense) layer** with 16 hidden units.
- The last layer is **densely connected with a single output node**. Using the sigmoid activation function, this value is a float between 0 and 1, representing a probability (or confidence level) that the review is positive.

In [None]:
embedding_dim = 16

model = 

model.summary()

#### Compile and train the model 

In [None]:
# Compile the model
model.

In [None]:
# Then, train the model
history = 

With this approach our model reaches a validation accuracy of around 88% (note the model is overfitting, training accuracy is significantly higher).

In [None]:
import matplotlib.pyplot as plt

history_dict = history.history

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

plt.figure(figsize=(12,9))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

plt.figure(figsize=(12,9))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim((0.5,1))
plt.show()

#### Retrieve the learned embeddings
Next, let's retrieve the word embeddings learned during training. This will be a matrix of shape `(vocab_size, embedding_dimension)`.

In [None]:
# model.layers[]: get the layer information
e = 
weights = 
print(weights.shape) # shape: (vocab_size, embedding_dim)

We will now write the weights to disk. To use the [`Embedding Projector`](http://projector.tensorflow.org/), we will upload two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words).

In [None]:
import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(vocab_size):
    word = reverse_word_index[word_num]
    embeddings = weights[word_num]
    out_m.write(word + "\n")
    out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

In [None]:
# For google colab
try:
    from google.colab import files
except ImportError:
    pass
else:
    files.download('vecs.tsv')
    files.download('meta.tsv')

#### Visualize the embeddings
To visualize our embeddings we will upload them to the embedding projector.

Open the Embedding Projector.
- Click on "Load data".
- Upload the two files we created above: vecs.tsv and meta.tsv. 

The embeddings you have trained will now be displayed. You can search for words to find their closest neighbors. For example, try searching for "beautiful". You may see neighbors like "wonderful". Note: your results may be a bit different, depending on how weights were randomly initialized before training the embedding layer.