# CS470 Introduction to Artificial Intelligence
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

## 4. Recurrent Neural Networks
### 4-2.  Word embeddings
To handle the text data, we should convert the text into numbers first (i.e., preprocess the data). 

#### One-hot encodings 
First method for representing the text as numbers is an "one-hot" encode each word in the vocabulary. Consider the sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). To represent each word, we will create a zero vector with length equal to the number of the vocabularies, then place a one in the index that corresponds to the word. This approach is shown in the following diagram.

![One Hot Encodings](images/one-hot.png)

To create a vector that contains the encoding of the sentence, we could then concatenate the one-hot vectors for each word.

However, This approach has several downsides. 
1. One-hot encoding is very inefficient. 
- A one-hot encoded vector is sparse (meaning most indicies are zero). Imagine we have 10,000 words in the vocabulary. To one-hot encode each word, we would create a vector where 99.99% of the elements are zero.
2. One-hot encoding cannot incorporate semantics between each word.
- Every token in one-hot encoding is equally distant away from all the others. That is,

![One-hot distance](images/one-hot-distance.PNG)


#### Word embeddings
Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, we do not have to specify this encoding by hand.

Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). So, the neural network captures the token's meaning as a vector.  
It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

![Word Embeddings](images/embeddings.png)

Above is a diagram for a word embedding. Each word is represented as a 4-dimensional vector of floating point values. Another way to think of an embedding is as "lookup table". After these weights have been learned, we can encode each word by looking up the dense vector it corresponds to in the table.

- Papers related to word-embeddings
1. CBoW, Skip-gram: https://arxiv.org/pdf/1301.3781v3.pdf
2. GloVe: https://www.aclweb.org/anthology/D14-1162.pdf
3. fastText: https://arxiv.org/pdf/1607.04606v2.pdf

In [1]:
try:
    %tensorflow_version 2.x
except Exception:
    pass

import tensorflow as tf
import matplotlib.pyplot as plt

#### Using the Embedding Layer

![Keras embedding layer](images/keras-embedding.jpg)

Keras makes it easy to use word embeddings. Let's take a look at the [`tf.keras.layers.Embedding(input_dim, output_dim)`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Embedding). This layers turns positive integers (indicies) into dense vectors of fixed size.
- `input_dim`: Size of the vocabulary. 2D tensor with shape `(batch_size, input_length)`
- `output_dim`: Dimension of the dense embedding. 3D tensor with shape `(batch_size, input_length, output_dim)`

#### Learning embeddings from scratch
We will train a sentiment classifier on IMDB movie reviews we previously did. In the process, we will learn embeddings from scratch. 

In [None]:
vocab_size = 
imdb = tf.keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=vocab_size)

As imported, the text of reviews is integer-encoded (each integer represents a specific word in a dictionary).

In [None]:
print(f'Length of data: {len(train_data[0])}')
print(train_data[0])
print(train_labels[0])

#### Convert the integers back to words
It may be useful to know how to convert integers back to text. Here, we'll create a helper function to query a dictionary object that contains the integer to string mapping.

And actually, by the default setting of [`tf.keras.datasets.imdb.load_data`](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data), integer index 1 indicates the start of a sequence and 2 indicates the out-of-vocabulary character. Then, 4 is the first index that actual words appears.

#### Things to do

1. First, we will load imdb dictionary (word_int_dict) that maps words to an integer index
2. We will make a dictionary that maps integer index to words (int_word_dict, which is inverse of word_int_dict)
3. The actual words must appear from index 4 in int_word_dict
4. We will define some indexes for special tokens (0-PAD, 1-START, 2-UNK, 3-UNUSED)
5. Finally, we will define a function (decode_review) that can convert index array to sequence of words

In [None]:
# A dictionary mapping words to an integer index
word_int_dict = 

In [6]:
# Create dict class using key(words)-value(integers)
int_word_dict = 


def decode_review(text):


In [None]:

print("Train_data[0]: {}\n".format(train_data[0]))
decode_review(train_data[0])


Movie reviews can be different lengths. We will use the [`pad_sequences`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) function to standardize the lengths of the reviews.

In [10]:
# Standardize the lenghts of data to use it as an input of model
maxlen = 

# Return the 2D Numpy array of shape (train_data, maxlen) 
# by transfoming the input data with zeros peddings 
train_data = 

test_data = 

In [None]:
print(f'Length of data: {len(train_data[0])}')
print(train_data[0])

#### Create simple embedding model 

Let's define our model using `tf.keras.Sequential`
- The first layer is a `tf.keras.layers.Embedding` layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index (convert integers to floating values). These vectors are learned as the model trains.
- Next, a `tf.keras.layers.GlobalAveragePooling1D` layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.  
![Global average pool 1D](images/global-average-pool-1d.png)
- This fixed-length output vector is piped through a `tf.keras.layers.Dense` with 16 hidden units.
- The last layer is a `tf.keras.layers.Dense` with a single output node. Using the sigmoid activation function, this value is a float between 0 and 1, representing a probability (or confidence level) that the review is positive.

In [12]:
# Let's use GlobalAveragePooling1D!
pool_func = 

In [14]:
embedding_dim = 

model = tf.keras.Sequential([

    
])

model.summary()

#### Compile and train the model 

In [None]:
# Compile the model
model.compile(
    
)

# Then, train the model
history = model.fit(


)

With this approach our model reaches a validation accuracy of around 88%.

In [None]:
history_dict = history.history

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

plt.figure(figsize=(12, 9))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

plt.figure(figsize=(12, 9))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim((0.5, 1))
plt.show()

#### Explore the learned embeddings
Next, let's explore the word embeddings learned during training. To use Embedding Projector, we will dump the embedding layers as two files: 
- A file of embeddings
- A file of meta data containing words

In [18]:
embedding_layer = model.layers[0]
weights = embedding_layer.get_weights()[0]

with open('embeddings.tsv', 'w', encoding='utf-8') as embeddings_file:
    with open('meta.tsv', 'w', encoding='utf-8') as meta_file:
        for word_index in range(vocab_size):
            word = int_word_dict[word_index]
            embedding = weights[word_index]
            
            print('\t'.join(str(x) for x in embedding), file=embeddings_file)
            print(word, file=meta_file)

In [None]:
try:
    from google.colab import files
except ImportError:
    pass
else:
    files.download('embeddings.tsv')
    files.download('meta.tsv')

Then, we can visualize the embeddings using the [Embedding Projector](http://projector.tensorflow.org).

Once you load dumped files, the embeddings you have trained will now be displayed. You can search for words to find their closest neighbors. For example, try searching for "beautiful". You may see neighbors like "wonderful". (Note: your results may be a bit different, depending on how weights were randomly initialized before training the embedding layer)