### Adding an embedding layer

The notebook uses input data that was processed and picked in notebook 'Embedded Data Pre-Processing'

The notebook shows how to add an embedding layer in contrast to using pretrained embeddings as shown in the notebook 'GloVe'

This notebook is modified from a [Keras blog post](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)


In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

Read in preprocessed data. See Embedding Data Pre-Processing notebook for details. 

In [2]:
import pickle

train_samples = pickle.load(open('data/train_samples.pkl', 'rb'))
train_labels = pickle.load(open('data/train_labels.pkl', 'rb'))

val_samples = pickle.load(open('data/val_samples.pkl', 'rb'))
val_labels = pickle.load(open('data/val_labels.pkl', 'rb'))

test_samples = pickle.load(open('data/test_samples.pkl', 'rb'))
test_labels = pickle.load(open('data/test_labels.pkl', 'rb'))

class_names = pickle.load(open('data/class_names.pkl', 'rb'))

#### Set up the vectorizer

Use Keras's TextVectorization() function to vectorize the data, using only the top 20K words. Each sample will be truncated or padded to a length of 200. 

In [3]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

In [6]:
# create a word index dictionary in which words map to indices

voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [7]:
test = ["the", "cat", "sat", "on", "the", "mat"]
[word_index[w] for w in test]

[2, 3509, 1657, 15, 2, 5562]

### Set up the embedding layer

In [19]:
from tensorflow.keras import layers

EMBEDDING_DIM = 128
MAX_SEQUENCE_LENGTH = 200

embedding_layer = layers.Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

### Build the model

Several layers of Conv1D followed by pooling, ending in a softmax classification layer. Instead of the usual Keras syntax, this example uses syntax from the Functional API: https://www.tensorflow.org/guide/keras/functional

In [20]:
# add more layers

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()

Model: "functional_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding_2 (Embedding)      (None, None, 128)         2560128   
_________________________________________________________________
conv1d_3 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, None, 128)         0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, None, 128)         0         
_________________________________________________________________
conv1d_5 (Conv1D)            (None, None, 128)        

### Vectorize train and validation sets

Using vectorizer in this way will right-pad the samples. 

In [21]:
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

### Train the model

Sparse categorical crossentropy is used because the final layer is a multi-class softmax layer. 

In [22]:
model.compile(
    loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7feaba85c940>

### Export the model

The next code block shows how you could create an end-to-end systems where the input is a text string and the output is the predicted label. 

In [23]:
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)

probabilities = end_to_end_model.predict(
    [["this message is about computer graphics and 3D modeling"]]
)

class_names[np.argmax(probabilities[0])]

'comp.graphics'

### Evaluate on the test data

In [24]:
test_x = vectorizer(np.array([[s] for s in test_samples])).numpy()

preds = model.predict(test_x)
pred_labels = [np.argmax(p) for p in preds]

In [25]:
from sklearn.metrics import classification_report

print(classification_report(test_labels, pred_labels))

              precision    recall  f1-score   support

           0       0.61      0.70      0.65       200
           1       0.53      0.49      0.51       202
           2       0.58      0.55      0.56       196
           3       0.51      0.66      0.58       192
           4       0.68      0.78      0.72       196
           5       0.86      0.63      0.72       190
           6       0.69      0.58      0.63       201
           7       0.73      0.73      0.73       200
           8       0.74      0.82      0.78       196
           9       0.83      0.89      0.86       213
          10       0.85      0.89      0.87       188
          11       0.80      0.84      0.82       196
          12       0.53      0.60      0.56       206
          13       0.80      0.64      0.71       190
          14       0.81      0.75      0.78       206
          15       0.78      0.65      0.71       193
          16       0.70      0.70      0.70       223
          17       0.91    