This is the last milestone of this course. In here, you will use either the text you generated in the previous project, or the ones provided here, to augment training data. In this milestone, the training data consists of 6250 generated positive reviews, 6250 original positive reviews, and 12500 negative reviews. These are concatenated and then shuffled for model training. Steps 1 through 18 are involved with contatenating these data. From step 19 and onwards, the process of building and training the model is identitical to that in project 2, milestone 4, where you built and trained a model using oversampled data.

## WorkFlow

1. Download `all_train_text.pkl` and `y_train_assembled.npy` from a GitHub repo provided by author. These files were generated from the mrevious milestone in this project. You may use `git clone` command (pre-installed already in Google Colab) as below:

In [1]:
#%%sh
#git clone https://github.com/shinchan75034/ManningPublishing-ImbalancedTextLP.git

In [2]:
#!ls -lrt ./ManningPublishing-ImbalancedTextLP/project4/milestone2

With the command above, you can see the two files generated from the previous milestone is now at your disposal, and the directory path to these files. 

2. Load libraries and open these two files. `pkl` file has to be opened with a `pickle` object, while `numpy` file has to be opened with a `numpy` object.

In [3]:
import tensorflow as tf
import os
import pickle
import numpy as np
import pandas as pd
print(tf.__version__)

2.6.0


In [4]:
with open('sample_files/milestone2/all_train_text.pkl', 'rb') as f:
    all_training_text = pickle.load(f)

In [5]:
with open('sample_files/milestone2/y_train_assembled.npy', 'rb') as f:
    y_train_assembled = np.load(f)

3. Since generated text surely contains words that are mispelled, it makes sense to use character based tokenization instead of word based tokenization that came with the dataset. The plan here is to encode these reviews at character level. Therefore I need to decode reviews from token to plain text, then merge these text with generated text, then tokenize all text at character level. Remember that in the `tf.keras.datasets.imdb`, the first four integers need to be accounted as reserved tokens:

In [6]:
word_index = tf.keras.datasets.imdb.get_word_index()
# These indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

4. Build a new lookup dictionary to map character to index from `all_training_text`. This ensures every character in `all_training_text` are accounted for.

In [7]:
text = ' '.join(all_training_text)

In [8]:
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

116 unique characters


In [9]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}

5. Add padding to the dictionary for marking each text.

In [10]:
# The first indices are reserved
char2idx = {k:(v+3) for k,v in char2idx.items()} 
char2idx["<PAD>"] = 0
char2idx["<START>"] = 1
char2idx["<UNK>"] = 2  # unknown
char2idx["<UNUSED>"] = 3

6. Convert plain text to tokens. Use `char2idx` to map all text to integers according to `char2idx`. Create a helper function named `encode_review` that encodes a review to token. Then use `map` to apply the helper function to every review. Apply `encode_review` function to every item in `all_training_text`.

In [11]:
def encode_review(plain_text):
    encoded_list = []
    for character in plain_text:
        if character in char2idx:
            token = char2idx[character]
        else:
            token = char2idx['<UNK>']
        encoded_list.append(token)
    return encoded_list

In [12]:
encoded_all_list = list(map(encode_review, all_training_text))

In [13]:
if len(encoded_all_list) == len(all_training_text):
    print(len(encoded_all_list))

25000


7. Convert tokenized list to `numpy` array. The tokenized reviews are in Python list format. In order to use it for machine learning model training, you need to convert it to a `numpy` array. 

In [14]:
all_training_np = np.array(encoded_all_list)

  all_training_np = np.array(encoded_all_list)


8. Perform padding operation. \
It turns out that in text classification problem, it is often required to have all data to be in same length. This length totally up to you. You have to look at length of your data and determine what is a reasonable length which contains enough information for a model to learn. In this code, a length of 256 words is set. You may experiment with a different value. For reviews shorter than this length, you will padd it with <padding> token in the front of the text; for reviews longer than this length, you will truncate it at this length. Lets pad each sentence to maximimum length of 256 words. We may take advantage of `pad_sequences` function provided to speed simplify our task.

In [15]:
train_data = tf.keras.preprocessing.sequence.pad_sequences(all_training_np,
                                                        value=word_index["<PAD>"],
                                                        padding='pre',
                                                        maxlen=256)

In [16]:
train_data.shape

(25000, 256)

9. Create a shuffling index. This index is a randomized array of indices. It will be used to shuffle both training data and label, so that training data and label will not mismatch during randomization process.

In [17]:
#Create a shuffling index
np.random.seed(100) 
shuffler = np.random.permutation(len(train_data))

In [18]:
#Shuffle both training data and label using the same shuffler
train_data_shuffled =  train_data[shuffler] # YOUR WORK: Use shuffler to re-order train_data
y_train_assembled_shuffled = y_train_assembled[shuffler] # YOUR WORK: Use shuffler to re-order label (y_train_assembled)

10. Build and compile a text classification model with structure just like in project 2. Start with embedding layer that convert a word into multi-dimensional vector representation. Then we feed that representation to a bidirectional Long-Short Terms Memory cell (LSTM) that uses 128 (a hyperparameter - arbitrarily chosen, feel free to experiment) dimensions to represent text sequence, follow by a dense layer to aggregate the LSTM output before making a classification.

In [107]:
# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = len(word_index)

MAX_SENTENCE_LENGTH=256
EMBEDDING_SIZE=16
HIDDEN_LAYER_SIZE=64
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, output_dim=MAX_SENTENCE_LENGTH),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(EMBEDDING_SIZE)),
    tf.keras.layers.Dense(HIDDEN_LAYER_SIZE),
    tf.keras.layers.Dense(1)
])

model.summary()

Model: "sequential_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, None, 256)         22678528  
_________________________________________________________________
bidirectional_4 (Bidirection (None, 32)                34944     
_________________________________________________________________
dense_6 (Dense)              (None, 64)                2112      
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 65        
Total params: 22,715,649
Trainable params: 22,715,649
Non-trainable params: 0
_________________________________________________________________


In [108]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

11. Set up cross validation. It is a good practice to set up a portion of training data for cross validation at the end of each training epoch. This helps us identify proper training epochs and prevent overfitting by memorizing training data.

In [109]:
# Shuffle training data for cross validation during training cycle
FRAC = 0.8 # fraction of training data used for training. Remaining is for cross validation.
idx = np.arange(len(train_data_shuffled))
np.random.seed(seed=400)
np.random.shuffle(idx)

idxs = idx[:round(len(idx)*FRAC)] # Select random 80% for training data
partial_x_train = train_data_shuffled[idxs]
partial_y_train = y_train_assembled_shuffled[idxs]

x_val = np.delete(train_data_shuffled, idxs.tolist(), axis=0) # select remaining as cross validation data
y_val = np.delete(y_train_assembled_shuffled, idxs.tolist(), axis=0)

In [110]:
print(partial_x_train.shape, partial_y_train.shape)

(20000, 256) (20000,)


As indicated above, you see the training data size. And below is the cross validation data size:

In [111]:
print(x_val.shape, y_val.shape)

(5000, 256) (5000,)


12. Add checkpoint and early stopping to training routine. Since there is no guarantee that you will get a best fitted model at the end of training, or an overfitted one, it is important to save the model at each epoch, so you may determine which epoch gives you the best model based on validation accuracy. A way to achieve this goal is to specify model checkpoint configuration. This is done through `tf.keras.callbacks.ModelCheckpoint API`. We can also save the model only if it is the best until now (see `save_best_only` option in this API).

Another concept is for stopping the training routine once the model ceasses to improve, say for consecutive `n` epochs. This is done through `tf.keras.callbacks.EarlyStopping` API. We may set it up so that the training routine stop if there are no improvement in validation accuracy after five eppchs. 

In [112]:
# Checkpopint - 

checkpoint_prefix = os.path.join('./checkpoint', "ckpt-{epoch}")
print(checkpoint_prefix)

myCheckPoint = tf.keras.callbacks.ModelCheckpoint(
    filepath = checkpoint_prefix,
    monitor = 'val_accuracy',
    mode = 'max',
    save_best_only = True
)

myEarlyStop = tf.keras.callbacks.EarlyStopping(
    monitor = 'loss',
    patience = 5
)

CALLBACKS = [myCheckPoint, myEarlyStop]

./checkpoint\ckpt-{epoch}


13. Launch the training process with training and validation data:

In [113]:
hist = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=16,
                    validation_data=(x_val, y_val),
                    callbacks=CALLBACKS,
                    verbose=1).history

Epoch 1/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-1\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-1\assets


Epoch 2/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-2\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-2\assets


Epoch 3/40
Epoch 4/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-4\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-4\assets


Epoch 5/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-5\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-5\assets


Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-9\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-9\assets


Epoch 10/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-10\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-10\assets


Epoch 11/40
Epoch 12/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-12\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-12\assets


Epoch 13/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-13\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-13\assets


Epoch 14/40
Epoch 15/40
Epoch 16/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-16\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-16\assets


Epoch 17/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-17\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-17\assets


Epoch 18/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-18\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-18\assets


Epoch 19/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-19\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-19\assets


Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-24\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-24\assets


Epoch 25/40
Epoch 26/40
Epoch 27/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-27\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-27\assets


Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40




INFO:tensorflow:Assets written to: ./checkpoint\ckpt-33\assets


INFO:tensorflow:Assets written to: ./checkpoint\ckpt-33\assets


Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


Check the output from the above cell and observe that not every epoch of training results in saving the model weights and biases. Saving these parameters only occurs if the validation acurracy is the best comparing to all previous epochs. Therefore, the last epoch which saves these parameters is the epoch where best model is produced through these training epochs.

14. Load test data and encode it with character-based token. Since the model was trained with character-based tokenized data, in order to test the model with test data, you need to encode the test data with characger-based tokens as well. First you need to decode text data, which is encoded by word based dictionary in the original data, and encode it with the character-based dictionary.

In [114]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(
    path='imdb.npz',
    num_words=None,
    skip_top=0,
    maxlen=None,
    seed=113,
    start_char=1,
    oov_char=2,
    index_from=3
)

In [115]:
#Remove reserved token in all negative reviews
for record in x_test:
    record.pop(0)

In [116]:
test_plain_text_holder = []
for review in x_test:
    plain_text = decode_review(review)
    test_plain_text_holder.append(plain_text)

In [117]:
char_encoded_test_list = list(map(encode_review, test_plain_text_holder))

15. Preprocessing test data. You need to apply the same preprocessing steps to test data as you did for training data. Namely, the encoded list that holds the test data has to be converted to a `numpy` array, then you need to ensure the maximum length for test data is same as what you set for training, which is specified to be 256 characters. Finally, for test data shorter than the maximum length, you will insert `<PAD>` token in front of the data. After all these processing steps are done, you have a `numpy` array ready for model to score.

In [118]:
all_test_np = np.array(char_encoded_test_list)

  all_test_np = np.array(char_encoded_test_list)


In [119]:
test_data = tf.keras.preprocessing.sequence.pad_sequences(all_test_np,
                                                       value=word_index["<PAD>"],
                                                       padding='pre',
                                                       maxlen=256)

16. Find the epoch with best validation accuracy. You may use the code below to extract from training history object about which epoch contains the model with highest validation accuracy. Then you may build up a text string for the file path, and load the model from that epoch using `tf.keras.models.load_model` API.

In [120]:
max_value = max(hist['val_accuracy'])
max_index = hist['val_accuracy'].index(max_value)
best_epoch = max_index + 1
print('Best epoch: ', best_epoch)

Best epoch:  33


In [121]:
best_trained_model_path = 'checkpoint/ckpt-' + str(best_epoch)
my_best_model = tf.keras.models.load_model(best_trained_model_path)

In [122]:
predicted = my_best_model.predict(test_data)

In [123]:
predicted

array([[ 0.38588992],
       [ 0.34800476],
       [ 0.2983358 ],
       ...,
       [ 0.22702295],
       [-0.0822449 ],
       [ 0.57778704]], dtype=float32)

17. Create a confusion matrix of the prediction. Now you need to make sense of predicted. It is an array of probability. Remember that we decided if the value is greater than 0.5, then the prediction is a positive comment. If the value is less or equal to 0.5, then it is a negative comment. It's a good idea to display the prediction outcome in a confusion table, so we can see the breakout of actual vs. predicted in each class. To do that, you may use pandas library to create a pandas series, then use crosstab function to build a confusion matrix:

In [124]:
predicted[predicted > 0.5] = 1
predicted[predicted <= 0.5] = 0
predictedf = predicted.flatten().astype(int)

import pandas as pd
df3 = pd.DataFrame(data=predictedf, columns=['predicted'])
refdf = pd.DataFrame(data=y_test, columns=['actual'])

y_actu = pd.Series(data=refdf.squeeze(), name='ACTUAL')
y_pred = pd.Series(data=df3.squeeze(), name='PREDICTED')
predicted_results = y_pred.tolist()
truth = y_actu.tolist()

dl_confusion = pd.crosstab(y_actu, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)

In [125]:
dl_confusion

Predicted,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,9274,3226,12500
1,4984,7516,12500
All,14258,10742,25000


18. Produce a performance report. You need to produce a report that shows precision, recall, and F1 for each class. A convenient method is using `sklearn`'s `classification_report` API.

In [126]:
from sklearn.metrics import classification_report
report = classification_report(refdf, df3)
print(report)

              precision    recall  f1-score   support

           0       0.65      0.74      0.69     12500
           1       0.70      0.60      0.65     12500

    accuracy                           0.67     25000
   macro avg       0.68      0.67      0.67     25000
weighted avg       0.68      0.67      0.67     25000



The report should indicate that model accuracy is improved over the one you saw at the conclusion of project 2 (the oversampling project), where in that case, you oversampled the positive reviews to make up for data balance. With the result here, it is clear that a model that learned from text can generate new text that bears resemblance to original text. The quality of the generated text is good enough to help train a text classification model.