This is the last milestone of this course. In here, you will use either the text you generated in the previous project, or the ones provided here, to augment training data. In this milestone, the training data consists of 6250 generated positive reviews, 6250 original positive reviews, and 12500 negative reviews. These are concatenated and then shuffled for model training. Steps 1 through 18 are involved with contatenating these data. From step 19 and onwards, the process of building and training the model is identitical to that in project 2, milestone 4, where you built and trained a model using oversampled data.

## Workflow


1. Load necessary libraries. Besides the usual TensorFlow, numpy and pandas libraries, you may need pickle library, as sample data provided are serialized in pickle format.

In [1]:
import tensorflow as tf
import pickle
import numpy as np
import pandas as pd
print(tf.__version__)

2.6.0


2. Sample files needed for this project are provided for your convenience. These files are:

`generated.pickle`: generated texts\
`x_trains_subset_positive_reviews.pickle`: randomly selected positive reviews from training data\
`y_trains_subset_positive_labels.pickle`: positive labels\
`x_train0_negative_reviews.pickle`: all negative reviews from training data\
`y_train0_negative_labels.pickle`: all negative labels from training data

In [2]:
generated = pickle.load( open( "sample_files/generated.pickle", "rb" ) )
x_trains = pickle.load( open( "sample_files/x_trains_subset_positive_reviews.pickle", "rb" ) )
y_trains = pickle.load( open( "sample_files/y_trains_subset_positive_labels.pickle", "rb" ))
x_train0 = pickle.load( open( "sample_files/x_train0_negative_reviews.pickle", "rb" ))
y_train0 = pickle.load( open( "sample_files/y_train0_negative_labels.pickle", "rb" ))

You should verify the type of `generated` and take a look at a sample. These are generated, not written by human.

In [3]:
type(generated)

list

In [4]:
x_trains.shape

(6250, 1)

In [5]:
generated[0]

"the same person does a great author roland go about not unthurming in fact from the end of more it's just doing soliday forces her to catch me aher 3 and his purshe doommates sandler's lily richardson who are supposed directed to childhood to bring characterization with a gunaumod on the flock notable are the best comedy that captures to check he played her dandy wears not being a bit more hamr but in her episoning them in fuction desire kept a sleazy sniper who is beautiful and an aflabiane only to make his life for native personality john forlive name school the entire cameo from the drunken a russian norman the black f the decade and she took apparently now what's great here this is always hard to feel like an the that would be inore i've have seen it again and way and we're a good aspect of the crowd it's nronse for yourself why i suppose this as engaging with the way i'm bad for years set at the cinematography by a brave tv series and absurdish floray and fully attached to reconc

4. Apply squeeze() to x_trains, y_trains, x_train0, y_train0 to get rid of extra dimensions

In [6]:
## Apply squeeze() to x_trains, y_trains, x_train0, y_train0 to get rid of extra dimensions
x_trains = x_trains.squeeze()
y_trains = y_trains.squeeze()
x_train0 = x_train0.squeeze()
y_train0 = y_train0.squeeze()

In [7]:
x_trains.shape

(6250,)

In [8]:
x_trains[:5][0]

[1,
 165,
 14,
 20,
 16,
 24,
 38,
 78,
 12,
 1367,
 206,
 212,
 5,
 2318,
 50,
 26,
 52,
 156,
 11,
 14,
 22,
 18,
 1825,
 7517,
 39973,
 18746,
 39,
 4,
 1419,
 3693,
 37,
 299,
 18063,
 160,
 73,
 573,
 284,
 9,
 3546,
 4224,
 39,
 2039,
 5,
 289,
 8822,
 4,
 293,
 105,
 26,
 256,
 34,
 3546,
 5788,
 17,
 6184,
 37,
 16,
 184,
 52,
 5,
 82,
 163,
 21,
 4,
 31,
 37,
 91,
 770,
 72,
 16,
 628,
 8335,
 17,
 4500,
 39520,
 29,
 299,
 6,
 275,
 109,
 74,
 29,
 633,
 127,
 88,
 11,
 85,
 108,
 40,
 4,
 1419,
 3693,
 1395,
 5808,
 4,
 31025,
 42,
 4,
 43737,
 29,
 299,
 6,
 55,
 2259,
 415,
 5,
 11,
 7242,
 4,
 299,
 220,
 4,
 1961,
 6,
 132,
 209,
 101,
 1438,
 63,
 16,
 327,
 8,
 67,
 4,
 64,
 66,
 1566,
 155,
 44,
 14,
 22,
 26,
 4,
 450,
 1268,
 7,
 4,
 182,
 3331,
 2216,
 63,
 166,
 14,
 22,
 382,
 168,
 6,
 117,
 1967,
 444,
 13,
 197,
 14,
 16,
 6,
 184,
 52,
 117,
 22]

5. Verify the length of `x_trains`

In [9]:
len(x_trains)

6250

6. Since generated text surely contains words that are mispelled, it makes sense to use character based tokenization instead of word based tokenization that came with the dataset. The plan here is to encode these reviews at character level. Therefore I need to decode reviews from token to plain text, then merge these text with generated text, then tokenize all text at character level. Remember that in the `tf.keras.datasets.imdb`, the first four integers need to be accounted as reserved tokens:

In [10]:
word_index = tf.keras.datasets.imdb.get_word_index()
# These indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

In [11]:
x_trains

array([list([1, 165, 14, 20, 16, 24, 38, 78, 12, 1367, 206, 212, 5, 2318, 50, 26, 52, 156, 11, 14, 22, 18, 1825, 7517, 39973, 18746, 39, 4, 1419, 3693, 37, 299, 18063, 160, 73, 573, 284, 9, 3546, 4224, 39, 2039, 5, 289, 8822, 4, 293, 105, 26, 256, 34, 3546, 5788, 17, 6184, 37, 16, 184, 52, 5, 82, 163, 21, 4, 31, 37, 91, 770, 72, 16, 628, 8335, 17, 4500, 39520, 29, 299, 6, 275, 109, 74, 29, 633, 127, 88, 11, 85, 108, 40, 4, 1419, 3693, 1395, 5808, 4, 31025, 42, 4, 43737, 29, 299, 6, 55, 2259, 415, 5, 11, 7242, 4, 299, 220, 4, 1961, 6, 132, 209, 101, 1438, 63, 16, 327, 8, 67, 4, 64, 66, 1566, 155, 44, 14, 22, 26, 4, 450, 1268, 7, 4, 182, 3331, 2216, 63, 166, 14, 22, 382, 168, 6, 117, 1967, 444, 13, 197, 14, 16, 6, 184, 52, 117, 22]),
       list([1, 4, 12475, 9, 6, 680, 22, 94, 293, 109, 5264, 256, 19, 35, 1732, 1493, 7, 1382, 5, 39453, 34, 3977, 9, 24, 129, 801, 632, 5264, 9, 6, 569, 132, 37, 9, 6606, 6, 522, 1696, 113, 18358, 260, 1084, 5284, 153, 11, 4, 5951, 7, 1043, 6448, 588, 29, 1

In [12]:
decode_review(x_trains[0])

'<START> actually this movie was not so bad it contains action comedy and excitement there are good actors in this film for instance doug hutchison percy from the green mile who plays bristol another well known actor is jamie kennedy from scream and three kings the main characters are played by jamie foxx as alvin who was pretty good and also funny but the one who most surprised me was david morse as edgar clenteen he plays a different character than he usually does because in other films like the green mile indian runner the negotiator or the langoliers he plays a very sympathetic person and in bait the plays almost the opposite a man without any emotions which was nice to see the only really negative thing about this film are the several pictures of the world trade center which makes this film perhaps look a little dated overall i thought this was a pretty good little film'

7. Remove `<START>` token. The review is decoded properly as it is coherently written. However, the first token is `<START>`. You need to remove it from each review, because it does not provide any knowledge for text generation; seeing `<START>` doesn't help you predict what next word is. `<START>` is merely a leading label. What you want the model to pay attention to is the text. Since each review is a list of tokens, you may use [`pop`](https://www.geeksforgeeks.org/python-removing-first-element-of-list/) function to remove the first element (token) in each review. 

In [13]:
#Remove reserved token in subset positive reviews
for record in x_trains:
  record.pop(0)

In [14]:
x_trains

array([list([165, 14, 20, 16, 24, 38, 78, 12, 1367, 206, 212, 5, 2318, 50, 26, 52, 156, 11, 14, 22, 18, 1825, 7517, 39973, 18746, 39, 4, 1419, 3693, 37, 299, 18063, 160, 73, 573, 284, 9, 3546, 4224, 39, 2039, 5, 289, 8822, 4, 293, 105, 26, 256, 34, 3546, 5788, 17, 6184, 37, 16, 184, 52, 5, 82, 163, 21, 4, 31, 37, 91, 770, 72, 16, 628, 8335, 17, 4500, 39520, 29, 299, 6, 275, 109, 74, 29, 633, 127, 88, 11, 85, 108, 40, 4, 1419, 3693, 1395, 5808, 4, 31025, 42, 4, 43737, 29, 299, 6, 55, 2259, 415, 5, 11, 7242, 4, 299, 220, 4, 1961, 6, 132, 209, 101, 1438, 63, 16, 327, 8, 67, 4, 64, 66, 1566, 155, 44, 14, 22, 26, 4, 450, 1268, 7, 4, 182, 3331, 2216, 63, 166, 14, 22, 382, 168, 6, 117, 1967, 444, 13, 197, 14, 16, 6, 184, 52, 117, 22]),
       list([4, 12475, 9, 6, 680, 22, 94, 293, 109, 5264, 256, 19, 35, 1732, 1493, 7, 1382, 5, 39453, 34, 3977, 9, 24, 129, 801, 632, 5264, 9, 6, 569, 132, 37, 9, 6606, 6, 522, 1696, 113, 18358, 260, 1084, 5284, 153, 11, 4, 5951, 7, 1043, 6448, 588, 29, 150, 65

In [15]:
#Remove reserved token in all negative reviews
for record in x_train0:
  record.pop(0)

8. Now the first token is not 1, which means `<START>` tokens are removed as expected. Next, you are going to put together (by concatenation) positive and negative reviews and convert them to plain text.

In [16]:
reviews_np = np.concatenate((x_trains, x_train0), axis=None) # axis = None is same as axis=0 followed by squeeze()

9. Decode numpy array to plain text, so that you may tokenize these texts. Notice the order of `np_concatenate`  in the previous step. First is the positive subset of original reviews, then all negative reviews.

In [28]:
plain_text_holder = []
for review in reviews_np:
    plain_text = decode_review(review)
    plain_text_holder.append(plain_text)


10. Inspect text in `plain_text_holder`. Make sure theg text is decoded properly. Simply inspect a few samples to make sure the sentences are coherent.

In [29]:
plain_text_holder[:1]

['actually this movie was not so bad it contains action comedy and excitement there are good actors in this film for instance doug hutchison percy from the green mile who plays bristol another well known actor is jamie kennedy from scream and three kings the main characters are played by jamie foxx as alvin who was pretty good and also funny but the one who most surprised me was david morse as edgar clenteen he plays a different character than he usually does because in other films like the green mile indian runner the negotiator or the langoliers he plays a very sympathetic person and in bait the plays almost the opposite a man without any emotions which was nice to see the only really negative thing about this film are the several pictures of the world trade center which makes this film perhaps look a little dated overall i thought this was a pretty good little film']

In [30]:
len(plain_text_holder)

18750

11. Merge generated text into the training text. The result should be a list. You may use `+` to to combine two lists into one. 

In [31]:
all_training_text = generated + plain_text_holder

Notice the order of `all_training_text`. First is the generated text, then `plain_text_holder`. Therefore, the order of records are: generated positive text, then positive subset of original reviews, then all negative reviews. This order is important, as you have to construct label array to match this order.

In [32]:
#Now original reviews and generated reviews are combined in a list, tokenize this list at character level.
type(all_training_text)

list

12.  Now save `all_training_text` as a pickle file. It will be used as training data in the next milestone.

In [33]:
with open('all_train_text.pkl', 'wb') as myfile:
    pickle.dump(all_training_text, myfile)

These steps demonstrate how to merge generated text with part of training dataset. The merged text are stored as a pickle file for the next milestone.

13. Their is also aneed to create label for these generated text as positive review. Lets just use the training label already provided by the original data, then concatenate it with labels for positive and negative reviews. Remember the order of `all_training_text`? It is `generated`, followed by the positive, then at last, the negatives. Likewise, your label array needs to follow this order as well. You need to make a label array for generated text. This is a `numpy` array of 1, repeated `len(generated)` times. Then you will concatenate this array, `y_trains`, and `y_train0` together and name it `y_train_assembled`, For `numpy`, it's common practice to save it as a `numpy` file, so you need to save `y_train_assembled` as a numpy file for next milestone.

In [None]:
y_trains_generated = y_trains[:len(generated)]

In [None]:
len(y_trains) # Make sure it is the length of generated text.

In [41]:
y_trains

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [42]:
y_train0

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [50]:
y_train_assembled = np.array(list(y_trains) + list(y_train0))

In [51]:
np.save('y_train_assembled.npy', y_train_assembled)

In [52]:
#FYI. This is how you will load a numpy array from a file
with open('y_train_assembled.npy', 'rb') as f:
    a = np.load(f)

In [53]:
len(a) # Verify length

18750

In [54]:
#FYI. This is how you will load a pickle file
with open('all_train_text.pkl', 'rb') as f:
    b = pickle.load(f)