## Working with embedding layers and 1d convolution layers 
- examples using sequential models
- demos of dropout for sequential models and bidirectional sequential layers along the way

Code adjusted from *Deep Learning for Python* Book

In [None]:
# Get raw imdb dataset
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2022-04-11 22:06:32--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2022-04-11 22:06:35 (36.6 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [None]:
# Untar it to a new folder
! tar xf aclImdb_v1.tar.gz

In [None]:
# Build corpus of docs and labels
import os

imdb_dir = 'aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

In [None]:
len(texts)

25000

In [None]:
print(texts[0])
print(labels[0])

Bugs Bunny accidentally ends up at the South Pole while trying to vacation in Florida. Where he meets a little penquin, which he tries to save from an Eskimo. This short tries and the penquin is adorable, but in the end it's a bit too light in the laughs department. The Eskimo isn't really that great of a foil for Bugs and I just seen a lot better Bugs Bunny cartoons frankly, even other shorts when he's paired with other unknown antagonists. So I can't in good conscience recommend this one. However it is nice to see it in it's uncut form. This cartoon is on Disk 3 of the "Looney Tunes Golden Collection Volume 1" <br /><br />My Grade: C
0


In [None]:
# Tokenize the data into one hot vectors
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100  # We will cut reviews after 100 words
training_samples = 10000  # We will be training on 10000 samples
validation_samples = 10000  # We will be validating on 10000 samples
max_words = 10000  # We will only consider the top 10,000 words in the dataset

# instantiates the tokenizer
tokenizer = Tokenizer(num_words=max_words)

# converts words in each text to each word's numeric index in tokenizer dictionary.
tokenizer.fit_on_texts(texts)

# creates sequences of words from dictionary
sequences = tokenizer.texts_to_sequences(texts) 

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# add zeroes to pad sequences under max length
data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)

print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# Split the data into a training set and a validation set
# But first, shuffle the data, since we started from data
# where samples are ordered (all negative first, then all positive).
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples] #100 words
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

Found 88582 unique tokens.
Shape of data tensor: (25000, 100)
Shape of label tensor: (25000,)


In [None]:
data[0]

array([ 107,    3,  173,    4, 2793, 6889,  105,  448,    4,    1,  198,
        303,    5,  101,   11,    6,    3,   49,   19, 4223,    9,   20,
         42,  202,    9,   13,  181,  354,    7,    7,   22,  112,   76,
          3,  144,   49,  165,   30,    1, 5070,  668,  390,   18,   48,
         10,  119,   64,   13,  181, 2671,    1,  338,  136,  303,    5,
         27, 3930,    5, 1894,   53,    1,   88,   19,   21,    5,  199,
         98, 3481,  276,   44,   10,  216,    1, 2764, 2028,  602,   10,
        200,   27,   50, 1553,    7,    7,    1,  274,   16,    1,  516,
       1310,   16,    2,  668,   13,  181, 1593,  115,  170,    4,    1,
         17], dtype=int32)

In [None]:
# Let's start with a model that ignores the sequential steps that make up each 
# observation
from tensorflow.keras.layers import Dense, Embedding, Flatten
from tensorflow.keras.models import Sequential

model = Sequential()
# Specify the size of your vocabulary (i.e.-10,000 terms)
# Specify the number of features you want to extract via fitting weights to your 
# embedding matrix (e.g. 16)
# We also specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs 
model.add(Embedding(10000, 16, input_length=maxlen))
# After the Embedding layer, 
# our activations have shape `(samples, maxlen, 16)`.

# We flatten the 3D tensor of embeddings 
# into a 2D tensor of shape `(samples, maxlen * 16)`
model.add(Flatten())

# We add the classifier on top
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 100, 16)           160000    
                                                                 
 flatten_1 (Flatten)         (None, 1600)              0         
                                                                 
 dense_1 (Dense)             (None, 1)                 1601      
                                                                 
Total params: 161,601
Trainable params: 161,601
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# What does the output of the embedding layer look like?  
# (It returns a sequentially ordered transformation of numerically indexed 
# input word data)
# Credit: Code adjusted from ML Mastery

import tensorflow as tf

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(1000, 2, input_length=10))
# The model will take as input an integer matrix of size (batch, input_length).  
# Now model.output_shape is (None, 10, 2), where `None` is the batch  
# dimension.  
input_array = np.random.randint(1000, size=(1, 10))
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
print(output_array.shape)
print(output_array[0])

(1, 10, 2)
[[-0.01537427 -0.01595243]
 [-0.02226641 -0.02764404]
 [ 0.03548204  0.04067739]
 [ 0.03737131  0.03836199]
 [ 0.03798166  0.0163169 ]
 [-0.01775052  0.04169117]
 [ 0.04707558  0.0202958 ]
 [-0.03491955 -0.03958663]
 [-0.01171577  0.02444894]
 [-0.011288    0.00346855]]


In [None]:
# What if we wanted to use a matrix of pretrained embeddings?  
# Same as transfer learning before, but now we are importing a pretrained 
# Embedding matrix:
# Download GloVe embedding matrix weights (Might take 10 mins or so!)
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2022-04-11 22:31:20--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2022-04-11 22:31:20--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2022-04-11 22:31:20--  http://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [applic

In [None]:
! unzip glove.6B.zip 

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [None]:
# Extract embedding data for 100 feature embedding matrix
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [None]:
# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [None]:
# Set up same model architecture as before and then import GloVe weights to 
# the Embedding layer:

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()



Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 100, 100)          1000000   
                                                                 
 flatten_2 (Flatten)         (None, 10000)             0         
                                                                 
 dense_2 (Dense)             (None, 32)                320032    
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________


In [None]:

# Add weights in same manner as transfer learning and turn of trainable option 
# before fitting model to freeze weights.
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False



model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')


# Training data small to speed up training. Increase for better fit.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# Evaluate model on test set (need to preprocess test data to same structure first)

test_dir = os.path.join(imdb_dir, 'test')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

#using tokenizer object we fit to test data above
sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

In [None]:
model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)





[0.874281644821167, 0.6893600225448608]

## Fitting sequential models to text data

In [None]:
# Example 1: simple RNN
from tensorflow.keras.layers import SimpleRNN, LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN

model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

# Small training data.  Increase for model improvement

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# Example 2: Stacked RNN layers

model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(LSTM(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

# Small training data.  Increase for model improvement

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# Example 3: LSTM layer

model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

# Small training data.  Increase for model improvement

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
#Example 4: Bidirectional LSTM
from tensorflow.keras.layers import Embedding, Bidirectional

model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(Bidirectional(LSTM(32)))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Some Dropout Examples for LSTM layers
Dropout was a difficult puzzle for sequential models.  Solved relatively recently by dropping out on same hidden node locations at each time step.

`LSTM(128, dropout=0.2, recurrent_dropout=0.2))` 

    dropout: Float between 0 and 1.  
        Fraction of the units to drop for  
        the linear transformation of the inputs.  
    recurrent_dropout: Float between 0 and 1.  
        Fraction of the units to drop for  
        the linear transformation of the recurrent state.

In [None]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2)) 
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=1,
                    batch_size=32,
                    validation_split=0.2)



In [None]:
score, acc = model.evaluate(x_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.3928277790546417
Test accuracy: 0.8240000009536743


## Sequential models using 1D Convnets

In [None]:
# Use 1D Conv layer rather than RNN or LSTM or GRU to fit model
# Why? Much lighter model to fit. Here we are training on the full dataset. 
# If you try to build a model using LSTM code after running this one it will 
# be much slower.

from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import SimpleRNN, LSTM, Embedding

model = Sequential()
model.add(layers.Embedding(10000, 8, input_length=maxlen))
model.add(layers.Conv1D(32, 7, activation='relu')) 
model.add(layers.MaxPooling1D(5)) #
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))

model.summary()





Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_9 (Embedding)     (None, 100, 8)            80000     
                                                                 
 conv1d (Conv1D)             (None, 94, 32)            1824      
                                                                 
 max_pooling1d (MaxPooling1D  (None, 18, 32)           0         
 )                                                               
                                                                 
 conv1d_1 (Conv1D)           (None, 12, 32)            7200      
                                                                 
 global_max_pooling1d (Globa  (None, 32)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_9 (Dense)             (None, 1)                

In [None]:
model.compile(optimizer=RMSprop(learning_rate=1e-4),
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=1,
                    batch_size=128,
                    validation_split=0.2)



In [None]:
# How to write a preprocessor function for text preprocessing using keras?

def preprocessor(textinput, maxlen=100):
        from tensorflow.keras.preprocessing.text import Tokenizer
        from tensorflow.keras.preprocessing.sequence import pad_sequences        

        sequences = tokenizer.texts_to_sequences(textinput) # converts words in each text to each word's numeric index in tokenizer dictionary.

        data = pad_sequences(sequences, maxlen=maxlen)
        return data



#See preprocessor output
print(preprocessor(["This movie is amazing, wonderful, unique and beautiful! I give it five stars."]).shape)
print(preprocessor(["This movie is amazing, wonderful, unique and beautiful! I give it five stars."]))

print(model.predict(preprocessor(["This movie is amazing, wonderful, unique and beautiful! I give it five stars."])))
print(model.predict(preprocessor(["This movie is horrible."])))


(1, 100)
[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  11  17   6
  477 386 952   2 304  10 199   9 674 379]]
[[-0.0392569]]
[[-0.01441896]]
