In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf # pip install tensorflow-macos
import os
tf.random.set_seed(1)

# The typical architecture of a Recurrent Neural Network (RNN)
- The premise of an RNN is simple: use information from the past to help you with the future (this is where the term recurrent comes from). In other words, take an input (X) and compute an output (y) based on all previous inputs.

When an RNN looks at a sequence of text (already in numerical form), the patterns it learns are continually updated based on the order of the sequence.

For a simple example, take two sentences:

1. Massive earthquake last week, no?
2. No massive earthquake last week.

Both contain exactly the same words but have different meaning. The order of the words determines the meaning (one could argue punctuation marks also dictate the meaning but for simplicity sake, let's stay focused on the words).

### actual architecture
1. input layer
2. text vectorization layer
3. embedding
4. RNN cell(s)
5. Hidden activation
6. pooling layer (sometimes, usually for Conv1D models)
7. fully connected layer
8. output layer



# Preparing a notebook for our first NLP with TensorFlow project
[data from kaggle](https://www.kaggle.com/competitions/nlp-getting-started/leaderboard)

In [2]:
from helper_functions import create_tensorboard_callback, unzip_data, plot_loss_curves, compare_historys

In [3]:
zip_path = "nlp_getting_started.zip"
if not os.path.isfile(zip_path):
    os.chdir("data")
unzip_data(zip_path)

# Becoming one with the data and visualizing a text dataset

In [4]:
# we can use pandas because the data isn't too big
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# shuffle training data
train_df = train_df.sample(frac=1, random_state=1)

In [5]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
3228,4632,emergency%20services,"Sydney, New South Wales",Goulburn man Henry Van Bilsen missing: Emergen...,1
3706,5271,fear,,The things we fear most in organizations--fluc...,0
6957,9982,tsunami,Land Of The Kings,@tsunami_esh ?? hey Esh,0
2887,4149,drown,,@POTUS you until you drown by water entering t...,0
7464,10680,wounds,"cody, austin follows ?*?",Crawling in my skin\nThese wounds they will no...,1


In [6]:
train_df.iloc[0]["text"]

'Goulburn man Henry Van Bilsen missing: Emergency services are searching for a Goulburn man who disappeared from his\x89Û_ http://t.co/z99pKJzTRp'

In [7]:
# test data frame looks the same but without targets
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [8]:
train_df["target"].value_counts()

target
0    4342
1    3271
Name: count, dtype: int64

In [9]:
print(len(train_df), len(test_df))

7613 3263


In [10]:
# let's visualize some random samples
import random
random_index = random.randint(0, len(train_df)-5)
for row in train_df[["text", "target"]][random_index:random_index+5].itertuples():
    _, text, target = row
    print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
    print(f"Text:\n{text}\n")
    print("---\n")

Target: 1 (real disaster)
Text:
Militants attack police post in Udhampur; 2 SPOs injured

Suspected militants Thursday  attacked a police post in... http://t.co/1o0j9FCPBi

---

Target: 0 (not real disaster)
Text:
I hear the mumbling i hear the cackling i got em scared shook panicking

---

Target: 1 (real disaster)
Text:
8/6/2015@2:09 PM: TRAFFIC ACCIDENT NO INJURY at 2781 WILLIS FOREMAN RD http://t.co/VCkIT6EDEv

---

Target: 1 (real disaster)
Text:
Evacuation order lifted for town of Roosevelt Wash. though residents warned to be ready to leave quickly http://t.co/Na0ptN0dTr

---

Target: 0 (not real disaster)
Text:
You messed up my feeling like a hurricane damaged this broken home

---



# Splitting data into training and validation sets

In [11]:
from sklearn.model_selection import train_test_split
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df["text"].to_numpy(),
                                                                            train_df["target"].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=1)

In [12]:
print(len(train_sentences), len(train_labels), len(val_sentences), len(val_labels))

6851 6851 762 762


In [13]:
train_sentences[:-10]

array(['Rly tragedy in MP: Some live to recount horror: \x89ÛÏWhen I saw coaches of my train plunging into water I called my daughters and said t...',
       'A river of lava in the sky this evening! It was indeed a beautiful sunset sky tonight. (8-4-15) http://t.co/17EGMlNi80',
       'Los Angeles Times: Arson suspect linked to 30 fires caught in Northern ... - http://t.co/xwMs1AWW8m #NewsInTweets http://t.co/TE2YeRugsi',
       ...,
       "When you lowkey already know you're gonna drown in school this year :) http://t.co/aCMrm833zq",
       '@JamesMelville Some old testimony of weapons used to promote conflicts\nTactics - corruption &amp; infiltration of groups\nhttps://t.co/cyU8zxw1oH',
       "Just got evacuated from the movie theatre for an emergency. Saw people running from another they're."],
      dtype=object)

In [14]:
train_labels[:-10]

array([1, 0, 1, ..., 0, 0, 1])

# Converting text data to numbers using tokenisation and embeddings (overview)
### Tokenization
- A straight mapping from word or character or sub-word to a numerical value. There are three main levels of tokenization:
1. Using word-level tokenization with the sentence "I love TensorFlow" might result in "I" being 0, "love" being 1 and "TensorFlow" being 2. In this case, every word in a sequence considered a single token.
2. Character-level tokenization, such as converting the letters A-Z to values 1-26. In this case, every character in a sequence considered a single token.
3. Sub-word tokenization is in between word-level and character-level tokenization. It involves breaking invidual words into smaller parts and then converting those smaller parts into numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After doing this, these sub-words would then be mapped to a numerical value. In this case, every word could be considered multiple tokens.


### Embeddings
- An embedding is a representation of natural language which can be learned. Representation comes in the form of a feature vector. For example, the word "dance" could be represented by the 5-dimensional vector [-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]. It's important to note here, the size of the feature vector is tuneable. There are two ways to use embeddings:
1. Create your own embedding - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as tf.keras.layers.Embedding) and an embedding representation will be learned during model training.
2. Reuse a pre-learned embedding - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.

# Setting up a TensorFlow TextVectorization layer to convert text to numbers

In [15]:
from tensorflow.keras.layers import TextVectorization

In [16]:
# these are the default parameters. this cell is just for demonstration
text_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary (all of the different words in your text)
                                    standardize="lower_and_strip_punctuation", # how to process text
                                    split="whitespace", # how to split tokens
                                    ngrams=None, # create groups of n-words?
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length=None) # how long should the output sequence of tokens be?
                                    # pad_to_max_tokens=True) # Not valid if using max_tokens=None

In [17]:
# Find average number of tokens (words) in training Tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [18]:
# Setup text vectorization with custom variables
max_vocab_length = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (e.g. how many words from a Tweet does our model see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

# Mapping the TextVectorization layer to text data and turning it into numbers

In [19]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

2023-06-19 12:06:00.248538: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


In [20]:
# Create sample sentence and tokenize it
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[319,   3, 199,   4,  13, 699,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [21]:
# Choose a random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nVectorized version:")
text_vectorizer([random_sentence])

Original text:
@oooureli @Abu_Baraa1 You mean like the tolerance you showed when sharing 'democracy' with the Iraqis? Wait you mutilated and bombed them.      

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[   1, 6264,   12, 1202,   25,    2, 7486,   12, 3418,   45, 4641,
        3064,   14,    2,    1]])>

In [22]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5] # most common tokens (notice the [UNK] token for "unknown" words)
bottom_5_words = words_in_vocab[-5:] # least common tokens
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"Top 5 most common words: {top_5_words}")
print(f"Bottom 5 least common words: {bottom_5_words}")

# UNK is unknown token which replaces uncommon words. increasing max tokens will reduce it's popularity

Number of words in vocab: 10000
Top 5 most common words: ['', '[UNK]', 'the', 'a', 'in']
Bottom 5 least common words: ['palmer', 'palm', 'palinfoen', 'palestinian\x89Û', 'paleface']


# Creating an Embedding layer to turn tokenised text into embedding vectors

In [23]:
embedding = tf.keras.layers.Embedding(input_dim=max_vocab_length,  # set input shape
                                      output_dim=128,  # a common starting point, multiples of 8 tend to run faster
                                      input_length=max_length,  # how long is the input?
                                      name="embedding_1")

In [24]:
# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nEmbedded version:")

# Embed the random sentence (turn it into numerical representation)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
@fadelurker @dalinthanelan &lt; right now.

Even after two years there were still refugees camped just south of Redcliffe village and Aidan &gt;      

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.00044354, -0.00959869, -0.01488503, ...,  0.0157225 ,
         -0.01731045,  0.04999502],
        [ 0.00044354, -0.00959869, -0.01488503, ...,  0.0157225 ,
         -0.01731045,  0.04999502],
        [ 0.03572508,  0.00295686, -0.01646795, ..., -0.04597085,
          0.00974417, -0.02150788],
        ...,
        [ 0.0406287 ,  0.03047732, -0.0343025 , ...,  0.03056096,
         -0.04152142,  0.03783656],
        [ 0.00044354, -0.00959869, -0.01488503, ...,  0.0157225 ,
         -0.01731045,  0.04999502],
        [-0.01843027,  0.04141413, -0.0358417 , ...,  0.01633341,
         -0.02188298,  0.00408796]]], dtype=float32)>

In [25]:
# Check out a single token's embedding
display(sample_embed[0][0])
display(sample_embed[0][0].shape)
display(random_sentence)

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([ 0.00044354, -0.00959869, -0.01488503,  0.01583595, -0.03460299,
       -0.01922628,  0.03886281,  0.03049261,  0.02703375,  0.03125577,
       -0.00640028,  0.04225099, -0.04081216, -0.01396431, -0.02797878,
       -0.00030531, -0.03165816,  0.02005385, -0.02563877, -0.04627147,
        0.01255215, -0.01807749,  0.04352308, -0.04396086, -0.00285302,
        0.01801059,  0.02089453, -0.02251923,  0.03271903, -0.04593432,
        0.04534021,  0.01232741,  0.01351071, -0.00348172, -0.00295175,
        0.01398399, -0.04538589,  0.03654433,  0.0401935 ,  0.01342661,
        0.02652259, -0.00834689,  0.04488019,  0.04263305, -0.01620691,
       -0.03888427, -0.00747744, -0.01589997,  0.0127697 , -0.04039965,
        0.00297211,  0.02234465,  0.01354844, -0.03515201,  0.00236771,
       -0.01941361,  0.0042451 , -0.03211303,  0.01056878,  0.01654197,
       -0.02855355, -0.00332244, -0.03632966, -0.03077023, -0.00369436,
       -0.007075

TensorShape([128])

'@fadelurker @dalinthanelan &lt; right now.\n\nEven after two years there were still refugees camped just south of Redcliffe village and Aidan &gt;'

# the various modelling experiments we're going to run
- [Model 0: Sklearn Naive Bayes (baseline)](https://scikit-learn.org/stable/modules/naive_bayes.html)
- Model 1: Feed-forward neural network (dense model)
- Model 2: LSTM model
- Model 3: GRU model
- Model 4: Bidirectional-LSTM model
- Model 5: 1D Convolutional Neural Network
- Model 6: TensorFlow Hub Pretrained Feature Extractor
- Model 7: Same as model 6 with 10% of training data

# Model 0: Building a baseline model to try and improve upon [Sklearn Naive Bayes (baseline)](https://scikit-learn.org/stable/modules/naive_bayes.html)

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenization and modelling pipeline (pipeline is a bit like tf.keras.models.Sequential()
model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), # convert words to numbers using tfidf
                    ("clf", MultinomialNB()) # model the text
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

In [27]:
# evaluate the model:
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Our baseline model achieves an accuracy of: {baseline_score*100:.2f}%")

Our baseline model achieves an accuracy of: 78.22%


In [28]:
# Make predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0])

# Creating a function to track and evaluate our model's results

In [29]:
# Function to evaluate: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
    """
    Calculates model accuracy, precision, recall and f1 score of a binary classification model.
    only works for binary classification
    Args:
    -----
    y_true = true labels in the form of a 1D array
    y_pred = predicted labels in the form of a 1D array

    Returns a dictionary of accuracy, precision, recall, f1-score.
    """
    # Calculate model accuracy
    model_accuracy = accuracy_score(y_true, y_pred)
    # Calculate model precision, recall and f1 score using "weighted" average
    model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
    model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
    return model_results

In [30]:
# Baseline results
baseline_results = calculate_results(y_true=val_labels, y_pred=baseline_preds)
baseline_results

{'accuracy': 0.7821522309711286,
 'precision': 0.7922635210264124,
 'recall': 0.7821522309711286,
 'f1': 0.7730378142460546}

# Model 1: Building, fitting and evaluating our first deep model (feed forward) on text data

In [31]:
# Create tensorboard callback (need to create a new one for each model)
from helper_functions import create_tensorboard_callback

# Create directory to save TensorBoard logs
SAVE_DIR = "model_logs"

In [32]:
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs) # turn the input text into numbers
x = embedding(x)  # create embedding of numberized inputs
x = layers.GlobalAveragePooling1D()(x) # lower the dimensionality of the embedding
outputs = layers.Dense(1, activation="sigmoid")(x)

model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense") # construct the model

In [33]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (G  (None, 128)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

In [34]:
model_1.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])



In [35]:
# Fit the model
model_1_history = model_1.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                     experiment_name="simple_dense_model")])

Saving TensorBoard log files to: model_logs/simple_dense_model/20230619-120600
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [36]:
# Check the results
model_1.evaluate(val_sentences, val_labels)



[0.5178927779197693, 0.7808399200439453]

In [37]:
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs[:10] # only print out the first 10 prediction probabilities



array([[0.9963774 ],
       [0.9327017 ],
       [0.8655261 ],
       [0.00449484],
       [0.28397885],
       [0.30158865],
       [0.84945637],
       [0.20643485],
       [0.11917296],
       [0.07175671]], dtype=float32)

In [38]:
# Turn prediction probabilities into single-dimension tensor of floats
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs)) # squeeze removes single dimensions
model_1_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([1., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0.,
       0., 1., 1.], dtype=float32)>

In [39]:
# Calculate model_1 metrics
model_1_results = calculate_results(y_true=val_labels,
                                    y_pred=model_1_preds)
model_1_results

{'accuracy': 0.7808398950131233,
 'precision': 0.7798665327287134,
 'recall': 0.7808398950131233,
 'f1': 0.7779809505888746}

In [40]:
# Is our simple Keras model better than our baseline model?
np.array(list(model_1_results.values())) > np.array(list(baseline_results.values()))
# nope

array([False, False, False,  True])

# Visualizing our model's learned word embeddings with [TensorFlow's projector tool](https://projector.tensorflow.org/)

In [41]:
# redoing this from above for practice
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

In [42]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (G  (None, 128)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

In [43]:
# weight matrix of embedding layer
# (these are the numerical patterns between the text in the training dataset the model has learned)
embed_weights = model_1.get_layer("embedding_1").get_weights()[0]
print(embed_weights.shape) # same size as vocab size and embedding_dim (each word is a embedding_dim size vector)
print(3)

(10000, 128)
3


In [44]:
import io

# Create output writers
out_v = io.open("embedding_vectors.tsv", "w", encoding="utf-8")
out_m = io.open("embedding_metadata.tsv", "w", encoding="utf-8")

# Write embedding vectors and words to file
for num, word in enumerate(words_in_vocab):
  if num == 0:
     continue # skip padding token
  vec = embed_weights[num]
  out_m.write(word + "\n") # write words to file
  out_v.write("\t".join([str(x) for x in vec]) + "\n") # write corresponding word vector to file
out_v.close()
out_m.close()

# High-level overview of Recurrent Neural Networks (RNNs) + where to learn more
- RNNs are useful for sequence data
- The premise of an RNN is simple: use information from the past to help you with the future (this is where the term recurrent comes from). In other words, take an input (X) and compute an output (y) based on all previous inputs
### types of RNNs
- Long short-term memory cells (LSTMs).
- Gated recurrent units (GRUs).
- Bidirectional RNN's (passes forward and backward along a sequence, left to right and right to left).
### types of problems RNNs can be used in
- One to one: one input, one output, such as image classification.
- One to many: one input, many outputs, such as image captioning (image input, a sequence of text as caption output).
- Many to one: many inputs, one outputs, such as text classification (classifying a Tweet as real diaster or not real diaster).
- Many to many: many inputs, many outputs, such as machine translation (translating English to Spanish) or speech to text (audio wave as input, text as output)
### Resources
- [MIT Deep Learning Lecture on Recurrent Neural Networks](https://youtu.be/SEnXr6v2ifU)  - explains the background of recurrent neural networks and introduces LSTMs.
- [The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) - demonstrates the power of RNN's with examples generating various sequences.
- [Understanding LSTMs by Chris Olah](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) - an in-depth (and technical) look at the mechanics of the LSTM cell, possibly the most popular RNN building block.

# Model 2: Building, fitting and evaluating our first TensorFlow RNN model (LSTM)

The reason we use a new embedding layer for each model is since the embedding layer is a learned representation of words (as numbers), if we were to use the same embedding layer (embedding_1) for each model, we'd be mixing what one model learned with the next. And because we want to compare our models later on, starting them with their own embedding layer each time is a better idea.

In [45]:
from tensorflow.keras import layers
model_2_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_2")


# Create LSTM model
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model_2_embedding(x)
print(x.shape)
# x = layers.LSTM(64, return_sequences=True)(x) # return vector for each word in the Tweet (you can stack RNN cells as long as return_sequences=True)
x = layers.LSTM(64)(x) # return vector for whole sequence
print(x.shape)
# x = layers.Dense(64, activation="relu")(x) # optional dense layer on top of output of LSTM cell
outputs = layers.Dense(1, activation="sigmoid")(x)
model_2 = tf.keras.Model(inputs, outputs, name="model_2_LSTM")

(None, 15, 128)
(None, 64)


In [46]:
model_2.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])



In [47]:
model_2_history = model_2.fit(x=train_sentences,
                              y=train_labels,
                              epochs=5,
                              validation_data=(train_sentences, train_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR, "model_2_experiment")])

Saving TensorBoard log files to: model_logs/model_2_experiment/20230619-120607
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [48]:
model_2.summary()

Model: "model_2_LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_2 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 lstm (LSTM)                 (None, 64)                49408     
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1,329,473
Trainable params: 1,329,473
Non-trainable params: 0
____________________________________________

In [49]:
# Make predictions on the validation dataset
model_2_pred_probs = model_2.predict(val_sentences)
print(model_2_pred_probs.shape, model_2_pred_probs[:10]) # view the first 10

# Round out predictions and reduce to 1-dimensional array
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

(762, 1) [[0.99987584]
 [0.99068624]
 [0.99947435]
 [0.00554314]
 [0.17340145]
 [0.0267905 ]
 [0.9998617 ]
 [0.15163232]
 [0.10999082]
 [0.00434084]]


<tf.Tensor: shape=(10,), dtype=float32, numpy=array([1., 1., 1., 0., 0., 0., 1., 0., 0., 0.], dtype=float32)>

In [50]:
# Calculate LSTM model results
model_2_results = calculate_results(y_true=val_labels,
                                    y_pred=model_2_preds)
display(model_2_results)  # worse than baseline
display(baseline_results)

{'accuracy': 0.7414698162729659,
 'precision': 0.7411148256584495,
 'recall': 0.7414698162729659,
 'f1': 0.7412807067637125}

{'accuracy': 0.7821522309711286,
 'precision': 0.7922635210264124,
 'recall': 0.7821522309711286,
 'f1': 0.7730378142460546}

# Model 3: Building, fitting and evaluating a GRU-cell powered
- The GRU cell has similar features to an LSTM cell but has less parameters
- again, we'll use this architecture: Input (text) -> Tokenize -> Embedding -> Layers -> Output (label probability)

In [51]:
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,), dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.GRU(64, return_sequences=False)(x) # you need return sequences true to stack reccurant layers
# x = layers.LSTM(64)(x, return_sequences=True)
# x = layers.GRU(64)(x)
# x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model_3 = tf.keras.Model(inputs, outputs)

In [52]:
model_3.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 gru (GRU)                   (None, 64)                37248     
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1,317,313
Trainable params: 1,317,313
Non-trainable params: 0
___________________________________________________

In [53]:
model_3.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

model_3_history = model_3.fit(train_sentences,
            train_labels,
            epochs=5,
            validation_data=(val_sentences, val_labels),
            callbacks=[create_tensorboard_callback(SAVE_DIR, "model_3_GRU")])



Saving TensorBoard log files to: model_logs/model_3_GRU/20230619-120622
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [54]:
# make predictions with model 3
model_3_pred_probs = model_3.predict(val_sentences)



In [55]:
# Convert prediction probabilities to prediction classes
model_3_preds = tf.squeeze(tf.round(model_3_pred_probs))

In [56]:
# Calcuate model_3 results
model_3_results = calculate_results(y_true=val_labels,
                                    y_pred=model_3_preds)
model_3_results  # still not as good as baseline

{'accuracy': 0.7519685039370079,
 'precision': 0.7512241125208886,
 'recall': 0.7519685039370079,
 'f1': 0.7515319432391477}

# Model 4: Building, fitting and evaluating a bidirectional RNN model
- A standard RNN will process a sequence from left to right, where as a bidirectional RNN will process the sequence from left to right and then again from right to left.
- Intuitively, this can be thought of as if you were reading a sentence for the first time in the normal fashion (left to right) but for some reason it didn't make sense so you traverse back through the words and go back over them again (right to left).
- In practice, many sequence models often see and improvement in performance when using bidirectional RNN's.
- However, this improvement in performance often comes at the cost of longer training times and increased model parameters (since the model goes left to right and right to left, the number of trainable parameters doubles).

In [57]:
# inputs = layers.Input(shape=(1,), dtype=tf.string)
# x = text_vectorizer(inputs)
# x = embedding(x)
# x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
# x = layers.Bidirectional(layers.GRU(64, return_sequences=False))(x)
# outputs = layers.Dense(1, activation="sigmoid")
#
# model_4 = tf.keras.Model(inputs, outputs)

# his
# inputs = layers.Input(shape=(1,), dtype="string")
# x = text_vectorizer(inputs)
# x = model_4_embedding(x)
# # x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x) # stacking RNN layers requires return_sequences=True
# x = layers.Bidirectional(layers.LSTM(64))(x) # bidirectional goes both ways so has double the parameters of a regular LSTM layer
# outputs = layers.Dense(1, activation="sigmoid")(x)
# model_4 = tf.keras.Model(inputs, outputs, name="model_4_Bidirectional")

# his modified
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = embedding(x)
# x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x) # stacking RNN layers requires return_sequences=True
x = layers.Bidirectional(layers.LSTM(64))(x) # bidirectional goes both ways so has double the parameters of a regular LSTM layer
outputs = layers.Dense(1, activation="sigmoid")(x)
model_4 = tf.keras.Model(inputs, outputs, name="model_4_Bidirectional")

In [58]:
model_4.summary()

Model: "model_4_Bidirectional"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 bidirectional (Bidirectiona  (None, 128)              98816     
 l)                                                              
                                                                 
 dense_3 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1,378,945
Trainable params: 1,3

In [59]:
model_4.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])
model_4_history = model_4.fit(train_sentences,
            train_labels,
            epochs=5,
            validation_data=(val_sentences, val_labels),
            callbacks=[create_tensorboard_callback(SAVE_DIR, "model_4_bidirectional")])



Saving TensorBoard log files to: model_logs/model_4_bidirectional/20230619-120634
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [60]:
model_4_pred_probs = model_4.predict(val_sentences)
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_results = calculate_results(val_labels, model_4_preds)
model_4_results  # still worse than baseline, also worse than uni-directional in this case



{'accuracy': 0.7493438320209974,
 'precision': 0.7482324836878728,
 'recall': 0.7493438320209974,
 'f1': 0.74862942348961}

# Conv1D neural networks for text and sequences
let's look at what it does

In [61]:
# Test out the embedding, 1D convolutional and max pooling
embedding_test = embedding(text_vectorizer(["this is a test sentence"])) # turn target sentence into embedding
conv_1d = layers.Conv1D(filters=32, kernel_size=5, activation="relu") # convolve over target sequence 5 words at a time
conv_1d_output = conv_1d(embedding_test) # pass embedding through 1D convolutional layer
max_pool = layers.GlobalMaxPool1D()
max_pool_output = max_pool(conv_1d_output) # get the most important features
embedding_test.shape, conv_1d_output.shape, max_pool_output.shape

(TensorShape([1, 15, 128]), TensorShape([1, 11, 32]), TensorShape([1, 32]))

The embedding has an output shape dimension of the parameters we set it to (input_length=15 and output_dim=128).

The 1-dimensional convolutional layer has an output which has been compressed inline with its parameters. And the same goes for the max pooling layer output.

Our text starts out as a string but gets converted to a feature vector of length 64 through various transformation steps (from tokenization to embedding to 1-dimensional convolution to max pool).

Let's take a peak at what each of these transformations looks like.

In [62]:
# See the outputs of each layer
embedding_test[:1], conv_1d_output[:1], max_pool_output[:1]

(<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
 array([[[ 2.47879792e-02, -3.76283042e-02, -4.78201099e-02, ...,
          -9.77362916e-02, -2.47036237e-02,  1.04545252e-02],
         [ 4.81648147e-02,  1.05584515e-02, -1.94007379e-03, ...,
          -3.20963301e-02,  2.00950690e-02,  3.29664797e-02],
         [-2.23150402e-02, -4.24621534e-03,  3.29113677e-02, ...,
          -2.85842549e-02, -4.58227694e-02, -1.35427881e-02],
         ...,
         [-2.03758404e-02, -5.34846149e-02, -5.64603251e-05, ...,
           1.46892974e-02, -3.05895861e-02,  1.38282441e-02],
         [-2.03758404e-02, -5.34846149e-02, -5.64603251e-05, ...,
           1.46892974e-02, -3.05895861e-02,  1.38282441e-02],
         [-2.03758404e-02, -5.34846149e-02, -5.64603251e-05, ...,
           1.46892974e-02, -3.05895861e-02,  1.38282441e-02]]],
       dtype=float32)>,
 <tf.Tensor: shape=(1, 11, 32), dtype=float32, numpy=
 array([[[6.11400232e-02, 4.75114733e-02, 0.00000000e+00, 1.16474926e-03,
         

### Making the conv1d model

In [64]:
tf.random.set_seed(42)
from tensorflow.keras import layers
model_5_embedding = layers.Embedding(input_dim=max_vocab_length,
                                     output_dim=128,
                                     embeddings_initializer="uniform",
                                     input_length=max_length,
                                     name="embedding_5")

# Create 1-dimensional convolutional layer to model sequences
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model_5_embedding(x)
x = layers.Conv1D(filters=32, kernel_size=5, activation="relu")(x)
x = layers.GlobalMaxPool1D()(x)
# x = layers.Dense(64, activation="relu")(x) # optional dense layer
outputs = layers.Dense(1, activation="sigmoid")(x)
model_5 = tf.keras.Model(inputs, outputs, name="model_5_Conv1D")

# Compile Conv1D model
model_5.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

model_5.summary()



Model: "model_5_Conv1D"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding_5 (Embedding)     (None, 15, 128)           1280000   
                                                                 
 conv1d_2 (Conv1D)           (None, 11, 32)            20512     
                                                                 
 global_max_pooling1d_2 (Glo  (None, 32)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_5 (Dense)             (None, 1)              

In [65]:
# Fit the model
model_5_history = model_5.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "Conv1D")])

Saving TensorBoard log files to: model_logs/Conv1D/20230619-120811
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [66]:
model_5_pred_probs = model_5.predict(val_sentences)
model_5_pred_probs[:10]



array([[0.99992543],
       [0.9976896 ],
       [0.986874  ],
       [0.00708563],
       [0.00330716],
       [0.14532353],
       [0.9840111 ],
       [0.14718454],
       [0.16802555],
       [0.00865445]], dtype=float32)

In [67]:
# Convert model_5 prediction probabilities to labels
model_5_preds = tf.squeeze(tf.round(model_5_pred_probs))
model_5_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([1., 1., 1., 0., 0., 0., 1., 0., 0., 0.], dtype=float32)>

In [68]:
# Calculate model_5 evaluation metrics
model_5_results = calculate_results(y_true=val_labels,
                                    y_pred=model_5_preds)
model_5_results

{'accuracy': 0.7414698162729659,
 'precision': 0.7398429301909109,
 'recall': 0.7414698162729659,
 'f1': 0.7402801500377935}

# Using TensorFlow Hub for pretrained word embeddings (transfer learning for NLP)
- We'll use [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4)

In [70]:
# Example of pretrained embedding with universal sentence encoder - https://tfhub.dev/google/universal-sentence-encoder/4
import tensorflow_hub as hub
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4") # load Universal Sentence Encoder
embed_samples = embed([sample_sentence,
                      "When you call the universal sentence encoder on a sentence, it turns it into numbers."])

print(embed_samples[0][:50])

tf.Tensor(
[-0.01157025  0.02485909  0.0287805  -0.01271501  0.03971541  0.08827759
  0.02680986  0.05589836 -0.01068732 -0.00597294  0.00639323 -0.01819521
  0.00030815  0.09105891  0.05874644 -0.03180626  0.01512474 -0.05162928
  0.00991368 -0.06865346 -0.04209306  0.0267898   0.03011009  0.00321064
 -0.00337969 -0.0478736   0.02266718 -0.00985928 -0.04063616 -0.01292095
 -0.04666382  0.05630299 -0.03949255  0.00517684  0.02495827 -0.07014439
  0.02871509  0.04947678 -0.00633972 -0.08960193  0.0280712  -0.00808366
 -0.01360598  0.0599865  -0.10361788 -0.05195374  0.00232957 -0.0233253
 -0.03758107  0.0332773 ], shape=(50,), dtype=float32)


In [71]:
# Each sentence has been encoded into a 512 dimension vector
embed_samples[0].shape

TensorShape([512])

# Model 6: Building, training and evaluating a transfer learning model for NLP

In [72]:
# We can use this encoding layer in place of our text_vectorizer and embedding layer
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[], # shape of inputs coming to our model
                                        dtype=tf.string, # data type of inputs coming to the USE layer
                                        trainable=False, # keep the pretrained weights (we'll create a feature extractor)
                                        name="USE")

Now we've got the USE as a Keras layer, we can use it in a Keras Sequential model.

In [73]:
# Create model 6 (transfer learning) using the Sequential API
model_6 = tf.keras.Sequential([
  sentence_encoder_layer, # take in sentences and then encode them into an embedding
  layers.Dense(64, activation="relu"),
  layers.Dense(1, activation="sigmoid")
], name="model_6_USE")

# Compile model
model_6.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

model_6.summary()



Model: "model_6_USE"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 USE (KerasLayer)            (None, 512)               256797824 
                                                                 
 dense_6 (Dense)             (None, 64)                32832     
                                                                 
 dense_7 (Dense)             (None, 1)                 65        
                                                                 
Total params: 256,830,721
Trainable params: 32,897
Non-trainable params: 256,797,824
_________________________________________________________________


In [74]:
# Train a classifier on top of pretrained embeddings
model_6_history = model_6.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "tf_hub_sentence_encoder")])

Saving TensorBoard log files to: model_logs/tf_hub_sentence_encoder/20230619-123642
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [75]:
# Make predictions with USE TF Hub model
model_6_pred_probs = model_6.predict(val_sentences)
model_6_pred_probs[:10]



array([[0.963828  ],
       [0.857974  ],
       [0.81991976],
       [0.03819581],
       [0.05435634],
       [0.48794934],
       [0.8404128 ],
       [0.15077174],
       [0.50047183],
       [0.60435563]], dtype=float32)

In [76]:
# Convert prediction probabilities to labels
model_6_preds = tf.squeeze(tf.round(model_6_pred_probs))
model_6_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([1., 1., 1., 0., 0., 0., 1., 0., 1., 1.], dtype=float32)>

In [77]:
# Calculate model 6 performance metrics
model_6_results = calculate_results(val_labels, model_6_preds)
model_6_results

{'accuracy': 0.7992125984251969,
 'precision': 0.7985566013932281,
 'recall': 0.7992125984251969,
 'f1': 0.7969086902314371}

# Preparing subsets of data for model 7 (same as model 6 but 10% of data)
- One of the benefits of using transfer learning methods, such as, the pretrained embeddings within the USE is the ability to get great results on a small amount of data (the USE paper even mentions this in the abstract).

- To put this to the test, we're going to make a small subset of the training data (10%), train a model and evaluate it.