## Natural Language Processing (NLP) with TensorFlow

Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses on enabling computers to understand, interpret, generate, and respond to human language in a way that is both meaningful and useful.

### 🧠 In Simple Terms:
NLP is how we teach computers to read, write, and understand human language—whether that's spoken or written. It bridges the gap between how humans naturally communicate and how computers process information.

### 🔍 Key Goals of NLP:

- Understanding language (e.g., “What is the meaning of this sentence?”)

- Generating language (e.g., chatbots or translation tools)

- Classifying text (e.g., spam detection)

- Extracting information (e.g., pulling names or dates from documents)

- Conversational AI (e.g., Siri, Alexa, ChatGPT)

### 🛠️ Examples of NLP in Action:

- Google Translate: Translates text between languages.

- Spam Filters: Detect spam based on words and phrases.

- Chatbots: Understand and respond to customer questions.

- Sentiment Analysis: Determines whether a sentence expresses a positive or negative opinion.

- Search Engines: Interpret what users really mean by their queries.

### 🔗 Relation to AI and ML:

NLP often combines linguistics (rules of language) with machine learning (learning from data) so that systems can improve over time with more examples.

As you may already know, for production-level deep learning or model training on large datasets, having a GPU (or using cloud services with GPUs) is much more efficient. So let us firstly learn about the specifications of the GPU we have at our disposal.

In [1]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


The next step would be getting the helper functions created and developed by **Daniel Bourke** which have frequently been used in his own tutorials - Click [here](https://github.com/mrdbourke/pytorch-deep-learning/blob/main/helper_functions.py) to open `helper_functions.py` on his github.

P.E.: This notebook is kind of inspired by his work just like many others. So I would like to send him all my gratitude for the great he has done. To learn more about his tutorials, visit [Zero to Mastery (ZTM)](https://zerotomastery.io/).

In [2]:
# Download helper functions script
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2025-05-08 19:10:23--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2025-05-08 19:10:23 (8.37 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [3]:
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

Oh! What about data?

Well, let us also download a dataset from Kaggle. You can read about the specifications of the dataset at [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started/data).

In [4]:
# Download dataset
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

# Unzip data
unzip_data("nlp_getting_started.zip")

--2025-05-08 19:10:36--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.8.207, 142.251.170.207, 173.194.174.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.8.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2025-05-08 19:10:37 (951 KB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [5]:
# Visualize data

import pandas as pd
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
# Shuffle train frame
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


The model we are going to train on the above dataset is expected to resolve the problem of classifying whether a Tweet is about a disaster or not.

In [7]:
# Extract the number of samples in each class
# This can help us understand how well the dataset is balanced
train_df.target.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


In [8]:
import random

random_index = random.randint(0, len(train_df)-5) # create random indexes not higher than the total number of samples
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 0 (not real disaster)
Text:
Sitting around a fire sounds great right about now

---

Target: 0 (not real disaster)
Text:
to whomever is hijacking my wifi hotspot. I have a very specific skill set. I will create a character and perform a one-man show about you

---

Target: 0 (not real disaster)
Text:
I tried making a chocolate and peanut butter lava cake using my #shakeology protein shake mix and aÛ_ https://t.co/APoD4EIVBa

---

Target: 1 (real disaster)
Text:
oc73x mhtw4fnet

Officials: Alabama home quarantined over possible Ebola case - Washington Post

---

Target: 1 (real disaster)
Text:
Rainstorm Destroys 600 Houses in Yobe State http://t.co/nU0D3uANNZ

---



**NOTE!** When creating a random index, the top of the range is subtracted by 5 (`len(train_df)-5`) because the code following this line is accessing the next 5 rows starting at `random_index`. Subtracting 5 ensures that the selected index plus 4 more steps won't go out of bounds.

Now let's split our data...

In [9]:
from sklearn.model_selection import train_test_split

# Split training data into training and validation sets
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1, # dedicate 10% of samples to validation set
                                                                            random_state=42) # random state for reproducibility

**NOTE!** Using `.to_numpy()` converts a DataFrame to a NumPy array, which is often required by machine learning models (like in scikit-learn) that expect input as arrays, not pandas objects. It also improves performance slightly during training.

Before feeding an NLP model with textual data, there are a series of preprocessing steps typically performed to clean, structure, and convert the text into a model-friendly format.

### tf.keras.layers.TextVectorization

`tf.keras.layers.TextVectorization` is a built-in TensorFlow Keras preprocessing layer used to convert raw text into numeric tensors—a crucial step before feeding text into neural networks.

#### 🔍 What It Does
The TextVectorization layer automates text standardization, tokenization, and vectorization, enabling a full text preprocessing pipeline inside the model.

In [10]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

max_vocab_length = 10000
max_length = 15

# Other than the two values for max_tokens and output_sequence_length, the rest are default
text_vectorizer = TextVectorization(max_tokens=max_vocab_length, # how many words in the vocabulary (all of the different words in your text)
                                    standardize="lower_and_strip_punctuation", # how to process text
                                    split="whitespace", # how to split tokens
                                    ngrams=None, # create groups of n-words?
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length=max_length) # how long should the output sequence of tokens be?
                                    # pad_to_max_tokens=True) # Not valid if using max_tokens=None

### text_vectorizer.adapt()

#### 🔍 What it does:
This step builds the vocabulary from your dataset (texts). Think of it like training the TextVectorization layer to understand what words exist in your data and how to index them.

🧠 Internally:
- It standardizes the text (e.g., lowercases, removes punctuation, etc.).
- Then it tokenizes the text into words (or characters, based on config).
- Finally, it counts the frequency of tokens and keeps the most frequent `max_tokens` - further explanation will be given a bit later.

In [11]:
# Map TextVectorization instance text_vectorizer to data
# In other words, fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

### vectorized_text = text_vectorizer(texts)
#### 🔍 What it does:
This converts your raw text into a sequence of integers, where each word is replaced by its corresponding index from the vocabulary built during `adapt()`.

#### 🧠 Internally:
- Each text string is standardized and tokenized the same way as during `adapt()`.
- Each token is replaced with its index (from the learned vocab).
- If `output_sequence_length` is set, the sequences are padded/truncated to that length.

| Step                      | Purpose                          | Outcome                        |
|---------------------------|-----------------------------------|--------------------------------|
| `text_vectorizer.adapt()`            | Learn vocabulary from data        | Builds word → index mapping   |
| `text_vectorizer(texts)`       | Vectorize text using vocab        | Converts text to integer sequences |


In [12]:
# Create a random sample sentence and tokenize it
sample_sentence = "There is a flood in my street!"
vectorized_text = text_vectorizer([sample_sentence])
vectorized_text

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 74,   9,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [13]:
longer_sample_sentence = "Since the test set has no labels and we need a way to evalaute our trained models, we'll split off some of the training data and create a validation set."
longer_vectorized_text = text_vectorizer([longer_sample_sentence])
longer_vectorized_text

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 216,    2, 1246,  284,   41,   40,    1,    7,   46,  162,    3,
         147,    5,    1,  103]])>

It is clearly noticeable that a longer sentence as in `longer_vectorized_text` has resulted in different values but the same sequence length (15) since `output_sequence_length=max_length=15` which is the average number of tokens per Tweet in the training set.

**NOTE!** Please note the 0's at the end of the returned tensor, which is because of setting `output_sequence_length=15`, that is, no matter the size of the sequence we pass to `text_vectorizer`, it always returns a sequence with a length of 15.

In [14]:
# Find average number of tokens (words) in training Tweets
avg_no_tokens = round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))
avg_no_tokens

15

Also, as explained by Daniel himself...

"For `max_tokens` (the number of words in the vocabulary), multiples of 10,000 (10,000, 20,000, 30,000) or the exact number of unique words in your text (e.g. 32,179) are common values."

However, in TensorFlow documentation the explanation below is given on `max_tokens`:

"Maximum size of the vocabulary for this layer. This should only be specified when adapting a vocabulary or when setting `pad_to_max_tokens=True`. Note that this vocabulary contains 1 OOV token, so the effective number of tokens is (`max_tokens - 1 - (1 if output_mode == "int" else 0)`)."

Let us try it also with some random sentences...

In [15]:
# Set seed to produce the same result/sentence
# You can comment the line below to produce different random results/sentences
seed = random.seed(42)

random_sentence = random.choice(train_sentences)
print(f"Original Sentence: {random_sentence}")
random_sentence
print(f"Vectorized Sentence: {text_vectorizer([random_sentence])}")

Original Sentence: You are listening to LLEGASTE TU - TWISTER EL REY
Vectorized Sentence: [[  12   22 1820    5    1 7321  358 1684 4739    0    0    0    0    0
     0]]


There is also another method which returns the current vocabulary of the layer:

In [16]:
# Get the unique words in the vocabulary
words = text_vectorizer.get_vocabulary()
top_3_words = words[:3]
top_3_words

['', '[UNK]', np.str_('the')]

In [17]:
# Get vocab size
vocab_size = text_vectorizer.vocabulary_size()
print(vocab_size)

# text_vectorizer.vocabulary_size() vs. len(text_vectorizer.get_vocabulary())
vocab_size == len(words)

10000


True

### Create an Embedding Using an Embedding Layer

`tf.keras.layers.Embedding` is a key layer used in NLP models after vectorizing text, and it plays a crucial role in teaching the model how to understand words numerically.

#### What is `tf.keras.layers.Embedding`?
It’s a lookup table that maps each word (represented by an integer index) to a dense vector of fixed size. If you're using the Embedding layer as part of a trainable model and you haven't trained it yet, here's what happens:

🚧 Before Training:

- The Embedding layer assigns random vectors to each word index.
- These vectors have no semantic meaning yet.
- So, when you pass in a sentence like "I love pizza":

  - It's tokenized and mapped to integers (e.g., `[12, 85, 210]`)
  - Each integer gets a random embedding vector (e.g., shape `(3, 128)` if `output_dim=128`)
  - These vectors are not meaningful yet — just initial placeholders. In fact, they are learned during training to capture semantic meaning.

🧠 During Training:

- The embedding vectors are updated via backpropagation.
- The model learns to adjust these vectors so that:
  - Words with similar contexts get closer in vector space.
  - Semantic relationships start to emerge (e.g., "king" and "queen" become related).

✅ After Training:

- The embeddings now encode semantics and syntax.
- They can be visualized, analyzed, or reused in other models.

In [18]:
tf.random.set_seed(42)
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length, # set input shape
                             output_dim=128, # set size of embedding vector
                             embeddings_initializer="uniform", # default, intialize randomly
                             input_length=max_length, # how long is each input
                             name="embedding_1")

embedding



<Embedding name=embedding_1, built=False>

In [19]:
# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nEmbedded version:")

# Embed the random sentence (turn it into numerical representation)
rnd_sentence_embedding = embedding(text_vectorizer([random_sentence]))
rnd_sentence_embedding

Original text:
@eunice_njoki aiii she needs to chill and answer calmly its not like she's being attacked      

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.03079859, -0.0189546 , -0.01243884, ..., -0.03588306,
          0.03818712,  0.01506058],
        [-0.03079859, -0.0189546 , -0.01243884, ..., -0.03588306,
          0.03818712,  0.01506058],
        [ 0.02519022, -0.03579969,  0.00304385, ...,  0.00691248,
         -0.02343527, -0.03144266],
        ...,
        [ 0.03271942, -0.04625431, -0.02785697, ...,  0.02894186,
          0.00474759,  0.03827249],
        [ 0.00036204,  0.04174695, -0.04853138, ..., -0.04821759,
          0.00393968, -0.0415694 ],
        [ 0.01250075, -0.01274125, -0.03891511, ..., -0.00288209,
         -0.0383091 , -0.03247384]]], dtype=float32)>

Let's have a look at a single token's embedding...

In [20]:
# Check out a single token's embedding
rnd_sentence_embedding[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([-3.0798590e-02, -1.8954599e-02, -1.2438845e-02, -2.6422739e-04,
       -1.6225576e-02,  2.6150886e-02,  5.8179125e-03, -5.5372491e-03,
       -2.9799486e-02, -3.5299398e-02, -3.6157310e-02, -1.2291051e-02,
        9.9672899e-03,  4.3678608e-02, -2.1052910e-02, -6.5875649e-03,
       -3.3069685e-02, -1.8835330e-02, -2.2636104e-02, -4.6077110e-02,
        2.2921469e-02,  3.2275271e-02,  2.1131728e-02, -4.1691326e-02,
       -3.2879338e-03,  3.7835207e-02,  3.4914780e-02, -3.2779716e-02,
        3.7594210e-02, -1.6118847e-02,  3.0581180e-02,  1.1982083e-02,
       -4.8280180e-02, -2.0977175e-02, -1.1316787e-02,  1.0697555e-02,
        3.8902331e-02,  3.4880076e-02, -2.6882147e-02,  3.1061102e-02,
        3.2911886e-02,  3.0359019e-02,  1.1458147e-02,  1.3435606e-02,
        3.3494834e-02, -4.5481481e-02,  1.6185392e-02, -6.1620027e-05,
       -1.0583520e-02,  4.4308927e-02,  1.9534957e-02, -2.6562130e-02,
        2.4622679e-04, -2.465

Summary:

| Stage            | Meaning Captured? | Description                               |
|------------------|-------------------|-------------------------------------------|
| Before Training  | ❌ No              | Random vectors; no understanding          |
| During Training  | ⚙️ Gradual        | Vectors updated to reflect meaning        |
| After Training   | ✅ Yes            | Embeddings reflect word semantics         |


### Modelling
Having said all the long but sweet tale above, seems like the stage is set to buld our models. Conventionally, we will start with a baseline and then experimenting with other alternatives, we will try to improve performance based on the the results achieved.

More specifically, we'll be building the following:

- **Model 0**: Naive Bayes (baseline)
- **Model 1**: Feed-forward neural network (dense model)
- **Model 2**: LSTM model
- **Model 3**: GRU model
- **Model 4**: Bidirectional-LSTM model
- **Model 5**: 1D Convolutional Neural Network
- **Model 6**: TensorFlow Hub Pretrained Feature Extractor
- Model 7: Same as model 6 with 10% of training data

#### Model 1: Dense Model


A **baseline** in machine learning is a simple model or method used as a point of comparison for more complex models. It might be as basic as predicting the most frequent class (in classification) or the mean value (in regression). A **benchmark** refers to the standard performance level—often set by the baseline or an existing best model—against which new models are evaluated.

In short, baselines provide simple starting points to evaluate whether a more advanced model is truly learning something meaningful.

The combination of actions we take below is widely used as a lightweight, interpretable baseline for tasks like spam detection, sentiment analysis, and topic classification.

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create pipeline
model0 = Pipeline([
    ("tfid", TfidfVectorizer()),    # convert word to numerical representations
    ("classifier", MultinomialNB()) # model the converted data
])

In [22]:
# Fit themodel
model0.fit(train_sentences, train_labels)

So when the model is fit, it actually learns about whether an e-mail for instance is *Spam* or *Ham* based on the frequency of word accurances (word count) collectively.

For example, after fitting:

- The model knows "buy" and "now" are common in class 1 (spam).
- "hello" and "friend" are seen in class 0 (not spam).
- It can now classify new texts like "buy friend" based on learned probabilities.

In [23]:
# Evaluate model
score_baseline = model0.score(val_sentences, val_labels)
print(f"Outcome: The baseline model achieves an accuracy of {score_baseline*100:.2f}%.")

Outcome: The baseline model achieves an accuracy of 79.27%.


In [24]:
# Make prediction
model0.predict(val_sentences[:5])

array([1, 1, 1, 0, 0])

In [25]:
# Get baseline predictions
baseline_preds = model0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

### Create Evaluation Function

In [26]:
# Define function to evaluate accuracy, precision, recall, fscore

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_results(y_true: list, y_pred: list) -> dict:
  """
  Computes model accuracy, precision, recall and f1-score of a binary classification mode

  Args:
  ----
  y_true (list): list of true labels
  y_pred (ilst): list of predicted labels

  Returns a dictionary of precision, recall, f1-score and accuracy
  """

  # Compute model accuracy
  accuracy = accuracy_score(y_true, y_pred) * 100

  # Compute model precision, recall and f1-score using "weighted" average
  precision, recall, f1score, _ = precision_recall_fscore_support(y_true=y_true, y_pred=y_pred, average="weighted")
  results = {"accuracy": accuracy,
             "precision": precision,
             "recall": recall,
             "f1score": f1score}

  return results

**NOTE!** The term *weighted average* means that when combining the precision, recall, and F1-score values across different classes, each class contributes to the final score proportionally to its support — i.e., the number of true instances for that class in the dataset.

In [27]:
# Produce baseline results
baseline_results = compute_results(y_true=val_labels, y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1score': 0.7862189758049549}

To keep track of the results achieved by different models and configurations, which are mainly based TensorFlow framework, it would be wise to use `tf.keras.callbacks.TensorBoard()` and create a tensorboard callback.

In [28]:
from datetime import datetime


def create_tb_callback(dir_name, experiment_name):
  """
  Creates a TensorBoard callback instand to store log files.

  Stores log files with the filepath:
    "dir_name/experiment_name/current_datetime/"

  Args:
  -----
    dir_name: target directory to store TensorBoard log files
    experiment_name: name of experiment directory (e.g. efficientnet_model_1)

  Returns a TensorBoard callback.
  """
  log_dir = dir_name + "/" + experiment_name + "/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
  tb_callback = tf.keras.callbacks.TensorBoard(
      log_dir=log_dir
  )
  print(f"Saving TensorBoard log files to: {log_dir}")
  return tb_callback

In [29]:
# Directory to Save logs
SAVE_DIR = "model_logs"

#### Model 2: Dense Model

In [30]:
# Build model with the Functional API

from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype="string") # inputs are 1-dimensional since they're raw strings

x = text_vectorizer(inputs)
x = embedding(x)
x = layers.GlobalAveragePooling1D()(x) # [Optional] lower the dimensionality of the embedding (try running the model without this layer and see what happens)

# Create the output layer
outputs = layers.Dense(1, activation="sigmoid")(x) # want binary outputs so use sigmoid activation

# Construct the model
model1 = tf.keras.Model(inputs, outputs, name="model_1_dense")

In [31]:
# Compile model
model1.compile(loss="binary_crossentropy",
               optimizer=tf.keras.optimizers.Adam(),
               metrics=["accuracy"])

In [32]:
model1.summary()

Now that our model is compiled, let us fit it to our training data for a few epochs...

In [33]:
callbacks = [create_tensorboard_callback(dir_name=SAVE_DIR,
                             experiment_name="simple_dense_model")]

model1_history = model1.fit(
    train_sentences,
    train_labels,
    epochs=6,
    validation_data=(val_sentences, val_labels),
    callbacks=callbacks
)

Saving TensorBoard log files to: model_logs/simple_dense_model/20250508-191041
Epoch 1/6
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 21ms/step - accuracy: 0.6362 - loss: 0.6493 - val_accuracy: 0.7585 - val_loss: 0.5330
Epoch 2/6
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 15ms/step - accuracy: 0.8090 - loss: 0.4658 - val_accuracy: 0.7900 - val_loss: 0.4732
Epoch 3/6
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - accuracy: 0.8531 - loss: 0.3618 - val_accuracy: 0.7940 - val_loss: 0.4611
Epoch 4/6
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - accuracy: 0.8866 - loss: 0.2955 - val_accuracy: 0.7900 - val_loss: 0.4674
Epoch 5/6
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 16ms/step - accuracy: 0.9058 - loss: 0.2467 - val_accuracy: 0.7808 - val_loss: 0.4831
Epoch 6/6
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 15ms/step - accuracy: 0.9244 - los

In [34]:
# Check the results
model1.evaluate(val_sentences, val_labels)

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7663 - loss: 0.5384


[0.5043545365333557, 0.778215229511261]

In [35]:
embedding.weights

[<Variable path=embedding_1/embeddings, shape=(10000, 128), dtype=float32, value=[[ 0.03434306 -0.0095758  -0.02374553 ... -0.03609192  0.01311912
    0.02618703]
  [-0.04414263 -0.03038486 -0.02443877 ... -0.02267646  0.05049562
    0.02830759]
  [ 0.03306349 -0.05248832  0.02045467 ...  0.05758527  0.05398652
    0.01203938]
  ...
  [ 0.02622915 -0.0330107   0.02841885 ... -0.0378751   0.02547008
   -0.009432  ]
  [-0.04829478  0.00047479  0.04407792 ...  0.06949612  0.03680482
    0.05556997]
  [-0.03496158 -0.04570995  0.06969726 ...  0.06450178  0.04361736
    0.05325269]]>]

In [36]:
embedding_weights = model1.get_layer("embedding_1").get_weights()[0]
embedding_weights, embedding_weights.shape

(array([[ 0.03434306, -0.0095758 , -0.02374553, ..., -0.03609192,
          0.01311912,  0.02618703],
        [-0.04414263, -0.03038486, -0.02443877, ..., -0.02267646,
          0.05049562,  0.02830759],
        [ 0.03306349, -0.05248832,  0.02045467, ...,  0.05758527,
          0.05398652,  0.01203938],
        ...,
        [ 0.02622915, -0.0330107 ,  0.02841885, ..., -0.0378751 ,
          0.02547008, -0.009432  ],
        [-0.04829478,  0.00047479,  0.04407792, ...,  0.06949612,
          0.03680482,  0.05556997],
        [-0.03496158, -0.04570995,  0.06969726, ...,  0.06450178,
          0.04361736,  0.05325269]], dtype=float32),
 (10000, 128))

In [37]:
# Make predictions in the form of probabilities
model1_pred_probs = model1.predict(val_sentences)
model1_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step


array([[0.33830914],
       [0.73765236],
       [0.9987248 ],
       [0.17282753],
       [0.06250986],
       [0.9596436 ],
       [0.8924247 ],
       [0.9973208 ],
       [0.98085046],
       [0.46200037]], dtype=float32)

In [38]:
model1_pred_probs[:10].round()  # round up/down

array([[0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.]], dtype=float32)

In [39]:
# model1_preds = tf.squeeze(model1_pred_probs.round())

# OR alternatively
model1_preds = model1_pred_probs.round().squeeze()

model1_preds[:10]

array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)

Now that the predictions are produced in the form of 0/1 values (binary classification), we can compare them to the ground truth values and produce results for different metrics.

In [40]:
# Compute model_1 metrics
model1_results = compute_results(y_true=val_labels,
                                 y_pred=model1_preds)
model1_results

{'accuracy': 77.82152230971128,
 'precision': 0.7798979990634543,
 'recall': 0.7782152230971129,
 'f1score': 0.7762659531210079}

Let us compare the results achieved by the two models so far...

In [41]:
import numpy as np
np.array(list(model1_results.values())) >= np.array(list(baseline_results.values()))

array([False, False, False, False])

In [42]:
def compare_models(model1_res, model2_res):
  comp_res = np.array(list(model1_res.values())) >= np.array(list(model2_res.values()))
  for key, value in model1_res.items():
    print(f"Baseline {key}: {value:.2f}, New Model {key}: {model2_res[key]:.2f}, Difference: {model2_res[key] - value:.2f}")

compare_models(baseline_results, model1_results)

Baseline accuracy: 79.27, New Model accuracy: 77.82, Difference: -1.44
Baseline precision: 0.81, New Model precision: 0.78, Difference: -0.03
Baseline recall: 0.79, New Model recall: 0.78, Difference: -0.01
Baseline f1score: 0.79, New Model f1score: 0.78, Difference: -0.01


The first model (`model1`) contained an embedding layer (`embedding`) which learned a way of representing words as feature vectors by passing over the training data.

### Visualize Embeddings
Let us now visualize the embedding our model has learned.

But first let me ask you a question...

**Question:** Have you ever thought to yourself what the differences between the embedding learned during training and the embedding model we use to transform texts and words into their numerical representations e.g., RAG systems?

**Answer:** The former ones are task-specific embeddings that are learned as part of training a deep learning model — such as a classifier or a language model — produced by backpropagation during supervised training while the latter ones are precomputed semantic embeddings used to represent chunks of documents, questions, or passages generated by pretrained embedding models like:

- OpenAI's text-embedding-ada-002
- Hugging Face's sentence-transformers
- BERT-based encoders

To visualize our embedding using the **[TensorFlow Embedding Projector Tool](http://projector.tensorflow.org/)**, we will need two objects/files:
- The embedding vectors (same as embedding weights).
- The meta data of the embedding vectors (the words they represent - our vocabulary).

In [44]:
# # Code below is adapted from: https://www.tensorflow.org/tutorials/text/word_embeddings#retrieve_the_trained_word_embeddings_and_save_them_to_disk
# import io

# # Create output writers
# out_v = io.open("embedding_vectors.tsv", "w", encoding="utf-8")
# out_m = io.open("embedding_metadata.tsv", "w", encoding="utf-8")

# # Write embedding vectors and words to file
# for num, word in enumerate(words):
#   if num == 0:
#      continue # skip padding token
#   vec = embedding_weights[num]
#   out_m.write(word + "\n") # write words to file
#   out_v.write("\t".join([str(x) for x in vec]) + "\n") # write corresponding word vector to file
# out_v.close()
# out_m.close()

# # Download files locally to upload to Embedding Projector
# try:
#   from google.colab import files
# except ImportError:
#   pass
# else:
#   files.download("embedding_vectors.tsv")
#   files.download("embedding_metadata.tsv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Once you have downloaded the embedding vectors and metadata, you can visualize them using **Embedding Vector Tool**:

1. Go to http://projector.tensorflow.org/
2. Click on "Load data"
3. Upload the two files you downloaded (`embedding_vectors.tsv` and `embedding_metadata.tsv`)
4. Explore
5. Optional: You can share the data you've created by clicking "Publish"

### Recurrent Neural Networks (RNN's)

For our next series of modelling experiments we're going to be using a special kind of neural network called a **Recurrent Neural Network (RNN)**. Recurrent Neural Networks (RNNs) are a type of neural network architecture designed for sequence data. Unlike traditional feedforward networks, RNNs have a memory of previous inputs, which allows them to process data where context or order matters — such as time series, text, or speech.

In other words, RNNs process one element of a sequence at a time and maintain a hidden state that gets updated at each step. This hidden state acts like memory, carrying information from previous inputs forward to influence future predictions.

Recurrent neural networks can be used for a number of sequence-based problems:

- **One to one:** one input, one output, such as image classification.
- **One to many:** one input, many outputs, such as image captioning (image input, a sequence of text as caption output).
- **Many to one:** many inputs, one outputs, such as text classification (classifying a Tweet as real diaster or not real diaster).
- **Many to many:** many inputs, many outputs, such as machine translation (translating English to Spanish) or speech to text (audio wave as input, text as output) - taken from [Zero to Mastery TensorFlow for Deep Learning](https://dev.mrdbourke.com/tensorflow-deep-learning/08_introduction_to_nlp_in_tensorflow/).

#### Challenges:

- Vanishing/exploding gradient problems for long sequences.
- Limited long-term memory (this led to variants like LSTM and GRU).
  - Long short-term memory cells (LSTMs).
  - Gated recurrent units (GRUs).
  - Bidirectional RNN's (passes forward and backward along a sequence, left to right and right to left).

#### Use Cases:

- Language modeling
- Time series prediction
- Speech recognition
- Machine translation (as part of encoder-decoder models)

#### Model 2: LSTM

The main difference comparing to the previous model is that we will add an LSTM layer between the embedding layer and the output. Even though the previous trained embeddings will not be reused but replaced with a new one, the text vectorizer can be reused as it won't update during training.

In [46]:
from tensorflow.keras import layers

tf.random.set_seed(42)

model2_embedding = layers.Embedding(input_dim=max_vocab_length,
                                    output_dim=128,
                                    embeddings_initializer="uniform",
                                    input_length=max_length,
                                    name="embedding_2")

In [47]:
# Create LSTM model
inputs = layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = model2_embedding(x)
print(x.shape)
# x = layers.LSTM(64, return_sequences=True)(x) # return vector for each word in the Tweet (you can stack RNN cells as long as return_sequences=True)
x = layers.LSTM(64)(x) # return vector for whole sequence
print(x.shape)
# x = layers.Dense(64, activation="relu")(x) # optional dense layer on top of output of LSTM cell
outputs = layers.Dense(1, activation="sigmoid")(x)
model2 = tf.keras.Model(inputs, outputs, name="model_2_LSTM")

(None, 15, 128)
(None, 64)
