# <font color="#418FDE" size="6.5" uppercase>**Text Preprocessing**</font>

>Last update: 20260121.
    
By the end of this Lecture, you will be able to:
- Tokenize raw text into integer sequences using Keras text preprocessing tools. 
- Build tf.data pipelines that batch and pad text sequences for model input. 
- Create and configure embedding layers to represent tokens as dense vectors. 


## **1. Keras Text Tokenization**

### **1.1. Using TextVectorization Layer**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_01_01.jpg?v=1768976388" width="250">



>* Layer turns raw text into integer sequences
>* Integrates preprocessing into Keras models for deployment

>* Configure output type, vocabulary size, sequence length
>* Choose word or character tokens for different tasks

>* Adapt layer to data to learn vocabulary
>* Ensures consistent token mapping and simpler deployment



In [None]:
#@title Python Code - Using TextVectorization Layer

# This script demonstrates basic Keras TextVectorization layer usage clearly.
# It shows how raw text becomes padded integer sequences for neural networks.
# It keeps the example small simple and fully runnable in Google Colab.

# !pip install tensorflow==2.20.0

# Import required libraries including TensorFlow and NumPy.
import os
import random
import numpy as np
import tensorflow as tf

# Print TensorFlow version information for reproducibility reference.
print("TensorFlow version:", tf.__version__)

# Set deterministic random seeds for reproducible vectorization behavior.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Prepare a small list of example review sentences for demonstration.
raw_text_data = [
    "I loved this movie so much",
    "This movie was okay not great",
    "I really disliked this boring movie",
    "Amazing acting and great story overall",
]

# Convert the raw text list into a TensorFlow constant tensor.
text_tensor = tf.constant(raw_text_data)

# Define maximum vocabulary size and sequence length hyperparameters.
max_vocabulary_size = 20
max_sequence_length = 8

# Create a TextVectorization layer configured for integer sequence output.
vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=max_vocabulary_size,
    output_mode="int",
    output_sequence_length=max_sequence_length,
)

# Adapt the TextVectorization layer vocabulary using the example text data.
vectorize_layer.adapt(text_tensor)

# Apply the vectorization layer to the raw text tensor to obtain sequences.
vectorized_sequences = vectorize_layer(text_tensor)

# Retrieve the learned vocabulary list from the TextVectorization layer.
vocabulary_list = vectorize_layer.get_vocabulary()

# Print the original sentences and their corresponding integer sequences.
for original_sentence, sequence in zip(raw_text_data, vectorized_sequences.numpy()):
    print("Sentence:", original_sentence)
    print("Sequence:", sequence)

# Print a small slice of the learned vocabulary for quick inspection.
print("Vocabulary sample:", vocabulary_list[:10])




### **1.2. Handling out of vocabulary tokens**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_01_02.jpg?v=1768976419" width="250">



>* Real text includes unseen, out-of-vocabulary words
>* Keras maps unknown words to special reserved indices

>* Unknown words become one shared placeholder token
>* Model learns to use this token contextually

>* Adjust vocabulary size to control rare words
>* Balance memory, model complexity, and information loss



In [None]:
#@title Python Code - Handling out of vocabulary tokens

# This script demonstrates Keras handling unknown out of vocabulary tokens.
# It shows how unseen words become a shared unknown token index.
# It compares training vocabulary words with new unseen review words.

# !pip install tensorflow==2.20.0

# Import required modules from TensorFlow and Python.
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization
import numpy as np

# Print TensorFlow version for environment clarity.
print("TensorFlow version:", tf.__version__)

# Set deterministic random seeds for reproducible behavior.
seed_value = 42
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Define small training sentences with restaurant review style text.
train_sentences = [
    "The burger was delicious and very juicy",
    "Service was slow but the food tasted great",
    "I loved the crispy fries and friendly staff",
]

# Define new sentences containing unseen restaurant and cuisine words.
new_sentences = [
    "The sushi at Moonlight Bistro was incredible",
    "I tried the new dragonfire burger and galaxy shake",
]

# Create TextVectorization layer with limited vocabulary size.
max_tokens = 15
output_sequence_length = 10
text_vectorizer = TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=output_sequence_length,
)

# Adapt vectorizer vocabulary using training sentences dataset.
text_vectorizer.adapt(tf.data.Dataset.from_tensor_slices(train_sentences))

# Obtain and print learned vocabulary including unknown token entry.
vocab = text_vectorizer.get_vocabulary()
print("\nLearned vocabulary tokens:")
print(vocab)

# Vectorize new sentences that contain unseen out of vocabulary words.
new_sequences = text_vectorizer(new_sentences)

# Convert sequences to numpy arrays for easier inspection.
new_sequences_np = new_sequences.numpy()

# Print vectorized sequences to show unknown token indices.
print("\nVectorized new sentences with OOV handling:")
print(new_sequences_np)

# Identify unknown token index from vocabulary list position.
unknown_token = vocab[1]
unknown_index = 1

# Print explanation line describing unknown token mapping behavior.
print("\nUnknown token:", unknown_token, "mapped at index", unknown_index)

# Verify that at least one sequence element equals unknown index value.
contains_unknown = bool((new_sequences_np == unknown_index).any())
print("Sequences contain unknown index:", contains_unknown)



### **1.3. Building and saving vocabularies**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_01_03.jpg?v=1768976453" width="250">



>* Vocabulary maps text tokens to integer indices
>* Adapt layer on text to build ordered mapping

>* Saved vocabularies keep word-to-index mapping consistent
>* Export, store, and reload vocabularies with models

>* Versioned vocabularies support collaboration, experiments, and audits
>* Treat vocabularies as stable, reusable project artifacts



In [None]:
#@title Python Code - Building and saving vocabularies

# This script builds a vocabulary using Keras TextVectorization layer.
# It then saves the learned vocabulary into a small text file.
# Finally it reloads the vocabulary and recreates an identical layer.

# !pip install tensorflow==2.20.0

# Import required standard libraries for paths and randomness.
import os
import random
import pathlib

# Import TensorFlow and Keras preprocessing layers.
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Set deterministic seeds for reproducible vocabulary building.
random.seed(42)
tf.random.set_seed(42)

# Print TensorFlow version for environment clarity and debugging.
print("TensorFlow version:", tf.__version__)

# Create a tiny example corpus with short movie style reviews.
corpus = [
    "This movie was excellent and very enjoyable",
    "The movie was terrible and extremely boring",
    "Excellent acting but the story was average",
]

# Convert the Python list into a TensorFlow dataset object.
ds_text = tf.data.Dataset.from_tensor_slices(corpus)

# Create a TextVectorization layer with limited vocabulary size.
vectorizer = TextVectorization(max_tokens=20, output_mode="int")

# Adapt the vectorizer on the dataset to build its vocabulary.
vectorizer.adapt(ds_text)

# Get the learned vocabulary list from the vectorizer layer.
vocab_list = vectorizer.get_vocabulary()

# Print a few vocabulary entries to inspect token to index mapping.
print("First vocabulary tokens:", vocab_list[:10])

# Define a small file path for saving the vocabulary tokens.
vocab_path = pathlib.Path("saved_vocabulary.txt")

# Save the vocabulary tokens into a plain text file safely.
with vocab_path.open("w", encoding="utf-8") as vocab_file:
    for token in vocab_list:
        vocab_file.write(f"{token}\n")

# Confirm that the vocabulary file now exists on disk.
print("Vocabulary file exists:", vocab_path.exists())

# Load the saved vocabulary tokens back from the text file.
with vocab_path.open("r", encoding="utf-8") as vocab_file:
    loaded_vocab = [line.strip() for line in vocab_file.readlines()]

# Create a new TextVectorization layer using loaded vocabulary.
new_vectorizer = TextVectorization(max_tokens=len(loaded_vocab), output_mode="int", vocabulary=loaded_vocab)

# Define a sample sentence to compare original and new vectorizers.
sample_sentence = tf.constant(["This movie was excellent but slightly boring overall"])

# Vectorize the sample sentence using the original vectorizer.
original_tokens = vectorizer(sample_sentence)

# Vectorize the same sentence using the recreated vectorizer.
reloaded_tokens = new_vectorizer(sample_sentence)

# Print both token sequences to verify identical integer mappings.
print("Original token sequence:", original_tokens.numpy())
print("Reloaded token sequence:", reloaded_tokens.numpy())




## **2. Batching Text Sequences**

### **2.1. Sequence Padding Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_02_01.jpg?v=1768976492" width="250">



>* Padding makes all text sequences equal length
>* Special padding tokens enable efficient batch processing

>* Choose a special padding token and value
>* Decide between pre-padding or post-padding consistently

>* Choose a max sequence length to reduce computation
>* Truncate long texts while preserving key information



In [None]:
#@title Python Code - Sequence Padding Basics

# This script shows basic sequence padding with simple integer token sequences.
# It demonstrates pre padding and post padding using TensorFlow Keras utilities.
# It prints padded sequences and shapes to clarify batch friendly dimensions.

# !pip install tensorflow

# Import required modules from TensorFlow and standard Python libraries.
import tensorflow as tf
import numpy as np

# Set deterministic random seeds for reproducible padding demonstration.
np.random.seed(7)
tf.random.set_seed(7)

# Print TensorFlow version information for environment transparency and reproducibility.
print("Using TensorFlow version:", tf.__version__)

# Create small example sequences representing tokenized short text sentences.
sequences = [[4, 7, 2], [10, 3], [6, 9, 1, 5]]

# Convert sequences into a TensorFlow ragged tensor for flexible length handling.
ragged_sequences = tf.ragged.constant(sequences)

# Print original ragged sequences and their varying lengths for comparison.
print("Original ragged sequences:", ragged_sequences)

# Choose a maximum sequence length for padding to create uniform batch shapes.
max_length = 6

# Apply post padding using pad_sequences with padding tokens appended at sequence ends.
post_padded = tf.keras.preprocessing.sequence.pad_sequences(
    sequences=sequences,
    maxlen=max_length,
    dtype="int32",
    padding="post",
    truncating="post",
    value=0,
)

# Apply pre padding using pad_sequences with padding tokens added at sequence beginnings.
pre_padded = tf.keras.preprocessing.sequence.pad_sequences(
    sequences=sequences,
    maxlen=max_length,
    dtype="int32",
    padding="pre",
    truncating="pre",
    value=0,
)

# Print post padded sequences and their shapes to show uniform batch friendly dimensions.
print("Post padded sequences:\n", post_padded, "\nShape:", post_padded.shape)

# Print pre padded sequences and their shapes to compare padding placement strategies.
print("Pre padded sequences:\n", pre_padded, "\nShape:", pre_padded.shape)



### **2.2. Handling Padding Masks**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_02_02.jpg?v=1768976523" width="250">



>* Padding masks mark real tokens versus padding
>* They prevent models learning from meaningless padded positions

>* Create masks during padding and pass along
>* Use masks so models ignore padded tokens

>* Use one padding mask across all model layers
>* Treat masks as core data to preserve meaning



In [None]:
#@title Python Code - Handling Padding Masks

# This script shows padded sequences and corresponding padding masks clearly.
# It uses TensorFlow to create masks and pass them into a simple model.
# It helps beginners see how models ignore padded time steps using masks.

# !pip install tensorflow

# Import required modules including TensorFlow and NumPy libraries.
import tensorflow as tf
import numpy as np
import os

# Print TensorFlow version information for reproducibility and clarity.
print("TensorFlow version:", tf.__version__)

# Set deterministic random seeds for reproducible behavior and outputs.
seed_value = 42
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Define a tiny corpus with variable length example sentences.
texts = [
    "I love this movie",
    "This film was okay",
    "Bad",
    "Absolutely fantastic acting today",
]

# Create a Keras Tokenizer and fit it on the example texts.
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=1000, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)

# Convert texts into integer sequences using the fitted tokenizer.
sequences = tokenizer.texts_to_sequences(texts)
print("Original sequences:", sequences)

# Pad sequences to equal length using post padding with zeros values.
padded = tf.keras.preprocessing.sequence.pad_sequences(sequences, padding="post", value=0)
print("Padded sequences shape:", padded.shape)

# Create a boolean padding mask where True means real token positions.
mask = tf.cast(tf.math.not_equal(padded, 0), dtype=tf.float32)
print("Mask shape:", mask.shape)

# Validate that padded and mask shapes match exactly for safety.
assert padded.shape == mask.shape

# Build a simple model that accepts mask information from inputs.
inputs = tf.keras.Input(shape=(padded.shape[1],), dtype="int32")
embedding = tf.keras.layers.Embedding(input_dim=1000, output_dim=8, mask_zero=True)(inputs)
pooled = tf.keras.layers.GlobalAveragePooling1D()(embedding)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(pooled)
model = tf.keras.Model(inputs=inputs, outputs=outputs)

# Compile the model with binary crossentropy loss and Adam optimizer.
model.compile(optimizer="adam", loss="binary_crossentropy")

# Create dummy labels for demonstration of forward pass behavior.
labels = np.array([1, 1, 0, 1], dtype="float32")

# Run one training epoch to show model using internal masking.
history = model.fit(padded, labels, epochs=1, batch_size=2, verbose=0)

# Show model predictions to confirm successful masked forward pass.
predictions = model.predict(padded, verbose=0)
print("Predictions shape:", predictions.shape)




### **2.3. tf data text batching**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_02_03.jpg?v=1768976563" width="250">



>* Use tf.data to batch and pad sequences
>* Get uniform batches, better performance, easier training

>* Dataset pairs token sequences with their labels
>* Pipeline batches, pads, and tracks consistent tensor shapes

>* Combine tokenize, filter, batch, and pad steps
>* Shuffle, repeat, prefetch to train efficiently



In [None]:
#@title Python Code - tf data text batching

# Demonstrate TensorFlow text dataset batching and padding pipeline usage.
# Show how sequences become padded batches for model ready tensors.
# Keep example simple, deterministic, and beginner friendly throughout.

# !pip install tensorflow==2.20.0

# Import required TensorFlow and operating system modules.
import tensorflow as tf
import os as os_module

# Set deterministic random seeds for reproducible dataset behavior.
tf.random.set_seed(7)

# Print TensorFlow version information for environment transparency.
print("TensorFlow version:", tf.__version__)

# Define small toy tokenized sequences representing short text examples.
sequences = tf.ragged.constant([[1, 2, 3], [4, 5], [6], [7, 8, 9, 10]])

# Define simple integer labels for each corresponding tokenized sequence.
labels = [0, 1, 0, 1]

# Create TensorFlow dataset from sequences and labels using from_tensor_slices.
base_dataset = tf.data.Dataset.from_tensor_slices((sequences.to_tensor(), labels))

# Shuffle dataset examples with fixed buffer size and deterministic seed.
shuffled_dataset = base_dataset.shuffle(buffer_size=4, seed=7, reshuffle_each_iteration=False)

# Define batch size and padded sequence length hyperparameters for batching.
batch_size = 2

# Define padded shapes for sequences and labels within each dataset batch.
padded_shapes = ([None], [])

# Define padding values for sequences and labels within padded batches.
padding_values = (0, tf.constant(-1, dtype=tf.int32))

# Apply padded_batch transformation to dataset using shapes and padding values.
batched_dataset = shuffled_dataset.padded_batch(batch_size=batch_size, padded_shapes=padded_shapes, padding_values=padding_values)

# Prefetch one batch to overlap data preparation with potential model execution.
prefetched_dataset = batched_dataset.prefetch(buffer_size=tf.data.AUTOTUNE)

# Iterate over dataset batches and print shapes and contents for inspection.
for batch_index, (batch_sequences, batch_labels) in enumerate(prefetched_dataset):

    # Print batch index and tensor shapes for sequences and labels.
    print("Batch", batch_index, "shapes", batch_sequences.shape, batch_labels.shape)

    # Print padded batch sequences tensor to observe padding behavior.
    print("Padded sequences batch:", batch_sequences.numpy())

    # Print batch labels tensor to confirm correct alignment with sequences.
    print("Batch labels:", batch_labels.numpy())

# Confirm script completed without errors by printing final confirmation message.
print("Dataset batching and padding demonstration finished successfully.")



## **3. Configuring Embedding Layers**

### **3.1. Embedding Layer Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_03_01.jpg?v=1768976627" width="250">



>* Embedding layers turn token IDs into vectors
>* They act as a learned lookup matrix

>* Embeddings learn token relationships from training data
>* Similar contexts cluster; vectors encode meaning and sentiment

>* Embeddings turn token sequences into uniform vectors
>* They reduce dimensionality and share parameters efficiently



In [None]:
#@title Python Code - Embedding Layer Basics

# This script shows how embedding layers map token indices to dense vectors.
# It builds a tiny Keras model using an Embedding layer for toy sentences.
# It prints token indices and corresponding embedding vectors for clear understanding.

# !pip install tensorflow==2.20.0

# Import required modules including TensorFlow and NumPy for this example.
import tensorflow as tf
import numpy as np
import os

# Set deterministic seeds for reproducible random behavior in this simple script.
np.random.seed(7)
tf.random.set_seed(7)

# Print TensorFlow version information for environment clarity and reproducibility.
print("TensorFlow version:", tf.__version__)

# Define a tiny corpus of example sentences for demonstrating token embeddings.
texts = [
    "this movie was great",
    "this movie was awful",
]

# Create a Keras Tokenizer to map words into integer indices for the vocabulary.
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=20, oov_token="<OOV>")

# Fit the tokenizer on the example texts to build the word index mapping.
tokenizer.fit_on_texts(texts)

# Convert the texts into sequences of integer token indices using the tokenizer.
sequences = tokenizer.texts_to_sequences(texts)

# Pad the sequences so they share equal length for batch processing convenience.
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, padding="post")

# Retrieve vocabulary size from tokenizer plus one for reserved padding index.
vocab_size = len(tokenizer.word_index) + 1

# Define embedding dimension size which controls vector length for each token.
embedding_dim = 4

# Build a simple Sequential model containing only a single Embedding layer.
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=padded_sequences.shape[1]),
])

# Run the padded integer sequences through the embedding layer to obtain vectors.
embeddings = model(padded_sequences)

# Convert embeddings to NumPy arrays for easier printing and shape inspection.
embeddings_array = embeddings.numpy()

# Print token index sequences and corresponding embedding shapes for both sentences.
print("Token index sequences:", padded_sequences)
print("Embeddings shape:", embeddings_array.shape)

# Print embeddings for first sentence tokens to show lookup table behavior.
print("First sentence embeddings:")
print(embeddings_array[0])



### **3.2. Understanding Embedding Size**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_03_02.jpg?v=1768976662" width="250">



>* Embedding size controls how much meaning fits
>* Small compresses information; large captures finer nuances

>* Balance size with data and compute limits
>* Start with defaults, tune using validation performance

>* Complex tasks and diverse vocabularies need larger embeddings
>* Match embedding dimension to task complexity and scope



In [None]:
#@title Python Code - Understanding Embedding Size

# This script compares different embedding sizes visually and numerically for beginners.
# It builds tiny models with various embedding dimensions and prints parameter counts.
# It helps understand how embedding size affects model capacity and memory usage.

# !pip install tensorflow==2.20.0

# Import required modules including TensorFlow and NumPy for computations.
import os
import random
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducible random behavior across different runs.
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

# Print TensorFlow version information for environment transparency and reproducibility.
print("TensorFlow version:", tf.__version__)

# Define a small example vocabulary size for our toy text dataset.
vocab_size = 50

# Define several embedding dimensions to compare model size and behavior.
embedding_dims = [4, 8, 16]

# Create a tiny batch of token id sequences representing short example sentences.
example_sequences = np.array([[1, 5, 9, 2], [3, 7, 0, 0]], dtype=np.int32)

# Loop through each embedding dimension and build a simple embedding model.
for dim in embedding_dims:

    # Create a simple Sequential model containing only an Embedding layer.
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=dim, input_length=4)
    ])

    # Build the model explicitly to ensure weights are created before inspection.
    model.build(input_shape=(None, 4))

    # Compute total trainable parameters which equals vocab size times embedding dimension.
    total_params = model.count_params()

    # Run the example sequences through the embedding layer to obtain dense vectors.
    embedded_output = model(example_sequences)

    # Convert embedded output to NumPy array and inspect resulting shape dimensions.
    output_array = embedded_output.numpy()

    # Print concise information about current embedding dimension and parameter count.
    print("Embedding dim:", dim, "Total params:", total_params)

    # Print the output shape to show how embedding dimension changes last axis size.
    print("Output shape:", output_array.shape)

# Confirm script finished successfully without errors or unexpected behavior.
print("Finished comparing embedding sizes.")



### **3.3. Initializing from pretrained vectors**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_03_03.jpg?v=1768976694" width="250">



>* Use pretrained embeddings instead of learning from scratch
>* They encode word similarities and boost performance quickly

>* Align your tokens with pretrained embedding vocabulary
>* Build embedding matrix, handle unknown words, initialize layer

>* Choose to freeze or train embeddings
>* Often freeze first, then unfreeze for specialization



In [None]:
#@title Python Code - Initializing from pretrained vectors

# This script shows initializing an embedding layer from pretrained vectors.
# It uses a tiny fake pretrained embedding file for demonstration purposes.
# It builds a simple Keras model using the loaded embedding matrix.

# !pip install tensorflow==2.20.0

# Import required standard libraries and TensorFlow framework.
import os
import random
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducible random behavior everywhere.
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

# Print TensorFlow version information for environment verification purposes.
print("TensorFlow version:", tf.__version__)

# Define a small example vocabulary that our tokenizer might have produced.
vocab = ["<pad>", "<unk>", "refund", "return", "exchange", "banana"]

# Define embedding dimensionality matching our pretend pretrained vectors.
embedding_dim = 4

# Create a fake pretrained embedding dictionary with simple numeric patterns.
pretrained_vectors = {
    "refund": np.array([0.9, 0.1, 0.0, 0.0], dtype=np.float32),
    "return": np.array([0.88, 0.12, 0.0, 0.0], dtype=np.float32),
}

# Initialize an empty embedding matrix with zeros for every vocabulary token.
embedding_matrix = np.zeros((len(vocab), embedding_dim), dtype=np.float32)

# Fill the embedding matrix using pretrained vectors or random fallbacks.
for index, token in enumerate(vocab):
    if token in pretrained_vectors:
        embedding_matrix[index] = pretrained_vectors[token]
    elif token == "<pad>":
        embedding_matrix[index] = np.zeros(embedding_dim, dtype=np.float32)
    else:
        embedding_matrix[index] = np.random.uniform(-0.05, 0.05, embedding_dim)

# Convert the numpy embedding matrix into a TensorFlow constant tensor.
embedding_initializer = tf.constant_initializer(embedding_matrix)

# Define a simple Keras Sequential model using the initialized embedding layer.
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(
        input_dim=len(vocab),
        output_dim=embedding_dim,
        embeddings_initializer=embedding_initializer,
        trainable=False,
        mask_zero=True,
        input_length=3,
    ),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

# Compile the model with a simple optimizer and binary crossentropy loss.
model.compile(optimizer="adam", loss="binary_crossentropy")

# Create a tiny batch of token index sequences representing example sentences.
example_sequences = np.array([[2, 3, 4], [4, 5, 0]], dtype=np.int32)

# Run a forward pass to obtain predictions and verify everything works.
predictions = model.predict(example_sequences, verbose=0)

# Print the embedding matrix rows for refund and banana tokens for comparison.
print("Embedding for 'refund':", embedding_matrix[vocab.index("refund")])
print("Embedding for 'banana':", embedding_matrix[vocab.index("banana")])

# Print the model predictions to show the initialized embeddings are usable.
print("Model predictions shape:", predictions.shape)
print("Model predictions values:", predictions)



# <font color="#418FDE" size="6.5" uppercase>**Text Preprocessing**</font>


In this lecture, you learned to:
- Tokenize raw text into integer sequences using Keras text preprocessing tools. 
- Build tf.data pipelines that batch and pad text sequences for model input. 
- Create and configure embedding layers to represent tokens as dense vectors. 

In the next Lecture (Lecture B), we will go over 'Sequence Models'