# <font color="#418FDE" size="6.5" uppercase>**Text Preprocessing**</font>

>Last update: 20260126.
    
By the end of this Lecture, you will be able to:
- Tokenize raw text into integer sequences using Keras text preprocessing tools. 
- Build tf.data pipelines that batch and pad text sequences for model input. 
- Create and configure embedding layers to represent tokens as dense vectors. 


## **1. Keras Text Tokenization**

### **1.1. Using TextVectorization Layer**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_01_01.jpg?v=1769419323" width="250">



>* TextVectorization turns raw text into integer sequences
>* Handles preprocessing and plugs directly into Keras models

>* Configure vocabulary size and output token format
>* Adapt on sample texts to produce consistent integers

>* Stores rules and vocabulary for consistent preprocessing
>* Ensures reliable deployment, collaboration, and model updates



In [None]:
#@title Python Code - Using TextVectorization Layer

# This script demonstrates Keras TextVectorization usage.
# It shows how raw text becomes integer token sequences.
# Run cells to observe simple deterministic preprocessing behavior.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras preprocessing tools.
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Set deterministic random seeds for reproducibility.
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

# Print TensorFlow version in one concise line.
print("TensorFlow version:", tf.__version__)

# Create a tiny corpus of example text sentences.
raw_texts = [
    "I loved this movie so much",
    "This movie was terrible and boring",
    "Absolutely fantastic acting and great story",
    "I would not recommend this movie",
]

# Convert list to TensorFlow constant tensor.
text_ds = tf.constant(raw_texts)

# Wrap tensor in a small Dataset for adaptation.
text_dataset = tf.data.Dataset.from_tensor_slices(text_ds)

# Define maximum vocabulary size and sequence length.
max_tokens = 20
sequence_length = 8

# Create a TextVectorization layer with basic settings.
vectorizer = TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Adapt the layer on the small text dataset.
vectorizer.adapt(text_dataset)

# Retrieve the learned vocabulary list from the layer.
vocab = vectorizer.get_vocabulary()

# Print a few vocabulary entries with their indices.
print("Vocabulary size:", len(vocab))
print("First ten tokens:", vocab[:10])

# Apply the vectorizer to the original raw texts.
vectorized_texts = vectorizer(text_ds)

# Confirm the resulting tensor shape is as expected.
print("Vectorized shape:", vectorized_texts.shape)

# Convert vectorized tensor to a NumPy array for inspection.
vectorized_array = vectorized_texts.numpy()

# Print each original sentence with its integer sequence.
for i, sentence in enumerate(raw_texts):
    print("Sentence:", sentence)
    print("Sequence:", vectorized_array[i])

# Build a simple tf.data pipeline with batching and padding.
text_pipeline = (
    text_dataset
    .batch(2)
    .map(vectorizer, num_parallel_calls=tf.data.AUTOTUNE)
)

# Take one batch from the pipeline and inspect its shape.
for batch in text_pipeline.take(1):
    print("Batch shape from pipeline:", batch.shape)



### **1.2. Handling Unknown Tokens**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_01_02.jpg?v=1769419363" width="250">



>* Limited vocabularies create unknown or rare tokens
>* Keras maps unknown words to special indices

>* Unknown token handling affects model generalization strongly
>* Mapping unseen words to one index preserves meaning

>* Balance vocab size, coverage, and model capacity
>* Use Keras controls and multiple unknown token indices



In [None]:
#@title Python Code - Handling Unknown Tokens

# This script shows Keras unknown token handling.
# It focuses on simple text tokenization concepts.
# Run each part to observe unknown token behavior.

# !pip install tensorflow==2.20.0.

# Import TensorFlow and Keras preprocessing utilities.
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

# Print TensorFlow version for reproducibility.
print("TensorFlow version:", tf.__version__)

# Define a tiny training corpus with simple sentences.
train_sentences = [
    "this phone is great",
    "this phone is bad",
    "this camera is great",
]

# Define some new sentences containing unknown words.
new_sentences = [
    "this phone is absolutely fire",
    "this tablet is great",
]

# Create a tokenizer with limited vocabulary size.
max_words = 5

# Reserve index for out of vocabulary tokens.
oov_token = "<OOV>"

# Initialize tokenizer with max words and oov token.
tokenizer = Tokenizer(num_words=max_words, oov_token=oov_token)

# Fit tokenizer on training sentences only.
tokenizer.fit_on_texts(train_sentences)

# Show the learned word index dictionary.
print("Word index:", tokenizer.word_index)

# Convert training sentences to integer sequences.
train_sequences = tokenizer.texts_to_sequences(train_sentences)

# Print tokenized training sequences for inspection.
print("Training sequences:", train_sequences)

# Convert new sentences containing unknown words.
new_sequences = tokenizer.texts_to_sequences(new_sentences)

# Print tokenized new sequences with unknown handling.
print("New sequences:", new_sequences)

# Extract the index used for unknown tokens.
unknown_index = tokenizer.word_index.get(oov_token, None)

# Safely print the unknown token index value.
print("Unknown token index:", unknown_index)

# Count how many unknown tokens appear in new sequences.
unknown_count = sum(
    1 for seq in new_sequences for token in seq if token == unknown_index
)

# Print how many tokens were mapped as unknown.
print("Number of unknown tokens:", unknown_count)

# Demonstrate that sequence lengths are still preserved.
lengths = [len(seq) for seq in new_sequences]

# Print the lengths of new sequences after tokenization.
print("New sequence lengths:", lengths)



### **1.3. Building and saving vocabularies**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_01_03.jpg?v=1769419400" width="250">



>* Vocabulary maps unique tokens to integer indices
>* Control vocab size and frequency to balance coverage

>* Keep token-to-index mapping stable across runs
>* Save and reload vocab files for consistency

>* Update vocabularies as language and data change
>* Version, reuse, and align vocabularies for compatibility



In [None]:
#@title Python Code - Building and saving vocabularies

# This script shows basic Keras text tokenization.
# It focuses on building and saving vocabularies.
# Run cells in order inside Google Colab.

# Install TensorFlow if not already available.
# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import json
import random

# Import TensorFlow and Keras preprocessing tools.
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Set deterministic seeds for reproducibility.
SEED = 42
random.seed(SEED)

# Set TensorFlow random seed for reproducibility.
tf.random.set_seed(SEED)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create a tiny sample corpus of sentences.
corpus = [
    "This movie was great and very fun",
    "The movie was okay but a bit long",
    "I did not enjoy this movie at all",
]

# Show the raw corpus to the learner.
print("Sample corpus:")
print(corpus)

# Define basic TextVectorization configuration.
max_tokens = 20
output_sequence_length = 8

# Create the TextVectorization layer instance.
vectorizer = TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=output_sequence_length,
)

# Convert corpus to a TensorFlow Dataset.
text_ds = tf.data.Dataset.from_tensor_slices(corpus)

# Adapt the vectorizer to build the vocabulary.
vectorizer.adapt(text_ds)

# Get the learned vocabulary as a Python list.
vocab = vectorizer.get_vocabulary()

# Show the vocabulary and its size.
print("\nLearned vocabulary:")
print(vocab)

# Build a mapping from token to index for clarity.
word_to_index = {word: index for index, word in enumerate(vocab)}

# Show a few example token to index pairs.
print("\nExample token to index mapping:")
for word in ["", "[UNK]", "movie", "great"]:
    if word in word_to_index:
        print(word, "->", word_to_index[word])

# Choose a small filename for saving vocabulary.
vocab_filename = "vocab_example.json"

# Save the vocabulary list as a JSON file.
with open(vocab_filename, "w", encoding="utf-8") as f:
    json.dump(vocab, f, ensure_ascii=False, indent=2)

# Confirm that the file now exists on disk.
file_exists = os.path.exists(vocab_filename)
print("\nVocabulary file saved:", file_exists)

# Load the vocabulary back from the JSON file.
with open(vocab_filename, "r", encoding="utf-8") as f:
    loaded_vocab = json.load(f)

# Verify that loaded vocabulary matches the original.
print("Loaded vocabulary matches:", loaded_vocab == vocab)

# Create a new TextVectorization layer using loaded vocabulary.
new_vectorizer = TextVectorization(
    max_tokens=len(loaded_vocab),
    output_mode="int",
    output_sequence_length=output_sequence_length,
)

# Set the vocabulary directly without adapting again.
new_vectorizer.set_vocabulary(loaded_vocab)

# Pick a test sentence to compare encodings.
test_sentence = tf.constant(["This movie was great"])

# Vectorize with the original adapted layer.
original_encoded = vectorizer(test_sentence)

# Vectorize with the new layer using loaded vocabulary.
reloaded_encoded = new_vectorizer(test_sentence)

# Show that both encodings are identical.
print("\nOriginal encoding:", original_encoded.numpy())
print("Reloaded encoding:", reloaded_encoded.numpy())

# Final check that shapes are as expected.
print("Encoding shape:", original_encoded.shape)



## **2. Batching Text Sequences**

### **2.1. Padding to Fixed Length**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_02_01.jpg?v=1769419440" width="250">



>* Text sequences vary; models need uniform lengths
>* Pad shorter sequences with special tokens for batching

>* Balance truncation risk against padding overhead
>* Choose max length from dataset length distribution

>* Choose padding start or end by model
>* Use unique pad token and thoughtful length



In [None]:
#@title Python Code - Padding to Fixed Length

# This script shows padding text sequences clearly.
# It focuses on batching sequences to fixed length.
# Run each part to observe shapes and padding.

# !pip install tensorflow==2.20.0.

# Import required modules from TensorFlow.
import tensorflow as tf

# Set a deterministic random seed value.
tf.random.set_seed(7)

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Create a tiny list of example sentences.
raw_texts = [
    "I love TensorFlow",
    "Padding makes batches rectangular",
    "Short",
    "This sentence is a bit longer than others",
]

# Show the raw texts to understand input.
print("Raw texts:")
print(raw_texts)

# Create a TextVectorization layer for tokenization.
vectorizer = tf.keras.layers.TextVectorization(
    standardize="lower_and_strip_punctuation",
    split="whitespace",
    output_mode="int",
)

# Adapt the vectorizer on the raw texts.
vectorizer.adapt(raw_texts)

# Convert raw texts to integer sequences.
int_sequences = vectorizer(tf.constant(raw_texts))

# Print unpadded integer sequences and shapes.
print("\nInteger sequences before padding:")
for seq in int_sequences:
    print(seq.numpy(), "shape:", seq.shape)

# Decide a fixed maximum sequence length.
max_length = 8

# Create a small tf.data Dataset from sequences.
seq_ds = tf.data.Dataset.from_tensor_slices(int_sequences)

# Define a function to pad each sequence.
def pad_to_fixed_length(seq):
    seq = seq[:max_length]
    padded = tf.pad(
        seq,
        paddings=[[0, max_length - tf.shape(seq)[0]]],
        constant_values=0,
    )
    return padded

# Map padding function and batch the dataset.
batched_ds = (
    seq_ds.map(pad_to_fixed_length)
    .batch(2)
)

# Take one batch and inspect its shape.
for batch in batched_ds.take(2):
    print("\nBatch shape:", batch.shape)
    print("Batch contents:\n", batch.numpy())

# Create an Embedding layer for padded tokens.
vocab_size = vectorizer.vocabulary_size()
embedding_dim = 4
embedding_layer = tf.keras.layers.Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    mask_zero=True,
)

# Pass one padded batch through the embedding layer.
for batch in batched_ds.take(1):
    embedded = embedding_layer(batch)
    print("\nEmbedded batch shape:", embedded.shape)
    print("Mask from embedding:", embedding_layer.compute_mask(batch))




### **2.2. Handling Padding Masks**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_02_02.jpg?v=1769419474" width="250">



>* Padding masks mark real tokens versus padding
>* They prevent models learning from meaningless padded positions

>* Padding masks mirror batch and sequence shapes
>* They flag real tokens and ignore padded positions

>* Create masks from padded batches and propagate
>* Use separate masks for each padding level



In [None]:
#@title Python Code - Handling Padding Masks

# This script shows padding masks with text batches.
# It uses TensorFlow text preprocessing and masking.
# Focus on batching variable length sequences safely.

# !pip install tensorflow==2.20.0.

# Import required TensorFlow modules.
import tensorflow as tf

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Create a tiny toy corpus of short sentences.
texts = [
    "I love TensorFlow",
    "Masks ignore padding",
    "Short",
    "Variable length text sequences",
]

# Build a TextVectorization layer for tokenization.
vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=50,
    output_mode="int",
    output_sequence_length=None,
)

# Convert Python list to a TensorFlow dataset.
text_ds = tf.data.Dataset.from_tensor_slices(texts)

# Adapt vectorizer vocabulary using the dataset.
vectorizer.adapt(text_ds.batch(4))

# Map raw text to integer sequences using vectorizer.
int_ds = text_ds.map(lambda x: vectorizer(x))

# Define a small maximum sequence length for padding.
max_len = 6

# Function to pad sequences to fixed length.
def pad_to_length(x):
    x = x[:max_len]
    pad_amount = max_len - tf.shape(x)[0]
    paddings = tf.stack([[0, pad_amount]])
    return tf.pad(x, paddings, constant_values=0)

# Apply padding and batch the sequences.
batched_ds = int_ds.map(pad_to_length).batch(2)

# Function to create a boolean padding mask.
def make_mask(batch):
    mask = tf.not_equal(batch, 0)
    return batch, mask

# Attach masks to each padded batch.
masked_ds = batched_ds.map(make_mask)

# Take one example batch and mask from dataset.
for batch_tokens, batch_mask in masked_ds.take(1):
    example_tokens = batch_tokens
    example_mask = batch_mask

# Validate shapes before printing them.
print("Batch tokens shape:", example_tokens.shape)
print("Batch mask shape:", example_mask.shape)

# Show the padded integer token sequences.
print("Padded token batch:\n", example_tokens.numpy())

# Show the corresponding boolean padding mask.
print("Padding mask batch:\n", example_mask.numpy())

# Demonstrate how mask can zero out padded positions.
example_tokens = tf.cast(example_tokens, tf.int32)
masked_values = tf.cast(example_mask, tf.int32) * example_tokens

# Print masked values where padding positions become zeros.
print("Tokens after applying mask:\n", masked_values.numpy())



### **2.3. tf data batching**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_02_03.jpg?v=1769419527" width="250">



>* Use data pipelines to batch many text sequences
>* Pipelines output padded batches for scalable NLP training

>* Pipelines separate data handling from model logic
>* They shuffle, batch, pad, prefetch, and optimize

>* Pipelines flexibly handle complex, varied real-world text
>* Composable steps keep batches consistent and scalable



In [None]:
#@title Python Code - tf data batching

# This script demonstrates tf.data text batching.
# It focuses on padding and batching sequences.
# Run it in Colab to follow along.

# !pip install tensorflow==2.20.0.

# Import TensorFlow and NumPy for this demo.
import tensorflow as tf
import numpy as np

# Set deterministic seeds for reproducibility.
tf.random.set_seed(7)
np.random.seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Define a few short example text sentences.
raw_texts = [
    "I love TensorFlow for NLP",
    "Batching sequences keeps training efficient",
    "Padding makes shapes consistent",
    "tf data pipelines are powerful",
]

# Create a simple TextVectorization layer.
vectorizer = tf.keras.layers.TextVectorization(
    standardize="lower_and_strip_punctuation",
    split="whitespace",
    output_mode="int",
    output_sequence_length=0,
)

# Adapt the vectorizer on the raw texts.
text_ds = tf.data.Dataset.from_tensor_slices(raw_texts)
vectorizer.adapt(text_ds)

# Vectorize texts to get ragged integer sequences.
int_sequences = vectorizer(tf.constant(raw_texts))
print("Vectorized shape:", int_sequences.shape)

# Convert ragged sequences to a Dataset of examples.
seq_ds = tf.data.Dataset.from_tensor_slices(int_sequences)

# Inspect one example sequence length safely.
for one_seq in seq_ds.take(1):
    print("One sequence length:", one_seq.shape[0])

# Define a small batch size for demonstration.
batch_size = 2

# Shuffle the dataset to randomize order.
shuffled_ds = seq_ds.shuffle(buffer_size=len(raw_texts))

# Batch sequences and let dataset handle padding.
batched_padded_ds = shuffled_ds.padded_batch(
    batch_size=batch_size,
    padded_shapes=(None,),
    padding_values=tf.constant(0, dtype=tf.int64),
)

# Prefetch to overlap data preparation and usage.
final_ds = batched_padded_ds.prefetch(buffer_size=tf.data.AUTOTUNE)

# Iterate over a few batches and print shapes.
for batch_index, batch in enumerate(final_ds):
    print("Batch", batch_index, "shape:", batch.shape)

# Show one batch content to visualize padding.
for example_batch in final_ds.take(1):
    print("Example batch tensor:\n", example_batch.numpy())




## **3. Configuring Embedding Layers**

### **3.1. Embedding Layer Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_03_01.jpg?v=1769419562" width="250">



>* Embedding layers turn token IDs into dense vectors
>* Similar tokens learn nearby vectors, unlike one-hot

>* Embedding layer is a trainable lookup matrix
>* Learns compact vectors that preserve task-specific meaning

>* Embeddings turn token IDs into rich vectors
>* They reveal patterns and power advanced text models



In [None]:
#@title Python Code - Embedding Layer Basics

# This script explains basic embedding layers.
# It uses tiny text data for clarity.
# Run each part to observe printed outputs.

# Install TensorFlow if missing in your environment.
# !pip install tensorflow==2.20.0.

# Import TensorFlow and NumPy modules.
import tensorflow as tf
import numpy as np

# Set deterministic random seeds for reproducibility.
tf.random.set_seed(7)
np.random.seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Define a tiny toy corpus of short sentences.
corpus = [
    "this movie was great",
    "this movie was terrible",
    "the film was fantastic",
]

# Create a simple TextVectorization layer.
vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=20,
    output_mode="int",
    output_sequence_length=5,
)

# Adapt the vectorizer on the small corpus.
vectorizer.adapt(corpus)

# Vectorize the corpus into integer token sequences.
int_sequences = vectorizer(corpus)

# Convert sequences to NumPy for easy inspection.
int_sequences_np = int_sequences.numpy()

# Print the integer sequences for each sentence.
print("Integer sequences for corpus:")
print(int_sequences_np)

# Get vocabulary size from the vectorizer.
vocab_size = len(vectorizer.get_vocabulary())

# Define embedding dimension for dense token vectors.
embedding_dim = 4

# Create a basic Embedding layer instance.
embedding_layer = tf.keras.layers.Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    mask_zero=True,
)

# Pass integer sequences through the embedding layer.
embedded_sequences = embedding_layer(int_sequences)

# Validate the embedded tensor shape safely.
print("Embedded tensor shape:", embedded_sequences.shape)

# Select first sentence embeddings for inspection.
first_sentence_embeddings = embedded_sequences[0]

# Convert first sentence embeddings to NumPy array.
first_sentence_np = first_sentence_embeddings.numpy()

# Print embeddings for the first sentence tokens.
print("Embeddings for first sentence tokens:")
print(first_sentence_np)



### **3.2. Understanding Embedding Size**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_03_02.jpg?v=1769419597" width="250">



>* Embedding size sets features per vocabulary token
>* Too small loses nuance; larger captures richer relationships

>* Choose size based on task and data
>* Too small underfits; too large overfits, costly

>* Embedding size affects parameters, speed, and memory
>* Start with defaults, tune size for each task



In [None]:
#@title Python Code - Understanding Embedding Size

# This script explores embedding size choices.
# It uses tiny text data for clarity.
# Run all cells together in Google Colab.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras layers.
import tensorflow as tf
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create a tiny corpus of short sentences.
corpus = [
    "I love this movie",
    "This movie is terrible",
    "Amazing acting and great story",
    "The plot was boring",
]

# Build a simple word index manually.
word_index = {"<pad>": 0}
for sentence in corpus:
    for word in sentence.lower().split():
        if word not in word_index:
            word_index[word] = len(word_index)

# Show the vocabulary size including padding.
vocab_size = len(word_index)
print("Vocabulary size including pad:", vocab_size)

# Convert sentences to integer sequences.
sequences = []
for sentence in corpus:
    tokens = []
    for word in sentence.lower().split():
        tokens.append(word_index[word])
    sequences.append(tokens)

# Pad sequences to the same length.
max_len = max(len(seq) for seq in sequences)
padded = tf.keras.preprocessing.sequence.pad_sequences(
    sequences, maxlen=max_len, padding="post"
)

# Wrap padded data in a small tf.data Dataset.
dataset = tf.data.Dataset.from_tensor_slices(padded)
dataset = dataset.batch(2)

# Define two different embedding sizes.
embedding_dim_small = 4
embedding_dim_large = 16

# Create a small embedding layer instance.
small_embedding = layers.Embedding(
    input_dim=vocab_size, output_dim=embedding_dim_small
)

# Create a large embedding layer instance.
large_embedding = layers.Embedding(
    input_dim=vocab_size, output_dim=embedding_dim_large
)

# Take one batch of token ids from dataset.
for batch_tokens in dataset.take(1):
    example_tokens = batch_tokens

# Validate the batch shape before embedding.
print("Token batch shape:", example_tokens.shape)

# Compute small embedding representations.
small_emb_output = small_embedding(example_tokens)

# Compute large embedding representations.
large_emb_output = large_embedding(example_tokens)

# Print shapes to compare embedding sizes.
print("Small embedding output shape:", small_emb_output.shape)
print("Large embedding output shape:", large_emb_output.shape)

# Show a single token id and its small embedding.
first_token_id = int(example_tokens[0, 0].numpy())
print("First token id:", first_token_id)
print("Small embedding vector:", small_emb_output[0, 0].numpy())

# Show the same token id and its large embedding.
print("Large embedding vector:", large_emb_output[0, 0].numpy())




### **3.3. Using Pretrained Embeddings**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_A/image_03_03.jpg?v=1769419637" width="250">



>* Pretrained embeddings give models rich language knowledge
>* They speed training and boost performance on NLP tasks

>* Match vocabulary, load known pretrained word vectors
>* Initialize missing words, model starts with semantics

>* Choose to freeze or fine-tune embeddings
>* Balance stability with domain adaptation for performance



In [None]:
#@title Python Code - Using Pretrained Embeddings

# This script shows pretrained embeddings usage simply.
# It builds a tiny model with a frozen embedding.
# Focus is on configuring the Keras Embedding layer.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras layers.
import tensorflow as tf
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Define a tiny example vocabulary list.
vocab = ["<pad>", "<unk>", "i", "love", "nlp", "and", "tensorflow"]

# Build a mapping from token to integer index.
word_index = {word: idx for idx, word in enumerate(vocab)}

# Set vocabulary size and embedding dimension.
vocab_size = len(vocab)
embedding_dim = 4

# Create a fake pretrained embedding matrix.
pretrained_matrix = np.zeros((vocab_size, embedding_dim), dtype="float32")

# Fill non special tokens with simple deterministic vectors.
for word, idx in word_index.items():
    if word in ("<pad>", "<unk>"):
        continue
    base = float(idx)
    pretrained_matrix[idx] = np.array([
        base * 0.1,
        base * 0.1 + 0.1,
        base * 0.1 + 0.2,
        base * 0.1 + 0.3,
    ], dtype="float32")

# Verify the pretrained matrix shape safely.
assert pretrained_matrix.shape == (vocab_size, embedding_dim)

# Build a simple Keras model using the matrix.
inputs = layers.Input(shape=(3,), dtype="int32", name="token_ids")

# Configure an Embedding layer with pretrained weights.
embedding_layer = layers.Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(pretrained_matrix),
    trainable=False,
    mask_zero=True,
    name="pretrained_embedding",
)

# Apply the embedding layer to the inputs.
embedded = embedding_layer(inputs)

# Pool token embeddings into a single vector.
pooled = layers.GlobalAveragePooling1D()(embedded)

# Add a tiny dense output layer for demonstration.
outputs = layers.Dense(1, activation="sigmoid")(pooled)

# Create the final model object.
model = tf.keras.Model(inputs=inputs, outputs=outputs)

# Compile the model with simple settings.
model.compile(optimizer="adam", loss="binary_crossentropy")

# Show a short model summary line count safely.
model.summary(print_fn=lambda x: None)

# Prepare two tiny example sentences as token ids.
example_sentences = np.array([
    [word_index["i"], word_index["love"], word_index["nlp"]],
    [word_index["i"], word_index["love"], word_index["tensorflow"]],
], dtype="int32")

# Run a forward pass to get predictions.
preds = model.predict(example_sentences, verbose=0)

# Fetch the embedding weights from the layer.
emb_weights = embedding_layer.get_weights()[0]

# Print a few key results for inspection.
print("Vocabulary:", vocab)
print("Embedding matrix shape:", emb_weights.shape)
print("Embedding for 'love':", emb_weights[word_index["love"]])
print("Embedding for 'tensorflow':", emb_weights[word_index["tensorflow"]])
print("Model predictions shape:", preds.shape)
print("Predictions sample:", preds[0])




# <font color="#418FDE" size="6.5" uppercase>**Text Preprocessing**</font>


In this lecture, you learned to:
- Tokenize raw text into integer sequences using Keras text preprocessing tools. 
- Build tf.data pipelines that batch and pad text sequences for model input. 
- Create and configure embedding layers to represent tokens as dense vectors. 

In the next Lecture (Lecture B), we will go over 'Sequence Models'