diff --git a/examples/README.md b/examples/README.md
deleted file mode 100644
index e7426169c8..0000000000
--- a/examples/README.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# KerasNLP Examples
-
-The `examples/` directly contains scripts built on top of the library that do not fit well into
-the colab format used on [keras.io](https://keras.io/examples/). This includes recipes for
-pre-training models and evaluating models on benchmarks such as GLUE.
diff --git a/examples/__init__.py b/examples/__init__.py
deleted file mode 100644
index 3364a6bd16..0000000000
--- a/examples/__init__.py
+++ /dev/null
@@ -1,13 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/examples/bert_pretraining/README.md b/examples/bert_pretraining/README.md
deleted file mode 100644
index acd47313fd..0000000000
--- a/examples/bert_pretraining/README.md
+++ /dev/null
@@ -1,199 +0,0 @@
-# BERT with KerasNLP
-
-This example demonstrates how to train a Bidirectional Encoder
-Representations from Transformers (BERT) model end-to-end using the KerasNLP
-library. This README contains instructions on how to run pretraining directly
-from raw data, followed by finetuning and evaluation on the GLUE dataset.
-
-## Quickly test out the code
-
-To exercise the code in this directory by training a tiny BERT model, you can
-run the following commands from the base directory of the repository. This can
-be useful to validate any code changes, but note that a useful BERT model would
-need to be trained for much longer on a much larger dataset.
-
-```shell
-OUTPUT_DIR=~/bert_test_output
-DATA_URL=https://storage.googleapis.com/tensorflow/keras-nlp/examples/bert
-
-# Download example data.
-wget ${DATA_URL}/bert_vocab_uncased.txt -O $OUTPUT_DIR/bert_vocab_uncased.txt
-wget ${DATA_URL}/wiki_example_data.txt -O $OUTPUT_DIR/wiki_example_data.txt
-
-# Parse input data and split into sentences.
-python3 examples/tools/split_sentences.py \
-    --input_files $OUTPUT_DIR/wiki_example_data.txt \
-    --output_directory $OUTPUT_DIR/sentence-split-data
-# Preprocess input for pretraining.
-python3 examples/bert_pretraining/bert_create_pretraining_data.py \
-    --input_files $OUTPUT_DIR/sentence-split-data/ \
-    --vocab_file $OUTPUT_DIR/bert_vocab_uncased.txt \
-    --output_file $OUTPUT_DIR/pretraining-data/pretraining.tfrecord
-# Run pretraining for 100 train steps only.
-python3 examples/bert_pretraining/bert_pretrain.py \
-    --input_directory $OUTPUT_DIR/pretraining-data/ \
-    --vocab_file $OUTPUT_DIR/bert_vocab_uncased.txt \
-    --saved_model_output $OUTPUT_DIR/model/ \
-    --num_train_steps 100
-```
-
-## Installing dependencies
-
-This example needs a few extra dependencies to run (e.g. wikiextractor for
-using wikipedia downloads). You can install these into a KerasNLP development
-environment with:
-
-```shell
-pip install -r "examples/bert_pretraining/requirements.txt"
-```
-
-## Pretraining BERT
-
-Training a BERT model happens in two stages. First, the model is "pretrained" on
-a large corpus of input text. This is computationally expensive. After
-pretraining, the model can be "finetuned" on a downstream task with a much
-smaller amount of labeled data.
-
-### Downloading pretraining data
-
-The GLUE pretraining data (Wikipedia + BooksCorpus) is fairly large. The raw
-input data takes roughly ~20GB of space, and after preprocessing, the full
-corpus will take ~400GB.
-
-The latest wikipedia dump can be downloaded
-[at this link](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2),
-or via command line:
-
-```shell
-curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
-```
-The dump can be extracted with the `wikiextractor` tool.
-
-```shell
-python3 -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2
-```
-
-BooksCorpus is no longer hosted by
-[its creators](https://yknzhu.wixsite.com/mbweb), but you can find instructions
-for downloading or reproducing the corpus in
-[this repository](https://github.com/soskek/bookcorpus). We suggest the pre-made file
-downloads listed at the top of the README. Alternatively, you can forgo it
-entirely and pretrain solely on wikipedia.
-
-Preparing the pretraining data will happen in two stages. First, raw text needs
-to be split into lists of sentences per document. Second, this sentence split
-data needs to use to create training examples with both masked words and
-next sentence predictions.
-
-### Splitting raw text into sentences
-
-Next, use `examples/tools/split_sentences.py` to process raw input files and
-split them into output files where each line contains a sentence, and a blank
-line marks the start of a new document. We need this for the next-sentence
-prediction task used by BERT.
-
-For example, if Wikipedia files are located in `~/datasets/wikipedia` and
-bookscorpus in `~/datasets/bookscorpus`, the following command will output
-sentence split documents to a configurable number of output file shards:
-
-```shell
-python3 examples/tools/split_sentences.py \
-    --input_files ~/datasets/wikipedia,~/datasets/bookscorpus \
-    --output_directory ~/datasets/sentence-split-data
-```
-
-### Computing a WordPiece vocabulary
-
-The easiest and best approach when training BERT is to use the official
-vocabularies from the original project, which have become somewhat standard.
-
-You can download the English uncased vocabulary
-[here](https://storage.googleapis.com/tensorflow/keras-nlp/examples/bert/bert_vocab_uncased.txt),
-or in your terminal run:
-
-```shell
-curl -O https://storage.googleapis.com/tensorflow/keras-nlp/examples/bert/bert_vocab_uncased.txt
-```
-
-You can also use `examples/tools/train_word_piece_vocab.py` to train your own.
-
-### Tokenize, mask, and combine sentences into training examples
-
-The ` bert_create_pretraining_data.py` script will take in a set of sentence split files, and
-set up training examples for the next sentence prediction and masked word tasks.
-
-The output of the script will be TFRecord files with a number of fields per
-example. Below shows a complete output example with the addition of a string
-`tokens` field for clarity. The actual script will only serialize the token ids
-to conserve disk space.
-
-```python
-tokens:  ['[CLS]', 'resin', '##s', 'are', 'today', '[MASK]', 'produced', 'by', 
-          'ang', '##ios', '##per', '##ms', ',', 'and', 'tend', 'to', '[SEP]', 
-          '[MASK]', 'produced', 'a', '[MASK]', '[MASK]', 'of', 'resin', ',', 
-          'which', '[MASK]', 'often', 'found', 'as', 'amber', '[SEP]']
-input_ids:  [101, 24604, 2015, 2024, 2651, 103, 2550, 2011, 17076, 10735, 4842,
-             5244, 1010, 1998, 7166, 2000, 102, 103, 2550, 1037, 103, 103, 1997,
-             24604, 1010, 2029, 103, 2411, 2179, 2004, 8994, 102]
-input_mask:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-              1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-segment_ids:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
-               1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-masked_lm_positions:  [5, 17, 20, 21, 26]
-masked_lm_ids:  [2069, 3619, 2353, 2828, 2003]
-masked_lm_weights:  [1.0, 1.0, 1.0, 1.0, 1.0]
-next_sentence_labels:  [0]
-```
-
-In order to set up the next sentence prediction task, the script will load the
-entire input into memory. As such, it is recommended to run this script on a
-subset of the input data at a time.
-
-For example, you can run the script on each file shard in a directory
-with the following:
-
-```shell
-for file in path/to/sentence-split-data/*; do
-    output="path/to/pretraining-data/$(basename -- "$file" .txt).tfrecord"
-    python3 examples/bert_pretraining/bert_create_pretraining_data.py \
-        --input_files ${file} \
-        --vocab_file bert_vocab_uncased.txt \
-        --output_file ${output}
-done
-```
-
-If enough memory is available, this could be further sped up by running this script
-multiple times in parallel. The following will take 3-4 hours on the entire dataset
-on an 8 core machine.
-
-```shell
-NUM_JOBS=5
-for file in path/to/sentence-split-data/*; do
-    output="path/to/pretraining-data/$(basename -- "$file" .txt).tfrecord"
-    echo python3 examples/bert_pretraining/bert_create_pretraining_data.py \
-        --input_files ${file} \
-        --vocab_file bert_vocab_uncased.txt \
-        --output_file ${output}
-done | parallel -j ${NUM_JOBS}
-```
-
-To preview a sample of generated data files, you can run the command below:
-
-```shell
-python3 -c "from examples.utils.data_utils import preview_tfrecord; preview_tfrecord('path/to/tfrecord_file')"
-```
-
-### Running BERT pretraining
-
-After preprocessing, we can run pretraining with the `bert_pretrain.py`
-script. This will train a model and save it to the `--saved_model_output`
-directory. If you are willing to train from data stored on google cloud storage bucket (GCS), you can do it by setting the file path to
-the URL of GCS bucket. For example, `--input_directory=gs://your-bucket-name/you-data-path`. You can also save models directly to GCS by the same approach.
-
-```shell
-python3 examples/bert_pretraining/bert_pretrain.py \
-    --input_directory path/to/data/ \
-    --vocab_file path/to/bert_vocab_uncased.txt \
-    --model_size tiny \
-    --saved_model_output path/to/model/
-```
diff --git a/examples/bert_pretraining/__init__.py b/examples/bert_pretraining/__init__.py
deleted file mode 100644
index 3364a6bd16..0000000000
--- a/examples/bert_pretraining/__init__.py
+++ /dev/null
@@ -1,13 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/examples/bert_pretraining/bert_config.py b/examples/bert_pretraining/bert_config.py
deleted file mode 100644
index 5c28ceae70..0000000000
--- a/examples/bert_pretraining/bert_config.py
+++ /dev/null
@@ -1,79 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# TODO(jbischof): remove in favor of presets with load_weights=False
-MODEL_CONFIGS = {
-    "tiny": {
-        "num_layers": 2,
-        "hidden_dim": 128,
-        "dropout": 0.1,
-        "num_heads": 2,
-        "intermediate_dim": 512,
-    },
-    "mini": {
-        "num_layers": 4,
-        "hidden_dim": 256,
-        "dropout": 0.1,
-        "num_heads": 4,
-        "intermediate_dim": 1024,
-    },
-    "small": {
-        "num_layers": 4,
-        "hidden_dim": 512,
-        "dropout": 0.1,
-        "num_heads": 8,
-        "intermediate_dim": 2048,
-    },
-    "medium": {
-        "num_layers": 8,
-        "hidden_dim": 512,
-        "dropout": 0.1,
-        "num_heads": 8,
-        "intermediate_dim": 2048,
-    },
-    "base": {
-        "num_layers": 12,
-        "hidden_dim": 768,
-        "dropout": 0.1,
-        "num_heads": 12,
-        "intermediate_dim": 3072,
-    },
-    "large": {
-        "num_layers": 24,
-        "hidden_dim": 1024,
-        "dropout": 0.1,
-        "num_heads": 16,
-        "intermediate_dim": 4096,
-    },
-}
-
-# Currently we have the same set of training parameters for all configurations.
-# We should see if we need to split this for different architecture sizes.
-
-PREPROCESSING_CONFIG = {
-    "max_seq_length": 512,
-    "max_predictions_per_seq": 76,
-    "dupe_factor": 10,
-    "masked_lm_prob": 0.15,
-    "short_seq_prob": 0.1,
-}
-
-TRAINING_CONFIG = {
-    "batch_size": 256,
-    "epochs": 10,
-    "learning_rate": 1e-4,
-    "num_train_steps": 1_000_000,
-    # Percentage of training steps used for learning rate warmup.
-    "warmup_percentage": 0.1,
-}
diff --git a/examples/bert_pretraining/bert_create_pretraining_data.py b/examples/bert_pretraining/bert_create_pretraining_data.py
deleted file mode 100644
index e0edeb7b08..0000000000
--- a/examples/bert_pretraining/bert_create_pretraining_data.py
+++ /dev/null
@@ -1,512 +0,0 @@
-# Copyright 2024 The KerasNLP Authors.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Create masked LM/next sentence masked_lm TF examples for BERT.
-
-This script will create TFRecord files containing BERT training examples with
-both word masking and next sentence prediction.
-
-This script will load the entire dataset into memory to setup the next sentence
-prediction task, so it is recommended to run this on shards of data at a time to
-avoid memory issues.
-
-By default, it will duplicate the input data 10 times with different masks and
-sentence pairs, as will the original paper. So a 20gb source of wikipedia and
-bookscorpus will result in a 400gb dataset.
-
-This script is adapted from the original BERT respository:
-https://github.com/google-research/bert/blob/master/create_pretraining_data.py
-
-Usage:
-python create_pretraining_data.py \
-    --input_files ~/datasets/bert-sentence-split-data/shard_0.txt \
-    --output_directory ~/datasets/bert-pretraining-data/shard_0.txt \
-    --vocab_file vocab.txt
-"""
-
-import collections
-import os
-import random
-import sys
-
-import tensorflow as tf
-import tensorflow_text as tf_text
-from absl import app
-from absl import flags
-
-from examples.bert_pretraining.bert_config import PREPROCESSING_CONFIG
-from examples.utils.scripting_utils import list_filenames_for_arg
-
-# Tokenization will happen with tensorflow and can easily OOM a GPU.
-# Restrict the script to run CPU as GPU will not offer speedup here anyway.
-os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
-
-FLAGS = flags.FLAGS
-
-flags.DEFINE_string(
-    "input_files",
-    None,
-    "Comma seperated list of directories, globs or files.",
-)
-
-flags.DEFINE_string(
-    "output_file",
-    None,
-    "Output TF record file.",
-)
-
-flags.DEFINE_string(
-    "vocab_file",
-    None,
-    "The vocabulary file for tokenization.",
-)
-
-flags.DEFINE_bool(
-    "do_lower_case",
-    True,
-    "Whether to lower case the input text.",
-)
-
-flags.DEFINE_integer(
-    "random_seed",
-    12345,
-    "Random seed for data generation.",
-)
-
-
-def convert_to_unicode(text):
-    """Converts text to Unicode if it's not already, assuming utf-8 input."""
-    if isinstance(text, str):
-        return text
-    elif isinstance(text, bytes):
-        return text.decode("utf-8", "ignore")
-    else:
-        raise ValueError("Unsupported string type: %s" % (type(text)))
-
-
-def printable_text(text):
-    """Returns text encoded in a way suitable for print."""
-    if isinstance(text, str):
-        return text
-    elif isinstance(text, bytes):
-        return text.decode("utf-8", "ignore")
-    else:
-        raise ValueError("Unsupported string type: %s" % (type(text)))
-
-
-# This tuple holds a complete training instance of data ready for serialization.
-TrainingInstance = collections.namedtuple(
-    "TrainingInstance",
-    [
-        "tokens",
-        "segment_ids",
-        "is_random_next",
-        "masked_lm_positions",
-        "masked_lm_labels",
-    ],
-)
-
-
-def write_instance_to_example_files(
-    instances, vocab, max_seq_length, max_predictions_per_seq, output_filename
-):
-    """Create TF example files from `TrainingInstance`s."""
-    writer = tf.io.TFRecordWriter(output_filename)
-    total_written = 0
-    lookup = dict(zip(vocab, range(len(vocab))))
-    for inst_index, instance in enumerate(instances):
-        token_ids = [lookup[x] for x in instance.tokens]
-        padding_mask = [1] * len(token_ids)
-        segment_ids = list(instance.segment_ids)
-        assert len(token_ids) <= max_seq_length
-
-        while len(token_ids) < max_seq_length:
-            token_ids.append(0)
-            padding_mask.append(0)
-            segment_ids.append(0)
-
-        assert len(token_ids) == max_seq_length
-        assert len(padding_mask) == max_seq_length
-        assert len(segment_ids) == max_seq_length
-
-        masked_lm_positions = list(instance.masked_lm_positions)
-        masked_lm_ids = [lookup[x] for x in instance.masked_lm_labels]
-        masked_lm_weights = [1.0] * len(masked_lm_ids)
-
-        while len(masked_lm_positions) < max_predictions_per_seq:
-            masked_lm_positions.append(0)
-            masked_lm_ids.append(0)
-            masked_lm_weights.append(0.0)
-
-        next_sentence_label = 1 if instance.is_random_next else 0
-
-        features = collections.OrderedDict()
-        features["token_ids"] = int_feature(token_ids)
-        features["padding_mask"] = int_feature(padding_mask)
-        features["segment_ids"] = int_feature(segment_ids)
-        features["masked_lm_positions"] = int_feature(masked_lm_positions)
-        features["masked_lm_ids"] = int_feature(masked_lm_ids)
-        features["masked_lm_weights"] = float_feature(masked_lm_weights)
-        features["next_sentence_labels"] = int_feature([next_sentence_label])
-
-        tf_example = tf.train.Example(
-            features=tf.train.Features(feature=features)
-        )
-
-        writer.write(tf_example.SerializeToString())
-        total_written += 1
-
-    writer.close()
-    print(f"Wrote {total_written} total instances")
-
-
-def int_feature(values):
-    return tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
-
-
-def float_feature(values):
-    return tf.train.Feature(float_list=tf.train.FloatList(value=list(values)))
-
-
-def create_training_instances(
-    input_filenames,
-    tokenizer,
-    vocab,
-    max_seq_length,
-    dupe_factor,
-    short_seq_prob,
-    masked_lm_prob,
-    max_predictions_per_seq,
-    rng,
-):
-    """Create `TrainingInstance`s from raw text."""
-    # Input file format:
-    # (1) One sentence per line. These should ideally be actual sentences, not
-    # entire paragraphs or arbitrary spans of text. (Because we use the
-    # sentence boundaries for the "next sentence prediction" task).
-    # (2) Blank lines between documents. Document boundaries are needed so
-    # that the "next sentence prediction" task doesn't span between documents.
-    dataset = tf.data.TextLineDataset(input_filenames)
-    dataset = dataset.map(
-        lambda x: tokenizer.tokenize(x).flat_values,
-        num_parallel_calls=tf.data.AUTOTUNE,
-    )
-    all_documents = []
-    current_document = []
-    for line in dataset.as_numpy_iterator():
-        if line.size == 0 and current_document:
-            all_documents.append(current_document)
-            current_document = []
-        else:
-            line = [x.decode("utf-8") for x in line]
-            if line:
-                current_document.append(line)
-    rng.shuffle(all_documents)
-
-    instances = []
-    for _ in range(dupe_factor):
-        for document_index in range(len(all_documents)):
-            instances.extend(
-                create_instances_from_document(
-                    all_documents,
-                    document_index,
-                    max_seq_length,
-                    short_seq_prob,
-                    masked_lm_prob,
-                    max_predictions_per_seq,
-                    vocab,
-                    rng,
-                )
-            )
-    rng.shuffle(instances)
-    return instances
-
-
-def create_instances_from_document(
-    all_documents,
-    document_index,
-    max_seq_length,
-    short_seq_prob,
-    masked_lm_prob,
-    max_predictions_per_seq,
-    vocab_words,
-    rng,
-):
-    """Creates `TrainingInstance`s for a single document."""
-    document = all_documents[document_index]
-
-    # Account for [CLS], [SEP], [SEP]
-    max_num_tokens = max_seq_length - 3
-
-    # We *usually* want to fill up the entire sequence since we are padding
-    # to `max_seq_length` anyways, so short sequences are generally wasted
-    # computation. However, we *sometimes*
-    # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
-    # sequences to minimize the mismatch between pre-training and fine-tuning.
-    # The `target_seq_length` is just a rough target however, whereas
-    # `max_seq_length` is a hard limit.
-    target_seq_length = max_num_tokens
-    if rng.random() < short_seq_prob:
-        target_seq_length = rng.randint(2, max_num_tokens)
-
-    # We DON'T just concatenate all of the tokens from a document into a long
-    # sequence and choose an arbitrary split point because this would make the
-    # next sentence prediction task too easy. Instead, we split the input into
-    # segments "A" and "B" based on the actual "sentences" provided by the user
-    # input.
-    instances = []
-    current_chunk = []
-    current_length = 0
-    i = 0
-    while i < len(document):
-        segment = document[i]
-        current_chunk.append(segment)
-        current_length += len(segment)
-        if i == len(document) - 1 or current_length >= target_seq_length:
-            if current_chunk:
-                # `a_end` is how many segments from `current_chunk` go into the
-                # `A` (first) sentence.
-                a_end = 1
-                if len(current_chunk) >= 2:
-                    a_end = rng.randint(1, len(current_chunk) - 1)
-
-                tokens_a = []
-                for j in range(a_end):
-                    tokens_a.extend(current_chunk[j])
-
-                tokens_b = []
-                # Random next
-                is_random_next = False
-                if len(current_chunk) == 1 or rng.random() < 0.5:
-                    is_random_next = True
-                    target_b_length = target_seq_length - len(tokens_a)
-
-                    # This should rarely go for more than one iteration for
-                    # large corpora. However, just to be careful, we try to make
-                    # sure that the random document is not the same as the
-                    # document we're processing.
-                    for _ in range(10):
-                        random_document_index = rng.randint(
-                            0, len(all_documents) - 1
-                        )
-                        if random_document_index != document_index:
-                            break
-
-                    random_document = all_documents[random_document_index]
-                    random_start = rng.randint(0, len(random_document) - 1)
-                    for j in range(random_start, len(random_document)):
-                        tokens_b.extend(random_document[j])
-                        if len(tokens_b) >= target_b_length:
-                            break
-                    # We didn't actually use these segments so we "put them
-                    # back" so they don't go to waste.
-                    num_unused_segments = len(current_chunk) - a_end
-                    i -= num_unused_segments
-                # Actual next
-                else:
-                    is_random_next = False
-                    for j in range(a_end, len(current_chunk)):
-                        tokens_b.extend(current_chunk[j])
-                truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)
-
-                assert len(tokens_a) >= 1
-                assert len(tokens_b) >= 1
-
-                tokens = []
-                segment_ids = []
-                tokens.append("[CLS]")
-                segment_ids.append(0)
-                for token in tokens_a:
-                    tokens.append(token)
-                    segment_ids.append(0)
-
-                tokens.append("[SEP]")
-                segment_ids.append(0)
-
-                for token in tokens_b:
-                    tokens.append(token)
-                    segment_ids.append(1)
-                tokens.append("[SEP]")
-                segment_ids.append(1)
-
-                (
-                    tokens,
-                    masked_lm_positions,
-                    masked_lm_labels,
-                ) = create_masked_lm_predictions(
-                    tokens,
-                    masked_lm_prob,
-                    max_predictions_per_seq,
-                    vocab_words,
-                    rng,
-                )
-                instance = TrainingInstance(
-                    tokens=tokens,
-                    segment_ids=segment_ids,
-                    is_random_next=is_random_next,
-                    masked_lm_positions=masked_lm_positions,
-                    masked_lm_labels=masked_lm_labels,
-                )
-                instances.append(instance)
-            current_chunk = []
-            current_length = 0
-        i += 1
-
-    return instances
-
-
-MaskedLmInstance = collections.namedtuple(
-    "MaskedLmInstance", ["index", "label"]
-)
-
-
-def create_masked_lm_predictions(
-    tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng
-):
-    """Creates the predictions for the masked LM objective."""
-
-    # TODO(jbischof): replace with keras_nlp.layers.MaskedLMMaskGenerator
-    # (Issue #166)
-
-    cand_indexes = []
-    for i, token in enumerate(tokens):
-        if token == "[CLS]" or token == "[SEP]":
-            continue
-        cand_indexes.append([i])
-
-    rng.shuffle(cand_indexes)
-
-    output_tokens = list(tokens)
-
-    num_to_predict = min(
-        max_predictions_per_seq,
-        max(1, int(round(len(tokens) * masked_lm_prob))),
-    )
-
-    masked_lms = []
-    covered_indexes = set()
-    for index_set in cand_indexes:
-        if len(masked_lms) >= num_to_predict:
-            break
-        # If adding a whole-word mask would exceed the maximum number of
-        # predictions, then just skip this candidate.
-        if len(masked_lms) + len(index_set) > num_to_predict:
-            continue
-        is_any_index_covered = False
-        for index in index_set:
-            if index in covered_indexes:
-                is_any_index_covered = True
-                break
-        if is_any_index_covered:
-            continue
-        for index in index_set:
-            covered_indexes.add(index)
-
-            masked_token = None
-            # 80% of the time, replace with [MASK]
-            if rng.random() < 0.8:
-                masked_token = "[MASK]"
-            else:
-                # 10% of the time, keep original
-                if rng.random() < 0.5:
-                    masked_token = tokens[index]
-                # 10% of the time, replace with random word
-                else:
-                    masked_token = vocab_words[
-                        rng.randint(0, len(vocab_words) - 1)
-                    ]
-
-            output_tokens[index] = masked_token
-
-            masked_lms.append(
-                MaskedLmInstance(index=index, label=tokens[index])
-            )
-    assert len(masked_lms) <= num_to_predict
-    masked_lms = sorted(masked_lms, key=lambda x: x.index)
-
-    masked_lm_positions = []
-    masked_lm_labels = []
-    for p in masked_lms:
-        masked_lm_positions.append(p.index)
-        masked_lm_labels.append(p.label)
-
-    return (output_tokens, masked_lm_positions, masked_lm_labels)
-
-
-def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng):
-    """Truncates a pair of sequences to a maximum sequence length."""
-    while True:
-        total_length = len(tokens_a) + len(tokens_b)
-        if total_length <= max_num_tokens:
-            break
-
-        trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b
-        assert len(trunc_tokens) >= 1
-
-        # We want to sometimes truncate from the front and sometimes from the
-        # back to add more randomness and avoid biases.
-        if rng.random() < 0.5:
-            del trunc_tokens[0]
-        else:
-            trunc_tokens.pop()
-
-
-def main(_):
-    print(f"Reading input data from {FLAGS.input_files}")
-    input_filenames = list_filenames_for_arg(FLAGS.input_files)
-    if not input_filenames:
-        print("No input files found. Check `input_files` flag.")
-        sys.exit(1)
-
-    # Load the vocabulary.
-    vocab = []
-    with open(FLAGS.vocab_file, "r") as vocab_file:
-        for line in vocab_file:
-            vocab.append(line.strip())
-    tokenizer = tf_text.BertTokenizer(
-        FLAGS.vocab_file,
-        lower_case=FLAGS.do_lower_case,
-        token_out_type=tf.string,
-    )
-
-    rng = random.Random(FLAGS.random_seed)
-    instances = create_training_instances(
-        input_filenames,
-        tokenizer,
-        vocab,
-        PREPROCESSING_CONFIG["max_seq_length"],
-        PREPROCESSING_CONFIG["dupe_factor"],
-        PREPROCESSING_CONFIG["short_seq_prob"],
-        PREPROCESSING_CONFIG["masked_lm_prob"],
-        PREPROCESSING_CONFIG["max_predictions_per_seq"],
-        rng,
-    )
-
-    print(f"Outputting to {FLAGS.output_file}.")
-    output_directory = os.path.dirname(FLAGS.output_file)
-    if not os.path.exists(output_directory):
-        os.mkdir(output_directory)
-    write_instance_to_example_files(
-        instances,
-        vocab,
-        PREPROCESSING_CONFIG["max_seq_length"],
-        PREPROCESSING_CONFIG["max_predictions_per_seq"],
-        FLAGS.output_file,
-    )
-
-
-if __name__ == "__main__":
-    flags.mark_flag_as_required("input_files")
-    flags.mark_flag_as_required("output_file")
-    flags.mark_flag_as_required("vocab_file")
-    app.run(main)
diff --git a/examples/bert_pretraining/bert_pretrain.py b/examples/bert_pretraining/bert_pretrain.py
deleted file mode 100644
index 95e2b77d4b..0000000000
--- a/examples/bert_pretraining/bert_pretrain.py
+++ /dev/null
@@ -1,457 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import datetime
-import sys
-
-import tensorflow as tf
-from absl import app
-from absl import flags
-from absl import logging
-from tensorflow import keras
-
-import keras_nlp
-from examples.bert_pretraining.bert_config import MODEL_CONFIGS
-from examples.bert_pretraining.bert_config import PREPROCESSING_CONFIG
-from examples.bert_pretraining.bert_config import TRAINING_CONFIG
-
-FLAGS = flags.FLAGS
-
-flags.DEFINE_string(
-    "input_directory",
-    None,
-    "The directory of training data. It can be a local disk path, or the URL "
-    "of Google cloud storage bucket.",
-)
-
-flags.DEFINE_string(
-    "saved_model_output",
-    None,
-    "Output directory to save the model to.",
-)
-
-flags.DEFINE_string(
-    "checkpoint_save_directory",
-    None,
-    "Output directory to save checkpoints to.",
-)
-
-flags.DEFINE_bool(
-    "skip_restore",
-    False,
-    "Skip restoring from checkpoint if True",
-)
-
-flags.DEFINE_bool(
-    "tpu_name",
-    None,
-    "The TPU to connect to. If None, TPU will not be used.",
-)
-
-flags.DEFINE_bool(
-    "enable_cloud_logging",
-    False,
-    "If True, the script will use cloud logging.",
-)
-
-flags.DEFINE_string(
-    "tensorboard_log_path",
-    None,
-    "The path to save tensorboard log to.",
-)
-
-flags.DEFINE_string(
-    "model_size",
-    "tiny",
-    "One of: tiny, mini, small, medium, base, or large.",
-)
-
-flags.DEFINE_string(
-    "vocab_file",
-    None,
-    "The vocabulary file for tokenization.",
-)
-
-flags.DEFINE_integer(
-    "num_train_steps",
-    None,
-    "Override the pre-configured number of train steps..",
-)
-
-
-class MaskedLMHead(keras.layers.Layer):
-    """Masked language model network head for BERT.
-
-    This layer implements a masked language model based on the provided
-    transformer based encoder. It assumes that the encoder network being passed
-    has a "get_embedding_table()" method.
-
-    Example:
-    ```python
-    encoder = keras_nlp.models.BertBackbone(
-        vocabulary_size=30552,
-        num_layers=12,
-        num_heads=12,
-        hidden_dim=768,
-        intermediate_dim=3072,
-        max_sequence_length=12,
-    )
-    lm_layer = MaskedLMHead(embedding_table=encoder.get_embedding_table())
-    ```
-
-    Args:
-        embedding_table: The embedding table from encoder network.
-        intermediate_activation: The activation, if any, for the inner dense
-            layer.
-        initializer: The initializer for the dense layer. Defaults to a Glorot
-            uniform initializer.
-        output: The output style for this layer. Can be either 'logits' or
-            'predictions'.
-    """
-
-    def __init__(
-        self,
-        embedding_table,
-        intermediate_activation="gelu",
-        initializer="glorot_uniform",
-        **kwargs,
-    ):
-        super().__init__(**kwargs)
-        self.embedding_table = embedding_table
-        self.intermediate_activation = keras.activations.get(
-            intermediate_activation
-        )
-        self.initializer = initializer
-
-    def build(self, input_shape):
-        self._vocab_size, hidden_dim = self.embedding_table.shape
-        self.dense = keras.layers.Dense(
-            hidden_dim,
-            activation=self.intermediate_activation,
-            kernel_initializer=self.initializer,
-            name="transform/dense",
-        )
-        self.layer_norm = keras.layers.LayerNormalization(
-            axis=-1, epsilon=1e-12, name="transform/LayerNorm"
-        )
-        self.bias = self.add_weight(
-            name="output_bias/bias",
-            shape=(self._vocab_size,),
-            initializer="zeros",
-            trainable=True,
-        )
-
-        super().build(input_shape)
-
-    def call(self, sequence_data, masked_positions):
-        masked_lm_input = self._gather_indexes(sequence_data, masked_positions)
-        lm_data = self.dense(masked_lm_input)
-        lm_data = self.layer_norm(lm_data)
-        lm_data = tf.matmul(lm_data, self.embedding_table, transpose_b=True)
-        logits = tf.nn.bias_add(lm_data, self.bias)
-        masked_positions_length = (
-            masked_positions.shape.as_list()[1] or tf.shape(masked_positions)[1]
-        )
-        return tf.reshape(
-            logits, [-1, masked_positions_length, self._vocab_size]
-        )
-
-    def _gather_indexes(self, sequence_tensor, positions):
-        """Gathers the vectors at the specific positions, for performance.
-
-        Args:
-            sequence_tensor: Sequence output of shape
-                (`batch_size`, `seq_length`, `hidden_dim`) where `hidden_dim`
-                is number of hidden units.
-            positions: Positions ids of tokens in sequence to mask for
-                pretraining of with dimension (batch_size, num_predictions)
-                where `num_predictions` is maximum number of tokens to mask out
-                and predict per each sequence.
-
-        Returns:
-            Masked out sequence tensor of shape (batch_size * num_predictions,
-            `hidden_dim`).
-        """
-        sequence_shape = tf.shape(sequence_tensor)
-        batch_size, seq_length = sequence_shape[0], sequence_shape[1]
-        width = sequence_tensor.shape.as_list()[2] or sequence_shape[2]
-
-        flat_offsets = tf.reshape(
-            tf.range(0, batch_size, dtype="int32") * seq_length, [-1, 1]
-        )
-        flat_positions = tf.reshape(positions + flat_offsets, [-1])
-        flat_sequence_tensor = tf.reshape(
-            sequence_tensor, [batch_size * seq_length, width]
-        )
-        output_tensor = tf.gather(flat_sequence_tensor, flat_positions)
-
-        return output_tensor
-
-
-class BertPretrainingModel(keras.Model):
-    """MaskedLM + NSP model with Bert encoder."""
-
-    def __init__(self, encoder, **kwargs):
-        super().__init__(**kwargs)
-        self.encoder = encoder
-        # TODO(jbischof): replace with keras_nlp.layers.MaskedLMHead (Issue #166)
-        self.masked_lm_head = MaskedLMHead(
-            embedding_table=encoder.token_embedding.embeddings,
-            initializer=keras.initializers.TruncatedNormal(stddev=0.02),
-            name="mlm_layer",
-        )
-        self.next_sentence_head = keras.layers.Dense(
-            encoder.num_segments,
-            kernel_initializer=keras.initializers.TruncatedNormal(stddev=0.02),
-            name="nsp_layer",
-        )
-
-    def call(self, data):
-        encoder_output = self.encoder(
-            {
-                "token_ids": data["token_ids"],
-                "segment_ids": data["segment_ids"],
-                "padding_mask": data["padding_mask"],
-            }
-        )
-        sequence_output, pooled_output = (
-            encoder_output["sequence_output"],
-            encoder_output["pooled_output"],
-        )
-        lm_preds = self.masked_lm_head(
-            sequence_output, data["masked_lm_positions"]
-        )
-        nsp_preds = self.next_sentence_head(pooled_output)
-        return {"mlm": lm_preds, "nsp": nsp_preds}
-
-
-class LinearDecayWithWarmup(keras.optimizers.schedules.LearningRateSchedule):
-    """
-    A learning rate schedule with linear warmup and decay.
-
-    This schedule implements a linear warmup for the first `num_warmup_steps`
-    and a linear ramp down until `num_train_steps`.
-    """
-
-    def __init__(self, learning_rate, num_warmup_steps, num_train_steps):
-        self.learning_rate = learning_rate
-        self.warmup_steps = num_warmup_steps
-        self.train_steps = num_train_steps
-
-    def __call__(self, step):
-        peak_lr = tf.cast(self.learning_rate, dtype="float32")
-        warmup = tf.cast(self.warmup_steps, dtype="float32")
-        training = tf.cast(self.train_steps, dtype="float32")
-        step = tf.cast(step, dtype="float32")
-
-        is_warmup = step < warmup
-
-        # Linear Warmup will be implemented if current step is less than
-        # `num_warmup_steps` else Linear Decay will be implemented.
-        return tf.cond(
-            is_warmup,
-            lambda: peak_lr * (step / warmup),
-            lambda: tf.math.maximum(
-                0.0, peak_lr * (training - step) / (training - warmup)
-            ),
-        )
-
-    def get_config(self):
-        return {
-            "learning_rate": self.learning_rate,
-            "num_warmup_steps": self.warmup_steps,
-            "num_train_steps": self.train_steps,
-        }
-
-
-def decode_record(record):
-    """Decodes a record to a TensorFlow example."""
-    seq_length = PREPROCESSING_CONFIG["max_seq_length"]
-    lm_length = PREPROCESSING_CONFIG["max_predictions_per_seq"]
-    name_to_features = {
-        "token_ids": tf.io.FixedLenFeature([seq_length], "int64"),
-        "padding_mask": tf.io.FixedLenFeature([seq_length], "int64"),
-        "segment_ids": tf.io.FixedLenFeature([seq_length], "int64"),
-        "masked_lm_positions": tf.io.FixedLenFeature([lm_length], "int64"),
-        "masked_lm_ids": tf.io.FixedLenFeature([lm_length], "int64"),
-        "masked_lm_weights": tf.io.FixedLenFeature([lm_length], "float32"),
-        "next_sentence_labels": tf.io.FixedLenFeature([1], "int64"),
-    }
-    # tf.Example only supports "int64", but the TPU only supports "int32".
-    # So cast all int64 to int32.
-    example = tf.io.parse_single_example(record, name_to_features)
-    for name in list(example.keys()):
-        value = example[name]
-        if value.dtype == "int64":
-            value = tf.cast(value, "int32")
-        example[name] = value
-
-    inputs = {
-        "token_ids": example["token_ids"],
-        "padding_mask": example["padding_mask"],
-        "segment_ids": example["segment_ids"],
-        "masked_lm_positions": example["masked_lm_positions"],
-    }
-    labels = {
-        "mlm": example["masked_lm_ids"],
-        "nsp": example["next_sentence_labels"],
-    }
-    sample_weights = {"mlm": example["masked_lm_weights"], "nsp": tf.ones((1,))}
-    sample = (inputs, labels, sample_weights)
-    return sample
-
-
-def get_checkpoint_callback():
-    if tf.io.gfile.exists(FLAGS.checkpoint_save_directory):
-        if not tf.io.gfile.isdir(FLAGS.checkpoint_save_directory):
-            raise ValueError(
-                "`checkpoint_save_directory` should be a directory, "
-                f"but {FLAGS.checkpoint_save_directory} is not a "
-                "directory. Please set `checkpoint_save_directory` as "
-                "a directory."
-            )
-
-        elif FLAGS.skip_restore:
-            # Clear up the directory if users want to skip restoring.
-            tf.io.gfile.rmtree(FLAGS.checkpoint_save_directory)
-    checkpoint_path = FLAGS.checkpoint_save_directory
-    return keras.callbacks.BackupAndRestore(
-        backup_dir=checkpoint_path,
-    )
-
-
-def get_tensorboard_callback():
-    timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
-    log_dir = FLAGS.tensorboard_log_path + timestamp
-    return keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
-
-
-def main(_):
-    if FLAGS.enable_cloud_logging:
-        # If the job is on cloud, we will use cloud logging.
-        import google.cloud.logging
-
-        keras.utils.disable_interactive_logging()
-        client = google.cloud.logging.Client()
-        client.setup_logging()
-
-    logging.info(f"Reading input data from {FLAGS.input_directory}")
-    if not tf.io.gfile.isdir(FLAGS.input_directory):
-        raise ValueError(
-            "`input_directory` should be a directory, "
-            f"but {FLAGS.input_directory} is not a directory. Please "
-            "set `input_directory` flag as a directory."
-        )
-    files = tf.io.gfile.listdir(FLAGS.input_directory)
-    input_filenames = [FLAGS.input_directory + "/" + file for file in files]
-
-    if not input_filenames:
-        logging.info("No input files found. Check `input_directory` flag.")
-        sys.exit(1)
-
-    vocab = []
-    with tf.io.gfile.GFile(FLAGS.vocab_file) as vocab_file:
-        for line in vocab_file:
-            vocab.append(line.strip())
-
-    model_config = MODEL_CONFIGS[FLAGS.model_size]
-
-    if FLAGS.tpu_name is None:
-        # Use default strategy if not using TPU.
-        strategy = tf.distribute.get_strategy()
-    else:
-        # Connect to TPU and create TPU strategy.
-        resolver = tf.distribute.cluster_resolver.TPUClusterResolver.connect(
-            tpu=FLAGS.tpu_name
-        )
-        strategy = tf.distribute.TPUStrategy(resolver)
-
-    # Decode and batch data.
-    dataset = tf.data.TFRecordDataset(input_filenames)
-    dataset = dataset.map(
-        lambda record: decode_record(record),
-        num_parallel_calls=tf.data.experimental.AUTOTUNE,
-    )
-    dataset = dataset.batch(TRAINING_CONFIG["batch_size"], drop_remainder=True)
-    dataset = dataset.repeat()
-
-    with strategy.scope():
-        # Create a Bert model the input config.
-        encoder = keras_nlp.models.BertBackbone(
-            vocabulary_size=len(vocab), **model_config
-        )
-        # Make sure model has been called.
-        encoder(encoder.inputs)
-        encoder.summary()
-
-        # Allow overriding train steps from the command line for quick testing.
-        if FLAGS.num_train_steps is not None:
-            num_train_steps = FLAGS.num_train_steps
-        else:
-            num_train_steps = TRAINING_CONFIG["num_train_steps"]
-        num_warmup_steps = int(
-            num_train_steps * TRAINING_CONFIG["warmup_percentage"]
-        )
-        learning_rate_schedule = LinearDecayWithWarmup(
-            learning_rate=TRAINING_CONFIG["learning_rate"],
-            num_warmup_steps=num_warmup_steps,
-            num_train_steps=num_train_steps,
-        )
-        optimizer = keras.optimizers.Adam(learning_rate=learning_rate_schedule)
-
-        lm_loss = keras.losses.SparseCategoricalCrossentropy(
-            from_logits=True,
-            name="lm_loss",
-        )
-        nsp_loss = keras.losses.SparseCategoricalCrossentropy(
-            from_logits=True,
-            name="nsp_loss",
-        )
-
-        lm_accuracy = keras.metrics.SparseCategoricalAccuracy(name="accuracy")
-        nsp_accuracy = keras.metrics.SparseCategoricalAccuracy(name="accuracy")
-
-        pretraining_model = BertPretrainingModel(encoder)
-        pretraining_model.compile(
-            optimizer=optimizer,
-            loss={"mlm": lm_loss, "nsp": nsp_loss},
-            weighted_metrics={"mlm": lm_accuracy, "nsp": nsp_accuracy},
-        )
-
-    epochs = TRAINING_CONFIG["epochs"]
-    steps_per_epoch = num_train_steps // epochs
-
-    callbacks = []
-    if FLAGS.checkpoint_save_directory:
-        callbacks.append(get_checkpoint_callback())
-    if FLAGS.tensorboard_log_path:
-        callbacks.append(get_tensorboard_callback())
-
-    pretraining_model.fit(
-        dataset,
-        epochs=epochs,
-        steps_per_epoch=steps_per_epoch,
-        callbacks=callbacks,
-    )
-
-    model_path = FLAGS.saved_model_output
-    logging.info(f"Saving to {FLAGS.saved_model_output}")
-    encoder.save(model_path)
-
-
-if __name__ == "__main__":
-    flags.mark_flag_as_required("input_directory")
-    flags.mark_flag_as_required("vocab_file")
-    flags.mark_flag_as_required("saved_model_output")
-    app.run(main)
diff --git a/examples/bert_pretraining/requirements.txt b/examples/bert_pretraining/requirements.txt
deleted file mode 100644
index 3d0b35fbc8..0000000000
--- a/examples/bert_pretraining/requirements.txt
+++ /dev/null
@@ -1,2 +0,0 @@
-nltk
-wikiextractor
diff --git a/examples/glue_benchmark/README.md b/examples/glue_benchmark/README.md
deleted file mode 100644
index 4dde0b93d7..0000000000
--- a/examples/glue_benchmark/README.md
+++ /dev/null
@@ -1,117 +0,0 @@
-# GLUE Finetuning Script
-
-This script is written to help you evaluate your model on GLUE benchmarking.
-It provides the functionalities below:
-
-- Load and preprocess GLUE data.
-- Finetuning your Keras text classification model. 
-- Generate GLUE submission files.
-
-To use the script, you need to change the code to load your pretrained model,
-and run the command below:
-
-```shell
-python glue.py --task_name="mrpc" --batch_size=32 \
-    --submission_directory="glue_submissions/"
-```
-
-By default the script finetunes on the tiniest BERT model we have available 
-(this will be fast but not top performing).
-
-To make a real GLUE leaderboard submission, you need to call the finetuning on 
-all tasks, then enter the submission directory then zip the submission files:
-```shell
-for task in cola sst2 mrpc rte stsb qnli qqp; do
-  python glue.py --task_name="$task" --submission_directory="glue_submissions/"
-done
-
-python glue.py --task_name="mnli_matched" \
-    --submission_directory="glue_submissions/" \
-    --save_finetuning_model="saved/mnli"
-
-python glue.py --task_name="mnli_mismatched" \
-    --submission_directory="glue_submissions/" \
-    --load_finetuning_model="saved/mnli"
-
-python glue.py --task_name="ax" \
-    --submission_directory="glue_submissions/" \
-    --load_finetuning_model="saved/mnli"
-
-cd glue_submissions
-zip -r submission.zip *.tsv
-```
-
-Please note that `mnli_matched`, `mnli_mismatched` and `ax` share the same 
-training set, so we only train once on `mnli_matched` and use the saved model 
-to evaluate on `mnli_mismatched` and `ax`.
-
-GLUE submission requires the `submission.zip` contains `.tsv` file for all 
-tasks, otherwise it will be a failed submission. An empty `.tsv` will also fail 
-because it checks the content. If you only want to evaluate on certain tasks, 
-you can download the sample submission, and put the `.tsv` files for tasks you 
-don't run inside your submission file. For example if you don't want to 
-run the `ax` task, then you can do:
-
-```
-curl -O https://gluebenchmark.com/assets/CBOW.zip
-unzip CBOW.zip -d sample_submissions
-cp sample_submissions/AX.tsv glue_submissions
-```
-
-## How to Use the Script
-
-To use this script on your model, you need to do 3 things:
-
-1. Implement your custom preprocessing in `preprocess_fn()`.
-2. Load your pretrained model.
-3. Make the finetune model with your model.
-
-Code needing customization is wrapped between comment
-`Custom code block starts` and 
-`Custom code block ends`. See instructions on each step below.
-
-### Custom Preprocessing
-
-In all GLUE dataset, each record comes with features of one or two sentences, 
-and one label. In the script, we load GLUE dataset in the format 
-`(features, labels)`,  where `features` is a tuple of either 1 sentence or 2
-sentences. Your need to write custom preprocessing logic to convert to data
-to the required input of your model. For example, in the current script 
-(finetuning for KerasNLP BERT), it is doing:
-
-```python
-bert_preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
-    "bert_tiny_en_uncased"
-)
-def preprocess_fn(feature, label):
-    return bert_preprocessor(feature), label
-```
-It uses the `BertPreprocessor` to convert input feature formats.
-
-### Load Pretrained Model
-
-As long as it is a Keras model, you can use it with this script. 
-
-### Make the Finetuning Model
-
-Users need to make a classification model based on your pretrained model for 
-evaluation purposes. For example, [`BertClassifier`](https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/bert/bert_classifier.py) takes a `Bert` model as backbone,
-and adds a dense layer on top of it. Please pay attention that different model 
-could use different classifier structure, e.g., in [RoBERTa](https://github.com/huggingface/transformers/blob/94b3f544a1f5e04b78d87a2ae32a7ac252e22e31/src/transformers/models/roberta/modeling_roberta.py#L1437-L1456), 
-it has 2 dense layers. If you are using pretrained model from an OSS package, 
-please find the correct classifier. If you use a custom model, you can start 
-experimenting with a simple dense layer, and adjust the structure based on 
-its performance.
-
-## Flags Table
-
-| Flags Name                 	| Explanation                                     	| Default 	|
-|----------------------------	|-------------------------------------------------	|---------	|
-| task_name                  	| The name of the GLUE task to finetune on.       	| "mrpc"  	|
-| batch_size                 	| Data batch size                                 	| 32      	|
-| epochs                     	| Number of epochs to run finetuning.             	| 2       	|
-| learning_rate              	| The optimizer's learning rate.                  	| 5e-5    	|
-| tpu_name               	    | The name of TPU to connect to.                    | None    	|
-| submission_directory       	| The file path to save the glue submission file. 	| None    	|
-| load_finetuning_model 	    | The path to load the finetuning model.          	| None    	|
-| save_finetuning_model 	    | The path to save the finetuning model.          	| None    	|
diff --git a/examples/glue_benchmark/scores.md b/examples/glue_benchmark/scores.md
deleted file mode 100644
index 4435a29911..0000000000
--- a/examples/glue_benchmark/scores.md
+++ /dev/null
@@ -1,141 +0,0 @@
-# GLUE Benchmark Score on KerasNLP Pretrained Models
-
-We use `glue.py` to test out KerasNLP pretrained models, and report scores in
-this doc. Our goal is to quickly verify our model's performance instead of 
-searching for the best hyperparameters, so the reported score can be a little 
-worse than reported by the original paper. 
-
-Unless specifically noted, hyperparameter settings are the same across all GLUE 
-tasks. 
-
-## BERT
-
-Test target is `keras_nlp.models.BertClassifier()`. WNLI is skipped because it 
-was not evaluated at the original paper.
-
-### Hyperparameter Settings
-
-- Learning Rate: 
-    We use a `PolynomialDecay` learning rate, with `initial_learning_rate=5e-5`.
-    ```python
-    lr = tf.keras.optimizers.schedules.PolynomialDecay(
-        5e-5,
-        decay_steps={total_training_steps},
-        end_learning_rate=0.0,
-    )
-    ```
-- Optimizer:
-    We use `AdamW` optimizer, and exclude `bias` and variables in 
-    `LayerNormalization` from weight decay.
-
-    ```python
-    optimizer = tf.keras.optimizers.experimental.AdamW(
-        lr, weight_decay=0.01, global_clipnorm=1.0
-    )
-    optimizer.exclude_from_weight_decay(
-        var_names=["LayerNorm", "layer_norm", "bias"]
-    )
-    ```
-- Others:
-    | Hyperparameter Name | Value |
-    |---------------------|-------|
-    | batch_size          | 32    |
-    | epochs              | 3     |
-    | dropout             | 0.1   |
-
-### Benchmark Score
-
-| Task Name | Metrics               | Score     |
-|-----------|-----------------------|-----------|
-| CoLA      | Matthew's Corr        | 52.2      |
-| SST-2     | Accuracy              | 93.5      |
-| MRPC      | F1 / Accuracy         | 88.2/83.9 |
-| STSB      | Pearson-Spearman Corr | 84.5/83.1 |
-| QQP       | F1 / Accuracy         | 71.3/89.3 |
-| MNLI_M    | Accuracy              |      84.3 |
-| MNLI_Mis  | Accuracy              |      83.3 |
-| QNLI      | Accuracy              |      90.4 |
-| RTE       | Accuracy              | 66.7      |
-| AX        | Matthew's Corr        |      34.8 |
-
-See the actual submission in this [link](https://gluebenchmark.com/submission/gnG9xUQGkjfVq6loRQYKTcM1YjG3/-NIe3Owl8pjHLXpistkI). 
-
-## RoBERTa
-
-Test target is `keras_nlp.models.RobertaClassifier()`.
-
-### Hyperparameter Settings
-
-#### WNLI
-
-We choose a special setting for WNLI from other tasks.
-
-- Learning Rate: 
-    We use a `PolynomialDecay` learning rate, with `initial_learning_rate=2e-5`.
-    ```python
-    lr = tf.keras.optimizers.schedules.PolynomialDecay(
-        2e-5,
-        decay_steps={total_training_steps},
-        end_learning_rate=0.0,
-    )
-    ```
-- Optimizer:
-    We use `Adam` optimizer.
-
-    ```python
-    optimizer = tf.keras.optimizers.Adam(lr)
-    ```
-- Others:
-    | Hyperparameter Name | Value |
-    |---------------------|-------|
-    | batch_size          | 32    |
-    | epochs              | 10    |
-    | dropout             | 0.1   |
-
-#### Other GLUE Tasks
-
-- Learning Rate: 
-    We use a `PolynomialDecay` learning rate, with `initial_learning_rate=2e-5`.
-    ```python
-    lr = tf.keras.optimizers.schedules.PolynomialDecay(
-        2e-5,
-        decay_steps={total_training_steps},
-        end_learning_rate=0.0,
-    )
-    ```
-- Optimizer:
-    We use `AdamW` optimizer, and exclude `bias` and variables in 
-    `LayerNormalization` from weight decay.
-
-    ```python
-    optimizer = tf.keras.optimizers.experimental.AdamW(
-        lr, weight_decay=0.01, global_clipnorm=1.0
-    )
-    optimizer.exclude_from_weight_decay(
-        var_names=["LayerNorm", "layer_norm", "bias"]
-    )
-    ```
-- Others:
-    | Hyperparameter Name | Value |
-    |---------------------|-------|
-    | batch_size          | 32    |
-    | epochs              | 3     |
-    | dropout             | 0.1   |
-
-### Benchmark Score
-
-| Task Name | Metrics               | Score     |
-|-----------|-----------------------|-----------|
-| CoLA      | Matthew's Corr        | 56.3      |
-| SST-2     | Accuracy              | 96.1     |
-| MRPC      | F1 / Accuracy         | 89.8/86.3 |
-| STSB      | Pearson-Spearman Corr | 88.4/87.7 |
-| QQP       | F1 / Accuracy         | 72.3/89.0 |
-| MNLI_M    | Accuracy              |      87.7 |
-| MNLI_Mis  | Accuracy              |      87.1 |
-| QNLI      | Accuracy              |      92.8 |
-| RTE       | Accuracy              | 69.2     |
-| WNLI      | Accuracy              | 65.1    |
-| AX        | Matthew's Corr        |      40.6 |
-
-See the actual submission in this [link](https://gluebenchmark.com/submission/gnG9xUQGkjfVq6loRQYKTcM1YjG3/-NJS0XAX1o9p8DJst3wM). 
\ No newline at end of file
diff --git a/examples/machine_translation/README.md b/examples/machine_translation/README.md
deleted file mode 100644
index ac836f933a..0000000000
--- a/examples/machine_translation/README.md
+++ /dev/null
@@ -1,48 +0,0 @@
-# English-Spanish machine translation with keras-nlp
-
-This example will show how to train a Transformer-based machine translation 
-model using APIs provided by Keras-NLP. This instruction shows how to train the 
-model, and evaluate with customized English sentences.
-
-## Installing dependencies
-
-Pip dependencies for all keras-nlp examples are listed in `setup.py`. To install
-both the keras-nlp library from source and all other dependencies required to
-run the example, run the below command. You may want to install to a self
-contained environment (e.g. a container or a virtualenv).
-
-```shell
-pip install -e ".[examples]"
-```
-
-## Train the machine translation model and save to disk
-
-At the root directory of keras-nlp, run the following command:
-
-```shell
-python ./examples/machine_translation/train.py \
-    --num_epochs=3 \
-    --saved_model_path="saved_models/machine_translation"
-```
-
-If it finishes successfully, you should see your console print out the 
-following information:
-```
-Successfully saved model to saved_models/machine_translation.
-```
-
-## Running machine translation on customized inputs
-
-Once you have a model saved successfully, you can play around it via the 
-inference.py script. To run inference on customized inputs, please run the 
-following command:
-
-```shell
-python ./examples/machine_translation/train.py \
-    --inputs="Have a nice day" \
-    --saved_model_path=saved_models/machine_translation"
-```
-
-You can set the inputs value as any English sentence, or you can leave it unset, 
-then the script will run against some predefined English sentences. 
-
diff --git a/examples/machine_translation/__init__.py b/examples/machine_translation/__init__.py
deleted file mode 100644
index 3364a6bd16..0000000000
--- a/examples/machine_translation/__init__.py
+++ /dev/null
@@ -1,13 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/examples/machine_translation/data.py b/examples/machine_translation/data.py
deleted file mode 100644
index 6015820eac..0000000000
--- a/examples/machine_translation/data.py
+++ /dev/null
@@ -1,148 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import pathlib
-import random
-import re
-import string
-
-import tensorflow as tf
-from tensorflow import keras
-
-
-def download_data():
-    text_file = keras.utils.get_file(
-        fname="spa-eng.zip",
-        origin=(
-            "http://storage.googleapis.com/download.tensorflow.org/data/"
-            + "spa-eng.zip"
-        ),
-        extract=True,
-    )
-    return pathlib.Path(text_file).parent / "spa-eng" / "spa.txt"
-
-
-def read_data(filepath):
-    with open(filepath) as f:
-        lines = f.read().split("\n")[:-1]
-        text_pairs = []
-        for line in lines:
-            eng, spa = line.split("\t")
-            spa = "[start] " + spa + " [end]"
-            text_pairs.append((eng, spa))
-    return text_pairs
-
-
-def split_train_val_test(text_pairs):
-    random.shuffle(text_pairs)
-    num_val_samples = int(0.15 * len(text_pairs))
-    num_train_samples = len(text_pairs) - 2 * num_val_samples
-    train_pairs = text_pairs[:num_train_samples]
-    val_end_index = num_train_samples + num_val_samples
-    val_pairs = text_pairs[num_train_samples:val_end_index]
-    test_pairs = text_pairs[val_end_index:]
-    return train_pairs, val_pairs, test_pairs
-
-
-strip_chars = string.punctuation + "¿"
-strip_chars = strip_chars.replace("[", "")
-strip_chars = strip_chars.replace("]", "")
-
-
-@keras.saving.register_keras_serializable()
-def custom_standardization(input_string):
-    lowercase = tf.strings.lower(input_string)
-    return tf.strings.regex_replace(
-        lowercase,
-        "[%s]" % re.escape(strip_chars),
-        "",
-    )
-
-
-def prepare_tokenizer(train_pairs, sequence_length, vocab_size):
-    """Preapare English and Spanish tokenizer."""
-    eng_tokenizer = keras.layers.TextVectorization(
-        max_tokens=vocab_size,
-        output_mode="int",
-        output_sequence_length=sequence_length,
-    )
-    spa_tokenizer = keras.layers.TextVectorization(
-        max_tokens=vocab_size,
-        output_mode="int",
-        output_sequence_length=sequence_length + 1,
-        standardize=custom_standardization,
-    )
-    eng_texts, spa_texts = zip(*train_pairs)
-    eng_tokenizer.adapt(eng_texts)
-    spa_tokenizer.adapt(spa_texts)
-    return eng_tokenizer, spa_tokenizer
-
-
-def prepare_datasets(text_pairs, batch_size, eng_tokenizer, spa_tokenizer):
-    """Transform raw text pairs to tf datasets."""
-    eng_texts, spa_texts = zip(*text_pairs)
-    eng_texts = list(eng_texts)
-    spa_texts = list(spa_texts)
-
-    def format_dataset(eng, spa):
-        """Format the dataset given input English and Spanish text.
-
-        The output format is:
-            x: a pair of English and Spanish sentence.
-            y: The Spanish sentence in x shifts 1 token towards right, because
-                we are predicting the next token.
-        """
-        eng = eng_tokenizer(eng)
-        spa = spa_tokenizer(spa)
-        return (
-            {
-                "encoder_inputs": eng,
-                "decoder_inputs": spa[:, :-1],
-            },
-            spa[:, 1:],
-            tf.cast((spa[:, 1:] != 0), "float32"),  # mask as sample weights
-        )
-
-    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
-    dataset = dataset.batch(batch_size)
-    dataset = dataset.map(format_dataset)
-    return dataset.shuffle(2048).prefetch(tf.data.AUTOTUNE).cache()
-
-
-def get_dataset_and_tokenizer(sequence_length, vocab_size, batch_size):
-    """Main method to get the formatted machine translation dataset."""
-    filepath = download_data()
-    text_pairs = read_data(filepath)
-    train_pairs, val_pairs, test_pairs = split_train_val_test(text_pairs)
-    eng_tokenizer, spa_tokenizer = prepare_tokenizer(
-        train_pairs, sequence_length, vocab_size
-    )
-    train_ds = prepare_datasets(
-        train_pairs,
-        batch_size,
-        eng_tokenizer,
-        spa_tokenizer,
-    )
-    val_ds = prepare_datasets(
-        val_pairs,
-        batch_size,
-        eng_tokenizer,
-        spa_tokenizer,
-    )
-    test_ds = prepare_datasets(
-        test_pairs,
-        batch_size,
-        eng_tokenizer,
-        spa_tokenizer,
-    )
-    return (train_ds, val_ds, test_ds), (eng_tokenizer, spa_tokenizer)
diff --git a/examples/machine_translation/inference.py b/examples/machine_translation/inference.py
deleted file mode 100644
index 5a3a1118e4..0000000000
--- a/examples/machine_translation/inference.py
+++ /dev/null
@@ -1,140 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import numpy as np
-import tensorflow as tf
-from absl import app
-from absl import flags
-from absl import logging
-from tensorflow import keras
-
-# Import data module to include the customized serializable, required for
-# loading tokenizer.
-import examples.machine_translation.data  # noqa: F401.
-
-FLAGS = flags.FLAGS
-
-flags.DEFINE_integer(
-    "sequence_length",
-    20,
-    "Input and output sequence length.",
-)
-
-flags.DEFINE_string(
-    "saved_model_path",
-    "saved_models/machine_translation_model",
-    "The path to saved model",
-)
-
-flags.DEFINE_string("inputs", None, "The inputs to run machine translation on.")
-
-EXAMPLES = [
-    (
-        "Tom doesn't listen to anyone.",
-        "[start] Tomás no escucha a nadie. [end]",
-    ),
-    ("I got soaked to the skin.", "[start] Estoy chorreando. [end]"),
-    ("I imagined that.", "[start] Me imaginé eso. [end]"),
-    ("The baby is crying.", "[start] El bebé está llorando. [end]"),
-    (
-        "I've never felt so exhilarated.",
-        "[start] Nunca me he sentido tan animado. [end]",
-    ),
-    (
-        "Please forgive me for not having written sooner.",
-        "[start] Perdóname por no haberte escrito antes, por favor. [end]",
-    ),
-    ("I expected more from you.", "[start] Esperaba más de vos. [end]"),
-    ("I have a computer.", "[start] Tengo un computador. [end]"),
-    ("Dinner's ready!", "[start] ¡La cena está lista! [end]"),
-    ("Let me finish.", "[start] Déjame terminar. [end]"),
-]
-
-
-def decode_sequence(input_sentence, model, max_sequence_length, lookup_table):
-    encoder_tokenizer = model.encoder_tokenizer
-    decoder_tokenizer = model.decoder_tokenizer
-    tokenized_input = encoder_tokenizer([input_sentence])
-
-    start_token = decoder_tokenizer("[start]")[0].numpy()
-    end_token = decoder_tokenizer("[end]")[0].numpy()
-
-    decoded_sentence = [start_token]
-    for i in range(max_sequence_length):
-        decoder_inputs = tf.convert_to_tensor(
-            [decoded_sentence],
-            dtype="int64",
-        )
-        decoder_inputs = tf.concat(
-            [
-                decoder_inputs,
-                tf.zeros(
-                    [1, max_sequence_length - i - 1],
-                    dtype="int64",
-                ),
-            ],
-            axis=1,
-        )
-        input = {
-            "encoder_inputs": tokenized_input,
-            "decoder_inputs": decoder_inputs,
-        }
-        predictions = model(input)
-        predicted_token = np.argmax(predictions[0, i, :])
-        decoded_sentence.append(predicted_token)
-        if predicted_token == end_token:
-            break
-
-    detokenized_output = []
-    for token in decoded_sentence:
-        detokenized_output.append(lookup_table[token])
-    return " ".join(detokenized_output)
-
-
-def main(_):
-    loaded_model = keras.models.load_model(FLAGS.saved_model_path)
-
-    decoder_tokenizer = loaded_model.decoder_tokenizer
-    vocab = decoder_tokenizer.get_vocabulary()
-    index_lookup_table = dict(zip(range(len(vocab)), vocab))
-
-    if FLAGS.inputs is not None:
-        # Run inference on user-specified sentence.
-        translated = decode_sequence(
-            FLAGS.inputs,
-            loaded_model,
-            FLAGS.sequence_length,
-            index_lookup_table,
-        )
-        logging.info(f"Translated results: {translated}")
-
-    else:
-        translated = []
-        for example in EXAMPLES:
-            translated.append(
-                decode_sequence(
-                    example[0],
-                    loaded_model,
-                    FLAGS.sequence_length,
-                    index_lookup_table,
-                )
-            )
-
-        for i in range(len(EXAMPLES)):
-            print("ENGLISH SENTENCE: ", EXAMPLES[i][0])
-            print("MACHINE TRANSLATED RESULT: ", translated[i])
-            print("GOLDEN: ", EXAMPLES[i][1])
-
-
-if __name__ == "__main__":
-    app.run(main)
diff --git a/examples/machine_translation/model.py b/examples/machine_translation/model.py
deleted file mode 100644
index 99a115f6c9..0000000000
--- a/examples/machine_translation/model.py
+++ /dev/null
@@ -1,125 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import tensorflow as tf
-from tensorflow import keras
-
-from keras_nlp.layers import TransformerDecoder
-from keras_nlp.layers import TransformerEncoder
-
-
-class PositionalEmbedding(keras.layers.Layer):
-    """The positional embedding class."""
-
-    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
-        super().__init__(**kwargs)
-        self.token_embeddings = keras.layers.Embedding(
-            input_dim=vocab_size, output_dim=embed_dim
-        )
-        self.position_embeddings = keras.layers.Embedding(
-            input_dim=sequence_length, output_dim=embed_dim
-        )
-        self.sequence_length = sequence_length
-        self.vocab_size = vocab_size
-        self.embed_dim = embed_dim
-
-    def call(self, inputs):
-        length = tf.shape(inputs)[-1]
-        positions = tf.range(start=0, limit=length, delta=1)
-        embedded_tokens = self.token_embeddings(inputs)
-        embedded_positions = self.position_embeddings(positions)
-        return embedded_tokens + embedded_positions
-
-    def compute_mask(self, inputs, mask=None):
-        return tf.math.not_equal(inputs, 0)
-
-
-class TranslationModel(keras.Model):
-    """The machine translation model.
-
-    The model is an encoder-decoder structure model. The encoder is a stack of
-    `keras_nlp.TransformerEncoder`, and the decoder is a stack of
-    `keras_nlp.TransformerDecoder`. We also pass in the tokenizer for encoder
-    and decoder so that during save/load, the tokenizer is also kept.
-    """
-
-    def __init__(
-        self,
-        encoder_tokenizer,
-        decoder_tokenizer,
-        num_encoders,
-        num_decoders,
-        num_heads,
-        transformer_intermediate_dim,
-        encoder_vocab_size,
-        decoder_vocab_size,
-        embed_dim,
-        sequence_length,
-    ):
-        super().__init__()
-        self.encoders = []
-        self.decoders = []
-        for _ in range(num_encoders):
-            self.encoders.append(
-                TransformerEncoder(
-                    num_heads=num_heads,
-                    intermediate_dim=transformer_intermediate_dim,
-                )
-            )
-        for _ in range(num_decoders):
-            self.decoders.append(
-                TransformerDecoder(
-                    num_heads=num_heads,
-                    intermediate_dim=transformer_intermediate_dim,
-                )
-            )
-
-        self.encoder_tokenizer = encoder_tokenizer
-        self.decoder_tokenizer = decoder_tokenizer
-
-        self.encoder_embedding = PositionalEmbedding(
-            sequence_length=sequence_length,
-            vocab_size=encoder_vocab_size,
-            embed_dim=embed_dim,
-        )
-
-        self.decoder_embedding = PositionalEmbedding(
-            sequence_length=sequence_length,
-            vocab_size=decoder_vocab_size,
-            embed_dim=embed_dim,
-        )
-
-        self.dense = keras.layers.Dense(
-            decoder_vocab_size,
-            activation="softmax",
-        )
-
-    def call(self, inputs):
-        encoder_input, decoder_input = (
-            inputs["encoder_inputs"],
-            inputs["decoder_inputs"],
-        )
-        encoded = self.encoder_embedding(encoder_input)
-        for encoder in self.encoders:
-            encoded = encoder(encoded)
-
-        decoded = self.decoder_embedding(decoder_input)
-        for decoder in self.decoders:
-            decoded = decoder(
-                decoded,
-                encoded,
-                use_causal_mask=True,
-            )
-
-        output = self.dense(decoded)
-        return output
diff --git a/examples/machine_translation/train.py b/examples/machine_translation/train.py
deleted file mode 100644
index dcc026e1fb..0000000000
--- a/examples/machine_translation/train.py
+++ /dev/null
@@ -1,113 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from absl import app
-from absl import flags
-from tensorflow import keras
-
-from examples.machine_translation.data import get_dataset_and_tokenizer
-from examples.machine_translation.model import TranslationModel
-
-FLAGS = flags.FLAGS
-
-flags.DEFINE_integer("num_epochs", 1, "Number of epochs to train.")
-flags.DEFINE_integer("steps_per_epoch", None, "Number of steps per epoch.")
-flags.DEFINE_integer("num_encoders", 2, "Number of Transformer encoder layers.")
-flags.DEFINE_integer("num_decoders", 2, "Number of Transformer decoder layers.")
-flags.DEFINE_integer("batch_size", 64, "The training batch size.")
-flags.DEFINE_float("learning_rate", 0.001, "The initial learning rate.")
-flags.DEFINE_integer("model_dim", 64, "Embedding size.")
-flags.DEFINE_integer(
-    "intermediate_dim",
-    128,
-    "Intermediate dimension (feedforward network) of transformer.",
-)
-flags.DEFINE_integer(
-    "num_heads",
-    8,
-    "Number of head of the multihead attention.",
-)
-flags.DEFINE_integer(
-    "sequence_length",
-    20,
-    "Input and output sequence length.",
-)
-flags.DEFINE_integer(
-    "vocab_size",
-    15000,
-    "Vocabulary size, required by tokenizer.",
-)
-
-flags.DEFINE_string(
-    "saved_model_path",
-    "saved_models/machine_translation_model",
-    "The path to saved model",
-)
-
-
-def run_training(model, train_ds, val_ds):
-    learning_rate = keras.optimizers.schedules.ExponentialDecay(
-        initial_learning_rate=FLAGS.learning_rate,
-        decay_steps=20,
-        decay_rate=0.98,
-    )
-    optimizer = keras.optimizers.Adam(learning_rate)
-    loss_fn = keras.losses.SparseCategoricalCrossentropy(
-        reduction=keras.losses.Reduction.NONE
-    )
-    metrics = keras.metrics.SparseCategoricalAccuracy()
-    model.compile(optimizer=optimizer, metrics=[metrics], loss=loss_fn)
-    model.fit(
-        train_ds,
-        epochs=FLAGS.num_epochs,
-        validation_data=val_ds,
-        steps_per_epoch=FLAGS.steps_per_epoch,
-    )
-
-
-def main(_):
-    (
-        (train_ds, val_ds, test_ds),
-        (
-            eng_tokenizer,
-            spa_tokenizer,
-        ),
-    ) = get_dataset_and_tokenizer(
-        FLAGS.sequence_length, FLAGS.vocab_size, FLAGS.batch_size
-    )
-    english_vocab_size = eng_tokenizer.vocabulary_size()
-    spanish_vocab_size = spa_tokenizer.vocabulary_size()
-    model = TranslationModel(
-        encoder_tokenizer=eng_tokenizer,
-        decoder_tokenizer=spa_tokenizer,
-        num_encoders=FLAGS.num_encoders,
-        num_decoders=FLAGS.num_decoders,
-        num_heads=FLAGS.num_heads,
-        transformer_intermediate_dim=FLAGS.intermediate_dim,
-        encoder_vocab_size=english_vocab_size,
-        decoder_vocab_size=spanish_vocab_size,
-        embed_dim=FLAGS.model_dim,
-        sequence_length=FLAGS.sequence_length,
-    )
-
-    run_training(model, train_ds, val_ds)
-
-    print(f"Saving to {FLAGS.saved_model_path}")
-    model.save(FLAGS.saved_model_path)
-
-    print(f"Successfully saved model to {FLAGS.saved_model_path}")
-
-
-if __name__ == "__main__":
-    app.run(main)
diff --git a/examples/tools/README.md b/examples/tools/README.md
deleted file mode 100644
index efe525e96f..0000000000
--- a/examples/tools/README.md
+++ /dev/null
@@ -1,37 +0,0 @@
-# KerasNLP Modeling Tools
-
-This directory contains runnable scripts that are not specific to a specific
-model architecture, but still useful for end-to-end workflows.
-
-## split_sentences.py
-
-The `split_sentences.py` script will process raw input files and split them into
-output files where each line contains a sentence, and a blank line marks the
-start of a new document. This is useful for tasks like next sentence prediction
-where the boundaries between sentences are needed for training.
-
-The script supports two types of inputs files. Plain text files, where each
-individual file is assumed to be an entire document, and wikipedia dump files
-in the format outputted by the wikiextractor tool (each document is enclosed in
-`<doc>` tags).
-
-Example usage:
-
-```shell
-python examples/tools/split_sentences.py \
-    --input_files ~/datasets/wikipedia,~/datasets/bookscorpus \
-    --output_directory ~/datasets/sentence-split-data
-```
-
-### train_word_piece_vocabulary.py
-
-The `train_word_piece_vocabulary.py` script allows you to compute your own
-WordPiece vocabulary.
-
-Example usage:
-
-```shell
-python examples/tools/train_word_piece_vocabulary.py \
-    --input_files ~/datasets/my-raw-dataset/ \
-    --output_file vocab.txt
-```
diff --git a/examples/tools/__init__.py b/examples/tools/__init__.py
deleted file mode 100644
index 3364a6bd16..0000000000
--- a/examples/tools/__init__.py
+++ /dev/null
@@ -1,13 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/examples/tools/split_sentences.py b/examples/tools/split_sentences.py
deleted file mode 100644
index d3897cb6d3..0000000000
--- a/examples/tools/split_sentences.py
+++ /dev/null
@@ -1,175 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Split sentences from raw input documents using nltk.
-
-A script to sentence split a raw dataset (e.g. wikipedia or bookscorpus) into
-sentences for further preproessing for BERT. The output file format is the
-format expected by `create_pretraining_data.py`, where each file contains one
-line per sentence, with empty newlines between documents.
-
-This script will run muliprocessed, and the number of concurrent process and
-output file shards can be controlled with `--num_jobs` and `--num_shards`.
-
-Usage:
-python examples/tools/create_sentence_split_data.py \
-    --input_files ~/datasets/wikipedia,~/datasets/bookscorpus \
-    --output_directory ~/datasets/bert-sentence-split-data
-"""
-
-import contextlib
-import multiprocessing
-import os
-import random
-import sys
-
-import nltk
-from absl import app
-from absl import flags
-from tensorflow import keras
-
-from examples.utils.scripting_utils import list_filenames_for_arg
-
-FLAGS = flags.FLAGS
-
-flags.DEFINE_string(
-    "input_files",
-    None,
-    "Comma seperated list of directories, files, or globs for input data.",
-)
-
-flags.DEFINE_string(
-    "output_directory",
-    None,
-    "Directory for output data.",
-)
-
-flags.DEFINE_integer("num_jobs", None, "Number of file shards to use.")
-
-flags.DEFINE_integer("num_shards", 500, "Number of file shards to use.")
-
-flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.")
-
-
-def parse_wiki_file(file):
-    """Read documents from a wikipedia dump file."""
-    documents = []
-    in_article = False
-    article_lines = []
-    for line in file:
-        line = line.strip()
-        # Skip empty lines.
-        if line == "":
-            continue
-        elif "<doc id=" in line:
-            in_article = True
-        elif "</doc>" in line:
-            in_article = False
-            # There are many wikipedia articles that are only titles (one
-            # line) or or redirects (two lines), we will skip these.
-            if len(article_lines) > 2:
-                # Skip the title.
-                documents.append(" ".join(article_lines[1:]))
-            article_lines = []
-        elif in_article:
-            article_lines.append(line)
-    return documents
-
-
-def parse_text_file(file):
-    """Read documents from a plain text file."""
-    documents = []
-    file_lines = []
-    for line in file:
-        line = line.strip()
-        # Skip empty lines.
-        if line == "":
-            continue
-        file_lines.append(line)
-    documents.append(" ".join(file_lines))
-    return documents
-
-
-def read_file(filename):
-    """Read documents from an input file."""
-    with open(filename, mode="r") as file:
-        firstline = file.readline()
-        file.seek(0)
-        # Very basic autodetection of file type.
-        # Wikipedia dump files all start with a doc id tag.
-        if "<doc id=" in firstline:
-            return parse_wiki_file(file)
-        return parse_text_file(file)
-
-
-def process_file(filename):
-    """Read documents from an input file and split into sentences with nltk."""
-    split_documents = []
-    for document in read_file(filename):
-        sentences = nltk.tokenize.sent_tokenize(document)
-        split_documents.append(sentences)
-    return split_documents
-
-
-def main(_):
-    nltk.download("punkt")
-    print(f"Reading input data from {FLAGS.input_files}")
-    input_filenames = list_filenames_for_arg(FLAGS.input_files)
-    if not input_filenames:
-        print("No input files found. Check `input_files` flag.")
-        sys.exit(1)
-
-    # Randomize files so we aren't processing input directories sequentially.
-    rng = random.Random(FLAGS.random_seed)
-    rng.shuffle(input_filenames)
-
-    # We will read and sentence split with multiprocessing, but write from
-    # a single thread to balance our shard sizes well.
-    pool = multiprocessing.Pool(FLAGS.num_jobs)
-
-    print(f"Outputting to {FLAGS.output_directory}.")
-    if not os.path.exists(FLAGS.output_directory):
-        os.mkdir(FLAGS.output_directory)
-
-    progbar = keras.utils.Progbar(len(input_filenames), unit_name="files")
-    progbar.update(0)
-    with contextlib.ExitStack() as stack:
-        # Open all files.
-        output_files = []
-        for i in range(FLAGS.num_shards):
-            path = os.path.join(FLAGS.output_directory, f"shard_{i}.txt")
-            output_files.append(stack.enter_context(open(path, "w")))
-
-        # Write documents to disk.
-        total_files = 0
-        total_documents = 0
-        for documents in pool.imap_unordered(process_file, input_filenames):
-            for document in documents:
-                output_file = output_files[total_documents % FLAGS.num_shards]
-                for sentence in document:
-                    output_file.write(sentence + "\n")
-                # Blank newline marks a new document.
-                output_file.write("\n")
-                total_documents += 1
-            total_files += 1
-            progbar.update(total_files)
-
-    print("Done.")
-    print(f"Read {total_files} files.")
-    print(f"Processed {total_documents} documents.")
-
-
-if __name__ == "__main__":
-    flags.mark_flag_as_required("input_files")
-    flags.mark_flag_as_required("output_directory")
-    app.run(main)
diff --git a/examples/tools/train_word_piece_vocab.py b/examples/tools/train_word_piece_vocab.py
deleted file mode 100644
index ad460c2f4d..0000000000
--- a/examples/tools/train_word_piece_vocab.py
+++ /dev/null
@@ -1,99 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Create BERT wordpiece vocabularies.
-
-This script will create wordpiece vocabularies suitable for pretraining BERT.
-
-Usage:
-python examples/tools/train_word_piece_vocabulary.py \
-    --input_files ~/datasets/bert-sentence-split-data/ \
-    --output_file vocab.txt
-"""
-
-import os
-import sys
-
-import tensorflow as tf
-from absl import app
-from absl import flags
-from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset
-
-from examples.utils.scripting_utils import list_filenames_for_arg
-
-FLAGS = flags.FLAGS
-
-flags.DEFINE_string(
-    "input_files",
-    None,
-    "Comma seperated list of directories, files, or globs.",
-)
-
-flags.DEFINE_string(
-    "output_file", None, "Output file for the computed vocabulary."
-)
-
-flags.DEFINE_bool(
-    "do_lower_case",
-    True,
-    "Whether to lower case the input text. Should be True for uncased "
-    "models and False for cased models.",
-)
-
-flags.DEFINE_string(
-    "reserved_tokens",
-    "[PAD],[UNK],[CLS],[SEP],[MASK]",
-    "Comma separated list of reserved tokens in the vocabulary.",
-)
-
-flags.DEFINE_integer("vocabulary_size", 30522, "Number of output files.")
-
-
-def write_vocab_file(filepath, vocab):
-    with open(filepath, "w") as file:
-        for token in vocab:
-            file.write(token + "\n")
-
-
-def main(_):
-    print(f"Reading input data from {FLAGS.input_files}")
-    input_filenames = list_filenames_for_arg(FLAGS.input_files)
-    if not input_filenames:
-        print("No input files found. Check `input_files` flag.")
-        sys.exit(1)
-
-    print(f"Outputting to {FLAGS.output_file}")
-    if os.path.exists(FLAGS.output_file):
-        print(f"File {FLAGS.output_file} already exists.")
-        sys.exit(1)
-
-    with open(FLAGS.output_file, "w") as file:
-        # TODO(mattdangerw): This is the slow and simple BERT vocabulary
-        # learner from tf text, we should try the faster flume option.
-        vocab = bert_vocab_from_dataset.bert_vocab_from_dataset(
-            tf.data.TextLineDataset(input_filenames).batch(1000).prefetch(2),
-            # The target vocabulary size
-            vocab_size=FLAGS.vocabulary_size,
-            # Reserved tokens that must be included in the vocabulary
-            reserved_tokens=FLAGS.reserved_tokens.split(","),
-            # Arguments for `text.BertTokenizer`
-            bert_tokenizer_params={"lower_case": FLAGS.do_lower_case},
-        )
-        for token in vocab:
-            file.write(token + "\n")
-
-
-if __name__ == "__main__":
-    flags.mark_flag_as_required("input_files")
-    flags.mark_flag_as_required("output_file")
-    app.run(main)
diff --git a/examples/utils/__init__.py b/examples/utils/__init__.py
deleted file mode 100644
index 3364a6bd16..0000000000
--- a/examples/utils/__init__.py
+++ /dev/null
@@ -1,13 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/examples/utils/data_utils.py b/examples/utils/data_utils.py
deleted file mode 100644
index 4780fe10dd..0000000000
--- a/examples/utils/data_utils.py
+++ /dev/null
@@ -1,36 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Utility functions for data handling."""
-
-import os
-
-try:
-    import tensorflow as tf
-except ImportError:
-    raise ImportError(
-        "To use `keras_nlp`, please install Tensorflow: `pip install tensorflow`. "
-        "The TensorFlow package is required for data preprocessing with any backend."
-    )
-from google import protobuf
-
-
-def preview_tfrecord(filepath):
-    """Pretty prints a single record from a tfrecord file."""
-    dataset = tf.data.TFRecordDataset(os.path.expanduser(filepath))
-    example = tf.train.Example()
-    example.ParseFromString(next(iter(dataset)).numpy())
-    formatted = protobuf.text_format.MessageToString(
-        example, use_short_repeated_primitives=True
-    )
-    print(formatted)
diff --git a/examples/utils/scripting_utils.py b/examples/utils/scripting_utils.py
deleted file mode 100644
index 33b8165785..0000000000
--- a/examples/utils/scripting_utils.py
+++ /dev/null
@@ -1,30 +0,0 @@
-# Copyright 2024 The KerasNLP Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     https://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Utility functions for writing training scripts."""
-
-import glob
-import os
-
-
-def list_filenames_for_arg(arg_pattern):
-    """List filenames from a comma separated list of files, dirs, and globs."""
-    input_filenames = []
-    for pattern in arg_pattern.split(","):
-        pattern = os.path.expanduser(pattern)
-        if os.path.isdir(pattern):
-            pattern = os.path.join(pattern, "**", "*")
-        for filename in glob.iglob(pattern, recursive=True):
-            if not os.path.isdir(filename):
-                input_filenames.append(filename)
-    return input_filenames
diff --git a/shell/format.sh b/shell/format.sh
index e7ab305731..8ad80fa058 100755
--- a/shell/format.sh
+++ b/shell/format.sh
@@ -1,7 +1,7 @@
 #!/bin/bash -e
 
 base_dir=$(dirname $(dirname $0))
-targets="${base_dir}/*.py ${base_dir}/examples/ ${base_dir}/keras_nlp/ ${base_dir}/integration_tests/ ${base_dir}/tools/"
+targets="${base_dir}"
 
 isort --sp "${base_dir}/pyproject.toml" ${targets}
 black --config "${base_dir}/pyproject.toml" ${targets}
diff --git a/shell/lint.sh b/shell/lint.sh
index 7455f6edc4..c59ff4579e 100755
--- a/shell/lint.sh
+++ b/shell/lint.sh
@@ -1,7 +1,7 @@
 #!/bin/bash -e
 
 base_dir=$(dirname $(dirname $0))
-targets="${base_dir}/*.py ${base_dir}/examples/ ${base_dir}/keras_nlp/ ${base_dir}/tools/"
+targets="${base_dir}"
 
 isort --sp "${base_dir}/pyproject.toml" -c ${targets}
 if ! [ $? -eq 0 ]; then
diff --git a/tools/checkpoint_conversion/convert_gpt_neox_checkpoints.py b/tools/checkpoint_conversion/convert_gpt_neox_checkpoints.py
index 205ee03fb4..098a4bafc5 100644
--- a/tools/checkpoint_conversion/convert_gpt_neox_checkpoints.py
+++ b/tools/checkpoint_conversion/convert_gpt_neox_checkpoints.py
@@ -152,7 +152,7 @@
     keras_model.get_layer(
         f"transformer_layer_{layer_index}"
     )._feedforward_output_dense.bias.assign(
-        hf_wts[f"layers.{layer_index    }.mlp.dense_4h_to_h.bias"]
+        hf_wts[f"layers.{layer_index}.mlp.dense_4h_to_h.bias"]
     )
 
 
diff --git a/examples/glue_benchmark/glue.py b/tools/glue.py
similarity index 99%
rename from examples/glue_benchmark/glue.py
rename to tools/glue.py
index 76fb602b2c..27e9760701 100644
--- a/examples/glue_benchmark/glue.py
+++ b/tools/glue.py
@@ -11,6 +11,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""TODO: This is written for tf.distribute. We should rewrite for Jax."""
+
 import csv
 import os