diff --git a/examples/README.md b/examples/README.md deleted file mode 100644 index e7426169c8..0000000000 --- a/examples/README.md +++ /dev/null @@ -1,5 +0,0 @@ -# KerasNLP Examples - -The `examples/` directly contains scripts built on top of the library that do not fit well into -the colab format used on [keras.io](https://keras.io/examples/). This includes recipes for -pre-training models and evaluating models on benchmarks such as GLUE. diff --git a/examples/__init__.py b/examples/__init__.py deleted file mode 100644 index 3364a6bd16..0000000000 --- a/examples/__init__.py +++ /dev/null @@ -1,13 +0,0 @@ -# Copyright 2024 The KerasNLP Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/examples/bert_pretraining/README.md b/examples/bert_pretraining/README.md deleted file mode 100644 index acd47313fd..0000000000 --- a/examples/bert_pretraining/README.md +++ /dev/null @@ -1,199 +0,0 @@ -# BERT with KerasNLP - -This example demonstrates how to train a Bidirectional Encoder -Representations from Transformers (BERT) model end-to-end using the KerasNLP -library. This README contains instructions on how to run pretraining directly -from raw data, followed by finetuning and evaluation on the GLUE dataset. - -## Quickly test out the code - -To exercise the code in this directory by training a tiny BERT model, you can -run the following commands from the base directory of the repository. This can -be useful to validate any code changes, but note that a useful BERT model would -need to be trained for much longer on a much larger dataset. - -```shell -OUTPUT_DIR=~/bert_test_output -DATA_URL=https://storage.googleapis.com/tensorflow/keras-nlp/examples/bert - -# Download example data. -wget ${DATA_URL}/bert_vocab_uncased.txt -O $OUTPUT_DIR/bert_vocab_uncased.txt -wget ${DATA_URL}/wiki_example_data.txt -O $OUTPUT_DIR/wiki_example_data.txt - -# Parse input data and split into sentences. -python3 examples/tools/split_sentences.py \ - --input_files $OUTPUT_DIR/wiki_example_data.txt \ - --output_directory $OUTPUT_DIR/sentence-split-data -# Preprocess input for pretraining. -python3 examples/bert_pretraining/bert_create_pretraining_data.py \ - --input_files $OUTPUT_DIR/sentence-split-data/ \ - --vocab_file $OUTPUT_DIR/bert_vocab_uncased.txt \ - --output_file $OUTPUT_DIR/pretraining-data/pretraining.tfrecord -# Run pretraining for 100 train steps only. -python3 examples/bert_pretraining/bert_pretrain.py \ - --input_directory $OUTPUT_DIR/pretraining-data/ \ - --vocab_file $OUTPUT_DIR/bert_vocab_uncased.txt \ - --saved_model_output $OUTPUT_DIR/model/ \ - --num_train_steps 100 -``` - -## Installing dependencies - -This example needs a few extra dependencies to run (e.g. wikiextractor for -using wikipedia downloads). You can install these into a KerasNLP development -environment with: - -```shell -pip install -r "examples/bert_pretraining/requirements.txt" -``` - -## Pretraining BERT - -Training a BERT model happens in two stages. First, the model is "pretrained" on -a large corpus of input text. This is computationally expensive. After -pretraining, the model can be "finetuned" on a downstream task with a much -smaller amount of labeled data. - -### Downloading pretraining data - -The GLUE pretraining data (Wikipedia + BooksCorpus) is fairly large. The raw -input data takes roughly ~20GB of space, and after preprocessing, the full -corpus will take ~400GB. - -The latest wikipedia dump can be downloaded -[at this link](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2), -or via command line: - -```shell -curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 -``` -The dump can be extracted with the `wikiextractor` tool. - -```shell -python3 -m wikiextractor.WikiExtractor enwiki-latest-pages-articles.xml.bz2 -``` - -BooksCorpus is no longer hosted by -[its creators](https://yknzhu.wixsite.com/mbweb), but you can find instructions -for downloading or reproducing the corpus in -[this repository](https://github.com/soskek/bookcorpus). We suggest the pre-made file -downloads listed at the top of the README. Alternatively, you can forgo it -entirely and pretrain solely on wikipedia. - -Preparing the pretraining data will happen in two stages. First, raw text needs -to be split into lists of sentences per document. Second, this sentence split -data needs to use to create training examples with both masked words and -next sentence predictions. - -### Splitting raw text into sentences - -Next, use `examples/tools/split_sentences.py` to process raw input files and -split them into output files where each line contains a sentence, and a blank -line marks the start of a new document. We need this for the next-sentence -prediction task used by BERT. - -For example, if Wikipedia files are located in `~/datasets/wikipedia` and -bookscorpus in `~/datasets/bookscorpus`, the following command will output -sentence split documents to a configurable number of output file shards: - -```shell -python3 examples/tools/split_sentences.py \ - --input_files ~/datasets/wikipedia,~/datasets/bookscorpus \ - --output_directory ~/datasets/sentence-split-data -``` - -### Computing a WordPiece vocabulary - -The easiest and best approach when training BERT is to use the official -vocabularies from the original project, which have become somewhat standard. - -You can download the English uncased vocabulary -[here](https://storage.googleapis.com/tensorflow/keras-nlp/examples/bert/bert_vocab_uncased.txt), -or in your terminal run: - -```shell -curl -O https://storage.googleapis.com/tensorflow/keras-nlp/examples/bert/bert_vocab_uncased.txt -``` - -You can also use `examples/tools/train_word_piece_vocab.py` to train your own. - -### Tokenize, mask, and combine sentences into training examples - -The ` bert_create_pretraining_data.py` script will take in a set of sentence split files, and -set up training examples for the next sentence prediction and masked word tasks. - -The output of the script will be TFRecord files with a number of fields per -example. Below shows a complete output example with the addition of a string -`tokens` field for clarity. The actual script will only serialize the token ids -to conserve disk space. - -```python -tokens: ['[CLS]', 'resin', '##s', 'are', 'today', '[MASK]', 'produced', 'by', - 'ang', '##ios', '##per', '##ms', ',', 'and', 'tend', 'to', '[SEP]', - '[MASK]', 'produced', 'a', '[MASK]', '[MASK]', 'of', 'resin', ',', - 'which', '[MASK]', 'often', 'found', 'as', 'amber', '[SEP]'] -input_ids: [101, 24604, 2015, 2024, 2651, 103, 2550, 2011, 17076, 10735, 4842, - 5244, 1010, 1998, 7166, 2000, 102, 103, 2550, 1037, 103, 103, 1997, - 24604, 1010, 2029, 103, 2411, 2179, 2004, 8994, 102] -input_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, - 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] -segment_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, - 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] -masked_lm_positions: [5, 17, 20, 21, 26] -masked_lm_ids: [2069, 3619, 2353, 2828, 2003] -masked_lm_weights: [1.0, 1.0, 1.0, 1.0, 1.0] -next_sentence_labels: [0] -``` - -In order to set up the next sentence prediction task, the script will load the -entire input into memory. As such, it is recommended to run this script on a -subset of the input data at a time. - -For example, you can run the script on each file shard in a directory -with the following: - -```shell -for file in path/to/sentence-split-data/*; do - output="path/to/pretraining-data/$(basename -- "$file" .txt).tfrecord" - python3 examples/bert_pretraining/bert_create_pretraining_data.py \ - --input_files ${file} \ - --vocab_file bert_vocab_uncased.txt \ - --output_file ${output} -done -``` - -If enough memory is available, this could be further sped up by running this script -multiple times in parallel. The following will take 3-4 hours on the entire dataset -on an 8 core machine. - -```shell -NUM_JOBS=5 -for file in path/to/sentence-split-data/*; do - output="path/to/pretraining-data/$(basename -- "$file" .txt).tfrecord" - echo python3 examples/bert_pretraining/bert_create_pretraining_data.py \ - --input_files ${file} \ - --vocab_file bert_vocab_uncased.txt \ - --output_file ${output} -done | parallel -j ${NUM_JOBS} -``` - -To preview a sample of generated data files, you can run the command below: - -```shell -python3 -c "from examples.utils.data_utils import preview_tfrecord; preview_tfrecord('path/to/tfrecord_file')" -``` - -### Running BERT pretraining - -After preprocessing, we can run pretraining with the `bert_pretrain.py` -script. This will train a model and save it to the `--saved_model_output` -directory. If you are willing to train from data stored on google cloud storage bucket (GCS), you can do it by setting the file path to -the URL of GCS bucket. For example, `--input_directory=gs://your-bucket-name/you-data-path`. You can also save models directly to GCS by the same approach. - -```shell -python3 examples/bert_pretraining/bert_pretrain.py \ - --input_directory path/to/data/ \ - --vocab_file path/to/bert_vocab_uncased.txt \ - --model_size tiny \ - --saved_model_output path/to/model/ -``` diff --git a/examples/bert_pretraining/__init__.py b/examples/bert_pretraining/__init__.py deleted file mode 100644 index 3364a6bd16..0000000000 --- a/examples/bert_pretraining/__init__.py +++ /dev/null @@ -1,13 +0,0 @@ -# Copyright 2024 The KerasNLP Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/examples/bert_pretraining/bert_config.py b/examples/bert_pretraining/bert_config.py deleted file mode 100644 index 5c28ceae70..0000000000 --- a/examples/bert_pretraining/bert_config.py +++ /dev/null @@ -1,79 +0,0 @@ -# Copyright 2024 The KerasNLP Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# TODO(jbischof): remove in favor of presets with load_weights=False -MODEL_CONFIGS = { - "tiny": { - "num_layers": 2, - "hidden_dim": 128, - "dropout": 0.1, - "num_heads": 2, - "intermediate_dim": 512, - }, - "mini": { - "num_layers": 4, - "hidden_dim": 256, - "dropout": 0.1, - "num_heads": 4, - "intermediate_dim": 1024, - }, - "small": { - "num_layers": 4, - "hidden_dim": 512, - "dropout": 0.1, - "num_heads": 8, - "intermediate_dim": 2048, - }, - "medium": { - "num_layers": 8, - "hidden_dim": 512, - "dropout": 0.1, - "num_heads": 8, - "intermediate_dim": 2048, - }, - "base": { - "num_layers": 12, - "hidden_dim": 768, - "dropout": 0.1, - "num_heads": 12, - "intermediate_dim": 3072, - }, - "large": { - "num_layers": 24, - "hidden_dim": 1024, - "dropout": 0.1, - "num_heads": 16, - "intermediate_dim": 4096, - }, -} - -# Currently we have the same set of training parameters for all configurations. -# We should see if we need to split this for different architecture sizes. - -PREPROCESSING_CONFIG = { - "max_seq_length": 512, - "max_predictions_per_seq": 76, - "dupe_factor": 10, - "masked_lm_prob": 0.15, - "short_seq_prob": 0.1, -} - -TRAINING_CONFIG = { - "batch_size": 256, - "epochs": 10, - "learning_rate": 1e-4, - "num_train_steps": 1_000_000, - # Percentage of training steps used for learning rate warmup. - "warmup_percentage": 0.1, -} diff --git a/examples/bert_pretraining/bert_create_pretraining_data.py b/examples/bert_pretraining/bert_create_pretraining_data.py deleted file mode 100644 index e0edeb7b08..0000000000 --- a/examples/bert_pretraining/bert_create_pretraining_data.py +++ /dev/null @@ -1,512 +0,0 @@ -# Copyright 2024 The KerasNLP Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""Create masked LM/next sentence masked_lm TF examples for BERT. - -This script will create TFRecord files containing BERT training examples with -both word masking and next sentence prediction. - -This script will load the entire dataset into memory to setup the next sentence -prediction task, so it is recommended to run this on shards of data at a time to -avoid memory issues. - -By default, it will duplicate the input data 10 times with different masks and -sentence pairs, as will the original paper. So a 20gb source of wikipedia and -bookscorpus will result in a 400gb dataset. - -This script is adapted from the original BERT respository: -https://github.com/google-research/bert/blob/master/create_pretraining_data.py - -Usage: -python create_pretraining_data.py \ - --input_files ~/datasets/bert-sentence-split-data/shard_0.txt \ - --output_directory ~/datasets/bert-pretraining-data/shard_0.txt \ - --vocab_file vocab.txt -""" - -import collections -import os -import random -import sys - -import tensorflow as tf -import tensorflow_text as tf_text -from absl import app -from absl import flags - -from examples.bert_pretraining.bert_config import PREPROCESSING_CONFIG -from examples.utils.scripting_utils import list_filenames_for_arg - -# Tokenization will happen with tensorflow and can easily OOM a GPU. -# Restrict the script to run CPU as GPU will not offer speedup here anyway. -os.environ["CUDA_VISIBLE_DEVICES"] = "-1" - -FLAGS = flags.FLAGS - -flags.DEFINE_string( - "input_files", - None, - "Comma seperated list of directories, globs or files.", -) - -flags.DEFINE_string( - "output_file", - None, - "Output TF record file.", -) - -flags.DEFINE_string( - "vocab_file", - None, - "The vocabulary file for tokenization.", -) - -flags.DEFINE_bool( - "do_lower_case", - True, - "Whether to lower case the input text.", -) - -flags.DEFINE_integer( - "random_seed", - 12345, - "Random seed for data generation.", -) - - -def convert_to_unicode(text): - """Converts text to Unicode if it's not already, assuming utf-8 input.""" - if isinstance(text, str): - return text - elif isinstance(text, bytes): - return text.decode("utf-8", "ignore") - else: - raise ValueError("Unsupported string type: %s" % (type(text))) - - -def printable_text(text): - """Returns text encoded in a way suitable for print.""" - if isinstance(text, str): - return text - elif isinstance(text, bytes): - return text.decode("utf-8", "ignore") - else: - raise ValueError("Unsupported string type: %s" % (type(text))) - - -# This tuple holds a complete training instance of data ready for serialization. -TrainingInstance = collections.namedtuple( - "TrainingInstance", - [ - "tokens", - "segment_ids", - "is_random_next", - "masked_lm_positions", - "masked_lm_labels", - ], -) - - -def write_instance_to_example_files( - instances, vocab, max_seq_length, max_predictions_per_seq, output_filename -): - """Create TF example files from `TrainingInstance`s.""" - writer = tf.io.TFRecordWriter(output_filename) - total_written = 0 - lookup = dict(zip(vocab, range(len(vocab)))) - for inst_index, instance in enumerate(instances): - token_ids = [lookup[x] for x in instance.tokens] - padding_mask = [1] * len(token_ids) - segment_ids = list(instance.segment_ids) - assert len(token_ids) <= max_seq_length - - while len(token_ids) < max_seq_length: - token_ids.append(0) - padding_mask.append(0) - segment_ids.append(0) - - assert len(token_ids) == max_seq_length - assert len(padding_mask) == max_seq_length - assert len(segment_ids) == max_seq_length - - masked_lm_positions = list(instance.masked_lm_positions) - masked_lm_ids = [lookup[x] for x in instance.masked_lm_labels] - masked_lm_weights = [1.0] * len(masked_lm_ids) - - while len(masked_lm_positions) < max_predictions_per_seq: - masked_lm_positions.append(0) - masked_lm_ids.append(0) - masked_lm_weights.append(0.0) - - next_sentence_label = 1 if instance.is_random_next else 0 - - features = collections.OrderedDict() - features["token_ids"] = int_feature(token_ids) - features["padding_mask"] = int_feature(padding_mask) - features["segment_ids"] = int_feature(segment_ids) - features["masked_lm_positions"] = int_feature(masked_lm_positions) - features["masked_lm_ids"] = int_feature(masked_lm_ids) - features["masked_lm_weights"] = float_feature(masked_lm_weights) - features["next_sentence_labels"] = int_feature([next_sentence_label]) - - tf_example = tf.train.Example( - features=tf.train.Features(feature=features) - ) - - writer.write(tf_example.SerializeToString()) - total_written += 1 - - writer.close() - print(f"Wrote {total_written} total instances") - - -def int_feature(values): - return tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) - - -def float_feature(values): - return tf.train.Feature(float_list=tf.train.FloatList(value=list(values))) - - -def create_training_instances( - input_filenames, - tokenizer, - vocab, - max_seq_length, - dupe_factor, - short_seq_prob, - masked_lm_prob, - max_predictions_per_seq, - rng, -): - """Create `TrainingInstance`s from raw text.""" - # Input file format: - # (1) One sentence per line. These should ideally be actual sentences, not - # entire paragraphs or arbitrary spans of text. (Because we use the - # sentence boundaries for the "next sentence prediction" task). - # (2) Blank lines between documents. Document boundaries are needed so - # that the "next sentence prediction" task doesn't span between documents. - dataset = tf.data.TextLineDataset(input_filenames) - dataset = dataset.map( - lambda x: tokenizer.tokenize(x).flat_values, - num_parallel_calls=tf.data.AUTOTUNE, - ) - all_documents = [] - current_document = [] - for line in dataset.as_numpy_iterator(): - if line.size == 0 and current_document: - all_documents.append(current_document) - current_document = [] - else: - line = [x.decode("utf-8") for x in line] - if line: - current_document.append(line) - rng.shuffle(all_documents) - - instances = [] - for _ in range(dupe_factor): - for document_index in range(len(all_documents)): - instances.extend( - create_instances_from_document( - all_documents, - document_index, - max_seq_length, - short_seq_prob, - masked_lm_prob, - max_predictions_per_seq, - vocab, - rng, - ) - ) - rng.shuffle(instances) - return instances - - -def create_instances_from_document( - all_documents, - document_index, - max_seq_length, - short_seq_prob, - masked_lm_prob, - max_predictions_per_seq, - vocab_words, - rng, -): - """Creates `TrainingInstance`s for a single document.""" - document = all_documents[document_index] - - # Account for [CLS], [SEP], [SEP] - max_num_tokens = max_seq_length - 3 - - # We *usually* want to fill up the entire sequence since we are padding - # to `max_seq_length` anyways, so short sequences are generally wasted - # computation. However, we *sometimes* - # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter - # sequences to minimize the mismatch between pre-training and fine-tuning. - # The `target_seq_length` is just a rough target however, whereas - # `max_seq_length` is a hard limit. - target_seq_length = max_num_tokens - if rng.random() < short_seq_prob: - target_seq_length = rng.randint(2, max_num_tokens) - - # We DON'T just concatenate all of the tokens from a document into a long - # sequence and choose an arbitrary split point because this would make the - # next sentence prediction task too easy. Instead, we split the input into - # segments "A" and "B" based on the actual "sentences" provided by the user - # input. - instances = [] - current_chunk = [] - current_length = 0 - i = 0 - while i < len(document): - segment = document[i] - current_chunk.append(segment) - current_length += len(segment) - if i == len(document) - 1 or current_length >= target_seq_length: - if current_chunk: - # `a_end` is how many segments from `current_chunk` go into the - # `A` (first) sentence. - a_end = 1 - if len(current_chunk) >= 2: - a_end = rng.randint(1, len(current_chunk) - 1) - - tokens_a = [] - for j in range(a_end): - tokens_a.extend(current_chunk[j]) - - tokens_b = [] - # Random next - is_random_next = False - if len(current_chunk) == 1 or rng.random() < 0.5: - is_random_next = True - target_b_length = target_seq_length - len(tokens_a) - - # This should rarely go for more than one iteration for - # large corpora. However, just to be careful, we try to make - # sure that the random document is not the same as the - # document we're processing. - for _ in range(10): - random_document_index = rng.randint( - 0, len(all_documents) - 1 - ) - if random_document_index != document_index: - break - - random_document = all_documents[random_document_index] - random_start = rng.randint(0, len(random_document) - 1) - for j in range(random_start, len(random_document)): - tokens_b.extend(random_document[j]) - if len(tokens_b) >= target_b_length: - break - # We didn't actually use these segments so we "put them - # back" so they don't go to waste. - num_unused_segments = len(current_chunk) - a_end - i -= num_unused_segments - # Actual next - else: - is_random_next = False - for j in range(a_end, len(current_chunk)): - tokens_b.extend(current_chunk[j]) - truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng) - - assert len(tokens_a) >= 1 - assert len(tokens_b) >= 1 - - tokens = [] - segment_ids = [] - tokens.append("[CLS]") - segment_ids.append(0) - for token in tokens_a: - tokens.append(token) - segment_ids.append(0) - - tokens.append("[SEP]") - segment_ids.append(0) - - for token in tokens_b: - tokens.append(token) - segment_ids.append(1) - tokens.append("[SEP]") - segment_ids.append(1) - - ( - tokens, - masked_lm_positions, - masked_lm_labels, - ) = create_masked_lm_predictions( - tokens, - masked_lm_prob, - max_predictions_per_seq, - vocab_words, - rng, - ) - instance = TrainingInstance( - tokens=tokens, - segment_ids=segment_ids, - is_random_next=is_random_next, - masked_lm_positions=masked_lm_positions, - masked_lm_labels=masked_lm_labels, - ) - instances.append(instance) - current_chunk = [] - current_length = 0 - i += 1 - - return instances - - -MaskedLmInstance = collections.namedtuple( - "MaskedLmInstance", ["index", "label"] -) - - -def create_masked_lm_predictions( - tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng -): - """Creates the predictions for the masked LM objective.""" - - # TODO(jbischof): replace with keras_nlp.layers.MaskedLMMaskGenerator - # (Issue #166) - - cand_indexes = [] - for i, token in enumerate(tokens): - if token == "[CLS]" or token == "[SEP]": - continue - cand_indexes.append([i]) - - rng.shuffle(cand_indexes) - - output_tokens = list(tokens) - - num_to_predict = min( - max_predictions_per_seq, - max(1, int(round(len(tokens) * masked_lm_prob))), - ) - - masked_lms = [] - covered_indexes = set() - for index_set in cand_indexes: - if len(masked_lms) >= num_to_predict: - break - # If adding a whole-word mask would exceed the maximum number of - # predictions, then just skip this candidate. - if len(masked_lms) + len(index_set) > num_to_predict: - continue - is_any_index_covered = False - for index in index_set: - if index in covered_indexes: - is_any_index_covered = True - break - if is_any_index_covered: - continue - for index in index_set: - covered_indexes.add(index) - - masked_token = None - # 80% of the time, replace with [MASK] - if rng.random() < 0.8: - masked_token = "[MASK]" - else: - # 10% of the time, keep original - if rng.random() < 0.5: - masked_token = tokens[index] - # 10% of the time, replace with random word - else: - masked_token = vocab_words[ - rng.randint(0, len(vocab_words) - 1) - ] - - output_tokens[index] = masked_token - - masked_lms.append( - MaskedLmInstance(index=index, label=tokens[index]) - ) - assert len(masked_lms) <= num_to_predict - masked_lms = sorted(masked_lms, key=lambda x: x.index) - - masked_lm_positions = [] - masked_lm_labels = [] - for p in masked_lms: - masked_lm_positions.append(p.index) - masked_lm_labels.append(p.label) - - return (output_tokens, masked_lm_positions, masked_lm_labels) - - -def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng): - """Truncates a pair of sequences to a maximum sequence length.""" - while True: - total_length = len(tokens_a) + len(tokens_b) - if total_length <= max_num_tokens: - break - - trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b - assert len(trunc_tokens) >= 1 - - # We want to sometimes truncate from the front and sometimes from the - # back to add more randomness and avoid biases. - if rng.random() < 0.5: - del trunc_tokens[0] - else: - trunc_tokens.pop() - - -def main(_): - print(f"Reading input data from {FLAGS.input_files}") - input_filenames = list_filenames_for_arg(FLAGS.input_files) - if not input_filenames: - print("No input files found. Check `input_files` flag.") - sys.exit(1) - - # Load the vocabulary. - vocab = [] - with open(FLAGS.vocab_file, "r") as vocab_file: - for line in vocab_file: - vocab.append(line.strip()) - tokenizer = tf_text.BertTokenizer( - FLAGS.vocab_file, - lower_case=FLAGS.do_lower_case, - token_out_type=tf.string, - ) - - rng = random.Random(FLAGS.random_seed) - instances = create_training_instances( - input_filenames, - tokenizer, - vocab, - PREPROCESSING_CONFIG["max_seq_length"], - PREPROCESSING_CONFIG["dupe_factor"], - PREPROCESSING_CONFIG["short_seq_prob"], - PREPROCESSING_CONFIG["masked_lm_prob"], - PREPROCESSING_CONFIG["max_predictions_per_seq"], - rng, - ) - - print(f"Outputting to {FLAGS.output_file}.") - output_directory = os.path.dirname(FLAGS.output_file) - if not os.path.exists(output_directory): - os.mkdir(output_directory) - write_instance_to_example_files( - instances, - vocab, - PREPROCESSING_CONFIG["max_seq_length"], - PREPROCESSING_CONFIG["max_predictions_per_seq"], - FLAGS.output_file, - ) - - -if __name__ == "__main__": - flags.mark_flag_as_required("input_files") - flags.mark_flag_as_required("output_file") - flags.mark_flag_as_required("vocab_file") - app.run(main) diff --git a/examples/bert_pretraining/bert_pretrain.py b/examples/bert_pretraining/bert_pretrain.py deleted file mode 100644 index 95e2b77d4b..0000000000 --- a/examples/bert_pretraining/bert_pretrain.py +++ /dev/null @@ -1,457 +0,0 @@ -# Copyright 2024 The KerasNLP Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import datetime -import sys - -import tensorflow as tf -from absl import app -from absl import flags -from absl import logging -from tensorflow import keras - -import keras_nlp -from examples.bert_pretraining.bert_config import MODEL_CONFIGS -from examples.bert_pretraining.bert_config import PREPROCESSING_CONFIG -from examples.bert_pretraining.bert_config import TRAINING_CONFIG - -FLAGS = flags.FLAGS - -flags.DEFINE_string( - "input_directory", - None, - "The directory of training data. It can be a local disk path, or the URL " - "of Google cloud storage bucket.", -) - -flags.DEFINE_string( - "saved_model_output", - None, - "Output directory to save the model to.", -) - -flags.DEFINE_string( - "checkpoint_save_directory", - None, - "Output directory to save checkpoints to.", -) - -flags.DEFINE_bool( - "skip_restore", - False, - "Skip restoring from checkpoint if True", -) - -flags.DEFINE_bool( - "tpu_name", - None, - "The TPU to connect to. If None, TPU will not be used.", -) - -flags.DEFINE_bool( - "enable_cloud_logging", - False, - "If True, the script will use cloud logging.", -) - -flags.DEFINE_string( - "tensorboard_log_path", - None, - "The path to save tensorboard log to.", -) - -flags.DEFINE_string( - "model_size", - "tiny", - "One of: tiny, mini, small, medium, base, or large.", -) - -flags.DEFINE_string( - "vocab_file", - None, - "The vocabulary file for tokenization.", -) - -flags.DEFINE_integer( - "num_train_steps", - None, - "Override the pre-configured number of train steps..", -) - - -class MaskedLMHead(keras.layers.Layer): - """Masked language model network head for BERT. - - This layer implements a masked language model based on the provided - transformer based encoder. It assumes that the encoder network being passed - has a "get_embedding_table()" method. - - Example: - ```python - encoder = keras_nlp.models.BertBackbone( - vocabulary_size=30552, - num_layers=12, - num_heads=12, - hidden_dim=768, - intermediate_dim=3072, - max_sequence_length=12, - ) - lm_layer = MaskedLMHead(embedding_table=encoder.get_embedding_table()) - ``` - - Args: - embedding_table: The embedding table from encoder network. - intermediate_activation: The activation, if any, for the inner dense - layer. - initializer: The initializer for the dense layer. Defaults to a Glorot - uniform initializer. - output: The output style for this layer. Can be either 'logits' or - 'predictions'. - """ - - def __init__( - self, - embedding_table, - intermediate_activation="gelu", - initializer="glorot_uniform", - **kwargs, - ): - super().__init__(**kwargs) - self.embedding_table = embedding_table - self.intermediate_activation = keras.activations.get( - intermediate_activation - ) - self.initializer = initializer - - def build(self, input_shape): - self._vocab_size, hidden_dim = self.embedding_table.shape - self.dense = keras.layers.Dense( - hidden_dim, - activation=self.intermediate_activation, - kernel_initializer=self.initializer, - name="transform/dense", - ) - self.layer_norm = keras.layers.LayerNormalization( - axis=-1, epsilon=1e-12, name="transform/LayerNorm" - ) - self.bias = self.add_weight( - name="output_bias/bias", - shape=(self._vocab_size,), - initializer="zeros", - trainable=True, - ) - - super().build(input_shape) - - def call(self, sequence_data, masked_positions): - masked_lm_input = self._gather_indexes(sequence_data, masked_positions) - lm_data = self.dense(masked_lm_input) - lm_data = self.layer_norm(lm_data) - lm_data = tf.matmul(lm_data, self.embedding_table, transpose_b=True) - logits = tf.nn.bias_add(lm_data, self.bias) - masked_positions_length = ( - masked_positions.shape.as_list()[1] or tf.shape(masked_positions)[1] - ) - return tf.reshape( - logits, [-1, masked_positions_length, self._vocab_size] - ) - - def _gather_indexes(self, sequence_tensor, positions): - """Gathers the vectors at the specific positions, for performance. - - Args: - sequence_tensor: Sequence output of shape - (`batch_size`, `seq_length`, `hidden_dim`) where `hidden_dim` - is number of hidden units. - positions: Positions ids of tokens in sequence to mask for - pretraining of with dimension (batch_size, num_predictions) - where `num_predictions` is maximum number of tokens to mask out - and predict per each sequence. - - Returns: - Masked out sequence tensor of shape (batch_size * num_predictions, - `hidden_dim`). - """ - sequence_shape = tf.shape(sequence_tensor) - batch_size, seq_length = sequence_shape[0], sequence_shape[1] - width = sequence_tensor.shape.as_list()[2] or sequence_shape[2] - - flat_offsets = tf.reshape( - tf.range(0, batch_size, dtype="int32") * seq_length, [-1, 1] - ) - flat_positions = tf.reshape(positions + flat_offsets, [-1]) - flat_sequence_tensor = tf.reshape( - sequence_tensor, [batch_size * seq_length, width] - ) - output_tensor = tf.gather(flat_sequence_tensor, flat_positions) - - return output_tensor - - -class BertPretrainingModel(keras.Model): - """MaskedLM + NSP model with Bert encoder.""" - - def __init__(self, encoder, **kwargs): - super().__init__(**kwargs) - self.encoder = encoder - # TODO(jbischof): replace with keras_nlp.layers.MaskedLMHead (Issue #166) - self.masked_lm_head = MaskedLMHead( - embedding_table=encoder.token_embedding.embeddings, - initializer=keras.initializers.TruncatedNormal(stddev=0.02), - name="mlm_layer", - ) - self.next_sentence_head = keras.layers.Dense( - encoder.num_segments, - kernel_initializer=keras.initializers.TruncatedNormal(stddev=0.02), - name="nsp_layer", - ) - - def call(self, data): - encoder_output = self.encoder( - { - "token_ids": data["token_ids"], - "segment_ids": data["segment_ids"], - "padding_mask": data["padding_mask"], - } - ) - sequence_output, pooled_output = ( - encoder_output["sequence_output"], - encoder_output["pooled_output"], - ) - lm_preds = self.masked_lm_head( - sequence_output, data["masked_lm_positions"] - ) - nsp_preds = self.next_sentence_head(pooled_output) - return {"mlm": lm_preds, "nsp": nsp_preds} - - -class LinearDecayWithWarmup(keras.optimizers.schedules.LearningRateSchedule): - """ - A learning rate schedule with linear warmup and decay. - - This schedule implements a linear warmup for the first `num_warmup_steps` - and a linear ramp down until `num_train_steps`. - """ - - def __init__(self, learning_rate, num_warmup_steps, num_train_steps): - self.learning_rate = learning_rate - self.warmup_steps = num_warmup_steps - self.train_steps = num_train_steps - - def __call__(self, step): - peak_lr = tf.cast(self.learning_rate, dtype="float32") - warmup = tf.cast(self.warmup_steps, dtype="float32") - training = tf.cast(self.train_steps, dtype="float32") - step = tf.cast(step, dtype="float32") - - is_warmup = step < warmup - - # Linear Warmup will be implemented if current step is less than - # `num_warmup_steps` else Linear Decay will be implemented. - return tf.cond( - is_warmup, - lambda: peak_lr * (step / warmup), - lambda: tf.math.maximum( - 0.0, peak_lr * (training - step) / (training - warmup) - ), - ) - - def get_config(self): - return { - "learning_rate": self.learning_rate, - "num_warmup_steps": self.warmup_steps, - "num_train_steps": self.train_steps, - } - - -def decode_record(record): - """Decodes a record to a TensorFlow example.""" - seq_length = PREPROCESSING_CONFIG["max_seq_length"] - lm_length = PREPROCESSING_CONFIG["max_predictions_per_seq"] - name_to_features = { - "token_ids": tf.io.FixedLenFeature([seq_length], "int64"), - "padding_mask": tf.io.FixedLenFeature([seq_length], "int64"), - "segment_ids": tf.io.FixedLenFeature([seq_length], "int64"), - "masked_lm_positions": tf.io.FixedLenFeature([lm_length], "int64"), - "masked_lm_ids": tf.io.FixedLenFeature([lm_length], "int64"), - "masked_lm_weights": tf.io.FixedLenFeature([lm_length], "float32"), - "next_sentence_labels": tf.io.FixedLenFeature([1], "int64"), - } - # tf.Example only supports "int64", but the TPU only supports "int32". - # So cast all int64 to int32. - example = tf.io.parse_single_example(record, name_to_features) - for name in list(example.keys()): - value = example[name] - if value.dtype == "int64": - value = tf.cast(value, "int32") - example[name] = value - - inputs = { - "token_ids": example["token_ids"], - "padding_mask": example["padding_mask"], - "segment_ids": example["segment_ids"], - "masked_lm_positions": example["masked_lm_positions"], - } - labels = { - "mlm": example["masked_lm_ids"], - "nsp": example["next_sentence_labels"], - } - sample_weights = {"mlm": example["masked_lm_weights"], "nsp": tf.ones((1,))} - sample = (inputs, labels, sample_weights) - return sample - - -def get_checkpoint_callback(): - if tf.io.gfile.exists(FLAGS.checkpoint_save_directory): - if not tf.io.gfile.isdir(FLAGS.checkpoint_save_directory): - raise ValueError( - "`checkpoint_save_directory` should be a directory, " - f"but {FLAGS.checkpoint_save_directory} is not a " - "directory. Please set `checkpoint_save_directory` as " - "a directory." - ) - - elif FLAGS.skip_restore: - # Clear up the directory if users want to skip restoring. - tf.io.gfile.rmtree(FLAGS.checkpoint_save_directory) - checkpoint_path = FLAGS.checkpoint_save_directory - return keras.callbacks.BackupAndRestore( - backup_dir=checkpoint_path, - ) - - -def get_tensorboard_callback(): - timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") - log_dir = FLAGS.tensorboard_log_path + timestamp - return keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1) - - -def main(_): - if FLAGS.enable_cloud_logging: - # If the job is on cloud, we will use cloud logging. - import google.cloud.logging - - keras.utils.disable_interactive_logging() - client = google.cloud.logging.Client() - client.setup_logging() - - logging.info(f"Reading input data from {FLAGS.input_directory}") - if not tf.io.gfile.isdir(FLAGS.input_directory): - raise ValueError( - "`input_directory` should be a directory, " - f"but {FLAGS.input_directory} is not a directory. Please " - "set `input_directory` flag as a directory." - ) - files = tf.io.gfile.listdir(FLAGS.input_directory) - input_filenames = [FLAGS.input_directory + "/" + file for file in files] - - if not input_filenames: - logging.info("No input files found. Check `input_directory` flag.") - sys.exit(1) - - vocab = [] - with tf.io.gfile.GFile(FLAGS.vocab_file) as vocab_file: - for line in vocab_file: - vocab.append(line.strip()) - - model_config = MODEL_CONFIGS[FLAGS.model_size] - - if FLAGS.tpu_name is None: - # Use default strategy if not using TPU. - strategy = tf.distribute.get_strategy() - else: - # Connect to TPU and create TPU strategy. - resolver = tf.distribute.cluster_resolver.TPUClusterResolver.connect( - tpu=FLAGS.tpu_name - ) - strategy = tf.distribute.TPUStrategy(resolver) - - # Decode and batch data. - dataset = tf.data.TFRecordDataset(input_filenames) - dataset = dataset.map( - lambda record: decode_record(record), - num_parallel_calls=tf.data.experimental.AUTOTUNE, - ) - dataset = dataset.batch(TRAINING_CONFIG["batch_size"], drop_remainder=True) - dataset = dataset.repeat() - - with strategy.scope(): - # Create a Bert model the input config. - encoder = keras_nlp.models.BertBackbone( - vocabulary_size=len(vocab), **model_config - ) - # Make sure model has been called. - encoder(encoder.inputs) - encoder.summary() - - # Allow overriding train steps from the command line for quick testing. - if FLAGS.num_train_steps is not None: - num_train_steps = FLAGS.num_train_steps - else: - num_train_steps = TRAINING_CONFIG["num_train_steps"] - num_warmup_steps = int( - num_train_steps * TRAINING_CONFIG["warmup_percentage"] - ) - learning_rate_schedule = LinearDecayWithWarmup( - learning_rate=TRAINING_CONFIG["learning_rate"], - num_warmup_steps=num_warmup_steps, - num_train_steps=num_train_steps, - ) - optimizer = keras.optimizers.Adam(learning_rate=learning_rate_schedule) - - lm_loss = keras.losses.SparseCategoricalCrossentropy( - from_logits=True, - name="lm_loss", - ) - nsp_loss = keras.losses.SparseCategoricalCrossentropy( - from_logits=True, - name="nsp_loss", - ) - - lm_accuracy = keras.metrics.SparseCategoricalAccuracy(name="accuracy") - nsp_accuracy = keras.metrics.SparseCategoricalAccuracy(name="accuracy") - - pretraining_model = BertPretrainingModel(encoder) - pretraining_model.compile( - optimizer=optimizer, - loss={"mlm": lm_loss, "nsp": nsp_loss}, - weighted_metrics={"mlm": lm_accuracy, "nsp": nsp_accuracy}, - ) - - epochs = TRAINING_CONFIG["epochs"] - steps_per_epoch = num_train_steps // epochs - - callbacks = [] - if FLAGS.checkpoint_save_directory: - callbacks.append(get_checkpoint_callback()) - if FLAGS.tensorboard_log_path: - callbacks.append(get_tensorboard_callback()) - - pretraining_model.fit( - dataset, - epochs=epochs, - steps_per_epoch=steps_per_epoch, - callbacks=callbacks, - ) - - model_path = FLAGS.saved_model_output - logging.info(f"Saving to {FLAGS.saved_model_output}") - encoder.save(model_path) - - -if __name__ == "__main__": - flags.mark_flag_as_required("input_directory") - flags.mark_flag_as_required("vocab_file") - flags.mark_flag_as_required("saved_model_output") - app.run(main) diff --git a/examples/bert_pretraining/requirements.txt b/examples/bert_pretraining/requirements.txt deleted file mode 100644 index 3d0b35fbc8..0000000000 --- a/examples/bert_pretraining/requirements.txt +++ /dev/null @@ -1,2 +0,0 @@ -nltk -wikiextractor diff --git a/examples/glue_benchmark/README.md b/examples/glue_benchmark/README.md deleted file mode 100644 index 4dde0b93d7..0000000000 --- a/examples/glue_benchmark/README.md +++ /dev/null @@ -1,117 +0,0 @@ -# GLUE Finetuning Script - -This script is written to help you evaluate your model on GLUE benchmarking. -It provides the functionalities below: - -- Load and preprocess GLUE data. -- Finetuning your Keras text classification model. -- Generate GLUE submission files. - -To use the script, you need to change the code to load your pretrained model, -and run the command below: - -```shell -python glue.py --task_name="mrpc" --batch_size=32 \ - --submission_directory="glue_submissions/" -``` - -By default the script finetunes on the tiniest BERT model we have available -(this will be fast but not top performing). - -To make a real GLUE leaderboard submission, you need to call the finetuning on -all tasks, then enter the submission directory then zip the submission files: -```shell -for task in cola sst2 mrpc rte stsb qnli qqp; do - python glue.py --task_name="$task" --submission_directory="glue_submissions/" -done - -python glue.py --task_name="mnli_matched" \ - --submission_directory="glue_submissions/" \ - --save_finetuning_model="saved/mnli" - -python glue.py --task_name="mnli_mismatched" \ - --submission_directory="glue_submissions/" \ - --load_finetuning_model="saved/mnli" - -python glue.py --task_name="ax" \ - --submission_directory="glue_submissions/" \ - --load_finetuning_model="saved/mnli" - -cd glue_submissions -zip -r submission.zip *.tsv -``` - -Please note that `mnli_matched`, `mnli_mismatched` and `ax` share the same -training set, so we only train once on `mnli_matched` and use the saved model -to evaluate on `mnli_mismatched` and `ax`. - -GLUE submission requires the `submission.zip` contains `.tsv` file for all -tasks, otherwise it will be a failed submission. An empty `.tsv` will also fail -because it checks the content. If you only want to evaluate on certain tasks, -you can download the sample submission, and put the `.tsv` files for tasks you -don't run inside your submission file. For example if you don't want to -run the `ax` task, then you can do: - -``` -curl -O https://gluebenchmark.com/assets/CBOW.zip -unzip CBOW.zip -d sample_submissions -cp sample_submissions/AX.tsv glue_submissions -``` - -## How to Use the Script - -To use this script on your model, you need to do 3 things: - -1. Implement your custom preprocessing in `preprocess_fn()`. -2. Load your pretrained model. -3. Make the finetune model with your model. - -Code needing customization is wrapped between comment -`Custom code block starts` and -`Custom code block ends`. See instructions on each step below. - -### Custom Preprocessing - -In all GLUE dataset, each record comes with features of one or two sentences, -and one label. In the script, we load GLUE dataset in the format -`(features, labels)`, where `features` is a tuple of either 1 sentence or 2 -sentences. Your need to write custom preprocessing logic to convert to data -to the required input of your model. For example, in the current script -(finetuning for KerasNLP BERT), it is doing: - -```python -bert_preprocessor = keras_nlp.models.BertPreprocessor.from_preset( - "bert_tiny_en_uncased" -) -def preprocess_fn(feature, label): - return bert_preprocessor(feature), label -``` -It uses the `BertPreprocessor` to convert input feature formats. - -### Load Pretrained Model - -As long as it is a Keras model, you can use it with this script. - -### Make the Finetuning Model - -Users need to make a classification model based on your pretrained model for -evaluation purposes. For example, [`BertClassifier`](https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/bert/bert_classifier.py) takes a `Bert` model as backbone, -and adds a dense layer on top of it. Please pay attention that different model -could use different classifier structure, e.g., in [RoBERTa](https://github.com/huggingface/transformers/blob/94b3f544a1f5e04b78d87a2ae32a7ac252e22e31/src/transformers/models/roberta/modeling_roberta.py#L1437-L1456), -it has 2 dense layers. If you are using pretrained model from an OSS package, -please find the correct classifier. If you use a custom model, you can start -experimenting with a simple dense layer, and adjust the structure based on -its performance. - -## Flags Table - -| Flags Name | Explanation | Default | -|---------------------------- |------------------------------------------------- |--------- | -| task_name | The name of the GLUE task to finetune on. | "mrpc" | -| batch_size | Data batch size | 32 | -| epochs | Number of epochs to run finetuning. | 2 | -| learning_rate | The optimizer's learning rate. | 5e-5 | -| tpu_name | The name of TPU to connect to. | None | -| submission_directory | The file path to save the glue submission file. | None | -| load_finetuning_model | The path to load the finetuning model. | None | -| save_finetuning_model | The path to save the finetuning model. | None | diff --git a/examples/glue_benchmark/scores.md b/examples/glue_benchmark/scores.md deleted file mode 100644 index 4435a29911..0000000000 --- a/examples/glue_benchmark/scores.md +++ /dev/null @@ -1,141 +0,0 @@ -# GLUE Benchmark Score on KerasNLP Pretrained Models - -We use `glue.py` to test out KerasNLP pretrained models, and report scores in -this doc. Our goal is to quickly verify our model's performance instead of -searching for the best hyperparameters, so the reported score can be a little -worse than reported by the original paper. - -Unless specifically noted, hyperparameter settings are the same across all GLUE -tasks. - -## BERT - -Test target is `keras_nlp.models.BertClassifier()`. WNLI is skipped because it -was not evaluated at the original paper. - -### Hyperparameter Settings - -- Learning Rate: - We use a `PolynomialDecay` learning rate, with `initial_learning_rate=5e-5`. - ```python - lr = tf.keras.optimizers.schedules.PolynomialDecay( - 5e-5, - decay_steps={total_training_steps}, - end_learning_rate=0.0, - ) - ``` -- Optimizer: - We use `AdamW` optimizer, and exclude `bias` and variables in - `LayerNormalization` from weight decay. - - ```python - optimizer = tf.keras.optimizers.experimental.AdamW( - lr, weight_decay=0.01, global_clipnorm=1.0 - ) - optimizer.exclude_from_weight_decay( - var_names=["LayerNorm", "layer_norm", "bias"] - ) - ``` -- Others: - | Hyperparameter Name | Value | - |---------------------|-------| - | batch_size | 32 | - | epochs | 3 | - | dropout | 0.1 | - -### Benchmark Score - -| Task Name | Metrics | Score | -|-----------|-----------------------|-----------| -| CoLA | Matthew's Corr | 52.2 | -| SST-2 | Accuracy | 93.5 | -| MRPC | F1 / Accuracy | 88.2/83.9 | -| STSB | Pearson-Spearman Corr | 84.5/83.1 | -| QQP | F1 / Accuracy | 71.3/89.3 | -| MNLI_M | Accuracy | 84.3 | -| MNLI_Mis | Accuracy | 83.3 | -| QNLI | Accuracy | 90.4 | -| RTE | Accuracy | 66.7 | -| AX | Matthew's Corr | 34.8 | - -See the actual submission in this [link](https://gluebenchmark.com/submission/gnG9xUQGkjfVq6loRQYKTcM1YjG3/-NIe3Owl8pjHLXpistkI). - -## RoBERTa - -Test target is `keras_nlp.models.RobertaClassifier()`. - -### Hyperparameter Settings - -#### WNLI - -We choose a special setting for WNLI from other tasks. - -- Learning Rate: - We use a `PolynomialDecay` learning rate, with `initial_learning_rate=2e-5`. - ```python - lr = tf.keras.optimizers.schedules.PolynomialDecay( - 2e-5, - decay_steps={total_training_steps}, - end_learning_rate=0.0, - ) - ``` -- Optimizer: - We use `Adam` optimizer. - - ```python - optimizer = tf.keras.optimizers.Adam(lr) - ``` -- Others: - | Hyperparameter Name | Value | - |---------------------|-------| - | batch_size | 32 | - | epochs | 10 | - | dropout | 0.1 | - -#### Other GLUE Tasks - -- Learning Rate: - We use a `PolynomialDecay` learning rate, with `initial_learning_rate=2e-5`. - ```python - lr = tf.keras.optimizers.schedules.PolynomialDecay( - 2e-5, - decay_steps={total_training_steps}, - end_learning_rate=0.0, - ) - ``` -- Optimizer: - We use `AdamW` optimizer, and exclude `bias` and variables in - `LayerNormalization` from weight decay. - - ```python - optimizer = tf.keras.optimizers.experimental.AdamW( - lr, weight_decay=0.01, global_clipnorm=1.0 - ) - optimizer.exclude_from_weight_decay( - var_names=["LayerNorm", "layer_norm", "bias"] - ) - ``` -- Others: - | Hyperparameter Name | Value | - |---------------------|-------| - | batch_size | 32 | - | epochs | 3 | - | dropout | 0.1 | - -### Benchmark Score - -| Task Name | Metrics | Score | -|-----------|-----------------------|-----------| -| CoLA | Matthew's Corr | 56.3 | -| SST-2 | Accuracy | 96.1 | -| MRPC | F1 / Accuracy | 89.8/86.3 | -| STSB | Pearson-Spearman Corr | 88.4/87.7 | -| QQP | F1 / Accuracy | 72.3/89.0 | -| MNLI_M | Accuracy | 87.7 | -| MNLI_Mis | Accuracy | 87.1 | -| QNLI | Accuracy | 92.8 | -| RTE | Accuracy | 69.2 | -| WNLI | Accuracy | 65.1 | -| AX | Matthew's Corr | 40.6 | - -See the actual submission in this [link](https://gluebenchmark.com/submission/gnG9xUQGkjfVq6loRQYKTcM1YjG3/-NJS0XAX1o9p8DJst3wM). \ No newline at end of file diff --git a/examples/machine_translation/README.md b/examples/machine_translation/README.md deleted file mode 100644 index ac836f933a..0000000000 --- a/examples/machine_translation/README.md +++ /dev/null @@ -1,48 +0,0 @@ -# English-Spanish machine translation with keras-nlp - -This example will show how to train a Transformer-based machine translation -model using APIs provided by Keras-NLP. This instruction shows how to train the -model, and evaluate with customized English sentences. - -## Installing dependencies - -Pip dependencies for all keras-nlp examples are listed in `setup.py`. To install -both the keras-nlp library from source and all other dependencies required to -run the example, run the below command. You may want to install to a self -contained environment (e.g. a container or a virtualenv). - -```shell -pip install -e ".[examples]" -``` - -## Train the machine translation model and save to disk - -At the root directory of keras-nlp, run the following command: - -```shell -python ./examples/machine_translation/train.py \ - --num_epochs=3 \ - --saved_model_path="saved_models/machine_translation" -``` - -If it finishes successfully, you should see your console print out the -following information: -``` -Successfully saved model to saved_models/machine_translation. -``` - -## Running machine translation on customized inputs - -Once you have a model saved successfully, you can play around it via the -inference.py script. To run inference on customized inputs, please run the -following command: - -```shell -python ./examples/machine_translation/train.py \ - --inputs="Have a nice day" \ - --saved_model_path=saved_models/machine_translation" -``` - -You can set the inputs value as any English sentence, or you can leave it unset, -then the script will run against some predefined English sentences. - diff --git a/examples/machine_translation/__init__.py b/examples/machine_translation/__init__.py deleted file mode 100644 index 3364a6bd16..0000000000 --- a/examples/machine_translation/__init__.py +++ /dev/null @@ -1,13 +0,0 @@ -# Copyright 2024 The KerasNLP Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/examples/machine_translation/data.py b/examples/machine_translation/data.py deleted file mode 100644 index 6015820eac..0000000000 --- a/examples/machine_translation/data.py +++ /dev/null @@ -1,148 +0,0 @@ -# Copyright 2024 The KerasNLP Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import pathlib -import random -import re -import string - -import tensorflow as tf -from tensorflow import keras - - -def download_data(): - text_file = keras.utils.get_file( - fname="spa-eng.zip", - origin=( - "http://storage.googleapis.com/download.tensorflow.org/data/" - + "spa-eng.zip" - ), - extract=True, - ) - return pathlib.Path(text_file).parent / "spa-eng" / "spa.txt" - - -def read_data(filepath): - with open(filepath) as f: - lines = f.read().split("\n")[:-1] - text_pairs = [] - for line in lines: - eng, spa = line.split("\t") - spa = "[start] " + spa + " [end]" - text_pairs.append((eng, spa)) - return text_pairs - - -def split_train_val_test(text_pairs): - random.shuffle(text_pairs) - num_val_samples = int(0.15 * len(text_pairs)) - num_train_samples = len(text_pairs) - 2 * num_val_samples - train_pairs = text_pairs[:num_train_samples] - val_end_index = num_train_samples + num_val_samples - val_pairs = text_pairs[num_train_samples:val_end_index] - test_pairs = text_pairs[val_end_index:] - return train_pairs, val_pairs, test_pairs - - -strip_chars = string.punctuation + "¿" -strip_chars = strip_chars.replace("[", "") -strip_chars = strip_chars.replace("]", "") - - -@keras.saving.register_keras_serializable() -def custom_standardization(input_string): - lowercase = tf.strings.lower(input_string) - return tf.strings.regex_replace( - lowercase, - "[%s]" % re.escape(strip_chars), - "", - ) - - -def prepare_tokenizer(train_pairs, sequence_length, vocab_size): - """Preapare English and Spanish tokenizer.""" - eng_tokenizer = keras.layers.TextVectorization( - max_tokens=vocab_size, - output_mode="int", - output_sequence_length=sequence_length, - ) - spa_tokenizer = keras.layers.TextVectorization( - max_tokens=vocab_size, - output_mode="int", - output_sequence_length=sequence_length + 1, - standardize=custom_standardization, - ) - eng_texts, spa_texts = zip(*train_pairs) - eng_tokenizer.adapt(eng_texts) - spa_tokenizer.adapt(spa_texts) - return eng_tokenizer, spa_tokenizer - - -def prepare_datasets(text_pairs, batch_size, eng_tokenizer, spa_tokenizer): - """Transform raw text pairs to tf datasets.""" - eng_texts, spa_texts = zip(*text_pairs) - eng_texts = list(eng_texts) - spa_texts = list(spa_texts) - - def format_dataset(eng, spa): - """Format the dataset given input English and Spanish text. - - The output format is: - x: a pair of English and Spanish sentence. - y: The Spanish sentence in x shifts 1 token towards right, because - we are predicting the next token. - """ - eng = eng_tokenizer(eng) - spa = spa_tokenizer(spa) - return ( - { - "encoder_inputs": eng, - "decoder_inputs": spa[:, :-1], - }, - spa[:, 1:], - tf.cast((spa[:, 1:] != 0), "float32"), # mask as sample weights - ) - - dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts)) - dataset = dataset.batch(batch_size) - dataset = dataset.map(format_dataset) - return dataset.shuffle(2048).prefetch(tf.data.AUTOTUNE).cache() - - -def get_dataset_and_tokenizer(sequence_length, vocab_size, batch_size): - """Main method to get the formatted machine translation dataset.""" - filepath = download_data() - text_pairs = read_data(filepath) - train_pairs, val_pairs, test_pairs = split_train_val_test(text_pairs) - eng_tokenizer, spa_tokenizer = prepare_tokenizer( - train_pairs, sequence_length, vocab_size - ) - train_ds = prepare_datasets( - train_pairs, - batch_size, - eng_tokenizer, - spa_tokenizer, - ) - val_ds = prepare_datasets( - val_pairs, - batch_size, - eng_tokenizer, - spa_tokenizer, - ) - test_ds = prepare_datasets( - test_pairs, - batch_size, - eng_tokenizer, - spa_tokenizer, - ) - return (train_ds, val_ds, test_ds), (eng_tokenizer, spa_tokenizer) diff --git a/examples/machine_translation/inference.py b/examples/machine_translation/inference.py deleted file mode 100644 index 5a3a1118e4..0000000000 --- a/examples/machine_translation/inference.py +++ /dev/null @@ -1,140 +0,0 @@ -# Copyright 2024 The KerasNLP Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import numpy as np -import tensorflow as tf -from absl import app -from absl import flags -from absl import logging -from tensorflow import keras - -# Import data module to include the customized serializable, required for -# loading tokenizer. -import examples.machine_translation.data # noqa: F401. - -FLAGS = flags.FLAGS - -flags.DEFINE_integer( - "sequence_length", - 20, - "Input and output sequence length.", -) - -flags.DEFINE_string( - "saved_model_path", - "saved_models/machine_translation_model", - "The path to saved model", -) - -flags.DEFINE_string("inputs", None, "The inputs to run machine translation on.") - -EXAMPLES = [ - ( - "Tom doesn't listen to anyone.", - "[start] Tomás no escucha a nadie. [end]", - ), - ("I got soaked to the skin.", "[start] Estoy chorreando. [end]"), - ("I imagined that.", "[start] Me imaginé eso. [end]"), - ("The baby is crying.", "[start] El bebé está llorando. [end]"), - ( - "I've never felt so exhilarated.", - "[start] Nunca me he sentido tan animado. [end]", - ), - ( - "Please forgive me for not having written sooner.", - "[start] Perdóname por no haberte escrito antes, por favor. [end]", - ), - ("I expected more from you.", "[start] Esperaba más de vos. [end]"), - ("I have a computer.", "[start] Tengo un computador. [end]"), - ("Dinner's ready!", "[start] ¡La cena está lista! [end]"), - ("Let me finish.", "[start] Déjame terminar. [end]"), -] - - -def decode_sequence(input_sentence, model, max_sequence_length, lookup_table): - encoder_tokenizer = model.encoder_tokenizer - decoder_tokenizer = model.decoder_tokenizer - tokenized_input = encoder_tokenizer([input_sentence]) - - start_token = decoder_tokenizer("[start]")[0].numpy() - end_token = decoder_tokenizer("[end]")[0].numpy() - - decoded_sentence = [start_token] - for i in range(max_sequence_length): - decoder_inputs = tf.convert_to_tensor( - [decoded_sentence], - dtype="int64", - ) - decoder_inputs = tf.concat( - [ - decoder_inputs, - tf.zeros( - [1, max_sequence_length - i - 1], - dtype="int64", - ), - ], - axis=1, - ) - input = { - "encoder_inputs": tokenized_input, - "decoder_inputs": decoder_inputs, - } - predictions = model(input) - predicted_token = np.argmax(predictions[0, i, :]) - decoded_sentence.append(predicted_token) - if predicted_token == end_token: - break - - detokenized_output = [] - for token in decoded_sentence: - detokenized_output.append(lookup_table[token]) - return " ".join(detokenized_output) - - -def main(_): - loaded_model = keras.models.load_model(FLAGS.saved_model_path) - - decoder_tokenizer = loaded_model.decoder_tokenizer - vocab = decoder_tokenizer.get_vocabulary() - index_lookup_table = dict(zip(range(len(vocab)), vocab)) - - if FLAGS.inputs is not None: - # Run inference on user-specified sentence. - translated = decode_sequence( - FLAGS.inputs, - loaded_model, - FLAGS.sequence_length, - index_lookup_table, - ) - logging.info(f"Translated results: {translated}") - - else: - translated = [] - for example in EXAMPLES: - translated.append( - decode_sequence( - example[0], - loaded_model, - FLAGS.sequence_length, - index_lookup_table, - ) - ) - - for i in range(len(EXAMPLES)): - print("ENGLISH SENTENCE: ", EXAMPLES[i][0]) - print("MACHINE TRANSLATED RESULT: ", translated[i]) - print("GOLDEN: ", EXAMPLES[i][1]) - - -if __name__ == "__main__": - app.run(main) diff --git a/examples/machine_translation/model.py b/examples/machine_translation/model.py deleted file mode 100644 index 99a115f6c9..0000000000 --- a/examples/machine_translation/model.py +++ /dev/null @@ -1,125 +0,0 @@ -# Copyright 2024 The KerasNLP Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import tensorflow as tf -from tensorflow import keras - -from keras_nlp.layers import TransformerDecoder -from keras_nlp.layers import TransformerEncoder - - -class PositionalEmbedding(keras.layers.Layer): - """The positional embedding class.""" - - def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs): - super().__init__(**kwargs) - self.token_embeddings = keras.layers.Embedding( - input_dim=vocab_size, output_dim=embed_dim - ) - self.position_embeddings = keras.layers.Embedding( - input_dim=sequence_length, output_dim=embed_dim - ) - self.sequence_length = sequence_length - self.vocab_size = vocab_size - self.embed_dim = embed_dim - - def call(self, inputs): - length = tf.shape(inputs)[-1] - positions = tf.range(start=0, limit=length, delta=1) - embedded_tokens = self.token_embeddings(inputs) - embedded_positions = self.position_embeddings(positions) - return embedded_tokens + embedded_positions - - def compute_mask(self, inputs, mask=None): - return tf.math.not_equal(inputs, 0) - - -class TranslationModel(keras.Model): - """The machine translation model. - - The model is an encoder-decoder structure model. The encoder is a stack of - `keras_nlp.TransformerEncoder`, and the decoder is a stack of - `keras_nlp.TransformerDecoder`. We also pass in the tokenizer for encoder - and decoder so that during save/load, the tokenizer is also kept. - """ - - def __init__( - self, - encoder_tokenizer, - decoder_tokenizer, - num_encoders, - num_decoders, - num_heads, - transformer_intermediate_dim, - encoder_vocab_size, - decoder_vocab_size, - embed_dim, - sequence_length, - ): - super().__init__() - self.encoders = [] - self.decoders = [] - for _ in range(num_encoders): - self.encoders.append( - TransformerEncoder( - num_heads=num_heads, - intermediate_dim=transformer_intermediate_dim, - ) - ) - for _ in range(num_decoders): - self.decoders.append( - TransformerDecoder( - num_heads=num_heads, - intermediate_dim=transformer_intermediate_dim, - ) - ) - - self.encoder_tokenizer = encoder_tokenizer - self.decoder_tokenizer = decoder_tokenizer - - self.encoder_embedding = PositionalEmbedding( - sequence_length=sequence_length, - vocab_size=encoder_vocab_size, - embed_dim=embed_dim, - ) - - self.decoder_embedding = PositionalEmbedding( - sequence_length=sequence_length, - vocab_size=decoder_vocab_size, - embed_dim=embed_dim, - ) - - self.dense = keras.layers.Dense( - decoder_vocab_size, - activation="softmax", - ) - - def call(self, inputs): - encoder_input, decoder_input = ( - inputs["encoder_inputs"], - inputs["decoder_inputs"], - ) - encoded = self.encoder_embedding(encoder_input) - for encoder in self.encoders: - encoded = encoder(encoded) - - decoded = self.decoder_embedding(decoder_input) - for decoder in self.decoders: - decoded = decoder( - decoded, - encoded, - use_causal_mask=True, - ) - - output = self.dense(decoded) - return output diff --git a/examples/machine_translation/train.py b/examples/machine_translation/train.py deleted file mode 100644 index dcc026e1fb..0000000000 --- a/examples/machine_translation/train.py +++ /dev/null @@ -1,113 +0,0 @@ -# Copyright 2024 The KerasNLP Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from absl import app -from absl import flags -from tensorflow import keras - -from examples.machine_translation.data import get_dataset_and_tokenizer -from examples.machine_translation.model import TranslationModel - -FLAGS = flags.FLAGS - -flags.DEFINE_integer("num_epochs", 1, "Number of epochs to train.") -flags.DEFINE_integer("steps_per_epoch", None, "Number of steps per epoch.") -flags.DEFINE_integer("num_encoders", 2, "Number of Transformer encoder layers.") -flags.DEFINE_integer("num_decoders", 2, "Number of Transformer decoder layers.") -flags.DEFINE_integer("batch_size", 64, "The training batch size.") -flags.DEFINE_float("learning_rate", 0.001, "The initial learning rate.") -flags.DEFINE_integer("model_dim", 64, "Embedding size.") -flags.DEFINE_integer( - "intermediate_dim", - 128, - "Intermediate dimension (feedforward network) of transformer.", -) -flags.DEFINE_integer( - "num_heads", - 8, - "Number of head of the multihead attention.", -) -flags.DEFINE_integer( - "sequence_length", - 20, - "Input and output sequence length.", -) -flags.DEFINE_integer( - "vocab_size", - 15000, - "Vocabulary size, required by tokenizer.", -) - -flags.DEFINE_string( - "saved_model_path", - "saved_models/machine_translation_model", - "The path to saved model", -) - - -def run_training(model, train_ds, val_ds): - learning_rate = keras.optimizers.schedules.ExponentialDecay( - initial_learning_rate=FLAGS.learning_rate, - decay_steps=20, - decay_rate=0.98, - ) - optimizer = keras.optimizers.Adam(learning_rate) - loss_fn = keras.losses.SparseCategoricalCrossentropy( - reduction=keras.losses.Reduction.NONE - ) - metrics = keras.metrics.SparseCategoricalAccuracy() - model.compile(optimizer=optimizer, metrics=[metrics], loss=loss_fn) - model.fit( - train_ds, - epochs=FLAGS.num_epochs, - validation_data=val_ds, - steps_per_epoch=FLAGS.steps_per_epoch, - ) - - -def main(_): - ( - (train_ds, val_ds, test_ds), - ( - eng_tokenizer, - spa_tokenizer, - ), - ) = get_dataset_and_tokenizer( - FLAGS.sequence_length, FLAGS.vocab_size, FLAGS.batch_size - ) - english_vocab_size = eng_tokenizer.vocabulary_size() - spanish_vocab_size = spa_tokenizer.vocabulary_size() - model = TranslationModel( - encoder_tokenizer=eng_tokenizer, - decoder_tokenizer=spa_tokenizer, - num_encoders=FLAGS.num_encoders, - num_decoders=FLAGS.num_decoders, - num_heads=FLAGS.num_heads, - transformer_intermediate_dim=FLAGS.intermediate_dim, - encoder_vocab_size=english_vocab_size, - decoder_vocab_size=spanish_vocab_size, - embed_dim=FLAGS.model_dim, - sequence_length=FLAGS.sequence_length, - ) - - run_training(model, train_ds, val_ds) - - print(f"Saving to {FLAGS.saved_model_path}") - model.save(FLAGS.saved_model_path) - - print(f"Successfully saved model to {FLAGS.saved_model_path}") - - -if __name__ == "__main__": - app.run(main) diff --git a/examples/tools/README.md b/examples/tools/README.md deleted file mode 100644 index efe525e96f..0000000000 --- a/examples/tools/README.md +++ /dev/null @@ -1,37 +0,0 @@ -# KerasNLP Modeling Tools - -This directory contains runnable scripts that are not specific to a specific -model architecture, but still useful for end-to-end workflows. - -## split_sentences.py - -The `split_sentences.py` script will process raw input files and split them into -output files where each line contains a sentence, and a blank line marks the -start of a new document. This is useful for tasks like next sentence prediction -where the boundaries between sentences are needed for training. - -The script supports two types of inputs files. Plain text files, where each -individual file is assumed to be an entire document, and wikipedia dump files -in the format outputted by the wikiextractor tool (each document is enclosed in -`` tags). - -Example usage: - -```shell -python examples/tools/split_sentences.py \ - --input_files ~/datasets/wikipedia,~/datasets/bookscorpus \ - --output_directory ~/datasets/sentence-split-data -``` - -### train_word_piece_vocabulary.py - -The `train_word_piece_vocabulary.py` script allows you to compute your own -WordPiece vocabulary. - -Example usage: - -```shell -python examples/tools/train_word_piece_vocabulary.py \ - --input_files ~/datasets/my-raw-dataset/ \ - --output_file vocab.txt -``` diff --git a/examples/tools/__init__.py b/examples/tools/__init__.py deleted file mode 100644 index 3364a6bd16..0000000000 --- a/examples/tools/__init__.py +++ /dev/null @@ -1,13 +0,0 @@ -# Copyright 2024 The KerasNLP Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/examples/tools/split_sentences.py b/examples/tools/split_sentences.py deleted file mode 100644 index d3897cb6d3..0000000000 --- a/examples/tools/split_sentences.py +++ /dev/null @@ -1,175 +0,0 @@ -# Copyright 2024 The KerasNLP Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""Split sentences from raw input documents using nltk. - -A script to sentence split a raw dataset (e.g. wikipedia or bookscorpus) into -sentences for further preproessing for BERT. The output file format is the -format expected by `create_pretraining_data.py`, where each file contains one -line per sentence, with empty newlines between documents. - -This script will run muliprocessed, and the number of concurrent process and -output file shards can be controlled with `--num_jobs` and `--num_shards`. - -Usage: -python examples/tools/create_sentence_split_data.py \ - --input_files ~/datasets/wikipedia,~/datasets/bookscorpus \ - --output_directory ~/datasets/bert-sentence-split-data -""" - -import contextlib -import multiprocessing -import os -import random -import sys - -import nltk -from absl import app -from absl import flags -from tensorflow import keras - -from examples.utils.scripting_utils import list_filenames_for_arg - -FLAGS = flags.FLAGS - -flags.DEFINE_string( - "input_files", - None, - "Comma seperated list of directories, files, or globs for input data.", -) - -flags.DEFINE_string( - "output_directory", - None, - "Directory for output data.", -) - -flags.DEFINE_integer("num_jobs", None, "Number of file shards to use.") - -flags.DEFINE_integer("num_shards", 500, "Number of file shards to use.") - -flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.") - - -def parse_wiki_file(file): - """Read documents from a wikipedia dump file.""" - documents = [] - in_article = False - article_lines = [] - for line in file: - line = line.strip() - # Skip empty lines. - if line == "": - continue - elif "" in line: - in_article = False - # There are many wikipedia articles that are only titles (one - # line) or or redirects (two lines), we will skip these. - if len(article_lines) > 2: - # Skip the title. - documents.append(" ".join(article_lines[1:])) - article_lines = [] - elif in_article: - article_lines.append(line) - return documents - - -def parse_text_file(file): - """Read documents from a plain text file.""" - documents = [] - file_lines = [] - for line in file: - line = line.strip() - # Skip empty lines. - if line == "": - continue - file_lines.append(line) - documents.append(" ".join(file_lines)) - return documents - - -def read_file(filename): - """Read documents from an input file.""" - with open(filename, mode="r") as file: - firstline = file.readline() - file.seek(0) - # Very basic autodetection of file type. - # Wikipedia dump files all start with a doc id tag. - if "