Adding to_tf_dataset method #2731

Rocketknight1 · 2021-07-29T18:10:25Z

Oh my god do not merge this yet, it's just a draft.

I've added a method (via a mixin) to the arrow_dataset.Dataset class that automatically converts our Dataset classes to TF Dataset classes ready for training. It hopefully has most of the features we want, including streaming from disk (no need to load the whole dataset in memory!), correct shuffling, variable-length batches to reduce compute, and correct support for unusual padding. It achieves that by calling the tokenizer pad method in the middle of a TF compute graph via a very hacky call to tf.py_function, which is heretical but seems to work.

A number of issues need to be resolved before it's ready to merge, though:

Is a MixIn the right way to do this? Do other classes besides arrow_dataset.Dataset need this method too?
Needs an argument to support constant-length batches for TPU training - this is easy to add and I'll do it soon.
Needs the user to supply the list of columns to drop from the arrow Dataset. Is there some automatic way to get the columns we want, or see which columns were added by the tokenizer?
Assumes the label column is always present and always called "label" - this is probably not great, but I'm not sure what the 'correct' thing to do here is.

sgugger

This is very... PyTorchic XD
I like the design, I just think it can be made more general by using a data_collator instead of a tokenizer. My only concern is how it will go in term of performance (since TF might not like the PyTorch-iness of it all) but since we're just grabbing tokenized texts and maybe padding, this shouldn't be too much of a problem.

For computer vision though, we should see if there is a way to make sure to use several processes to prepare the batches, or does the final map do that automatically? Knowing TF I doubt it but one can hope.

src/datasets/arrow_dataset.py

Rocketknight1 · 2021-08-04T16:57:24Z

This seems to be working reasonably well in testing, and performance is way better. tf.py_function has been dropped for an input generator, but I moved as much of the code as possible outside the generator to allow TF to compile it correctly. I also avoid tf.RaggedTensor at all costs, and do the shuffle in the dataset followed by accessing sequential chunks, instead of shuffling an index tensor. The combination of all of these gives us a more flexible data loader as well as a ~20X boost in performance compared to the first solution.

sgugger

Looking good! Just a few more comments on the API.

src/datasets/arrow_dataset.py

Rocketknight1 · 2021-08-27T11:32:56Z

I made a change to the TFFormatter in this PR that will need some changes to the tests, so I wanted to ping @lhoestq and anyone else before I made those changes.

The key problem is that up until now the TFFormatter always returns RaggedTensor, created using the very slow tf.ragged.constant function. This is a big performance penalty, but it's also (imo) surprising for users - RaggedTensor handles tensors where one dimension has variable length. This is a good choice for tokenized datasets with variable sequence length, but it's an odd choice when the non-batch dimensions are constant, such as in image datasets, or in datasets where all samples are padded to the same length (e.g. for TPU training).

The change I made was to try to return standard Tensor objects instead of RaggedTensor when all the samples in the batch had the same shape, and if that was not the case to fall back to fast RaggedTensor creation with tf.ragged.stack, and only falling back to the very slow tf.ragged.constant function as a last resort. I think this will match user expectations in most cases and greatly improve performance, but it's a (very slightly) breaking change, so any feedback is welcome!

Rocketknight1 · 2021-08-27T11:38:49Z

Also I really can't emphasize enough how slow tf.ragged.constant is, it's bad enough to create a data pipeline bottleneck in more or less any training setup:

lhoestq

I'm fine with this change to use tf tensors instead of ragged tensors, and it's nice to see that there is a fallback on ragged tensors anyway. I'm very impressed by the speed gains

This is indeed a breaking change, but I agree with you that in the end it's the only way to have a proper speed. It's also always better to get actual tensors rather than ragged tensors when possible.

The API looks fine to me :)

Maybe in the future people will be happy to have more control over the shuffling (setting the parameters to pass to Dataset.shuffle), but for now I think it's fine

lhoestq · 2021-08-30T16:10:41Z

src/datasets/arrow_dataset.py

+        columns,
+        batch_size,
+        shuffle,


Could they be optional parameters ?

I was thinking about that! It's unclear to me what the defaults should be for columns or batch_size though, and I really wanted shuffle to be a required parameter to ensure people were aware of it, and that they didn't accidentally shuffle or skip shuffling their data when they didn't mean to.

I could maybe set batch_size to something like 32 by default and leave the other two as required parameters?

Oh I see your point about shuffle.
And actually thinking more about it, it looks like we should require batch_size as well no ?

Maybe if columns is not specified then all of them are used ?

(this is just some random ideas, in the end we should just pick the one that fits the TF paradigm the best)

I was thinking about that, but usually our datasets have one or more string columns that we don't want, so the default of using all columns will probably not work most of the time. It'd be nice if we had some way to auto-detect relevant columns, but I can't think of how we'd do that, so I think the safest thing to do is to just ask them to specify.

Rocketknight1 · 2021-08-31T13:40:00Z

Hi @lhoestq, the tests have been modified and everything is passing. The Windows tests look to be failing for an unrelated reason, but other than that I'm ready to merge if you are!

lhoestq · 2021-09-06T13:35:41Z

Hi @Rocketknight1 ! Feel free to merge master into this branch to fix and run the full CI :)

Rocketknight1 · 2021-09-07T13:22:08Z

@lhoestq rebased onto master and it looks good! I'm doing some testing with new notebook examples, but are you happy to merge if that looks good?

mariosasko

This feature seems super cool.

A few nits in terms of style:

src/datasets/arrow_dataset.py

lhoestq

Thanks for pushing this :)
Feel free to add docstrings + type hints + tests.
Let me know if I can help you with this

Also what do you think of adding it to the documentation as well ?

lhoestq · 2021-09-10T12:58:54Z

src/datasets/arrow_dataset.py

+            # We assume that if you're shuffling it's the train set, so we drop the remainder unless told not to
+            drop_remainder = shuffle
+        dataset = self.remove_columns([col for col in self.features if col not in cols_to_retain])
+        dataset.set_format("python")


Note that it is faster to use the numpy format rather than python, especially for tensors. There's a zero-copy conversion from the Arrow data to numpy).

Noted! I'll try to make it work in numpy format.

lhoestq · 2021-09-10T13:09:54Z

src/datasets/arrow_dataset.py

+                    cast_dtype = np.int64 if np.issubdtype(array.dtype, np.integer) else np.float32
+                    array = array.astype(cast_dtype)


Would this work for string types or nested types ?

I've had some success with nested dtypes (in multiple choice datasets). This does fail on string types though - the tf.data.Dataset is intended to be passed straight to a model, so the assumption was that everything coming out of it would be convertable to a tf.Tensor. We could possibly make strings work in this context, though - but I'd need to think about a more generic approach to building the dataset and doing shape inference.

Ok ! Maybe we can mention this in the docstring ?

I just mentioned that numeric data only are expected in the docstring :)

…ort padding to constant size for TPU training

…feature types

src/datasets/arrow_dataset.py

lhoestq

Thanks a lot ! I think the PR is ready to be merged now :)

After that we may to update parts of the documentation:

add the method to the list of documented Dataset method in main_classes.rst
update the demo google colab
update the tensorflow parts of the documentation

Are there other changes that you wanted to do before merging ?

Rocketknight1 · 2021-09-16T13:19:51Z

@lhoestq No, I'm happy to merge it as-is and add documentation afterwards!

lhoestq

Perfect then :)

Rocketknight1 requested review from sgugger and lhoestq July 29, 2021 18:10

sgugger reviewed Jul 29, 2021

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

sgugger reviewed Jul 29, 2021

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

sgugger reviewed Aug 5, 2021

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

Rocketknight1 mentioned this pull request Aug 18, 2021

Performance issues in the program huggingface/transformers#13165

Closed

Rocketknight1 force-pushed the tf_dataset_conversion branch from 3aec479 to cc20d27 Compare August 26, 2021 16:35

lhoestq reviewed Aug 30, 2021

View reviewed changes

Rocketknight1 mentioned this pull request Aug 31, 2021

bug in gpt2 notebook (in tensorflow) huggingface/transformers#13332

Closed

Rocketknight1 changed the title ~~First draft of a method to auto-convert our datasets to TF datasets!~~ Adding to_tf_dataset method Sep 2, 2021

Rocketknight1 force-pushed the tf_dataset_conversion branch from 4f7e564 to 71054d7 Compare September 7, 2021 11:39

mariosasko reviewed Sep 7, 2021

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

lhoestq reviewed Sep 10, 2021

View reviewed changes

Rocketknight1 added 11 commits September 15, 2021 15:57

Rebase onto master

92cad15

Support multiple label_cols, replaced tokenizer with collate_fn, supp…

74b5bad

…ort padding to constant size for TPU training

Standardize int and float dtypes to keep TF happy

97917bc

Add a prefetch buffer for improved performance

4eb79f5

TF dataset is actually kinda performant now!

bed394a

TF dataset is actually kinda performant now!

ea525a2

Style pass

d3a8140

Helpful error message if my code gets caught off-guard by unexpected …

3ce6dc4

…feature types

Style pass

67c0657

Added drop_remainder argument, removed pad_to

2963f0a

Correct shape signatures when we're not dropping the remainder

7f11d76

Rocketknight1 added 16 commits September 15, 2021 15:57

Fix an embarrassing regression bug

56ea08f

Style pass

2ddf7c6

Added config.TF_AVAILABLE checks and dict literals

ddfda69

Handling for special cases around label/labels and very nested dtypes

c87d47e

Fix for accidentally shuffling even when flag was False

e7d1ce8

Adding dummy labels by default

48045fb

Adding docstrings and type hints

ec4f7d4

Style pass

88e9f1e

Add tests, bugfix to handling scalar columns

a7b4574

Style pass

b35267d

Fix to numpy_pad

6273d73

Replace assertion with more robust syntax

4ff6d2e

Add cleanup deletion of tf_dataset in tests

589c575

Rebasing onto Master

d70fe94

Fixes for the new approach

a189740

Force dtype to ensure Windows compatibility

c8f251b

Rocketknight1 force-pushed the tf_dataset_conversion branch from fdafeff to c8f251b Compare September 15, 2021 15:01

Rocketknight1 added 2 commits September 15, 2021 16:13

Fixing things because I am bad at merging

f1f8888

Fix issues with passing a mutable list to columns argument

ef9a7bb

Rocketknight1 linked an issue Sep 16, 2021 that may be closed by this pull request

Mutable columns argument breaks set_format #2930

Closed

Rocketknight1 mentioned this pull request Sep 16, 2021

Mutable columns argument breaks set_format #2930

Closed

lhoestq reviewed Sep 16, 2021

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

Update src/datasets/arrow_dataset.py

b8523e4

lhoestq reviewed Sep 16, 2021

View reviewed changes

Rocketknight1 and others added 2 commits September 16, 2021 14:21

Merge branch 'master' into tf_dataset_conversion

46c2507

Fix unused import

397bcb7

lhoestq approved these changes Sep 16, 2021

View reviewed changes

Rocketknight1 merged commit fa09d37 into master Sep 16, 2021

Rocketknight1 deleted the tf_dataset_conversion branch September 16, 2021 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding to_tf_dataset method #2731

Adding to_tf_dataset method #2731

Rocketknight1 commented Jul 29, 2021

sgugger left a comment

Rocketknight1 commented Aug 4, 2021

sgugger left a comment

Rocketknight1 commented Aug 27, 2021

Rocketknight1 commented Aug 27, 2021

lhoestq left a comment

lhoestq Aug 30, 2021

Rocketknight1 Aug 30, 2021

lhoestq Aug 30, 2021 •

edited

Rocketknight1 Aug 31, 2021

Rocketknight1 commented Aug 31, 2021

lhoestq commented Sep 6, 2021

Rocketknight1 commented Sep 7, 2021

mariosasko left a comment

lhoestq left a comment •

edited

lhoestq Sep 10, 2021

Rocketknight1 Sep 10, 2021

lhoestq Sep 10, 2021

Rocketknight1 Sep 10, 2021

lhoestq Sep 16, 2021

lhoestq Sep 16, 2021

lhoestq left a comment

Rocketknight1 commented Sep 16, 2021

lhoestq left a comment

		cast_dtype = np.int64 if np.issubdtype(array.dtype, np.integer) else np.float32
		array = array.astype(cast_dtype)

Adding to_tf_dataset method #2731

Adding to_tf_dataset method #2731

Conversation

Rocketknight1 commented Jul 29, 2021

sgugger left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Aug 4, 2021

sgugger left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Aug 27, 2021

Rocketknight1 commented Aug 27, 2021

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq Aug 30, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rocketknight1 commented Aug 31, 2021

lhoestq commented Sep 6, 2021

Rocketknight1 commented Sep 7, 2021

mariosasko left a comment

Choose a reason for hiding this comment

lhoestq left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Sep 16, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Aug 30, 2021 •

edited

lhoestq left a comment •

edited