Added max_sample_ arguments #10551

bhadreshpsavani · 2021-03-05T16:38:57Z

What does this PR do?

Fixes #10437 #10423

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? [Trainer] add --max_train_samples --max_val_samples --max_test_samples #10437
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Notes:

All the PyTorch-based examples except the below two files will have support for the arguments by adding these changes.

The same changes can be implemented for run_mlm_flax.py but since I couldn't test the changes I didn't make changes to that file.
run_generation.py

I have reverted the code changes for three TF-based examples since it was giving an error and we want to keep it as it is.
Test/Predict code addition is still pending. I will do it next.

review:

@stas00 @sgugger

stas00

Excellent work, @bhadreshpsavani!

There are a few small tweak requests that I left in the comments.

Thank you!

stas00 · 2021-03-05T22:09:50Z

examples/language-modeling/run_clm.py

+            num_proc=data_args.preprocessing_num_workers,
+            load_from_cache_file=not data_args.overwrite_cache,
+        )
+    if training_args.do_eval:


Suggested change

if training_args.do_eval:

if training_args.do_eval:

please add a new line between the ifs so that they don't mesh together (same in all other scripts).

stas00 · 2021-03-05T22:15:30Z

examples/test_examples.py

+    if os.path.exists(path):
+        with open(path, "r") as f:
+            results = json.load(f)
+    return results


if we are expecting this to always work, then perhaps:

else: raise ValueError(f"can't find {path}")

otherwise result["eval_accuracy"] will complain about a missing key, which is not the problem.

sgugger

Thanks for diving into this. There is a slight problem with the language modeling examples and the QA examples: for both, the number of samples in the dataset is actually changed in the preprocessing, so we must take more care to have the right number of samples in the final dataset.

In particular, I don't think we can avoid preprocessing the whole dataset in the language modeling examples.

sgugger · 2021-03-05T22:33:38Z

examples/language-modeling/run_clm.py

+    def preprocess_function(examples):
+        examples = tokenizer(examples[text_column_name])
+        return group_texts(examples)
+


This will not work grouped like this as the group_texts function relies on the length of tokenized samples. Moreover, in this case, the number of samples is actually the number of elements in the lm_datasets (in the version before your PR) as group_texts changes the number of examples.

Therefore, all the preprocessing should be left as is here and the number of samples selected at the end.

sgugger · 2021-03-05T22:34:19Z

examples/language-modeling/run_mlm.py

+            if data_args.max_train_samples is not None:
+                train_dataset = train_dataset.select(range(data_args.max_train_samples))
+            train_dataset = train_dataset.map(
+                group_texts,
+                batched=True,
+                num_proc=data_args.preprocessing_num_workers,
+                load_from_cache_file=not data_args.overwrite_cache,
+            )


Same comment as before. The selecting should be done after the preprocessing.

sgugger · 2021-03-05T22:34:46Z

examples/language-modeling/run_plm.py

-            num_proc=data_args.preprocessing_num_workers,
-            load_from_cache_file=not data_args.overwrite_cache,
-        )
+        if training_args.do_train:


Same comment on this script too.

sgugger · 2021-03-05T22:35:56Z

examples/question-answering/run_qa.py

+        train_dataset = datasets["train"]
+        if data_args.max_train_samples is not None:
+            train_dataset = train_dataset.select(range(data_args.max_train_samples))
+        train_dataset = train_dataset.map(


The prepare_train_features function will create multiple entries for each example. So we should do a second select after the preprocessing.

sgugger · 2021-03-05T22:36:05Z

examples/question-answering/run_qa.py

+        eval_dataset = datasets["validation"]
+        if data_args.max_val_samples is not None:
+            eval_dataset = eval_dataset.select(range(data_args.max_val_samples))
+        eval_dataset = eval_dataset.map(


Same for validation.

sgugger · 2021-03-05T22:36:28Z

examples/question-answering/run_qa_beam_search.py

+        train_dataset = datasets["train"]
+        if data_args.max_train_samples is not None:
+            train_dataset = train_dataset.select(range(data_args.max_train_samples))
+        train_dataset = train_dataset.map(


Same as in the run_qa script, we should do a second select after preprocessing.

sgugger · 2021-03-05T22:36:37Z

examples/question-answering/run_qa_beam_search.py

+        eval_dataset = datasets["validation"]
+        if data_args.max_val_samples is not None:
+            eval_dataset = eval_dataset.select(range(data_args.max_val_samples))
+        eval_dataset = eval_dataset.map(


Same for validation.

stas00 · 2021-03-05T22:49:24Z

Thank you for having a closer look that I did, @sgugger.

Ideally we should have tests that would have caught this

bhadreshpsavani · 2021-03-06T05:24:16Z

Hi @stas00,

How can we add test cases for this testing? If we check max_train_samples and max_valid_samples from metrics and add assert statement that might be possible.

stas00 · 2021-03-06T05:25:40Z

How can we add test cases for this testing? If we check max_train_samples and max_valid_samples from metrics and add assert statement that might be possible.

Yes, that's exactly the idea

bhadreshpsavani · 2021-03-06T06:20:21Z

Hi @stas00,

What should I do if I got this error while using git,

$ git push origin argument-addition
To https://github.com/bhadreshpsavani/transformers.git
 ! [rejected]            argument-addition -> argument-addition (non-fast-forward)
error: failed to push some refs to 'https://github.com/bhadreshpsavani/transformers.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

bhadreshpsavani · 2021-03-06T06:24:48Z

I found that I need to use this command git push -f origin argument-addition with fast-forward flag.
Thanks, @stas00 I used your rebase script. It's cool! I did it the first time!

stas00 · 2021-03-06T06:31:11Z

Some unrelated to your work CI tests were failing so I rebased your PR branch to master, and then they passed. You may have not noticed that.

So you needed to do git pull before continuing pushing. and if you already made some changes and git pull doesn't work because an update was made in files that you locally modify, you normally do:

git stash
git pull
git stash pop

and deal with merge conflicts if any emerge.

In general force-pushing should only be reserved for when a bad mistake was made and you need to undo some damage.

So your force-pushing undid the changes I pushed. But since you then rebased it's the same as what I did. No damage done in this situation.

But please be careful in the future and first understand why you think of doing force pushing.

bhadreshpsavani · 2021-03-06T06:36:22Z

Okay @stas00,
I will be careful while using force push I will use stash.
Now I understood

sgugger

Thanks for addressing my comments! I have a few more and then it should be ready to be merged :-)

sgugger · 2021-03-08T13:54:07Z

examples/language-modeling/run_clm.py

+        train_dataset = tokenized_datasets["train"].map(
+            group_texts,
+            batched=True,
+            num_proc=data_args.preprocessing_num_workers,
+            load_from_cache_file=not data_args.overwrite_cache,
+        )


I think this map can be dome has before (deleted lines 349 to 354 in the diff) since it's the same for training and validation.

Hi @sgugger,

so we should do it like below

lm_datasets = tokenized_datasets.map( group_texts, batched=True, num_proc=data_args.preprocessing_num_workers, load_from_cache_file=not data_args.overwrite_cache, )

and we simply select samples for train and validation?

Yes. It avoids duplicating the same code this way.

I did this for almost all the examples, I thought preprocessing will be done only if it will be required.
Shall I do these changes for all the examples or mentioned here only?

For the other examples, you are doing the select before doing the map (to avoid preprocessing all the dataset) so it's not possible to group all the preprocessing together.I think it only applies to the three scripts in language_modeling.

sgugger · 2021-03-08T13:54:45Z

examples/language-modeling/run_plm.py

+            train_dataset = tokenized_datasets["train"].map(
+                group_texts,
+                batched=True,
+                num_proc=data_args.preprocessing_num_workers,
+                load_from_cache_file=not data_args.overwrite_cache,
+            )


sgugger · 2021-03-08T13:56:29Z

examples/question-answering/run_qa.py

+        if data_args.max_train_samples is not None:
+            train_dataset = train_dataset.select(range(data_args.max_train_samples))


We should still do this before the map: the map adds samples but do not reduce their numbers. So in this example, we can speed up preprocessing by doing the train_dataset = train_dataset.select(range(data_args.max_train_samples)) before the map so we preprocess at most max_train_samples examples, and then once more after the map to make sure we have the right number of examples.

sgugger · 2021-03-08T13:56:41Z

examples/question-answering/run_qa.py

+        if data_args.max_val_samples is not None:
+            eval_dataset = eval_dataset.select(range(data_args.max_val_samples))


Same comment here for the validation set.

sgugger · 2021-03-08T13:56:53Z

examples/question-answering/run_qa_beam_search.py

+        if data_args.max_train_samples is not None:
+            train_dataset = train_dataset.select(range(data_args.max_train_samples))



Same comment as in run_qa

bhadreshpsavani · 2021-03-08T16:05:08Z

Hello @stas00 and @sgugger,
I have made the suggested changes
Please let me know if any other changes are required
Thanks

sgugger · 2021-03-08T18:25:24Z

@LysandreJik I think this is ready for final review and merge if you're happy with it.

LysandreJik

Great, LGTM!

* reverted changes of logging and saving metrics * added max_sample arguments * fixed code * white space diff * reformetting code * reformatted code

stas00 approved these changes Mar 5, 2021

View reviewed changes

sgugger reviewed Mar 5, 2021

View reviewed changes

This was referenced Mar 6, 2021

[trainer] a consistent way to limit the number of items #9801

Closed

[examples tests on multigpu] resolving require_torch_non_multi_gpu_but_fix_me #10561

Merged

bhadreshpsavani added 4 commits March 6, 2021 10:13

reverted changes of logging and saving metrics

c2acb5d

added max_sample arguments

85f2dec

fixed code

2f99439

white space diff

34f1b23

bhadreshpsavani added 2 commits March 6, 2021 11:27

reformetting code

0474538

Merge remote-tracking branch 'origin/master' into argument-addition

b9d9b6f

stas00 added the Examples Which is related to examples in general label Mar 7, 2021

bhadreshpsavani requested a review from sgugger March 8, 2021 05:07

stas00 mentioned this pull request Mar 8, 2021

[examples tests] various fixes #10584

Merged

sgugger approved these changes Mar 8, 2021

View reviewed changes

bhadreshpsavani added 2 commits March 8, 2021 19:58

Merge remote-tracking branch 'origin/master' into argument-addition

7cf59c4

reformatted code

b1f5323

sgugger requested a review from LysandreJik March 8, 2021 18:25

LysandreJik approved these changes Mar 8, 2021

View reviewed changes

LysandreJik merged commit dfd16af into huggingface:master Mar 8, 2021

bhadreshpsavani deleted the argument-addition branch March 8, 2021 21:58

stas00 mentioned this pull request Mar 9, 2021

[examples template] added max_sample args and metrics changes #10602

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added max_sample_ arguments #10551

Added max_sample_ arguments #10551

bhadreshpsavani commented Mar 5, 2021 •

edited

stas00 left a comment •

edited

stas00 Mar 5, 2021

stas00 Mar 5, 2021

sgugger left a comment

sgugger Mar 5, 2021

sgugger Mar 5, 2021

sgugger Mar 5, 2021

sgugger Mar 5, 2021

sgugger Mar 5, 2021

sgugger Mar 5, 2021

sgugger Mar 5, 2021

stas00 commented Mar 5, 2021

bhadreshpsavani commented Mar 6, 2021

stas00 commented Mar 6, 2021

bhadreshpsavani commented Mar 6, 2021

bhadreshpsavani commented Mar 6, 2021

stas00 commented Mar 6, 2021 •

edited

bhadreshpsavani commented Mar 6, 2021

sgugger left a comment

sgugger Mar 8, 2021

bhadreshpsavani Mar 8, 2021

sgugger Mar 8, 2021

bhadreshpsavani Mar 8, 2021

sgugger Mar 8, 2021 •

edited

sgugger Mar 8, 2021

sgugger Mar 8, 2021

sgugger Mar 8, 2021

sgugger Mar 8, 2021

bhadreshpsavani commented Mar 8, 2021

sgugger commented Mar 8, 2021

LysandreJik left a comment

		if data_args.max_train_samples is not None:
		train_dataset = train_dataset.select(range(data_args.max_train_samples))

		if data_args.max_val_samples is not None:
		eval_dataset = eval_dataset.select(range(data_args.max_val_samples))

Added max_sample_ arguments #10551

Added max_sample_ arguments #10551

Conversation

bhadreshpsavani commented Mar 5, 2021 • edited

What does this PR do?

Before submitting

Notes:

review:

stas00 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Mar 5, 2021

bhadreshpsavani commented Mar 6, 2021

stas00 commented Mar 6, 2021

bhadreshpsavani commented Mar 6, 2021

bhadreshpsavani commented Mar 6, 2021

stas00 commented Mar 6, 2021 • edited

bhadreshpsavani commented Mar 6, 2021

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger Mar 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhadreshpsavani commented Mar 8, 2021

sgugger commented Mar 8, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

bhadreshpsavani commented Mar 5, 2021 •

edited

stas00 left a comment •

edited

stas00 commented Mar 6, 2021 •

edited

sgugger Mar 8, 2021 •

edited