-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from using sum for flattening lists of lists in group_texts #14472
Conversation
Thanks for investigating the performance of that line! I initially used the Wdyt @LysandreJik ? |
I was actually confused at first because I didn't know that Alternatively, would |
I think |
Ok I'll make the changes. Do you know why |
3d709f6
to
7f2b01d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me ! The chain(*examples)
is indeed way clearer than functools.reduce(operator.iconcat, list_of_lists, [])
and sum(examples, [])
.
Thanks for looking into it, @nbroad1881!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well, thanks for amending your PR! Let's just remove all those blank new lines before merging.
I did a couple more tests in this notebook: https://colab.research.google.com/drive/1Kxj_JbM9HMLFpjUduy6i3tfqDob_pYIp Edit: This actually didn't work. Let me try to fix it. This works: Edit: it is same with/without partial Here is a summary of the methods when using
|
I think that option 3 ( Thanks a lot for benchmarking all options! |
per sgugger's suggestions Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Speed up list flattening in
group_texts
by changingsum(list_of_lists, [])
tofunctools.reduce(operator.iconcat, list_of_lists, [])
I changed all list flattening from
sum(list_of_lists, [])
tofunctools.reduce(operator.iconcat, list_of_lists, [])
.Here is a stack overflow thread about which method is fastest: https://stackoverflow.com/a/45323085
Here is a colab notebook that shows a quick example between the old way and the new way and a couple of timed examples. The new way is about 5-6x faster. https://colab.research.google.com/drive/1Kxj_JbM9HMLFpjUduy6i3tfqDob_pYIp?usp=sharing
I discovered this while trying to use
group_texts
on many GB of data, and the speedup was greatly appreciated.Nearly all of these changes are in
run_mlm
orrun_clm
examples, but there are a couple inrun_swag
and anotherin
file_utils.py
which might be unnecessary.I don't know why
make style
is movingimport functools
to its own line above the other imports in examples/flax/language-modeling/run_t5_mlm_flax.py and examples/tensorflow/language-modeling/run_clm.pyBefore submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
I think @sgugger wrote the original
group_texts