Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[s2s] wmt download script use less ram #6405

Merged
merged 3 commits into from
Aug 11, 2020

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Aug 11, 2020

a few enhancement to #6403

the main change:

  • rewrite not to load 100GB into RAM - wmt19 is huge!

and then a few small things:

  • replaced the default dataset to wmt16 as it's much much smaller than wmt19 to experiment with (also it seems that at least wmt19-ru-en is missing a test dataset, while wmt16-run-en has it)
  • added lang defaults so it's easy to start experimenting with
  • moved tqdm to where a detailed progress can be seen
  • added some extra notes

@sshleifer

- rewrite not to load 100GB into RAM - wmt19 is huge! instead read/write in chunks
- replaced the default dataset to wmt16 as it's much much smaller than wmt19 to experiment with
- added lang defaults so it's easy to start experimenting with
@codecov
Copy link

codecov bot commented Aug 11, 2020

Codecov Report

Merging #6405 into master will decrease coverage by 2.10%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6405      +/-   ##
==========================================
- Coverage   80.10%   78.00%   -2.11%     
==========================================
  Files         149      149              
  Lines       27680    27680              
==========================================
- Hits        22173    21591     -582     
- Misses       5507     6089     +582     
Impacted Files Coverage Δ
src/transformers/optimization.py 28.94% <0.00%> (-67.11%) ⬇️
src/transformers/pipelines.py 26.98% <0.00%> (-52.81%) ⬇️
src/transformers/optimization_tf.py 33.33% <0.00%> (-24.33%) ⬇️
src/transformers/modeling_tf_auto.py 48.79% <0.00%> (-18.08%) ⬇️
src/transformers/modeling_auto.py 63.95% <0.00%> (-14.54%) ⬇️
src/transformers/data/processors/squad.py 13.76% <0.00%> (-14.38%) ⬇️
src/transformers/modeling_tf_gpt2.py 65.68% <0.00%> (-6.16%) ⬇️
src/transformers/modelcard.py 82.71% <0.00%> (-2.47%) ⬇️
src/transformers/modeling_distilbert.py 96.19% <0.00%> (-1.64%) ⬇️
src/transformers/configuration_utils.py 95.20% <0.00%> (-1.37%) ⬇️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9ecd92...de72b0a. Read the comment docs.

@stas00
Copy link
Contributor Author

stas00 commented Aug 11, 2020

further testing showed that chunking doesn't make much of a difference - writing one record at a time is almost as fast as writing in chunks of 10K records - I think it's the reading that's the bottleneck here, which we can't optimize. So I removed the chunking from this PR.

wmt19-ru-en with 37M records converted in ~40mins on my machine using this PR.

@sshleifer sshleifer changed the title don't read the whole dataset into memory [s2s] wmt download script use less ram Aug 11, 2020
@sshleifer sshleifer merged commit f6cb0f8 into huggingface:master Aug 11, 2020
@stas00 stas00 deleted the convert-wmt branch August 11, 2020 16:36
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants