[s2s] wmt download script use less ram #6405

stas00 · 2020-08-11T03:44:07Z

a few enhancement to #6403

the main change:

rewrite not to load 100GB into RAM - wmt19 is huge!

and then a few small things:

replaced the default dataset to wmt16 as it's much much smaller than wmt19 to experiment with (also it seems that at least wmt19-ru-en is missing a test dataset, while wmt16-run-en has it)
added lang defaults so it's easy to start experimenting with
moved tqdm to where a detailed progress can be seen
added some extra notes

- rewrite not to load 100GB into RAM - wmt19 is huge! instead read/write in chunks - replaced the default dataset to wmt16 as it's much much smaller than wmt19 to experiment with - added lang defaults so it's easy to start experimenting with

codecov · 2020-08-11T03:50:54Z

Codecov Report

Merging #6405 into master will decrease coverage by 2.10%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #6405      +/-   ##
==========================================
- Coverage   80.10%   78.00%   -2.11%     
==========================================
  Files         149      149              
  Lines       27680    27680              
==========================================
- Hits        22173    21591     -582     
- Misses       5507     6089     +582

Impacted Files	Coverage Δ
src/transformers/optimization.py	`28.94% <0.00%> (-67.11%)`	⬇️
src/transformers/pipelines.py	`26.98% <0.00%> (-52.81%)`	⬇️
src/transformers/optimization_tf.py	`33.33% <0.00%> (-24.33%)`	⬇️
src/transformers/modeling_tf_auto.py	`48.79% <0.00%> (-18.08%)`	⬇️
src/transformers/modeling_auto.py	`63.95% <0.00%> (-14.54%)`	⬇️
src/transformers/data/processors/squad.py	`13.76% <0.00%> (-14.38%)`	⬇️
src/transformers/modeling_tf_gpt2.py	`65.68% <0.00%> (-6.16%)`	⬇️
src/transformers/modelcard.py	`82.71% <0.00%> (-2.47%)`	⬇️
src/transformers/modeling_distilbert.py	`96.19% <0.00%> (-1.64%)`	⬇️
src/transformers/configuration_utils.py	`95.20% <0.00%> (-1.37%)`	⬇️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9ecd92...de72b0a. Read the comment docs.

stas00 · 2020-08-11T04:31:03Z

further testing showed that chunking doesn't make much of a difference - writing one record at a time is almost as fast as writing in chunks of 10K records - I think it's the reading that's the bottleneck here, which we can't optimize. So I removed the chunking from this PR.

wmt19-ru-en with 37M records converted in ~40mins on my machine using this PR.

and some more small enhancements

This reverts commit b9a8c66.

don't read the whole dataset into memory

693a263

- rewrite not to load 100GB into RAM - wmt19 is huge! instead read/write in chunks - replaced the default dataset to wmt16 as it's much much smaller than wmt19 to experiment with - added lang defaults so it's easy to start experimenting with

stas00 added 2 commits August 10, 2020 21:47

remove chunking - doesn't improve speed

5aa58c0

and some more small enhancements

clarify

de72b0a

sshleifer approved these changes Aug 11, 2020

View reviewed changes

sshleifer changed the title ~~don't read the whole dataset into memory~~ [s2s] wmt download script use less ram Aug 11, 2020

sshleifer merged commit f6cb0f8 into huggingface:master Aug 11, 2020

stas00 deleted the convert-wmt branch August 11, 2020 16:36

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "[s2s] wmt download script use less ram (huggingface#6405)"

3650248

This reverts commit b9a8c66.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[s2s] wmt download script use less ram #6405

[s2s] wmt download script use less ram #6405

stas00 commented Aug 11, 2020 •

edited

Loading

codecov bot commented Aug 11, 2020 •

edited

Loading

stas00 commented Aug 11, 2020 •

edited

Loading

[s2s] wmt download script use less ram #6405

[s2s] wmt download script use less ram #6405

Conversation

stas00 commented Aug 11, 2020 • edited Loading

codecov bot commented Aug 11, 2020 • edited Loading

Codecov Report

stas00 commented Aug 11, 2020 • edited Loading

stas00 commented Aug 11, 2020 •

edited

Loading

codecov bot commented Aug 11, 2020 •

edited

Loading

stas00 commented Aug 11, 2020 •

edited

Loading