Not able to load the dataset #54

nvnvashisth · 2019-05-19T16:54:41Z

I have been trying to train the 117M model, with the dataset of size 1.03 GB, with 64 GB ram. But while it load the dataset, it remain stuck there. And after some 30 min, its just terminate. Here is the log.

Fetching checkpoint: 1.00kit [00:00, 679kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 16.5Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 573kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001:  11%|#8               | 53.6M/498M [00:00<00:07, 62.2Mit/s]
Fetching model.ckpt.data-00000-of-00001:  28%|#####3             | 141M/498M [00:01<00:03, 105Mit/s]
Fetching model.ckpt.data-00000-of-00001:  46%|########7          | 230M/498M [00:02<00:02, 108Mit/s]
Fetching model.ckpt.data-00000-of-00001:  63%|###########4      | 316M/498M [00:03<00:02, 66.6Mit/s]
Fetching model.ckpt.data-00000-of-00001:  77%|#############8    | 384M/498M [00:04<00:01, 58.8Mit/s]
Fetching model.ckpt.data-00000-of-00001:  92%|################6 | 460M/498M [00:06<00:00, 44.8Mit/s]
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:06, 72.4Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 3.39Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:00, 9.86Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 9.54Mit/s]                                                       2019-05-19 16:12:23.408514: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instr
uctions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

  0%|          | 0/1 [00:00<?, ?it/s]

I also saw another issue, which ask to cut the text file. How much has to be ideal size in order to train. If not, what model size could go with 1 GB text file ?

Help will be appreciated 👍

The text was updated successfully, but these errors were encountered:

WAUthethird · 2019-05-19T17:10:08Z

Your dataset should be the same size as the model or less. It won't work correctly if you train any more than that, as per #53 (comment).

And a question: You're saying that it stays at the bottom line and doesn't do anything?

nvnvashisth · 2019-05-19T17:28:49Z

Thanks.
Yes, it stays there and doesn't do anything after that.

bob80333 · 2019-05-19T21:48:59Z

As you can see I mention in #19, it has to encode your text into tokens before it can use it to train. This process is extremely RAM intensive with larger files.

I would suggest creating a directory of individual files if you can, the encoder can handle a path, and it will encode each file separately then concatenate the encoded files, significantly lowering RAM and time usage.

If you will be training more than once (or trying different things), you may want to save time by pre-encoding the files on your own computer with encoder.py in nshepperd's fork of gpt-2.

I was able to successfully pre-encode ~544MB of files into a single ~170MB file of tokens that can be loaded just the same as your previous text file.

minimaxir · 2019-05-19T22:43:44Z

I'll add an encode function in 0.5

minimaxir · 2019-05-20T03:54:38Z

Added as encode_dataset() in 0.5

ziqizhang · 2019-07-31T15:34:44Z

Can I ask a question on this again... when it is said 'your dataset should be the same size as the model or less', what exactly does this mean?

My dataset is massive, 11GB. I can sample it and use encode_dataset() to compress it, but I would really appreciate some guidance on, roughly, what would be the maximum dataset size (after compression) to use with the 117M model?

Many thanks in advance

omarmagdy217 · 2020-05-08T05:33:55Z

Can I ask a question on this again... when it is said 'your dataset should be the same size as the model or less', what exactly does this mean?

My dataset is massive, 11GB. I can sample it and use encode_dataset() to compress it, but I would really appreciate some guidance on, roughly, what would be the maximum dataset size (after compression) to use with the 117M model?

Many thanks in advance

I have the same problem with a dataset of 6.6 GB. So, have you reached a solution to train on such massive dataset?? @ziqizhang

minimaxir closed this as completed May 20, 2019

woctezuma mentioned this issue Aug 7, 2019

Memory Error // Questions #98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to load the dataset #54

Not able to load the dataset #54

nvnvashisth commented May 19, 2019 •

edited

Loading

WAUthethird commented May 19, 2019 •

edited

Loading

nvnvashisth commented May 19, 2019 •

edited

Loading

bob80333 commented May 19, 2019 •

edited

Loading

minimaxir commented May 19, 2019

minimaxir commented May 20, 2019

ziqizhang commented Jul 31, 2019

omarmagdy217 commented May 8, 2020

Not able to load the dataset #54

Not able to load the dataset #54

Comments

nvnvashisth commented May 19, 2019 • edited Loading

WAUthethird commented May 19, 2019 • edited Loading

nvnvashisth commented May 19, 2019 • edited Loading

bob80333 commented May 19, 2019 • edited Loading

minimaxir commented May 19, 2019

minimaxir commented May 20, 2019

ziqizhang commented Jul 31, 2019

omarmagdy217 commented May 8, 2020

nvnvashisth commented May 19, 2019 •

edited

Loading

WAUthethird commented May 19, 2019 •

edited

Loading

nvnvashisth commented May 19, 2019 •

edited

Loading

bob80333 commented May 19, 2019 •

edited

Loading