Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to load the dataset #54

Closed
nvnvashisth opened this issue May 19, 2019 · 7 comments
Closed

Not able to load the dataset #54

nvnvashisth opened this issue May 19, 2019 · 7 comments

Comments

@nvnvashisth
Copy link

nvnvashisth commented May 19, 2019

I have been trying to train the 117M model, with the dataset of size 1.03 GB, with 64 GB ram. But while it load the dataset, it remain stuck there. And after some 30 min, its just terminate. Here is the log.

Fetching checkpoint: 1.00kit [00:00, 679kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 16.5Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 573kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001:  11%|#8               | 53.6M/498M [00:00<00:07, 62.2Mit/s]
Fetching model.ckpt.data-00000-of-00001:  28%|#####3             | 141M/498M [00:01<00:03, 105Mit/s]
Fetching model.ckpt.data-00000-of-00001:  46%|########7          | 230M/498M [00:02<00:02, 108Mit/s]
Fetching model.ckpt.data-00000-of-00001:  63%|###########4      | 316M/498M [00:03<00:02, 66.6Mit/s]
Fetching model.ckpt.data-00000-of-00001:  77%|#############8    | 384M/498M [00:04<00:01, 58.8Mit/s]
Fetching model.ckpt.data-00000-of-00001:  92%|################6 | 460M/498M [00:06<00:00, 44.8Mit/s]
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:06, 72.4Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 3.39Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:00, 9.86Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 9.54Mit/s]                                                       2019-05-19 16:12:23.408514: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instr
uctions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

  0%|          | 0/1 [00:00<?, ?it/s]

I also saw another issue, which ask to cut the text file. How much has to be ideal size in order to train. If not, what model size could go with 1 GB text file ?

Help will be appreciated 👍

@WAUthethird
Copy link

WAUthethird commented May 19, 2019

Your dataset should be the same size as the model or less. It won't work correctly if you train any more than that, as per #53 (comment).

And a question: You're saying that it stays at the bottom line and doesn't do anything?

@nvnvashisth
Copy link
Author

nvnvashisth commented May 19, 2019

Thanks.
Yes, it stays there and doesn't do anything after that.

@bob80333
Copy link

bob80333 commented May 19, 2019

As you can see I mention in #19, it has to encode your text into tokens before it can use it to train. This process is extremely RAM intensive with larger files.

I would suggest creating a directory of individual files if you can, the encoder can handle a path, and it will encode each file separately then concatenate the encoded files, significantly lowering RAM and time usage.

If you will be training more than once (or trying different things), you may want to save time by pre-encoding the files on your own computer with encoder.py in nshepperd's fork of gpt-2.

I was able to successfully pre-encode ~544MB of files into a single ~170MB file of tokens that can be loaded just the same as your previous text file.

@minimaxir
Copy link
Owner

I'll add an encode function in 0.5

@minimaxir
Copy link
Owner

Added as encode_dataset() in 0.5

@ziqizhang
Copy link

Can I ask a question on this again... when it is said 'your dataset should be the same size as the model or less', what exactly does this mean?

My dataset is massive, 11GB. I can sample it and use encode_dataset() to compress it, but I would really appreciate some guidance on, roughly, what would be the maximum dataset size (after compression) to use with the 117M model?

Many thanks in advance

@omarmagdy217
Copy link

Can I ask a question on this again... when it is said 'your dataset should be the same size as the model or less', what exactly does this mean?

My dataset is massive, 11GB. I can sample it and use encode_dataset() to compress it, but I would really appreciate some guidance on, roughly, what would be the maximum dataset size (after compression) to use with the 117M model?

Many thanks in advance

I have the same problem with a dataset of 6.6 GB. So, have you reached a solution to train on such massive dataset?? @ziqizhang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants