-
-
Notifications
You must be signed in to change notification settings - Fork 675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to load the dataset #54
Comments
Your dataset should be the same size as the model or less. It won't work correctly if you train any more than that, as per #53 (comment). And a question: You're saying that it stays at the bottom line and doesn't do anything? |
Thanks. |
As you can see I mention in #19, it has to encode your text into tokens before it can use it to train. This process is extremely RAM intensive with larger files. I would suggest creating a directory of individual files if you can, the encoder can handle a path, and it will encode each file separately then concatenate the encoded files, significantly lowering RAM and time usage. If you will be training more than once (or trying different things), you may want to save time by pre-encoding the files on your own computer with I was able to successfully pre-encode ~544MB of files into a single ~170MB file of tokens that can be loaded just the same as your previous text file. |
I'll add an |
Added as |
Can I ask a question on this again... when it is said 'your dataset should be the same size as the model or less', what exactly does this mean? My dataset is massive, 11GB. I can sample it and use Many thanks in advance |
I have the same problem with a dataset of 6.6 GB. So, have you reached a solution to train on such massive dataset?? @ziqizhang |
I have been trying to train the 117M model, with the dataset of size 1.03 GB, with 64 GB ram. But while it load the dataset, it remain stuck there. And after some 30 min, its just terminate. Here is the log.
I also saw another issue, which ask to cut the text file. How much has to be ideal size in order to train. If not, what model size could go with 1 GB text file ?
Help will be appreciated 👍
The text was updated successfully, but these errors were encountered: