-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataset description #7
Comments
Just added some info to the dataset card: https://huggingface.co/datasets/PY007/slimpajama_llama_tokenized_upsample_4096_chunk_256K |
Both dataset cards specifies that --dataset_size=100m. However, calculation shows that 256K dataset contains 1B tokens, and 1M dataset contains 5B tokens. |
@Bostoncake Yes you are correct. I will update the dataset card. Sorry for the typo. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Great work! Would it be possible to add some descriptions to clarify how the training dataset is generated? For example, the two datasets used in the script: PY007/slimpajama_llama_tokenized_upsample_4096_chunk_256K and PY007/slimpajama_llama_tokenized_upsample_4096_chunk_1M. Thanks!
The text was updated successfully, but these errors were encountered: