Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset description #7

Closed
sunying2018 opened this issue Apr 8, 2024 · 3 comments
Closed

dataset description #7

sunying2018 opened this issue Apr 8, 2024 · 3 comments

Comments

@sunying2018
Copy link

Great work! Would it be possible to add some descriptions to clarify how the training dataset is generated? For example, the two datasets used in the script: PY007/slimpajama_llama_tokenized_upsample_4096_chunk_256K and PY007/slimpajama_llama_tokenized_upsample_4096_chunk_1M. Thanks!

@jzhang38
Copy link
Owner

jzhang38 commented Apr 8, 2024

Just added some info to the dataset card: https://huggingface.co/datasets/PY007/slimpajama_llama_tokenized_upsample_4096_chunk_256K

@Bostoncake
Copy link

Both dataset cards specifies that --dataset_size=100m. However, calculation shows that 256K dataset contains 1B tokens, and 1M dataset contains 5B tokens.

@jzhang38
Copy link
Owner

jzhang38 commented Apr 19, 2024

@Bostoncake Yes you are correct. I will update the dataset card. Sorry for the typo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants