dataset description #7

sunying2018 · 2024-04-08T06:43:27Z

Great work! Would it be possible to add some descriptions to clarify how the training dataset is generated? For example, the two datasets used in the script: PY007/slimpajama_llama_tokenized_upsample_4096_chunk_256K and PY007/slimpajama_llama_tokenized_upsample_4096_chunk_1M. Thanks!

jzhang38 · 2024-04-08T06:49:24Z

Just added some info to the dataset card: https://huggingface.co/datasets/PY007/slimpajama_llama_tokenized_upsample_4096_chunk_256K

Bostoncake · 2024-04-18T06:58:28Z

Both dataset cards specifies that --dataset_size=100m. However, calculation shows that 256K dataset contains 1B tokens, and 1M dataset contains 5B tokens.

jzhang38 · 2024-04-19T00:48:07Z

@Bostoncake Yes you are correct. I will update the dataset card. Sorry for the typo.

puppet101 mentioned this issue Apr 10, 2024

error when finetuning yi-34b #13

Open

sunying2018 closed this as completed Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset description #7

dataset description #7

sunying2018 commented Apr 8, 2024

jzhang38 commented Apr 8, 2024 •

edited

Loading

Bostoncake commented Apr 18, 2024

jzhang38 commented Apr 19, 2024 •

edited

Loading

dataset description #7

dataset description #7

Comments

sunying2018 commented Apr 8, 2024

jzhang38 commented Apr 8, 2024 • edited Loading

Bostoncake commented Apr 18, 2024

jzhang38 commented Apr 19, 2024 • edited Loading

jzhang38 commented Apr 8, 2024 •

edited

Loading

jzhang38 commented Apr 19, 2024 •

edited

Loading