Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heuristic classification method selections #1

Closed
mataney opened this issue Feb 22, 2023 · 1 comment
Closed

Heuristic classification method selections #1

mataney opened this issue Feb 22, 2023 · 1 comment

Comments

@mataney
Copy link

mataney commented Feb 22, 2023

Great work!
Thanks for releasing your DSIR-selected dataset through huggingface.

Any plans to release the selected-dataset according to the Heuristic classification method as well?

That will be great for reproducing efforts.

@sangmichaelxie
Copy link
Member

Hi, sorry for the delay. The heuristic classification dataset is at https://huggingface.co/datasets/stanford-crfm/heuristic_classification-filtered-pile-50M.

Note that I've also updated the DSIR-selected dataset to merge the validation and test set into the train, which reflects how it was used in the paper. The heuristic classification dataset also only has a train set. If you want to recover the previous validation and test, they should be the last two 50k chunks of the train set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants