Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a question, thank you for your reply #15

Closed
zhouyuan888888 opened this issue May 9, 2024 · 3 comments
Closed

a question, thank you for your reply #15

zhouyuan888888 opened this issue May 9, 2024 · 3 comments

Comments

@zhouyuan888888
Copy link

Hi, thank you for your nice work. I have a question about training. If I want to train your model on the pile-uncopyrighted dataset (just uncopyrighted pile), how should I prepare or pre-process the dataset?

@simran-arora
Copy link
Collaborator

Hi!
The option that we used was to follow the EleutherAI tokenization instructions provided here (prepare data): https://github.com/EleutherAI/gpt-neox
We then plugged in the file paths for the resulting train, val, and test splits here:

"train": ["/var/cr06_data/sim_data/pile/pile/pile_text_document"],

Our code base also supports tokenization for any hugging face dataset without any additional effort on your part. E.g., if you want to try Fine Web, Slim Pajama, etc.

@zhouyuan888888
Copy link
Author

So thank you for your timely reply!

@lisc110h
Copy link

I am currently trying to train a model from scratch using the Pile dataset. I would like to add that it was necessary to run make at based/train/datamodules/neox_utils to build helpers.cpp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants