Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List the "publicly available sources" 15T dataset list from Llama 3 #39

Open
bennmann opened this issue Apr 18, 2024 · 1 comment
Open

Comments

@bennmann
Copy link

Llama 3 is not reproducible in any meaningful capacity without a list of the dataset sources.

Please release a list of the sources.

@bennmann bennmann changed the title List the "publically available sources" 15T dataset list from Llama 3 List the "publicly available sources" 15T dataset list from Llama 3 Apr 18, 2024
@grothedev
Copy link

related question: why train only on publicly available data from the internet? if you want quality language and good knowledge, wouldn't you want to train on things like textbooks, historical documents, scientific research papers, and the like? things that you could get in a library? i'm talking like classic fundamental knowledge. training on classical philosophy would probably improve reasoning skills. and training on the OG programming textbooks would be very good for programming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants