Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a concrete finetuning example from a custom dataset #156

Merged
merged 10 commits into from
May 18, 2023

Conversation

alextrott16
Copy link
Contributor

Adds a finetune_example directory in the train directory, which includes:

  • A toy local dataset
  • A preprocessing function
  • A runnable YAML that puts it together for fine-tuning GPT2
  • A README that explains each component

@samhavens
Copy link
Contributor

samhavens commented May 17, 2023

This is going to be useful! I think everything is clear and merge-ready, except I think we need to address that arc_easy is under the CC-BY-SA license, which requires attribution. We can't put it in the JSONL file, since JSONL doesn't support comments. I think we are fine adding it to the finetune_example readme. The requirements are

You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

Where "appropriate credit" means "you must provide the name of the creator and attribution parties, a copyright notice, a license notice, a disclaimer notice, and a link to the material."

So I'd add to the readme (fixing what I said about split to be right, if it we used a different split)

This work uses data from the AI2 Reasoning Challenge (ARC) 2018, created by and copyright AI2 and Aristo​​. The data consists of 7,787 science exam questions and is intended for non-commercial, research purposes only. We use the Easy Dev split, consisting of 570 questions and answer, which we do not modify. The original data is available at https://allenai.org/data/arc and is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

I am not sure why they gave it a commercial license but then added a noncommercial disclaimer.

Copy link
Contributor

@samhavens samhavens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤩

@alextrott16 alextrott16 merged commit 2a77fb8 into main May 18, 2023
6 checks passed
@alextrott16 alextrott16 deleted the alex/local-finetune-clarity branch May 18, 2023 21:11
@hanlint hanlint linked an issue May 19, 2023 that may be closed by this pull request
bmosaicml pushed a commit that referenced this pull request Jun 6, 2023
* Add default to download script and adjust yamls

Co-authored-by: dblalock <davis@mosaicml.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Finetune MPT models with local dataset
2 participants