-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates run_lora_clm.py with enhanced dataset support #955
Conversation
@dmsuehir Can you please add this case to CI and make sure it doesn't break anything? |
@yeonsily I have added a test. Please let me know if there are any other updates needed to my PR. Thanks! |
@dmsuehir thank you! what setup did you test for the baseline number? |
I used a single card from a Gaudi2 machine from SDP cloud with v1.15.1 Gaudi software/driver that is part of the Kubernetes cluster. I had the memory allocated at 120Gi and hugepages-2Mi 4400Mi. The base container that I used was |
@yeonsily Is there anything else needed for this PR? |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR!
What does this PR do?
This PR updates the
examples/language-modeling/run_lora_clm.py
script with the flexibility to support more named datasets from Hugging Face Hub. Previously, this example script was written to only support two named datasets for non-SQL prompts:tatsu-lab/alpaca
andtimdettmers/openassistant-guanaco
(specifying other datasets would raise an error saying that the dataset is unsupported). Thetatsu-lab/alpaca
dataset has standard columns forinstruction
,input
, andoutput
that are formed into a prompt string before tokenization.timdettmers/openassistant-guanaco
has a singletext
column.To support more datasets, I added arguments to allow the user to specify column names for "instruction", "input", and "output" (or in the case of using SQL prompts it's "question", "context", and "answer"). If the user provides custom column names, I am renaming those columns to use the standard names (instruction/input/output or question/context/answer) so that the rest of the code in the script will work as expected. Also, for non-SQL prompt datasets, the "input" is optional (which is why there's a "prompt_with_input" and "prompt_without_input" template). The existing code was handing the case where the "input" column was blank, but was not able to handle the the "input" column not existing, so I updated the code to also handle the "input" column not existing at all.
I left the code specific to
timdettmers/openassistant-guanaco
alone, since that was a more special use case with it only having the one column and not getting preprocessed with the prompt template.I tested this update with several different HF datasets: databricks/databricks-dolly-15k, ruggsea/stanford-encyclopedia-of-philosophy_instruct, b-mc2/sql-create-context, flytech/python-codes-25k, gbharti/finance-alpaca, medalpaca/medical_meadow_medical_flashcards.
As an example,
databricks/databricks-dolly-15k
is a dataset that has different column names than the original code expects (it has "instruction", "context", and "response" instead of "instruction", "input", and "output"). To userun_lora_clm.py
with this dataset, you can include args for--input_column_name "context"
and--output_column_name "response"
.This can also be run without the
--input_column_name
arg, which would then test the use case where a dataset does not have an "input" column at all.Fixes #629
Before submitting