Updates run_lora_clm.py with enhanced dataset support #955

dmsuehir · 2024-05-06T17:41:13Z

What does this PR do?

This PR updates the examples/language-modeling/run_lora_clm.py script with the flexibility to support more named datasets from Hugging Face Hub. Previously, this example script was written to only support two named datasets for non-SQL prompts: tatsu-lab/alpaca and timdettmers/openassistant-guanaco (specifying other datasets would raise an error saying that the dataset is unsupported). The tatsu-lab/alpacadataset has standard columns for instruction, input, and output that are formed into a prompt string before tokenization. timdettmers/openassistant-guanaco has a single text column.

To support more datasets, I added arguments to allow the user to specify column names for "instruction", "input", and "output" (or in the case of using SQL prompts it's "question", "context", and "answer"). If the user provides custom column names, I am renaming those columns to use the standard names (instruction/input/output or question/context/answer) so that the rest of the code in the script will work as expected. Also, for non-SQL prompt datasets, the "input" is optional (which is why there's a "prompt_with_input" and "prompt_without_input" template). The existing code was handing the case where the "input" column was blank, but was not able to handle the the "input" column not existing, so I updated the code to also handle the "input" column not existing at all.

I left the code specific to timdettmers/openassistant-guanaco alone, since that was a more special use case with it only having the one column and not getting preprocessed with the prompt template.

I tested this update with several different HF datasets: databricks/databricks-dolly-15k, ruggsea/stanford-encyclopedia-of-philosophy_instruct, b-mc2/sql-create-context, flytech/python-codes-25k, gbharti/finance-alpaca, medalpaca/medical_meadow_medical_flashcards.

As an example, databricks/databricks-dolly-15k is a dataset that has different column names than the original code expects (it has "instruction", "context", and "response" instead of "instruction", "input", and "output"). To use run_lora_clm.py with this dataset, you can include args for --input_column_name "context" and --output_column_name "response".

  python run_lora_clm.py \
  --model_name_or_path huggyllama/llama-7b \
  --dataset_name databricks/databricks-dolly-15k \
  --dataset_concatenation True \
  --per_device_train_batch_size 16 \
  --evaluation_strategy "no" \
  --save_strategy "no" \
  --learning_rate 1e-4 \
  --warmup_ratio  0.03 \
  --lr_scheduler_type "constant" \
  --max_grad_norm  0.3 \
  --gradient_accumulation_steps 1 \
  --learning_rate 2e-4 \
  --num_train_epochs 3 \
  --output_dir /tmp/output \
  --overwrite_output_dir \
  --validation_split_percentage 20 \
  --use_fast_tokenizer False \
  --lora_rank 8 \
  --lora_alpha 16 \
  --lora_dropout 0.1 \
  --lora_target_modules q_proj v_proj \
  --do_train True \
  --do_eval True \
  --use_habana \
  --use_lazy_mode \
  --adam_epsilon 1e-08 \
  --lr_scheduler_type "constant" \
  --max_grad_norm  0.3 \
  --bf16 \
  --throughput_warmup_steps 3 \
  --input_column_name "context" \
  --output_column_name "response" \
  --max_steps 50 \
  --max_eval_samples 50

This can also be run without the --input_column_name arg, which would then test the use case where a dataset does not have an "input" column at all.

Fixes #629

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

…ora_clm_datasets

yeonsily · 2024-05-23T22:00:11Z

@dmsuehir Can you please add this case to CI and make sure it doesn't break anything?

…ora_clm_datasets

dmsuehir · 2024-05-29T21:23:28Z

@yeonsily I have added a test. Please let me know if there are any other updates needed to my PR. Thanks!

yeonsily · 2024-05-30T00:04:53Z

@dmsuehir thank you! what setup did you test for the baseline number?

dmsuehir · 2024-05-30T15:54:53Z

@dmsuehir thank you! what setup did you test for the baseline number?

I used a single card from a Gaudi2 machine from SDP cloud with v1.15.1 Gaudi software/driver that is part of the Kubernetes cluster. I had the memory allocated at 120Gi and hugepages-2Mi 4400Mi. The base container that I used was vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0.

dmsuehir · 2024-06-05T15:16:49Z

@yeonsily Is there anything else needed for this PR?

…ora_clm_datasets

HuggingFaceDocBuilderDev · 2024-06-11T21:33:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss

Nice PR!

dmsuehir added 3 commits May 3, 2024 15:18

Updates run_lora_clm.py example to enhance dataset support

56fc597

Merge branch 'main' of github.com:dmsuehir/optimum-habana into dina/l…

094deaf

…ora_clm_datasets

Merge branch 'main' of github.com:dmsuehir/optimum-habana into dina/l…

515eaa9

…ora_clm_datasets

dmsuehir requested a review from regisss as a code owner May 6, 2024 17:41

dmsuehir mentioned this pull request May 6, 2024

run_lora_clm.py support for other datasets #629

Closed

dmsuehir added 3 commits May 10, 2024 14:11

Merge branch 'main' of github.com:dmsuehir/optimum-habana into dina/l…

a0671bd

…ora_clm_datasets

Merge branch 'main' of github.com:dmsuehir/optimum-habana into dina/l…

37d02fd

…ora_clm_datasets

Merge branch 'main' of github.com:dmsuehir/optimum-habana into dina/l…

ba2df8d

…ora_clm_datasets

dmsuehir added 2 commits May 29, 2024 14:14

Add test for run_lora_clm with column name args

099a609

Merge branch 'main' of github.com:dmsuehir/optimum-habana into dina/l…

096b08d

…ora_clm_datasets

Merge branch 'main' of github.com:dmsuehir/optimum-habana into dina/l…

c402c5f

…ora_clm_datasets

regisss approved these changes Jun 11, 2024

View reviewed changes

regisss merged commit 1825d15 into huggingface:main Jun 11, 2024
6 of 7 checks passed

imangohari1 pushed a commit to imangohari1/optimum-habana that referenced this pull request Jun 13, 2024

Updates run_lora_clm.py with enhanced dataset support (huggingface#955)

b49a73a

regisss mentioned this pull request Jun 13, 2024

Support for custom files for run_lora_clm.py #1039

Open

3 tasks

dmsuehir deleted the dina/lora_clm_datasets branch June 21, 2024 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates run_lora_clm.py with enhanced dataset support #955

Updates run_lora_clm.py with enhanced dataset support #955

dmsuehir commented May 6, 2024

yeonsily commented May 23, 2024

dmsuehir commented May 29, 2024

yeonsily commented May 30, 2024

dmsuehir commented May 30, 2024

dmsuehir commented Jun 5, 2024

HuggingFaceDocBuilderDev commented Jun 11, 2024

regisss left a comment

Updates run_lora_clm.py with enhanced dataset support #955

Updates run_lora_clm.py with enhanced dataset support #955

Conversation

dmsuehir commented May 6, 2024

What does this PR do?

Before submitting

yeonsily commented May 23, 2024

dmsuehir commented May 29, 2024

yeonsily commented May 30, 2024

dmsuehir commented May 30, 2024

dmsuehir commented Jun 5, 2024

HuggingFaceDocBuilderDev commented Jun 11, 2024

regisss left a comment

Choose a reason for hiding this comment