New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to optimize CodeGen for my code before launching FauxPilot #62
Comments
I don't know of a good guide to fine-tuning unfortunately! One of my colleagues, @shailja-thakur, has fine-tuned CodeGen on Verilog code, but it takes a lot of VRAM to fine-tune the 16B model (we had to use 80GB A100s). The
(You can add other keys if you want; the only field used by the training script is You can see an example of a dataset I put together of C/C++ code found in Debian here: https://huggingface.co/datasets/moyix/debian_csrc I would not expect the bigger models to get much better from being fine-tuned a relatively small amount of code, but the smallest models (like 350M) might benefit from seeing your code. Also note that it is still a bit tricky to get a custom model working – you'll have to run the conversion from HF to FasterTransformers after training it, and create a configuration file for the new model (there is a script for this in the converter directory: https://github.com/moyix/fauxpilot/blob/main/converter/triton_config_gen.py). |
Yepp, I think so. :) |
@moyix, @shailja-thakur, I got the unexpected OOM issue (e.g., Have you had a similar experience? Did you have to utilize Nvidia A100 VRAM 80GB (or 40GB) at the time, even if you tried to fine-tune tasks using the smallest model, such as the 350M? Can we try to change the 'ds config.json' file to reduce the memory consumption of the GPU VRAM in order to complete the fine-tuning operation successfully? Any feedback will be appreciated.
|
Hello Geunsik,
Thank you for your email
I will be happy to help. Can you share your my-codegen-350m-deepspeed-
finetune.sh, ds_config.json, and the size of the training data, so I get an
idea of what could be happening in your case?
Thank you
shailja
…On Thu, Nov 3, 2022 at 7:46 PM Geunsik Lim ***@***.***> wrote:
I don't know of a good guide to fine-tuning unfortunately! One of my
colleagues, @shailja-thakur <https://github.com/shailja-thakur>, has
fine-tuned CodeGen on Verilog code, but it takes a lot of VRAM to fine-tune
the 16B model (we had to use 80GB A100s).
@moyix <https://github.com/moyix>, @shailja-thakur
<https://github.com/shailja-thakur>, I got the unexpected OOM issue
(e.g., torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate
198.00 MiB (GPU 0; 11.90 GiB total capacity; 10.55 GiB already allocated;
200.50 MiB free; 10.70 GiB reserved in total by PyTorch) while running
the fine-tuning task with the smallest model (e.g., 350M) and your debian
dataset on my Ubuntu 22.04 (DRAM 32GB)+ Nvidia GPU Xp (Vram 12GB).
Have you had a similar experience? Did you have to utilize Nvidia A100
VRAM 80GB (or 40GB) at the time, even if you tried to fine-tune tasks using
the smallest model, such as the 350M? Can we try to change the 'ds
config.json' file to reduce the memory consumption of the GPU VRAM in order
to complete the fine-tuning operation successfully? Any feedback will be
appreciated.
- Screenshot:
$ my-codegen-350m-deepspeed-finetune.sh
......... OMISSION ..........
[INFO|trainer.py:1608] 2022-11-04 11:17:11,278 >> ***** Running training *****
[INFO|trainer.py:1609] 2022-11-04 11:17:11,278 >> Num examples = 3786289
[INFO|trainer.py:1610] 2022-11-04 11:17:11,278 >> Num Epochs = 1
[INFO|trainer.py:1611] 2022-11-04 11:17:11,278 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1612] 2022-11-04 11:17:11,278 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1613] 2022-11-04 11:17:11,278 >> Gradient Accumulation steps = 32
[INFO|trainer.py:1614] 2022-11-04 11:17:11,278 >> Total optimization steps = 118321
[INFO|trainer.py:1615] 2022-11-04 11:17:11,278 >> Number of trainable parameters = 354858103
0%| /work/qtlab/transformers/src/transformers/models/codegen/modeling_codegen.py:167: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version
attn_weights = torch.where(causal_mask, attn_weights, mask_value)
Traceback (most recent call last):
File "/work/qtlab/./transformers/examples/pytorch/language-modeling/run_clm.py", line 580, in <module>
main()
File "/work/qtlab/./transformers/examples/pytorch/language-modeling/run_clm.py", line 528, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/work/qtlab/transformers/src/transformers/trainer.py", line 1501, in train
return inner_training_loop(
File "/work/qtlab/transformers/src/transformers/trainer.py", line 1749, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/work/qtlab/transformers/src/transformers/trainer.py", line 2508, in training_step
loss = self.compute_loss(model, inputs)
File "/work/qtlab/transformers/src/transformers/trainer.py", line 2540, in compute_loss
outputs = model(**inputs)
File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/invain/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/home/invain/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1680, in forward
loss = self.module(*inputs, **kwargs)
File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/work/qtlab/transformers/src/transformers/models/codegen/modeling_codegen.py", line 711, in forward
lm_logits = self.lm_head(hidden_states).to(torch.float32)
File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 198.00 MiB (GPU 0; 11.90 GiB total capacity; 10.55 GiB already allocated; 200.50 MiB free; 10.70 GiB reserved in total by PyTorch) If re
0%|
[2022-11-04 11:17:13,621] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3296
[2022-11-04 11:17:13,621] [ERROR] [launch.py:324:sigkill_handler] ['/home/invain/anaconda3/envs/deepspeed/bin/python', '-u', './run_clm.py', '--local_rank= 'moyix/debian_csrc', '--tokenizer_name', 'Salesforce/codegen-350M-multi', '--block_size', '2048', '--gradient_accumulation_steps', '32', '--do_train', '--fp16', '--overwrite_output_dir', '--deepspeed',
real 94m15.273s
user 461m18.611s
sys 3m52.003s
—
Reply to this email directly, view it on GitHub
<#62 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXKPBPX363QYC7LASGO6ETWGR2IHANCNFSM6AAAAAAQR6IMEU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@shailja-thakur, Here, I don't know why this training strategy still gives a CUDA-out-of-memory issue on out-of-date Nvidia GPU (e.g., VRAM 12GB).
At that time, I concentrated on Parameters, Gradients, Optimizer States to avoid CUDA-OOM issue on Nvidia GPU (with VRAM 12GB). However, I could not still find a recipe to avoid CUDA-OOM issue on Nvidia GPU VRAM 12GB. |
@shailja-thakur, Are there any hints or clues to work on Fine-Tune on NVIDIA TITAN XP? I tried various things, but I failed. So now, in my case, I use the high -performance GPU (e.g. NVIDIA A100 (VRAM 80GB) to avoid the CUDA room reported above. |
@moyix, First of all, thank you for sharing your experiences. $ tree ./codegen-350M-multi-finetuned/
./codegen-350M-multi-finetuned/
├── added_tokens.json
├── all_results.json
├── config.json
├── merges.txt
├── pytorch_model.bin
├── README.md
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
├── trainer_state.json
├── training_args.bin
├── train_results.json
└── vocab.json
$ ls -al ./codegen-350M-multi-finetuned/
total 778380
drwxr-xr-x 2 leemgs leemgs 4096 Nov 10 16:40 .
drwxr-xr-x 6 leemgs leemgs 4096 Nov 10 16:43 ..
-rw-r--r-- 1 leemgs leemgs 1080 Nov 10 16:31 added_tokens.json
-rw-r--r-- 1 leemgs leemgs 582 Nov 10 16:31 all_results.json
-rw-r--r-- 1 leemgs leemgs 1011 Nov 10 16:31 config.json
-rw-r--r-- 1 leemgs leemgs 456356 Nov 10 16:31 merges.txt
-rw-r--r-- 1 leemgs leemgs 793630000 Nov 10 16:31 pytorch_model.bin
-rw-r--r-- 1 leemgs leemgs 1149 Nov 10 16:31 README.md
-rw-r--r-- 1 leemgs leemgs 99 Nov 10 16:31 special_tokens_map.json
-rw-r--r-- 1 leemgs leemgs 283 Nov 10 16:31 tokenizer_config.json
-rw-r--r-- 1 leemgs leemgs 2114827 Nov 10 16:31 tokenizer.json
-rw-r--r-- 1 leemgs leemgs 998 Nov 10 16:31 trainer_state.json
-rw-r--r-- 1 leemgs leemgs 4539 Nov 10 16:31 training_args.bin
-rw-r--r-- 1 leemgs leemgs 582 Nov 10 16:31 train_results.json
-rw-r--r-- 1 leemgs leemgs 798156 Nov 10 16:31 vocab.json
(deepspeed) leemgs@ai02:~/qtlab/CodeGen/checkpoints$
Using the generated fined-tuned model, I performed the "def hello_word" test. However, I meet an unexpected error message like this:
(.venv) $ python3 -m jaxformer.hf.sample --model codegen-350M-multi --context "def hello_world():"
loading parameters
loading parameters took 9.95s
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/home/leemgs/qtlab/CodeGen/jaxformer/hf/sample.py", line 253, in <module>
main()
File "/data/home/leemgs/qtlab/CodeGen/jaxformer/hf/sample.py", line 225, in main
model = create_model(ckpt=ckpt, fp16=use_fp16).to(device)
File "/data/home/leemgs/qtlab/CodeGen/jaxformer/hf/sample.py", line 63, in create_model
return CodeGenForCausalLM.from_pretrained(ckpt, revision='float16', torch_dtype=torch.float16, low_cpu_mem_usage=True)
File "/data/home/leemgs/qtlab/CodeGen/.venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1526, in from_pretrained
cls._load_state_dict_into_model_low_mem(model, loaded_state_dict_keys, resolved_archive_file)
File "/data/home/leemgs/qtlab/CodeGen/.venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1786, in _load_state_dict_into_model_low_mem
new_val = getattr(submodule, param_name)
File "/data/home/leemgs/qtlab/CodeGen/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'CodeGenAttention' object has no attribute 'causal_mask' |
FIXED. I figured out what was causing this problem. It was because the versions I learned and tried to sample were different. This problem has been resolved by using the most recent Transformer's latest version (e.g. 4.25.0.dev0) and incorrect weights in the config.json file. My report will be useful to anyone who may have a similar difficulty in the near future. 😄
license: bsd-3-clausetags:
This model card has been generated automatically according to the information the Trainer had access to. You codegen-350M-finetunedThis model is a fine-tuned version of Salesforce/codegen-350M-multi on the moyix/debian_csrc dataset. Model descriptionMore information needed Intended uses & limitationsMore information needed Training and evaluation dataMore information needed Training procedureTraining hyperparametersThe following hyperparameters were used during training:
Training resultsFramework versions
|
@moyix, I have one query about the fine-tuned Codegen model. With the 350M Codegen model, how can I compare the quality/accuracy of the original Codegen model and the fine-tuned Codegen model? I'm curious if there are any well-known benchmarking tools or general methods for comparing the quality/accuracy of these two models. |
Using a well-crafted FAUXPILOT, we can execute inference tasks based on the Codegen model. I read recently that I can work on Fine-tune using the Codegen model on the following website.
$ deepspeed --num_gpus 1 --num_nodes 1 run_clm.py --model_name_or_path=Salesforce/codegen-6B-multi --per_device_train_batch_size=1 --learning_rate 2e-5 --num_train_epochs 1 --output_dir=./codegen-6B-finetuned --dataset_name your_dataset --tokenizer_name Salesforce/codegen-6B-multi --block_size 2048 --gradient_accumulation_steps 32 --do_train --fp16 --overwrite_output_dir --deepspeed ds_config.json
I'm curious if there is a GitHub storage address that describes how to perform Fine-Tune work with additional source code (e.g., my own source code) using Deepspeed. We are looking for a more detailed GitHub repository for the "--dataset_name your_dataset" option. Where is the applicable GitHub repository located? Are there any web pages that deal with how to run Fine-Tune with Deepspeed? Welcome to any comments on this issue.
The text was updated successfully, but these errors were encountered: