Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to optimize CodeGen for my code before launching FauxPilot #62

Open
leemgs opened this issue Sep 21, 2022 · 9 comments
Open

How to optimize CodeGen for my code before launching FauxPilot #62

leemgs opened this issue Sep 21, 2022 · 9 comments

Comments

@leemgs
Copy link
Contributor

leemgs commented Sep 21, 2022

Using a well-crafted FAUXPILOT, we can execute inference tasks based on the Codegen model. I read recently that I can work on Fine-tune using the Codegen model on the following website.

$ deepspeed --num_gpus 1 --num_nodes 1 run_clm.py --model_name_or_path=Salesforce/codegen-6B-multi --per_device_train_batch_size=1 --learning_rate 2e-5 --num_train_epochs 1 --output_dir=./codegen-6B-finetuned --dataset_name your_dataset --tokenizer_name Salesforce/codegen-6B-multi --block_size 2048 --gradient_accumulation_steps 32 --do_train --fp16 --overwrite_output_dir --deepspeed ds_config.json

I'm curious if there is a GitHub storage address that describes how to perform Fine-Tune work with additional source code (e.g., my own source code) using Deepspeed. We are looking for a more detailed GitHub repository for the "--dataset_name your_dataset" option. Where is the applicable GitHub repository located? Are there any web pages that deal with how to run Fine-Tune with Deepspeed? Welcome to any comments on this issue.

@moyix
Copy link
Collaborator

moyix commented Sep 21, 2022

I don't know of a good guide to fine-tuning unfortunately! One of my colleagues, @shailja-thakur, has fine-tuned CodeGen on Verilog code, but it takes a lot of VRAM to fine-tune the 16B model (we had to use 80GB A100s).

The --dataset_name is just the location of the code you want to train on in a format that Huggingface Datasets recognizes. The simplest is probably to use JSONL format – a JSON file with one dictionary per line, using the format:

{"text": "content_of_source_file_1", "url": "path_to_source_file_1"}
{"text": "content_of_source_file_2", "url": "path_to_source_file_2"}
...

(You can add other keys if you want; the only field used by the training script is text, but I find it helpful to include some extra metadata so I can keep track of where the code came from.)

You can see an example of a dataset I put together of C/C++ code found in Debian here: https://huggingface.co/datasets/moyix/debian_csrc

I would not expect the bigger models to get much better from being fine-tuned a relatively small amount of code, but the smallest models (like 350M) might benefit from seeing your code.

Also note that it is still a bit tricky to get a custom model working – you'll have to run the conversion from HF to FasterTransformers after training it, and create a configuration file for the new model (there is a script for this in the converter directory: https://github.com/moyix/fauxpilot/blob/main/converter/triton_config_gen.py).

@leemgs
Copy link
Contributor Author

leemgs commented Sep 22, 2022

I would not expect the bigger models to get much better from being fine-tuned a relatively small amount of code, but the smallest models (like 350M) might benefit from seeing your code.

Yepp, I think so. :)

@leemgs
Copy link
Contributor Author

leemgs commented Nov 4, 2022

I don't know of a good guide to fine-tuning unfortunately! One of my colleagues, @shailja-thakur, has fine-tuned CodeGen on Verilog code, but it takes a lot of VRAM to fine-tune the 16B model (we had to use 80GB A100s).

@moyix, @shailja-thakur, I got the unexpected OOM issue (e.g., torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 198.00 MiB (GPU 0; 11.90 GiB total capacity; 10.55 GiB already allocated; 200.50 MiB free; 10.70 GiB reserved in total by PyTorch) while running the fine-tuning task with the smallest model (e.g., 350M) and your debian dataset on my Ubuntu 22.04 (DRAM 32GB)+ Nvidia GPU Xp (Vram 12GB).

Have you had a similar experience? Did you have to utilize Nvidia A100 VRAM 80GB (or 40GB) at the time, even if you tried to fine-tune tasks using the smallest model, such as the 350M? Can we try to change the 'ds config.json' file to reduce the memory consumption of the GPU VRAM in order to complete the fine-tuning operation successfully? Any feedback will be appreciated.

  • Screenshot:
$ my-codegen-350m-deepspeed-finetune.sh
     ......... OMISSION ..........
[INFO|trainer.py:1608] 2022-11-04 11:17:11,278 >> ***** Running training *****
[INFO|trainer.py:1609] 2022-11-04 11:17:11,278 >>   Num examples = 3786289
[INFO|trainer.py:1610] 2022-11-04 11:17:11,278 >>   Num Epochs = 1
[INFO|trainer.py:1611] 2022-11-04 11:17:11,278 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1612] 2022-11-04 11:17:11,278 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1613] 2022-11-04 11:17:11,278 >>   Gradient Accumulation steps = 32
[INFO|trainer.py:1614] 2022-11-04 11:17:11,278 >>   Total optimization steps = 118321
[INFO|trainer.py:1615] 2022-11-04 11:17:11,278 >>   Number of trainable parameters = 354858103
  0%|                                                                                                                                                                                                      /work/qtlab/transformers/src/transformers/models/codegen/modeling_codegen.py:167: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version
  attn_weights = torch.where(causal_mask, attn_weights, mask_value)
Traceback (most recent call last):
  File "/work/qtlab/./transformers/examples/pytorch/language-modeling/run_clm.py", line 580, in <module>
    main()
  File "/work/qtlab/./transformers/examples/pytorch/language-modeling/run_clm.py", line 528, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/work/qtlab/transformers/src/transformers/trainer.py", line 1501, in train
    return inner_training_loop(
  File "/work/qtlab/transformers/src/transformers/trainer.py", line 1749, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/work/qtlab/transformers/src/transformers/trainer.py", line 2508, in training_step
    loss = self.compute_loss(model, inputs)
  File "/work/qtlab/transformers/src/transformers/trainer.py", line 2540, in compute_loss
    outputs = model(**inputs)
  File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/invain/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/home/invain/anaconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1680, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/qtlab/transformers/src/transformers/models/codegen/modeling_codegen.py", line 711, in forward
    lm_logits = self.lm_head(hidden_states).to(torch.float32)
  File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/invain/.local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 198.00 MiB (GPU 0; 11.90 GiB total capacity; 10.55 GiB already allocated; 200.50 MiB free; 10.70 GiB reserved in total by PyTorch) If re
  0%|
[2022-11-04 11:17:13,621] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3296
[2022-11-04 11:17:13,621] [ERROR] [launch.py:324:sigkill_handler] ['/home/invain/anaconda3/envs/deepspeed/bin/python', '-u', './run_clm.py', '--local_rank= 'moyix/debian_csrc', '--tokenizer_name', 'Salesforce/codegen-350M-multi', '--block_size', '2048', '--gradient_accumulation_steps', '32', '--do_train', '--fp16', '--overwrite_output_dir', '--deepspeed',

real    94m15.273s
user    461m18.611s
sys     3m52.003s

@shailja-thakur
Copy link

shailja-thakur commented Nov 4, 2022 via email

@leemgs
Copy link
Contributor Author

leemgs commented Nov 6, 2022

Can you share your my-codegen-350m-deepspeed-
finetune.sh, ds_config.json, and the size of the training data, so I get an
idea of what could be happening in your case?

@shailja-thakur, Here, I don't know why this training strategy still gives a CUDA-out-of-memory issue on out-of-date Nvidia GPU (e.g., VRAM 12GB).

  • fine-tune option with deepspeed framework (e.g., my-codegen-350m-deepspeed-finetune.sh)
    • 12th Gen Intel Core i7 + DRAM 31GB + Nvidia Titan Xp (VRAM 12GB) : It's failed due to CUDA-OOM 😭
    • 12th Gen Intel Core i7 + DRAM 31GB + Nvidia A100 (VRAM 80GB) : It's succeeded thanks to VRAM 80GB 😄
 --num_gpus 1 --num_nodes 1 $RUN_CLM --model_name_or_path=Salesforce/codegen-${PARAM_SIZE}-multi \
 --per_device_train_batch_size=1 --learning_rate 2e-5 --num_train_epochs 1 \
 --output_dir=./codegen-${PARAM_SIZE}-finetuned --dataset_name $MY_DATASET \
 --tokenizer_name Salesforce/codegen-${PARAM_SIZE}-multi  \
 --block_size 2048 --gradient_accumulation_steps 32 --do_train --fp16 --overwrite_output_dir \
 --deepspeed $DS_CONFIG
  • ds_config.json
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false

  • the size of the training data
    • 153G ~/.cache/huggingface/datasets/moyix___parquet/

At that time, I concentrated on Parameters, Gradients, Optimizer States to avoid CUDA-OOM issue on Nvidia GPU (with VRAM 12GB). However, I could not still find a recipe to avoid CUDA-OOM issue on Nvidia GPU VRAM 12GB.

image

@leemgs
Copy link
Contributor Author

leemgs commented Nov 10, 2022

12th Gen Intel Core i7 + DRAM 31GB + Nvidia Titan Xp (VRAM 12GB) : It's failed due to CUDA-OOM 😭
12th Gen Intel Core i7 + DRAM 31GB + Nvidia A100 (VRAM 80GB) : It's succeeded thanks to VRAM 80GB 😄

@shailja-thakur, Are there any hints or clues to work on Fine-Tune on NVIDIA TITAN XP? I tried various things, but I failed. So now, in my case, I use the high -performance GPU (e.g. NVIDIA A100 (VRAM 80GB) to avoid the CUDA room reported above.

@leemgs
Copy link
Contributor Author

leemgs commented Nov 10, 2022

Also note that it is still a bit tricky to get a custom model working
– you'll have to run the conversion from HF to FasterTransformers after training it,

@moyix, First of all, thank you for sharing your experiences.
Thanks to your sharing, I could create a Fine-tuned model (e.g., codegen-350M-multi-finetuned) as follows.

$ tree ./codegen-350M-multi-finetuned/
./codegen-350M-multi-finetuned/
├── added_tokens.json
├── all_results.json
├── config.json
├── merges.txt
├── pytorch_model.bin
├── README.md
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
├── trainer_state.json
├── training_args.bin
├── train_results.json
└── vocab.json

$ ls -al ./codegen-350M-multi-finetuned/
total 778380
drwxr-xr-x 2 leemgs leemgs      4096 Nov 10 16:40 .
drwxr-xr-x 6 leemgs leemgs      4096 Nov 10 16:43 ..
-rw-r--r-- 1 leemgs leemgs      1080 Nov 10 16:31 added_tokens.json
-rw-r--r-- 1 leemgs leemgs       582 Nov 10 16:31 all_results.json
-rw-r--r-- 1 leemgs leemgs      1011 Nov 10 16:31 config.json
-rw-r--r-- 1 leemgs leemgs    456356 Nov 10 16:31 merges.txt
-rw-r--r-- 1 leemgs leemgs 793630000 Nov 10 16:31 pytorch_model.bin
-rw-r--r-- 1 leemgs leemgs      1149 Nov 10 16:31 README.md
-rw-r--r-- 1 leemgs leemgs        99 Nov 10 16:31 special_tokens_map.json
-rw-r--r-- 1 leemgs leemgs       283 Nov 10 16:31 tokenizer_config.json
-rw-r--r-- 1 leemgs leemgs   2114827 Nov 10 16:31 tokenizer.json
-rw-r--r-- 1 leemgs leemgs       998 Nov 10 16:31 trainer_state.json
-rw-r--r-- 1 leemgs leemgs      4539 Nov 10 16:31 training_args.bin
-rw-r--r-- 1 leemgs leemgs       582 Nov 10 16:31 train_results.json
-rw-r--r-- 1 leemgs leemgs    798156 Nov 10 16:31 vocab.json
(deepspeed) leemgs@ai02:~/qtlab/CodeGen/checkpoints$

Using the generated fined-tuned model, I performed the "def hello_word" test.
Currently, I have read the official CodeGen documentation as follows:

However, I meet an unexpected error message like this:

  • error message: 'CodeGenAttention' object has no attribute 'causal_mask'
    I am perplexed as to why the "pytorch model.bin" file I prepared throughout the fine-tuning process is incompatible.
    I believe that any feedback or experience on this error message will be helpful.
(.venv) $ python3 -m jaxformer.hf.sample --model codegen-350M-multi --context "def hello_world():"


loading parameters
loading parameters took 9.95s
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/home/leemgs/qtlab/CodeGen/jaxformer/hf/sample.py", line 253, in <module>
    main()
  File "/data/home/leemgs/qtlab/CodeGen/jaxformer/hf/sample.py", line 225, in main
    model = create_model(ckpt=ckpt, fp16=use_fp16).to(device)
  File "/data/home/leemgs/qtlab/CodeGen/jaxformer/hf/sample.py", line 63, in create_model
    return CodeGenForCausalLM.from_pretrained(ckpt, revision='float16', torch_dtype=torch.float16, low_cpu_mem_usage=True)
  File "/data/home/leemgs/qtlab/CodeGen/.venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1526, in from_pretrained
    cls._load_state_dict_into_model_low_mem(model, loaded_state_dict_keys, resolved_archive_file)
  File "/data/home/leemgs/qtlab/CodeGen/.venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1786, in _load_state_dict_into_model_low_mem
    new_val = getattr(submodule, param_name)
  File "/data/home/leemgs/qtlab/CodeGen/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'CodeGenAttention' object has no attribute 'causal_mask'

@leemgs
Copy link
Contributor Author

leemgs commented Nov 10, 2022

AttributeError: 'CodeGenAttention' object has no attribute 'causal_mask'

FIXED. I figured out what was causing this problem. It was because the versions I learned and tried to sample were different. This problem has been resolved by using the most recent Transformer's latest version (e.g. 4.25.0.dev0) and incorrect weights in the config.json file. My report will be useful to anyone who may have a similar difficulty in the near future. 😄

  • The model card informaiton : fine-tuned Codegen-350M-multi model
    • /mylab/fine-tuning-codegen/codegen-350M-finetuned$ cat ./README.md

license: bsd-3-clause

tags:

  • generated_from_trainer
    datasets:
  • moyix/debian_csrc
    model-index:
  • name: codegen-350M-finetuned
    results: []

This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment.

codegen-350M-finetuned

This model is a fine-tuned version of Salesforce/codegen-350M-multi on the moyix/debian_csrc dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 1.0

Training results

Framework versions

  • Transformers 4.25.0.dev0
  • Pytorch 1.13.0
  • Datasets 2.6.1
  • Tokenizers 0.11.0

@leemgs
Copy link
Contributor Author

leemgs commented Nov 10, 2022

I would not expect the bigger models to get much better from being fine-tuned a relatively small amount of code, but the smallest models (like 350M) might benefit from seeing your code.

@moyix, I have one query about the fine-tuned Codegen model. With the 350M Codegen model, how can I compare the quality/accuracy of the original Codegen model and the fine-tuned Codegen model? I'm curious if there are any well-known benchmarking tools or general methods for comparing the quality/accuracy of these two models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants