# Code Llama Finetune

**In this notebook, I'll show you how to fine-tune Code Llama to become a C++ code generation beast. For coding tasks, we can generally get better performance from Code Llama than Llama 2, especially when we specialize the model to a specific code generation task: **

- Finetune with [Lora](https://github.com/microsoft/LoRA) method，Quantize the base model to int 8, freeze its weights, and only train the adapter
- The code is referenced from [alpaca-lora](https://github.com/tloen/alpaca-lora)，and the certain improvements and optimizations have been made at the same time.


### 1. Install and load necessary libraries

In [5]:
# !pip install git+https://github.com/huggingface/transformers.git@main bitsandbytes accelerate==0.20.3  # we need latest transformers for this
# !pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
!pip -qqq install bitsandbytes accelerate
!pip install datasets==2.10.1
!pip install wandb
!pip install -q -U peft
import locale # colab workaround
locale.getpreferredencoding = lambda: "UTF-8" # colab workaround


I used 4 NVIDIA A40-48Q GPU server configured with Python 3.10 and Cuda 12.2 to run the code in this article. It ran for about eight hours. (To verify portability, I also tried running the code on Colab, and the results were good.)

### Load libraries 
( If get import errors, try restarting the Jupyter kernel )


In [9]:
from datetime import datetime
import os
import sys

import torch
from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
    set_peft_model_state_dict,
)
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq


### 2. Load dataset 'codeparrot/xlcost-text-to-code'


The above extracts the dataset from Huggingface Hub and splits 10% of it into an evaluation set to check how well the model performs in training. If you want to load your own dataset, do the following:

```
train_dataset = load_dataset('json', data_files='train_set.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='validation_set.jsonl', split='train')
```


In [10]:
from datasets import load_dataset
data = load_dataset("codeparrot/xlcost-text-to-code", "C++-program-level", split="train")
data

Each sample consists of the text "text" and the "code".

In [11]:
train_dataset = data.train_test_split(test_size=0.1)["train"]
eval_dataset = data.train_test_split(test_size=0.1)["test"]

If you want to see any sample in the dataset, just do: ``` ```

In [14]:
print(train_dataset[3])

### 3. Load model
I load code llama as int8 from Huggingface. Lora’s standards:

In [16]:
base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")


Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.64s/it]


torch_dtype=torch.float16 means to perform the calculation using a float16 representation, even if the value itself is an 8-bit integer.

If you get the error "ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported." make sure your Transformer version is 4.33.0 and accelerate>=0.20.3.

Below is our test task.

In [16]:
eval_prompt = """
Use the Task below and write the Response, which is a programming code that can solve the Task.
### Task:
Generate a C++ program that accepts numeric input from the user and maintains a record of previous user inputs with timestamps. Ensure the program sorts the user inputs in ascending order based on the provided numeric input. Enhance the program to display timestamps along with the sorted user inputs.
### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=500, do_sample=True, top_p=0.9,temperature=0.5)[0]))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> 
Use the Task below and write the Response, which is a programming code that can solve the Task.
### Task:
Generate a C++ program that accepts numeric input from the user and maintains a record of previous user inputs with timestamps. Ensure the program sorts the user inputs in ascending order based on the provided numeric input. Enhance the program to display timestamps along with the sorted user inputs.
### Response:
```cpp
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <algorithm>
#include <chrono>
#include <ctime>
#include <iomanip>
#include <sstream>
#include <iterator>
using namespace std;

int main()
{
	int n;
	cout << "Enter the number of inputs: ";
	cin >> n;
	vector<string> v;
	vector<string> v2;
	vector<string> v3;
	for (int i = 0; i < n; i++)
	{
		string s;
		cout << "Enter the input: ";
		cin >> s;
		v.push_back(s);
	}
	for (int i = 0; i < n; i++)
	{
		stringstream ss;
		ss << v[i];
		int x;
		ss >> x;
		stringstream s;
		s << x;
		

The output result is not perfect enough and fails the test. We have to fine-tune and check the results after fine-tuning!

It is possible to pull out the number of model parameters and find out how many of them are trainable. The following function can be used to do that, at this stage, you do not need to go into details of it. 

In [7]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 262541312
all model parameters: 6738546688
percentage of trainable model parameters: 3.90%


### 4. Tokenization
Set some tokenization settings such as left padding since it makes [uses less memory during training](https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa)：

In [17]:
tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

Set the tokenize function so that labels and input_ids are the same. This is basically [self-supervised fine-tuning](https://neptune.ai/blog/self-supervised-learning)：

In [18]:
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding=False,
        return_tensors=None,
    )

    # "self-supervised learning" means the labels are also the inputs:
    result["labels"] = result["input_ids"].copy()

    return result

And run the tip to convert each data_point which works great:

In [27]:
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""
Use the Task below and write the Response, which is a programming code that can solve the Task.

### Task:
{data_point["text"]}

### Response:
{data_point["code"]}
"""
    return tokenize(full_prompt)



Reformat to prompt and label each sample:

In [28]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

Map:   0%|          | 0/8817 [00:00<?, ? examples/s]

Map: 100%|██████████| 8817/8817 [00:10<00:00, 846.22 examples/s]
Map: 100%|██████████| 980/980 [00:00<00:00, 1332.20 examples/s]


### 5. Setup Lora

Add LoRA adapter layers/parameters to the original LLM to be trained.

In [17]:
model.train() # put model back into training mode
model = prepare_model_for_int8_training(model)


lora_alpha = 16
lora_dropout = 0.05
lora_r = 16

config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=[
    "q_proj",
    "v_proj",
],
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

In [18]:
print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 8388608
all model parameters: 6746935296
percentage of trainable model parameters: 0.12%


To resume from a checkpoint, set `resume_from_checkpoint` to the path of the `adapter_model.bin` from which to resume. This code will replace the connected LoRa adapter to the model.

In [33]:
resume_from_checkpoint = "" # set this to the adapter_model.bin file you want to resume from

if resume_from_checkpoint:
    if os.path.exists(resume_from_checkpoint):
        print(f"Restarting from {resume_from_checkpoint}")
        adapters_weights = torch.load(resume_from_checkpoint)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        print(f"Checkpoint {resume_from_checkpoint} not found")

To view optional content of the training graph, set weights and biases:

In [34]:
wandb_project = "CodeLlama_finetune_CPP_2"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project


In [35]:
if torch.cuda.device_count() > 1:
    # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
    model.is_parallelizable = True
    model.model_parallel = True

### 6. Training Parameters
If the GPU memory is insufficient, modify the `per_device_train_batch_size`. Ensure that the `gradient_accumulation_steps` variable is adjusted to prevent it from affecting the dynamic batching during training.

In [36]:
from trl import SFTTrainer

batch_size = 128
per_device_train_batch_size = 16
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "CodeLlama_CPP_FineTuned_2"

training_args = TrainingArguments(
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=100,
        max_steps=400,
        learning_rate=3e-4,
        fp16=True,
        logging_steps=10,
        optim="adamw_torch",
        evaluation_strategy="steps", # if val_set_size > 0 else "no",
        save_strategy="steps",
        eval_steps=20,
        save_steps=20,
        output_dir=output_dir,
        # save_total_limit=3,
        load_best_model_at_end=False,
        # ddp_find_unused_parameters=False if ddp else None,
        group_by_length=True, # group sequences of roughly the same length together to speed up training
        report_to="wandb", # if use_wandb else "none",
        run_name=f"codellama-cpp-{datetime.now().strftime('%Y-%m-%d-%H-%M')}", # if use_wandb else None,
    )

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=training_args,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
)


Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Then we perform some optimizations related to PyTorch, which are aimed at speeding up the training process without affecting accuracy:

In [37]:
model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
    model, type(model)
)
if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)

compiling the model
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoi

In [38]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mmedxiaorudan[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a CodeLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
20,1.5907,1.391939
40,1.271,1.081857
60,0.8369,0.793175
80,0.782,0.755939
100,0.6838,0.731848
120,0.6527,0.720158
140,0.6797,0.714182
160,0.7195,0.705022
180,0.7141,0.698529
200,0.6961,0.694456


TrainOutput(global_step=400, training_loss=0.758864541053772, metrics={'train_runtime': 28420.8634, 'train_samples_per_second': 1.801, 'train_steps_per_second': 0.014, 'total_flos': 7.593082346579558e+17, 'train_loss': 0.758864541053772, 'epoch': 5.8})

### Load the final checkpoint.


In [22]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.52s/it]


To load a fine-tuned Lora/Qlora adapter, use `PeftModel.from_pretrained`. The `output_dir` should contain `adapter_config.json` and `adapter_model.bin`:

In [23]:
from peft import PeftModel
model = PeftModel.from_pretrained(model, "/srv/users/rudxia/Developer_MICCAI/CodeLlama/CodeLlama_CPP_FineTuned_2/checkpoint-400/")

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [25]:
print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 0
all model parameters: 6746935296
percentage of trainable model parameters: 0.00%


Attempt with the same prompts as before:

In [6]:
eval_prompt = """
Use the Task below and write the Response, which is a programming code that can solve the Task.
### Task:
Generate a C++ program that accepts numeric input from the user and maintains a record of previous user inputs with timestamps. Ensure the program sorts the user inputs in ascending order based on the provided numeric input. Enhance the program to display timestamps along with the sorted user inputs.
### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=500, pad_token_id=tokenizer.eos_token_id,do_sample=True, top_p=0.9,temperature=0.5)[0]))

<s> 
Use the Task below and write the Response, which is a programming code that can solve the Task.
### Task:
Generate a C++ program that accepts numeric input from the user and maintains a record of previous user inputs with timestamps. Ensure the program sorts the user inputs in ascending order based on the provided numeric input. Enhance the program to display timestamps along with the sorted user inputs.
### Response:
```
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <ctime>
#include <iomanip>

using namespace std;

int main()
{
    vector<string> inputs;
    string input;
    string time;

    cout << "Enter input: ";
    getline(cin, input);

    while (input != "exit")
    {
        time = to_string(time_t(time(0)));
        inputs.push_back(input + " " + time);
        cout << "Enter input: ";
        getline(cin, input);
    }

    sort(inputs.begin(), inputs.end());

    for (int i = 0; i < inputs.size(); i++)
    {
        cout << lef

From the running results, it can be seen that fine-tuning is effective! It can successfully pass the compilation tests. You can also convert this adapter to the Llama.cpp model for running locally.

