# Supervised Fine Tuning of StarCoder2-3B for Text to Cypher Generation

**Sources:**  
- [Phil Schmid, How to Fine-Tune LLMs in 2024 with Hugging Face](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl)  
- [Sebastian Raschka, Practical Tips for Finetuning LLMs using LoRA](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms)
- [Sebastian Raschka, Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments](https://lightning.ai/pages/community/lora-insights/)
-[Hugging Face - The Alignment Notebook](https://github.com/huggingface/alignment-handbook/tree/main)

## Workspace Setup

The setup as of May 2024, it will probably change in time. This notebook is run in Google Colab Pro with A100 GPU and High-RAM setting.

In [None]:
#@title Necessary Installs

!pip install -U accelerate
!pip install -U bitsandbytes
!pip install -U datasets
!pip install -U einops
!pip install -U evaluate
!pip install -U ninja
!pip install -U packaging
!pip install -U peft
!pip install -U tensorboard
!pip install -U torch
!pip install -U transformers
!pip install -U trl

In [4]:
#@title Required Imports

# Python packages
import json
import gc

# For Google Colab settings
from google.colab import userdata, drive

# For Hugging Face Hub setting
from huggingface_hub import login

# To work with data in Hugging Face format
import datasets
from datasets import load_dataset, Dataset

import torch

# To parse the data with chat_template
from multiprocessing import cpu_count

#Imports for model, tokenizer, training
import transformers
from transformers import (AutoTokenizer,
                          AutoModelForCausalLM,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          pipeline
                          )

from peft import (LoraConfig,
                  PeftConfig,
                  PeftModel,
                  get_peft_model,
                  prepare_model_for_kbit_training,
                  AutoPeftModelForCausalLM
                  )
import trl
from trl import (SFTTrainer,
                 setup_chat_format,
                 )

In [5]:
#@title Display Relevant Libraries Versions

print(f"The PyTorch version is {torch.__version__}.")
print(f"Datasets version is {datasets.__version__}.")
print(f"Transformers version is {transformers.__version__}.")
print(f"TRL version is {trl.__version__}.")

The PyTorch version is 2.3.0+cu121.
Datasets version is 2.19.1.
Transformers version is 4.41.0.
TRL version is 0.8.6.


In [6]:
#@title Assert Cuda Capability for Flash Attention

major_version, minor_version = torch.cuda.get_device_capability()
print(f"Cuda major version: {major_version}.\nCuda minor version: {minor_version}")
assert major_version >= 8, "Hardware not supported by Flash Attention."

Cuda major version: 8.
Cuda minor version: 0


In [None]:
#@title Install Flash Attention

# Limit the number of jobs to accomodate the compute capabilities
%env MAX_JOBS=2 # for Google Colab

# Install flash attention - for Ampere GPUs
!pip install -U flash-attn --no-build-isolation

In [8]:
#@title Resources Estimation
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
0.0 GB of memory reserved.


In [9]:
#@title Hugging Face Credentials

# Upload the HuggingFace token (should have WRITE access) from Colab secrets
HF = userdata.get('HF')

# This is needed to upload the model to HuggingFace
login(token=HF,add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [10]:
#@title Google Colab Drive Helper

# This will prompt for authorization
drive.mount('/content/drive')

# Set the working directory
%cd '/content/drive/MyDrive/finetuneCypher/'

Mounted at /content/drive
/content/drive/MyDrive/finetuneCypher


In [11]:
#@title Path Variables

# Create a path variable for the data folder
data_path = '/content/drive/MyDrive/finetuneCypher/datas/'

# Supervised fine-tuning dataset
trainer_with_repeats_file = 'parametric_trainer_with_repeats.json'

# Create a path variable for the SFT model to be saved locally
model_path = '/content/drive/MyDrive/finetuneCypher/cypherStarCoder2/'

In [12]:
#@title The Model Name

model_id =  "bigcode/starcoder2-3b"

## Parse the Dataset with Chat Template

In [13]:
#@title Load the SFT Dataset

with open(data_path+trainer_with_repeats_file, 'rb') as f:
	sampler = json.load(f)

# Display an entry
sampler[123]

{'Prompt': 'Convert the following question into a Cypher query using the provided graph schema!',
 'Question': 'Fetch the Author nodes and extract their affiliation property!',
 'Schema': 'Graph schema: Relevant node labels and their properties (with datatypes) are:\nAuthor {affiliation: STRING}',
 'Cypher': 'MATCH (n:Author) RETURN n.affiliation'}

In [14]:
#@title Parse to Conversational Format

system_message = """
You are a text to Cypher query translator. {prompt}\n{schema}
"""

# Function to transform the data to conversational format {role:, content: }
def create_conversation(sample):
    return {
        "messages": [
            {"role": "system","content": system_message.format(prompt=sample["Prompt"], schema=sample["Schema"])},
            {"role": "user", "content": sample["Question"]},
            {"role": "assistant", "content": sample["Cypher"]}
        ]
    }


In [15]:
#@title Convert Data to HuggingFace Format

from datasets import load_dataset, Dataset

dataset = Dataset.from_list(sampler)
dataset = dataset.shuffle()

# Transform to required format
dataset = dataset.map(create_conversation,
                      remove_columns=dataset.features,
                      batched=False)


Map:   0%|          | 0/30116 [00:00<?, ? examples/s]

In [16]:
#@title Split Data into Train and Test Sets
dataset = dataset.train_test_split(test_size=0.1, seed=23)

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 27104
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 3012
    })
})


In [17]:
#@title Display a Sample
dataset["train"][32]["messages"]

[{'content': '\nYou are a text to Cypher query translator. Convert the following question into a Cypher query using the provided graph schema!\nGraph schema: Relevant node labels and their properties (with datatypes) are:\nDOI {name: STRING}\n',
  'role': 'system'},
 {'content': 'Find 10 DOI that have the name recorded and return these values!',
  'role': 'user'},
 {'content': 'MATCH (n:DOI) WHERE n.name IS NOT NULL RETURN n.name LIMIT 10',
  'role': 'assistant'}]

In [18]:
#@title Save the Dataset Splits
dataset["train"].to_json(data_path+"train_dataset.json", orient="records")
dataset["test"].to_json(data_path+"test_dataset.json", orient="records")

Creating json from Arrow format:   0%|          | 0/28 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

2184680

## Tokenizer Settings

In [19]:
#@title Load Tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'right' # added to prevent warnings

# Set a maximum length
tokenizer.model_max_length = 2048

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.88k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

In [20]:
#@title  Define a Chat Template

CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|im_start|>user\n' + message['content'] + '<|im_end|>'+eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|im_start|>system\n' + message['content'] + '<|im_end|>'+eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|im_start|>assistant\n'  + message['content'] + '<|im_end|>'+eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|im_start|>assistant' }}\n{% endif %}\n{% endfor %}"

tokenizer.chat_template = CHAT_TEMPLATE

In [21]:
#@title Reload Dataset if Necessary

train_dataset = load_dataset("json",
                       data_files=data_path+"train_dataset.json",
                       split="train")

Generating train split: 0 examples [00:00, ? examples/s]

In [23]:
#@title Apply the Chat Template

# Process the dataset function
def apply_chat_template(sample, tokenizer):
    messages = sample["messages"]

    # We add an empty system message if there is none
    if messages[0]["role"] != "system":
        messages.insert(0, {"role": "system", "content": ""})

    sample["text"] = tokenizer.apply_chat_template(messages,
                                                   tokenize=False)

    return sample

# Apply the chat template to the entire dataset
train_dataset = train_dataset.map(apply_chat_template,
                      fn_kwargs={"tokenizer": tokenizer},
                      remove_columns=train_dataset.features,
                      )
print(train_dataset)

Map:   0%|          | 0/27104 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 27104
})


In [25]:
#@title Display a Sample
print(train_dataset[123]["text"])

<|im_start|>system

You are a text to Cypher query translator. Convert the following question into a Cypher query using the provided graph schema!
Relevant node labels and their properties (with datatypes) are:
Article {comments: STRING}
Journal {name: STRING}

Relevant relationships are:
{'start': Article, 'type': PUBLISHED_IN, 'end': Journal }


Relevant relationship properties (with datatypes) are:
PUBLISHED_IN {meta: STRING}
<|im_end|><|endoftext|>
<|im_start|>user
Calculate the average name for Journal that is linked to Article via PUBLISHED_IN where meta is 220 and has comments date before December 31, 2020!<|im_end|><|endoftext|>
<|im_start|>assistant
MATCH (n:Article) -[:PUBLISHED_IN{meta: '220'}]->(m:Journal) WHERE m.comments < date('2020-12-31') RETURN avg(m.name) AS avg_name<|im_end|><|endoftext|>



## Load Base Model

In [26]:
#@title Quantization Parameters

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16   # change to float16 if using non-Ampere GPU
)

In [27]:
#@title Device Map

device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None

In [28]:
#@title Load the Model in 4-bit

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device_map, #device_map,
    attn_implementation="flash_attention_2", # remove if using non-Ampere GPU
    torch_dtype=torch.bfloat16, #bfloat16, # change to float16 if using non-Ampere GPU
    quantization_config=bnb_config
)

model.config.use_cache = False

config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/12.1G [00:00<?, ?B/s]

## SFT Train Model and Save

In [29]:
#@title Prepare Model for Training

model = prepare_model_for_kbit_training(model)

In [30]:
#@title Identify the Linear Layers

print(model)

Starcoder2ForCausalLM(
  (model): Starcoder2Model(
    (embed_tokens): Embedding(49152, 3072)
    (layers): ModuleList(
      (0-29): 30 x Starcoder2DecoderLayer(
        (self_attn): Starcoder2FlashAttention2(
          (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=True)
          (k_proj): Linear4bit(in_features=3072, out_features=256, bias=True)
          (v_proj): Linear4bit(in_features=3072, out_features=256, bias=True)
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=True)
          (rotary_emb): Starcoder2RotaryEmbedding()
        )
        (mlp): Starcoder2MLP(
          (c_fc): Linear4bit(in_features=3072, out_features=12288, bias=True)
          (c_proj): Linear4bit(in_features=12288, out_features=3072, bias=True)
          (act): PytorchGELUTanh()
        )
        (input_layernorm): LayerNorm((3072,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((3072,), eps=1e-05, elementwise_affine=True)
      )

In [31]:
#@title LoRA Configuration

# According to Sebastian Raschka findings
peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "c_fc", "c_proj"], # use all linear layers
        task_type="CAUSAL_LM",
)

In [32]:
# @title SFT Training Arguments

# Adapted from  Phil Schmid blogpost
args = TrainingArguments(
    output_dir=model_path,                  # directory to save the model and repository id
    num_train_epochs=1,                     # number of training epochs, use 3 at most
    per_device_train_batch_size=3,          # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory, use in distributed training
    gradient_checkpointing_kwargs={"use_reentrant": False}, # needed if gradient checkpoint is used
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=100,                      # number of steps between two logs
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use bfloat16 precision
    tf32=True,                              # use tf32 precision for better performance
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",           # use constant learning rate scheduler
    push_to_hub=True,                       # push model to Hugging Face hub, optional
    hub_model_id="starcoder2-3b-sft-qlora-cypher", # if the model is pushed on HuggingFace
    report_to="tensorboard",                # report metrics to tensorboard, optional, if model is pushed to HuggingFace
)

In [33]:
# @title SFTTrainer Parameters

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    dataset_text_field="text",
    peft_config=peft_config,
    max_seq_length=tokenizer.model_max_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False,  # the template adds the special tokens
        "append_concat_token": False, # no need to add additional separator token
    }
)

Generating train split: 0 examples [00:00, ? examples/s]

In [34]:
#@title Train the Model

trainer.train()

The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss
100,0.5703
200,0.1521
300,0.0982
400,0.0735




TrainOutput(global_step=471, training_loss=0.19977091426808868, metrics={'train_runtime': 1471.5846, 'train_samples_per_second': 1.92, 'train_steps_per_second': 0.32, 'total_flos': 1.131939646930944e+17, 'train_loss': 0.19977091426808868, 'epoch': 1.0})

In [35]:
#@title Save Model Locally

trainer.save_model()



In [36]:
#@title Clear Memory

del trainer
del model
del tokenizer

gc.collect()
torch.cuda.empty_cache()

## Basic Inference Tests

In [37]:
#@title Load Tokenizer & Model

peft_model_id = "solanaO/starcoder2-3b-sft-qlora-cypher"  # if using HuggingFace model
#peft_model_id = model_path  # if using the local model

# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(
  peft_model_id,
  device_map="auto",  # use auto for inference
  torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
#model.resize_token_embeddings(len(tokenizer))

# Text generation pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

adapter_config.json:   0%|          | 0.00/703 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/8.40k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/763M [00:00<?, ?B/s]

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalL

In [38]:
#@title Load Test Dataset

test_dataset = load_dataset("json", data_files=data_path+"test_dataset.json", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

In [39]:
#@title One Sample Test

prompt = pipe.tokenizer.apply_chat_template(test_dataset[101]["messages"][:2],
                                            tokenize=False,
                                            add_generation_prompt=True)
print(prompt)

<|im_start|>system

You are a text to Cypher query translator. Convert the following question into a Cypher query using the provided graph schema!
Graph schema: Relevant node labels and their properties (with datatypes) are:
Author {affiliation: STRING}
Author {first_name: STRING}
<|im_end|><|endoftext|>
<|im_start|>user
Retrieve the Author where affiliation or first_name contains unspecified!<|im_end|><|endoftext|>
<|im_start|>assistant



In [55]:
# Generate the output and print it

outputs = pipe(prompt,
              max_new_tokens=256,
              do_sample=False,
              temperature=0.1,
              top_k=50,
              top_p=0.1
              )

print(f"Question: {test_dataset[101]['messages'][1]['content']}")
print(f"Corect Cypher: {test_dataset[101]['messages'][2]['content']}")
print(f"Generated Cypher: {outputs[0]['generated_text'][len(prompt):-10].strip()}") # remove end generation token

Question: Retrieve the Author where affiliation or first_name contains unspecified!
Corect Cypher: MATCH (n:Author) WHERE n.affiliation CONTAINS 'unspecified' RETURN n AS node UNION ALL MATCH (m:Author) WHERE m.first_name CONTAINS 'unspecified' RETURN m AS node
Generated Cypher: MATCH (n:Author) WHERE n.affiliation CONTAINS 'unspecified' RETURN n AS node UNION ALL MATCH (m:Author) WHERE m.first_name CONTAINS 'unspecified' RETURN m AS node


## Test the Fine-Tuned Model

In [58]:
#@title Test on a Subset of Samples
from tqdm import tqdm

# Compare the generated text with provided Cypher statement

def evaluate(sample):
    prompt = pipe.tokenizer.apply_chat_template(sample["messages"][:2],
                                                tokenize=False,
                                                add_generation_prompt=True)
    outputs = pipe(prompt,
                   max_new_tokens=256,
                   do_sample=True,
                   temperature=0.7,
                   top_k=50,
                   top_p=0.95
                   )

    predicted_answer = outputs[0]['generated_text'][len(prompt):-10].strip() # remove end generation token

    if predicted_answer == sample["messages"][2]["content"]:
        return 1
    else:
        return 0

success_rate = []
number_of_eval_samples = 100

# Iterate over sample dataset and predict
for s in tqdm(test_dataset.shuffle().select(range(number_of_eval_samples))):
    success_rate.append(evaluate(s))

# compute accuracy
accuracy = sum(success_rate)/len(success_rate)

print(f"Accuracy: {accuracy*100:.2f}%")

100%|██████████| 100/100 [05:25<00:00,  3.26s/it]

Accuracy: 85.00%



