### Fine-tuning with SpaRTA on a text-generation task 

#### Setup

In [1]:
import os

In [2]:
# pre-trained model 
model_id = 'Qwen/Qwen2.5-0.5B'

# dir path for saving SpaRTA adapter
home_dir = os.environ['HOME']
save_dir = os.path.join(home_dir, 'sparta_examples/output/generative_model/')

In [None]:
print(save_dir)

#### Load pre-trained causal language model (for text-generation tasks)

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [5]:
model = AutoModelForCausalLM.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model.config.pad_token_id = tokenizer.pad_token_id
tokenizer.padding_side = 'right'
tokenizer.truncation_side = 'right'

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

#### Create SpaRTA adapter

Instead of full fine-tuning (which is computationally expensive), we train a much smaller subset of model parameters using a SpaRTA adapter. We choose a `sparsity = 99%`, reducing the number of trainable parameters to just around 1% of the original model parameters.

In [6]:
from peft_sparta import SpaRTA

In [7]:
model = SpaRTA(model, 0.99).to('cuda')

In [None]:
# print(model)

In [8]:
model.num_trainable_parameters()

Num trainable parameters: 4,943,849 (1.00071%)


#### Dataset

We use `trl-lib/Capybara`, a conversational dataset for language modeling with demonstrations of multi-turn dialogues, to train our SpaRTA adapter so that the adapted base model can follow user instructions and engage in conversations (when using the chat template to format user/assistant interactions).  

In [9]:
from datasets import load_dataset

In [10]:
train_dataset = load_dataset('trl-lib/Capybara', split='train')

Let's see a training example in the dataset, which consists of multi-turn user/assistant chat interactions, using the helper `print_chat` function.

In [11]:
def print_chat(messages):
    for i, msg in enumerate(messages):
        role = msg['role']
        content = msg['content']
        print(f"{i+1}. \033[1m{role.upper()}:\033[0m {content.strip()}\n")

In [12]:
# conversation example
print_chat(train_dataset[1]['messages'])

1. [1mUSER:[0m Determine the result obtained by evaluating 5338245-50629795848152. Numbers and symbols only, please.

2. [1mASSISTANT:[0m 5338245 - 50629795848152 = -50629790509907



#### Train SpaRTA

We use the `SFTTrainer` from the `trl` library to supervised fine-tune the adapter.

In [None]:
model.train()

In [13]:
from trl import SFTConfig, SFTTrainer

In [15]:
sft_config = SFTConfig(
    output_dir=save_dir,
    max_length=512,
    num_train_epochs=1, # max_steps=100
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    max_grad_norm=1.,
    logging_strategy="steps",
    logging_steps=100,
    logging_first_step=True,
    gradient_checkpointing=False,
    save_strategy="no",
    load_best_model_at_end=False,
    report_to="none",
    push_to_hub=False,
    )

In [16]:
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset=train_dataset,
    args=sft_config,
    )

Truncating train dataset:   0%|          | 0/15806 [00:00<?, ? examples/s]

In [17]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


Step,Training Loss
1,2.072824
100,1.687627
200,1.663403
300,1.614137
400,1.619442
500,1.610936
600,1.609767
700,1.602426
800,1.602324
900,1.612635


TrainOutput(global_step=988, training_loss=1.6218637413824135, metrics={'train_runtime': 314.7204, 'train_samples_per_second': 50.222, 'train_steps_per_second': 3.139, 'total_flos': 0.0, 'train_loss': 1.6218637413824135})

Note how the training loss decreases from approximaltey 2.07 down to 1.61.

#### Inference

Let's test the SpaRTA adapted model

In [19]:
input_chat = [
    {'content': 'What is the capital of France?', 'role': 'user'}
]
formatted_input = tokenizer.apply_chat_template(input_chat, tokenize=False, add_generation_prompt=True)
tokenized_input = tokenizer(formatted_input, return_tensors="pt").to("cuda")
model_response = model.generate(**tokenized_input, max_new_tokens=40, pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(model_response[0]))

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris. It is the largest city in Europe and the third-largest city in the world. It is also the seat of the French government and the country's cultural and political center


In [20]:
# printout model response only
input_length = tokenized_input.input_ids.shape[1]
new_tokens = model_response[0][input_length:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))

The capital of France is Paris. It is the largest city in Europe and the third-largest city in the world. It is also the seat of the French government and the country's cultural and political center


#### Saving the SpaRTA adapter

We save the adapter using the `save` method with the argument `merged=False`, so only the adapter is saved, instead of the entire merged model, which will duplicate most of the parameters in the original model.

In [None]:
model.save(save_dir, merged=False)

In [23]:
os.listdir(save_dir)

['config.json', 'sparse_deltas.safetensors']

#### Reloading the adapter for inference

In [24]:
del model, tokenizer

We use the `SpaRTAforCausalLM` class to reload the SpaRTA adapter and marged it to the original pre-trained base model, `Qwen 2.5 0.5B`. 

In [25]:
from peft_sparta import SpaRTAforCausalLM
from transformers import AutoTokenizer

In [26]:
model = SpaRTAforCausalLM(save_dir, 'cuda')

`torch_dtype` is deprecated! Use `dtype` instead!


Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

In [27]:
print(model)

(SpaRTA)ModelForCausalLM(
	adapter = '/u/jriosal/sparta_examples/output/generative_model/'
	base model = 'Qwen/Qwen2.5-0.5B'
)


In [28]:
tokenizer = AutoTokenizer.from_pretrained(model.base_model.config._name_or_path)

In [29]:
tokenizer.padding_side = 'left'
model.generation_config.pad_token_id = tokenizer.pad_token_id

In [30]:
# token ids used as stopping criteria during text generation
gen_terminators = [
    tokenizer.eos_token_id, 
    tokenizer.convert_tokens_to_ids('<|im_end|>')
]

Let's test the reloaded adapted model in the same prompt as before

In [31]:
input_chat = [
    {'content': 'What is the capital of France?', 'role': 'user'}
]
formatted_input = tokenizer.apply_chat_template(input_chat, tokenize=False, add_generation_prompt=True)
tokenized_input = tokenizer(formatted_input, return_tensors="pt").to("cuda")
model_response = model.generate(**tokenized_input, max_new_tokens=40, eos_token_id=gen_terminators)
print(tokenizer.decode(model_response[0]))

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris. It is the largest city in Europe and the third-largest city in the world. It is also the seat of the French government and the country's cultural and political center


The response of the reloaded adapted model is identical to the saved model.