# Fine-tuning DeepSeek R1 Distill Qwen 1.5B model for ReAct prompts

We will load the base DeepSeek R1 distill Qwen 1.5B model and use a ReAct dataset that we created to fine-tune it. This will allow us to build agents on local since ReAct prompts dont work well with base model as is. We will then save the LoRA weights which we get from the trained model to use for inference later on.

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

# Loading base mode

In [None]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.4: Fast Qwen2 patching. Transformers: 4.48.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.81G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

# Preparing trianing dataset

Get an idea about the prompt format used by the model by sending an example message to model's tokenizer.

In [None]:
messages=[
  {"role":"user","content":"this is the first question"},
  {"role":"assistant","content":"response to first question"},
  {"role":"user","content":"followup second question"},
  {"role":"assistant","content":"response to second question"},
  {"role":"user","content":"followup third question"},
  {"role":"assistant","content":"response to third question"}
]

output = tokenizer.apply_chat_template(messages)
print(tokenizer.decode(output))

<｜begin▁of▁sentence｜><｜User｜>this is the first question<｜Assistant｜>response to first question<｜end▁of▁sentence｜><｜User｜>followup second question<｜Assistant｜>response to second question<｜end▁of▁sentence｜><｜User｜>followup third question<｜Assistant｜>response to third question<｜end▁of▁sentence｜>


Use above prompt format as reference to format the data points used for training. Additionally added a think step in assistant message (not sure this is even correct).

In [None]:
from datasets import Dataset
import json


# Function to process messages into the desired format
def format_conversation(example):
    messages = example["messages"]

    formatted_conversation = "<｜begin▁of▁sentence｜>"

    for msg in messages:
        content = msg["content"].strip()
        if msg["role"] == "system":
            formatted_conversation += content
        elif msg["role"] == "user":
            formatted_conversation += f"<｜User｜>{content}"
        elif msg["role"] == "assistant":
            # Adding a hardcoded think message but can be made better by setting it different for different datapoints.
            formatted_conversation += f"<｜Assistant｜><think>Since user has anyway asked to put thoughts in \"Thought\" section I'll put the thoughts there.</think>{content}<｜end▁of▁sentence｜>"

    return {"text": formatted_conversation}

# Load JSONL file and create a dataset
jsonl_file = "react_dataset.jsonl"
data_list = [json.loads(line) for line in open(jsonl_file, "r")]

# Convert to Hugging Face dataset and process
dataset = Dataset.from_list(data_list).map(format_conversation)
dataset = dataset.shuffle(seed=42) # Randomly shuffle the data points just to be sure

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['messages', 'text'],
    num_rows: 60
})

In [None]:
print(dataset[0]["text"])

<｜begin▁of▁sentence｜>You run in a loop of Thought, Action, Observation.
At the end of the loop, you output an Answer.

Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of the actions available to you - then wait for getting the observation of action in the next message.
Observation will be the result of running those actions.

Your available actions are:
```
create_playlist:
usage - create_playlist: <songs: list of strings>
e.g. create_playlist: ['Song 1', 'Song 2', 'Song 3']
Creates a playlist from the provided list of songs.

add_to_playlist:
usage - add_to_playlist: <playlist_name: str>, <song: str>
e.g. add_to_playlist: 'My Playlist', 'Song 4'
Adds a song to an existing playlist.

get_playlist_details:
usage - get_playlist_details: <playlist_name: str>
e.g. get_playlist_details: 'My Playlist'
Returns the details of the specified playlist.
```<｜User｜>Can you create a playlist with 'Shape of You', 'Blinding Lights', and 'Levitating'?

# Model training

In [None]:
# Increased r and lora_alpha to get more trainable parameters because less param trained was not giving good results

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha = 24,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
    use_rslora = False,
    loftq_config = None,
)

Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.2.4 patched 28 layers with 28 QKV layers, 28 O layers and 0 MLP layers.


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Only played around a bit with warmup_steps, max_steps and learning_rate

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 2,
        max_steps = 50,
        learning_rate = 5e-3,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/60 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 60 | Num Epochs = 8
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 50
 "-____-"     Number of trainable parameters = 4,358,144


Step,Training Loss
1,2.5309
2,2.5996
3,1.9231
4,1.8086
5,1.4773
6,1.2296
7,1.0406
8,0.79
9,0.7697
10,0.6449


# Inferencing from trained model

In [None]:
from transformers import TextStreamer

FastLanguageModel.for_inference(model)

messages=[{"role":"system","content":"You run in a loop of Thought, Action, Observation.\nAt the end of the loop, you output an Answer.\n\nUse Thought to describe your thoughts about the question you have been asked.\nUse Action to run one of the actions available to you - then wait for getting the observation of action in next message.\nObservation will be the result of running those actions.\n\nIf the tool requires arguments and the user has not provided them, **ask the user for the missing details** instead of assuming values. Do not make arbitrary selections when multiple options are available—ask the user to choose.\n\nYour available actions are:\n```\nfind_restaurant:\nusage - find_restaurant: <cuisine: str>\ne.g. find_restaurant: Italian\nReturns a list of restaurants matching the given cuisine.\n\nbook_table:\nusage - book_table: <restaurant_id: str>, <time: str>, <party_size: int>\ne.g. book_table: 1234, 7:00 PM, 4\nBooks a table at the specified restaurant for the given time and party size.\n```\n\nExample session:\n```\nQuestion: I'm hungry. Thought:: The user hasn't specified what type of food they are in the mood for. I need to ask for their cuisine preference. Answer: Sure! What type of food are you in the mood for?\n\nQuestion: I'm in the mood for Italian food. Thought: I should search for Italian restaurants. Action: find_restaurant: Italian\n\n<end message, action will be performed and observation will be sent in next message>\n\nObservation: The search returned the following Italian restaurants:\n- La Trattoria\n- Bella Cucina\n- Pasta & Vino Answer: Here are some Italian restaurants:\n1. La Trattoria\n2. Bella Cucina\n3. Pasta & Vino\nWould you like to book a table at one of these?\n\nQuestion: I'd like to book a table at La Trattoria. Thought: The user wants to book a table at La Trattoria. I need to confirm the time and party size. Answer: Sure! What time and how many people will be joining you?\n\nQuestion: 7:00 PM for 4 people. Thought: The user has specified the time and party size. I will go ahead and book the table. Action: book_table: 1234, 7:00 PM, 4\n\n<end message, action will be performed and observation will be sent in next message>\n```"},{"role":"user","content":"I'm hungry"},{"role":"assistant","content":"Thought: The user hasn't specified what type of food they are in the mood for. I need to ask for their cuisine preference. Answer: Sure! What type of food are you in the mood for?"},{"role":"user","content":"I'm in the mood for Italian food."},{"role":"assistant","content":"Thought: I should search for Italian restaurants. Action: find_restaurant: Italian"},{"role":"user","content":"Observation: The search returned the following Italian restaurants:\n- La Trattoria\n- Bella Cucina\n- Pasta & Vino"}]


inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024, use_cache = True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<｜begin▁of▁sentence｜>You run in a loop of Thought, Action, Observation.
At the end of the loop, you output an Answer.

Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of the actions available to you - then wait for getting the observation of action in next message.
Observation will be the result of running those actions.

If the tool requires arguments and the user has not provided them, **ask the user for the missing details** instead of assuming values. Do not make arbitrary selections when multiple options are available—ask the user to choose.

Your available actions are:
```
find_restaurant:
usage - find_restaurant: <cuisine: str>
e.g. find_restaurant: Italian
Returns a list of restaurants matching the given cuisine.

book_table:
usage - book_table: <restaurant_id: str>, <time: str>, <party_size: int>
e.g. book_table: 1234, 7:00 PM, 4
Books a table at the specified restaurant for the given time and party size.
```

Example session

# Save LoRA weights

This saved only the LoRA adapter and not the complete model.

In [None]:
model.save_pretrained("react_model")

In [None]:
tokenizer.save_pretrained("react_model")

('react_model/tokenizer_config.json',
 'react_model/special_tokens_map.json',
 'react_model/tokenizer.json')

Zip the model to download it on google colab.

In [None]:
!zip -r ./react_model.zip ./react_model

  adding: react_model/ (stored 0%)
  adding: react_model/README.md (deflated 66%)
  adding: react_model/special_tokens_map.json (deflated 70%)
  adding: react_model/adapter_model.safetensors (deflated 7%)
  adding: react_model/tokenizer_config.json (deflated 84%)
  adding: react_model/tokenizer.json (deflated 81%)
  adding: react_model/adapter_config.json (deflated 54%)


At this point we have trained and downloaded LoRA weights for the model on local which would be pretty small in size.

We will see how to make use of these weights in the next notebook.