<a href="https://colab.research.google.com/github/johntango/DeepSeek-R1-Medical/blob/main/fine_tune_deepseek_r1_math_gsm8k.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Setting Up Working Enviroment

In [None]:
%%capture
!pip install unsloth vllm
!pip install datasets
!#pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git



In [None]:
from huggingface_hub import login
from google.colab import userdata

# Make sure the key 'hf_token' is correct and contains a valid Hugging Face token.
hf_token = userdata.get('Lora2')  # Changed key from 'Lora' to 'hf_token'

# If the model is private or gated, ensure the token has the necessary permissions.
login(token=hf_token)


In [None]:
import wandb

wb_token = userdata.get('WANDB_API_KEY')

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune-DeepSeek-R1-Distill-Llama-8B',
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjrwtango[0m ([33mjrwtango-massachusetts-institute-of-technology[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


# 2. Loading the Model and Tokenizer

In [None]:
from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None
load_in_4bit = True



In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token,

)

==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

# 3. Test the Model with Zero Shot Inference

In [None]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a math expert with advanced knowledge in math reasoning.
Please answer the following math question.

### Question:
{}

### Response:
<think>{}"""

In [None]:
question = "Amy buy 10 tomatoes at 10 cents each. She sells nearly a third of them. Whats the value of the ones remaining?"

FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
Okay, so Amy bought 10 tomatoes, each costing 10 cents. I need to figure out the value of the tomatoes she has left after selling nearly a third of them. Let me break this down step by step.

First, Amy bought 10 tomatoes. Each tomato costs 10 cents. So, the total cost for all 10 tomatoes would be 10 tomatoes multiplied by 10 cents per tomato. That should give me the total amount she spent.

Next, she sells nearly a third of them. "Nearly a third" isn't a precise fraction, but for the sake of calculation, I can assume she sold exactly one-third. So, one-third of 10 tomatoes is 10 divided by 3, which is approximately 3.333 tomatoes. But since you can't sell a fraction of a tomato, maybe she sold 3 tomatoes, leaving her with 7. Or perhaps she sold 4, leaving her with 6. The problem isn't entirely clear, but I think it's safer to assume she sold exactly one-third, which is about 3.333, but since you can't have a third of a tomato, maybe the problem just wants us to consider the r

# 4. Loading and processing the dataset

In [None]:
# make this for math or medical data
style = "medical"
if (style == "medical"):
    train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
    Write a response that appropriately completes the request.
    Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

    ### Instruction:
    You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
    Please answer the following medical question.

    ### Question:
    {}

    ### Response:
    <think>
    {}
    </think>
    {}"""
else:
    train_prompt_style = """Below is a question that involves math.
    Write a response that appropriately completes the request.
    Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

    ### Instruction:
    You are a math expert with advanced knowledge in math reasoning.
    Please answer the following math question.

    ### Question:
    {}

    ### Response:
    <think>
    {}
    </think>
    {}"""

In [None]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN
if(style == "medical"):
    def formatting_prompts_func(examples):
        inputs = examples["Question"]
        cots = examples["Complex_CoT"]
        outputs = examples["Response"]
        texts = []
        for input, cot, output in zip(inputs, cots, outputs):
            text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
            texts.append(text)
        return {
            "text": texts,
        }
else:
    def formatting_prompts_func(examples):
        inputs = examples["question"]
        outputs = examples["answer"]
        texts = []
        for input, output in zip(inputs, outputs):
            text = train_prompt_style.format(input, output) + EOS_TOKEN
            texts.append(text)
        return {
            "text": texts,
        }

In [None]:
from datasets import load_dataset
if(style == "medical"):
    dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True)
else:
    dataset = load_dataset("openai/gsm8k","main", split = "train[0:1000]",trust_remote_code=True)

dataset = dataset.map(formatting_prompts_func, batched = True,)
dataset["text"][0]

DatasetNotFoundError: Dataset 'FreedomIntelligence/medical-o1-reasoning-SFT' doesn't exist on the Hub or cannot be accessed.

# 5. Fine - Tune the LLM

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Unsloth 2025.1.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.9188
20,1.4615
30,1.4023
40,1.3088
50,1.3443
60,1.314


# 6. Model Inference After Fine-Tuning

In [None]:
question = "John is 5 years older than his brother who is younger by the number of letters in Italy. How old is the sister?"

FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])



<think>
Okay, let's see. We have a 55-year-old woman with a history of urine leakage whenever she does things like laugh or lift heavy objects. That sounds like a urinary incontinence issue. She's not having any urgency or leakage at night, which is good news because it means she's not dealing with nocturnal enuresis. 

Now, let's think about what might be causing her symptoms. Since she doesn't have urgency, it's unlikely to be a problem with her bladder emptying completely on its own. That points us towards a possible urethral issue, like a urethral sphincter deficiency. 

We've already done a Q-tip test, which helps check for urethral resistance. If the test shows low resistance, it's a strong indicator that we're dealing with urethral sphincter deficiency. This makes me think about how the bladder might function in such a case. 

In urethral sphincter deficiency, the bladder is usually able to empty on its own without any need for detrusor muscle activity. So, during cystometry, I

# 7. Saving the model locally & Hugging Face Hub

In [None]:
new_model_local = "DeepSeek-R1-Fine-tuned-Medical"
model.save_pretrained(new_model_local)
tokenizer.save_pretrained(new_model_local)

model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.0G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.77 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 38%|███▊      | 12/32 [00:01<00:02,  8.44it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [01:56<00:00,  3.64s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving DeepSeek-R1-Fine-tuned-Medical/pytorch_model-00001-of-00004.bin...
Unsloth: Saving DeepSeek-R1-Fine-tuned-Medical/pytorch_model-00002-of-00004.bin...
Unsloth: Saving DeepSeek-R1-Fine-tuned-Medical/pytorch_model-00003-of-00004.bin...
Unsloth: Saving DeepSeek-R1-Fine-tuned-Medical/pytorch_model-00004-of-00004.bin...
Done.


In [None]:
new_model_online = "YoussefHosni/DeepSeek-R1-Fine-tuned-Medical"
model.push_to_hub(new_model_online)
tokenizer.push_to_hub(new_model_online)

model.push_to_hub_merged(new_model_online, tokenizer, save_method = "merged_16bit")