# DeepSeek-1.5-FinQA: Financial Reasoning Tutorial

Welcome to this Google Colab tutorial for training **DeepSeek-1.5B-FinQA_Reasoner** - a specialized 1.5B-parameter language model optimized for finance domain reasoning and mathematical problem-solving. This guide will walk you through the process of fine-tuning and deploying a model that combines financial expertise with structured reasoning capabilities.

**Key Tutorial Focuses**:
- 🧠 Leveraging GRPO (Group Relative Policy Optimization) for medical domain adaptation
- 📊 Curated training data FinQA
- ⚡ Efficient deployment using 4-bit quantization via unsloth
- 🩺 Practical applications in financial analysis
- 🪽 Reward Engineering and Evaluations via Pegasi

In [2]:
%%capture
import sys; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None

!pip install unsloth vllm
!pip install --upgrade pillow
# If you are running this notebook on local, you need to install `diffusers` too
# !pip install diffusers
# Temporarily install a specific TRL nightly version
!pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b
!pip install ipywidgets
!pip install diffusers
!pip install -i https://test.pypi.org/simple/ pegasi==0.2.6

We will be using the amazing Unsloth library for this tutorial.

In [3]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 02-12 18:22:30 __init__.py:190] Automatically detected platform cuda.


## Download and initialize the model
We will first download the model and leverage 50% of the GPU capacity along with vLLM inference to speed up the GRPO training using Qlora.

In [4]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 2048 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.4, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

Unsloth: Switching from Unsloth dynamic quant to normal quant since
we do not yet support fast inference for unsloth/deepseek-r1-distill-qwen-1.5b-unsloth-bnb-4bit
==((====))==  Unsloth 2025.2.5: Fast Qwen2 patching. Transformers: 4.48.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Unsloth: vLLM loading unsloth/deepseek-r1-distill-qwen-1.5b-bnb-4bit with actual GPU utilization = 39.73%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 4.09 GB. Also swap space = 2 GB.
INFO 02-12 18:23:04 config.py:542] This model supports multiple tasks: {'classify', 'embed', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'float16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': ['lm_head', 'multi_modal_projector', 'merger', 'modality_projection'], 'llm_int8_threshold': 6.0}
INFO 02-12 18:23:04 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2

tokenizer_config.json:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

INFO 02-12 18:23:08 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-12 18:23:08 cuda.py:227] Using XFormers backend.
INFO 02-12 18:23:09 model_runner.py:1110] Starting to load model unsloth/deepseek-r1-distill-qwen-1.5b-bnb-4bit...
INFO 02-12 18:23:09 loader.py:1102] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 02-12 18:23:10 weight_utils.py:252] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/1.61G [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-12 18:23:33 model_runner.py:1115] Loading model weights took 1.5365 GB
INFO 02-12 18:23:33 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 02-12 18:23:50 worker.py:267] Memory profiling takes 15.94 seconds
INFO 02-12 18:23:50 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.40) = 5.86GiB
INFO 02-12 18:23:50 worker.py:267] model weights take 1.54GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 3.22GiB.
INFO 02-12 18:23:50 executor_base.py:110] # CUDA blocks: 7545, # CPU blocks: 4681
INFO 02-12 18:23:51 executor_base.py:115] Maximum concurrency for 2048 tokens per request: 58.95x
INFO 02-12 18:23:54 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error oc

Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:46<00:00,  1.74s/it]

INFO 02-12 18:24:41 model_runner.py:1562] Graph capturing finished in 47 secs, took 0.47 GiB
INFO 02-12 18:24:41 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 67.40 seconds





tokenizer_config.json:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.2.5 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


## Continual Pretraining

Now we go through the continual finetuning. We will be using three datasets from huggingface hub respectively. `openai/gsm8k` , `qiaojin/PubMedQA` and `esilhealth/Health_Benchmarks`. As you can see in the code, we are filtering the length of contexts in the case of PubMedQA as it might have longer traces that could cause out of memory issues for our training (in this tutorial we are aiming for a T4 or A10 GPU with 16/24 Gb of memory).

Also note that after filtering we have almost three times more samples from `PubmedQA` datasets. This is on purpose as that is a more challenging dataset for the model to learn and therefore, we want it to be shown to the model more often.

In [81]:
import re
from datasets import load_dataset, Dataset, interleave_datasets, concatenate_datasets

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

def combine_context(example):
    # If pre_text is a list, join its elements into a string
    pre_text = " ".join(example["pre_text"]) if isinstance(example["pre_text"], list) else example["pre_text"]
    # If post_text is a list, join its elements into a string
    post_text = " ".join(example["post_text"]) if isinstance(example["post_text"], list) else example["post_text"]

    # Create the new context field by concatenating the two strings
    example["context"] = pre_text + " " + post_text
    return example

# uncomment middle messages for 1-shot prompting
def get_datasets(split = "train") -> Dataset:
    """
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer']),
        'db_set':'gsm8k'
    }) # type: ignore
    data = data.remove_columns(['question'])
    """

    data_qa = load_dataset("dreamerdeo/finqa")
    data_qa = data_qa["train"]
    data_qa = data_qa.map(combine_context)

    data_qa = data_qa.filter(lambda x: len("\n".join(x['gold_evidence'])) < 1024) # avoid long traces
    data_qa = data_qa.filter(lambda x: len("\n".join(x['answer'])) > 0) # avoid empty answers
    data_qa = data_qa.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {
                "role": "user",
                "content": "Given the financial context below:\n" +
                          "\n".join(x['gold_evidence']) +
                          "\n\nAnswer the following question:\n" +
                          x['question'] +
                          " with careful numerical precision. You need to carefully review the context and reason before answering."
            },
        ],
        'answer': x['answer'],
        'db_set': 'finqa'
    }) # type: ignore
    data_qa = data_qa.remove_columns(['id', 'post_text', 'pre_text', 'context', 'table'])

    dataset = concatenate_datasets([data_qa])
    return dataset


In [82]:
dataset = get_datasets()
dataset

Filter:   0%|          | 0/6251 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6141 [00:00<?, ? examples/s]

Map:   0%|          | 0/6094 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer', 'gold_evidence', 'prompt', 'db_set'],
    num_rows: 6094
})

In [83]:
dataset = get_datasets()
dataset = dataset.shuffle(seed=42)
train_test_split = dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]
print(f"train size: {len(train_dataset)}, test size: {len(test_dataset)}")

train size: 5484, test size: 610


# Desigining Reward Functions

Personally I believe the trick to get a good performance using GRPO is to have really nicely designed reward functions. Like when we are teaching a dog to perform some tricks, we want to give the model higher rewards for difficult actions and smaller treats for when it gets smaller tasks correct. This means we will try to teach the model both about the format we want it to respond (such as `reasoning` and the quality and correctness of its response).

Lets quickly review the following ones:

## correctness_reward_func

This one ensures that the final answer is correct. In case of `gsm8k` sometimes the model answers `The final answer is $80.` in that case it wont perfectly match the ground truth `80` and therefore the `a in r` check to some extend captures such scenarios but the reward is only 1 since we do not want to encourage verbosity. For the other datasets, we simply accept the answer since in case of `pubmedqa` answers are in `yes`, `no` or `maybe` and in the `health_benchmarks` case multiple choice questions.

The other reward functions ensure the correctness of the format, so that the model responds in proper `reasoning` and `answer` tags.

In [84]:
from typing import List, Dict, Any, Union
import re

def normalize_finqa_answer(answer: str) -> str:
    """Normalize FinQA answers by removing whitespace and converting to lowercase."""
    if not isinstance(answer, str):
        return ""
    # Remove extra whitespace and convert to lowercase
    normalized = ' '.join(str(answer).strip().split()).lower()

    # Extract numbers if the answer contains only digits and basic separators
    number_match = re.search(r'[-+]?\d*\.?\d+', normalized)
    if number_match and all(c in '0123456789.-+$ ' for c in normalized):
        return number_match.group(0)

    return normalized

def extract_xml_answer(response: str) -> str:
    """Extract answer from XML tags in the response."""
    if not isinstance(response, str):
        return ""
    match = re.search(r'<answer>(.*?)</answer>', response, re.DOTALL)
    return match.group(1).strip() if match else response.strip()

def ensure_list_length(rewards: List[float], expected_length: int, default_value: float = 0.0) -> List[float]:
    """Ensure the rewards list has the expected length."""
    if len(rewards) < expected_length:
        rewards.extend([default_value] * (expected_length - len(rewards)))
    return rewards[:expected_length]

def correctness_reward_func(prompts, completions, answer, db_set, **kwargs) -> list[float]:
    """Calculate correctness rewards for model completions."""
    if not completions or not answer or not db_set:
        return [0.0] * len(completions)

    responses = [completion[0]['content'] if isinstance(completion, list) and len(completion) > 0
                and isinstance(completion[0], dict) and 'content' in completion[0]
                else "" for completion in completions]

    if prompts and len(prompts) > 0 and len(prompts[0]) > 0:
        q = prompts[0][-1].get('content', '')
        print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}",
              f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extract_xml_answer(responses[0])}")

    extracted_responses = [extract_xml_answer(r) for r in responses]
    rewards = []

    for r, a, dt in zip(extracted_responses, answer, db_set):
        if dt == "gsm8k":
            if a in r:
                rewards.append(1.0)
            elif r == a:
                rewards.append(2.0)
            else:
                rewards.append(0.0)
        elif dt == "finqa":
            pred_norm = normalize_finqa_answer(r)
            actual_norm = normalize_finqa_answer(a)

            # Handle numeric answers
            if pred_norm.replace('.','').isdigit() and actual_norm.replace('.','').isdigit():
                try:
                    pred_num = float(pred_norm)
                    actual_num = float(actual_norm)
                    # Allow small relative difference for floating point numbers
                    if abs(pred_num - actual_num) / max(abs(actual_num), 1e-10) < 0.01:
                        rewards.append(2.0)
                        continue
                except ValueError:
                    pass

            # Handle text answers
            if pred_norm == actual_norm:
                rewards.append(2.0)
            elif actual_norm in pred_norm:
                rewards.append(1.0)
            else:
                rewards.append(0.0)
        else:
            rewards.append(2.0 if r.lower() == a.strip().lower() else 0.0)

    return ensure_list_length(rewards, len(completions))

def int_reward_func(completions, db_set, **kwargs) -> list[float]:
    """Calculate intermediate rewards based on answer format."""
    if not completions or not db_set:
        return [0.0] * len(completions)

    responses = [completion[0]['content'] if isinstance(completion, list) and len(completion) > 0
                and isinstance(completion[0], dict) and 'content' in completion[0]
                else "" for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    rewards = []

    for r, dt in zip(extracted_responses, db_set):
        if dt == "gsm8k":
            rewards.append(0.5 if r.isdigit() else 0.0)
        elif dt == "pubmedqa":
            rewards.append(0.5 if ('yes' in r.lower() or 'no' in r.lower() or 'maybe' in r.lower()) else 0.0)
        else:
            rewards.append(0.5 if ('a' in r.lower() or 'b' in r.lower() or 'c' in r.lower() or 'd' in r.lower()) else 0.0)

    return ensure_list_length(rewards, len(completions))

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format with exact newlines."""
    if not completions:
        return [0.0]

    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]['content'] if isinstance(completion, list) and len(completion) > 0
                and isinstance(completion[0], dict) and 'content' in completion[0]
                else "" for completion in completions]
    matches = [bool(re.match(pattern, r, re.DOTALL)) for r in responses]
    return ensure_list_length([0.5 if match else 0.0 for match in matches], len(completions))

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has XML tags in any format."""
    if not completions:
        return [0.0]

    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]['content'] if isinstance(completion, list) and len(completion) > 0
                and isinstance(completion[0], dict) and 'content' in completion[0]
                else "" for completion in completions]
    matches = [bool(re.search(pattern, r, re.DOTALL)) for r in responses]
    return ensure_list_length([0.5 if match else 0.0 for match in matches], len(completions))

def count_xml(text: str) -> float:
    """Count XML tags and calculate granular format rewards."""
    if not isinstance(text, str):
        return 0.0

    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        try:
            count -= len(text.split("\n</answer>\n")[-1])*0.001
        except Exception:
            pass
    if text.count("\n</answer>") == 1:
        count += 0.125
        try:
            count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
        except Exception:
            pass
    return max(0.0, min(0.5, count))  # Ensure reward is between 0 and 0.5

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    """Calculate granular rewards based on XML tag counts and formatting."""
    if not completions:
        return [0.0]

    contents = [completion[0]['content'] if isinstance(completion, list) and len(completion) > 0
               and isinstance(completion[0], dict) and 'content' in completion[0]
               else "" for completion in completions]
    rewards = [count_xml(c) for c in contents]
    return ensure_list_length(rewards, len(completions))

# Setup Training Arguments

We will be using TRL library from huggingface that has support for GRPO.

In [85]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 2, # Decrease if out of memory
    max_prompt_length = 1024,
    max_completion_length = 1024,
    #num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 750,
    save_steps = 100,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch


In [None]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset=test_dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 5,484 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 1
\        /    Total batch size = 1 | Total steps = 750
 "-____-"     Number of trainable parameters = 73,859,072


-------------------- Question:
Given the financial context below:
( in millions ) the $ 8166 of operating income reconciliation is 2015 ccg operating income ;
( in millions ) the $ 10327 of operating income reconciliation is 2014 ccg operating income ;

Answer the following question:
what is the growth rate in ccg operating income in 2015? with careful numerical precision. You need to carefully review the context and reason before answering. 
Answer:
-20.9% 
Response:
<think>
Okay, so I'm trying to figure out the net cash outflow related to future lease payments in 2005 based on the information provided. Let me start by going through the details again to make sure I understand everything correctly.

First, there are two fiscal years mentioned: 2005 and 2007. Under each year, there are two figures provided: future minimum lease payments and future minimum sublease income. For 2007, the payments are 12,188, and for 2005, it's 20,746. On the income side, in 2007, it's 3,819, and in 2005, 

Step,Training Loss,reward,reward_std,completion_length,kl
1,0.0,0.5,0.0,860.5,1.1e-05
2,0.0,1.5,0.0,937.5,5e-06
3,0.0,1.0,0.707107,662.0,7e-06
4,0.0,0.5,0.0,1024.0,5e-06
5,0.0,0.5,0.0,784.0,6e-06
6,0.0,0.5,0.0,870.5,7e-06
7,0.0,0.5,0.0,780.5,6e-06


-------------------- Question:
Given the financial context below:
in millions the lease obligations of 2014 is $ 171 ; the lease obligations of 2015 is $ 133 ; the lease obligations of 2016 is $ 97 ; the lease obligations of 2017 is $ 74 ; the lease obligations of 2018 is $ 59 ; the lease obligations of thereafter is $ 162 ;
in millions the purchase obligations ( a ) of 2014 is 3170 ; the purchase obligations ( a ) of 2015 is 770 ; the purchase obligations ( a ) of 2016 is 642 ; the purchase obligations ( a ) of 2017 is 529 ; the purchase obligations ( a ) of 2018 is 453 ; the purchase obligations ( a ) of thereafter is 2404 ;

Answer the following question:
what was the ratio of the lease obligations to purchase obligations with careful numerical precision. You need to carefully review the context and reason before answering. 
Answer:
0.13 
Response:
<think>
Okay, so I need to figure out the ratio of lease obligations to purchase obligations for each year mentioned. The user provided 

# Testing time

First we will test our model without `Qlora` heads. Then we will add the head and compare it.

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Is Aspirin good for cardio vascular function?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

## Lets Add Qlora weight

Adding Qlora weigths that we just finetuned to see the difference

In [None]:
model.save_lora("grpo_saved_lora")

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Is Aspirin good for cardio vascular function?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

In [None]:
model.save_pretrained_merged("model", tokenizer)

# Push to huggingface hub

If you like to push your finetuned model to the hub simply:

In [None]:
model.push_to_hub_merged("myMedModel", tokenizer, token = "GET YOUR TOKEN from HUGGINGFACE")