To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [blog post](https://unsloth.ai/blog/r1-reasoning) for guidance on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

### Unsloth

Use `PatchFastRL` before all functions to patch GRPO and other RL algorithms!

In [1]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!


2025-02-27 14:44:27,398	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


Load up `Llama 3.1 8B Instruct`, and set parameters

In [2]:
from unsloth import is_bfloat16_supported
import torch
import time

max_seq_length = 2048 # Can increase for longer reasoning traces
lora_rank = 128 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "kings-crown/Isabelle_FVELer_SFT",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.8, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

INFO 02-27 14:44:28 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.2.15: Fast Qwen2 patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA RTX A6000. Max memory: 47.536 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading kings-crown/Isabelle_FVELer_SFT with actual GPU utilization = 79.47%
Unsloth: Your GPU has CUDA compute capability 8.6 with VRAM = 47.54 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 288.
Unsloth: vLLM's KV Cache can use up to 22.99 GB. Also swap space = 6 GB.
INFO 02-27 14:44:33 config.py:549] This model supports multiple tasks: {'embed', 'generate', 'reward', 'score', 'classify'}. Defaulting t

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.47it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  3.40it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  2.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.35it/s]



INFO 02-27 14:44:36 model_runner.py:1115] Loading model weights took 14.3854 GB
INFO 02-27 14:44:36 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 02-27 14:44:38 worker.py:267] Memory profiling takes 1.32 seconds
INFO 02-27 14:44:38 worker.py:267] the current vLLM instance can use total_gpu_memory (47.54GiB) x gpu_memory_utilization (0.79) = 37.78GiB
INFO 02-27 14:44:38 worker.py:267] model weights take 14.39GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.58GiB; the rest of the memory reserved for KV Cache is 21.76GiB.
INFO 02-27 14:44:38 executor_base.py:111] # cuda blocks: 25460, # CPU blocks: 7021
INFO 02-27 14:44:38 executor_base.py:116] Maximum concurrency for 2048 tokens per request: 198.91x
INFO 02-27 14:44:42 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory err

Capturing CUDA graph shapes: 100%|██████████| 39/39 [00:21<00:00,  1.83it/s]

INFO 02-27 14:45:03 model_runner.py:1562] Graph capturing finished in 21 secs, took 2.12 GiB
INFO 02-27 14:45:03 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 27.04 seconds



Unsloth 2025.2.15 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [3]:
class Checker(object):
    """A modified version of the Draft, Sketch, Prove proof-checking client.
    (https://github.com/albertqjiang/draft_sketch_prove/blob/main/autoformalization/checker.py)

    This checker supports Isabelle2022 via the new version of PISA
    (https://albertqjiang.github.io/Portal-to-ISAbelle/).

    It supports checking a miniF2F-style proof via `check`.

    Finally, it replaces `sledgehammer` with a call to `normalhammer`.
    """
    def __init__(self, working_dir, isa_path, theory_file_path, port=9000):
        sys.path.append(os.environ.get('PISA_PATH', ''))
        try:
            from pisa_client import initialise_env
            self.initialise_env = initialise_env
        except ImportError:
            print("Set $PISA_PATH to /yourpath/to/Portal-to-ISAbelle/src/main/python")

        self.working_dir = working_dir
        self.isa_path = isa_path
        self.theory_file_path = theory_file_path
        self.port = port

    def _initialize(self):
        """Initialize the PISA environment."""
        env = self.initialise_env(
            self.port,
            isa_path=self.isa_path,
            theory_file_path=self.theory_file_path,
            working_directory=self.working_dir
        )
        return env

    def _exit(self, env):
        """Exit the environment and clean up resources."""
        try:
            env.post('exit')
        except Exception:
            pass
        os.system("ps aux | grep Isabelle | awk '{print $2}' | xargs kill -9 > /dev/null 2>&1")
        os.system("ps aux | grep poly | awk '{print $2}' | xargs kill -9 > /dev/null 2>&1")

    def _parse_output(self, obs):
        """Parse the sledgehammer output, returning the relevant part."""
        return obs.split('<hammer>')[0] if '<hammer>' in obs else ''

    def _run_step(self, step, i, tls_name, env):
        """Run a single proof step."""
        try:
            obs, reward, done, metadata = env.step_to_top_level_state(
                action=step,
                tls_name=tls_name,
                new_name=f'default_{i}'
            )
            return obs, reward, done, metadata, None
        except Exception as e:
            return '', 0, False, None, str(e)

    def _run_sledgehammer(self, step, i, tls_name, env):
        """Run sledgehammer or fallback heuristics on a step."""
        heuristics = [
            'by auto', 'by simp', 'by blast', 'by fastforce',
            'by force', 'by eval', 'by presburger', 'by sos',
            'by arith', 'by linarith', 'by (auto simp: field_simps)'
        ]
        for heuristic in heuristics:
            step_ = step.replace('normalhammer', heuristic)
            obs, reward, done, metadata, error = self._run_step(step_, i, tls_name, env)
            if error is None:
                obs = f'{heuristic} <hammer> {obs}'
                return obs, reward, done, metadata, error
        return self._run_step(step.replace("normalhammer", "sledgehammer"), i, tls_name, env)

    def check(self, statement_and_proof):
        """Check the given proof."""
        env = self._initialize()
        env.initialise()

        theory = self.wrap_theorem(statement_and_proof)
        steps = self.get_parsed(env, theory)

        result = self._check(env, steps)
        self._exit(env)

        # Output the result
        #print("\n==== Success: %s" % result['success'])
        #print("--- Complete proof:\n%s" % result['theorem_and_proof'])
        return result

    def _check(self, env, steps):
        """Run the proof steps and collect results."""
        success, reason, done = False, '', False
        step_results = []
        tls_name = 'default'

        for i, step in enumerate(steps):
            time0 = time.time()
            if 'normalhammer' in step or 'sledgehammer' in step:
                obs, reward, done, metadata, error = self._run_sledgehammer(step, i, tls_name, env)
            else:
                obs, reward, done, metadata, error = self._run_step(step, i, tls_name, env)

            step_time = time.time() - time0
            step_results.append({
                'index': i, 'step': step, 
                'output': self._parse_output(obs), 
                'step_time': step_time
            })

            if error:
                reason = error
                break
            tls_name = f'default_{i}'

        success = done and reward == 1.0
        return {
            'success': success,
            'reason': reason,
            'num_steps': len(steps),
            'last_step': len(step_results),
            'step_results': step_results,
            'theorem_and_proof': self.reconstruct(step_results) if success else ''
        }

    @staticmethod
    def reconstruct(step_results):
        """Reconstruct the complete proof."""
        return '\n'.join(
            step_result['output'].strip() if step_result['output'] else step_result['step'].strip()
            for step_result in step_results[1:]
        )

    @staticmethod
    def wrap_theorem(theorem):
        """Wrap the theorem in a theory file."""
        return (
            'theory Interactive imports HOL.HOL Complex_Main '
            '"HOL-Library.Code_Target_Numeral" "HOL-Library.Sum_of_Squares" '
            '"Symmetric_Polynomials.Vieta" "HOL-Computational_Algebra.Computational_Algebra" '
            '"HOL-Number_Theory.Number_Theory" \n begin\n%s' % theorem
        )

    @staticmethod
    def get_parsed(env, theory):
        """Parse the theory and extract proof steps."""
        raw_steps = env.post(f"<parse text> ${theory}")
        steps = [s.strip() for s in raw_steps.split('<SEP>') if s.strip() and s != '$']
        processed_steps = []
        for i, step in enumerate(steps):
            if step.lower() == "then" and (i == 0 or steps[i - 1].startswith("proof")):
                continue
            processed_steps.append(step)
        return processed_steps

In [4]:
import sys
import os

sys.path.append('../')
os.environ['PISA_PATH'] = '/home/siai/Portal-to-ISAbelle/src/main/python'

#import dsp_utils

checker = Checker(
    working_dir='/home/siai/Isabelle2022/src/HOL/Examples',
    isa_path='/home/siai/Isabelle2022',
    theory_file_path='/home/siai/Isabelle2022/src/HOL/Examples/Interactive.thy',
    port=9000
)


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [None]:
import re
import sys
import os
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """Write a proof in Isabelle that appropriately proves the given statement in natural language.
Make sure to wrap the proof within ``isabelle and ``` tags inside answer.
Respond in the following format:
<reasoning>
[Your explanation or chain of thought]
</reasoning>
<answer>
``` 
isabelle 
[Your formal Isabelle code]
``` 
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()


def extract_isabelle_snippet(text: str) -> str | None:
    """
    Extracts Isabelle proof content from text. Handles both Markdown code blocks 
    and inline structured proofs by detecting `lemma`, `proof`, and `qed`.
    """
    if not isinstance(text,str) or text.strip() == "":
        return None
    
    if ":" in text and "```isabelle" in text:
        text = text.split("```isabelle",1)[1]
        text = "```isabelle" + text
        
    code_pattern = r"```isabelle\s*(.+?)\s*```"
    matches = re.findall(code_pattern, text, flags=re.DOTALL | re.IGNORECASE)
    
    if matches:
        return matches[0].strip() 
    inline_pattern = r"(lemma.*?proof.*?qed)"
    matches = re.findall(inline_pattern, text, flags=re.DOTALL | re.IGNORECASE)

    if matches:
        return matches[0].strip() 
    return None


# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('kings-crown/FVELer_PISA_Proven', 'default')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['natural_language_statement']}
        ],
        'answer': extract_isabelle_snippet(x['formal_proof'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_isabelle_snippet(r) for r in responses]
    #print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def checker_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_snippets = [extract_isabelle_snippet(r) for r in responses]

    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_snippets[0]}")

    for content in extracted_snippets:
        checker = Checker(
            working_dir='/home/siai/Isabelle2022/src/HOL/Examples',
            isa_path='/home/siai/Isabelle2022',
            theory_file_path='/home/siai/Isabelle2022/src/HOL/Examples/Interactive.thy',
            port=9000
        )
        #result = checker.check(content)
        rewards = [2.0 if checker.check(content).get("success", False) else 0.0 for content in extracted_snippets]
    return rewards

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

Map: 100%|██████████| 1138/1138 [00:00<00:00, 14085.10 examples/s]


In [6]:
dataset[4]

{'natural_language_statement': 'The lemma states that if set \\( S \\) is a subset of set \\( T \\), and there exists an element \\( x \\) such that \\( S \\) is the singleton set containing \\( x \\), and there exists an element \\( y \\) such that \\( T \\) is the singleton set containing \\( y \\), then \\( S \\) is equal to \\( T \\).',
 'formal_proof': 'To translate the informal proof into a structured Isabelle proof, we will follow the steps outlined in the informal reasoning and use `sledgehammer` to assist in finding any necessary lemmas or theorems. Here\'s how the structured proof in Isabelle might look:\n\n```isabelle\nlemma eq:\n  assumes "S \\<subseteq> T"\n    and "\\<exists>x. S = {x}"\n    and "\\<exists>y. T = {y}"\n  shows "S = T"\nproof -\n  from `\\<exists>x. S = {x}` obtain x where "S = {x}" by auto\n  from `\\<exists>y. T = {y}` obtain y where "T = {y}" by auto\n  have "x \\<in> T" using `S \\<subseteq> T` `S = {x}` by auto\n  then have "x = y" using `T = {y}` by 

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [7]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 3000,
    num_train_epochs = 3, # Set to 1 for a full training run
    max_steps = 500,
    save_steps = 50,
    max_grad_norm = 0.1,
    report_to = "wandb", # Can use Weights & Biases
    output_dir = "output_RL",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 2 to the `num_generations` of 6


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
sys.path.append('../')
os.environ['PISA_PATH'] = '/home/siai/Portal-to-ISAbelle/src/main/python'


trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
        checker_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,138 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 6 | Gradient Accumulation steps = 4
\        /    Total batch size = 24 | Total steps = 500
 "-____-"     Number of trainable parameters = 161,480,704
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mbalaji-vir1997[0m ([33mbalaji-vir1997-stevens-institute-of-technology[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


-------------------- Question:
The lemma states that for any natural numbers \( a \) and \( b \), the product of the integer division of \( a \) by \( b \) and \( b \) is less than or equal to \( a \). 
Answer:
lemma div_mult_le:
  "a div b * b \<le> a" for a b :: nat
proof -
  have "a = (a div b) * b + (a mod b)"
    by (simp add: div_mult_mod_eq)
  then have "(a div b) * b \<le> a"
    by simp
  thus ?thesis .
qed 
Response:
Okay, so I need to prove that for any natural numbers \( a \) and \( b \), the product of the integer division of \( a \) by \( b \) and \( b \) is less than or equal to \( a \). Hmm, let me break this down step by step.

First, what does integer division mean? I think it's when you divide two numbers and take the floor of the result. So, \( a \div b \) would give a quotient which is the largest integer less than or equal to the actual division result of \( a \) by \( b \). For example, if \( a = 10 \) and \( b = 3 \), then \( 10 \div 3 = 3 \) because 3 times 3 i

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func,rewards / checker_reward_func


-------------------- Question:
The lemma `list_all_dest` can be translated into the following natural language statements:

1. For a list containing a single pair \((x, y)\), the predicate `list_all P` holds if and only if the predicate \(P\) holds for the pair \((x, y)\).

2. For a list starting with a pair \((x, y)\) followed by another element \(z\) and the rest of the list \(xs\), the predicate `list_all P` holds if and only if the predicate \(P\) holds for the pair \((x, y)\) and `list_all P` holds for the remainder of the list starting with \(z\) and followed by \(xs\). 
Answer:
lemma list_all_dest:
  shows "list_all P [(x, y)] \<equiv> P (x, y)"
    and "list_all P ((x, y) # z # xs) \<equiv> (P (x, y) \<and> list_all P (z # xs))"
proof -
  (* Proof for the first statement *)
  show "list_all P [(x, y)] \<equiv> P (x, y)"
  proof -
    have "list_all P [(x, y)] = (\<forall>p \<in> set [(x, y)]. P p)"
      by (simp add: list_all_iff)
    also have "... = P (x, y)"
      by simp
 

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
