# Can we discover novel in-context learning behavior? 
1) use Llama-8B-instruct
2) use GRPO to RL train on 1-shot GSMK8
    - evaluate on 0-shot, 1-shot, 5-shot
2) use GRPO to RL train on 0-shot GSMK8
    - evaluate on 0-shot, 1-shot, 5-shot
3) use GRPO to RL train on 5-shot GSMK8
    - evaluate on 0-shot, 1-shot, 5-shot
4) Play with lora dropout + all linear modules (based on QLoRa Paper)
5) If nothing happens, then try curriculum learning on 1-shots, 2-shots, 3-shots and so on until 10-shots
6) If nothing is happening, perhaps we should try full bfloat16 tuning on A100s AND/OR QLoRa with 70B (in this case we distill using DeepSeek's method and release it)
7) Two options:
    - If something happens, add code and feature engineering
    - If nothing happens, then ICL is more about what samples matter
8) Can we do a qualitative analysis on the new behavior and come up with zero-shot prompting strategies? Do we lose zero-shot performance by focusing on ICL? Are they mutually exclusive? If I could only use 1K samples, should I do a mixed strategy? RL or SFT for practical purposes? Release the SFT Dataset? 



In [1]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    # lora_dropout = 0,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-24 23:43:08 [__init__.py:256] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.18: Fast Llama patching. Transformers: 4.50.0. vLLM: 0.8.1.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.586 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit with actual GPU utilization = 58.68%
Unsloth: Your GPU has CUDA compute capability 8.6 with VRAM = 23.59 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 192.
Unsloth: vLLM's KV Cache can

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 03-24 23:43:18 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-24 23:43:19 [model_runner.py:1146] Model loading took 5.7737 GB and 2.502195 seconds
INFO 03-24 23:43:22 [worker.py:267] Memory profiling takes 3.22 seconds
INFO 03-24 23:43:22 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.59GiB) x gpu_memory_utilization (0.59) = 13.84GiB
INFO 03-24 23:43:22 [worker.py:267] model weights take 5.77GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 0.90GiB; the rest of the memory reserved for KV Cache is 7.11GiB.
INFO 03-24 23:43:22 [executor_base.py:111] # cuda blocks: 3638, # CPU blocks: 3072
INFO 03-24 23:43:22 [executor_base.py:116] Maximum concurrency for 2048 tokens per request: 28.42x
INFO 03-24 23:43:26 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If o

Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:26<00:00,  1.02it/s]

INFO 03-24 23:43:53 [model_runner.py:1570] Graph capturing finished in 26 secs, took 0.60 GiB
INFO 03-24 23:43:53 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 34.14 seconds



Unsloth 2025.3.18 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [2]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

In [3]:
max_prompt_length = 512

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 4, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_seq_length - max_prompt_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


In [4]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 4 x 1) = 24
 "-____-"     Trainable parameters = 83,886,080/8,000,000,000 (1.05% trained)


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
Let's break down the problem into parts.

The first 10 tickets cost the full price of $40 each.
The remaining tickets, which is 12 - 10 = 2 tickets, get a 5% discount.

To calculate the discount on each ticket, we first find 5% of $40.
5% of $40 = 0.05 * $40 = $2

The discount on each discount-eligible ticket is $2.

So the price of each discounted ticket is $40 - $2 = $38. Since there are 2 of these tickets, their total price is 2 * $38 = $76.

The price of the full-priced tickets is 10 * $40 = $400.

To find the total amount paid by Mr. Benson, we add the cost of full-priced and discounted tickets together.
Total cost = $400 + $76 = $476

<answer> $476 </answer> 
Extracted:
$476
-------------------- Question:
Jane is trying to decide whether to buy a house or a trailer. A

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,0.0,0.447292,0.528245,246.541676,0.0,-0.073542,0.0,0.0,0.104167,0.416667
2,0.0,0.300042,0.76004,219.91667,0.0,-0.012458,0.0,0.0,0.0625,0.25
3,0.0,0.174125,0.386854,184.250008,0.000383,-0.034208,0.0,0.0,0.041667,0.166667
4,0.0,0.628167,0.878327,168.16667,0.000355,-0.017667,0.0,0.0,0.145833,0.5
5,0.0,0.101125,0.370251,177.666668,0.000418,-0.065542,0.0,0.0,0.083333,0.083333
6,0.0,0.254083,0.427716,136.916672,0.000349,0.004083,0.0,0.0,0.083333,0.166667
7,0.0,0.068958,0.366913,173.666672,0.000405,-0.035208,0.0,0.0,0.020833,0.083333
8,0.0,0.3065,0.745095,143.79167,0.000309,-0.026833,0.0,0.0,0.083333,0.25
9,0.0,0.18125,0.563033,151.41667,0.00035,-0.027083,0.0,0.0,0.041667,0.166667
10,0.0,0.110625,0.366097,172.083336,0.000436,-0.014375,0.0,0.0,0.041667,0.083333


-------------------- Question:
Over the past five years, on July 4th, the high temperature for Washington, DC has been: 90 degrees in 2020, 90 degrees in 2019, 90 degrees in 2018, 79 degrees in 2017 and 71 degrees in 2016. What is the average temperature for July 4th in Washington, DC over the past 5 years? 
Answer:
84 
Response:
To find the average temperature for July 4th in Washington, DC over the past 5 years, we need to add up all the temperatures and then divide by the number of years.

We have the temperatures for 2020: 90 degrees, 2019: 90 degrees, 2018: 90 degrees, 2017: 79 degrees, and 2016: 71 degrees.

First, we add up all the temperatures:
90 + 90 + 90 + 79 + 71 = 320

Then, we divide by the number of years (5):
320 / 5 = 64

So, the average temperature for July 4th in Washington, DC over the past 5 years is 64 degrees. 
Extracted:
To find the average temperature for July 4th in Washington, DC over the past 5 years, we need to add up all the temperatures and then divide by

TrainOutput(global_step=250, training_loss=0.0014389606379608892, metrics={'train_runtime': 14291.4116, 'train_samples_per_second': 0.42, 'train_steps_per_second': 0.017, 'total_flos': 0.0, 'train_loss': 0.0014389606379608892})

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

Processed prompts: 100%|██████████| 1/1 [00:07<00:00,  7.87s/it, est. speed input: 4.96 toks/s, output: 76.16 toks/s]


"Calculating the exact value of pi is a complex task, as it is an irrational number that goes on forever without repeating. However, we can use various mathematical techniques to approximate its value.\n\nHere are a few methods to calculate pi:\n\n**Method 1: Archimedes' Approximation**\n\nIn the 3rd century BC, the Greek mathematician Archimedes approximated pi by inscribing and circumscribing polygons around a circle and calculating their perimeters.\n\n```python\nimport math\n\n# Define the number of sides for the polygon\nn = 100\n\n# Calculate the perimeter of the inscribed polygon\nperimeter_inscribed = n * 2 * math.sin(math.pi / n)\n\n# Calculate the perimeter of the circumscribed polygon\nperimeter_circumscribed = n * 2 * math.tan(math.pi / n)\n\n# Calculate pi using the average of the perimeters\npi_approximation = (perimeter_inscribed + perimeter_circumscribed) / 2\n\nprint(pi_approximation)\n```\n\n**Method 2: Leibniz Formula**\n\nIn 1676, the German mathematician Gottfried 

In [7]:
print(output)

Calculating the exact value of pi is a complex task, as it is an irrational number that goes on forever without repeating. However, we can use various mathematical techniques to approximate its value.

Here are a few methods to calculate pi:

**Method 1: Archimedes' Approximation**

In the 3rd century BC, the Greek mathematician Archimedes approximated pi by inscribing and circumscribing polygons around a circle and calculating their perimeters.

```python
import math

# Define the number of sides for the polygon
n = 100

# Calculate the perimeter of the inscribed polygon
perimeter_inscribed = n * 2 * math.sin(math.pi / n)

# Calculate the perimeter of the circumscribed polygon
perimeter_circumscribed = n * 2 * math.tan(math.pi / n)

# Calculate pi using the average of the perimeters
pi_approximation = (perimeter_inscribed + perimeter_circumscribed) / 2

print(pi_approximation)
```

**Method 2: Leibniz Formula**

In 1676, the German mathematician Gottfried Wilhelm Leibniz discovered a 

In [5]:
model.save_lora("grpo_saved_lora")

In [8]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

print(output)

Processed prompts: 100%|██████████| 1/1 [00:05<00:00,  5.62s/it, est. speed input: 10.86 toks/s, output: 59.47 toks/s]

To calculate pi, we can use the formula:

π = ∑[n=0 to ∞] ((-1)^n * (1/n!))

where n! represents the factorial of n, which is the product of all positive integers up to n.

However, calculating this infinite series directly can be challenging. A more practical approach is to use a numerical method. One common method is the Monte Carlo method, where we simulate many random points within a square and a circle, and calculate the ratio of points within the circle to the total number of points.

Another approach is to use a Taylor series approximation, which is given by:

π/4 = 1 - 1/3 + 1/5 - 1/7 + 1/9 - ...

This series can be truncated to a finite number of terms to achieve a desired level of accuracy.

A simple way to approximate π is to use the Leibniz formula:

π/4 = 1 - 1/3 + 1/5 - 1/7 + 1/9 - ...

Using this formula, we can calculate a reasonable approximation of π:

π ≈ 3.14159265359

However, this is an infinite series, so we need to stop at a certain number of terms to get a prac




### Evaluate on GSM8K test set

In [1]:
import sys
sys.path.append("../")
from utils import analyze_errors
