<a href="https://colab.research.google.com/github/hasonsk/Finetune-LLM-for-deteting-cookie/blob/main/Llama3_1_(8B)_GRPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [Gemma 3 blog](https://unsloth.ai/blog/gemma3) for what's new in Unsloth and our [Reasoning blog](https://unsloth.ai/blog/r1-reasoning) on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm

In [2]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### Unsloth

Load up `Llama 3.1 8B Instruct`, and set parameters

In [3]:
from unsloth import FastLanguageModel
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-04 15:26:30 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-04 15:26:30 [__init__.py:239] Automatically detected platform cuda.


In [4]:
import os, wandb
from huggingface_hub import login

os.environ["WANDB_API_KEY"] = "810a489e4029d94651a3868c08a49812cc1e4cc7"
os.environ["HUGGINGFACE_API_KEY"] = "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"

In [5]:
hf_token = os.environ["HUGGINGFACE_API_KEY"]
login(hf_token)

secret_key = os.environ["WANDB_API_KEY"]
wandb.login(key=secret_key)

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msonhh2003[0m ([33msonhh2003-hanoi-university-of-science-and-technology[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

### Load Tokenizer & Model

In [6]:
max_seq_length = 2048 # Can increase for longer reasoning traces
lora_rank = 16 # Larger rank = smarter, but slower
model_name = "meta-llama/meta-Llama-3.1-8B-Instruct"    # Gemma 3.1 8B‑Instruct checkpoint
output_dir = "./grpo_llama3_8b"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        # "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit with actual GPU utilization = 59.43%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 2.66 GB. Also swap space = 2 GB.
INFO 05-04 15:27:13 [config.py:717] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes 

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

INFO 05-04 15:27:17 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-04 15:27:17 [cuda.py:289] Using XFormers backend.
INFO 05-04 15:27:18 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-04 15:27:18 [model_runner.py:1108] Starting to load model unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit...
INFO 05-04 15:27:18 [loader.py:1187] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 05-04 15:27:19 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

INFO 05-04 15:29:46 [weight_utils.py:281] Time spent downloading weights for unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit: 147.789338 seconds
INFO 05-04 15:29:47 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 05-04 15:30:29 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 05-04 15:30:29 [model_runner.py:1140] Model loading took 5.6877 GiB and 190.899800 seconds
INFO 05-04 15:30:41 [worker.py:287] Memory profiling takes 11.42 seconds
INFO 05-04 15:30:41 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.59) = 8.76GiB
INFO 05-04 15:30:41 [worker.py:287] model weights take 5.69GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.75GiB; the rest of the memory reserved for KV Cache is 2.29GiB.
INFO 05-04 15:30:42 [executor_base.py:112] # cuda blocks: 1174, # CPU blocks: 1024
INFO 05-04 15:30:42 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 9.17x
INFO 05-04 15:30:43 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If

Capturing CUDA graph shapes:   0%|          | 0/23 [00:00<?, ?it/s]

INFO 05-04 15:31:34 [model_runner.py:1592] Graph capturing finished in 51 secs, took 0.53 GiB
INFO 05-04 15:31:34 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 64.96 seconds


tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Not an error, but Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.4.7 patched 32 layers with 0 QKV layers, 0 O layers and 32 MLP layers.


In [7]:
print(tokenizer.model_max_length)

131072


### Data Prep
<a name="Data"></a>

We directly leverage [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) for data prep and all reward functions. You are free to create your own!

In [8]:
!pip install datasets



In [9]:
from datasets import Dataset, load_dataset

SYSTEM_PROMPT = """
ROLE: You are a cookie policy analysis expert.
Your task is to read a cookie policy text and extract detailed information about each cookie mentioned in that policy.
RESPONSE Format: If no cookies are specifically described, return []. Otherwise, return the result as a JSON array following structure:
  [
    {
      "cookie_name": "cookie_name",
      "declared_purpose": "declared_purpose",
      "declared_retention": "declared
      "declared_third_parties": ["declared_third_parties"],
      "declared_description": "declared_description"
    }
  ]

Specific Requirements:
  Read and Analyze the Text: Carefully read the entire content of the cookie policy to understand the types of cookies used, their purposes, storage duration, detailed descriptions, and information about the owner (first party or third party).
  Identify Described Cookies: Search for sections, paragraphs, or text that describe specific cookies. Each cookie will usually have its own entry or be listed in a list. If not found cookies, return [] and dont say anymore.

  Extract Detailed Information for Each Cookie: For each identified cookie, carefully extract the following information from the policy:
    +) "cookie_name": Find the technical name or descriptive name of the cookie exactly as it is mentioned in the policy.
    +) "declared_purpose": A detailed description of the cookie's purpose. With the following options:
        +) Strictly Necessary: Cookies essential for the website to function correctly with basic features (e.g., managing login sessions, shopping cart functionality, secure navigation).
        +) Functionality: Cookies that help personalize the user experience and provide additional features (e.g., remembering language preferences, page layouts).
        +) Analytical: Cookies used to collect data and understand user behavior on the website (e.g., number of visits, most viewed pages).
        +) Targeting/Advertising/Marketing: Cookies used to display personalized advertisements based on user behavior or interests.
        +) Performance: Cookies that help improve the technical performance of the website (e.g., optimizing page loading, content distribution).
        +) Social Sharing: Cookies that enable sharing content from the website to social media platforms.
        +) Null if the policy does not provide any specific information about the purpose of this cookie.
    +) "declared_retention": Find information about the duration this cookie will be stored on the user's device (e.g., "6 months", "24 hours", "Session", "Persistent", "1 minute", "Until deleted"). Record this information as accurately as possible according to the original text.
    +) "declared_third_parties": An array of third parties involved in the use of this cookie (if any). ["First Party"] indicates a first-party cookie set by the website itself.
    +) "declared_description": The content taken directly from the text, without fabrication, using exactly what they have described.
"""

def preprocess(example):
    if example['english_content']:
        content = f"Cookie policy: {example['english_content']}\nTable: {example['english_table']}"
    elif example['english_table']:
        content = f"Table: {example['english_table']}"
    else:
        content = f"Content: {example['english_table']}"

    prompt = [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': content}
    ]

    return {
        'prompt': prompt,
        'answer': example['label']
        # 'answer': '[]'
    }

dataset = load_dataset("sonhask/detect-cookie-with-label")
dataset = dataset['train'].map(preprocess)


final_merged_dataset.csv:   0%|          | 0.00/127M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/12441 [00:00<?, ? examples/s]

In [10]:
def compute_token_length(example):
    length =  (
              len(example['answer']) +
               len(example["prompt"][0]['content']) + len(example["prompt"][1]['content']))*0.331 + 50
    return length

dataset = dataset.filter(lambda example: compute_token_length(example) <= 2048)
len(dataset)

Filter:   0%|          | 0/12441 [00:00<?, ? examples/s]

7285

In [11]:
def thong_ke(dataset):
    empty_label_count = 0
    non_empty_label_count = 0

    for i in dataset:
        if i['answer'] == '[]':
            empty_label_count += 1
        else:
            non_empty_label_count += 1

    print("#[]: ", empty_label_count)
    print("#Has cookies: ", non_empty_label_count)

thong_ke(dataset)

#[]:  6690
#Has cookies:  595


In [12]:
# 2. remove every column except the one GRPOTrainer expects:
keep = ["prompt", "answer"]
to_remove = [c for c in dataset.column_names if c not in keep]
dataset = dataset.remove_columns(to_remove)

# 3. now dataset.features == ['prompt']
print(dataset.features)

{'prompt': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}], 'answer': Value(dtype='string', id=None)}


In [13]:
dataset[0]

{'prompt': [{'content': '\nROLE: You are a cookie policy analysis expert.\nYour task is to read a cookie policy text and extract detailed information about each cookie mentioned in that policy.\nRESPONSE Format: If no cookies are specifically described, return []. Otherwise, return the result as a JSON array following structure:\n  [\n    {\n      "cookie_name": "cookie_name",\n      "declared_purpose": "declared_purpose",\n      "declared_retention": "declared\n      "declared_third_parties": ["declared_third_parties"],\n      "declared_description": "declared_description"\n    }\n  ]\n\nSpecific Requirements:\n  Read and Analyze the Text: Carefully read the entire content of the cookie policy to understand the types of cookies used, their purposes, storage duration, detailed descriptions, and information about the owner (first party or third party).\n  Identify Described Cookies: Search for sections, paragraphs, or text that describe specific cookies. Each cookie will usually have it

### Define reward functions

In [14]:
import jsonschema
import json
from jsonschema import validate, ValidationError
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from itertools import zip_longest

embedder = SentenceTransformer('all-MiniLM-L6-v2')  # small, fast

schema = {
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "cookie_name": {"type": "string"},
            "declared_purpose": {"type": ["string", "null"]},
            "declared_retention": {"type": ["string", "null"]},
            "declared_third_parties": {"type": "array"},
            "declared_description": {"type": ["string", "null"]}
        },
        "required": ["cookie_name"]
    }
}

def strict_format_reward_func(completions, **kwargs):
    rewards = []
    for completion in completions:
        try:
            response = json.loads(completion[0]['content'])
            validate(instance=response, schema=schema)
            rewards.append(1.0)
        except (json.JSONDecodeError, ValidationError):
            rewards.append(-1.0)
    return rewards

# Hàm này kiểm tra xem đầu ra có thể được phân tích cú pháp thành JSON hợp lệ hay không, mà không cần xác thực theo schema.
def soft_format_reward_func(completions, **kwargs):
    rewards = []
    for completion in completions:
        try:
            json.loads(completion[0]['content'])
            rewards.append(0.5)
        except json.JSONDecodeError:
            rewards.append(-0.5)
    return rewards

# def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
#     rewards = []
#     for completion in completions:
#         try:
#             response = json.loads(completion[0]['content'])
#             validate(instance=response, schema=schema)
#             correct_items = sum(1 for r_item in response if r_item in answer)
#             total_items = len(answer)
#             score = correct_items / total_items if total_items > 0 else 0.0
#             rewards.append(score)
#         except (json.JSONDecodeError, ValidationError):
#             rewards.append(-1.0)
#     return rewards

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    rewards = []
    print("Answer is: ", answer)
    for completion, answer in zip(completions, answer):
        try:
            response = json.loads(completion[0]['content'])
            ground_truth = json.loads(answer)
            validate(instance=response, schema=schema)
            correct_items = sum(1 for r_item in response if r_item in ground_truth)
            total_items = len(ground_truth)
            score = correct_items / total_items if total_items > 0 else 0.0
            rewards.append(score)
        except (json.JSONDecodeError, ValidationError):
            rewards.append(-1.0)
    return rewards

# Hàm này đánh giá mức độ đầy đủ của các trường trong mỗi mục của đầu ra, tính điểm dựa trên tỷ lệ các trường không rỗng.
def field_completeness_reward_func(completions, **kwargs) -> list[float]:
    rewards = []
    for comp in completions:
        try:
            arr = json.loads(comp[0]["content"])
            total = len(arr) * 5
            filled = 0
            for item in arr:
                for v in [
                    item.get("cookie_name"),
                    item.get("declared_purpose"),
                    item.get("declared_retention"),
                    item.get("declared_third_parties"),
                    item.get("declared_description"),
                ]:
                    if v not in (None, "", []):
                        filled += 1
            rewards.append(filled / total if total else 0.0)
        except Exception:
            rewards.append(-1.0)
    return rewards

# FIELD-LEVEL REWARD
def field_level_reward(prompts, completions, answer, **kwargs) -> list[float]:
    required_fields = [
        "cookie_name",
        "declared_purpose",
        "declared_retention",
        "declared_third_parties",
        "declared_description",
    ]
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]
    print('*'*20, f"Question:\n{question}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}")

    rewards = []
    for comp, answer in zip(completions, answer):
        print("Len of comp: ", len(comp))
        print("Answer: ", answer)
        print("---> PREDICT: ", comp[0]["content"])
        try:
            pred = json.loads(comp[0]["content"])
            ground_truth = json.loads(answer)
        except Exception:
            rewards.append(-1.0)
            continue

        if not isinstance(pred, list):
            rewards.append(0.0)
            continue

        score = 0.0
        matched = 0
        for p_item, g_item in zip_longest(pred, ground_truth, fillvalue={}):
            if not isinstance(p_item, dict):
                score -= 1.0
                matched += 1
                continue

            match = 0
            for f in required_fields:
                if p_item.get(f) == g_item.get(f):
                    match += 1
            score += match / len(required_fields)
            matched += 1

        rewards.append(score / matched if matched else 0.0)
    return rewards

# JACCARD ON THIRD PARTIES
def jaccard_third_party_reward(prompts, completions, answer, **kwargs) -> list[float]:
    rewards = []

    for comp, answer in zip(completions, answer):
        try:
            pred = json.loads(comp[0]["content"])
            ground_truth = json.loads(answer)
        except Exception:
            rewards.append(-3.0)
            continue

        if not isinstance(pred, list):
            rewards.append(-3.0)
            continue

        scores = []
        for p_item, g_item in zip_longest(pred, ground_truth, fillvalue={}):
            try:
                set_p = set(p_item.get("declared_third_parties", []))
                set_g = set(g_item.get("declared_third_parties", []))
                inter = len(set_p & set_g)
                union = len(set_p | set_g)
                scores.append(inter / union if union > 0 else 0.0)

            except AttributeError, NoneType:
                # set_p = set()
                scores.append(-1.0)
                continue

        rewards.append(sum(scores) / len(scores) if scores else 0.0)
    return rewards

# def semantic_description_reward(prompts, completions, answer, **kwargs) -> list[float]:
#     rewards = []

#     for comp in completions:
#         try:
#             pred = json.loads(comp[0]["content"])
#         except:
#             rewards.append(0.0)
#             continue

#         sims = []
#         ground_truth = json.loads(gold)
#         for p_item, g_item in zip_longest(pred, ground_truth, fillvalue={}): # Trường hợp p_item có số lượng phần tử khác với g_item thì sao?
#             desc_p = p_item.get("declared_description", "")
#             desc_g = g_item.get("declared_description", "")
#             emb = embedder.encode([desc_p, desc_g])
#             sims.append(float(cosine_similarity([emb[0]], [emb[1]])[0][0]))
#         rewards.append(sum(sims) / len(sims) if sims else 0.0)
#     return rewards

def semantic_description_reward(prompts, completions, answer, **kwargs) -> list[float]:
    rewards = []
    for comp, answer in zip(completions, answer):
        try:
            pred = json.loads(comp[0]["content"])
            ground_truth = json.loads(answer)
        except:
            rewards.append(-1.0)
            continue

        sims = []
        for p_item, g_item in zip(pred, ground_truth):
            try:
                desc_p = str(p_item.get("declared_description", ""))
                desc_g = g_item.get("declared_description", "")
                emb = embedder.encode([desc_p, desc_g])
                sims.append(float(cosine_similarity([emb[0]], [emb[1]])[0][0]))
            # except AttributeError, Exception:
            except:
                # desc_p = ""
                scores.append(-1.0)
                continue
        rewards.append(sum(sims) / len(sims) if sims else 0.0)
    return rewards

# def composite_reward(prompts, completions, answers, **kwargs):
#     r1 = strict_format_reward_func(completions, **kwargs)
#     r2 = soft_format_reward_func(completions, **kwargs)
#     r3 = field_completeness_reward_func(completions, **kwargs)
#     r4 = field_level_reward(prompts, completions, answers, **kwargs)
#     r5 = jaccard_third_party_reward(prompts, completions, answers, **kwargs)
#     r6 = semantic_description_reward(prompts, completions, answers, **kwargs)
#     # weights
#     w = [1.0, 0.5, 1.0, 2.0, 1.0, 1.0]
#     return [
#         w[0]*a + w[1]*b + w[2]*c + w[3]*d + w[4]*e + w[5]*f
#         for a,b,c,d,e,f in zip(r1,r2,r3,r4,r5,r6)
#     ]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [15]:
run = wandb.init(
    project="Fine-tune Llama 3 8B Instruct GRPO",
)

In [18]:
max_prompt_length = 256

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4, # Increase to 4 for smoother training
    num_generations = 2, # Decrease if out of memory
    save_strategy="steps",                # Hoặc "epoch" nếu bạn muốn lưu sau mỗi epoch
    save_steps=10,                       # Lưu mỗi 500 steps (tùy bạn chỉnh)
    save_total_limit=3,                   # Giới hạn số lượng checkpoint (xóa cũ)
    logging_dir="./logs",                 # Thư mục lưu log TensorBoard
    # logging_steps=50,                    # Ghi log mỗi 100 bước
    # max_prompt_length = max_prompt_length,
    # max_completion_length = max_seq_length - max_prompt_length,
    num_train_epochs = 1, # Set to 1 for a full training run
    # max_steps = 250,
    max_grad_norm = 0.1,
    report_to = "wandb", # Can use Weights & Biases
    output_dir = "llama_3_1_8B_GRPO",
    max_prompt_length=2048
)

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [19]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        strict_format_reward_func,
        soft_format_reward_func,
        correctness_reward_func,
        field_completeness_reward_func,
        field_level_reward,
        jaccard_third_party_reward,
        semantic_description_reward,
    ],
    args = training_args,
    train_dataset = dataset,
)

trainer.train()
# trainer.train(resume_from_checkpoint=True)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,285 | Num Epochs = 1 | Total steps = 1,821
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 28,311,552/8,000,000,000 (0.35% trained)


Answer is:  ['[]', '[]']
******************** Question:
Content: None 
Answer:
[] 
Response:
[]
Len of comp:  1
Answer:  []
---> No.0  []
Len of comp:  1
Answer:  []
---> No.0  []
Answer is:  ['[]', '[]']
******************** Question:
Cookie policy: Home Cookie policy (EU)
Cookie policy (EU)
This Cookie Policy was last updated on March 5, 2024 and applies to citizens and permanent residents of the European Economic Area.
1. Introduction
Our website www.gardis.cz (hereinafter referred to as the "website") uses cookies and other related technologies (for convenience, all technologies are referred to as "cookies"). Cookies are also placed by third parties that we have engaged. In the document below, we inform you about the use of cookies on our website.
2. What are cookies?
A cookie is a small, simple file that is sent along with the pages of this website and stored by your browser on the hard drive of your computer or other device. The information stored in them may be returned to our s

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / strict_format_reward_func,rewards / soft_format_reward_func,rewards / correctness_reward_func,rewards / field_completeness_reward_func,rewards / field_level_reward,rewards / jaccard_third_party_reward,rewards / semantic_description_reward
1,-0.0,-0.375,3.005204,29.25,0.0,0.5,0.25,-0.25,-0.125,-0.25,-0.25,-0.25
2,0.0,-3.5,1.414214,70.0,0.0,-0.25,-0.125,-0.625,-0.625,-0.625,-0.625,-0.625
3,0.0,0.75,1.414214,48.875,0.0,0.75,0.375,-0.125,0.125,-0.125,-0.125,-0.125
4,-0.0,-2.087637,6.240023,143.0,0.0,0.0,0.0,-0.5,-0.291667,-0.433333,-0.458333,-0.404304
5,0.0,-3.5,4.242641,48.25,9e-06,-0.25,-0.125,-0.625,-0.625,-0.625,-0.625,-0.625
6,0.0,1.625,0.176777,33.75,1.8e-05,1.0,0.5,0.0,0.125,0.0,0.0,0.0
7,0.0,-4.070693,3.435559,95.125,8e-06,-0.5,-0.25,-0.75,-0.625,-0.65,-0.625,-0.670693
8,0.0,-2.25,2.828427,110.125,1.3e-05,0.0,0.0,-0.5,-0.25,-0.5,-0.5,-0.5
9,0.0,-0.5,2.828427,11.5,1e-05,0.5,0.25,-0.25,-0.25,-0.25,-0.25,-0.25
10,0.0,-0.375,3.005204,26.0,1.8e-05,0.5,0.25,-0.25,-0.125,-0.25,-0.25,-0.25


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Made with real coconut, our bite-sized treats are packed with flavor and all-natural ingredients. They’re perfect for anyone looking for a guilt-free indulgence.
Michi’s Coconut Cookie Candy is perfect for any occasion, whether you need a quick snack on the go or a sweet treat to enjoy with friends and family. Plus, they’re great for satisfying your cravings without derailing your healthy lifestyle.
So why wait? Try our Michi’s Coconut Cookie Candy today and experience the delicious taste of all-natural ingredients in every bite. You won’t regret it!

Table: [] 
Answer:
[] 
Response:
[] 

(The provided policy does not mention any specific cookies.)
Len of comp:  1
Answer:  []
---> No.0  [] 

(The provided policy does not mention any specific cookies.)
Len of comp:  1
Answer:  []
---> No.0  []
Answer is:  ['[]', '[]']
******************** Question:
Cookie policy: Cookies Policy
1. What is a cookie? And what do we use this 

TypeError: 'NoneType' object is not iterable

In [None]:
wandb.finish()
model.config.use_cache = True

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
def infer():
  prompt = input("Enter your cookie policy")
  text = tokenizer.apply_chat_template([
      {'role': 'system', 'content': SYSTEM_PROMPT},
      {"role" : "user", "content" : prompt},
  ], tokenize = False, add_generation_prompt = True)

  from vllm import SamplingParams
  sampling_params = SamplingParams(
      temperature = 0.8,
      top_p = 0.95,
      max_tokens = 1024,
  )
  output = model.fast_generate(
      [text],
      sampling_params = sampling_params,
      lora_request = None,
  )[0].outputs[0].text

  print(output)

infer()

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [None]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Calculate pi."},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
