<a href="https://colab.research.google.com/github/jadericdawson/Adafruit_SSD1306/blob/master/WBI_TRL_training_12May2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Colab Notebook: Fine-tuning with Unsloth GRPO for WBI DoD Proposals
This notebook demonstrates how to fine-tune a language model using Unsloth and Group Relative Policy Optimization (GRPO) to generate proposal content for the Wright Brothers Institute (WBI) in response to hypothetical Department of Defense (DoD) solicitations.

Phase 1: Setup and Installation
First, we install the necessary libraries, primarily Unsloth and its dependencies.



In [1]:
#@title 1.1 Install Libraries
# Install Unsloth first - let it pull its specific compatible dependencies like TRL
# Use the latest recommended command from Unsloth's GitHub for Colab
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# !pip install unsloth # Alternative if the above causes issues, but git install is often preferred

# Upgrade Pillow separately (usually safe)
!pip install --upgrade pillow

# Install other dependencies - Check Unsloth docs if specific versions are needed
# Often Unsloth's install handles PEFT, Accelerate, Bitsandbytes correctly.
# Avoid explicitly upgrading TRL here. If you need specific versions of others, pin them:
# Example: !pip install "peft==0.11.1" "accelerate==0.30.1" # <== FIND CORRECT VERSIONS IF NEEDED

# Install vLLM if you plan to use it (might have its own dependencies)
!pip install vllm

print("Installation complete.")
print("IMPORTANT: You MUST RESTART THE RUNTIME now for all changes to take effect.")
print("Runtime > Restart Runtime (or Ctrl+M .)")

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-icsg5f5p/unsloth_dd455495a7024f11a7d6843a309cec5c
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-icsg5f5p/unsloth_dd455495a7024f11a7d6843a309cec5c
  Resolved https://github.com/unslothai/unsloth.git to commit c281a787b20e1dd564ee10755f9aaa86191b3e0e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting trl!=0.15.0,!=0.9.0,!=0.9.1,!=0.9.2,!=0.9.3,<=0.15.2,>=0.7.9 (from unsloth_zoo>=2025.5.1->unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Using cached trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Using cached trl-0.15.2

Text Cell:
After running the cell above, restart your Colab runtime by navigating to "Runtime" > "Restart Runtime" (or using the shortcut Ctrl+M .). This is crucial for the newly installed libraries to be correctly loaded.

Phase 2: Model Loading and LoRA Configuration with Unsloth
We'll load a base model efficiently using Unsloth's FastModel and then apply LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

In [1]:
#@title 2.1 Load Model and Configure LoRA
import torch
from unsloth import FastModel
# from peft import LoraConfig # Unsloth's FastModel.get_peft_model can handle this internally

# --- Configuration ---
max_seq_length = 2048  # Max sequence length for the model (prompts + completions)
load_in_4bit = True    # Use 4-bit quantization for memory efficiency on Colab T4

# Model Selection: Choose a suitable instruction-tuned model from Unsloth's offerings.
# Smaller models like Gemma-1B/3B or Llama-3.1-3B are good for Colab.
# Using a pre-quantized 4-bit model from Unsloth is recommended.
model_name = "unsloth/gemma-3-1b-it-bnb-4bit"
# Alternatives:
# model_name = "unsloth/gemma-1b-it-bnb-4bit" # Even smaller Gemma
# model_name = "unsloth/Llama-3.1-8B-bnb-4bit" # Larger, might require Colab Pro or careful memory management

print(f"Loading model: {model_name}")
model, tokenizer = FastModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=load_in_4bit,
    # token="hf_YOUR_TOKEN_HERE",  # Add your Hugging Face token if using a gated model
)
print("Model loaded.")

# --- Add LoRA Adapters ---
# Configure LoRA using Unsloth's helper function for PEFT.
print("Configuring LoRA adapters...")
model = FastModel.get_peft_model(
    model,
    r=16,  # LoRA rank (e.g., 8, 16, 32).
    lora_alpha=16,  # Scaling factor, often equal to r or 2*r.
    lora_dropout=0.05,  # Dropout for LoRA layers.
    bias="none",  # Type of bias training. "none" is common for LoRA.
    use_gradient_checkpointing="unsloth", # Unsloth's method for memory saving.
    random_state=3407, # For reproducibility.
    target_modules=[ # Specify layers to apply LoRA. Unsloth might auto-detect.
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)
print("LoRA adapters configured.")
model.print_trainable_parameters()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-12 16:56:19 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-12 16:56:19 [__init__.py:239] Automatically detected platform cuda.
Loading model: unsloth/gemma-3-1b-it-bnb-4bit
==((====))==  Unsloth 2025.5.1: Fast Gemma3 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model loaded.
Configuring LoRA adapters...
Unsloth: Making `model.base_model.model.model` require gradients
LoRA adapters configured.
trainable params: 13,045,760

Phase 3: Data Preparation for WBI DoD Proposals
This is a critical step where we define the structure for our "thinking" LLM and prepare a dataset based on WBI's information and hypothetical DoD solicitation prompts.

In [2]:
#@title 3.1 Define Custom Tags, WBI Info, System Prompt, and Dataset
from datasets import Dataset
import random # Not used in this snippet, but often useful for dataset manipulation

# --- Define Custom Tags for WBI Proposal Structure ---
WBI_THINK_START_TAG = "<WBI_THINK_START>"
WBI_THINK_END_TAG = "<WBI_THINK_END>"
WBI_PROPOSAL_SECTION_START_TAG = "<WBI_PROPOSAL_SECTION_START>"
WBI_PROPOSAL_SECTION_END_TAG = "<WBI_PROPOSAL_SECTION_END>"

# --- Expanded WBI Core Concepts & Program Examples (for rewards and system prompt context) ---
# Derived comprehensively from the WBI company information you provided.

WBI_CORE_CONCEPTS_FOR_REWARD = [
    # Names & Aliases
    "Wright Brothers Institute", "WBI",
    # Key Partners & Entities
    "AFRL", "Air Force Research Laboratory", "$4 billion technology powerhouse", "AFRL/RX", "AFRL's Materials and Manufacturing Directorate",
    "AFLCMC", "Air Force Life Cycle Management Center", "AFMC", "Air Force Materiel Command",
    "AFIMSC", "Air Force Installation and Mission Support Center", "AFICC", "Air Force Installation Contracting Center",
    "NSWC Crane", "Naval Surface Warfare Center Crane Division", "US Navy",
    "Universities", "University of Dayton", "Wright State University", "Purdue University", "University of Cincinnati", "Miami University",
    "Industry", "Industry partners", "Small Businesses", "Start-ups", "SBIR companies",
    "The Entrepreneurs' Center", "The Collaboratory", "Dayton Development Coalition", "National Instruments", "Emerson", "FDA",
    # Locations & Facilities
    "Dayton", "Ohio", "Wright-Patterson Air Force Base", "WPAFB",
    "WBI Headquarters", "5000 Springfield Street", "Tec^Edge Innovation and Collaboration Center", "Tec^Edge",
    "IDEA Lab", "Discovery Labs", "high tech 'monster garage'", "25,000 sq. ft. facility",
    "Proving Ground", "Works", "105 Janney Road", "The Hub", "444 East Second St.",
    # Organizational & Legal
    "Non-profit", "501(c)(3)", "Tax-Exempt since May 2003", "Founded 2002",
    "Partnership Intermediary Agreement", "PIA", "First PIA with Air Force",
    "Neutral convener", "Agile facilitator", "Innovation intermediary",
    "Independent Board of Trustees", "Financial stewardship", "Program service revenue",
    # Mission, Vision & Framework
    "Discover, Develop, Deliver", "Accelerate solutions", "Cutting-edge solutions", "Warfighter solutions",
    "Empower and enhance decision-making", "Shap[e] the Future of Defense",
    "Transferring real-world solutions and capabilities to the warfighter", "First stop for innovation",
    "Uniting stakeholders", "end-users", "technologists", "acquirers", "experts", "Subject Matter Experts", "SMEs",
    "Bridging the 'valley of death'", "Unexpected and undiscovered Warfighter Solutions",
    # Capabilities & Activities
    "Spearheading defense innovation", "Multi-sector collaboration", "Collaborative environments",
    "Technology transfer", "Tech transfer", "T3", "Technology transition", "Commercialization",
    "Spin-in", "Spin-out", "Rapid prototyping", "Workforce development",
    "Disruptive innovation processes", "Ecosystem building", "WBI Ecosystem",
    "Problem-solving methodologies", "sprints", "open innovation", "Divergent Collaboration tool",
    "Market analytics", "Manpower solutions", "Predictive modeling", "Inter-service collaboration",
    "Technology scouting", "IP bundling", "Licensing opportunities",
    # Economic & Regional Impact
    "Regional economic development", "Stimulating industrial base", "Dayton aerospace advancement",
    # General Strategic Terms
    "National security", "Defense innovation ecosystem", "Complex challenges", "Operational capabilities",
    "Emerging technologies", "AI", "Artificial Intelligence", "quantum computing", "advanced space technologies", "synthetic biology"
]

WBI_PROGRAM_EXAMPLES_FOR_REWARD = [
    # Flagship & Recurring Programs
    "Software Defined Radio University Challenge", "SDR University Challenge", "SDR Challenge",
    "Tec^Edge Innovation and Collaboration Center", "Tec^Edge", # Also a facility, but used programmatically
    "DoD SkillBridge Program", "SkillBridge",
    "TECH-ARTS Collaboration", "TECH-ARTS",
    "Collaboration Accelerator",
    "AFRL Small Business Hub", "Small Business Hub",
    "Lunch & Learn Program",
    # Specific Events & Initiatives
    "Demystifying the Acquisition Process",
    "Small Business Infrastructure and matchmaking Collider", "Collider events",
    "Developing Elite Acquisition Workforce",
    "AI Manufacturing Network", # WBI supported setting up collaborative environment
    "AFRL Manpower Analytics",
    "Summer of Innovation", # Activity type WBI supports
    "Accelerators", # Commercialization service/program type
    "AFRL Regional Network", # Pilot initiative
    # Methodologies named as initiatives/tools
    "Divergent Collaboration" # Ideation tool developed
]

# --- Create a System Prompt for WBI Proposal Generation ---
# This guides the LLM on its role, expected output format, and knowledge domain.
wbi_system_prompt_content = f"""You are an expert assistant for the Wright Brothers Institute (WBI), a non-profit 501(c)(3) innovation intermediary.
Your task is to generate content to satisfy Department of Defense (DoD) solicitations by creating proposal sections, showcasing WBI's capabilities, experience, and value.
Structure your response as follows:
1.  First, provide your thinking process and rationale. Enclose this detailed reasoning within {WBI_THINK_START_TAG} and {WBI_THINK_END_TAG} tags. In this section, explain how WBI's strengths, past projects, and operational model (like "Discover, Develop, Deliver") address the user's request.
2.  Second, provide the drafted proposal section text. Enclose this formal text within {WBI_PROPOSAL_SECTION_START_TAG} and {WBI_PROPOSAL_SECTION_END_TAG} tags.

When generating content:
-   Accurately reflect WBI's mission to accelerate solutions for the warfighter by uniting end-users, technologists, acquirers, and experts from industry and academia.
-   Emphasize WBI's cornerstone Partnership Intermediary Agreement (PIA) with the Air Force Research Laboratory (AFRL).
-   Incorporate relevant WBI programs (e.g., {', '.join(WBI_PROGRAM_EXAMPLES_FOR_REWARD[:3])}), partnerships (e.g., AFRL, AFLCMC, NSWC Crane), and core capabilities (e.g., rapid prototyping, technology transfer, commercialization, multi-sector collaboration).
-   Highlight WBI's role as a neutral convener and agile facilitator in the Dayton, Ohio, defense ecosystem and beyond.
-   Cite specific WBI initiatives or achievements as evidence where appropriate, drawing from WBI's established record and programs.
"""

# --- Create a Dataset of DoD-Style Prompts and Reference Data for Rewards ---
# These prompts simulate questions from a DoD solicitation.
# `expected_keywords` and `expected_programs` are used by reward functions.
# For effective fine-tuning, you'll need a much larger and more diverse dataset (hundreds or thousands of examples).
wbi_dod_prompts_with_references = [
    {
        "id": "wbi_dod_001",
        "question": "Describe WBI's methodology for leveraging its Partnership Intermediary Agreement with AFRL to accelerate technology development and transition. Provide examples of collaborative environments WBI utilizes.",
        "expected_keywords": ["PIA", "AFRL", "technology transfer", "Discover, Develop, Deliver", "collaboration", "Tec^Edge", "neutral convener"],
        "expected_programs": ["Tec^Edge Innovation and Collaboration Center", "SDR University Challenge"]
    },
    {
        "id": "wbi_dod_002",
        "question": "Explain WBI's approach to enhancing Air Force operational capabilities through talent development and fostering novel problem-solving methodologies.",
        "expected_keywords": ["SkillBridge", "workforce development", "TECH-ARTS", "SDR Challenge", "Collaboration Accelerator", "problem-solving", "Divergent Collaboration"],
        "expected_programs": ["DoD SkillBridge Program", "TECH-ARTS Collaboration", "SDR University Challenge", "Developing Elite Acquisition Workforce"]
    },
    {
        "id": "wbi_dod_003",
        "question": "Detail WBI's experience in engaging with the Air Force Life Cycle Management Center (AFLCMC) and supporting small business participation in defense contracts.",
        "expected_keywords": ["AFLCMC", "Small Business Hub", "Collider events", "acquisition process", "matchmaking", "SBIR companies"],
        "expected_programs": ["AFRL Small Business Hub", "Demystifying the Acquisition Process", "Small Business Infrastructure and matchmaking Collider"]
    },
    {
        "id": "wbi_dod_004",
        "question": "Outline WBI's contribution to the Dayton regional defense industrial base and its strategy for technology commercialization benefiting both defense and commercial markets.",
        "expected_keywords": ["Dayton", "regional economic development", "commercialization", "dual-use", "industrial base", "The Entrepreneurs' Center", "spin-out", "spin-in"],
        "expected_programs": ["Accelerators"]
    },
    {
        "id": "wbi_dod_005",
        "question": "How does WBI's 'Discover, Develop, Deliver' framework specifically ensure that solutions are ready for acquisition and meet warfighter needs?",
        "expected_keywords": ["Discover", "Develop", "Deliver", "acquisition readiness", "warfighter needs", "risk minimization", "SMEs", "valley of death"],
        "expected_programs": []
    },
    {
        "id": "wbi_dod_006",
        "question": "Discuss WBI's role in facilitating inter-service collaboration, citing an example such as its PIA with NSWC Crane.",
        "expected_keywords": ["NSWC Crane", "PIA", "inter-service", "Navy", "Air Force", "joint commercialization", "IP bundling", "Force Multiplier"],
        "expected_programs": ["NSWC Crane PIA"] # This is the key program example
    },
    {
        "id": "wbi_dod_007",
        "question": "What are the key features and benefits of WBI's Tec^Edge Innovation and Collaboration Center for AFRL and its partners?",
        "expected_keywords": ["Tec^Edge", "IDEA Lab", "Discovery Labs", "rapid prototyping", "monster garage", "R&D collaborations", "25,000 sq. ft. facility", "subject matter experts"],
        "expected_programs": ["Tec^Edge Innovation and Collaboration Center"]
    },
    {
        "id": "wbi_dod_008",
        "question": "How does WBI's non-profit status and role as a neutral convener benefit its mission to unite diverse stakeholders for defense innovation?",
        "expected_keywords": ["non-profit", "501(c)(3)", "neutral convener", "uniting stakeholders", "industry", "academia", "government", "trust", "open interchanges"],
        "expected_programs": []
    },
    {
        "id": "wbi_dod_009",
        "question": "Describe WBI's approach to technology scouting and commercialization, including both 'spin-in' and 'spin-out' strategies.",
        "expected_keywords": ["technology scouting", "commercialization", "spin-in", "spin-out", "Leveraging the Lab", "Leveraging the Marketplace", "Accelerators"],
        "expected_programs": ["Accelerators"]
    },
    {
        "id": "wbi_dod_010",
        "question": "Explain WBI's involvement in workforce development initiatives beyond the DoD SkillBridge program, such as supporting the AFMC/AFLCMC acquisition workforce.",
        "expected_keywords": ["workforce development", "Developing Elite Acquisition Workforce", "AFMC", "AFLCMC", "talent pipeline", "STEM"],
        "expected_programs": ["Developing Elite Acquisition Workforce", "SDR University Challenge", "SkillBridge"]
    }
]

# --- Map Raw Data to GRPOTrainer Format ---
# 'prompt' will be fed to the model (as a chat interaction).
# 'answer' dictionary will be passed to reward functions for checking against generated completions.
def format_wbi_dataset_for_grpo(example):
    return {
        "prompt": [
            {"role": "system", "content": wbi_system_prompt_content},
            {"role": "user", "content": example["question"]},
        ],
        "answer": {  # This 'answer' dict is passed to reward functions
            "id": example["id"],
            "reference_keywords": example.get("expected_keywords", []), # Use .get for safety
            "reference_programs": example.get("expected_programs", []), # Use .get for safety
        }
    }

train_dataset_wbi = Dataset.from_list(wbi_dod_prompts_with_references).map(format_wbi_dataset_for_grpo)

print("--- Formatted WBI Training Dataset Example ---")
print(f"Number of examples: {len(train_dataset_wbi)}")
if len(train_dataset_wbi) > 0:
    print(train_dataset_wbi[0])
else:
    print("Dataset is empty. Please add examples to `wbi_dod_prompts_with_references`.")
print("--- End of Dataset Example ---")

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

--- Formatted WBI Training Dataset Example ---
Number of examples: 10
{'id': 'wbi_dod_001', 'question': "Describe WBI's methodology for leveraging its Partnership Intermediary Agreement with AFRL to accelerate technology development and transition. Provide examples of collaborative environments WBI utilizes.", 'expected_keywords': ['PIA', 'AFRL', 'technology transfer', 'Discover, Develop, Deliver', 'collaboration', 'Tec^Edge', 'neutral convener'], 'expected_programs': ['Tec^Edge Innovation and Collaboration Center', 'SDR University Challenge'], 'prompt': [{'content': 'You are an expert assistant for the Wright Brothers Institute (WBI), a non-profit 501(c)(3) innovation intermediary.\nYour task is to generate content to satisfy Department of Defense (DoD) solicitations by creating proposal sections, showcasing WBI\'s capabilities, experience, and value.\nStructure your response as follows:\n1.  First, provide your thinking process and rationale. Enclose this detailed reasoning within <W

Phase 4: Define Custom Reward Functions
These functions are the heart of GRPO. They score the LLM's generated outputs based on adherence to our desired format and relevance to WBI's information and the specific prompt.

In [6]:
#@title 4.1 Define Reward Functions for WBI DoD Proposals
import re

# --- Regex to match our desired WBI output format accurately ---
# Using [\s\S] to match any character including newlines.
# Using non-greedy `+?` for content within tags.
# Anchoring with `^` and `$` and using `fullmatch` ensures the entire string conforms.
wbi_output_format_regex = re.compile(
    rf"^{WBI_THINK_START_TAG}([\s\S]+?){WBI_THINK_END_TAG}[\s\S]*?"
    rf"{WBI_PROPOSAL_SECTION_START_TAG}([\s\S]+?){WBI_PROPOSAL_SECTION_END_TAG}$",
    flags=re.MULTILINE | re.DOTALL
)

# --- Helper to extract answer references ---
def get_answer_references(kwargs):
    """Safely extracts the answer references batch from kwargs."""
    answer_refs = kwargs.get("answer") # Use the dataset column name "answer" as the key
    if answer_refs is None:
        print("WARNING: 'answer' key not found in reward_kwargs! Cannot score content.")
        # Return None or an empty list depending on how you want to handle downstream
        return None
    return answer_refs

def reward_wbi_exact_format_adherence(completions, prompts, **kwargs): # Removed answer_references_batch, added **kwargs
    """
    Rewards completions that perfectly match the specified WBI THINK and PROPOSAL_SECTION format
    and have substantial content in both sections.
    'completions': List of strings (generated outputs).
    'prompts': List of prompt inputs for the batch.
    'kwargs': Dictionary possibly containing extra dataset columns like 'answer'.
    """
    # NOTE: This specific function doesn't actually NEED answer_references_batch,
    # so you could technically just remove it without adding the extraction logic below.
    # However, keeping the pattern consistent with the other functions is good practice.
    # answer_references_batch = get_answer_references(kwargs) # Extract if needed later

    scores = []
    for completion_text in completions:
        score = 0.0
        match = wbi_output_format_regex.fullmatch(completion_text.strip())
        if match:
            think_content = match.group(1).strip()
            proposal_content = match.group(2).strip()
            if len(think_content) > 25 and len(proposal_content) > 40:
                score += 3.0
            elif len(think_content) > 10 or len(proposal_content) > 20:
                score += 1.0
            else:
                score += 0.2
        scores.append(score)
    return scores

def reward_wbi_approximate_format_presence(completions, prompts, **kwargs): # Removed answer_references_batch, added **kwargs
    """
    Rewards completions for the presence of individual tags, even if the overall format isn't perfect.
    Penalizes if tags are missing or duplicated, encouraging proper tag usage.
    """
    # NOTE: This function also doesn't NEED answer_references_batch.
    # answer_references_batch = get_answer_references(kwargs) # Extract if needed

    scores = []
    all_defined_tags = [WBI_THINK_START_TAG, WBI_THINK_END_TAG, WBI_PROPOSAL_SECTION_START_TAG, WBI_PROPOSAL_SECTION_END_TAG]
    for completion_text in completions:
        score = 0.0
        for tag in all_defined_tags:
            tag_count = completion_text.count(tag)
            if tag_count == 1:
                score += 0.5
            elif tag_count > 1:
                score -= 0.5
        scores.append(score)
    return scores

def reward_wbi_content_and_evidence(completions, prompts, **kwargs): # Removed answer_references_batch, added **kwargs
    """
    Rewards completions for including relevant WBI keywords (general and prompt-specific)
    and specific WBI program examples within the PROPOSAL_SECTION.
    Also gives minor rewards for keywords in THINK section.
    """
    answer_references_batch = get_answer_references(kwargs) # EXTRACT THE DATA HERE
    if answer_references_batch is None:
         # If data is missing, return default low scores as we can't evaluate content
         return [-2.0] * len(completions) # Penalize slightly or return 0

    batch_scores = []
    # Ensure length matches completions in case extraction failed partially (though get_answer_references handles full failure)
    if len(answer_references_batch) != len(completions):
        print(f"WARNING: Mismatch between completions ({len(completions)}) and answer_references ({len(answer_references_batch)}) lengths.")
        # Handle mismatch, e.g., return default scores
        return [-2.0] * len(completions)

    for i, completion_text in enumerate(completions):
        current_score = 0.0
        # Get the 'answer' reference dictionary for this specific completion.
        current_prompt_answer_ref = answer_references_batch[i] # Now correctly indexed
        expected_prompt_specific_keywords = current_prompt_answer_ref.get("reference_keywords", [])
        expected_prompt_specific_programs = current_prompt_answer_ref.get("reference_programs", [])

        # --- REST OF THE FUNCTION LOGIC REMAINS THE SAME ---
        format_match = wbi_output_format_regex.search(completion_text.strip())
        if format_match:
            think_section_content = format_match.group(1).lower().strip()
            proposal_section_content = format_match.group(2).lower().strip()

            # ... (keyword and program checking logic) ...
            for core_keyword in WBI_CORE_CONCEPTS_FOR_REWARD:
                 if core_keyword.lower() in proposal_section_content: current_score += 0.05
                 elif core_keyword.lower() in think_section_content: current_score += 0.02

            for expected_keyword in expected_prompt_specific_keywords:
                 if expected_keyword.lower() in proposal_section_content: current_score += 0.4
                 elif expected_keyword.lower() in think_section_content: current_score += 0.1

            for program_keyword in expected_prompt_specific_programs:
                 if program_keyword.lower() in proposal_section_content: current_score += 0.6
                 elif program_keyword.lower() in think_section_content: current_score += 0.2

            if len(think_section_content) > 50 and len(proposal_section_content) > 100:
                current_score += 0.5
        else:
            current_score -= 1.0

        batch_scores.append(min(max(current_score, -2.0), 5.0))
        # --- END OF ORIGINAL LOGIC ---

    return batch_scores

# --- List of reward functions (remains the same) ---
wbi_custom_reward_functions = [
    reward_wbi_exact_format_adherence,
    reward_wbi_approximate_format_presence,
    reward_wbi_content_and_evidence,
]
print(f"Defined {len(wbi_custom_reward_functions)} reward functions with updated signatures.")

Defined 3 reward functions with updated signatures.


Phase 5: Model Training with GRPOTrainer
This phase configures and runs TRL's GRPOTrainer using our Unsloth-prepared model, custom WBI dataset, and the reward functions defined above.

In [7]:
#@title 5.1 Configure and Run GRPOTrainer
from trl import GRPOConfig, GRPOTrainer
# import wandb # Uncomment if using Weights & Biases for logging

# --- Training Configuration ---
# Adjust these parameters based on your Colab resources (GPU type, RAM) and dataset size.
num_generations_per_prompt_config = 16  # (K in GRPO paper) Completions per prompt. Higher demands more VRAM. (4-12 typical).
per_device_batch_size_config = 1       # Number of PROMPTS processed per device in one optimizer step.
gradient_accumulation_steps_config = 4 # Effective batch size = per_device_bs * grad_acc_steps.
max_training_steps_config = 150        # For initial testing (e.g., 50-200). For a full run, use num_train_epochs or more steps.
                                       # Unsloth's example uses 50 for a quick run. More data needs more steps.
learning_rate_config = 5e-6            # Learning rate, often lower for fine-tuning.

# Calculate max_prompt_length and max_completion_length for GRPOTrainer.
# This is crucial. GRPOTrainer will truncate prompts/completions if they exceed these token limits.
# max_prompt_length should accommodate your system + user prompt after tokenization.
# max_completion_length is for the generated text (<THINK>...</THINK><PROPOSAL>...</PROPOSAL>).
# Ensure: max_prompt_length + max_completion_length <= model.config.max_position_embeddings (or `max_seq_length` used for model loading)

# Estimate average tokenized prompt length (can be refined with actual tokenization).
# The system prompt can be quite long.
avg_tokenized_prompt_len_estimate = 400 # Adjust this based on your data.
cfg_max_prompt_len = min(avg_tokenized_prompt_len_estimate + 150, max_seq_length // 2) # Buffer for variability.
cfg_max_completion_len = max_seq_length - cfg_max_prompt_len - 30 # Buffer for special tokens, etc.

print(f"--- GRPO Training Configuration ---")
print(f"Max Sequence Length (Model): {max_seq_length}")
print(f"Target Max Prompt Length (Config): {cfg_max_prompt_len}")
print(f"Target Max Completion Length (Config): {cfg_max_completion_len}")
if cfg_max_prompt_len + cfg_max_completion_len > max_seq_length:
    print("WARNING: Sum of prompt and completion lengths might exceed model's max sequence length!")

wbi_grpo_training_args = GRPOConfig(
    output_dir="wbi_grpo_unsloth_outputs_run2", # Change for different runs
    learning_rate=learning_rate_config,
    optim="adamw_torch_fused",  # Unsloth often recommends for speed with their models.
    logging_steps=10,  # Log metrics frequently for observation.
    per_device_train_batch_size=per_device_batch_size_config,
    gradient_accumulation_steps=gradient_accumulation_steps_config,
    num_generations=num_generations_per_prompt_config,
    max_prompt_length=cfg_max_prompt_len,
    max_completion_length=cfg_max_completion_len,
    max_steps=max_training_steps_config,  # For quick test; use num_train_epochs for full run (e.g., 1-3).
    # num_train_epochs=1, # Uncomment for a full epoch run, and comment out max_steps.
    save_strategy="steps",
    save_steps=max_training_steps_config // 3 if max_training_steps_config >= 60 else max_training_steps_config,
    max_grad_norm=0.1,  # Gradient clipping, from Unsloth example.
    report_to="none",  # Change to "wandb" if using Weights & Biases.
    remove_unused_columns=False,  # CRITICAL: Keep "answer" column for reward functions.
    bf16=torch.cuda.is_bf16_supported(), # Use bfloat16 if available (A100+).
    # fp16=not torch.cuda.is_bf16_supported(), # Use float16 if bf16 not available (like on T4).
    # load_best_model_at_end=True, # Optional, if using evaluation and want to keep the best model.
    # evaluation_strategy="steps", # If eval_dataset is provided.
    # eval_steps=50, # How often to evaluate.
)

# --- Initialize GRPOTrainer ---
# Ensure the tokenizer is correctly passed.
wbi_grpo_trainer = GRPOTrainer(
    model=model,  # Your Unsloth LoRA-adapted model.
    tokenizer=tokenizer, # Pass the Unsloth-loaded tokenizer.
    args=wbi_grpo_training_args,
    reward_funcs=wbi_custom_reward_functions, # Your list of custom WBI reward functions.
    train_dataset=train_dataset_wbi,
    # eval_dataset=eval_dataset_wbi, # Uncomment if you have an evaluation dataset.
)

# --- (Optional) Initialize W&B if used ---
# project_name_wandb = "WBI-DoD-GRPO-Unsloth-FineTuning"
# if wbi_grpo_training_args.report_to == "wandb":
#     try:
#         import wandb
#         wandb.init(project=project_name_wandb, config=wbi_grpo_training_args)
#     except ImportError:
#         print("wandb not installed. Skipping W&B initialization.")
#         wbi_grpo_training_args.report_to = "none"


print("\nStarting WBI GRPO fine-tuning with Unsloth...")
try:
    wbi_grpo_trainer.train()
    print("WBI GRPO fine-tuning finished successfully.")
except Exception as e:
    print(f"An error occurred during training: {e}")
    import traceback
    traceback.print_exc()


# --- (Optional) Finish W&B run ---
# if wbi_grpo_training_args.report_to == "wandb" and wandb.run is not None:
#     wandb.finish()

--- GRPO Training Configuration ---
Max Sequence Length (Model): 2048
Target Max Prompt Length (Config): 550
Target Max Completion Length (Config): 1468
Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 16

Starting WBI GRPO fine-tuning with Unsloth...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10 | Num Epochs = 75 | Total steps = 150
O^O/ \_/ \    Batch size per device = 16 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (16 x 4 x 1) = 64
 "-____-"     Trainable parameters = 13,045,760/1,000,000,000 (1.30% trained)


An error occurred during training: 'list' object has no attribute 'strip'


Traceback (most recent call last):
  File "<ipython-input-7-ad391ba4c7b6>", line 81, in <cell line: 0>
    wbi_grpo_trainer.train()
  File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2245, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 315, in _fast_inner_training_loop
  File "<string>", line 25, in _unsloth_training_step
  File "/content/unsloth_compiled_cache/UnslothGRPOTrainer.py", line 1034, in _prepare_inputs
    output_reward_func = reward_func(prompts=prompts, completions=completions, **reward_kwargs)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<ipython-input-6-7dbad295b64e>", line 40, in reward_wbi_exact_format_adherence
    match = wbi_output_format_regex.fullmatch(completion_text.strip())
                                              ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'strip'


Phase 6: Inference and Saving the Model
After training, save your fine-tuned LoRA adapters. Then, test the model's generation capabilities on new prompts.

In [5]:
#@title 6.1 Save LoRA Adapters and Perform Inference

# --- Save LoRA Adapters and Tokenizer ---
# This saves only the learned LoRA weights, not the full model.
lora_output_directory = "wbi_dod_grpo_lora_adapters_final_run2" # Change for different runs
model.save_pretrained(lora_output_directory)
tokenizer.save_pretrained(lora_output_directory)
print(f"LoRA adapters and tokenizer configuration saved to: {lora_output_directory}")

# --- Inference with the Fine-tuned Model ---
# The `model` variable in this session currently holds the LoRA-adapted fine-tuned model.
from transformers import TextStreamer
import gc # Garbage collector

print("\n--- Inference Example using Fine-tuned LoRA Model ---")

# Example DoD-style prompt for testing inference
test_dod_prompt_for_inference = "Explain WBI's unique value proposition as a neutral convener in the Dayton defense ecosystem when tackling complex AFRL challenges."

# Format the prompt using the chat template, including the system prompt
inference_chat_messages_list = [
    {"role": "system", "content": wbi_system_prompt_content},
    {"role": "user", "content": test_dod_prompt_for_inference},
]

# Apply chat template to prepare the input text for the model
# `add_generation_prompt=True` is crucial for instruction-tuned models.
inference_input_text_formatted = tokenizer.apply_chat_template(
    inference_chat_messages_list,
    add_generation_prompt=True,
    tokenize=False, # Get the string first
)

# Tokenize the formatted prompt string
# Ensure the model and inputs are on the same device (usually "cuda" if GPU is available).
device = model.device
inference_tokenized_inputs = tokenizer(inference_input_text_formatted, return_tensors="pt").to(device)

print(f"\nGenerating response for DoD Prompt:")
print(f"'{test_dod_prompt_for_inference}'\n")

# Use a TextStreamer for real-time output if desired.
# Ensure tokenizer is passed correctly if it has special tokens for skipping.
text_streamer_for_inference = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Clear some memory before generation if needed, especially on constrained environments
torch.cuda.empty_cache()
gc.collect()

# Generate text
try:
    with torch.no_grad(): # Important for inference to save memory and speed up
        generated_ids = model.generate(
            input_ids=inference_tokenized_inputs.input_ids,
            attention_mask=inference_tokenized_inputs.attention_mask,
            max_new_tokens=cfg_max_completion_len,  # Allow enough tokens for THINK + PROPOSAL_SECTION
            temperature=0.5,  # Lower for more factual/deterministic, higher for creative.
            top_p=0.9,        # Nucleus sampling.
            do_sample=True,   # Enable sampling for diverse outputs. Set to False for greedy.
            streamer=text_streamer_for_inference,
            pad_token_id=tokenizer.eos_token_id # Crucial for consistent generation.
        )
    # The TextStreamer prints the output. If not using streamer, decode `generated_ids`.
    # decoded_output = tokenizer.decode(generated_ids[0, inference_tokenized_inputs.input_ids.shape[-1]:], skip_special_tokens=True)
    # print(decoded_output)

except Exception as e:
    print(f"An error occurred during inference: {e}")
    import traceback
    traceback.print_exc()

print("\n--- End of Inference Example ---")


# --- (Optional) Merging LoRA and Saving for Deployment ---
# Unsloth provides methods to merge LoRA weights with the base model for standalone deployment.
# This creates a larger model file but doesn't require PEFT library at inference time.
# Refer to official Unsloth documentation for `save_pretrained_merged` and `save_pretrained_gguf`.

# Example for saving a merged float16 model:
# merged_model_dir_float16 = "wbi_dod_grpo_merged_model_float16_run2"
# should_save_merged_model = False  # Set to True to execute this block
# if should_save_merged_model:
#     print(f"\nMerging LoRA adapter and saving full model to float16 at: {merged_model_dir_float16}")
#     # Important: Ensure model is not loaded in 4-bit/8-bit if you want a true float16/bfloat16 merge.
#     # You might need to reload the base model in full precision, then apply adapters, then merge.
#     # Or, Unsloth's `save_pretrained_merged` might handle dequantization. Check their docs.
#     try:
#         # For Unsloth, you might need to reload the model without 4-bit, then apply adapter and merge.
#         # This is a simplified example; actual steps for unquantized merge might vary.
#         # If model is already FastModel with LoRA:
#         model.save_pretrained_merged(merged_model_dir_float16, tokenizer, save_method="merged_16bit")
#         print(f"Merged float16 model saved to {merged_model_dir_float16}")
#     except Exception as e:
#         print(f"Error during merged model saving: {e}")

# Example for saving to GGUF format (for llama.cpp, etc.):
# gguf_output_file_name = "wbi_dod_grpo_merged_q8_0_run2.gguf" # Note: GGUF path is often a filename
# should_save_gguf_model = False # Set to True to execute
# if should_save_gguf_model:
#     print(f"\nSaving model to GGUF format (Q8_0) as: {gguf_output_file_name}")
#     try:
#         # Ensure model is in a state suitable for GGUF conversion (e.g., merged).
#         model.save_pretrained_gguf(gguf_output_file_name, tokenizer, quantization_method="q8_0")
#         print(f"GGUF Q8_0 model saved as {gguf_output_file_name}")
#     except Exception as e:
#         print(f"Error during GGUF model saving: {e}")

LoRA adapters and tokenizer configuration saved to: wbi_dod_grpo_lora_adapters_final_run2

--- Inference Example using Fine-tuned LoRA Model ---

Generating response for DoD Prompt:
'Explain WBI's unique value proposition as a neutral convener in the Dayton defense ecosystem when tackling complex AFRL challenges.'

user
You are an expert assistant for the Wright Brothers Institute (WBI), a non-profit 501(c)(3) innovation intermediary.
Your task is to generate content to satisfy Department of Defense (DoD) solicitations by creating proposal sections, showcasing WBI's capabilities, experience, and value.
Structure your response as follows:
1.  First, provide your thinking process and rationale. Enclose this detailed reasoning within <WBI_THINK_START> and <WBI_THINK_END> tags. In this section, explain how WBI's strengths, past projects, and operational model (like "Discover, Develop, Deliver") address the user's request.
2.  Second, provide the drafted proposal section text. Enclose this 

Phase 7: Concluding Remarks and Next Steps
Text Cell:

This Colab notebook provides a comprehensive template for fine-tuning a language model using Unsloth and GRPO, tailored for generating WBI-specific DoD proposal content.

Key Next Steps and Considerations:

Expand Your Dataset: The quality and quantity of your wbi_dod_prompts_with_references dataset are paramount. Aim for hundreds, if not thousands, of diverse and high-quality examples that accurately reflect the types of queries in DoD solicitations and the corresponding relevant WBI information.
Iterate on Reward Functions: This is the most art-and-science part of GRPO. Continuously test and refine your wbi_custom_reward_functions. Adjust scoring weights, add new reward components (e.g., for conciseness, avoiding repetition, or stronger evidence linkage), and ensure they align with your definition of a "good" proposal section.
Hyperparameter Tuning: Experiment with learning_rate, LoRA parameters (r, lora_alpha), num_generations, batch sizes, and the number of training steps (max_steps or num_train_epochs). Use Weights & Biases (W&B) or similar tools to track experiments and identify optimal configurations.
Thorough Evaluation: Beyond the automated reward scores from GRPO, perform rigorous human evaluation of the generated outputs. Subject Matter Experts (SMEs) in proposal writing and WBI's operations should review the content for factual accuracy, relevance, persuasiveness, coherence, and adherence to the desired tone and style.
Memory Management (Colab): Continuously monitor VRAM usage in Colab. If you encounter Out-Of-Memory (OOM) errors, try:
Reducing max_seq_length.
Decreasing num_generations_per_prompt_config.
Lowering per_device_batch_size_config (though gradient_accumulation_steps_config can help maintain effective batch size).
Using smaller base models if necessary.
Ensuring 4-bit quantization is active (load_in_4bit=True).
Consult Unsloth Documentation: The Unsloth library is actively developed. Always refer to the official Unsloth GitHub repository and documentation for the latest installation instructions, API changes, new features, and best practices for fine-tuning and deployment.