# Reinforcement Fine-Tuning (RFT) with OpenR1-Math-220k Dataset

This notebook demonstrates how to fine-tune language models using **Reinforcement Fine-Tuning (RFT)** with the OpenR1-Math-220k dataset - a collection of 220,000 advanced mathematical reasoning problems with verified step-by-step solutions.

## What You'll Learn
1. Understand reinforcement fine-tuning (RFT) and how it differs from supervised fine-tuning (SFT)
2. Define a grader/reward function for mathematical reasoning
3. Prepare and upload mathematical reasoning datasets
4. Create and configure an RFT job using the OpenAI method
5. Monitor training progress and evaluate model performance
6. Deploy and test your RFT fine-tuned model

**Note**: Execute each cell in sequence.

## 1. Setup and Installation

Install all required packages from requirements.txt

In [1]:
pip install -r requirements.txt

Collecting azure-ai-projects>=2.0.0b1 (from -r requirements.txt (line 2))
  Using cached azure_ai_projects-2.0.0b3-py3-none-any.whl.metadata (68 kB)
Collecting openai (from -r requirements.txt (line 5))
  Using cached openai-2.14.0-py3-none-any.whl.metadata (29 kB)
Collecting azure-identity (from -r requirements.txt (line 8))
  Using cached azure_identity-1.25.1-py3-none-any.whl.metadata (88 kB)
Collecting azure-mgmt-cognitiveservices (from -r requirements.txt (line 9))
  Using cached azure_mgmt_cognitiveservices-14.1.0-py3-none-any.whl.metadata (32 kB)
Collecting python-dotenv (from -r requirements.txt (line 12))
  Using cached python_dotenv-1.2.1-py3-none-any.whl.metadata (25 kB)
Collecting isodate>=0.6.1 (from azure-ai-projects>=2.0.0b1->-r requirements.txt (line 2))
  Using cached isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Collecting azure-core>=1.35.0 (from azure-ai-projects>=2.0.0b1->-r requirements.txt (line 2))
  Using cached azure_core-1.37.0-py3-none-any.whl.metadata (47


[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## 2. Import Libraries

In [1]:
import os
from dotenv import load_dotenv
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

print("All libraries imported successfully")

All libraries imported successfully


## 3. Configure Azure Environment

Set your Microsoft Foundry Project endpoint, model name and other environment variables. We're using **o4-mini** in this example for RFT.

Create a .env file with:

```
MICROSOFT_FOUNDRY_PROJECT_ENDPOINT=<your-endpoint>
AZURE_SUBSCRIPTION_ID=<your-subscription-id>
AZURE_RESOURCE_GROUP=<your-resource-group>
AZURE_AOAI_ACCOUNT=<your-foundry-account-name>
MODEL_NAME=o4-mini
```

In [2]:
load_dotenv()

endpoint = os.environ.get("MICROSOFT_FOUNDRY_PROJECT_ENDPOINT")
model_name = os.environ.get("MODEL_NAME", "o4-mini")

# Define RFT dataset file paths
training_file_path = "training_rft.jsonl"
validation_file_path = "validation_rft.jsonl"

## 4. Connect to Microsoft Foundry Project

Connect to Microsoft Foundry Project using Azure credential authentication. This initializes the project client and OpenAI client needed for fine-tuning workflows.

**Important**: Ensure you have the **Azure AI User** role assigned to your account for the Microsoft Foundry Project resource.

In [3]:
credential = DefaultAzureCredential()
project_client = AIProjectClient(endpoint=endpoint, credential=credential)
openai_client = project_client.get_openai_client()

print("Connected to Microsoft Foundry Project")

Connected to Microsoft Foundry Project


## 5. Define Mathematical Grader for RFT

Reinforcement Fine-Tuning (RFT) requires a grader function to evaluate model outputs. Unlike SFT which learns from examples, RFT learns from reward signals.

For mathematical reasoning, we use a **Python-based grader** that deterministically verifies the correctness of the final answer by:
1. Extracting the answer from the model's response using `\\boxed{}` notation
2. Comparing it against the reference answer from the training data
3. Returning a score of 1.0 for correct answers and 0.0 for incorrect ones

This approach is more appropriate for math problems than LLM-based scoring, as mathematical correctness is objective.

In [4]:
# Python-based grader for Azure OpenAI RFT
# Compares final answers using \\boxed{} notation only

grading_function = """import re

def normalize(ans: str):
    try:
        if not isinstance(ans, str):
            return []
        parts = re.split(r"[,\s]+", ans.strip())
        return sorted(p for p in parts if p)
    except Exception:
        return []


def extract_model_answer(text: str):
    try:
        if not text or not isinstance(text, str):
            return ""
        
        pattern = r"\\\\boxed\\{"
        matches = list(re.finditer(pattern, text))
        if not matches:
            return ""
        
        last_match = matches[-1]
        start = last_match.end()
        
        brace_count = 1
        i = start
        while i < len(text) and brace_count > 0:
            if text[i] == '{':
                brace_count += 1
            elif text[i] == '}':
                brace_count -= 1
            i += 1
        
        if brace_count == 0:
            return text[start:i-1].strip()
        
        return ""
    except Exception:
        return ""


def grade(sample, item):
    try:
        # Get model output - handle both dict and object access
        if isinstance(sample, dict):
            output_text = sample.get("output_text", "") or sample.get("output_json", "")
        else:
            output_text = getattr(sample, "output_text", "") or getattr(sample, "output_json", "")
        
        # Get reference answer
        if isinstance(item, dict):
            ref_raw = item.get("answer", "")
        else:
            ref_raw = getattr(item, "answer", "")
        
        # Handle None or empty values
        if not output_text:
            return 0.0
        if not ref_raw:
            return 0.0
            
        # Convert output_json to string if it's a dict/object
        if isinstance(output_text, dict):
            output_text = str(output_text)
        
        pred_raw = extract_model_answer(str(output_text))
        
        if not pred_raw:
            return 0.0

        pred = normalize(pred_raw)
        ref = normalize(str(ref_raw))

        return 1.0 if pred == ref else 0.0
    except Exception:
        # Always return 0.0 on any error to prevent job failure
        return 0.0
"""

grader = {
    "type": "python",
    "source": grading_function
}

print("Python RFT grader configured successfully")

Python RFT grader configured successfully


## 5.1. Test Grader Function (Optional)

Test the grader locally with a sample from the training data to verify it works correctly before submitting the RFT job.

In [5]:
exec(grading_function, globals())

test_cases = [
    {
        "name": "Simple numeric answer - CORRECT",
        "sample": {"output_text": "The final answer is \\boxed{42}"},
        "item": {"answer": "42"},
        "expected": 1.0
    },
    {
        "name": "Simple numeric answer - WRONG",
        "sample": {"output_text": "The final answer is \\boxed{99}"},
        "item": {"answer": "42"},
        "expected": 0.0
    },
    {
        "name": "Missing \\boxed{} - should FAIL even if answer is correct",
        "sample": {"output_text": "The answer is 42"},
        "item": {"answer": "42"},
        "expected": 0.0
    },
    {
        "name": "Multiple \\boxed{} - uses LAST one",
        "sample": {"output_text": "First: \\boxed{wrong}, Final: \\boxed{42}"},
        "item": {"answer": "42"},
        "expected": 1.0
    },
    {
        "name": "Comma-separated values - order doesn't matter",
        "sample": {"output_text": "Answer: \\boxed{3, 2, 1}"},
        "item": {"answer": "1, 2, 3"},
        "expected": 1.0
    },
    {
        "name": "Space-separated values - normalized",
        "sample": {"output_text": "Answer: \\boxed{9 6 4}"},
        "item": {"answer": "4, 6, 9"},
        "expected": 1.0
    },
    {
        "name": "LaTeX expression",
        "sample": {"output_text": "Solution: \\boxed{\\sqrt{2}}"},
        "item": {"answer": "\\sqrt{2}"},
        "expected": 1.0
    },
    {
        "name": "Empty \\boxed{}",
        "sample": {"output_text": "Answer: \\boxed{}"},
        "item": {"answer": "42"},
        "expected": 0.0
    }
]
all_passed = True

for test in test_cases:
    score = grade(test["sample"], test["item"])
    passed = (score == test["expected"])
    all_passed = all_passed and passed
    status = "PASS" if passed else "FAIL"
    print(f"{status} | {test['name']}")
    if not passed:
        print(f"       Expected: {test['expected']}, Got: {score}")
if all_passed:
    print("ALL TESTS PASSED")
else:
    print("SOME TESTS FAILED")

PASS | Simple numeric answer - CORRECT
PASS | Simple numeric answer - WRONG
PASS | Missing \boxed{} - should FAIL even if answer is correct
PASS | Multiple \boxed{} - uses LAST one
PASS | Comma-separated values - order doesn't matter
PASS | Space-separated values - normalized
PASS | LaTeX expression
PASS | Empty \boxed{}
ALL TESTS PASSED


## 6. Upload Training Files

Upload the training and validation JSONL files to Microsoft Foundry. Each file is assigned a unique ID that will be referenced when creating the fine-tuning job.


In [6]:
print("Uploading training file...")
with open(training_file_path, "rb") as f:
    train_file = openai_client.files.create(file=f, purpose="fine-tune")

print("Uploading validation file...")
with open(validation_file_path, "rb") as f:
    validation_file = openai_client.files.create(file=f, purpose="fine-tune")

train_file_id = train_file.id
val_file_id = validation_file.id

print(f"Training file ID: {train_file_id}")
print(f"Validation file ID: {val_file_id}")

Uploading training file...
Uploading validation file...
Training file ID: file-ef920ef1469b4750971a7601ec5f9189
Validation file ID: file-2d11b1d6f7414ff996c9bc9c25382927


## 7. Wait for File Processing

Microsoft Foundry needs to process the uploaded files before they can be used for fine-tuning. This step ensures the files are validated and ready for the RFT job.

In [7]:
print("Waiting for files to be processed...")
openai_client.files.wait_for_processing(train_file_id)
openai_client.files.wait_for_processing(val_file_id)
print("Files ready!")

Waiting for files to be processed...
Files ready!


## 8. Create Reinforcement Fine-Tuning Job

Create a reinforcement fine-tuning job with your uploaded datasets and grader function. Configure hyperparameters to control the training process.

In [8]:
print("Creating Reinforcement Fine-Tuning job...")

fine_tuning_job = openai_client.fine_tuning.jobs.create(
    training_file=train_file_id,
    validation_file=val_file_id,
    model=model_name,
    method={
        "type": "reinforcement",
        "reinforcement": {
            "grader": grader,
            "hyperparameters": {
                "n_epochs": 2,
                "batch_size": 1,
                "learning_rate_multiplier": 1.0,
                "eval_interval": 5,
                "eval_samples": 2,
                "reasoning_effort": "medium"
            }
        }
    },
    extra_body={
        "trainingType": "Standard"
    },
    suffix="math-reasoning-rft"
)

print(f"Fine-tuning job created!")
print(f"Job ID: {fine_tuning_job.id}")
print(f"Status: {fine_tuning_job.status}")
print(f"Model: {fine_tuning_job.model}")

Creating Reinforcement Fine-Tuning job...
Fine-tuning job created!
Job ID: ftjob-927c273b843b483a9b4512204cd01c39
Status: pending
Model: o4-mini-2025-04-16


## 9. Monitor Training Progress

Track the status of your fine-tuning job. You can view the current status, and recent training events. Training duration varies based on dataset size, model, and hyperparameters - typically ranging from minutes to several hours.

In [9]:
job_status = openai_client.fine_tuning.jobs.retrieve(fine_tuning_job.id)
print(f"Status: {job_status.status}")

Status: pending


## 10. Retrieve Fine-Tuned Model

After the fine-tuning job succeeded, retrieve the fine-tuned model ID. This ID is required to make inference calls with your customized model.

In [18]:
completed_job = openai_client.fine_tuning.jobs.retrieve(fine_tuning_job.id)

if completed_job.status == "succeeded":
    fine_tuned_model = completed_job.fine_tuned_model
    print(f"Fine-tuned Model ID: {fine_tuned_model}")
else:
    print(f"Status: {completed_job.status}")

Status: pending


## 11. Deploy Fine-Tuned Model

Deploy the fine-tuned model to Azure OpenAI as a deployment endpoint. This step is required before making inference calls. The deployment uses GlobalStandard SKU with 50 capacity.

In [None]:
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
from azure.mgmt.cognitiveservices.models import Deployment, DeploymentProperties, DeploymentModel, Sku

subscription_id = os.environ.get("AZURE_SUBSCRIPTION_ID")
resource_group = os.environ.get("AZURE_RESOURCE_GROUP")
account_name = os.environ.get("AZURE_AOAI_ACCOUNT")

deployment_name = "o4-mini-math-reasoning-rft"

with CognitiveServicesManagementClient(credential=credential, subscription_id=subscription_id) as cogsvc_client:
    deployment_model = DeploymentModel(format="OpenAI", name=fine_tuned_model, version="1")
    deployment_properties = DeploymentProperties(model=deployment_model)
    deployment_sku = Sku(name="GlobalStandard", capacity=200)
    deployment_config = Deployment(properties=deployment_properties, sku=deployment_sku)
    
    print(f"Deploying fine-tuned model: {fine_tuned_model}")
    deployment = cogsvc_client.deployments.begin_create_or_update(
        resource_group_name=resource_group,
        account_name=account_name,
        deployment_name=deployment_name,
        deployment=deployment_config,
    )
    
    print("Waiting for deployment to complete...")
    deployment.result()

print(f"Model deployment completed: {deployment_name}")

## 12. Test Fine-Tuned Model

Test your fine-tuned model by solving a mathematical reasoning problem.

In [None]:
test_problem = """Solve for x: 3x + 7 = 22"""

print("Testing fine-tuned model...")
print(f"Problem:{test_problem}")

response = openai_client.responses.create(
    model=deployment_name,
    input=[
        {"role": "system", "content": "You are a mathematical reasoning assistant. Solve problems step-by-step, showing your work clearly. Provide the final answer in \\boxed{} format."},
        {"role": "user", "content": test_problem}
    ]
)

print(f"Model Response:{response.output_text}")
