# Understanding ChatCompletion to Model-Ready Format Transformation for SFT

This notebook explains how OpenAI's fine-tuning framework internally transforms ChatCompletion-style training data into model-ready format for Supervised Fine-Tuning (SFT), and how the loss is computed during training.

This addresses a common question about what happens "under the hood" when you provide training data in the ChatCompletion format for fine-tuning.

## Overview of the Transformation Process

When you provide training data in ChatCompletion format, the framework performs several transformation steps:

1. **Message Concatenation**: Converts the structured conversation into a continuous text sequence
2. **Special Token Insertion**: Adds role markers and message boundaries
3. **Tokenization**: Converts text to token IDs that the model can process
4. **Loss Mask Creation**: Determines which tokens contribute to the training loss
5. **Sequence Padding**: Ensures uniform batch sizes for efficient training

## Step 1: ChatCompletion Format Input

Your training data starts in this familiar format:

In [None]:
# Example training conversation
training_example = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant specialized in explaining technical concepts."
        },
        {
            "role": "user",
            "content": "What is gradient descent?"
        },
        {
            "role": "assistant",
            "content": "Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent."
        }
    ]
}

import json
print(json.dumps(training_example, indent=2))

## Step 2: Internal Format Transformation

The framework transforms this structured conversation into a linear sequence with special tokens that indicate role boundaries and message structure:

In [None]:
# Simplified representation of the internal transformation
# Note: Actual tokens and format may vary by model

def transform_to_model_format(messages):
    """
    Demonstrates how ChatCompletion messages are transformed into
    a continuous sequence for the model.
    """
    # Special tokens (simplified representation)
    SYSTEM_PREFIX = "<|im_start|>system\n"
    SYSTEM_SUFFIX = "<|im_end|>\n"
    USER_PREFIX = "<|im_start|>user\n"
    USER_SUFFIX = "<|im_end|>\n"
    ASSISTANT_PREFIX = "<|im_start|>assistant\n"
    ASSISTANT_SUFFIX = "<|im_end|>\n"
    
    model_input = ""
    
    for message in messages:
        role = message["role"]
        content = message["content"]
        
        if role == "system":
            model_input += SYSTEM_PREFIX + content + SYSTEM_SUFFIX
        elif role == "user":
            model_input += USER_PREFIX + content + USER_SUFFIX
        elif role == "assistant":
            model_input += ASSISTANT_PREFIX + content + ASSISTANT_SUFFIX
    
    return model_input

# Transform our example
model_ready_format = transform_to_model_format(training_example["messages"])
print("Model-ready format:")
print(model_ready_format)
print("\n" + "="*50)
print("This continuous sequence is what the model actually sees during training.")

## Step 3: Tokenization Process

The text sequence is then converted to numerical tokens that the model can process:

In [None]:
import tiktoken

# Initialize tokenizer (using cl100k_base as example)
encoding = tiktoken.get_encoding("cl100k_base")

def demonstrate_tokenization(text):
    """
    Shows how text is converted to tokens.
    """
    tokens = encoding.encode(text)
    
    print(f"Original text length: {len(text)} characters")
    print(f"Number of tokens: {len(tokens)}")
    print(f"\nFirst 20 tokens: {tokens[:20]}")
    
    # Decode back to show token boundaries
    print("\nToken boundaries (first 100 chars):")
    for i, token_id in enumerate(tokens[:10]):
        token_text = encoding.decode([token_id])
        print(f"Token {i}: [{token_id}] = '{token_text}'")
    
    return tokens

# Tokenize the model-ready format
tokens = demonstrate_tokenization(model_ready_format)

## Step 4: Loss Computation Strategy

A critical aspect of SFT is determining which tokens contribute to the training loss. The framework implements intelligent loss masking:

In [None]:
def create_loss_mask(messages, tokens):
    """
    Demonstrates how the loss mask is created.
    In SFT, typically only the assistant's responses contribute to the loss.
    """
    # This is a simplified representation
    # In practice, the implementation tracks token positions more precisely
    
    loss_mask = []
    current_role = None
    
    # For demonstration, we'll create a simple mask
    # 1 = contribute to loss, 0 = don't contribute
    for message in messages:
        role = message["role"]
        content_tokens = encoding.encode(message["content"])
        
        if role == "assistant":
            # Assistant tokens contribute to loss
            loss_mask.extend([1] * len(content_tokens))
        else:
            # System and user tokens don't contribute to loss
            loss_mask.extend([0] * len(content_tokens))
    
    return loss_mask

# Demonstrate loss masking
print("Loss Masking Strategy:")
print("="*50)
print("✓ Assistant responses: Contribute to loss (mask=1)")
print("✗ System messages: Don't contribute to loss (mask=0)")
print("✗ User messages: Don't contribute to loss (mask=0)")
print("\nThis ensures the model learns to generate appropriate assistant responses")
print("given the context of system instructions and user queries.")

## Step 5: Training Process Visualization

Here's how the transformed data flows through the training process:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def visualize_training_flow():
    """
    Creates a visual representation of the training data flow.
    """
    fig, axes = plt.subplots(3, 1, figsize=(12, 10))
    
    # Simplified token sequence
    sequence_length = 50
    token_types = ['System'] * 10 + ['User'] * 15 + ['Assistant'] * 25
    
    # Color mapping
    colors = {'System': 'blue', 'User': 'green', 'Assistant': 'red'}
    color_sequence = [colors[t] for t in token_types]
    
    # Plot 1: Token sequence
    ax1 = axes[0]
    positions = np.arange(sequence_length)
    ax1.bar(positions, [1]*sequence_length, color=color_sequence, width=1.0)
    ax1.set_title('Token Sequence by Role', fontsize=14, fontweight='bold')
    ax1.set_ylabel('Token Presence')
    ax1.set_ylim(0, 1.5)
    ax1.legend([plt.Rectangle((0,0),1,1, fc=c) for c in colors.values()], 
               colors.keys(), loc='upper right')
    
    # Plot 2: Loss mask
    ax2 = axes[1]
    loss_mask = [0] * 10 + [0] * 15 + [1] * 25  # Only assistant tokens have loss
    ax2.bar(positions, loss_mask, color=['gray' if m == 0 else 'orange' for m in loss_mask], width=1.0)
    ax2.set_title('Loss Mask (Which Tokens Contribute to Training)', fontsize=14, fontweight='bold')
    ax2.set_ylabel('Loss Weight')
    ax2.set_ylim(0, 1.5)
    ax2.legend(['No Loss', 'Has Loss'], loc='upper right')
    
    # Plot 3: Gradient flow
    ax3 = axes[2]
    gradient_magnitude = np.array(loss_mask) * np.random.uniform(0.5, 1.0, sequence_length)
    ax3.plot(positions, gradient_magnitude, 'r-', linewidth=2)
    ax3.fill_between(positions, 0, gradient_magnitude, alpha=0.3, color='red')
    ax3.set_title('Gradient Magnitude During Backpropagation', fontsize=14, fontweight='bold')
    ax3.set_ylabel('Gradient Magnitude')
    ax3.set_xlabel('Token Position')
    ax3.set_ylim(0, 1.5)
    
    plt.tight_layout()
    plt.savefig('sft_transformation_flow.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("\nKey Insights:")
    print("1. The entire conversation is processed as a continuous sequence")
    print("2. Only assistant tokens contribute to the loss calculation")
    print("3. Gradients flow back primarily through assistant response tokens")
    print("4. The model learns to generate appropriate responses in context")

visualize_training_flow()

## Step 6: Loss Calculation Details

The actual loss computation uses cross-entropy loss on the masked tokens:

In [None]:
def explain_loss_calculation():
    """
    Explains how the loss is calculated during SFT.
    """
    print("Loss Calculation in SFT")
    print("="*50)
    print()
    print("1. Forward Pass:")
    print("   - Input: Full conversation sequence (system + user + assistant)")
    print("   - Model generates: Probability distribution over vocabulary for each token")
    print()
    print("2. Loss Computation:")
    print("   - For each token position:")
    print("     • If mask = 0 (system/user): Skip this token")
    print("     • If mask = 1 (assistant): Calculate cross-entropy loss")
    print("   - Formula: L = -Σ(mask_i * log(P(token_i|context)))")
    print()
    print("3. Gradient Calculation:")
    print("   - Gradients are computed only for masked tokens")
    print("   - This focuses learning on generating good assistant responses")
    print()
    print("4. Parameter Update:")
    print("   - Model weights are updated to minimize the loss")
    print("   - Over many examples, the model learns the assistant's behavior pattern")
    
    # Simulate a simple loss calculation
    print("\n" + "="*50)
    print("Example Loss Calculation:")
    print("="*50)
    
    # Simulated probabilities for assistant tokens
    assistant_tokens = ["Gradient", "descent", "is", "an", "optimization", "algorithm"]
    predicted_probs = [0.95, 0.88, 0.99, 0.97, 0.82, 0.91]
    
    losses = [-np.log(p) for p in predicted_probs]
    
    for i, (token, prob, loss) in enumerate(zip(assistant_tokens, predicted_probs, losses)):
        print(f"Token {i+1}: '{token}'")
        print(f"  Predicted probability: {prob:.3f}")
        print(f"  Cross-entropy loss: {loss:.3f}")
    
    avg_loss = np.mean(losses)
    print(f"\nAverage loss for this example: {avg_loss:.3f}")
    print("\nLower loss = model is more confident in generating the correct tokens")

explain_loss_calculation()

## Practical Implications for Fine-Tuning

Understanding this transformation process helps explain several important aspects of fine-tuning:

In [None]:
def summarize_implications():
    """
    Summarizes the practical implications of the transformation process.
    """
    implications = [
        {
            "aspect": "Why Assistant Messages Matter Most",
            "explanation": "Only assistant tokens contribute to loss, so the quality and accuracy of assistant responses in your training data directly impacts model performance."
        },
        {
            "aspect": "Token Limits",
            "explanation": "The 4096 token limit applies to the entire conversation after transformation, including special tokens for role markers."
        },
        {
            "aspect": "Context Learning",
            "explanation": "The model learns to generate responses conditioned on the full context (system + user messages), even though only assistant tokens contribute to loss."
        },
        {
            "aspect": "Format Consistency",
            "explanation": "Maintaining consistent formatting in your training data helps the model learn the expected structure and improves generation quality."
        },
        {
            "aspect": "Multi-turn Conversations",
            "explanation": "Including multi-turn examples helps the model learn to maintain context across multiple exchanges."
        }
    ]
    
    print("Practical Implications for Your Fine-Tuning")
    print("="*60)
    
    for i, item in enumerate(implications, 1):
        print(f"\n{i}. {item['aspect']}")
        print(f"   {item['explanation']}")
    
    print("\n" + "="*60)
    print("\nBest Practices Based on This Understanding:")
    print("• Ensure high-quality assistant responses in your training data")
    print("• Include diverse examples that cover your use case")
    print("• Keep conversations within token limits to avoid truncation")
    print("• Use consistent formatting across all training examples")
    print("• Test with examples similar to your training format")

summarize_implications()

## Complete Example: From ChatCompletion to Training

Let's walk through a complete example showing the entire transformation pipeline:

In [None]:
def complete_transformation_example():
    """
    Demonstrates the complete transformation pipeline from ChatCompletion
    format to model training.
    """
    # Step 1: Original ChatCompletion format
    original_data = {
        "messages": [
            {"role": "system", "content": "You are a Python expert."},
            {"role": "user", "content": "How do I read a file in Python?"},
            {"role": "assistant", "content": "You can read a file using the open() function with a context manager:\n\nwith open('file.txt', 'r') as f:\n    content = f.read()"}
        ]
    }
    
    print("STEP 1: Original ChatCompletion Format")
    print("="*50)
    print(json.dumps(original_data, indent=2))
    
    # Step 2: Transform to linear sequence
    print("\n\nSTEP 2: Transformed to Linear Sequence")
    print("="*50)
    linear_sequence = transform_to_model_format(original_data["messages"])
    print(linear_sequence[:200] + "..." if len(linear_sequence) > 200 else linear_sequence)
    
    # Step 3: Tokenization
    print("\n\nSTEP 3: Tokenization")
    print("="*50)
    tokens = encoding.encode(linear_sequence)
    print(f"Total tokens: {len(tokens)}")
    print(f"First 30 tokens: {tokens[:30]}")
    
    # Step 4: Loss mask creation
    print("\n\nSTEP 4: Loss Mask Creation")
    print("="*50)
    
    # Simplified: Find where assistant content starts
    assistant_start_marker = "<|im_start|>assistant"
    assistant_start_pos = linear_sequence.find(assistant_start_marker)
    
    if assistant_start_pos != -1:
        # Create a simple mask
        pre_assistant_tokens = encoding.encode(linear_sequence[:assistant_start_pos])
        mask = [0] * len(pre_assistant_tokens) + [1] * (len(tokens) - len(pre_assistant_tokens))
        
        print(f"Tokens before assistant response: {len(pre_assistant_tokens)} (mask=0)")
        print(f"Assistant response tokens: {len(tokens) - len(pre_assistant_tokens)} (mask=1)")
        print(f"\nOnly the {len(tokens) - len(pre_assistant_tokens)} assistant tokens will contribute to the loss.")
    
    # Step 5: Training implications
    print("\n\nSTEP 5: Training Process")
    print("="*50)
    print("During training:")
    print("1. The model processes the entire sequence")
    print("2. It predicts the next token at each position")
    print("3. Loss is calculated only for assistant tokens")
    print("4. The model learns to generate appropriate Python help responses")
    print("5. After many examples, it generalizes to new Python questions")

complete_transformation_example()

## Conclusion

This notebook has demonstrated how OpenAI's fine-tuning framework transforms ChatCompletion format training data into model-ready format for Supervised Fine-Tuning (SFT). Key takeaways:

1. **Structured to Sequential**: The conversation structure is converted to a linear sequence with special tokens
2. **Selective Loss Computation**: Only assistant responses contribute to the training loss
3. **Context-Aware Learning**: The model learns to generate responses based on the full conversation context
4. **Efficient Training**: The loss masking ensures the model focuses on learning the desired behavior

Understanding this process helps you:
- Design better training data
- Understand why certain formatting matters
- Debug fine-tuning issues more effectively
- Optimize your training examples for better results

For more information on fine-tuning best practices, refer to the [OpenAI Fine-tuning Guide](https://platform.openai.com/docs/guides/fine-tuning).