# Model Publishing Tutorial

This tutorial demonstrates how to publish a trained model to the Hugging Face Hub using the Continual Pretraining Framework. We'll cover the following topics:

1. Understanding the model publishing workflow
2. Setting up the publishing configuration
3. Converting FSDP checkpoints to HuggingFace format
4. Uploading models to the Hugging Face Hub
5. Validating the published model
6. Best practices for model publishing

This tutorial assumes you have already completed the CLM training tutorial and have a trained model checkpoint available.

## 1. Understanding the Model Publishing Workflow

The publish module in the Continual Pretraining Framework provides a streamlined way to convert your trained model checkpoints (especially those trained with FSDP - Fully Sharded Data Parallel) into a format that can be easily shared with the community via the Hugging Face Hub.

The publishing workflow consists of two main steps:

1. **Format Conversion**: Converting the model checkpoint from the training format (e.g., FSDP) to a standard HuggingFace format.
2. **Model Upload**: Uploading the converted model and its tokenizer to the Hugging Face Hub.

The `PublishOrchestrator` class orchestrates this entire workflow, making it easy to publish your models with minimal effort.

## 2. Setting Up the Environment

First, let's import the necessary modules and set up our environment:

In [None]:
import os
import torch
import yaml
from box import Box
from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import login

# Import the publish components
from src.tasks.publish import execute
from src.tasks.publish.orchestrator import PublishOrchestrator
from src.tasks.publish.format.fsdp import ConvertFSDPCheckpoint
from src.tasks.publish.upload.huggingface import UploadHuggingface
from src.config.config_loader import ConfigValidator

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device count: {torch.cuda.device_count()}")
    print(f"CUDA device name: {torch.cuda.get_device_name(0)}")

## 3. Creating a Sample Checkpoint

For this tutorial, we'll create a small sample checkpoint to demonstrate the publishing process. In a real-world scenario, you would use a checkpoint from your trained model.

In [None]:
# Create a sample checkpoint directory
sample_checkpoint_dir = Path("sample_checkpoint")
sample_checkpoint_dir.mkdir(exist_ok=True)

# Function to create a sample FSDP checkpoint
def create_sample_checkpoint():
    # Load a small model for demonstration
    model_name = "gpt2"
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
    # Create a state dict that mimics an FSDP checkpoint structure
    fsdp_state_dict = {}
    
    # Convert regular state dict to FSDP-like format
    for key, value in model.state_dict().items():
        # Mimic FSDP key structure
        if key.startswith("transformer."):
            # Transform transformer.h.0.attn.c_attn.weight to model.model.layers.0.attn.c_attn.weight
            new_key = key.replace("transformer.", "model.model.")
            new_key = new_key.replace(".h.", ".layers.")
            fsdp_state_dict[new_key] = value
        elif key == "lm_head.weight":
            # Keep lm_head as is
            fsdp_state_dict["model.lm_head.weight"] = value
        else:
            # Other parameters
            fsdp_state_dict[f"model.{key}"] = value
    
    # Create a checkpoint dictionary with model_state_dict
    checkpoint = {
        "model_state_dict": fsdp_state_dict,
        "step_count": 1000,
        "current_epoch": 1
    }
    
    # Save the checkpoint
    checkpoint_path = sample_checkpoint_dir / "checkpoint.pt"
    torch.save(checkpoint, checkpoint_path)
    
    return str(checkpoint_path)

# Create the sample checkpoint
checkpoint_path = create_sample_checkpoint()
print(f"Sample checkpoint created at: {checkpoint_path}")

# Load and verify the checkpoint
checkpoint = torch.load(checkpoint_path, map_location="cpu")
print(f"Checkpoint keys: {list(checkpoint.keys())}")
print(f"Number of parameters in checkpoint: {len(checkpoint['model_state_dict'])}")
print(f"Sample parameter key: {list(checkpoint['model_state_dict'].keys())[0]}")

## 4. Setting Up the Publishing Configuration

To publish a model, we need to create a configuration that specifies the model, checkpoint path, and publishing details. The Continual Pretraining Framework uses YAML configuration files, but we can also create a configuration programmatically:

In [None]:
# Create a configuration for model publishing
def create_publish_config(checkpoint_path, base_model="gpt2", repo_id="your-username/your-model-name"):
    """
    Create a configuration for model publishing.
    
    Args:
        checkpoint_path: Path to the model checkpoint
        base_model: Name or path of the base model used for training
        repo_id: HuggingFace repository ID where the model will be published
        
    Returns:
        Box: Configuration object
    """
    config = {
        "task": "publish",
        "experiment_name": "tutorial_publish",
        "verbose_level": 4,
        
        # Publish configuration
        "publish": {
            # Format conversion
            "format": "fsdp",
            "base_model": base_model,
            "checkpoint_path": checkpoint_path,
            
            # Upload configuration
            "host": "huggingface",
            "repo_id": repo_id,
            "commit_message": f"Add {base_model} model trained with Continual Pretraining Framework",
            
            # Advanced options
            "max_shard_size": "5GB",
            "safe_serialization": True,
            "create_pr": False
        }
    }
    
    # Convert to Box object for dot notation access
    return Box(config)

# Create the configuration
# Note: Replace 'your-username/your-model-name' with your actual HuggingFace username and desired model name
publish_config = create_publish_config(
    checkpoint_path=checkpoint_path,
    base_model="gpt2",
    repo_id="your-username/your-model-name"
)

# Print the configuration
print("Publish Configuration:")
for key, value in publish_config.items():
    if isinstance(value, dict):
        print(f"{key}:")
        for k, v in value.items():
            print(f"  {k}: {v}")
    else:
        print(f"{key}: {value}")

# Save the configuration to a YAML file
config_path = "publish_config.yaml"
with open(config_path, "w") as f:
    yaml.dump(publish_config.to_dict(), f)

print(f"\nConfiguration saved to {config_path}")

## 5. Understanding the Configuration Parameters

Let's go through the key configuration parameters for model publishing:

### Format Conversion Parameters

- **format**: The format of the checkpoint to convert. Currently, only "fsdp" is supported.
- **base_model**: The name or path of the base model used for training. This is used to get the model architecture and configuration.
- **checkpoint_path**: The path to the checkpoint file to convert.

### Upload Parameters

- **host**: The host where the model will be published. Currently, only "huggingface" is supported.
- **repo_id**: The HuggingFace repository ID where the model will be published, in the format "username/model-name".
- **commit_message**: The commit message for the model upload.

### Advanced Parameters

- **max_shard_size**: The maximum size of each model shard when uploading to HuggingFace. Default is "5GB".
- **safe_serialization**: Whether to use safe serialization when uploading the model. Default is True.
- **create_pr**: Whether to create a pull request instead of pushing directly to the repository. Default is False.

## 6. Converting FSDP Checkpoints

Let's convert our FSDP checkpoint to HuggingFace format using the `ConvertFSDPCheckpoint` class:

In [None]:
# Create an instance of the FSDP conversion handler
converter = ConvertFSDPCheckpoint(
    host=publish_config.publish.host,
    base_model=publish_config.publish.base_model,
    checkpoint_path=publish_config.publish.checkpoint_path
)

# Convert the checkpoint
try:
    model = converter.execute()
    print(f"Model successfully converted from FSDP checkpoint")
    print(f"Model type: {type(model).__name__}")
    print(f"Number of parameters: {sum(p.numel() for p in model.parameters()):,}")
except Exception as e:
    print(f"Error converting checkpoint: {str(e)}")
    # For the tutorial, we'll load a pretrained model as a fallback
    print("Loading pretrained model as fallback...")
    model = AutoModelForCausalLM.from_pretrained(publish_config.publish.base_model)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(publish_config.publish.base_model)
print(f"Tokenizer loaded: {tokenizer.__class__.__name__}")

## 7. Uploading Models to HuggingFace

Now that we have converted our model, let's upload it to the HuggingFace Hub. First, we need to log in to HuggingFace:

In [None]:
# Log in to HuggingFace
# Uncomment the following line and replace with your token to log in
# login(token="your_huggingface_token")

# For the tutorial, we'll skip the actual upload to avoid modifying HuggingFace repositories
# In a real scenario, you would uncomment the following code and run it

'''
# Create an instance of the HuggingFace upload handler
uploader = UploadHuggingface(
    base_model=publish_config.publish.base_model,
    model=model,
    tokenizer=tokenizer,
    repo_id=publish_config.publish.repo_id
)

# Upload the model and tokenizer
try:
    uploader.execute(
        message=publish_config.publish.commit_message,
        max_shard_size=publish_config.publish.max_shard_size,
        safe_serialization=publish_config.publish.safe_serialization,
        create_pr=publish_config.publish.create_pr
    )
    print(f"Model and tokenizer successfully uploaded to {publish_config.publish.repo_id}")
    print(f"Model available at: https://huggingface.co/{publish_config.publish.repo_id}")
except Exception as e:
    print(f"Error uploading model: {str(e)}")
'''

print("To upload the model to HuggingFace:")
print("1. Uncomment the login line and replace with your HuggingFace token")
print("2. Uncomment the upload code block")
print("3. Run the cell")

## 8. Publishing a Model with PublishOrchestrator

Instead of using the individual components, we can use the `PublishOrchestrator` to handle the entire publishing workflow:

In [None]:
# Create an instance of the PublishOrchestrator
orchestrator = PublishOrchestrator(publish_config)

# For the tutorial, we'll skip the actual execution to avoid modifying HuggingFace repositories
# In a real scenario, you would uncomment the following code and run it

'''
# Execute the publish workflow
try:
    orchestrator.execute()
    print(f"Model successfully published to {publish_config.publish.repo_id}")
    print(f"Model available at: https://huggingface.co/{publish_config.publish.repo_id}")
except Exception as e:
    print(f"Error publishing model: {str(e)}")
'''

# Alternatively, you can use the execute function directly with the config file
'''
from src.tasks.publish import execute
execute(config_path)
'''

print("To publish the model using the orchestrator:")
print("1. Ensure you are logged in to HuggingFace")
print("2. Uncomment the execution code block")
print("3. Run the cell")

## 9. Validating the Published Model

After publishing a model, it's important to validate that it works correctly:

In [None]:
def validate_published_model(repo_id):
    """
    Validate a published model by loading it and running a test inference.
    
    Args:
        repo_id: HuggingFace repository ID of the published model
    """
    print(f"Validating model from {repo_id}...")
    
    try:
        # Load model and tokenizer from repo
        model = AutoModelForCausalLM.from_pretrained(repo_id)
        tokenizer = AutoTokenizer.from_pretrained(repo_id)
        
        # Print model information
        print(f"Model type: {type(model).__name__}")
        print(f"Number of parameters: {sum(p.numel() for p in model.parameters()):,}")
        
        # Test tokenization
        test_text = "Hello, world!"
        tokens = tokenizer(test_text, return_tensors="pt")
        print(f"Tokenized '{test_text}' to {tokens['input_ids'].shape[1]} tokens")
        
        # Test generation
        with torch.no_grad():
            outputs = model.generate(
                tokens["input_ids"],
                max_length=20,
                num_return_sequences=1,
                pad_token_id=tokenizer.eos_token_id
            )
        
        # Decode and print the generated text
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print(f"Generated text: {generated_text}")
        
        print("✅ Model validation successful!")
        return True
    except Exception as e:
        print(f"❌ Model validation failed: {str(e)}")
        return False

# For the tutorial, we'll validate the base model instead of the published model
# In a real scenario, you would validate the published model
print("Validating the base model...")
validate_published_model(publish_config.publish.base_model)

# To validate the published model, uncomment the following line
# validate_published_model(publish_config.publish.repo_id)

## 10. Best Practices for Model Publishing

Here are some best practices to follow when publishing models:

### Model Preparation

- **Clean Checkpoints**: Ensure your checkpoint is clean and contains only the necessary model weights.
- **Weight Tying**: Make sure weight tying is properly applied, especially for language models where the embedding and output layers often share weights.
- **Parameter Validation**: Verify that all parameters are loaded correctly by checking the percentage of loaded parameters.

### Repository Setup

- **Clear Repository Name**: Choose a clear and descriptive repository name that reflects the model's purpose and architecture.
- **Model Card**: Create a comprehensive model card that describes the model, its training data, performance, limitations, and intended use cases.
- **License**: Include a clear license that specifies how the model can be used.

### Upload Configuration

- **Shard Size**: For large models, use an appropriate shard size to avoid timeout issues during upload.
- **Safe Serialization**: Enable safe serialization to ensure the model can be loaded reliably.
- **Pull Requests**: Consider using pull requests for collaborative model development.

### Validation

- **Functional Testing**: Verify that the model can generate text correctly after uploading.
- **Performance Comparison**: Compare the performance of the uploaded model with the original checkpoint to ensure no degradation.
- **Integration Testing**: Test the model in the intended application context to ensure it meets requirements.

## 11. Integration with CLM Training

The publish module is designed to work seamlessly with the CLM training module. Here's a typical workflow that integrates CLM training and model publishing:

In [None]:
# Example workflow integrating CLM training and model publishing

# 1. CLM training configuration
clm_training_config = {
    "task": "pretraining",
    "model_name": "gpt2",
    "dataset": {
        "source": "local",
        "nameOrPath": "tokenized_dataset"
    },
    "output_dir": "trained_model",
    # Other training parameters...
}

# 2. Publish configuration using the trained model checkpoint
publish_config = {
    "task": "publish",
    "publish": {
        "format": "fsdp",
        "base_model": "gpt2",
        "checkpoint_path": "trained_model/checkpoint.pt",  # Output from CLM training
        "host": "huggingface",
        "repo_id": "your-username/your-model-name"
    }
}

# In a real scenario, you would run:
'''
# 1. Execute CLM training
from src.tasks.clm_training import execute as train
train("path/to/clm_training_config.yaml")

# 2. Execute model publishing
from src.tasks.publish import execute as publish
publish("path/to/publish_config.yaml")
'''

print("This example shows how to integrate CLM training and model publishing tasks.")

## 12. Conclusion

In this tutorial, we've covered the basics of model publishing using the Continual Pretraining Framework. We've learned how to:

1. Set up a publishing configuration
2. Convert FSDP checkpoints to HuggingFace format
3. Upload models to the HuggingFace Hub
4. Validate published models
5. Follow best practices for model publishing
6. Integrate model publishing with CLM training

The publish module provides a simple and efficient way to share your trained models with the community, making it easy to collaborate and build upon your work.

For more advanced usage, refer to the framework documentation and experiment with different configurations to find what works best for your specific use case.