# Model Publishing Tutorial

This tutorial demonstrates how to publish a trained model to the Hugging Face Hub using the Continual Pretraining Framework. We'll cover the following topics:

1. Understanding the model publishing workflow
2. Setting up the publishing configuration
3. Converting FSDP checkpoints to HuggingFace format
4. Uploading models to the Hugging Face Hub
5. Validating the published model
6. Best practices for model publishing

This tutorial assumes you have already completed the CLM training tutorial and have a trained model checkpoint available.

## Understanding the Model Publishing Workflow

The publish module in the Continual Pretraining Framework provides a streamlined way to convert your trained model checkpoints (especially those trained with FSDP - Fully Sharded Data Parallel) into a format that can be easily shared with the community via the Hugging Face Hub.

The publishing workflow consists of two main steps:

1. **Format Conversion**: Converting the model checkpoint from the training format (e.g., FSDP) to a standard HuggingFace format.
2. **Model Upload**: Uploading the converted model and its tokenizer to the Hugging Face Hub.

The `PublishOrchestrator` class orchestrates this entire workflow, making it easy to publish your models with minimal effort.

## Understanding the Configuration Parameters

Let's go through the key configuration parameters for model publishing:

### Format Conversion Parameters

- **format**: The format of the checkpoint to convert. Currently, only "fsdp" is supported.
- **base_model**: The name or path of the base model used for training. This is used to get the model architecture and configuration.
- **checkpoint_path**: The path to the checkpoint file to convert.

### Upload Parameters

- **host**: The host where the model will be published. Currently, only "huggingface" is supported.
- **repo_id**: The HuggingFace repository ID where the model will be published, in the format "username/model-name".
- **commit_message**: The commit message for the model upload.

### Advanced Parameters

- **max_shard_size**: The maximum size of each model shard when uploading to HuggingFace. Default is "5GB".
- **safe_serialization**: Whether to use safe serialization when uploading the model. Default is True.
- **create_pr**: Whether to create a pull request instead of pushing directly to the repository. Default is False.

## Setting Up the Config File

First, let's import the necessary modules and set up our environment:

In [1]:
import yaml
from box import Box

In [2]:
with open("/workspace/tutorials/configs/publish_tutorial.yaml", "r") as f:
    publish_config = Box(yaml.safe_load(f), default_box=True)

print("Loaded publish config:")
print(publish_config)

Loaded publish config:
{'task': 'publish', 'experiment_name': 'tutorial_publish', 'verbose_level': 4, 'publish': {'format': 'fsdp', 'host': 'huggingface', 'base_model': 'openai-community/gpt2', 'checkpoint_path': 'tutorials/output/epoch-001-final-ckpt.pth', 'repo_id': 'FabioDataGeek/quijote-gpt2-clm', 'commit_message': 'Add tutorial-trained GPT-2 checkpoint', 'max_shard_size': '5GB', 'safe_serialization': True, 'create_pr': False}}


## Publish

In [None]:
import os

os.chdir("/workspace")  # or your project root if needed
!export HF_TOKEN=YOUR_HUGGINGFACE_TOKEN
!python src/main.py --config tutorials/configs/publish_tutorial.yaml

2025-06-16 16:39:48 - src.tasks.publish.orchestrator - [0;36mDEBUG[0m - [0;36mLoaded tokenizer for base model: openai-community/gpt2[0m
2025-06-16 16:39:48 - src.tasks.publish.orchestrator - [0;32mINFO[0m - [0;32mStarting publish workflow[0m
2025-06-16 16:39:53 - src.tasks.publish.format.fsdp - [0;32mINFO[0m - [0;32m✅ All parameters resolved after weight tying![0m
[[0;32mINFO[0m | src.tasks.publish.format.fsdp]: [0;32m✅ All parameters resolved after weight tying![0m
2025-06-16 16:39:53 - src.tasks.publish.format.fsdp - [0;32mINFO[0m - [0;32m✅ Final result: 149/149 parameters (100.0%)[0m
[[0;32mINFO[0m | src.tasks.publish.format.fsdp]: [0;32m✅ Final result: 149/149 parameters (100.0%)[0m
2025-06-16 16:39:53 - src.tasks.publish.format.fsdp - [0;32mINFO[0m - [0;32mSkipping weight tying check in test environment[0m
[[0;32mINFO[0m | src.tasks.publish.format.fsdp]: [0;32mSkipping weight tying check in test environment[0m
2025-06-16 16:39:53 - src.tasks.publish

## Best Practices for Model Publishing

Here are some best practices to follow when publishing models:

### Model Preparation

- **Clean Checkpoints**: Ensure your checkpoint is clean and contains only the necessary model weights.
- **Weight Tying**: Make sure weight tying is properly applied, especially for language models where the embedding and output layers often share weights.
- **Parameter Validation**: Verify that all parameters are loaded correctly by checking the percentage of loaded parameters.

### Repository Setup

- **Clear Repository Name**: Choose a clear and descriptive repository name that reflects the model's purpose and architecture.
- **Model Card**: Create a comprehensive model card that describes the model, its training data, performance, limitations, and intended use cases.
- **License**: Include a clear license that specifies how the model can be used.

### Upload Configuration

- **Shard Size**: For large models, use an appropriate shard size to avoid timeout issues during upload.
- **Safe Serialization**: Enable safe serialization to ensure the model can be loaded reliably.
- **Pull Requests**: Consider using pull requests for collaborative model development.

### Validation

- **Functional Testing**: Verify that the model can generate text correctly after uploading.
- **Performance Comparison**: Compare the performance of the uploaded model with the original checkpoint to ensure no degradation.
- **Integration Testing**: Test the model in the intended application context to ensure it meets requirements.

## Integration with Tokenization and CLM Training using "El Quijote"

The publish module is designed to work seamlessly with the CLM training module. Here's a typical workflow that integrates CLM training and model publishing:

In [None]:
# Example workflow integrating CLM training and model publishing

import os
os.chdir("/workspace")


# 1. Run tokenization
!python src/main.py --config tutorials/configs/tokenization_tutorial.yaml

# 2. Run CLM training
print("Running CLM training...")
!python src/main.py --config tutorials/configs/clm_training_tutorial.yaml

# 3. Publish the model
print("Publishing the model...")
!python src/main.py --config tutorials/configs/publish_tutorial.yaml

## Conclusion

In this tutorial, we've covered the basics of model publishing using the Continual Pretraining Framework. We've learned how to:

1. Set up a publishing configuration
2. Convert FSDP checkpoints to HuggingFace format
3. Upload models to the HuggingFace Hub
4. Validate published models
5. Follow best practices for model publishing
6. Integrate model publishing with CLM training

The publish module provides a simple and efficient way to share your trained models with the community, making it easy to collaborate and build upon your work.

For more advanced usage, refer to the framework documentation and experiment with different configurations to find what works best for your specific use case.