Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][Drafting] LLM Fine-tuning Microservice #268

Closed
4 tasks
xwu99 opened this issue Jul 2, 2024 · 0 comments
Closed
4 tasks

[RFC][Drafting] LLM Fine-tuning Microservice #268

xwu99 opened this issue Jul 2, 2024 · 0 comments

Comments

@xwu99
Copy link

xwu99 commented Jul 2, 2024

LLM Fine-tuning Microservice

Fine-tuning is the process of adapting a base language model to a specific task or domain by continuing its training on a smaller, task-specific dataset. This microservice provides OpenAI Fine-tuning API compatible interfaces for users to easily submit fine-tuning jobs in a consistent way. Based on Ray unified framework, it is a scalable solution for distributed LLM fine-tuning.

RFC Content

Author

xwu99

Status

Drafting, will change to Under Review when ready

Objective

This RFC aims to introduce OPEA LLM Fine-tuning microserive design. The objective is to address the overall architecture, workflow and key design decisions.

Motivation

LLM serving is already an integral part of OPEA. Adding fine-tuning microservice complements the ability of OPEA to customize users' own models to adapt to specific task or domain by using their own datasets.

Design Proposal

The Fine-tuning microservice provides the following features that is OpenAI compatible:

  • Create fine-tuning job
  • List fine-tuning jobs
  • List fine-tuning events
  • List fine-tuning checkpoints
  • Retrieve fine-tuning job
  • Cancel fine-tuning

The following figure shows the architecture of Fine-tuning microservice:

Architecture

In the Kubernetes cluster, KubeRay operator fully manages the lifecycle of custom resource definition (CRD) "RayCluster", including cluster creation and deletion, autoscaling, and ensuring fault tolerance. When fine-tuning service processes requests and submit Ray job to the RayCluster, it can scale worker nodes based on resource requirements.

For deployment, given an existing Kubernetes cluster, take the following steps:

  • Deploy a KubeRay Operator using Helm Charts.
  • Build custom LLM Fine-tuning image and upload it to the cluster.
  • Create Ray Cluster with Autoscaling.
  • Start LLM Fine-tuning service.

Alternatives Considered

N/A

Compatibility

This microservice provides OpenAI compatible fine-tuning interfaces. See the following documents for details:

Miscs

Task List:

@xwu99 xwu99 changed the title [RFC] LLM Fine-tuning Microservice [RFC][Drafting] LLM Fine-tuning Microservice Jul 2, 2024
@xwu99 xwu99 closed this as completed Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant