[RFC][Drafting] LLM Fine-tuning Microservice #268

xwu99 · 2024-07-02T03:10:31Z

LLM Fine-tuning Microservice

Fine-tuning is the process of adapting a base language model to a specific task or domain by continuing its training on a smaller, task-specific dataset. This microservice provides OpenAI Fine-tuning API compatible interfaces for users to easily submit fine-tuning jobs in a consistent way. Based on Ray unified framework, it is a scalable solution for distributed LLM fine-tuning.

RFC Content

Author

xwu99

Status

Drafting, will change to Under Review when ready

Objective

This RFC aims to introduce OPEA LLM Fine-tuning microserive design. The objective is to address the overall architecture, workflow and key design decisions.

Motivation

LLM serving is already an integral part of OPEA. Adding fine-tuning microservice complements the ability of OPEA to customize users' own models to adapt to specific task or domain by using their own datasets.

Design Proposal

The Fine-tuning microservice provides the following features that is OpenAI compatible:

Create fine-tuning job
List fine-tuning jobs
List fine-tuning events
List fine-tuning checkpoints
Retrieve fine-tuning job
Cancel fine-tuning

The following figure shows the architecture of Fine-tuning microservice:

In the Kubernetes cluster, KubeRay operator fully manages the lifecycle of custom resource definition (CRD) "RayCluster", including cluster creation and deletion, autoscaling, and ensuring fault tolerance. When fine-tuning service processes requests and submit Ray job to the RayCluster, it can scale worker nodes based on resource requirements.

For deployment, given an existing Kubernetes cluster, take the following steps:

Deploy a KubeRay Operator using Helm Charts.
Build custom LLM Fine-tuning image and upload it to the cluster.
Create Ray Cluster with Autoscaling.
Start LLM Fine-tuning service.

Alternatives Considered

N/A

Compatibility

This microservice provides OpenAI compatible fine-tuning interfaces. See the following documents for details:

Miscs

Task List:

Add initial Finetuning component (Add initial Finetuning component #236): Get basic functionality working, support required parameters
Add Infra code (Add FineTuning example infra GenAIInfra#122): Deploying example code in K8S cluster, pass CI.
Add Example code (Add FineTuning example GenAIExamples#315): example code for running finetuning service
Polish components: More optional parameters support

xwu99 mentioned this issue Jul 2, 2024

Add initial Finetuning component #236

Draft

3 tasks

xwu99 changed the title ~~[RFC] LLM Fine-tuning Microservice~~ [RFC][Drafting] LLM Fine-tuning Microservice Jul 2, 2024

xwu99 closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][Drafting] LLM Fine-tuning Microservice #268

[RFC][Drafting] LLM Fine-tuning Microservice #268

xwu99 commented Jul 2, 2024 •

edited

Loading

[RFC][Drafting] LLM Fine-tuning Microservice #268

[RFC][Drafting] LLM Fine-tuning Microservice #268

Comments

xwu99 commented Jul 2, 2024 • edited Loading

LLM Fine-tuning Microservice

RFC Content

Author

Status

Objective

Motivation

Design Proposal

Alternatives Considered

Compatibility

Miscs

xwu99 commented Jul 2, 2024 •

edited

Loading