# Signature-Aware Model Serving from MLflow with Ray Serve

## Introduction

Optimizing the model deployment from a model registry of the model lifecycle is particularly important due to the production-facing aspect of the end result. At this stage, our model becomes a microservice, which means that we need to contend with all elements of service ownership, which can include:
* standardizing and encforcing API bakcwards-compatibility;
* logging, metrics, and general observability concerns, and etc.

In this example, we will use the following minimal stack:
* MLflow for model registry;
* Ray Serve for model serving.

For demo purposes, we will use off-the-shelf open-source models from HuggingFace Hub. We will not use GPUs for inference because inference performance is orthogonal to our focus here.

In [None]:
!pip install -qU transformers mlflow-skinny ray[serve] torch

## Register the model

For demo propurse, we will use a simple text translation model, where the source and destination languages are configurable at registration time. This also means that different "versions" of the model can be registered to translate different languages, but the underlying model architecture and weights can stay the same.

In [3]:
import mlflow
from transformers import pipeline

class MyTranslationModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        self.lang_from = context.model_config.get('lang_from', 'en')
        self.lang_to = context.model_config.get('lang_to', 'de')

        self.input_label: str = context.model_config.get('input_label', 'prompt')

        self.model_ref: str = context.model_config.get('hfhub_name', 'google-t5/t5-base')

        self.pipeline = pipeline(
            f"translation_{self.lang_from}_to_{self.lang_to}",
            self.model_ref
        )

    def predict(self, context, model_input, params=None):
        prompt = model_input[self.input_label].tolist()

        result = self.pipeline(prompt)

        return result



After we define our model, we need to register an actual version of it. This particular version will use Google's T5 Base model and is configured to translate from English to German.

In [None]:
import pandas as pd

with mlflow.start_run():
    model_info = mlflow.pyfunc.log_model(
        'translation_model',
        registered_model_name='translation_model',
        python_model=MyTranslationModel(),
        pip_requirements=['transformers'],
        input_example=pd.DataFrame(
            {
                'prompt': ['Hello my name is Bin.']
            }
        ),
        model_config={
            'lang_from': 'en',
            'lang_to': 'de',
            'hfhub_name': 'google-t5/t5-base'
        }
    )

We can keep track of this exact version:

In [5]:
en_to_de_version: str = str(model_info.registered_model_version)

The registered model metadata contains useful information for us.

In [6]:
model_info.signature

inputs: 
  ['prompt': string (required)]
outputs: 
  ['translation_text': string (required)]
params: 
  None

The registered model version is associated with a strict **signature** that denotes the expected shape of its input and output.

## Serve the model

Now that we register our model in MLflow, we can set up our serving scaffolding using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). In this example, we limit our "deployment" to the following behavior:
* Source the selected model and version from MLflow;
* Receive inference requests and return inference responses via a simple REST API.

In [7]:
import mlflow
import pandas as pd

from ray import serve
from fastapi import FastAPI


app = FastAPI()

@serve.deployment
@serve.ingress(app)
class ModelDeployment:
    def __init__(self, model_name: str = 'translation_model', default_version: str = '1'):
        self.model_name = model_name
        self.default_version = default_version

        self.model = mlflow.pyfunc.load_model(f"models:/{self.model_name}/{self.default_version}")


    @app.post('/serve')
    async def serve(self, input_string: str):
        return self.model.predict(pd.DataFrame({'prompt': [input_string]}))

In [8]:
deployment = ModelDeployment.bind(default_version=en_to_de_version)

The hard-coding `'prompt'` as the input label introduces hidden coupling between the registered model's signature and the deployment implementation.

Now we can run the deployment and play around with it:

In [9]:
serve.run(deployment, blocking=False)

2025-04-06 14:04:42,083	INFO worker.py:1843 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m
[36m(ProxyActor pid=1744)[0m INFO 2025-04-06 14:04:51,995 proxy 172.28.0.12 -- Proxy starting on node de6ca8f11920304b8fb71a115545195c4f2a588bf373b575776d4828 (HTTP port: 8000).
[36m(ProxyActor pid=1744)[0m INFO 2025-04-06 14:04:52,071 proxy 172.28.0.12 -- Got updated endpoints: {}.
INFO 2025-04-06 14:04:52,175 serve 263 -- Started Serve in namespace "serve".
[36m(ServeController pid=1745)[0m INFO 2025-04-06 14:04:52,276 controller 1745 -- Deploying new version of Deployment(name='ModelDeployment', app='default') (initial target replicas: 1).
[36m(ProxyActor pid=1744)[0m INFO 2025-04-06 14:04:52,280 proxy 172.28.0.12 -- Got updated endpoints: {Deployment(name='ModelDeployment', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.
[36m(ProxyActor pid=1744)[0m INFO 2025-04-06 14:04:52,298 proxy 172.28.0.12 -- Started <ray

DeploymentHandle(deployment='ModelDeployment')

In [10]:
import requests

response = requests.post(
    'http://127.0.0.1:8000/serve',
    params={
        'input_string': "The weather is lovely today"
    }
)
response.json()

[{'translation_text': 'Das Wetter ist heute nett.'}]

We can see that the REST API does not line up with the model signature. It uses the label `'input_string'` while our served model version itself uses the input label `'prompt'`.

## Multiple versions, one endpoint

Now that we have got a basic endpoint set up for our model, but this deployment is strictly tethered to a single version of this model - specifically, version `1` of the registered `translation_model`.

Consider that we would like to refine this model and register a new version. With our current deployment implementation, we need to set up a whole new endpoint for `translation_model/2` then, which requires our users to remember which address and port corresponds to which version of the model.

What if there is a way that we could reuse the exact same endpoint - same signature, same address and port, same query conventions, etc. - to serve both versions of this model. Then users can simply specify which version of the model they would like to use, and we can treat one of them as the "default" in cases where users did not explicitly request one.

This is where we apply **Ray Serve** with a feature called **model multiplexing**, which allows us to load up multiple "versions" of our model, dynamically hot-swapping them as needed, as well as unloading the versions that do not get used after some time.

Before adding a new version of the model, we need to extend the model server with multiplexing support:

In [11]:
from ray import serve
from fastapi import FastAPI

app = FastAPI()


@serve.deployment
@serve.ingress(app)
class MultiplexedModelDeployment:

    @serve.multiplexed(max_num_models_per_replica=2)
    async def get_model(self, version: str):
        return mlflow.pyfunc.load_model(f"models:/{self.model_name}/{version}")

    def __init__(
            self,
            model_name: str = 'translation_model',
            default_version: str = en_to_de_version
    ):
        self.model_name = model_name
        self.default_version = default_version

    @app.post('/serve')
    async def serve(self, input_string: str):
        model = await self.get_model(serve.get_multiplexed_model_id())
        return model.predict(pd.DataFrame({'prompt': [input_string]}))

[36m(ServeReplica:default:ModelDeployment pid=1924)[0m INFO 2025-04-06 14:05:25,734 default_ModelDeployment a1niwgvi 8f765cca-0c27-4343-bee4-86893db9cae3 -- POST /serve 200 1681.4ms


In [12]:
multiplexed_deployment = MultiplexedModelDeployment.bind(model_name='translation_model')
serve.run(multiplexed_deployment, blocking=False)

INFO 2025-04-06 14:05:25,964 serve 263 -- Connecting to existing Serve app in namespace "serve". New http options will not be applied.
[36m(ServeController pid=1745)[0m INFO 2025-04-06 14:05:26,041 controller 1745 -- Deploying new version of Deployment(name='MultiplexedModelDeployment', app='default') (initial target replicas: 1).
[36m(ProxyActor pid=1744)[0m INFO 2025-04-06 14:05:26,046 proxy 172.28.0.12 -- Got updated endpoints: {Deployment(name='MultiplexedModelDeployment', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.
[36m(ServeController pid=1745)[0m INFO 2025-04-06 14:05:26,191 controller 1745 -- Removing 1 replica from Deployment(name='ModelDeployment', app='default').
[36m(ServeController pid=1745)[0m INFO 2025-04-06 14:05:26,207 controller 1745 -- Adding 1 replica to Deployment(name='MultiplexedModelDeployment', app='default').
[36m(ServeController pid=1745)[0m INFO 2025-04-06 14:05:28,676 controller 1745 -- Replica(id='a1niwgvi', deployment=

DeploymentHandle(deployment='MultiplexedModelDeployment')

Now we will register another version of the model to translate from English to French under the version `"2"`.

In [13]:
import pandas as pd

with mlflow.start_run():
    model_info = mlflow.pyfunc.log_model(
        'translation_model',
        registered_model_name='translation_model',
        python_model=MyTranslationModel(),
        pip_requirements=['transformers'],
        input_example=pd.DataFrame(
            {
                'prompt': [
                    'Hello my name is Bin.'
                ]
            }
        ),
        model_config={
            'hfhub_name': 'google-t5/t5-base',
            'lang_from': 'en',
            'lang_to': 'fr'
        }
    )


en_to_fr_version: str = str(model_info.registered_model_version)
en_to_fr_version

2025/04/06 14:05:36 INFO mlflow.pyfunc: Inferring model signature from input example
Device set to use cpu
Device set to use cpu
Registered model 'translation_model' already exists. Creating a new version of this model...
Created version '2' of model 'translation_model'.


'2'

Now that this model is registered, we can query for it via the model server

In [14]:
import requests

response = requests.post(
    'http://127.0.0.1:8000/serve/',
    params={
        'input_string': 'The weather is lovely today'
    },
    headers={
        'serve_multiplexed_model_id': en_to_fr_version
    }
)
response.json()

[36m(ServeReplica:default:MultiplexedModelDeployment pid=2114)[0m INFO 2025-04-06 14:05:53,806 default_MultiplexedModelDeployment invjul99 aa1dfb07-e4e0-4a5a-92e5-440f0429a3dc -- POST /serve/ 307 3.8ms
[36m(ServeReplica:default:MultiplexedModelDeployment pid=2114)[0m INFO 2025-04-06 14:05:55,830 default_MultiplexedModelDeployment invjul99 c755eaed-a863-4464-ae64-ad7ab8159e47 -- Loading model '2'.
[36m(ServeReplica:default:MultiplexedModelDeployment pid=2114)[0m 2025-04-06 14:06:08.624975: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(ServeReplica:default:MultiplexedModelDeployment pid=2114)[0m E0000 00:00:1743948368.658770    2186 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[36m(ServeReplica:default:MultiplexedModelDeployment pid=2114)[0m E0000 0

[{'translation_text': "Le temps est beau aujourd'hui"}]

Now we can see we were able to immediately access the model version **without redeploying the model server**.

If we never requested version 2, it never gets loaded. This helps conserve compute resources for the models that do get queried.

If the number of models loaded up exceeds the configured maximum (`max_num_models_per_replica`), the *least-recently* used model version will get evicted.

In this example, we set `max_num_models_per_replica=2` above, the "default" English-to-German version should still be loaded up and readily available to serve requests without any cold-start time:

In [15]:
response = requests.post(
    'http://127.0.0.1:8000/serve',
    params={
        'input_string': "The weather is lovely today"
    },
    headers={
        'serve_multiplexed_model_id': en_to_de_version
    }
)
response.json()

[36m(ServeReplica:default:MultiplexedModelDeployment pid=2114)[0m INFO 2025-04-06 14:06:23,226 default_MultiplexedModelDeployment invjul99 c755eaed-a863-4464-ae64-ad7ab8159e47 -- POST /serve 200 27402.6ms
[36m(ServeReplica:default:MultiplexedModelDeployment pid=2114)[0m INFO 2025-04-06 14:06:23,236 default_MultiplexedModelDeployment invjul99 ae6501ad-a38f-4648-a993-b7fe0c7ec56f -- Loading model '1'.
[36m(ServeReplica:default:MultiplexedModelDeployment pid=2114)[0m Device set to use cpu
[36m(ServeReplica:default:MultiplexedModelDeployment pid=2114)[0m INFO 2025-04-06 14:06:30,552 default_MultiplexedModelDeployment invjul99 ae6501ad-a38f-4648-a993-b7fe0c7ec56f -- Successfully loaded model '1' in 7315.2ms.


[{'translation_text': 'Das Wetter ist heute nett.'}]

## Auto-Signature

When defining a server, we need to define a whole new signature for the API itself, which may cause inconsistent APIs across all models.

In this example, what if we want to deploy another set of models that do not have to do with language translation? The defined `/serve` API, which returns a JSON object, would no longer make sense. What if the API signature for **MultiplexedModelDeployment** could automatically mirror the signature of the underlying models it serves?

With MLflow Model Registry metadata and Python dynamic-class-creation shenanigans, the above feature is possible.

We will infer the model server signature from the registered model in the following case. Since different versions of an MLflow can have different signatures, we will use the "default version" to "pin" the signature; any attempt to multiplex an incompatible-signature model version we will have throw an error.

Since Ray Serve binds the request and response signatures at class-definition time, we will use a Python metaclass to set this as a function of the specified model name and default model version.

In [16]:
import mlflow
import pydantic

def schema_to_pydantic(schema: mlflow.types.schema.Schema, *, name: str) -> pydantic.BaseModel:
    return pydantic.create_model(
        name,
        **{
            k: (v.type.to_python(), pydantic.Field(required=True))
            for k, v in schema.input_dict().items()
        }
    )


def get_req_resp_signature(
        model_signature: mlflow.models.ModelSignature
) -> tuple[pydantic.BaseModel, pydantic.BaseModel]:
    inputs: mlflow.types.schema.Schema = model_signature.inputs
    outputs: mlflow.types.schema.Schema = model_signature.outputs

    return (
        schema_to_pydantic(inputs, name='InputModel'),
        schema_to_pydantic(outputs, name='OutputModel')
    )

[36m(ServeReplica:default:MultiplexedModelDeployment pid=2114)[0m INFO 2025-04-06 14:06:32,063 default_MultiplexedModelDeployment invjul99 ae6501ad-a38f-4648-a993-b7fe0c7ec56f -- POST /serve 200 8830.4ms


In [22]:
import mlflow
from fastapi import FastAPI, Response, status
from ray import serve
from typing import List


def deployment_from_model_name(model_name: str, default_version: str = '1'):
    app = FastAPI()
    model_info = mlflow.models.get_model_info(f"models:/{model_name}/{default_version}")
    input_datamodel, output_datamodel = get_req_resp_signature(model_info.signature)


    @serve.deployment
    @serve.ingress(app)
    class DynamicallyDefinedDeployment:

        MODEL_NAME: str = model_name
        DEFAULT_VERSION: str = default_version

        @serve.multiplexed(max_num_models_per_replica=2)
        async def get_model(self, model_version: str):
            model = mlflow.pyfunc.load_model(f"models:/{self.MODEL_NAME}/{model_version}")

            if model.metadata.get_model_info().signature != model_info.signature:
                raise ValueError(
                    f"Requested version {model_version} has signature incompitable with that of the default version {self.DEFAULT_VERSION}"
                )
            return model


        @app.post('/serve', response_model=List[output_datamodel])
        async def serve(self, model_input: input_datamodel, response: Response):
            model_id = serve.get_multiplexed_model_id()
            if model_id == '':
                model_id = self.DEFAULT_VERSION

            try:
                model = await self.get_model(model_id)
            except ValueError:
                response.status_code = status.HTTP_409_CONFLICT
                return [{'translation_text': 'FAILED'}]

            return model.predict(model_input.dict())

    return DynamicallyDefinedDeployment

In [23]:
deployment = deployment_from_model_name('translation_model', default_version=en_to_fr_version)

In [24]:
serve.run(deployment.bind(), blocking=False)

INFO 2025-04-06 14:12:36,336 serve 263 -- Connecting to existing Serve app in namespace "serve". New http options will not be applied.
[36m(ServeController pid=1745)[0m INFO 2025-04-06 14:12:36,402 controller 1745 -- Deploying new version of Deployment(name='DynamicallyDefinedDeployment', app='default') (initial target replicas: 1).
[36m(ServeController pid=1745)[0m INFO 2025-04-06 14:12:36,518 controller 1745 -- Stopping 1 replicas of Deployment(name='DynamicallyDefinedDeployment', app='default') with outdated versions.
[36m(ServeController pid=1745)[0m INFO 2025-04-06 14:12:36,519 controller 1745 -- Adding 1 replica to Deployment(name='DynamicallyDefinedDeployment', app='default').
[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=2434)[0m INFO 2025-04-06 14:12:38,543 default_DynamicallyDefinedDeployment khidw42p -- Unloading model '2'.
[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=2434)[0m INFO 2025-04-06 14:12:38,547 default_DynamicallyDefinedDeploym

DeploymentHandle(deployment='DynamicallyDefinedDeployment')

In [None]:
import requests

resp = requests.post(
    'http://127.0.0.1:8000/serve',
    json={'prompt': 'The weather is lovely today'},
)

assert resp.ok
assert resp.status_code == 200

resp.json()

In [26]:
resp.status_code

500

In [None]:
resp = requests.post(
    'http://127.0.0.1:8000/serve',
    json={'prompt': 'The weather is lovely today'},
    headers={
        'serve_multiplexed_model_id': str(en_to_fr_version)
    },
)

assert resp.ok
assert resp.status_code == 200

resp.json()

404

We can now confirm that the signature-check provision we put in place actually works. For this, we can register this same model with a slightly different signature.

In [None]:
import pandas as pd

with mlflow.start_run():
    incompatible_version = str(
        mlflow.pyfunc.log_model(
            'translation_model',
            registered_model_name='translation_model',
            python_model=MyTranslationModel(),
            pip_requirements=['transformers'],
            input_example=pd.DataFrame(
                {
                    'text_to_translate': [
                        'Hello my name is Bin.'
                    ]
                }
            ),
            model_config={
                'input_label': 'text_to_translate',
                'hfhub_name': 'google-t5/t5-base',
                'lang_from': 'en',
                'lang_to': 'de'
            }
        ).registered_model_version
    )

In [None]:
import requests

resp = requests.post(
    'http://127.0.0.1:8000/serve',
    json={'prompt': 'The weather is lovely today'},
    headers={
        'serve_multiplexed_model_id': incompatible_version
    }
)

assert not resp.ok
resp.status_code == 409

assert resp.json()[0]['translation_text'] == 'FAILED'

A good thing to do here would be to implement a response container that allows for an "error message" to be defined as part of the actual response, rather than "abusing" the `translation_text` field.

We can also try registering an entirely different model - with an entirely different signature - and deploying that via `deployment_from_model_name()`. This will help us confirm that the entire signature is defined from the loaded model.

In [34]:
import mlflow
from transformers import pipeline


class QuestionAnswererModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        self.model_context = context.model_config.get(
            'model_context',
            'My name is Bin and I live in North Carolina.'
        )
        self.model_name = context.model_config.get(
            'model_name',
            'deepset/roberta-base-squad2'
        )
        self.tokenizer_name = context.model_config.get(
            'tokenizer_name',
            'deepset/roberta-base-squad2'
        )
        self.pipeline = pipeline(
            'question-answering',
            model=self.model_name,
            tokenizer=self.tokenizer_name
        )

    def predict(self, context, model_input, params=None):
        resp = self.pipeline(
            question=model_input['question'].tolist(),
            context=self.model_context
        )

        return [resp] if type(resp) is not list else resp



In [35]:
import pandas as pd

with mlflow.start_run():
    model_info = mlflow.pyfunc.log_model(
        'question_answerer',
        registered_model_name='question_answerer',
        python_model=QuestionAnswererModel(),
        pip_requirements=['transformers'],
        input_example=pd.DataFrame(
            {
                'question': [
                    'Where do you live?',
                    'What is your name?'
                ]
            }
        ),
        model_config={
            'model_context': 'My name is Bin and I live in North Carolina.',
        }
    )

2025/04/06 14:23:55 INFO mlflow.pyfunc: Inferring model signature from input example
Device set to use cpu
Device set to use cpu
Registered model 'question_answerer' already exists. Creating a new version of this model...
Created version '2' of model 'question_answerer'.


In [36]:
model_info.signature

inputs: 
  ['question': string (required)]
outputs: 
  ['score': double (required), 'start': long (required), 'end': long (required), 'answer': string (required)]
params: 
  None

In [37]:
from ray import serve

serve.run(
    deployment_from_model_name(
        "question_answerer",
        default_version=str(model_info.registered_model_version),
    ).bind(),
    blocking=False,
)

INFO 2025-04-06 14:24:13,122 serve 263 -- Connecting to existing Serve app in namespace "serve". New http options will not be applied.
[36m(ServeController pid=1745)[0m INFO 2025-04-06 14:24:13,133 controller 1745 -- Deploying new version of Deployment(name='DynamicallyDefinedDeployment', app='default') (initial target replicas: 1).
[36m(ServeController pid=1745)[0m INFO 2025-04-06 14:24:13,241 controller 1745 -- Stopping 1 replicas of Deployment(name='DynamicallyDefinedDeployment', app='default') with outdated versions.
[36m(ServeController pid=1745)[0m INFO 2025-04-06 14:24:13,242 controller 1745 -- Adding 1 replica to Deployment(name='DynamicallyDefinedDeployment', app='default').
[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=4008)[0m INFO 2025-04-06 14:24:15,258 default_DynamicallyDefinedDeployment qy94ady8 -- Unloading model '2'.
[36m(ServeReplica:default:DynamicallyDefinedDeployment pid=4008)[0m INFO 2025-04-06 14:24:15,262 default_DynamicallyDefinedDeploym

DeploymentHandle(deployment='DynamicallyDefinedDeployment')

In [None]:
import requests

resp = requests.post(
    "http://127.0.0.1:8000/serve/",
    json={"question": "The weather is lovely today"},
)
resp.json()