
%md
# Intfloat/multilingual-e5-large-instruct Model Deployment

## Prerequisites
- Databricks workspace with appropriate permissions
- Access to Hugging Face models
- Sufficient GPU resources for model deployment

## Cluster Configuration
- Runtime: 16.3 ML (includes Apache Spark 3.5.2, GPU, Scala 2.12)
- Node Type: Standard_NC40ads_H100_v5 [H100] (beta)
- 320 GB Memory
- 1 GPU
- 40 Cores

## Install Required Dependencies

Installing necessary packages including:
- OpenAI client
- VLLM for efficient model serving
- MLflow extensions for deployment
- Transformers library with Qwen VL support

In [0]:
PIP_REQUIREMENTS = (
    "openai vllm==0.8.5.post1 optree "
    "git+https://github.com/huggingface/transformers accelerate  "
    "mlflow==2.19.0 "
    "git+https://github.com/stikkireddy/mlflow-extensions.git@vllm-embeddings-support "
    "qwen-vl-utils "
)

%pip install {PIP_REQUIREMENTS}
dbutils.library.restartPython()

## Configuration

Set up the necessary configuration parameters for model deployment:
- Catalog and schema for model registration
- Model and endpoint names
- Environment variables for VLLM

In [0]:
# Configuration parameters
# Configuration parameters
CATALOG = "..."
SCHEMA = "..."
MODEL_NAME = "..."
ENDPOINT_NAME = "..."

# Set environment variables for VLLM
import os
# os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'
# os.environ['VLLM_USE_V1'] = "0"

In [0]:
PIP_REQUIREMENTS = (
    "openai vllm==0.8.5.post1 optree "
    "git+https://github.com/huggingface/transformers accelerate  "
    "mlflow==2.19.0 "
    "git+https://github.com/stikkireddy/mlflow-extensions.git@vllm-embeddings-support "
    "qwen-vl-utils "
)

In [0]:
from huggingface_hub import login
from mlflow_extensions.serving.engines import VLLMEngineProcess
from mlflow_extensions.serving.engines.vllm_engine import VLLMEngineConfig
from mlflow_extensions.databricks.deploy.ez_deploy import EzDeployConfig, ServingConfig, EzDeployVllmOpenCompat

# Replace 'your_huggingface_token' with your actual Hugging Face token
# login()

In [0]:
# Initialize the deployer with VLLM OpenAI compatibility layer
deployer = EzDeployVllmOpenCompat(
  config= EzDeployConfig(
    # Specify the model name/path from Hugging Face
    name="intfloat/multilingual-e5-large-instruct",
    # Use VLLM engine process for serving
    engine_proc=VLLMEngineProcess,
    engine_config=VLLMEngineConfig(
          # Model identifier on Hugging Face
          model="intfloat/multilingual-e5-large-instruct",
          # Maximum sequence length for input
          max_model_len = 512,
          # Maximum number of images/videos that can be processed
          # VLLM specific configuration flags
          vllm_command_flags={
            # GPU memory utilization target (98%)
            "--gpu-memory-utilization": .95,
            "--task" : "embedding",
          },
),
  serving_config=ServingConfig(
      # Minimum memory required for model serving (in GB)
      # Includes model weights, KV cache, overhead and intermediate states
      minimum_memory_in_gb=60,
  ),
  # Use pip requirements defined earlier
  pip_config_override = PIP_REQUIREMENTS.split(" ")
),
  # Register model with fully qualified name in Unity Catalog
  registered_model_name=f"{CATALOG}.{SCHEMA}.{MODEL_NAME}"
)

## Model Registration and Deployment

Download and register the model in Unity Catalog.

In [0]:
# Download and register the model
deployer.artifacts = deployer._config.download_artifacts(local_dir="/tmp/") # this can be volume location as well
deployer._downloaded = True

In [0]:
deployer.register() # Ignore error as this will fail in serverless as there are no GPU's

# Below is the code to deploy the endpoint to model serving

## Model Deployment to Serving Endpoint

Deploy the registered model to a serving endpoint. This will:
1. Create a new serving endpoint with the specified name
2. Load the model into memory
3. Make it available for inference requests

Note: `scale_to_zero=False` means the endpoint will maintain at least one instance running,
which helps reduce cold start times but may incur higher costs.

In [0]:
deployer.deploy(ENDPOINT_NAME, scale_to_zero=False)

## Process Management

### Restarting Model Processes

Sometimes you may need to restart the model processes, for example:
- After making configuration changes
- If the model becomes unresponsive
- To free up GPU memory

The following code will:
1. Kill any existing VLLM processes
2. Kill any Ray processes (used for distributed computing)
3. Kill any multiprocessing processes

Run this cell whenever you need to restart the model processes.

In [0]:
from mlflow_extensions.testing.helper import kill_processes_containing

# Kill existing processes to free up resources
kill_processes_containing("vllm")  # Kill VLLM model serving processes
kill_processes_containing("ray")   # Kill Ray distributed computing processes
kill_processes_containing("from multiprocessing")  # Kill any multiprocessing processes

## Model Serving Setup

Initialize the model for serving and set up the client for inference.
This section will:
1. Set up MLflow registry URI
2. Fetch the latest model version
3. Load the model for serving

In [0]:
import mlflow
from mlflow.tracking import MlflowClient

# Set up MLflow registry
mlflow.set_registry_uri('databricks-uc')

# Initialize MLflow client
client = MlflowClient()

# Get the latest model version
model_name = f"{CATALOG}.{SCHEMA}.{MODEL_NAME}"
latest_version = None

# Iterate through versions to find the latest one
for i in range(1, 10):
    try:
        client.get_model_version(model_name, i)
    except:
        latest_version = i - 1
        break

if latest_version is None:
    raise Exception("Could not determine latest model version")

print(f"Using latest model version: {latest_version}")

# Load the registered model
model_uri = f"models:/{model_name}/{latest_version}"
pyfunc_model = mlflow.pyfunc.load_model(model_uri)
base_url = str(pyfunc_model.unwrap_python_model()._engine._server_http_client.base_url)

print("Model serving base URL:", base_url)

## Inference Examples

Demonstrate model inference capabilities with different types of inputs.

### Text-only Inference

Basic text completion example.

In [0]:
serving_payload = {"input": ["this is a new test"]}

response = pyfunc_model.predict(serving_payload)
print(response)