<a href="https://colab.research.google.com/github/kd303/nvidia_llm/blob/main/Optimizing_OpenCLIP_for_Production.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here is the technical blueprint of our OpenCLIP production pipeline, consolidated into a single, production-ready Markdown document.

---

# Production Blueprint: High-Throughput OpenCLIP Inference

This document outlines the architecture for serving OpenCLIP at scale using **NVIDIA Triton Inference Server**, **DALI**, and **TensorRT** on an **A100 20GB MIG slice**.

## 1. System Architecture

We utilize a **FastAPI Gateway** to handle incoming REST requests, translating them into high-speed **gRPC** calls to the **Triton Ensemble**.

* **Gateway:** Handles HTTP/REST (for client compatibility) + converts to binary gRPC (for performance).
* **Triton Ensemble:** "Zero-Copy" pipeline where DALI preprocessing and TensorRT inference occur entirely on the GPU.
* **Infrastructure:** Deployed via Docker with A100 MIG partitioning for hardware isolation.

---

## 2. Model Repository Structure

Your repository should follow this standard layout for the Triton Ensemble:

```text
model_repository/
├── preprocess/
│   ├── 1/
│   │   └── model.py       # DALI Python pipeline
│   └── config.pbtxt
├── clip_vision/
│   ├── 1/
│   │   └── model.plan     # TensorRT engine
│   └── config.pbtxt
└── openclip_ensemble/
    └── config.pbtxt       # Defines the pipeline flow

```

### DALI Backend Config (`preprocess/config.pbtxt`)

Optimized for a 20GB MIG slice to prevent VRAM overflow.

```protobuf
name: "preprocess"
backend: "dali"
max_batch_size: 64

instance_group [{ count: 2, kind: KIND_GPU }]

```

### TensorRT Vision Config (`clip_vision/config.pbtxt`)

```protobuf
name: "clip_vision"
platform: "tensorrt_plan"
max_batch_size: 64

instance_group [{ count: 1, kind: KIND_GPU }]

dynamic_batching {
  preferred_batch_size: [ 16, 32, 64 ]
  max_queue_delay_microseconds: 2000
}

```

---

## 3. Infrastructure: Docker & MIG

Use the following `docker-compose.yml` to ensure your Triton server binds correctly to the MIG device.

```yaml
services:
  triton-backend:
    image: nvcr.io/nvidia/tritonserver:24.01-py3
    environment:
      - NVIDIA_VISIBLE_DEVICES=MIG-GPU-xxxx-xxxx-xxxx  # Insert your MIG UUID
    shm_size: '2gb' # Critical for internal tensor passing
    volumes:
      - ./model_repository:/models
    command: ["tritonserver", "--model-repository=/models"]

  rest-gateway:
    build: ./api_gateway
    ports:
      - "8080:8080"
    environment:
      - TRITON_URL=triton-backend:8001

```

---

## 4. Gateway Implementation (`main.py`)

This Python service acts as the REST-to-gRPC bridge.

In [None]:
import tritonclient.grpc.aio as grpc_client
from fastapi import FastAPI, UploadFile, File
import numpy as np

app = FastAPI()
client = grpc_client.InferenceServerClient(url="triton-backend:8001")

@app.post("/v1/embed")
async def embed(file: UploadFile = File(...)):
    # Read binary bytes
    content = await file.read()
    image_bytes = np.frombuffer(content, dtype=np.uint8)

    # Send gRPC request to Triton Ensemble
    inputs = [grpc_client.InferInput("IMAGE_BYTES", [1, len(image_bytes)], "UINT8")]
    inputs[0].set_data_from_numpy(image_bytes.reshape(1, -1))

    response = await client.infer("openclip_ensemble", inputs)
    return {"embedding": response.as_numpy("EMBEDDING").tolist()}

---

## 5. Deployment Checklist for 20GB MIG

1. **Engine Build:** Ensure your `trtexec` build command specifies the correct `--minShapes` and `--maxShapes` (e.g., `1x3x224x224` to `64x3x224x224`) to match the config.
2. **Memory Management:** If you hit `OOM` errors during inference, lower the `max_batch_size` in your `config.pbtxt` files to `32`.
3. **MIG Isolation:** Always verify your MIG UUID via `nvidia-smi -L` before launching containers to ensure you aren't consuming the host GPU.

---

[How to optimize NVIDIA TensorRT performance](https://www.youtube.com/watch?v=67ev-6Xn30U)

This video provides excellent insights into profiling your TensorRT engines to ensure your model is running at peak efficiency on your specific A100 hardware.