Skip to content

Commit

Permalink
feat: add tensorrt support (#688)
Browse files Browse the repository at this point in the history
* fix: drafts tempt

* feat: support tensorrt backend

* fix: offline tensorrt loading

* fix: remove draft Dockerfile

* fix: setup

* fix: s3 bucket for tensorrt

* fix: polish codes

* fix: int64 input tensors

* fix: rebase

* fix: refactor preprocess

* fix: deprecate _preprocess_blob

* fix: refacotr

* fix: available model for tensorrt

* fix: warning message

* fix: imports

* fix: draft tensorrt tests

* fix: tensrrt output

* fix: trt test

* fix: tensorrt deps

* fix: runtimeerrr re-initialized torch in pytest

* fix: compute capability

* fix: dynamic trt converting

* fix: ci workflow

* fix: import onnxruntime

* fix: errors

* fix: upload cov file

* fix: upload cov file

* fix: minor revision

* chore: update ocs

* fix: try to fix cov uploader

* fix: temp diable netfily

* fix: temp diable netfily

* fix: revert netifly

* fix: gpu test

* fix: runs on

* fix: upgrade cov action

* fix: use cov bash uploader

* fix: action checkout fetch-depth

* fix: rebase conflict

* fix: cov uploader

* fix: setup script
  • Loading branch information
numb3r3 committed May 4, 2022
1 parent 3f34d46 commit f7b9af4
Show file tree
Hide file tree
Showing 17 changed files with 718 additions and 95 deletions.
52 changes: 50 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -127,17 +127,65 @@ jobs:
with:
files: "coverage.xml"
- name: Upload coverage from test to Codecov
uses: codecov/codecov-action@v2
uses: codecov/codecov-action@v3
if: steps.check_files.outputs.files_exists == 'true' && ${{ matrix.python-version }} == '3.7'
with:
file: coverage.xml
name: ${{ matrix.test-path }}-codecov
flags: ${{ steps.test.outputs.codecov_flag }}
fail_ci_if_error: false
token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos

gpu-test:
needs: prep-testbed
runs-on: [self-hosted, x64, gpu, linux]
strategy:
fail-fast: false
matrix:
python-version: [ 3.7 ]
steps:
- uses: actions/checkout@v2
with:
# For coverage builds fetch the whole history
fetch-depth: 0
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Prepare enviroment
run: |
python -m pip install --upgrade pip
python -m pip install wheel pytest pytest-cov nvidia-pyindex
pip install -e "client/[test]"
pip install -e "server/[tensorrt]"
- name: Test
id: test
run: |
pytest --suppress-no-test-exit-code --cov=clip_client --cov=clip_server --cov-report=xml \
-v -s -m "gpu" ./tests/test_tensorrt.py
echo "::set-output name=codecov_flag::cas"
timeout-minutes: 30
env:
# fix re-initialized torch runtime error on cuda device
JINA_MP_START_METHOD: spawn
- name: Check codecov file
id: check_files
uses: andstor/file-existence-action@v1
with:
files: "coverage.xml"
- name: Upload coverage from test to Codecov
uses: codecov/codecov-action@v3
if: steps.check_files.outputs.files_exists == 'true' && ${{ matrix.python-version }} == '3.7'
with:
file: coverage.xml
name: gpu-related-codecov
flags: ${{ steps.test.outputs.codecov_flag }}
fail_ci_if_error: false
token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos

# just for blocking the merge until all parallel core-test are successful
success-all-test:
needs: core-test
needs: [core-test, gpu-test]
if: always()
runs-on: ubuntu-latest
steps:
Expand Down
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

CLIP-as-service is a low-latency high-scalability service for embedding images and text. It can be easily integrated as a microservice into neural search solutions.

**Fast**: Serve CLIP models with ONNX runtime and PyTorch JIT with 800QPS<sup>[*]</sup>. Non-blocking duplex streaming on requests and responses, designed for large data and long-running tasks.
**Fast**: Serve CLIP models with TensorRT, ONNX runtime and PyTorch JIT with 800QPS<sup>[*]</sup>. Non-blocking duplex streaming on requests and responses, designed for large data and long-running tasks.

🫐 **Elastic**: Horizontally scale up and down multiple CLIP models on single GPU, with automatic load balancing.

Expand Down Expand Up @@ -58,6 +58,16 @@ To run CLIP model via ONNX (default is via PyTorch):
pip install "clip-server[onnx]"
```

To run CLIP model via TensorRT

```bash
# You must first install the nvidia-pyindex package, which is required in order to set up your pip installation
# to fetch additional Python modules from the NVIDIA NGC™ PyPI repo.
pip install nvidia-pyindex

pip install "clip-server[tensorrt]"
```

### Install client

```bash
Expand Down
47 changes: 40 additions & 7 deletions docs/user-guides/server.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

CLIP-as-service is designed in a client-server architecture. A server is a long-running program that receives raw sentences and images from clients, and returns CLIP embeddings to the client. Additionally, `clip_server` is optimized for speed, low memory footprint and scalability.
- Horizontal scaling: adding more replicas easily with one argument.
- Vertical scaling: using PyTorch JIT or ONNX runtime to speedup single GPU inference.
- Vertical scaling: using PyTorch JIT, ONNX or TensorRT runtime to speedup single GPU inference.
- Supporting gRPC, HTTP, Websocket protocols with their TLS counterparts, w/o compressions.

This chapter introduces the API of the client.
Expand Down Expand Up @@ -34,22 +34,37 @@ To use ONNX runtime for CLIP, you can run:
```bash
pip install "clip_server[onnx]"

python -m clip_server onnx_flow.yml
python -m clip_server onnx-flow.yml
```

One may wonder where is this `onnx_flow.yml` come from. Must be a typo? Believe me, just run it. It should work. I will explain this YAML file in the next section.
We also support TensorRT runtime for CLIP, you can run:

```bash
# You must first install the nvidia-pyindex package, which is required in order to set up your pip installation
# to fetch additional Python modules from the NVIDIA NGC™ PyPI repo.
pip install nvidia-pyindex

pip install "clip_server[tensorrt]"

python -m clip_server tensorrt-flow.yml
```

One may wonder where is this `onnx-flow.yml` (or `tensorrt-flow.yml`) come from. Must be a typo? Believe me, just run it. It should work. I will explain this YAML file in the next section.

The procedure and UI of ONNX runtime would look the same as Pytorch runtime.


The procedure and UI of ONNX and TensorRT runtime would look the same as Pytorch runtime.



## YAML config

You may notice that there is a YAML file in our last ONNX example. All configurations are stored in this file. In fact, `python -m clip_server` does **not support** any other argument besides a YAML file. So it is the only source of the truth of your configs.

And to answer your doubt, `clip_server` has two built-in YAML configs as a part of the package resources: one for PyTorch backend, one for ONNX backend. When you do `python -m clip_server` it loads the Pytorch config, and when you do `python -m clip_server onnx-flow.yml` it loads the ONNX config.
And to answer your doubt, `clip_server` has three built-in YAML configs as a part of the package resources. When you do `python -m clip_server` it loads the Pytorch config, and when you do `python -m clip_server onnx-flow.yml` it loads the ONNX config.
In the same way, when you do `python -m clip_server tensorrt-flow.yml` it loads the TensorRT config.

Let's look at these two built-in YAML configs:
Let's look at these three built-in YAML configs:

````{tab} torch-flow.yml
Expand Down Expand Up @@ -85,6 +100,24 @@ executors:
```
````


````{tab} tensorrt-flow.yml
```yaml
jtype: Flow
version: '1'
with:
port: 51000
executors:
- name: clip_r
uses:
jtype: CLIPEncoder
metas:
py_modules:
- executors/clip_trt.py
```
````

Basically, each YAML file defines a [Jina Flow](https://docs.jina.ai/fundamentals/flow/). The complete Jina Flow YAML syntax [can be found here](https://docs.jina.ai/fundamentals/flow/flow-yaml/#configure-flow-meta-information). General parameters of the Flow and Executor can be used here as well. But now we only highlight the most important parameters.

Looking at the YAML file again, we can put it into three subsections as below:
Expand Down Expand Up @@ -162,7 +195,7 @@ executors:

### CLIP model config

For PyTorch & ONNX backend, you can set the following parameters via `with`:
For all backends, you can set the following parameters via `with`:

| Parameter | Description |
|-----------|--------------------------------------------------------------------------------------------------------------------------------|
Expand Down
3 changes: 3 additions & 0 deletions scripts/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ def warn(*args, **kwargs):
warnings.warn = warn


np.random.seed(123)


class BenchmarkClient(threading.Thread):
def __init__(
self,
Expand Down
2 changes: 1 addition & 1 deletion scripts/get-all-test-paths.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ BATCH_SIZE=3
#declare -a array1=( "tests/unit/test_*.py" )
#declare -a array2=( $(ls -d tests/unit/*/ | grep -v '__pycache__' | grep -v 'array') )
#declare -a array3=( "tests/unit/array/*.py" )
declare -a mixins=( $(find tests -name "test_*.py") )
declare -a mixins=( $(find tests -name "test_*.py" | grep -v 'test_tensorrt.py') )
declare -a array4=( "$(echo "${mixins[@]}" | xargs -n$BATCH_SIZE)" )
# array5 is currently empty because in the array/ directory, mixins is the only directory
# but add the following in case new directories are created in array/
Expand Down
61 changes: 14 additions & 47 deletions server/clip_server/executors/clip_onnx.py
Original file line number Diff line number Diff line change
@@ -1,26 +1,15 @@
import os
import warnings
from multiprocessing.pool import ThreadPool, Pool
from typing import List, Tuple, Optional
import numpy as np
from functools import partial
from multiprocessing.pool import ThreadPool
from typing import Optional
import onnxruntime as ort

from jina import Executor, requests, DocumentArray

from clip_server.model import clip
from clip_server.model.clip_onnx import CLIPOnnxModel

_SIZE = {
'RN50': 224,
'RN101': 224,
'RN50x4': 288,
'RN50x16': 384,
'RN50x64': 448,
'ViT-B/32': 224,
'ViT-B/16': 224,
'ViT-L/14': 224,
'ViT-L/14@336px': 336,
}
from clip_server.executors.helper import split_img_txt_da, preproc_image, preproc_text


class CLIPEncoder(Executor):
Expand All @@ -34,8 +23,7 @@ def __init__(
):
super().__init__(**kwargs)

self._preprocess_blob = clip._transform_blob(_SIZE[name])
self._preprocess_tensor = clip._transform_ndarray(_SIZE[name])
self._preprocess_tensor = clip._transform_ndarray(clip.MODEL_SIZE[name])
self._pool = ThreadPool(processes=num_worker_preprocess)

self._minibatch_size = minibatch_size
Expand Down Expand Up @@ -86,51 +74,30 @@ def __init__(

self._model.start_sessions(sess_options=sess_options, providers=providers)

def _preproc_image(self, da: 'DocumentArray') -> 'DocumentArray':
for d in da:
if d.tensor is not None:
d.tensor = self._preprocess_tensor(d.tensor)
else:
if not d.blob and d.uri:
# in case user uses HTTP protocol and send data via curl not using .blob (base64), but in .uri
d.load_uri_to_blob()
d.tensor = self._preprocess_blob(d.blob)
da.tensors = da.tensors.detach().cpu().numpy().astype(np.float32)
return da

def _preproc_text(self, da: 'DocumentArray') -> Tuple['DocumentArray', List[str]]:
texts = da.texts
da.tensors = clip.tokenize(texts).detach().cpu().numpy().astype(np.int64)
da[:, 'mime_type'] = 'text'
return da, texts

@requests
async def encode(self, docs: 'DocumentArray', **kwargs):
_img_da = DocumentArray()
_txt_da = DocumentArray()
for d in docs:
if d.text:
_txt_da.append(d)
elif (d.blob is not None) or (d.tensor is not None):
_img_da.append(d)
elif d.uri:
_img_da.append(d)
else:
warnings.warn(
f'The content of document {d.id} is empty, cannot be processed'
)
split_img_txt_da(d, _img_da, _txt_da)

# for image
if _img_da:
for minibatch in _img_da.map_batch(
self._preproc_image, batch_size=self._minibatch_size, pool=self._pool
partial(
preproc_image, preprocess_fn=self._preprocess_tensor, return_np=True
),
batch_size=self._minibatch_size,
pool=self._pool,
):
minibatch.embeddings = self._model.encode_image(minibatch.tensors)

# for text
if _txt_da:
for minibatch, _texts in _txt_da.map_batch(
self._preproc_text, batch_size=self._minibatch_size, pool=self._pool
partial(preproc_text, return_np=True),
batch_size=self._minibatch_size,
pool=self._pool,
):
minibatch.embeddings = self._model.encode_text(minibatch.tensors)
minibatch.texts = _texts
Expand Down
Loading

0 comments on commit f7b9af4

Please sign in to comment.