feat: add tensorrt support (#688)

* fix: drafts tempt * feat: support tensorrt backend * fix: offline tensorrt loading * fix: remove draft Dockerfile * fix: setup * fix: s3 bucket for tensorrt * fix: polish codes * fix: int64 input tensors * fix: rebase * fix: refactor preprocess * fix: deprecate _preprocess_blob * fix: refacotr * fix: available model for tensorrt * fix: warning message * fix: imports * fix: draft tensorrt tests * fix: tensrrt output * fix: trt test * fix: tensorrt deps * fix: runtimeerrr re-initialized torch in pytest * fix: compute capability * fix: dynamic trt converting * fix: ci workflow * fix: import onnxruntime * fix: errors * fix: upload cov file * fix: upload cov file * fix: minor revision * chore: update ocs * fix: try to fix cov uploader * fix: temp diable netfily * fix: temp diable netfily * fix: revert netifly * fix: gpu test * fix: runs on * fix: upgrade cov action * fix: use cov bash uploader * fix: action checkout fetch-depth * fix: rebase conflict * fix: cov uploader * fix: setup script
jina-ai · May 4, 2022 · f7b9af4 · f7b9af4
1 parent 3f34d46
commit f7b9af4
Show file tree

Hide file tree

Showing 17 changed files with 718 additions and 95 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -127,17 +127,65 @@ jobs:
         with:
           files: "coverage.xml"
       - name: Upload coverage from test to Codecov
-        uses: codecov/codecov-action@v2
+        uses: codecov/codecov-action@v3
         if: steps.check_files.outputs.files_exists == 'true' && ${{ matrix.python-version }} == '3.7'
         with:
           file: coverage.xml
+          name: ${{ matrix.test-path }}-codecov
+          flags: ${{ steps.test.outputs.codecov_flag }}
+          fail_ci_if_error: false
+          token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos
+
+  gpu-test:
+    needs: prep-testbed
+    runs-on: [self-hosted, x64, gpu, linux]
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: [ 3.7 ]
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          # For coverage builds fetch the whole history
+          fetch-depth: 0
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v2
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Prepare enviroment
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install wheel pytest pytest-cov nvidia-pyindex
+          pip install -e "client/[test]"
+          pip install -e "server/[tensorrt]"
+      - name: Test
+        id: test
+        run: |
+          pytest --suppress-no-test-exit-code --cov=clip_client --cov=clip_server --cov-report=xml \
+            -v -s -m "gpu" ./tests/test_tensorrt.py
+          echo "::set-output name=codecov_flag::cas"
+        timeout-minutes: 30
+        env:
+          # fix re-initialized torch runtime error on cuda device
+          JINA_MP_START_METHOD: spawn
+      - name: Check codecov file
+        id: check_files
+        uses: andstor/file-existence-action@v1
+        with:
+          files: "coverage.xml"
+      - name: Upload coverage from test to Codecov
+        uses: codecov/codecov-action@v3
+        if: steps.check_files.outputs.files_exists == 'true' && ${{ matrix.python-version }} == '3.7'
+        with:
+          file: coverage.xml
+          name: gpu-related-codecov
           flags: ${{ steps.test.outputs.codecov_flag }}
           fail_ci_if_error: false
           token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos
 
   # just for blocking the merge until all parallel core-test are successful
   success-all-test:
-    needs: core-test
+    needs: [core-test, gpu-test]
     if: always()
     runs-on: ubuntu-latest
     steps:

diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@
 
 CLIP-as-service is a low-latency high-scalability service for embedding images and text. It can be easily integrated as a microservice into neural search solutions.
 
-⚡ **Fast**: Serve CLIP models with ONNX runtime and PyTorch JIT with 800QPS<sup>[*]</sup>. Non-blocking duplex streaming on requests and responses, designed for large data and long-running tasks. 
+⚡ **Fast**: Serve CLIP models with TensorRT, ONNX runtime and PyTorch JIT with 800QPS<sup>[*]</sup>. Non-blocking duplex streaming on requests and responses, designed for large data and long-running tasks. 
 
 🫐 **Elastic**: Horizontally scale up and down multiple CLIP models on single GPU, with automatic load balancing.
 
@@ -58,6 +58,16 @@ To run CLIP model via ONNX (default is via PyTorch):
 pip install "clip-server[onnx]"
 ```
 
+To run CLIP model via TensorRT
+
+```bash
+# You must first install the nvidia-pyindex package, which is required in order to set up your pip installation 
+# to fetch additional Python modules from the NVIDIA NGC™ PyPI repo.
+pip install nvidia-pyindex
+
+pip install "clip-server[tensorrt]"
+```
+
 ### Install client
 
 ```bash

diff --git a/docs/user-guides/server.md b/docs/user-guides/server.md
@@ -2,7 +2,7 @@
 
 CLIP-as-service is designed in a client-server architecture. A server is a long-running program that receives raw sentences and images from clients, and returns CLIP embeddings to the client. Additionally, `clip_server` is optimized for speed, low memory footprint and scalability.
 - Horizontal scaling: adding more replicas easily with one argument. 
-- Vertical scaling: using PyTorch JIT or ONNX runtime to speedup single GPU inference.
+- Vertical scaling: using PyTorch JIT, ONNX or TensorRT runtime to speedup single GPU inference.
 - Supporting gRPC, HTTP, Websocket protocols with their TLS counterparts, w/o compressions.
 
 This chapter introduces the API of the client. 
@@ -34,22 +34,37 @@ To use ONNX runtime for CLIP, you can run:
 ```bash
 pip install "clip_server[onnx]"
 
-python -m clip_server onnx_flow.yml
+python -m clip_server onnx-flow.yml
 ```
 
-One may wonder where is this `onnx_flow.yml` come from. Must be a typo? Believe me, just run it. It should work. I will explain this YAML file in the next section. 
+We also support TensorRT runtime for CLIP, you can run:
+
+```bash
+# You must first install the nvidia-pyindex package, which is required in order to set up your pip installation 
+# to fetch additional Python modules from the NVIDIA NGC™ PyPI repo.
+pip install nvidia-pyindex
+
+pip install "clip_server[tensorrt]"
+
+python -m clip_server tensorrt-flow.yml
+```
+
+One may wonder where is this `onnx-flow.yml` (or `tensorrt-flow.yml`) come from. Must be a typo? Believe me, just run it. It should work. I will explain this YAML file in the next section. 
 
-The procedure and UI of ONNX runtime would look the same as Pytorch runtime.
+
+
+The procedure and UI of ONNX and TensorRT runtime would look the same as Pytorch runtime.
 
 
 
 ## YAML config
 
 You may notice that there is a YAML file in our last ONNX example. All configurations are stored in this file. In fact, `python -m clip_server` does **not support** any other argument besides a YAML file. So it is the only source of the truth of your configs. 
 
-And to answer your doubt, `clip_server` has two built-in YAML configs as a part of the package resources: one for PyTorch backend, one for ONNX backend. When you do `python -m clip_server` it loads the Pytorch config, and when you do `python -m clip_server onnx-flow.yml` it loads the ONNX config.
+And to answer your doubt, `clip_server` has three built-in YAML configs as a part of the package resources. When you do `python -m clip_server` it loads the Pytorch config, and when you do `python -m clip_server onnx-flow.yml` it loads the ONNX config.
+In the same way, when you do `python -m clip_server tensorrt-flow.yml` it loads the TensorRT config.
 
-Let's look at these two built-in YAML configs:
+Let's look at these three built-in YAML configs:
 
 ````{tab} torch-flow.yml
 
@@ -85,6 +100,24 @@ executors:
 ```
 ````
 
+
+````{tab} tensorrt-flow.yml
+
+```yaml
+jtype: Flow
+version: '1'
+with:
+  port: 51000
+executors:
+  - name: clip_r
+    uses:
+      jtype: CLIPEncoder
+      metas:
+        py_modules:
+          - executors/clip_trt.py
+```
+````
+
 Basically, each YAML file defines a [Jina Flow](https://docs.jina.ai/fundamentals/flow/). The complete Jina Flow YAML syntax [can be found here](https://docs.jina.ai/fundamentals/flow/flow-yaml/#configure-flow-meta-information). General parameters of the Flow and Executor can be used here as well. But now we only highlight the most important parameters.
 
 Looking at the YAML file again, we can put it into three subsections as below:
@@ -162,7 +195,7 @@ executors:
 
 ### CLIP model config
 
-For PyTorch & ONNX backend, you can set the following parameters via `with`:
+For all backends, you can set the following parameters via `with`:
 
 | Parameter | Description                                                                                                                    |
 |-----------|--------------------------------------------------------------------------------------------------------------------------------|

diff --git a/scripts/benchmark.py b/scripts/benchmark.py
@@ -16,6 +16,9 @@ def warn(*args, **kwargs):
 warnings.warn = warn
 
 
+np.random.seed(123)
+
+
 class BenchmarkClient(threading.Thread):
     def __init__(
         self,

diff --git a/scripts/get-all-test-paths.sh b/scripts/get-all-test-paths.sh
@@ -6,7 +6,7 @@ BATCH_SIZE=3
 #declare -a array1=( "tests/unit/test_*.py" )
 #declare -a array2=( $(ls -d tests/unit/*/ | grep -v '__pycache__' | grep -v 'array') )
 #declare -a array3=( "tests/unit/array/*.py" )
-declare -a mixins=( $(find tests -name "test_*.py") )
+declare -a mixins=( $(find tests -name "test_*.py" | grep -v 'test_tensorrt.py') )
 declare -a array4=( "$(echo "${mixins[@]}" | xargs -n$BATCH_SIZE)" )
 # array5 is currently empty because in the array/ directory, mixins is the only directory
 # but add the following in case new directories are created in array/

diff --git a/server/clip_server/executors/clip_onnx.py b/server/clip_server/executors/clip_onnx.py
@@ -1,26 +1,15 @@
 import os
 import warnings
-from multiprocessing.pool import ThreadPool, Pool
-from typing import List, Tuple, Optional
-import numpy as np
+from functools import partial
+from multiprocessing.pool import ThreadPool
+from typing import Optional
 import onnxruntime as ort
 
 from jina import Executor, requests, DocumentArray
 
 from clip_server.model import clip
 from clip_server.model.clip_onnx import CLIPOnnxModel
-
-_SIZE = {
-    'RN50': 224,
-    'RN101': 224,
-    'RN50x4': 288,
-    'RN50x16': 384,
-    'RN50x64': 448,
-    'ViT-B/32': 224,
-    'ViT-B/16': 224,
-    'ViT-L/14': 224,
-    'ViT-L/14@336px': 336,
-}
+from clip_server.executors.helper import split_img_txt_da, preproc_image, preproc_text
 
 
 class CLIPEncoder(Executor):
@@ -34,8 +23,7 @@ def __init__(
     ):
         super().__init__(**kwargs)
 
-        self._preprocess_blob = clip._transform_blob(_SIZE[name])
-        self._preprocess_tensor = clip._transform_ndarray(_SIZE[name])
+        self._preprocess_tensor = clip._transform_ndarray(clip.MODEL_SIZE[name])
         self._pool = ThreadPool(processes=num_worker_preprocess)
 
         self._minibatch_size = minibatch_size
@@ -86,51 +74,30 @@ def __init__(
 
         self._model.start_sessions(sess_options=sess_options, providers=providers)
 
-    def _preproc_image(self, da: 'DocumentArray') -> 'DocumentArray':
-        for d in da:
-            if d.tensor is not None:
-                d.tensor = self._preprocess_tensor(d.tensor)
-            else:
-                if not d.blob and d.uri:
-                    # in case user uses HTTP protocol and send data via curl not using .blob (base64), but in .uri
-                    d.load_uri_to_blob()
-                d.tensor = self._preprocess_blob(d.blob)
-        da.tensors = da.tensors.detach().cpu().numpy().astype(np.float32)
-        return da
-
-    def _preproc_text(self, da: 'DocumentArray') -> Tuple['DocumentArray', List[str]]:
-        texts = da.texts
-        da.tensors = clip.tokenize(texts).detach().cpu().numpy().astype(np.int64)
-        da[:, 'mime_type'] = 'text'
-        return da, texts
-
     @requests
     async def encode(self, docs: 'DocumentArray', **kwargs):
         _img_da = DocumentArray()
         _txt_da = DocumentArray()
         for d in docs:
-            if d.text:
-                _txt_da.append(d)
-            elif (d.blob is not None) or (d.tensor is not None):
-                _img_da.append(d)
-            elif d.uri:
-                _img_da.append(d)
-            else:
-                warnings.warn(
-                    f'The content of document {d.id} is empty, cannot be processed'
-                )
+            split_img_txt_da(d, _img_da, _txt_da)
 
         # for image
         if _img_da:
             for minibatch in _img_da.map_batch(
-                self._preproc_image, batch_size=self._minibatch_size, pool=self._pool
+                partial(
+                    preproc_image, preprocess_fn=self._preprocess_tensor, return_np=True
+                ),
+                batch_size=self._minibatch_size,
+                pool=self._pool,
             ):
                 minibatch.embeddings = self._model.encode_image(minibatch.tensors)
 
         # for text
         if _txt_da:
             for minibatch, _texts in _txt_da.map_batch(
-                self._preproc_text, batch_size=self._minibatch_size, pool=self._pool
+                partial(preproc_text, return_np=True),
+                batch_size=self._minibatch_size,
+                pool=self._pool,
             ):
                 minibatch.embeddings = self._model.encode_text(minibatch.tensors)
                 minibatch.texts = _texts