Adding support for CTranslate2 acceleration #496

michaelfeil · 2023-06-26T19:24:34Z

Feature request

The feature would be to support accelerated inference with the CTranslate2 framework
https://github.com/OpenNMT/CTranslate2

Motivation

Reasons to CTranslate2

faster float16 generation

In my case, outperforms VLLM #478 by a factor of 180% and Transformers by 190%.
This is measured for gpt_bigcode_starcoder, with input length of 10, batch_size of 2, max_new_tokens=32

even faster int8 quantization

int8 quantization that in most cases outperforms float16 inference, quality loss is minimal while being identical to

Wide range of model support

This could be a new "best effort" option - https://opennmt.net/CTranslate2/guides/transformers.html

Stability

Compared to other frameworks, cross-platform unit tests.
It is the backend of LibreTranslate https://github.com/LibreTranslate/LibreTranslate

Streaming Support

https://opennmt.net/CTranslate2/generation.html?highlight=stream

Your contribution

Here is an example script, comparing the three frameworks with ray[serve] (replacement of the dynamic batching in TGI).

With the code below, my timings are:
In: input: "def hello_world" -> 32 new tokens
Median time of 10:

Transformers : 287.6ms
VLLM : 278.8ms
CTranslate2 : 193.5ms

from fastapi import FastAPI
from ray import serve, get_gpu_ids
import time
from typing import List, Dict
from pydantic import BaseModel
from enum import Enum

app = FastAPI(docs_url="/docs") # lets not do any authentication for now 

class HFAPIRequest(BaseModel):
    inputs: str
    parameters: dict = {}

class Framework(Enum):
    ctranslate2 = 1
    transformers = 2
    vllm = 3


@serve.deployment
@serve.ingress(app)
class FastAPIWrapper:
    def __init__(self, model_handle) -> None:
        self.model_handle = model_handle
    
    @app.get("/v1/heartbeat")
    async def complete(self) -> float:
        """check that the application is running"""
        return time.time()
        
    @app.post("/v1/complete")
    async def complete(self, data: HFAPIRequest) -> Dict[str,str]:
        """send json-data per http request. returns a completion"""
        try:
            d_dict=data.dict()
            # send the call, get ray_ref as receipt
            ray_ref = await self.model_handle.predict.remote(inputs=d_dict["inputs"])
            # wait until ray_ref finished
            completion = await ray_ref
            return {"generated_text": completion}
        except Exception as ex:
            return  {"error": str(ex)}

@serve.deployment(ray_actor_options={"num_gpus": 1})
class CodeCompletionModel:
    def __init__(self, model_name = "michaelfeil/ct2fast-gpt_bigcode-santacoder", framework: Framework = Framework.ctranslate2 ) -> None:
        self.framework = framework
        gpus = bool(get_gpu_ids())
        
        if self.framework == Framework.ctranslate2:
            # 1784MiB VRAM, 314.3ms for generation
            from hf_hub_ctranslate2 import GeneratorCT2fromHfHub
            
            self.model = GeneratorCT2fromHfHub(
                    # use a pre-quantized model here
                    model_name_or_path=model_name, 
                    device="cuda" if gpus else "cpu",
                    compute_type="int8_float16" if gpus else "int8",
            )
        elif self.framework == Framework.transformers:
            # 2918MiB VRAM, ~1146.8ms for generation
            from transformers import pipeline
            import torch
            self.model = pipeline(
                task="text-generation",
                model="bigcode/gpt_bigcode-santacoder", 
                device="cuda:0" if gpus else "cpu",
                torch_dtype=torch.float16
            )
        elif self.framework == Framework.vllm:
            from vllm import LLM, SamplingParams
            self.sampling_params = SamplingParams(max_tokens=32)
            self.model  = LLM(model="bigcode/gpt_bigcode-santacoder")
        else:
            raise ValueError(
                f"{self.framework} is not a supported framework."
                f"Please use one of the following: {list(Framework)}"
            )
    
    
    @serve.batch(max_batch_size=4)
    async def predict(self, inputs: List[str]) -> List[str]:
        if self.framework == Framework.ctranslate2:
            return self.model.generate(
                text=inputs,
                max_length=32, 
                include_prompt_in_result=False
            )
        elif self.framework == Framework.transformers:
            out = self.model(inputs, max_new_tokens=32, return_full_text=True)
            return [o[0]["generated_text"] for o in out]
        elif self.framework == Framework.vllm:
            outputs = self.model.generate(inputs, self.sampling_params)
            return [o.outputs[0].text for o in outputs]
    
deploy_handle = FastAPIWrapper.bind(model_handle=CodeCompletionModel.bind(framework=Framework.ctranslate2))

if __name__ == "__main__":
    serve.run(deploy_handle, port=5000)

The text was updated successfully, but these errors were encountered:

Atry · 2023-06-28T18:56:41Z

The batch size 2 seems very small.

michaelfeil · 2023-06-29T09:54:15Z

Feel free to try your own example.

if __name__ == "__main__":
    import timeit
    for framework in Framework:
        start = timeit.default_timer()        
        model = CodeCompletionModel(framework=framework)
        end = timeit.default_timer()     
        model.predict_batch(["warmup"])
        inp = [("def " * i) for i in  range(1,16)] # batch size 16 with increasing number of tokens
        t = timeit.timeit("model.predict_batch(inp)", globals=locals(), number=10) / 10
        print("framework: ", framework, " mean time per batch: ", t, "s, model load time was ", end -start, "s")

e.g. batch size 64, with increasing size

framework:  Framework.vllm  mean time per batch:  0.9124439172912389 s, model load time was  6.033201194833964 s
framework:  Framework.ctranslate2  mean time per batch:  0.5082830318249763 s, model load time was  7.9577884785830975 s

aliswel-mt · 2023-07-03T12:26:10Z

@michaelfeil Hi, I was working on ctranslate2 version of wizardcoder and the latency with max_new_token=128 is a lot higher (about 4s) compared to yours, may I ask what machine are you using? or did you modify anything to make it faster?

In my experience, the mean time of your example should grow approximate 4X to 2s if max_new_token=128, which still 2X faster than my results

michaelfeil · 2023-07-03T14:52:08Z

@michaelfeil Hi, I was working on ctranslate2 version of wizardcoder and the latency with max_new_token=128 is a lot higher (about 4s) compared to yours, may I ask what machine are you using? or did you modify anything to make it faster?

In my experience, the mean time of your example should grow approximate 4X to 2s if max_new_token=128, which still 2X faster than my results

Yung-Kai The above uses santacoder (1.1B params), with 32 tokens. Whats the relative speed difference across all frameworks?

aliswel-mt · 2023-07-04T02:19:13Z

@michaelfeil Thanks for your reply, I got around 16s on HF, and 5s for 4 bit quantization with autoGPTQ.
Haven't try vLLM due to OOM in my case, I enlarge the max_num_seq.
I guess starcoder should take longer than santacoder, but not sure how long it would be

github-actions · 2024-05-15T01:48:08Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

michaelfeil mentioned this issue Jul 6, 2023

Compatibility with HF Inference Endpoints? michaelfeil/hf-hub-ctranslate2#10

Closed

This was referenced Jul 23, 2023

Adding ctranslate2 quantization and inference #684

Closed

Adding ctranslate2 quantization and inference: moving the contribution Preemo-Inc/text-generation-inference#1

Merged

github-actions bot added the Stale label May 15, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for CTranslate2 acceleration #496

Adding support for CTranslate2 acceleration #496

michaelfeil commented Jun 26, 2023 •

edited

Loading

Atry commented Jun 28, 2023

michaelfeil commented Jun 29, 2023

aliswel-mt commented Jul 3, 2023 •

edited

Loading

michaelfeil commented Jul 3, 2023 •

edited

Loading

aliswel-mt commented Jul 4, 2023 •

edited

Loading

github-actions bot commented May 15, 2024

Adding support for CTranslate2 acceleration #496

Adding support for CTranslate2 acceleration #496

Comments

michaelfeil commented Jun 26, 2023 • edited Loading

Feature request

Motivation

faster float16 generation

even faster int8 quantization

Wide range of model support

Stability

Streaming Support

Your contribution

Atry commented Jun 28, 2023

michaelfeil commented Jun 29, 2023

aliswel-mt commented Jul 3, 2023 • edited Loading

michaelfeil commented Jul 3, 2023 • edited Loading

aliswel-mt commented Jul 4, 2023 • edited Loading

github-actions bot commented May 15, 2024

michaelfeil commented Jun 26, 2023 •

edited

Loading

aliswel-mt commented Jul 3, 2023 •

edited

Loading

michaelfeil commented Jul 3, 2023 •

edited

Loading

aliswel-mt commented Jul 4, 2023 •

edited

Loading