Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for CTranslate2 acceleration #496

Closed
michaelfeil opened this issue Jun 26, 2023 · 6 comments
Closed

Adding support for CTranslate2 acceleration #496

michaelfeil opened this issue Jun 26, 2023 · 6 comments
Labels

Comments

@michaelfeil
Copy link

michaelfeil commented Jun 26, 2023

Feature request

The feature would be to support accelerated inference with the CTranslate2 framework
https://github.com/OpenNMT/CTranslate2

Motivation

Reasons to CTranslate2

faster float16 generation

In my case, outperforms VLLM #478 by a factor of 180% and Transformers by 190%.
This is measured for gpt_bigcode_starcoder, with input length of 10, batch_size of 2, max_new_tokens=32

even faster int8 quantization

  • int8 quantization that in most cases outperforms float16 inference, quality loss is minimal while being identical to

Wide range of model support

This could be a new "best effort" option - https://opennmt.net/CTranslate2/guides/transformers.html

Stability

Compared to other frameworks, cross-platform unit tests.
It is the backend of LibreTranslate https://github.com/LibreTranslate/LibreTranslate

Streaming Support

https://opennmt.net/CTranslate2/generation.html?highlight=stream

Your contribution

Here is an example script, comparing the three frameworks with ray[serve] (replacement of the dynamic batching in TGI).

With the code below, my timings are:
In: input: "def hello_world" -> 32 new tokens
Median time of 10:

  • Transformers : 287.6ms
  • VLLM : 278.8ms
  • CTranslate2 : 193.5ms
from fastapi import FastAPI
from ray import serve, get_gpu_ids
import time
from typing import List, Dict
from pydantic import BaseModel
from enum import Enum

app = FastAPI(docs_url="/docs") # lets not do any authentication for now 

class HFAPIRequest(BaseModel):
    inputs: str
    parameters: dict = {}

class Framework(Enum):
    ctranslate2 = 1
    transformers = 2
    vllm = 3


@serve.deployment
@serve.ingress(app)
class FastAPIWrapper:
    def __init__(self, model_handle) -> None:
        self.model_handle = model_handle
    
    @app.get("/v1/heartbeat")
    async def complete(self) -> float:
        """check that the application is running"""
        return time.time()
        
    @app.post("/v1/complete")
    async def complete(self, data: HFAPIRequest) -> Dict[str,str]:
        """send json-data per http request. returns a completion"""
        try:
            d_dict=data.dict()
            # send the call, get ray_ref as receipt
            ray_ref = await self.model_handle.predict.remote(inputs=d_dict["inputs"])
            # wait until ray_ref finished
            completion = await ray_ref
            return {"generated_text": completion}
        except Exception as ex:
            return  {"error": str(ex)}

@serve.deployment(ray_actor_options={"num_gpus": 1})
class CodeCompletionModel:
    def __init__(self, model_name = "michaelfeil/ct2fast-gpt_bigcode-santacoder", framework: Framework = Framework.ctranslate2 ) -> None:
        self.framework = framework
        gpus = bool(get_gpu_ids())
        
        if self.framework == Framework.ctranslate2:
            # 1784MiB VRAM, 314.3ms for generation
            from hf_hub_ctranslate2 import GeneratorCT2fromHfHub
            
            self.model = GeneratorCT2fromHfHub(
                    # use a pre-quantized model here
                    model_name_or_path=model_name, 
                    device="cuda" if gpus else "cpu",
                    compute_type="int8_float16" if gpus else "int8",
            )
        elif self.framework == Framework.transformers:
            # 2918MiB VRAM, ~1146.8ms for generation
            from transformers import pipeline
            import torch
            self.model = pipeline(
                task="text-generation",
                model="bigcode/gpt_bigcode-santacoder", 
                device="cuda:0" if gpus else "cpu",
                torch_dtype=torch.float16
            )
        elif self.framework == Framework.vllm:
            from vllm import LLM, SamplingParams
            self.sampling_params = SamplingParams(max_tokens=32)
            self.model  = LLM(model="bigcode/gpt_bigcode-santacoder")
        else:
            raise ValueError(
                f"{self.framework} is not a supported framework."
                f"Please use one of the following: {list(Framework)}"
            )
    
    
    @serve.batch(max_batch_size=4)
    async def predict(self, inputs: List[str]) -> List[str]:
        if self.framework == Framework.ctranslate2:
            return self.model.generate(
                text=inputs,
                max_length=32, 
                include_prompt_in_result=False
            )
        elif self.framework == Framework.transformers:
            out = self.model(inputs, max_new_tokens=32, return_full_text=True)
            return [o[0]["generated_text"] for o in out]
        elif self.framework == Framework.vllm:
            outputs = self.model.generate(inputs, self.sampling_params)
            return [o.outputs[0].text for o in outputs]
    
deploy_handle = FastAPIWrapper.bind(model_handle=CodeCompletionModel.bind(framework=Framework.ctranslate2))

if __name__ == "__main__":
    serve.run(deploy_handle, port=5000)
@Atry
Copy link
Contributor

Atry commented Jun 28, 2023

The batch size 2 seems very small.

@michaelfeil
Copy link
Author

Feel free to try your own example.

if __name__ == "__main__":
    import timeit
    for framework in Framework:
        start = timeit.default_timer()        
        model = CodeCompletionModel(framework=framework)
        end = timeit.default_timer()     
        model.predict_batch(["warmup"])
        inp = [("def " * i) for i in  range(1,16)] # batch size 16 with increasing number of tokens
        t = timeit.timeit("model.predict_batch(inp)", globals=locals(), number=10) / 10
        print("framework: ", framework, " mean time per batch: ", t, "s, model load time was ", end -start, "s")

e.g. batch size 64, with increasing size

framework:  Framework.vllm  mean time per batch:  0.9124439172912389 s, model load time was  6.033201194833964 s
framework:  Framework.ctranslate2  mean time per batch:  0.5082830318249763 s, model load time was  7.9577884785830975 s

@aliswel-mt
Copy link

aliswel-mt commented Jul 3, 2023

@michaelfeil Hi, I was working on ctranslate2 version of wizardcoder and the latency with max_new_token=128 is a lot higher (about 4s) compared to yours, may I ask what machine are you using? or did you modify anything to make it faster?

In my experience, the mean time of your example should grow approximate 4X to 2s if max_new_token=128, which still 2X faster than my results

@michaelfeil
Copy link
Author

michaelfeil commented Jul 3, 2023

@michaelfeil Hi, I was working on ctranslate2 version of wizardcoder and the latency with max_new_token=128 is a lot higher (about 4s) compared to yours, may I ask what machine are you using? or did you modify anything to make it faster?

In my experience, the mean time of your example should grow approximate 4X to 2s if max_new_token=128, which still 2X faster than my results

Yung-Kai The above uses santacoder (1.1B params), with 32 tokens. Whats the relative speed difference across all frameworks?

@aliswel-mt
Copy link

aliswel-mt commented Jul 4, 2023

@michaelfeil Thanks for your reply, I got around 16s on HF, and 5s for 4 bit quantization with autoGPTQ.
Haven't try vLLM due to OOM in my case, I enlarge the max_num_seq.
I guess starcoder should take longer than santacoder, but not sure how long it would be

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 15, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants