OpenAI API Wrapper #735

Ichigo3766 · 2023-07-28T22:59:07Z

Feature request

Hi,

I was wondering if it would be possible to have a openai based api.

Motivation

Many projects have been built around openai api something similar to what vllm has and few others inference servers have. If TGI can have this, we can just swap the base url for any projects such as aider and many more and use them without any hastle of changing the code.

https://github.com/paul-gauthier/aider
https://github.com/AntonOsika/gpt-engineer
https://github.com/Significant-Gravitas/Auto-GPT

And many more.

For reference, vllm has a wrapper and text-generation webui has one too.

Your contribution

discuss.

philschmid · 2023-07-29T15:50:53Z

Hello @bloodsucker99, I am not sure that's possible on the server side since models have different prompts.
So it might make sense to implement this on a client side, which converts the open AI schema (List of dicts) into a single prompt.

Ichigo3766 · 2023-07-29T19:24:54Z

Yea i kind of suspected that doing it on server side would not be possible :(

Any chance you/anyone would be interested in building like a middle man for this? A python wrapper that just sits in middle would be cool

philschmid · 2023-07-29T21:08:55Z

I had some time and started working on something. I will share the first version here. I would love to get feedback if you are willing to try out.

Ichigo3766 · 2023-07-29T21:10:11Z

Id love to try it out. Also possible to communicate over discord? Would make it much easier :)

philschmid · 2023-07-29T22:24:26Z

Okay, I rushed out the first version. It is in a package i started called easyllm.

Github: https://github.com/philschmid/easyllm
Documentation: https://philschmid.github.io/easyllm/

The documentation also includes examples for streaming

Example

Install EasyLLM via pip:

pip install easyllm

Then import and start using the clients:

from easyllm.clients import huggingface
from easyllm.prompt_utils import build_llama2_prompt

# helper to build llama2 prompt
huggingface.prompt_builder = build_llama2_prompt

response = huggingface.ChatCompletion.create(
    model="meta-llama/Llama-2-70b-chat-hf",
    messages=[
        {"role": "system", "content": "\nYou are a helpful assistant speaking like a pirate. argh!"},
        {"role": "user", "content": "What is the sun?"},
    ],
      temperature=0.9,
      top_p=0.6,
      max_tokens=256,
)

print(response)

Ichigo3766 · 2023-07-29T22:41:03Z

This is interesting. Could you give me an example of connecting this to the TGI api? There is a model space but that would be loading the model again right? So instead if I am usin TGI which has the model loaded, how would i use its api in here and get a openai api out?

philschmid · 2023-07-30T07:31:30Z

No, its a client. How would add a Wrapper when you don't know the prompt format on the server side?
It might be possible the write a different server.rs which implements common templating and you could define what you want when starting it, but that's a lot of work.

Ichigo3766 · 2023-07-30T20:51:44Z

Hi! I am a bit confused of what you mean by "you don't know the prompt format on the server side". So there is a wrapper made by langchain:
https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/llms/huggingface_text_gen_inference.py

I was kind of thinking this way but for openai if that makes sense.

Narsil · 2023-07-31T16:00:11Z

you don't know the prompt format on the server side"

I think what @philschmid meant, is how are you supposed to send a final fully formed id sequence.
TGI doesn't know how model where trained/fine-tuned, so it doesn't know what system prompt or user_prompt is. It expects a single full string, as the langchain wrapper sends.

So the missing step is going from

 messages=[
        {"role": "system", "content": "\nYou are a helpful assistant speaking like a pirate. argh!"},
        {"role": "user", "content": "What is the sun?"},
    ],

To:

[[SYS]\nYou are a helpful assistant speaking like a pirate. argh[/SYS] What is the sun <s>

Which is needed for good results with https://huggingface.co/meta-llama/Llama-2-7b-chat-hf for instance (don't quote me on the prompt I did it from memory)

paulcx · 2023-08-03T01:08:52Z

I agree with @Narsil point. Some people or projects don't use OpenAI-style prompts. Eventually, all messages will be merged into a single string for input to LLM, limiting flexibility. One possible solution is to create an API template on the server side, allowing users to define their preferred API. However, implementing this approach might require a substantial amount of work and could potentially introduce bugs.

I have a question: Why is the TGI API slightly different from the TGI client SDK? For instance, the parameter 'detail' is ignored in the TGI client source code. Shouldn't they be exactly the same?

Narsil · 2023-08-03T15:03:36Z

Why is the TGI API slightly different from the TGI client SDK?

I'm not sure what you are referring to. There could be some slight out sync between the Python client and the server, but that's not intentional.

Narsil · 2023-08-03T15:06:21Z

One possible solution is to create an API template on the server side

That's definitely an option, which I would like with guidance and token healing if we were to do it, they seem to serve the same purpose: extending the querying API in a user defined way. (Both the server user and the actual querying user)

viniciusarruda · 2023-08-03T17:50:37Z

I've implemented a small wrapper around the chat completions for llama2.
The easyllm from @philschmid seems good, and I've compared it with my implementation for llama2 and it gives the same result!

paulcx · 2023-08-04T02:14:06Z

Why is the TGI API slightly different from the TGI client SDK?

I'm not sure what you are referring to. There could be some slight out sync between the Python client and the server, but that's not intentional.

here is what I'm referring to here. The request parameters is slight different from ones in API. It's ok but why the 'detail' is manually set to True here

jcushman · 2023-09-08T19:10:51Z

In case this is helpful, llama.cpp does this via api_like_OAI.py. This PR would update that script to use fastchat's conversation.py to handle the serialization problem discussed upthread.

jcushman · 2023-09-08T19:36:15Z

And here is fastchat's own version of this: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py

zfang · 2023-09-20T05:12:11Z

I would love to have this to be supported.

abhinavkulkarni · 2023-09-23T10:23:13Z

LiteLLM has support for TGI: https://docs.litellm.ai/docs/providers/huggingface#text-generation-interface-tgi---llms

krrishdholakia · 2023-09-24T00:02:45Z

Thanks for mentioning us @abhinavkulkarni

Hey @Narsil @jcushman @zfang
Happy to help here.

This is the basic code:

import os 
from litellm import completion 

# [OPTIONAL] set env var
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key" 

messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]

# e.g. Call 'WizardLM/WizardCoder-Python-34B-V1.0' hosted on HF Inference endpoints
response = completion(model="huggingface/WizardLM/WizardCoder-Python-34B-V1.0", messages=messages, api_base="https://my-endpoint.huggingface.cloud")

print(response)

We also handle prompt formatting - https://docs.litellm.ai/docs/providers/huggingface#models-with-prompt-formatting based on the lmsys/fastchat implementation.

But you can overwrite this with your own changes if necessary - https://docs.litellm.ai/docs/providers/huggingface#custom-prompt-templates

zfang · 2023-09-24T00:56:19Z

Hi @krrishdholakia,

Thanks for the info. Instead of a client I actually a middle service because I'm trying to host an API server for the chatbot arena https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model

I can use vLLM to host a service and provide an OpenAI-compatible API but it's quite slower than TGI. It pains me that TGI doesn't support this. I will probably need to hack a FastChat service to redirect calls to TGI.

Regards,

Felix

krrishdholakia · 2023-09-24T04:06:05Z

@zfang we have an open-source proxy you can fork and run this through - https://github.com/BerriAI/liteLLM-proxy

would it be helpful if we exposed a cli command to deploy this through?

litellm --deploy

abhinavkulkarni · 2023-09-28T03:51:21Z

LiteLLM has developed an OpenAI wrapper for TGI (and for lots of other model-serving frameworks).

Here are more details: https://docs.litellm.ai/docs/proxy_server

You can set it up as follows:

Set up a local TGI endpoint first:

$ text-generation-launcher 
  --model-id abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq \
  --trust-remote-code --port 8080 \
  --max-input-length 5376 --max-total-tokens 6144 --max-batch-prefill-tokens 6144 \
  --quantize awq

I have a LiteLLM proxy server on top of that.

$ litellm \
  --model huggingface/abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq \
  --api_base http://localhost:8080

I am able to successfully obtain responses from openai.ChatCompletion.create endpoint as follows:

>>> import openai
>>> openai.api_key = "xyz"
>>> openai.api_base = "http://0.0.0.0:8000"
>>> model = "huggingface/abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq"
>>> completion = openai.ChatCompletion.create(model=model, messages=[{"role": "user", "content": "How are you?"}])
>>> print(completion)
{
  "object": "chat.completion",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "content": "I'm fine, thanks. I'm glad to hear that.\n\nI'm",
        "role": "assistant",
        "logprobs": -18.19830319
      }
    }
  ],
  "id": "chatcmpl-7f8f5312-893a-4dab-aff5-3a97a354c2be",
  "created": 1695869575.316254,
  "model": "abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq",
  "usage": {
    "prompt_tokens": 4,
    "completion_tokens": 15,
    "total_tokens": 19
  }
}

michaelfeil · 2023-10-04T07:22:16Z

@zfang @paulcx I implemented this feature on the Apache-2.0 licenced forked project directly in Rust.

https://github.com/Preemo-Inc/text-generation-inference

krrishdholakia · 2023-10-04T14:57:23Z

Hey @michaelfeil - is TGI close-source now? I can't find other info on this

Narsil · 2023-10-04T15:32:26Z

We added a restriction in 1.0 which means you cannot use it as a cloud provider as-is without getting a license from us.
Most likely it doesn't change anything for you.

More details here:
#744

adrianog · 2023-10-10T01:26:34Z

@zfang @paulcx I implemented this feature on the Apache-2.0 licenced forked project directly in Rust.

https://github.com/Preemo-Inc/text-generation-inference

Can I use this to wrap the official inference API as published by hf?
I can't seem to be able to find an example of how to do create models using the hf inference api from llamaindex.

batindfa · 2023-10-20T09:39:29Z

@abhinavkulkarni hi, How to run a LiteLLM proxy server
I use litellm --model huggingface/meta-llama/Llama-2-70b-chat-hf --api_base http://0.0.0.0:8080/generate in linux command， but it appears bash: litellm: command not found

abhinavkulkarni · 2023-10-20T09:42:10Z

@yanmengxiang1: Please install litellm using pip.

batindfa · 2023-10-20T09:47:45Z

@abhinavkulkarni yes, I know it. Should I use something like flask to wrappe this TGI？

abhinavkulkarni · 2023-10-20T09:53:43Z

Hey @yanmengxiang1:

Run TGI at port 8080. Then run litellm so that it points to TGI:

litellm --model huggingface/meta-llama/Llama-2-70b-chat-hf --api_base http://localhost:8080 --port 8000

You now have OpenAI compatible API endpoint at port 8000.

krrishdholakia · 2023-10-20T16:15:00Z

@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server

LarsHill · 2023-10-27T12:22:50Z

@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server

It seems this feature is going to be depreciated? So how future proof is it to build an application around it?

krrishdholakia · 2023-11-01T21:16:25Z

Hey @LarsHill the LiteLLM community is discussing the best approach right now - BerriAI/litellm#648 (comment)

Some context
We'd initially planned on the docker container being an easier replacement ( consistent environment + easier to deploy)

but it might not be ideal. So we're trying to understand what works best (how do you provide a consistent experience + easy ability to set up configs, etc.).

DM'ing you to understand what a good experience here looks like.

bitsnaps · 2024-02-06T15:58:44Z

@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server

It seems this feature is going to be depreciated? So how future proof is it to build an application around it?

The text-generation-webui made a huge progress on supporting other providers by including extensions, you can serve the compatible openai's API using this command:

# clone the repo, then cd..

# install deps:
!pip install -q -r requirements.txt --upgrade
# install extensions (openai...)
!pip install -q -r extensions/openai/requirements.txt --upgrade

# download your model (using way allows you to download large models):
!python download-model.py https://huggingface.co/TheBloke/SauerkrautLM-UNA-SOLAR-Instruct-GPTQ 
# this one works better for MemGPT

# serve your model (check the name of the download file/directory):
!python server.py --model TheBloke_SauerkrautLM-UNA-SOLAR-Instruct-GPTQ --n-gpu-layers 24 --n_ctx 2048 --api --nowebui --extensions openai 

# or download a specific file (if using GGUF models):
!python download-model.py https://huggingface.co/TheBloke/dolphin-2.7-mixtral-8x7b-GGUF  --specific-file dolphin-2.7-mixtral-8x7b.Q2_K.gguf

Your server should be up and running on 5000 (by default):

!curl http://0.0.0.0:5000/v1/completions -H "Content-Type: application/json" -d '{ "prompt": "This is a cake recipe:\n\n1.","max_tokens": 200, "temperature": 1,  "top_p": 0.9, "seed": 10 }'

This way allows you to run any model (even those aren't available on Ollama's docker images), and without hitting huggingface's api and even large models (>= 10Gb) and even if models without an inference API. No litellm neither ollama required.

Increasingly, LLM software is standardizing around the use of OpenAI-esque compatible endpoints. Some examples: * [OpenLLM](https://github.com/bentoml/OpenLLM) (commonly used to self-host/deploy various LLMs in enterprises) * [Huggingface TGI](huggingface/text-generation-inference#735) (and, by extension, [AWS SageMaker](https://aws.amazon.com/blogs/machine-learning/announcing-the-launch-of-new-hugging-face-llm-inference-containers-on-amazon-sagemaker/)) * [Ollama](https://github.com/ollama/ollama) (commonly used for running LLMs locally, useful for local testing) All of these projects either have OpenAI-compatible API endpoints already, or are actively building out support for it. On strat we are regularly working with enterprise customers that self-host their own specific-model LLM via one of these methods, and wish for Cody to consume an OpenAI endpoint (understanding some specific model is on the other side and that Cody should optimize for / target that specific model.) Since Cody needs to tailor to a specific model (prompt generation, stop sequences, context limits, timeouts, etc.) and handle other provider-specific nuances, it is insufficient to simply expect that a customer-provided OpenAI compatible endpoint is in fact 1:1 compatible with e.g. GPT-3.5 or GPT-4. We need to be able to configure/tune many of these aspects to the specific provider/model, even though it presents as an OpenAI endpoint. In response to these needs, I am working on adding an 'OpenAI-compatible' provider proper: the ability for a Sourcegraph enterprise instance to advertise that although it is connected to an OpenAI compatible endpoint, there is in fact a specific model on the other side (starting with Starchat and Starcoder) and that Cody should target that configuration. The _first step_ of this work is this change. After this change, an existing (current-version) Sourcegraph enterprise instance can configure an OpenAI endpoint for completions via the site config such as: ``` "cody.enabled": true, "completions": { "provider": "openai", "accessToken": "asdf", "endpoint": "http://openllm.foobar.com:3000", "completionModel": "gpt-4", "chatModel": "gpt-4", "fastChatModel": "gpt-4", }, ``` The `gpt-4` model parameters will be sent to the OpenAI-compatible endpoint specified, but will otherwise be unused today. Users may then specify in their VS Code configuration that Cody should treat the LLM on the other side as if it were e.g. Starchat: ``` "cody.autocomplete.advanced.provider": "experimental-openaicompatible", "cody.autocomplete.advanced.model": "starchat-16b-beta", "cody.autocomplete.advanced.timeout.multiline": 10000, "cody.autocomplete.advanced.timeout.singleline": 10000, ``` In the future, we will make it possible to configure the above options via the Sourcegraph site configuration instead of each user needing to configure it in their VS Code settings explicitly. Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

drbh · 2024-03-06T16:03:24Z

This functionality is now supported in TGI with the introduction of the Message API and can be used like:

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="-"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=False
)

print(chat_completion)

Please see the docs here for more details https://huggingface.co/docs/text-generation-inference/messages_api

Increasingly, LLM software is standardizing around the use of OpenAI-esque compatible endpoints. Some examples: * [OpenLLM](https://github.com/bentoml/OpenLLM) (commonly used to self-host/deploy various LLMs in enterprises) * [Huggingface TGI](huggingface/text-generation-inference#735) (and, by extension, [AWS SageMaker](https://aws.amazon.com/blogs/machine-learning/announcing-the-launch-of-new-hugging-face-llm-inference-containers-on-amazon-sagemaker/)) * [Ollama](https://github.com/ollama/ollama) (commonly used for running LLMs locally, useful for local testing) All of these projects either have OpenAI-compatible API endpoints already, or are actively building out support for it. On strat we are regularly working with enterprise customers that self-host their own specific-model LLM via one of these methods, and wish for Cody to consume an OpenAI endpoint (understanding some specific model is on the other side and that Cody should optimize for / target that specific model.) Since Cody needs to tailor to a specific model (prompt generation, stop sequences, context limits, timeouts, etc.) and handle other provider-specific nuances, it is insufficient to simply expect that a customer-provided OpenAI compatible endpoint is in fact 1:1 compatible with e.g. GPT-3.5 or GPT-4. We need to be able to configure/tune many of these aspects to the specific provider/model, even though it presents as an OpenAI endpoint. In response to these needs, I am working on adding an 'OpenAI-compatible' provider proper: the ability for a Sourcegraph enterprise instance to advertise that although it is connected to an OpenAI compatible endpoint, there is in fact a specific model on the other side (starting with Starchat and Starcoder) and that Cody should target that configuration. The _first step_ of this work is this change. After this change, an existing (current-version) Sourcegraph enterprise instance can configure an OpenAI endpoint for completions via the site config such as: ``` "cody.enabled": true, "completions": { "provider": "openai", "accessToken": "asdf", "endpoint": "http://openllm.foobar.com:3000", "completionModel": "gpt-4", "chatModel": "gpt-4", "fastChatModel": "gpt-4", }, ``` The `gpt-4` model parameters will be sent to the OpenAI-compatible endpoint specified, but will otherwise be unused today. Users may then specify in their VS Code configuration that Cody should treat the LLM on the other side as if it were e.g. Starchat: ``` "cody.autocomplete.advanced.provider": "experimental-openaicompatible", "cody.autocomplete.advanced.model": "starchat-16b-beta", "cody.autocomplete.advanced.timeout.multiline": 10000, "cody.autocomplete.advanced.timeout.singleline": 10000, ``` In the future, we will make it possible to configure the above options via the Sourcegraph site configuration instead of each user needing to configure it in their VS Code settings explicitly. Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>

michaelfeil mentioned this issue Aug 1, 2023

Wrapping completions and chat/completions endpoint Preemo-Inc/text-generation-inference#2

Merged

5 tasks

Michellehbn mentioned this issue Jan 10, 2024

feat: supports openai chat completions API #1408

Closed

Narsil mentioned this issue Jan 10, 2024

feat: supports openai chat completions API #1427

Merged

irthomasthomas mentioned this issue Jan 18, 2024

OpenAI API Wrapper · Issue #735 · huggingface/text-generation-inference irthomasthomas/undecidability#390

Open

1 task

slimsag mentioned this issue Feb 20, 2024

add an OpenAI-compatible provider as a generic Enterprise LLM adapter sourcegraph/cody#3218

Merged

drbh closed this as completed Mar 6, 2024

OpenAI API Wrapper #735

OpenAI API Wrapper #735

Comments

Ichigo3766 commented Jul 28, 2023

Feature request

Motivation

Your contribution

philschmid commented Jul 29, 2023

Ichigo3766 commented Jul 29, 2023

philschmid commented Jul 29, 2023

Ichigo3766 commented Jul 29, 2023

philschmid commented Jul 29, 2023

Example

Ichigo3766 commented Jul 29, 2023 • edited

philschmid commented Jul 30, 2023

Ichigo3766 commented Jul 30, 2023

Narsil commented Jul 31, 2023 • edited

paulcx commented Aug 3, 2023 • edited

Narsil commented Aug 3, 2023

Narsil commented Aug 3, 2023

viniciusarruda commented Aug 3, 2023

paulcx commented Aug 4, 2023

jcushman commented Sep 8, 2023

jcushman commented Sep 8, 2023

zfang commented Sep 20, 2023 • edited

abhinavkulkarni commented Sep 23, 2023

krrishdholakia commented Sep 24, 2023

zfang commented Sep 24, 2023

krrishdholakia commented Sep 24, 2023 • edited

abhinavkulkarni commented Sep 28, 2023 • edited

michaelfeil commented Oct 4, 2023

krrishdholakia commented Oct 4, 2023

Narsil commented Oct 4, 2023

adrianog commented Oct 10, 2023

batindfa commented Oct 20, 2023 • edited

abhinavkulkarni commented Oct 20, 2023

batindfa commented Oct 20, 2023

abhinavkulkarni commented Oct 20, 2023

krrishdholakia commented Oct 20, 2023

LarsHill commented Oct 27, 2023

krrishdholakia commented Nov 1, 2023 • edited

bitsnaps commented Feb 6, 2024

drbh commented Mar 6, 2024

Ichigo3766 commented Jul 29, 2023 •

edited

Narsil commented Jul 31, 2023 •

edited

paulcx commented Aug 3, 2023 •

edited

zfang commented Sep 20, 2023 •

edited

krrishdholakia commented Sep 24, 2023 •

edited

abhinavkulkarni commented Sep 28, 2023 •

edited

batindfa commented Oct 20, 2023 •

edited

krrishdholakia commented Nov 1, 2023 •

edited