Make vllm-openai Docker container compatible with HuggingFace Inference Endpoints. Specifically, the most recent VLLM version supports vision language models like Phi-3-vision that Text Generation Inference does not yet support, so this repo is useful for deploying those VLM models not supported by TGI.
This repo was heavily inspired by https://github.com/philschmid/vllm-huggingface, but is simpler because it does not fork from vllm.
- Install dependencies with
poetry install
. If usingpoetry
as your environment manager, runpoetry shell
to activate your environment. - Add a
.env
file in the root directory withHF_TOKEN
defined as a read/write token from huggingface. See.env.example
for how to format.
- View/Edit the details in
examples/deploy.py
. It is set up to deploy a HuggingFace Inference Endpoint for the Phi-3-vision model. Once you have set up the necessary variables, runpython examples/deploy.py
. - Go to the link printed by the previous
deploy.py
script to watch the endpoint deployment status and to retrieve the inference base url when finished deploying. - Copy this Endpoint Url from step 2 and add the env variable
HF_ENDPOINT_URL
with this copied value. Again, see.env.example
for how to format.
- The endpoint you have deployed above is OpenAI API Compatible, meaning you can use the OpenAI library and any other library built to use OpenAI's library with your endpoint. To see an example of how you can call inference using your new endpoint, see
examples/inference.py
. - To run the inference, run
python examples/inference.py
.