Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update vllm patch to v0.2.6 and add regex support for /generate #481

Merged
merged 3 commits into from
Dec 29, 2023

Conversation

tscholak
Copy link
Contributor

@tscholak tscholak commented Dec 24, 2023

closes #478

Here's a Dockerfile that I used for testing:

# syntax=docker/dockerfile:1.2

# Use the vllm/vllm-openai image with the specified tag
ARG BASE_TAG=latest
FROM vllm/vllm-openai:${BASE_TAG}

# Install git and openssh-client, then clean up to keep the image small
RUN apt-get update && \
    apt-get install -y git openssh-client python3.10-venv && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Clone the repository, checkout the specific commit, build the source distribution, install it, and clean up
ARG COMMIT
RUN --mount=type=secret,id=ssh_key \
    mkdir -p -m 0700 ~/.ssh && \
    ssh-keyscan github.com >> ~/.ssh/known_hosts && \
    cp /run/secrets/ssh_key ~/.ssh/id_rsa && \
    chmod 600 ~/.ssh/id_rsa && \
    git clone git@github.com:tscholak/outlines.git && \
    cd outlines && \
    git checkout ${COMMIT} && \
    python3 -m pip install build && \
    python3 -m build && \
    pip install dist/outlines-*.whl && \
    cd .. && \
    rm -rf outlines ~/.ssh /root/.cache/pip

# Install ray with pydantic-2.5 support and clean up
RUN pip install ray==2.9.0 && \
    rm -rf /root/.cache/pip

and the accompanying Makefile:

TAG=v0.2.6
COMMIT=c4e0b5ae2e97ec0377c726aa65a2207c939377de

# Build the Docker image with a configurable tag
build:
	DOCKER_BUILDKIT=1 \
	docker build \
	--secret id=ssh_key,src=$(SSH_KEY) \
	--build-arg BASE_TAG=$(TAG) \
	--build-arg COMMIT=$(COMMIT) \
	-t $(FULL_IMAGE_NAME) \
	.

@tscholak
Copy link
Contributor Author

tscholak commented Dec 24, 2023

problem with `nest_asyncio` and uvloop
/usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py:474: NumbaPendingDeprecationWarning: 
Encountered the use of a type that is scheduled for deprecation: type 'reflected set' found for argument 'fsm_finals' of function '_walk_fsm'.

For more information visit https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-reflection-for-list-and-set-types

File "../usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py", line 226:
@numba.njit(nogil=True, cache=True)
def _walk_fsm(
^

  state_seq = _walk_fsm(
/usr/local/lib/python3.10/dist-packages/numba/core/ir_utils.py:2172: NumbaPendingDeprecationWarning: 
Encountered the use of a type that is scheduled for deprecation: type 'reflected set' found for argument 'fsm_finals' of function 'state_scan_tokens'.

For more information visit https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-reflection-for-list-and-set-types

File "../usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py", line 462:
@numba.njit(cache=True, nogil=True)
def state_scan_tokens(
^

  warnings.warn(NumbaPendingDeprecationWarning(msg, loc=loc))
INFO 12-24 15:45:08 async_llm_engine.py:379] Received request 891d2ad827604ec28c56f22d882ae045: prompt: 'What is the capital of France?', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: None.
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7fe7969551b0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7fe786667520>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7fe7969551b0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7fe786667520>)>
Traceback (most recent call last):
  File "/workspace/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
    task.result()
  File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 234, in __step
    result = coro.throw(exc)
  File "/workspace/vllm/engine/async_llm_engine.py", line 359, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/workspace/vllm/engine/async_llm_engine.py", line 338, in engine_step
    request_outputs = await self.engine.step_async()
  File "/workspace/vllm/engine/async_llm_engine.py", line 191, in step_async
    output = await self._run_workers_async(
  File "/workspace/vllm/engine/async_llm_engine.py", line 219, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
    future.result()
  File "/usr/lib/python3.10/asyncio/tasks.py", line 234, in __step
    result = coro.throw(exc)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
INFO 12-24 15:45:08 async_llm_engine.py:134] Aborted request 891d2ad827604ec28c56f22d882ae045.
  File "/usr/lib/python3.10/asyncio/futures.py", line 285, in __await__
    yield self  # This tells Task to wait for completion.
  File "/usr/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
    future.result()
  File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception.with_traceback(self._exception_tb)
ray.exceptions.RayTaskError(RaySystemError): ray::RayWorkerVllm.execute_method() (pid=762, ip=10.210.140.102, actor_id=81008059c117df77dfd1d76001000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7fb61c7604f0>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RaySystemError: System error: Can't patch loop of type <class 'uvloop.Loop'>
traceback: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/outlines/__init__.py", line 3, in <module>
INFO:     10.210.93.238:40022 - "POST /generate HTTP/1.1" 500 Internal Server Error
    import outlines.models
  File "/usr/local/lib/python3.10/dist-packages/outlines/models/__init__.py", line 12, in <module>
    from .openai import OpenAI, openai
  File "/usr/local/lib/python3.10/dist-packages/outlines/models/openai.py", line 11, in <module>
    from outlines.base import vectorize
  File "/usr/local/lib/python3.10/dist-packages/outlines/base.py", line 16, in <module>
    nest_asyncio.apply()
  File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 19, in apply
    loop = loop or asyncio.get_event_loop()
  File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 41, in _get_event_loop
    loop = events.get_event_loop_policy().get_event_loop()
  File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 68, in get_event_loop
    _patch_loop(loop)
  File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 184, in _patch_loop
    raise ValueError('Can\'t patch loop of type %s' % type(loop))
ValueError: Can't patch loop of type <class 'uvloop.Loop'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/workspace/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
    raise exc
  File "/workspace/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/workspace/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
    task.result()
  File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 234, in __step
    result = coro.throw(exc)
  File "/workspace/vllm/engine/async_llm_engine.py", line 359, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/workspace/vllm/engine/async_llm_engine.py", line 338, in engine_step
    request_outputs = await self.engine.step_async()
  File "/workspace/vllm/engine/async_llm_engine.py", line 191, in step_async
    output = await self._run_workers_async(
  File "/workspace/vllm/engine/async_llm_engine.py", line 219, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
    future.result()
  File "/usr/lib/python3.10/asyncio/tasks.py", line 234, in __step
    result = coro.throw(exc)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/usr/lib/python3.10/asyncio/futures.py", line 285, in __await__
    yield self  # This tells Task to wait for completion.
  File "/usr/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
    future.result()
  File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception.with_traceback(self._exception_tb)
ray.exceptions.RayTaskError(RaySystemError): ray::RayWorkerVllm.execute_method() (pid=762, ip=10.210.140.102, actor_id=81008059c117df77dfd1d76001000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7fb61c7604f0>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RaySystemError: System error: Can't patch loop of type <class 'uvloop.Loop'>
traceback: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/outlines/__init__.py", line 3, in <module>
    import outlines.models
  File "/usr/local/lib/python3.10/dist-packages/outlines/models/__init__.py", line 12, in <module>
    from .openai import OpenAI, openai
  File "/usr/local/lib/python3.10/dist-packages/outlines/models/openai.py", line 11, in <module>
    from outlines.base import vectorize
  File "/usr/local/lib/python3.10/dist-packages/outlines/base.py", line 16, in <module>
    nest_asyncio.apply()
  File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 19, in apply
    loop = loop or asyncio.get_event_loop()
  File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 41, in _get_event_loop
    loop = events.get_event_loop_policy().get_event_loop()
  File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 68, in get_event_loop
    _patch_loop(loop)
  File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 184, in _patch_loop
    raise ValueError('Can\'t patch loop of type %s' % type(loop))
ValueError: Can't patch loop of type <class 'uvloop.Loop'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 274, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/outlines/serve/serve.py", line 84, in generate
    async for request_output in results_generator:
  File "/workspace/vllm/engine/async_llm_engine.py", line 445, in generate
    raise e
  File "/workspace/vllm/engine/async_llm_engine.py", line 439, in generate
    async for request_output in stream:
  File "/workspace/vllm/engine/async_llm_engine.py", line 70, in __anext__
    raise result
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/workspace/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
    raise exc
  File "/workspace/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
(RayWorkerVllm pid=762) Can't patch loop of type <class 'uvloop.Loop'>
(RayWorkerVllm pid=762) Traceback (most recent call last):
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 404, in deserialize_objects
(RayWorkerVllm pid=762)     obj = self._deserialize_object(data, metadata, object_ref)
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 270, in _deserialize_object
(RayWorkerVllm pid=762)     return self._deserialize_msgpack_data(data, metadata_fields)
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 225, in _deserialize_msgpack_data
(RayWorkerVllm pid=762)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/serialization.py", line 215, in _deserialize_pickle5_data
(RayWorkerVllm pid=762)     obj = pickle.loads(in_band)
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/outlines/__init__.py", line 3, in <module>
(RayWorkerVllm pid=762)     import outlines.models
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/outlines/models/__init__.py", line 12, in <module>
(RayWorkerVllm pid=762)     from .openai import OpenAI, openai
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/outlines/models/openai.py", line 11, in <module>
(RayWorkerVllm pid=762)     from outlines.base import vectorize
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/outlines/base.py", line 16, in <module>
(RayWorkerVllm pid=762)     nest_asyncio.apply()
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 19, in apply
(RayWorkerVllm pid=762)     loop = loop or asyncio.get_event_loop()
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 41, in _get_event_loop
(RayWorkerVllm pid=762)     loop = events.get_event_loop_policy().get_event_loop()
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 68, in get_event_loop
(RayWorkerVllm pid=762)     _patch_loop(loop)
(RayWorkerVllm pid=762)   File "/usr/local/lib/python3.10/dist-packages/nest_asyncio.py", line 184, in _patch_loop
(RayWorkerVllm pid=762)     raise ValueError('Can\'t patch loop of type %s' % type(loop))
(RayWorkerVllm pid=762) ValueError: Can't patch loop of type <class 'uvloop.Loop'>
(RayWorkerVllm pid=762) [W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())

@tscholak
Copy link
Contributor Author

some success:

INFO 12-24 21:56:26 llm_engine.py:73] Initializing an LLM engine with config: model='/mnt/llmd/base_models/Mistral-7B-Instruct-v0.2', tokenizer='/mnt/llmd/base_models/Mistral-7B-Instruct-v0.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
INFO 12-24 21:57:40 llm_engine.py:223] # GPU blocks: 25213, # CPU blocks: 2048
INFO 12-24 21:57:42 model_runner.py:394] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-24 21:57:48 model_runner.py:437] Graph capturing finished in 6 secs.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
/usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py:474: NumbaPendingDeprecationWarning: 
Encountered the use of a type that is scheduled for deprecation: type 'reflected set' found for argument 'fsm_finals' of function '_walk_fsm'.

For more information visit https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-reflection-for-list-and-set-types

File "../usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py", line 226:
@numba.njit(nogil=True, cache=True)
def _walk_fsm(
^

  state_seq = _walk_fsm(
/usr/local/lib/python3.10/dist-packages/numba/core/ir_utils.py:2172: NumbaPendingDeprecationWarning: 
Encountered the use of a type that is scheduled for deprecation: type 'reflected set' found for argument 'fsm_finals' of function 'state_scan_tokens'.

For more information visit https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-reflection-for-list-and-set-types

File "../usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py", line 462:
@numba.njit(cache=True, nogil=True)
def state_scan_tokens(
^

  warnings.warn(NumbaPendingDeprecationWarning(msg, loc=loc))
INFO 12-24 21:58:38 async_llm_engine.py:379] Received request c96271e42d38463b91549220e3c26e81: prompt: 'What is the capital of France?', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: None.
INFO 12-24 21:58:38 llm_engine.py:653] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 12-24 21:58:38 async_llm_engine.py:111] Finished request c96271e42d38463b91549220e3c26e81.
INFO:     10.210.93.238:40684 - "POST /generate HTTP/1.1" 200 OK

At least it doesn't crash anymore.

@tscholak tscholak changed the title update vllm patch to v0.2.6 update vllm patch to v0.2.6 and add regex support for /generate Dec 26, 2023
@tscholak
Copy link
Contributor Author

everything works

@tscholak tscholak mentioned this pull request Dec 26, 2023
@tscholak tscholak requested a review from rlouf December 26, 2023 18:10
Copy link
Member

@rlouf rlouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we default to an empty list of logits processors in case neither a schema nor a regex string are provided?

We also need to update the documentation to reflect the changes before merging.

Besides this everything looks good.

@tscholak tscholak marked this pull request as ready for review December 26, 2023 20:14
@tscholak
Copy link
Contributor Author

@rlouf thanks for the review! I made some changes based on your feedback

@rlouf
Copy link
Member

rlouf commented Dec 29, 2023

Great contribution, thank you!

@rlouf rlouf merged commit 5f6166a into outlines-dev:main Dec 29, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support regexs in vLLM
2 participants