Model server throws error when multiprocessing start method is already set #3406

sivanantha321 · 2024-02-03T10:38:39Z

/kind bug

I’m trying to start a kserve model server with 2 workers like this

model.load()
kserve.ModelServer(workers=2).start([model])

But I get this error in the log

bin /app/venv/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so
CUDA SETUP: Loading binary /app/venv/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
{"asctime": "2024-02-01 16:51:03,775", "name": "torch.distributed.nn.jit.instantiator", "levelname": "INFO", "message": "Created a temporary directory at /tmp/tmp7h8azrb7"}
{"asctime": "2024-02-01 16:51:03,775", "name": "torch.distributed.nn.jit.instantiator", "levelname": "INFO", "message": "Writing /tmp/tmp7h8azrb7/_remote_module_non_scriptable.py"}
/app/venv/lib/python3.8/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
{"asctime": "2024-02-01 16:52:11,342", "name": "root", "levelname": "INFO", "message": "PATH_TO_MODEL is set to ==>/mnt/models/"}
{"asctime": "2024-02-01 16:52:11,342", "name": "root", "levelname": "INFO", "message": "Registering model: llama-2-7b-chat-hf"}
{"asctime": "2024-02-01 16:52:11,343", "name": "root", "levelname": "INFO", "message": "Setting max asyncio worker threads as 8"}
{"asctime": "2024-02-01 16:52:11,344", "name": "root", "levelname": "INFO", "message": "Starting uvicorn with 2 workers"}
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:48<00:48, 48.85s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [01:06<00:00, 30.56s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [01:06<00:00, 33.31s/it]
{"asctime": "2024-02-01 16:52:11,368", "name": "root", "levelname": "INFO", "message": "Starting gRPC server on [::]:8081"}
{"asctime": "2024-02-01 16:52:11,369", "name": "root", "levelname": "ERROR", "message": "uncaught exception", "exc_info": "Traceback (most recent call last):\n  File \"serve.py\", line 56, in <module>\n    kserve.ModelServer(workers=NUM_WORKERS).start([model])\n  File \"/app/venv/lib/python3.8/site-packages/kserve/model_server.py\", line 167, in start\n    asyncio.run(servers_task())\n  File \"/usr/lib/python3.8/asyncio/runners.py\", line 43, in run\n    return loop.run_until_complete(main)\n  File \"/usr/lib/python3.8/asyncio/base_events.py\", line 608, in run_until_complete\n    return future.result()\n  File \"/app/venv/lib/python3.8/site-packages/kserve/model_server.py\", line 165, in servers_task\n    await asyncio.gather(*servers)\n  File \"/app/venv/lib/python3.8/site-packages/kserve/model_server.py\", line 154, in serve\n    multiprocessing.set_start_method('fork')\n  File \"/usr/lib/python3.8/multiprocessing/context.py\", line 243, in set_start_method\n    raise RuntimeError('context has already been set')\nRuntimeError: context has already been set"}
Exception ignored in: <function Server.__del__ at 0x7f3b818b5ca0>
Traceback (most recent call last):
  File "/app/venv/lib/python3.8/site-packages/grpc/aio/_server.py", line 170, in __del__
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 118, in grpc._cython.cygrpc.schedule_coro_threadsafe
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 110, in grpc._cython.cygrpc.schedule_coro_threadsafe
  File "/usr/lib/python3.8/asyncio/base_events.py", line 425, in create_task
  File "/usr/lib/python3.8/asyncio/base_events.py", line 504, in _check_closed
RuntimeError: Event loop is closed
11:54

I would think kserve should handle multiprocessing gracefully since it offers passing more worker to the kserve.ModelServer().start() function but maybe I'm wrong?

Slack Message

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

What did you expect to happen:

What's the InferenceService yaml:
[To help us debug please run kubectl get isvc $name -n $namespace -oyaml and paste the output]

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Istio Version:
Knative Version:
KServe Version: 0.10.0 - 0.12.0-rc1
Kubeflow version:
Cloud Environment:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm]
Minikube/Kind version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

serdarildercaglar · 2024-03-21T15:03:38Z

I am having trouble with the same issue. Is there an improvement?
@sivanantha321

sivanantha321 · 2024-03-25T05:29:16Z

Currently it is on hold. We need to do more tests. Is it possible for you to try if the changes in the PR #3407 works for you ?

serdarildercaglar · 2024-03-25T12:35:17Z

Currently it is on hold. We need to do more tests. Is it possible for you to try if the changes in the PR #3407 works for you ?

#3407 (comment)

oss-prow-bot bot added the kind/bug label Feb 3, 2024

sivanantha321 linked a pull request Feb 3, 2024 that will close this issue

Create a new multiprocessing context instead of using default context to avoid RuntimeError: context has already been set #3407

Draft

9 tasks

sivanantha321 linked a pull request Jun 23, 2024 that will close this issue

Refactor the ModelServer to let uvicorn handle multiple workers #3757

Draft

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model server throws error when multiprocessing start method is already set #3406

Model server throws error when multiprocessing start method is already set #3406

sivanantha321 commented Feb 3, 2024 •

edited

Loading

serdarildercaglar commented Mar 21, 2024

sivanantha321 commented Mar 25, 2024

serdarildercaglar commented Mar 25, 2024

Model server throws error when multiprocessing start method is already set #3406

Model server throws error when multiprocessing start method is already set #3406

Comments

sivanantha321 commented Feb 3, 2024 • edited Loading

serdarildercaglar commented Mar 21, 2024

sivanantha321 commented Mar 25, 2024

serdarildercaglar commented Mar 25, 2024

sivanantha321 commented Feb 3, 2024 •

edited

Loading