-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for text-generation-server, gradio inference server, OpenAI inference server. #295
Conversation
Falcon 40B
|
8-bit h2oGPT 12B on 2xA6000Ada 48GBThis works:
|
8xA100 80GB Falcon 40B
|
To avoid redownload of weights, just do something like:
i.e. correct This finally worked, but so slow, and unsure why no sharding fails:
|
SERVER on 192.168.1.46: CUDA_VISIBLE_DEVICES=0,1,2,3 docker run --gpus all --shm-size 2g -e NCCL_SHM_DISABLE=1 -e TRANSFORMERS_CACHE="/.cache/" -p 6112:80 -v $HOME/.cache:/.cache/ -v $HOME/.cache/huggingface/hub/:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id h2oai/h2ogpt-oasst1-512-12b --max-input-length 2048 --max-total-tokens 3072 --sharded=true --num-shard=4 --disable-custom-kernels CLIENT: python generate.py --base_model="http://192.168.1.46:6112"
OpenAI tests pass except embedding one:
https://community.openai.com/t/getting-embeddings-of-length-1/263285/4?u=pseudotensor |
… messes up pycharm
475e682
to
e705a0f
Compare
This reverts commit 0d164ff.
Traceback (most recent call last): File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict output = await app.get_blocks().process_api( File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/gradio/blocks.py", line 1346, in process_api result = await self.call_function( File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/gradio/blocks.py", line 1090, in call_function prediction = await utils.async_iteration(iterator) File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/gradio/utils.py", line 341, in async_iteration return await iterator.__anext__() File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/gradio/utils.py", line 334, in __anext__ return await anyio.to_thread.run_sync( File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run result = context.run(func, *args) File "/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/gradio/utils.py", line 317, in run_sync_iterator_async return next(iterator) File "/home/jon/h2ogpt/gradio_runner.py", line 1109, in bot for output_fun in fun1(*tuple(args_list)): File "/home/jon/h2ogpt/generate.py", line 1263, in evaluate from gpt_langchain import run_qa_db File "/home/jon/h2ogpt/gpt_langchain.py", line 286, in <module> class GradioInference(LLM): File "/home/jon/h2ogpt/gpt_langchain.py", line 315, in GradioInference def validate_environment(cls, values: Dict) -> Dict: File "pydantic/class_validators.py", line 134, in pydantic.class_validators.root_validator.dec File "pydantic/class_validators.py", line 156, in pydantic.class_validators._prepare_validator pydantic.errors.ConfigError: duplicate validator function "gpt_langchain.GradioInference.validate_environment"; if this is intended, set `allow_reuse=True` streamlit/streamlit@2682614
|
Separate PR
Also see:
GPTQ: huggingface/text-generation-inference#438
3x faster llama: https://github.com/turboderp/exllama
docker with mounted .cache
Compiled locally but doesn't start properly: