Evaluate: core42/jais-XX #43

ggbetz · 2024-04-10T11:58:33Z

For XX in [13b, 13b-chat, 30b-v3, 30b-chat-v3]:

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=core42/jais-{XX}
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=8

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

The text was updated successfully, but these errors were encountered:

yakazimir · 2024-05-10T23:21:44Z

Looks like a tricky one here, will look into where this is coming in:

2024-05-10T22:36:35.181385695Z INFO 05-10 22:36:35 selector.py:16] Using FlashAttention backend.
2024-05-10T22:36:36.885485950Z �[36m(RayWorkerVllm pid=7595)�[0m INFO 05-10 22:36:36 selector.py:16] Using FlashAttention backend.
2024-05-10T22:36:36.885536750Z �[36m(RayWorkerVllm pid=7595)�[0m INFO 05-10 22:36:36 pynccl_utils.py:45] vLLM is using nccl==2.18.1
2024-05-10T22:36:36.885543520Z INFO 05-10 22:36:36 pynccl_utils.py:45] vLLM is using nccl==2.18.1
2024-05-10T22:36:41.312005618Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed execution.
2024-05-10T22:36:41.312037008Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] Traceback (most recent call last):
2024-05-10T22:36:41.312043168Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 37, in execute_method
2024-05-10T22:36:41.312049198Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     return executor(*args, **kwargs)
2024-05-10T22:36:41.312054648Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in load_model
2024-05-10T22:36:41.312060448Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.model_runner.load_model()
2024-05-10T22:36:41.312065698Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 95, in load_model
2024-05-10T22:36:41.312071528Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.model = get_model(
2024-05-10T22:36:41.312098688Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader.py", line 91, in get_model
2024-05-10T22:36:41.312104668Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     model = model_class(model_config.hf_config, linear_method)
2024-05-10T22:36:41.312110048Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 270, in __init__
2024-05-10T22:36:41.312115687Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.transformer = JAISModel(config, linear_method)
2024-05-10T22:36:41.312120977Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 230, in __init__
2024-05-10T22:36:41.312126507Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.h = nn.ModuleList([
2024-05-10T22:36:41.312132687Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 231, in <listcomp>
2024-05-10T22:36:41.312138737Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     JAISBlock(config, linear_method)
2024-05-10T22:36:41.312144097Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 183, in __init__
2024-05-10T22:36:41.312149707Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.mlp = JAISMLP(inner_dim, config, linear_method)
2024-05-10T22:36:41.312155027Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 137, in __init__
2024-05-10T22:36:41.312160747Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.c_fc = ColumnParallelLinear(
2024-05-10T22:36:41.312165967Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 173, in __init__
2024-05-10T22:36:41.312171587Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.output_size_per_partition = divide(output_size, tp_size)
2024-05-10T22:36:41.312176897Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 19, in divide
2024-05-10T22:36:41.312182467Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     ensure_divisibility(numerator, denominator)
2024-05-10T22:36:41.312187737Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 12, in ensure_divisibility
2024-05-10T22:36:41.312199477Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     assert numerator % denominator == 0, "{} is not divisible by {}".format(
2024-05-10T22:36:41.312205177Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] AssertionError: 13653 is not divisible by 4
2024-05-10T22:36:41.312211187Z �[36m(RayWorkerVllm pid=7380)�[0m INFO 05-10 22:36:36 selector.py:16] Using FlashAttention backend.�[32m [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)�[0m
2024-05-10T22:36:41.313330131Z Traceback (most recent call last):
2024-05-10T22:36:41.313366891Z   File "/usr/local/bin/cot-eval", line 8, in <module>
2024-05-10T22:36:41.313526370Z     sys.exit(main())
2024-05-10T22:36:41.313550730Z   File "/workspace/cot-eval/src/cot_eval/__main__.py", line 149, in main
2024-05-10T22:36:41.313593179Z     llm = VLLM(
2024-05-10T22:36:41.313605389Z   File "/usr/local/lib/python3.10/dist-packages/langchain_core/load/serializable.py", line 120, in __init__
2024-05-10T22:36:41.313659589Z     super().__init__(**kwargs)
2024-05-10T22:36:41.313672039Z   File "/usr/local/lib/python3.10/dist-packages/pydantic/v1/main.py", line 341, in __init__
2024-05-10T22:36:41.313752498Z     raise validation_error
2024-05-10T22:36:41.313823148Z pydantic.v1.error_wrappers.ValidationError: 1 validation error for VLLM
2024-05-10T22:36:41.313833768Z __root__
2024-05-10T22:36:41.313839598Z   13653 is not divisible by 4 (type=assertion_error)
2024-05-10T22:36:43.599831555Z �[36m(RayWorkerVllm pid=7380)�[0m INFO 05-10 22:36:36 pynccl_utils.py:45] vLLM is using nccl==2.18.1�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599894725Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed execution.�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599908335Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] Traceback (most recent call last):�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599914535Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 37, in execute_method�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599923215Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     return executor(*args, **kwargs)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599931985Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 95, in load_model�[32m [repeated 4x across cluster]�[0m
2024-05-10T22:36:43.599940855Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.model_runner.load_model()�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599977715Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.model = get_model(�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599983425Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader.py", line 91, in get_model�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599992975Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     model = model_class(model_config.hf_config, linear_method)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599998554Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 173, in __init__�[32m [repeated 10x across cluster]�[0m
2024-05-10T22:36:43.600010064Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.transformer = JAISModel(config, linear_method)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600016424Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.h = nn.ModuleList([�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600022014Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 231, in <listcomp>�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600032144Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     JAISBlock(config, linear_method)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600040194Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.mlp = JAISMLP(inner_dim, config, linear_method)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600045794Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.c_fc = ColumnParallelLinear(�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600054054Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.output_size_per_partition = divide(output_size, tp_size)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600059674Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 19, in divide�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600068564Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     ensure_divisibility(numerator, denominator)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600076354Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 12, in ensure_divisibility�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600085134Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     assert numerator % denominator == 0, "{} is not divisible by {}".format(�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600100124Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] AssertionError: 13653 is not divisible by 4�[32m [repeated 2x across cluster]�[0m

ggbetz · 2024-05-13T14:17:53Z

Yes, might be tricky because I just tried to load core42/jais-13b-chat on a single NVIDIA A100-SXM4-40GB and run inference with VLLM, which worked fine.

ggbetz added the eval_request label Apr 10, 2024

ggbetz assigned yakazimir Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate: core42/jais-XX #43

Evaluate: core42/jais-XX #43

ggbetz commented Apr 10, 2024

yakazimir commented May 10, 2024

ggbetz commented May 13, 2024

Evaluate: core42/jais-XX #43

Evaluate: core42/jais-XX #43

Comments

ggbetz commented Apr 10, 2024

yakazimir commented May 10, 2024

ggbetz commented May 13, 2024