Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate: core42/jais-XX #43

Open
3 of 6 tasks
ggbetz opened this issue Apr 10, 2024 · 2 comments
Open
3 of 6 tasks

Evaluate: core42/jais-XX #43

ggbetz opened this issue Apr 10, 2024 · 2 comments
Assignees

Comments

@ggbetz
Copy link
Contributor

ggbetz commented Apr 10, 2024

For XX in [13b, 13b-chat, 30b-v3, 30b-chat-v3]:

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=core42/jais-{XX}
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=8

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)
@yakazimir
Copy link
Collaborator

Looks like a tricky one here, will look into where this is coming in:

2024-05-10T22:36:35.181385695Z INFO 05-10 22:36:35 selector.py:16] Using FlashAttention backend.
2024-05-10T22:36:36.885485950Z �[36m(RayWorkerVllm pid=7595)�[0m INFO 05-10 22:36:36 selector.py:16] Using FlashAttention backend.
2024-05-10T22:36:36.885536750Z �[36m(RayWorkerVllm pid=7595)�[0m INFO 05-10 22:36:36 pynccl_utils.py:45] vLLM is using nccl==2.18.1
2024-05-10T22:36:36.885543520Z INFO 05-10 22:36:36 pynccl_utils.py:45] vLLM is using nccl==2.18.1
2024-05-10T22:36:41.312005618Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed execution.
2024-05-10T22:36:41.312037008Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] Traceback (most recent call last):
2024-05-10T22:36:41.312043168Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 37, in execute_method
2024-05-10T22:36:41.312049198Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     return executor(*args, **kwargs)
2024-05-10T22:36:41.312054648Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in load_model
2024-05-10T22:36:41.312060448Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.model_runner.load_model()
2024-05-10T22:36:41.312065698Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 95, in load_model
2024-05-10T22:36:41.312071528Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.model = get_model(
2024-05-10T22:36:41.312098688Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader.py", line 91, in get_model
2024-05-10T22:36:41.312104668Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     model = model_class(model_config.hf_config, linear_method)
2024-05-10T22:36:41.312110048Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 270, in __init__
2024-05-10T22:36:41.312115687Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.transformer = JAISModel(config, linear_method)
2024-05-10T22:36:41.312120977Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 230, in __init__
2024-05-10T22:36:41.312126507Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.h = nn.ModuleList([
2024-05-10T22:36:41.312132687Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 231, in <listcomp>
2024-05-10T22:36:41.312138737Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     JAISBlock(config, linear_method)
2024-05-10T22:36:41.312144097Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 183, in __init__
2024-05-10T22:36:41.312149707Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.mlp = JAISMLP(inner_dim, config, linear_method)
2024-05-10T22:36:41.312155027Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 137, in __init__
2024-05-10T22:36:41.312160747Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.c_fc = ColumnParallelLinear(
2024-05-10T22:36:41.312165967Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 173, in __init__
2024-05-10T22:36:41.312171587Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.output_size_per_partition = divide(output_size, tp_size)
2024-05-10T22:36:41.312176897Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 19, in divide
2024-05-10T22:36:41.312182467Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     ensure_divisibility(numerator, denominator)
2024-05-10T22:36:41.312187737Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 12, in ensure_divisibility
2024-05-10T22:36:41.312199477Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     assert numerator % denominator == 0, "{} is not divisible by {}".format(
2024-05-10T22:36:41.312205177Z �[36m(RayWorkerVllm pid=7595)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] AssertionError: 13653 is not divisible by 4
2024-05-10T22:36:41.312211187Z �[36m(RayWorkerVllm pid=7380)�[0m INFO 05-10 22:36:36 selector.py:16] Using FlashAttention backend.�[32m [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)�[0m
2024-05-10T22:36:41.313330131Z Traceback (most recent call last):
2024-05-10T22:36:41.313366891Z   File "/usr/local/bin/cot-eval", line 8, in <module>
2024-05-10T22:36:41.313526370Z     sys.exit(main())
2024-05-10T22:36:41.313550730Z   File "/workspace/cot-eval/src/cot_eval/__main__.py", line 149, in main
2024-05-10T22:36:41.313593179Z     llm = VLLM(
2024-05-10T22:36:41.313605389Z   File "/usr/local/lib/python3.10/dist-packages/langchain_core/load/serializable.py", line 120, in __init__
2024-05-10T22:36:41.313659589Z     super().__init__(**kwargs)
2024-05-10T22:36:41.313672039Z   File "/usr/local/lib/python3.10/dist-packages/pydantic/v1/main.py", line 341, in __init__
2024-05-10T22:36:41.313752498Z     raise validation_error
2024-05-10T22:36:41.313823148Z pydantic.v1.error_wrappers.ValidationError: 1 validation error for VLLM
2024-05-10T22:36:41.313833768Z __root__
2024-05-10T22:36:41.313839598Z   13653 is not divisible by 4 (type=assertion_error)
2024-05-10T22:36:43.599831555Z �[36m(RayWorkerVllm pid=7380)�[0m INFO 05-10 22:36:36 pynccl_utils.py:45] vLLM is using nccl==2.18.1�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599894725Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed execution.�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599908335Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] Traceback (most recent call last):�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599914535Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 37, in execute_method�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599923215Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     return executor(*args, **kwargs)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599931985Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 95, in load_model�[32m [repeated 4x across cluster]�[0m
2024-05-10T22:36:43.599940855Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.model_runner.load_model()�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599977715Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.model = get_model(�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599983425Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader.py", line 91, in get_model�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599992975Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     model = model_class(model_config.hf_config, linear_method)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.599998554Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 173, in __init__�[32m [repeated 10x across cluster]�[0m
2024-05-10T22:36:43.600010064Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.transformer = JAISModel(config, linear_method)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600016424Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.h = nn.ModuleList([�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600022014Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 231, in <listcomp>�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600032144Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     JAISBlock(config, linear_method)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600040194Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.mlp = JAISMLP(inner_dim, config, linear_method)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600045794Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.c_fc = ColumnParallelLinear(�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600054054Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     self.output_size_per_partition = divide(output_size, tp_size)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600059674Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 19, in divide�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600068564Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     ensure_divisibility(numerator, denominator)�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600076354Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 12, in ensure_divisibility�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600085134Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44]     assert numerator % denominator == 0, "{} is not divisible by {}".format(�[32m [repeated 2x across cluster]�[0m
2024-05-10T22:36:43.600100124Z �[36m(RayWorkerVllm pid=7380)�[0m ERROR 05-10 22:36:41 ray_utils.py:44] AssertionError: 13653 is not divisible by 4�[32m [repeated 2x across cluster]�[0m

@ggbetz
Copy link
Contributor Author

ggbetz commented May 13, 2024

Yes, might be tricky because I just tried to load core42/jais-13b-chat on a single NVIDIA A100-SXM4-40GB and run inference with VLLM, which worked fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants