[Bug] open compass hangs when evaluating chat musician trained model - waiting for semaphore? #1034

petergreis · 2024-04-10T08:43:50Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version.

Type

I have modified the code (config is not considered code), or I'm working on my own tasks/models/datasets.

Environment

(opencompass) ml@nmtc5um8qv:/notebooks/ChatMusician/eval$ python -c "import opencompass.utils;import pprint;pprint.pprint(dict(opencompass.utils.collect_env()))"
{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
 'GPU 0': 'Quadro P6000',
 'MMEngine': '0.8.2',
 'NVCC': 'Cuda compilation tools, release 11.6, V11.6.124',
 'OpenCV': '4.9.0',
 'PyTorch': '1.13.1+cu117',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201402\n'
                              '  - Intel(R) Math Kernel Library Version '
                              '2020.0.0 Product Build 20191122 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v2.6.0 (Git Hash '
                              '52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX2\n'
                              '  - CUDA Runtime 11.7\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n'
                              '  - CuDNN 8.5\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=11.7, '
                              'CUDNN_VERSION=8.5.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -fabi-version=11 -Wno-deprecated '
                              '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
                              '-fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-DEDGE_PROFILER_USE_KINETO -O2 -fPIC '
                              '-Wno-narrowing -Wall -Wextra '
                              '-Werror=return-type -Werror=non-virtual-dtor '
                              '-Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wunused-local-typedefs '
                              '-Wno-unused-parameter -Wno-unused-function '
                              '-Wno-unused-result -Wno-strict-overflow '
                              '-Wno-strict-aliasing '
                              '-Wno-error=deprecated-declarations '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=redundant-decls '
                              '-Wno-error=old-style-cast '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, '
                              'USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0]',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.1.4+',
 'sys.platform': 'linux'}
(opencompass) ml@nmtc5um8qv:/notebooks/ChatMusician/eval$

Reproduces the problem - code/configuration sample

(opencompass) ml@nmtc5um8qv:/notebooks/ChatMusician/eval$ cat configs/eval_chat_musician_7b.py 

from mmengine.config import read_base
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
from opencompass.partitioners import NaivePartitioner


with read_base():
    from .datasets.collections.base_medium_llama import mmlu_datasets
    from .datasets.music_theory_bench.music_theory_bench_ppl_zero_shot import music_theory_bench_datasets_zero_shot
    from .datasets.music_theory_bench.music_theory_bench_ppl_few_shot import music_theory_bench_datasets_few_shot
    from .models.chat_musician.hf_chat_musician import models

datasets = [
    *mmlu_datasets,
    *music_theory_bench_datasets_zero_shot,
    *music_theory_bench_datasets_few_shot
]

infer = dict(
    partitioner=dict(type=NaivePartitioner),
    runner=dict(
        type=LocalRunner,
        max_num_workers=8,
        task=dict(type=OpenICLInferTask)),
)

eval = dict(
    partitioner=dict(type=NaivePartitioner),
    runner=dict(
        type=LocalRunner,
        max_num_workers=64,
        task=dict(type=OpenICLEvalTask)
    ),
)

Reproduces the problem - command or script

Note that open compass is used in the evaluation of the ChatMusician project

python run.py configs/eval_chat_musician_7b.py

Reproduces the problem - error message

python run.py configs/eval_chat_musician_7b.py
04/10 08:30:09 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
04/10 08:30:09 - OpenCompass - INFO - Partitioned into 122 tasks.
launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%|                                                                                                                                                              | 0/122 [00:00<?, ?it/s]^CTraceback (most recent call last):
  File "/notebooks/ChatMusician/eval/run.py", line 327, in <module>
    main()
  File "/notebooks/ChatMusician/eval/run.py", line 282, in main
    runner(tasks)
  File "/notebooks/ChatMusician/eval/opencompass/runners/base.py", line 38, in __call__
    status = self.launch(tasks)
  File "/notebooks/ChatMusician/eval/opencompass/runners/local.py", line 125, in launch
    with ThreadPoolExecutor(
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/_base.py", line 649, in __exit__
    self.shutdown(wait=True)
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown
    t.join()
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1096, in join
    self._wait_for_tstate_lock()
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py'>
Traceback (most recent call last):
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1537, in _shutdown
    atexit_call()
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/thread.py", line 31, in _python_exit
    t.join()
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1096, in join
    self._wait_for_tstate_lock()
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt:

Other information

Letting this run for a while I see no progress. When killed with Ctrl-C what is of interest:

File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock if lock.acquire(block, timeout):

It looks like it is stuck waiting for a thread. Any ideas?

The text was updated successfully, but these errors were encountered:

petergreis · 2024-04-10T09:08:51Z

After the first ctrl-c to kill the job:

0%|                                                                                                                                                              | 0/122 [00:00<?, ?it/s]^C╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /notebooks/ChatMusician/eval/opencompass/run.py:4 in <module>                                    │
│                                                                                                  │
│   1 from opencompass.cli.main import main                                                        │
│   2                                                                                              │
│   3 if __name__ == '__main__':                                                                   │
│ ❱ 4 │   main()                                                                                   │
│   5                                                                                              │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/cli/main.py:309 in main                     │
│                                                                                                  │
│   306 │   │   │   for task in tasks:                                                             │
│   307 │   │   │   │   cfg.attack.dataset = task.datasets[0][0].abbr                              │
│   308 │   │   │   │   task.attack = cfg.attack                                                   │
│ ❱ 309 │   │   runner(tasks)                                                                      │
│   310 │                                                                                          │
│   311 │   # evaluate                                                                             │
│   312 │   if args.mode in ['all', 'eval']:                                                       │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/runners/base.py:38 in __call__              │
│                                                                                                  │
│   35 │   │   │   tasks (list[dict]): A list of task configs, usually generated by                │
│   36 │   │   │   │   Partitioner.                                                                │
│   37 │   │   """                                                                                 │
│ ❱ 38 │   │   status = self.launch(tasks)                                                         │
│   39 │   │   status_list = list(status)  # change into list format                               │
│   40 │   │   self.summarize(status_list)                                                         │
│   41                                                                                             │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/runners/local.py:148 in launch              │
│                                                                                                  │
│   145 │   │   │   │                                                                              │
│   146 │   │   │   │   return res                                                                 │
│   147 │   │   │                                                                                  │
│ ❱ 148 │   │   │   with ThreadPoolExecutor(                                                       │
│   149 │   │   │   │   │   max_workers=self.max_num_workers) as executor:                         │
│   150 │   │   │   │   status = executor.map(submit, tasks, range(len(tasks)))                    │
│   151                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/_base.py:649 in __exit__   │
│                                                                                                  │
│   646 │   │   return self                                                                        │
│   647 │                                                                                          │
│   648 │   def __exit__(self, exc_type, exc_val, exc_tb):                                         │
│ ❱ 649 │   │   self.shutdown(wait=True)                                                           │
│   650 │   │   return False                                                                       │
│   651                                                                                            │
│   652                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/thread.py:235 in shutdown  │
│                                                                                                  │
│   232 │   │   │   self._work_queue.put(None)                                                     │
│   233 │   │   if wait:                                                                           │
│   234 │   │   │   for t in self._threads:                                                        │
│ ❱ 235 │   │   │   │   t.join()                                                                   │
│   236 │   shutdown.__doc__ = _base.Executor.shutdown.__doc__                                     │
│   237                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py:1096 in join                     │
│                                                                                                  │
│   1093 │   │   │   raise RuntimeError("cannot join current thread")                              │
│   1094 │   │                                                                                     │
│   1095 │   │   if timeout is None:                                                               │
│ ❱ 1096 │   │   │   self._wait_for_tstate_lock()                                                  │
│   1097 │   │   else:                                                                             │
│   1098 │   │   │   # the behavior of a negative timeout isn't documented, but                    │
│   1099 │   │   │   # historically .join(timeout=x) for x<0 has acted as if timeout=0             │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py:1116 in _wait_for_tstate_lock    │
│                                                                                                  │
│   1113 │   │   │   return                                                                        │
│   1114 │   │                                                                                     │
│   1115 │   │   try:                                                                              │
│ ❱ 1116 │   │   │   if lock.acquire(block, timeout):                                              │
│   1117 │   │   │   │   lock.release()                                                            │
│   1118 │   │   │   │   self._stop()                                                              │
│   1119 │   │   except:                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyboardInterrupt

liushz · 2024-04-10T14:14:07Z

How long will the process stack in

launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%|

and, can I see the content of your model config .models.chat_musician.hf_chat_musician ?

petergreis · 2024-04-10T14:45:07Z

How long will the process stack in

launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%|

I am sorry, I don't understand what you're asking for here...

and, can I see the content of your model config .models.chat_musician.hf_chat_musician ?

from opencompass.models import HuggingFaceCausalLM

model_path_mapping = {
    "ChatMusician": "m-a-p/ChatMusician",
    "ChatMusician-Base": "m-a-p/ChatMusician-Base"
}

models = [
    dict(
        type=HuggingFaceCausalLM,
        abbr=model_abbr,
        path=model_path,
        tokenizer_path=model_path,
        tokenizer_kwargs=dict(
            trust_remote_code=True,
        ),
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        model_kwargs=dict(device_map='auto'),
        batch_padding=False, # if false, inference with for-loop without batch padding
        run_cfg=dict(num_gpus=1, num_procs=1),
    )
    for model_abbr, model_path in model_path_mapping.items()
]

petergreis · 2024-04-10T17:03:13Z

Just for reference, this has been stuck for 45 mintues:

(opencompass) ml@njdfaracvb:/notebooks/ChatMusician/eval$ date; python run.py configs/eval_chat_musician_7b.py
Wed Apr 10 16:13:49 UTC 2024
04/10 16:13:57 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
04/10 16:13:57 - OpenCompass - INFO - Partitioned into 122 tasks.
launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%|                                                                                                                                                              | 0/122 [00:00<?, ?it/s]

liushz · 2024-04-11T04:58:53Z

It seems like the system is in the process of caching the ChatMusician model on the backend. Please try again once the model has been fully downloaded and cached.

petergreis · 2024-04-11T06:23:39Z

Models are cached, it is still just sitting there (have just re-reun predict code tp bring down models again):

(opencompass) ml@nwlciszbmg:/notebooks/ChatMusician/eval$ cd
(opencompass) ml@nwlciszbmg:~$ ls -al .cache/huggingface/hub/
total 24
drwxrwxr-x 5 ml ml 4096 Apr 11 06:13 .
drwxrwxr-x 3 ml ml 4096 Apr 11 06:07 ..
drwxrwxr-x 4 ml ml 4096 Apr 11 06:13 .locks
drwxrwxr-x 6 ml ml 4096 Apr 11 06:19 models--m-a-p--ChatMusician
drwxrwxr-x 6 ml ml 4096 Apr 11 06:12 models--m-a-p--ChatMusician-Base
-rw-rw-r-- 1 ml ml    1 Apr 11 06:07 version.txt
(opencompass) ml@nwlciszbmg:~$ du -s !$
du -s .cache/huggingface/hub/
26326872        .cache/huggingface/hub/
(opencompass) ml@nwlciszbmg:~$ du -hs .cache/huggingface/hub/*
13G     .cache/huggingface/hub/models--m-a-p--ChatMusician
13G     .cache/huggingface/hub/models--m-a-p--ChatMusician-Base
4.0K    .cache/huggingface/hub/version.txt

petergreis · 2024-04-11T06:56:29Z

I think something is hanging in the partitioner; is there an easy way to debug this?

liushz · 2024-04-11T07:19:45Z

Could you please provide the contents of your log? It can be found in output/WORK_DIR/logs. Is it empty?

petergreis · 2024-04-11T13:08:35Z

Nothing found like output/WORK_DIR/logs.

I only have outputs:

ml@ng4x7onc3t:/notebooks/ChatMusician/eval$ find . -name output -print
ml@ng4x7onc3t:/notebooks/ChatMusician/eval$ find . -name outputs -print
./outputs

from outputs/default/20240411_124955/configs/20240411_
20240411_124955.py.txt
124955'.py

Leymore · 2024-04-12T04:06:22Z

Please try --debug option in cli, e.g. python run.py configs/eval_chat_musician_7b.py --debug, and paste the stuck outputs here.

Besides, will the following codes run successfully?

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = 'm-a-p/ChatMusician'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map='cuda').eval()
prompt = 'Hello, how are you?'
inputs = tokenizer(prompt, return_tensors='pt')
response = model.generate(input_ids=inputs['input_ids'].to(model.device),)
response = tokenizer.decode(response[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

petergreis · 2024-04-12T08:00:31Z

Yes, your code runs successfully:

(opencompass) ml@nsuco72z84:/notebooks/ChatMusician/eval$ python test.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.72s/it]
/home/ml/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/generation/utils.py:1132: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(


I'm doing well, thank you.


(opencompass) ml@nsuco72z84:/notebooks/ChatMusician/eval$ conda list | grep transformers
sentence-transformers     2.2.2                    pypi_0    pypi
transformers              4.39.3                   pypi_0    pypi

Solution

After a marathon deep dive, here is the TLDR version. There was indeed a threading deadlock as I suspected. This is a conflict between Intel's Math Kernel Library (MKL) and the GNU OpenMP library (libgomp.so.1) in the platform environment that I am using Digital Ocean Paperspace.

The steps in brief:

Put in a fresh install of open compass
copy over the config and dataset for ChatMusician
configs/eval_chat_musician_7b.py
configs/datasets/music_theory_bench
conda install python-Levenshtein -y
Pin the threading model to use the GNU version:
export MKL_THREADING_LAYER=GNU
Rerun: time python run.py configs/eval_chat_musician_7b.py

A few errors result, but it is running on a 45GB A6000 at Paperspace. First runtime was approximate 5h30m. Thank you all for your patience and feedback in tracking this down.

mm-assistant bot assigned liushz Apr 10, 2024

petergreis mentioned this issue Apr 10, 2024

Open compass hangs... hf-lin/ChatMusician#17

Closed

petergreis closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] open compass hangs when evaluating chat musician trained model - waiting for semaphore? #1034

[Bug] open compass hangs when evaluating chat musician trained model - waiting for semaphore? #1034

petergreis commented Apr 10, 2024

petergreis commented Apr 10, 2024

liushz commented Apr 10, 2024

petergreis commented Apr 10, 2024

petergreis commented Apr 10, 2024

liushz commented Apr 11, 2024

petergreis commented Apr 11, 2024 •

edited

Loading

petergreis commented Apr 11, 2024

liushz commented Apr 11, 2024

petergreis commented Apr 11, 2024 •

edited

Loading

Leymore commented Apr 12, 2024 •

edited

Loading

petergreis commented Apr 12, 2024

[Bug] open compass hangs when evaluating chat musician trained model - waiting for semaphore? #1034

[Bug] open compass hangs when evaluating chat musician trained model - waiting for semaphore? #1034

Comments

petergreis commented Apr 10, 2024

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

petergreis commented Apr 10, 2024

liushz commented Apr 10, 2024

petergreis commented Apr 10, 2024

petergreis commented Apr 10, 2024

liushz commented Apr 11, 2024

petergreis commented Apr 11, 2024 • edited Loading

petergreis commented Apr 11, 2024

liushz commented Apr 11, 2024

petergreis commented Apr 11, 2024 • edited Loading

Leymore commented Apr 12, 2024 • edited Loading

petergreis commented Apr 12, 2024

Solution

petergreis commented Apr 11, 2024 •

edited

Loading

petergreis commented Apr 11, 2024 •

edited

Loading

Leymore commented Apr 12, 2024 •

edited

Loading