Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] open compass hangs when evaluating chat musician trained model - waiting for semaphore? #1034

Closed
2 tasks done
petergreis opened this issue Apr 10, 2024 · 11 comments
Closed
2 tasks done
Assignees

Comments

@petergreis
Copy link

Prerequisite

Type

I have modified the code (config is not considered code), or I'm working on my own tasks/models/datasets.

Environment

(opencompass) ml@nmtc5um8qv:/notebooks/ChatMusician/eval$ python -c "import opencompass.utils;import pprint;pprint.pprint(dict(opencompass.utils.collect_env()))"
{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
 'GPU 0': 'Quadro P6000',
 'MMEngine': '0.8.2',
 'NVCC': 'Cuda compilation tools, release 11.6, V11.6.124',
 'OpenCV': '4.9.0',
 'PyTorch': '1.13.1+cu117',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201402\n'
                              '  - Intel(R) Math Kernel Library Version '
                              '2020.0.0 Product Build 20191122 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v2.6.0 (Git Hash '
                              '52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX2\n'
                              '  - CUDA Runtime 11.7\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n'
                              '  - CuDNN 8.5\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=11.7, '
                              'CUDNN_VERSION=8.5.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -fabi-version=11 -Wno-deprecated '
                              '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
                              '-fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-DEDGE_PROFILER_USE_KINETO -O2 -fPIC '
                              '-Wno-narrowing -Wall -Wextra '
                              '-Werror=return-type -Werror=non-virtual-dtor '
                              '-Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wunused-local-typedefs '
                              '-Wno-unused-parameter -Wno-unused-function '
                              '-Wno-unused-result -Wno-strict-overflow '
                              '-Wno-strict-aliasing '
                              '-Wno-error=deprecated-declarations '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=redundant-decls '
                              '-Wno-error=old-style-cast '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, '
                              'USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0]',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.1.4+',
 'sys.platform': 'linux'}
(opencompass) ml@nmtc5um8qv:/notebooks/ChatMusician/eval$ 

Reproduces the problem - code/configuration sample

(opencompass) ml@nmtc5um8qv:/notebooks/ChatMusician/eval$ cat configs/eval_chat_musician_7b.py 

from mmengine.config import read_base
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
from opencompass.partitioners import NaivePartitioner


with read_base():
    from .datasets.collections.base_medium_llama import mmlu_datasets
    from .datasets.music_theory_bench.music_theory_bench_ppl_zero_shot import music_theory_bench_datasets_zero_shot
    from .datasets.music_theory_bench.music_theory_bench_ppl_few_shot import music_theory_bench_datasets_few_shot
    from .models.chat_musician.hf_chat_musician import models

datasets = [
    *mmlu_datasets,
    *music_theory_bench_datasets_zero_shot,
    *music_theory_bench_datasets_few_shot
]

infer = dict(
    partitioner=dict(type=NaivePartitioner),
    runner=dict(
        type=LocalRunner,
        max_num_workers=8,
        task=dict(type=OpenICLInferTask)),
)

eval = dict(
    partitioner=dict(type=NaivePartitioner),
    runner=dict(
        type=LocalRunner,
        max_num_workers=64,
        task=dict(type=OpenICLEvalTask)
    ),
)

Reproduces the problem - command or script

Note that open compass is used in the evaluation of the ChatMusician project

python run.py configs/eval_chat_musician_7b.py

Reproduces the problem - error message

python run.py configs/eval_chat_musician_7b.py
04/10 08:30:09 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
04/10 08:30:09 - OpenCompass - INFO - Partitioned into 122 tasks.
launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%|                                                                                                                                                              | 0/122 [00:00<?, ?it/s]^CTraceback (most recent call last):
  File "/notebooks/ChatMusician/eval/run.py", line 327, in <module>
    main()
  File "/notebooks/ChatMusician/eval/run.py", line 282, in main
    runner(tasks)
  File "/notebooks/ChatMusician/eval/opencompass/runners/base.py", line 38, in __call__
    status = self.launch(tasks)
  File "/notebooks/ChatMusician/eval/opencompass/runners/local.py", line 125, in launch
    with ThreadPoolExecutor(
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/_base.py", line 649, in __exit__
    self.shutdown(wait=True)
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown
    t.join()
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1096, in join
    self._wait_for_tstate_lock()
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py'>
Traceback (most recent call last):
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1537, in _shutdown
    atexit_call()
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/thread.py", line 31, in _python_exit
    t.join()
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1096, in join
    self._wait_for_tstate_lock()
  File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt:

Other information

Letting this run for a while I see no progress. When killed with Ctrl-C what is of interest:

File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock if lock.acquire(block, timeout):

It looks like it is stuck waiting for a thread. Any ideas?

@petergreis
Copy link
Author

After the first ctrl-c to kill the job:

0%|                                                                                                                                                              | 0/122 [00:00<?, ?it/s]^C╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /notebooks/ChatMusician/eval/opencompass/run.py:4 in <module>                                    │
│                                                                                                  │
│   1 from opencompass.cli.main import main                                                        │
│   2                                                                                              │
│   3 if __name__ == '__main__':                                                                   │
│ ❱ 4 │   main()                                                                                   │
│   5                                                                                              │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/cli/main.py:309 in main                     │
│                                                                                                  │
│   306 │   │   │   for task in tasks:                                                             │
│   307 │   │   │   │   cfg.attack.dataset = task.datasets[0][0].abbr                              │
│   308 │   │   │   │   task.attack = cfg.attack                                                   │
│ ❱ 309 │   │   runner(tasks)                                                                      │
│   310 │                                                                                          │
│   311 │   # evaluate                                                                             │
│   312 │   if args.mode in ['all', 'eval']:                                                       │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/runners/base.py:38 in __call__              │
│                                                                                                  │
│   35 │   │   │   tasks (list[dict]): A list of task configs, usually generated by                │
│   36 │   │   │   │   Partitioner.                                                                │
│   37 │   │   """                                                                                 │
│ ❱ 38 │   │   status = self.launch(tasks)                                                         │
│   39 │   │   status_list = list(status)  # change into list format                               │
│   40 │   │   self.summarize(status_list)                                                         │
│   41                                                                                             │
│                                                                                                  │
│ /notebooks/ChatMusician/eval/opencompass/opencompass/runners/local.py:148 in launch              │
│                                                                                                  │
│   145 │   │   │   │                                                                              │
│   146 │   │   │   │   return res                                                                 │
│   147 │   │   │                                                                                  │
│ ❱ 148 │   │   │   with ThreadPoolExecutor(                                                       │
│   149 │   │   │   │   │   max_workers=self.max_num_workers) as executor:                         │
│   150 │   │   │   │   status = executor.map(submit, tasks, range(len(tasks)))                    │
│   151                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/_base.py:649 in __exit__   │
│                                                                                                  │
│   646 │   │   return self                                                                        │
│   647 │                                                                                          │
│   648 │   def __exit__(self, exc_type, exc_val, exc_tb):                                         │
│ ❱ 649 │   │   self.shutdown(wait=True)                                                           │
│   650 │   │   return False                                                                       │
│   651                                                                                            │
│   652                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/concurrent/futures/thread.py:235 in shutdown  │
│                                                                                                  │
│   232 │   │   │   self._work_queue.put(None)                                                     │
│   233 │   │   if wait:                                                                           │
│   234 │   │   │   for t in self._threads:                                                        │
│ ❱ 235 │   │   │   │   t.join()                                                                   │
│   236 │   shutdown.__doc__ = _base.Executor.shutdown.__doc__                                     │
│   237                                                                                            │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py:1096 in join                     │
│                                                                                                  │
│   1093 │   │   │   raise RuntimeError("cannot join current thread")                              │
│   1094 │   │                                                                                     │
│   1095 │   │   if timeout is None:                                                               │
│ ❱ 1096 │   │   │   self._wait_for_tstate_lock()                                                  │
│   1097 │   │   else:                                                                             │
│   1098 │   │   │   # the behavior of a negative timeout isn't documented, but                    │
│   1099 │   │   │   # historically .join(timeout=x) for x<0 has acted as if timeout=0             │
│                                                                                                  │
│ /home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py:1116 in _wait_for_tstate_lock    │
│                                                                                                  │
│   1113 │   │   │   return                                                                        │
│   1114 │   │                                                                                     │
│   1115 │   │   try:                                                                              │
│ ❱ 1116 │   │   │   if lock.acquire(block, timeout):                                              │
│   1117 │   │   │   │   lock.release()                                                            │
│   1118 │   │   │   │   self._stop()                                                              │
│   1119 │   │   except:                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyboardInterrupt

@liushz
Copy link
Collaborator

liushz commented Apr 10, 2024

How long will the process stack in

launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%| 

and, can I see the content of your model config .models.chat_musician.hf_chat_musician ?

@petergreis
Copy link
Author

How long will the process stack in

launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%| 

I am sorry, I don't understand what you're asking for here...

and, can I see the content of your model config .models.chat_musician.hf_chat_musician ?

from opencompass.models import HuggingFaceCausalLM

model_path_mapping = {
    "ChatMusician": "m-a-p/ChatMusician",
    "ChatMusician-Base": "m-a-p/ChatMusician-Base"
}

models = [
    dict(
        type=HuggingFaceCausalLM,
        abbr=model_abbr,
        path=model_path,
        tokenizer_path=model_path,
        tokenizer_kwargs=dict(
            trust_remote_code=True,
        ),
        max_out_len=100,
        max_seq_len=2048,
        batch_size=8,
        model_kwargs=dict(device_map='auto'),
        batch_padding=False, # if false, inference with for-loop without batch padding
        run_cfg=dict(num_gpus=1, num_procs=1),
    )
    for model_abbr, model_path in model_path_mapping.items()
]

@petergreis
Copy link
Author

Just for reference, this has been stuck for 45 mintues:

(opencompass) ml@njdfaracvb:/notebooks/ChatMusician/eval$ date; python run.py configs/eval_chat_musician_7b.py
Wed Apr 10 16:13:49 UTC 2024
04/10 16:13:57 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
04/10 16:13:57 - OpenCompass - INFO - Partitioned into 122 tasks.
launch OpenICLInfer[ChatMusician/lukaemon_mmlu_college_biology] on GPU 0                                                                                                                   
  0%|                                                                                                                                                              | 0/122 [00:00<?, ?it/s]

@liushz
Copy link
Collaborator

liushz commented Apr 11, 2024

It seems like the system is in the process of caching the ChatMusician model on the backend. Please try again once the model has been fully downloaded and cached.

@petergreis
Copy link
Author

petergreis commented Apr 11, 2024

Models are cached, it is still just sitting there (have just re-reun predict code tp bring down models again):

(opencompass) ml@nwlciszbmg:/notebooks/ChatMusician/eval$ cd
(opencompass) ml@nwlciszbmg:~$ ls -al .cache/huggingface/hub/
total 24
drwxrwxr-x 5 ml ml 4096 Apr 11 06:13 .
drwxrwxr-x 3 ml ml 4096 Apr 11 06:07 ..
drwxrwxr-x 4 ml ml 4096 Apr 11 06:13 .locks
drwxrwxr-x 6 ml ml 4096 Apr 11 06:19 models--m-a-p--ChatMusician
drwxrwxr-x 6 ml ml 4096 Apr 11 06:12 models--m-a-p--ChatMusician-Base
-rw-rw-r-- 1 ml ml    1 Apr 11 06:07 version.txt
(opencompass) ml@nwlciszbmg:~$ du -s !$
du -s .cache/huggingface/hub/
26326872        .cache/huggingface/hub/
(opencompass) ml@nwlciszbmg:~$ du -hs .cache/huggingface/hub/*
13G     .cache/huggingface/hub/models--m-a-p--ChatMusician
13G     .cache/huggingface/hub/models--m-a-p--ChatMusician-Base
4.0K    .cache/huggingface/hub/version.txt

@petergreis
Copy link
Author

I think something is hanging in the partitioner; is there an easy way to debug this?

@liushz
Copy link
Collaborator

liushz commented Apr 11, 2024

Could you please provide the contents of your log? It can be found in output/WORK_DIR/logs. Is it empty?

@petergreis
Copy link
Author

petergreis commented Apr 11, 2024

Nothing found like output/WORK_DIR/logs.

I only have outputs:

ml@ng4x7onc3t:/notebooks/ChatMusician/eval$ find . -name output -print
ml@ng4x7onc3t:/notebooks/ChatMusician/eval$ find . -name outputs -print
./outputs

from outputs/default/20240411_124955/configs/20240411_
20240411_124955.py.txt
124955'.py



@Leymore
Copy link
Contributor

Leymore commented Apr 12, 2024

Please try --debug option in cli, e.g. python run.py configs/eval_chat_musician_7b.py --debug, and paste the stuck outputs here.

Besides, will the following codes run successfully?

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = 'm-a-p/ChatMusician'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map='cuda').eval()
prompt = 'Hello, how are you?'
inputs = tokenizer(prompt, return_tensors='pt')
response = model.generate(input_ids=inputs['input_ids'].to(model.device),)
response = tokenizer.decode(response[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

@petergreis
Copy link
Author

Yes, your code runs successfully:

(opencompass) ml@nsuco72z84:/notebooks/ChatMusician/eval$ python test.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.72s/it]
/home/ml/anaconda3/envs/opencompass/lib/python3.10/site-packages/transformers/generation/utils.py:1132: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(


I'm doing well, thank you.


(opencompass) ml@nsuco72z84:/notebooks/ChatMusician/eval$ conda list | grep transformers
sentence-transformers     2.2.2                    pypi_0    pypi
transformers              4.39.3                   pypi_0    pypi

Solution

After a marathon deep dive, here is the TLDR version. There was indeed a threading deadlock as I suspected. This is a conflict between Intel's Math Kernel Library (MKL) and the GNU OpenMP library (libgomp.so.1) in the platform environment that I am using Digital Ocean Paperspace.

The steps in brief:

  • Put in a fresh install of open compass
  • copy over the config and dataset for ChatMusician
    configs/eval_chat_musician_7b.py
    configs/datasets/music_theory_bench
  • conda install python-Levenshtein -y
  • Pin the threading model to use the GNU version:
    export MKL_THREADING_LAYER=GNU
  • Rerun: time python run.py configs/eval_chat_musician_7b.py

A few errors result, but it is running on a 45GB A6000 at Paperspace. First runtime was approximate 5h30m. Thank you all for your patience and feedback in tracking this down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants