-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] open compass hangs when evaluating chat musician trained model - waiting for semaphore? #1034
Comments
After the first ctrl-c to kill the job:
|
How long will the process stack in
and, can I see the content of your model config |
I am sorry, I don't understand what you're asking for here...
|
Just for reference, this has been stuck for 45 mintues:
|
It seems like the system is in the process of caching the |
Models are cached, it is still just sitting there (have just re-reun predict code tp bring down models again):
|
I think something is hanging in the partitioner; is there an easy way to debug this? |
Could you please provide the contents of your log? It can be found in output/WORK_DIR/logs. Is it empty? |
Nothing found like output/WORK_DIR/logs. I only have outputs:
from outputs/default/20240411_124955/configs/20240411_
|
Please try Besides, will the following codes run successfully? from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = 'm-a-p/ChatMusician'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map='cuda').eval()
prompt = 'Hello, how are you?'
inputs = tokenizer(prompt, return_tensors='pt')
response = model.generate(input_ids=inputs['input_ids'].to(model.device),)
response = tokenizer.decode(response[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response) |
Yes, your code runs successfully:
SolutionAfter a marathon deep dive, here is the TLDR version. There was indeed a threading deadlock as I suspected. This is a conflict between Intel's Math Kernel Library (MKL) and the GNU OpenMP library (libgomp.so.1) in the platform environment that I am using Digital Ocean Paperspace. The steps in brief:
A few errors result, but it is running on a 45GB A6000 at Paperspace. First runtime was approximate 5h30m. Thank you all for your patience and feedback in tracking this down. |
Prerequisite
Type
I have modified the code (config is not considered code), or I'm working on my own tasks/models/datasets.
Environment
Reproduces the problem - code/configuration sample
Reproduces the problem - command or script
Note that open compass is used in the evaluation of the ChatMusician project
python run.py configs/eval_chat_musician_7b.py
Reproduces the problem - error message
Other information
Letting this run for a while I see no progress. When killed with Ctrl-C what is of interest:
File "/home/ml/anaconda3/envs/opencompass/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock if lock.acquire(block, timeout):
It looks like it is stuck waiting for a thread. Any ideas?
The text was updated successfully, but these errors were encountered: