Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There was an error when I ran finetune.py. #14

Closed
ctjian opened this issue Mar 29, 2023 · 4 comments
Closed

There was an error when I ran finetune.py. #14

ctjian opened this issue Mar 29, 2023 · 4 comments

Comments

@ctjian
Copy link

ctjian commented Mar 29, 2023

traceback (most recent call last):
File "finetune.py", line 231, in
trainer.train()
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/trainer.py", line 1648, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/trainer.py", line 1911, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/trainer.py", line 2657, in training_step
loss = self.compute_loss(model, inputs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/trainer.py", line 2689, in compute_loss
outputs = model(**inputs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/_utils.py", line 543, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 3 on device 3.
Original Traceback (most recent call last):
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/peft/peft_model.py", line 538, in forward
**kwargs,
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/models/llama/modeling_llama.py", line 714, in forward
return_dict=return_dict,
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/models/llama/modeling_llama.py", line 590, in forward
None,
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/models/llama/modeling_llama.py", line 581, in custom_forward
return module(*inputs, output_attentions, None)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/models/llama/modeling_llama.py", line 324, in forward
hidden_states = self.mlp(hidden_states)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/transformers/models/llama/modeling_llama.py", line 155, in forward
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/jct/.conda/envs/alpaca_cot_envs/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (2048x3 and 4x4096)

@PhoebusSi
Copy link
Owner

This looks like the number of GPUs detected (3/4) is different from the number you passed in (4/3).

@ctjian
Copy link
Author

ctjian commented Mar 29, 2023

This looks like the number of GPUs detected (3/4) is different from the number you passed in (4/3).

Although the server has four GPUs, I am training with a single GPU.
python3 finetune.py --size 7 --data alpaca-belle-cot
So how do I fix it?

@PhoebusSi
Copy link
Owner

Limit the number of visible GPUs, please, like

CUDA_VISIBLE_DEVICES=0 python3 finetune.py --size 7 --data alpaca-belle-cot

@ctjian
Copy link
Author

ctjian commented Mar 29, 2023

Thank you so much for your help! Your contribution was incredibly helpful and I really appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants