Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: type object 'ChunkManager' has no attribute 'search_chunk_size' #2166

Closed
Alfred-Duncan opened this issue Dec 21, 2022 · 16 comments
Closed
Labels
bug Something isn't working

Comments

@Alfred-Duncan
Copy link

馃悰 Describe the bug

when i training the diffusion model
that happened:

Setting up LambdaLR scheduler...
Traceback (most recent call last):
File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 804, in
trainer.fit(model, data)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 578, in fit
call._call_and_handle_interrupt(
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 620, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1038, in _run
self.strategy.setup(self)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 333, in setup
self.setup_precision_plugin()
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 270, in setup_precision_plugin
chunk_size = self.chunk_size or ChunkManager.search_chunk_size(
AttributeError: type object 'ChunkManager' has no attribute 'search_chunk_size'
Setting up LambdaLR scheduler...
/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
Summoning checkpoint.

Traceback (most recent call last):
File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 804, in
trainer.fit(model, data)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 578, in fit
call._call_and_handle_interrupt(
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
return function(*args, **kwargs)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 620, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1038, in _run
self.strategy.setup(self)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 333, in setup
self.setup_precision_plugin()
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 270, in setup_precision_plugin
chunk_size = self.chunk_size or ChunkManager.search_chunk_size(
AttributeError: type object 'ChunkManager' has no attribute 'search_chunk_size'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 806, in
melk()
File "/home/tongange/ColossalAI/examples/images/diffusion/main.py", line 789, in melk
trainer.save_checkpoint(ckpt_path)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1900, in save_checkpoint
self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 512, in save_checkpoint
_checkpoint = self.dump_checkpoint(weights_only)
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 444, in dump_checkpoint
"state_dict": self._get_lightning_module_state_dict(),
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 526, in _get_lightning_module_state_dict
state_dict = self.trainer.strategy.lightning_module_state_dict()
File "/root/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 383, in lightning_module_state_dict
assert isinstance(self.model, ZeroDDP)
AssertionError

Environment

i use the way bellow to train, all the steps are same:
https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion

@Alfred-Duncan Alfred-Duncan added the bug Something isn't working label Dec 21, 2022
@Alfred-Duncan
Copy link
Author

ah, i checked it, under the colossalai/gemini/chunk/manager,

type object 'ChunkManager' really has no attribute 'search_chunk_size'

i wanna kown where he go?

@feifeibear
Copy link
Contributor

feifeibear commented Dec 21, 2022

Sorry for the bug. This is because the mismatch between your colossalai and lightning version.
two solutions.

  1. install the latest lightning. clone the master branch and install it from source
  2. install colossalai v 0.1.10.

@Alfred-Duncan
Copy link
Author

thank u for that, man, i'll try it later @feifeibear

@Thomas2419
Copy link

Hello, did this solution work for you?

@feifeibear
Copy link
Contributor

@Alfred-Duncan Is your problem solved?

@FrankieDong
Copy link

@feifeibear I have got same problem. I wanted to finetune stable diffusion 2.0 model, as the step, I install the colossalai up to 0.1.12, and install the lightning from the code source, but it is wrong, "type object 'ChunkManager' has no attribute 'search_chunk_size'"

And can i finetune SD 2.0 model now ?

@Alfred-Duncan
Copy link
Author

Yeah, it worked on this problem, but after that, there's another problem happening

@feifeibear
Copy link
Contributor

@Alfred-Duncan Could you please post another issue for the other problem?

@feifeibear
Copy link
Contributor

search_chunk_size

Hey, the bug occurs when your colossalai is lower than v0.1.10. I guess you did not correctly install 0.1.12. Can you

  1. clean previously installed version and reinstall it from source.
  2. install it from official website>

@Alfred-Duncan
Copy link
Author

already post that bro, check the issues after this one, but nearly training success

@Alfred-Duncan
Copy link
Author

#2176

@feifeibear
Copy link
Contributor

@Alfred-Duncan Thanks, I've already seen the issue. The related personnel will be back this afternoon. We will try to reproduce your bug ASAP.

@Alfred-Duncan
Copy link
Author

@FrankieDong
But I seen that, In the training method of this repositories, the main.py of ldm is from SD1.4, mayby u should check the .py and the .yaml of the network? I'm not sure whether that can be used on the SD2 model.
Maybe u can try the example first, on the SD1.4 and the 'teyvat' way...

@Alfred-Duncan
Copy link
Author

@Thomas2419
#2176
'half' work

@FrankieDong
Copy link

@FrankieDong But I seen that, In the training method of this repositories, the main.py of ldm is from SD1.4, mayby u should check the .py and the .yaml of the network? I'm not sure whether that can be used on the SD2 model. Maybe u can try the example first, on the SD1.4 and the 'teyvat' way...

I find the problem, and reinstall the lightning from the source, and it 's ok, but another problem occur, I am solving now.

@feifeibear
Copy link
Contributor

@FrankieDong nice! I closed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants