Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: examples/images/diffusion ran failed #1951

Closed
GxjGit opened this issue Nov 15, 2022 · 13 comments
Closed

[BUG]: examples/images/diffusion ran failed #1951

GxjGit opened this issue Nov 15, 2022 · 13 comments
Labels
bug Something isn't working

Comments

@GxjGit
Copy link

GxjGit commented Nov 15, 2022

🐛 Describe the bug

I ran the example of diffusion according to https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion:
steps:
conda env create -f environment.yaml
conda activate ldm
pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org
git clone https://github.com/Lightning-AI/lightning && cd lightning && git reset --hard b04a7aa
pip install -r requirements.txt && pip install .

dataset:
laion-400m

run:
bash train.sh

failed info:

**/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:438: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
Traceback (most recent call last):
File "/home/code/ColossalAI/examples/images/diffusion/main.py", line 811, in
trainer.fit(model, data)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit
call._call_and_handle_interrupt(
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
results = self._run_stage()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
self._run_train()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train
self.fit_loop.run()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 368, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 74, in optimizer_step
closure_result = closure()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in call
self._result = self.closure(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
step_output = self._step_fn()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1440, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 352, in training_step
return self.model(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 241, in forward
outputs = self.module(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward
output = self._forward_module.training_step(*inputs, **kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 411, in training_step
loss, loss_dict = self.shared_step(batch)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 976, in shared_step
loss = self(x, c)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 988, in forward
return self.p_losses(x, c, t, *args, **kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1122, in p_losses
model_output = self.apply_model(x_noisy, t, cond)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1094, in apply_model
x_recon = self.model(x_noisy, t, **cond)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1519, in forward
out = self.diffusion_model(x, t, context=cc)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 927, in forward
h = th.cat([h, hs.pop()], dim=1)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 170, in torch_function
ret = func(*args, kwargs)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

Environment

image

@GxjGit GxjGit added the bug Something isn't working label Nov 15, 2022
@binmakeswell
Copy link
Member

Hi @GxjGit Thank you for your feedback. We will try to reproduce your issue and fix it soon.

@Fazziekey
Copy link
Contributor

can you upload your train.yaml

@GxjGit
Copy link
Author

GxjGit commented Nov 15, 2022

I have not modified the yaml setting.
In addition, As I can't find the training model, I have annotated the code of pretrained model loading of UNetModel and AutoencoderKL. I have no idea if it is concerned.

image

image

  model:
  base_learning_rate: 1.0e-04
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    linear_start: 0.00085
    linear_end: 0.0120
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: image
    cond_stage_key: caption
    image_size: 64
    channels: 4
    cond_stage_trainable: false   # Note: different from the one we trained before
    conditioning_key: crossattn
    monitor: val/loss_simple_ema
    scale_factor: 0.18215
    use_ema: False
    scheduler_config: # 10000 warmup steps
      target: ldm.lr_scheduler.LambdaLinearScheduler
      params:
        warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratch
        cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
        f_start: [ 1.e-6 ]
        f_max: [ 1.e-4 ]
        f_min: [ 1.e-10 ]
    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 32 # unused
        from_pretrained: '/data/scratch/diffuser/stable-diffusion-v1-4/unet/diffusion_pytorch_model.bin'
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions: [ 4, 2, 1 ]
        num_res_blocks: 2
        channel_mult: [ 1, 2, 4, 4 ]
        num_heads: 8
        use_spatial_transformer: True
        transformer_depth: 1
        context_dim: 768
        use_checkpoint: False
        legacy: False
    first_stage_config:
      target: ldm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        from_pretrained: '/data/scratch/diffuser/stable-diffusion-v1-4/vae/diffusion_pytorch_model.bin'
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 4
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
      params:
        use_fp16: True
data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 64
    wrap: False
    train:
      target: ldm.data.base.Txt2ImgIterableBaseDataset
      params:
        file_path: "/home/notebook/data/group/huangxin/laion-400m/e-commerce/e-commerce-0.tsv"
        world_size: 1
        rank: 0
lightning:
  trainer:
    accelerator: 'gpu' 
    devices: 4
    log_gpu_memory: all
    max_epochs: 2
    precision: 16
    auto_select_gpus: False
    strategy:
      target: pytorch_lightning.strategies.ColossalAIStrategy
      params:
        use_chunk: False
        enable_distributed_storage: True,
        placement_policy: cuda
        force_outputs_fp32: False
    log_every_n_steps: 2
    logger: True
    default_root_dir: "/tmp/diff_log/"
    profiler: pytorch
  logger_config:
    wandb:
      target: pytorch_lightning.loggers.WandbLogger
      params:
          name: nowname
          save_dir: "/tmp/diff_log/"
          offline: opt.debug
          id: nowname

@Fazziekey
Copy link
Contributor

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

@liuchenbaidu
Copy link

conda env create -f environment.yaml
it give
ResolvePackageNotFound:

  • cudatoolkit=11.3
  • libgcc-ng[version='>=9.3.0']
  • __glibc[version='>=2.17']
  • cudatoolkit=11.3
  • libstdcxx-ng[version='>=9.3.0']

can it run CPU?

@GxjGit
Copy link
Author

GxjGit commented Nov 15, 2022

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

ok, I am downloading it and trying it again.

@GxjGit
Copy link
Author

GxjGit commented Nov 15, 2022

conda env create -f environment.yaml it give ResolvePackageNotFound:

  • cudatoolkit=11.3
  • libgcc-ng[version='>=9.3.0']
  • __glibc[version='>=2.17']
  • cudatoolkit=11.3
  • libstdcxx-ng[version='>=9.3.0']

can it run CPU?

I ran it in GPU environment.

@GxjGit
Copy link
Author

GxjGit commented Nov 15, 2022

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

@Fazziekey I have update the pretrained model and code, but encountered the same problem.

How do we comprehend the this problem:

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

"What does “size 8” stand for ?

@flynnamy
Copy link

Hi,could you please share the detailed link of pretrained model? I only fould some *.ckpt models. @GxjGit

@GxjGit
Copy link
Author

GxjGit commented Nov 15, 2022

Hi,could you please share the detailed link of pretrained model? I only fould some *.ckpt models. @GxjGit
Look at this, click the tab of "Files and versions"
image

And you can also download by cmd:
image

@GxjGit
Copy link
Author

GxjGit commented Nov 15, 2022

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

@Fazziekey I have update the pretrained model and code, but encountered the same problem.

How do we comprehend the this problem:

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

"What does “size 8” stand for ?

@Fazziekey I have soveled my problem. The reason is the incorrect images size. As the size of images in my dataset is different, it repoted "RuntimeError: stack expects each tensor to be equal size, but got [140, 140, 3] at entry 0 and [300, 500, 3] at entry 1", so I resize it in 224 x 224, it report the error as above. when I change to 256 x 256 as the yaml setting, it run successfully.

But I can not find resize operation in the origin code, Are the images in your dataset in the fixed size of 256*256?
image

I suggest that a description of the resolution requirements for input dataset images can be added. Anyway, Thanks a lot.

@flynnamy
Copy link

flynnamy commented Nov 15, 2022

Thanks! @GxjGit I download and meet some problems:1.Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPTextModelZero, 2.like followings:
image

@Fazziekey
Copy link
Contributor

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

@Fazziekey I have update the pretrained model and code, but encountered the same problem.
How do we comprehend the this problem:
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.
"What does “size 8” stand for ?

@Fazziekey I have soveled my problem. The reason is the incorrect images size. As the size of images in my dataset is different, it repoted "RuntimeError: stack expects each tensor to be equal size, but got [140, 140, 3] at entry 0 and [300, 500, 3] at entry 1", so I resize it in 224 x 224, it report the error as above. when I change to 256 x 256 as the yaml setting, it run successfully.

But I can not find resize operation in the origin code, Are the images in your dataset in the fixed size of 256*256? image

I suggest that a description of the resolution requirements for input dataset images can be added. Anyway, Thanks a lot.

yes,the input image size must be 256*256 for Latency Diffusion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants