[BUG]: examples/images/diffusion ran failed #1951

GxjGit · 2022-11-15T03:39:44Z

🐛 Describe the bug

I ran the example of diffusion according to https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion：
steps:
conda env create -f environment.yaml
conda activate ldm
pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org
git clone https://github.com/Lightning-AI/lightning && cd lightning && git reset --hard b04a7aa
pip install -r requirements.txt && pip install .

dataset:
laion-400m

run:
bash train.sh

failed info:

**/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:438: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")
Traceback (most recent call last):
File "/home/code/ColossalAI/examples/images/diffusion/main.py", line 811, in
trainer.fit(model, data)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit
call._call_and_handle_interrupt(
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
results = self._run_stage()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
self._run_train()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train
self.fit_loop.run()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 368, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 74, in optimizer_step
closure_result = closure()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in call
self._result = self.closure(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
step_output = self._step_fn()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1440, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 352, in training_step
return self.model(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 241, in forward
outputs = self.module(*args, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward
output = self._forward_module.training_step(*inputs, **kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 411, in training_step
loss, loss_dict = self.shared_step(batch)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 976, in shared_step
loss = self(x, c)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 988, in forward
return self.p_losses(x, c, t, *args, **kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1122, in p_losses
model_output = self.apply_model(x_noisy, t, cond)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1094, in apply_model
x_recon = self.model(x_noisy, t, **cond)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1519, in forward
out = self.diffusion_model(x, t, context=cc)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 927, in forward
h = th.cat([h, hs.pop()], dim=1)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 170, in torch_function
ret = func(*args, kwargs)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

Environment

binmakeswell · 2022-11-15T05:52:32Z

Hi @GxjGit Thank you for your feedback. We will try to reproduce your issue and fix it soon.

Fazziekey · 2022-11-15T06:51:37Z

can you upload your train.yaml

GxjGit · 2022-11-15T07:05:53Z

I have not modified the yaml setting.
In addition， As I can't find the training model, I have annotated the code of pretrained model loading of UNetModel and AutoencoderKL. I have no idea if it is concerned.

  model:
  base_learning_rate: 1.0e-04
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    linear_start: 0.00085
    linear_end: 0.0120
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: image
    cond_stage_key: caption
    image_size: 64
    channels: 4
    cond_stage_trainable: false   # Note: different from the one we trained before
    conditioning_key: crossattn
    monitor: val/loss_simple_ema
    scale_factor: 0.18215
    use_ema: False
    scheduler_config: # 10000 warmup steps
      target: ldm.lr_scheduler.LambdaLinearScheduler
      params:
        warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratch
        cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
        f_start: [ 1.e-6 ]
        f_max: [ 1.e-4 ]
        f_min: [ 1.e-10 ]
    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 32 # unused
        from_pretrained: '/data/scratch/diffuser/stable-diffusion-v1-4/unet/diffusion_pytorch_model.bin'
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions: [ 4, 2, 1 ]
        num_res_blocks: 2
        channel_mult: [ 1, 2, 4, 4 ]
        num_heads: 8
        use_spatial_transformer: True
        transformer_depth: 1
        context_dim: 768
        use_checkpoint: False
        legacy: False
    first_stage_config:
      target: ldm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        from_pretrained: '/data/scratch/diffuser/stable-diffusion-v1-4/vae/diffusion_pytorch_model.bin'
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 4
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
      params:
        use_fp16: True
data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 64
    wrap: False
    train:
      target: ldm.data.base.Txt2ImgIterableBaseDataset
      params:
        file_path: "/home/notebook/data/group/huangxin/laion-400m/e-commerce/e-commerce-0.tsv"
        world_size: 1
        rank: 0
lightning:
  trainer:
    accelerator: 'gpu' 
    devices: 4
    log_gpu_memory: all
    max_epochs: 2
    precision: 16
    auto_select_gpus: False
    strategy:
      target: pytorch_lightning.strategies.ColossalAIStrategy
      params:
        use_chunk: False
        enable_distributed_storage: True,
        placement_policy: cuda
        force_outputs_fp32: False
    log_every_n_steps: 2
    logger: True
    default_root_dir: "/tmp/diff_log/"
    profiler: pytorch
  logger_config:
    wandb:
      target: pytorch_lightning.loggers.WandbLogger
      params:
          name: nowname
          save_dir: "/tmp/diff_log/"
          offline: opt.debug
          id: nowname

Fazziekey · 2022-11-15T07:21:02Z

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

liuchenbaidu · 2022-11-15T07:52:36Z

conda env create -f environment.yaml
it give
ResolvePackageNotFound:

cudatoolkit=11.3
libgcc-ng[version='>=9.3.0']
__glibc[version='>=2.17']
cudatoolkit=11.3
libstdcxx-ng[version='>=9.3.0']

can it run CPU?

GxjGit · 2022-11-15T07:59:16Z

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

ok， I am downloading it and trying it again.

GxjGit · 2022-11-15T08:00:12Z

conda env create -f environment.yaml it give ResolvePackageNotFound:

cudatoolkit=11.3

libgcc-ng[version='>=9.3.0']

__glibc[version='>=2.17']

cudatoolkit=11.3

libstdcxx-ng[version='>=9.3.0']

can it run CPU?

I ran it in GPU environment.

GxjGit · 2022-11-15T08:33:01Z

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

@Fazziekey I have update the pretrained model and code, but encountered the same problem.

How do we comprehend the this problem:

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

"What does “size 8” stand for ？

flynnamy · 2022-11-15T09:12:47Z

Hi,could you please share the detailed link of pretrained model? I only fould some *.ckpt models. @GxjGit

GxjGit · 2022-11-15T09:35:35Z

Hi,could you please share the detailed link of pretrained model? I only fould some *.ckpt models. @GxjGit
Look at this, click the tab of "Files and versions"

And you can also download by cmd:

GxjGit · 2022-11-15T09:50:11Z

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

@Fazziekey I have update the pretrained model and code, but encountered the same problem.

How do we comprehend the this problem:

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

"What does “size 8” stand for ？

@Fazziekey I have soveled my problem. The reason is the incorrect images size. As the size of images in my dataset is different, it repoted "RuntimeError: stack expects each tensor to be equal size, but got [140, 140, 3] at entry 0 and [300, 500, 3] at entry 1", so I resize it in 224 x 224, it report the error as above. when I change to 256 x 256 as the yaml setting, it run successfully.

But I can not find resize operation in the origin code, Are the images in your dataset in the fixed size of 256*256?

I suggest that a description of the resolution requirements for input dataset images can be added. Anyway, Thanks a lot.

flynnamy · 2022-11-15T09:51:58Z

Thanks! @GxjGit I download and meet some problems:1.Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPTextModelZero, 2.like followings:

Fazziekey · 2022-11-18T09:31:33Z

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

@Fazziekey I have update the pretrained model and code, but encountered the same problem.
How do we comprehend the this problem:
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.
"What does “size 8” stand for ？

@Fazziekey I have soveled my problem. The reason is the incorrect images size. As the size of images in my dataset is different, it repoted "RuntimeError: stack expects each tensor to be equal size, but got [140, 140, 3] at entry 0 and [300, 500, 3] at entry 1", so I resize it in 224 x 224, it report the error as above. when I change to 256 x 256 as the yaml setting, it run successfully.

But I can not find resize operation in the origin code, Are the images in your dataset in the fixed size of 256*256?

I suggest that a description of the resolution requirements for input dataset images can be added. Anyway, Thanks a lot.

yes,the input image size must be 256*256 for Latency Diffusion

GxjGit added the bug Something isn't working label Nov 15, 2022

Fazziekey closed this as completed Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: examples/images/diffusion ran failed #1951

[BUG]: examples/images/diffusion ran failed #1951

GxjGit commented Nov 15, 2022

binmakeswell commented Nov 15, 2022

Fazziekey commented Nov 15, 2022

GxjGit commented Nov 15, 2022 •

edited

Fazziekey commented Nov 15, 2022

liuchenbaidu commented Nov 15, 2022

GxjGit commented Nov 15, 2022

GxjGit commented Nov 15, 2022

GxjGit commented Nov 15, 2022

flynnamy commented Nov 15, 2022

GxjGit commented Nov 15, 2022 •

edited

GxjGit commented Nov 15, 2022 •

edited

flynnamy commented Nov 15, 2022 •

edited

Fazziekey commented Nov 18, 2022

[BUG]: examples/images/diffusion ran failed #1951

[BUG]: examples/images/diffusion ran failed #1951

Comments

GxjGit commented Nov 15, 2022

🐛 Describe the bug

Environment

binmakeswell commented Nov 15, 2022

Fazziekey commented Nov 15, 2022

GxjGit commented Nov 15, 2022 • edited

Fazziekey commented Nov 15, 2022

liuchenbaidu commented Nov 15, 2022

GxjGit commented Nov 15, 2022

GxjGit commented Nov 15, 2022

GxjGit commented Nov 15, 2022

flynnamy commented Nov 15, 2022

GxjGit commented Nov 15, 2022 • edited

GxjGit commented Nov 15, 2022 • edited

flynnamy commented Nov 15, 2022 • edited

Fazziekey commented Nov 18, 2022

GxjGit commented Nov 15, 2022 •

edited

GxjGit commented Nov 15, 2022 •

edited

GxjGit commented Nov 15, 2022 •

edited

flynnamy commented Nov 15, 2022 •

edited