Train stable diffusion finetune stoped at "Summoning checkpoint" #2347

yufengyao-lingoace · 2023-01-05T11:02:28Z

My Machine: cpu 32g, gpu 16g, batchsize=1. It seems colossalai is not working well.

{'accelerator': 'gpu', 'devices': 1, 'log_gpu_memory': 'all', 'max_epochs': 2, 'precision': 16, 'auto_select_gpus': False, 'strategy': {'target': 'strategies.ColossalAIStrategy', 'params': {'use_chunk': True, 'enable_distributed_storage': True, 'placement_policy': 'cuda', 'force_outputs_fp32': True}}, 'log_every_n_steps': 2, 'logger': True, 'default_root_dir': '/tmp/diff_log/'}
Running on GPU
Using FP16 = True
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in v-prediction mode
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Using strategy: strategies.ColossalAIStrategy
Monitoring val/loss_simple_ema as checkpoint metric.
Merged modelckpt-cfg:
{'target': 'lightning.pytorch.callbacks.ModelCheckpoint', 'params': {'dirpath': '/tmp/2023-01-05T10-52-57_train_colossalai_teyvat/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': 'val/loss_simple_ema', 'save_top_k': 3}}
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

....
....

Lightning config
trainer:
accelerator: gpu
devices: 1
log_gpu_memory: all
max_epochs: 2
precision: 16
auto_select_gpus: false
strategy:
target: strategies.ColossalAIStrategy
params:
use_chunk: true
enable_distributed_storage: true
placement_policy: cuda
force_outputs_fp32: true
log_every_n_steps: 2
logger: true
default_root_dir: /tmp/diff_log/
logger_config:
wandb:
target: loggers.WandbLogger
params:
name: nowname
save_dir: /tmp/diff_log/
offline: opt.debug
id: nowname

/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/loggers/tensorboard.py:261: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given.
rank_zero_warn(
Epoch 0: 0%| | 0/234 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/utilities/data.py:85: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use self.log(..., batch_size=batch_size).
warning_cache.warn(
/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:233: UserWarning: You called self.log('global_step', ...) in your training_step but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
Summoning checkpoint.
Killed

The text was updated successfully, but these errors were encountered:

Thomas2419 · 2023-01-05T16:24:46Z

Are you using data workers? I found them to be unstable which may just be because I was using my own dataset but still.

yufengyao-lingoace · 2023-01-06T01:52:35Z

Are you using data workers? I found them to be unstable which may just be because I was using my own dataset but still.

I just run the official example: python main.py --logdir /tmp/ -t -b configs/Teyvat/train_colossalai_teyvat.yaml.

yufengyao-lingoace · 2023-01-06T01:55:04Z

No, I just run the official example: python main.py --logdir /tmp/ -t -b configs/Teyvat/train_colossalai_teyvat.yaml

yufengyao-lingoace · 2023-01-06T02:00:39Z

this is my configs/Teyvat/train_colossalai_teyvat.yaml:

model:
base_learning_rate: 1.0e-4
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
parameterization: "v"
linear_start: 0.00085
linear_end: 0.0120
num_timesteps_cond: 1
log_every_t: 200
timesteps: 1000
first_stage_key: image
cond_stage_key: txt
image_size: 64
channels: 4
cond_stage_trainable: false
conditioning_key: crossattn
monitor: val/loss_simple_ema
scale_factor: 0.18215
use_ema: False # we set this to false because this is an inference only config

scheduler_config: # 10000 warmup steps
  target: ldm.lr_scheduler.LambdaLinearScheduler
  params:
    warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratch
    cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
    f_start: [ 1.e-6 ]
    f_max: [ 1.e-4 ]
    f_min: [ 1.e-10 ]

unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
use_checkpoint: True
use_fp16: True
image_size: 32 # unused
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions: [ 4, 2, 1 ]
num_res_blocks: 2
channel_mult: [ 1, 2, 4, 4 ]
num_head_channels: 64 # need to fix for flash-attn
use_spatial_transformer: True
use_linear_in_transformer: True
transformer_depth: 1
context_dim: 1024
legacy: False

first_stage_config:
  target: ldm.models.autoencoder.AutoencoderKL
  params:
    embed_dim: 4
    monitor: val/rec_loss
    ddconfig:
      #attn_type: "vanilla-xformers"
      double_z: true
      z_channels: 4
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult:
      - 1
      - 2
      - 4
      - 4
      num_res_blocks: 2
      attn_resolutions: []
      dropout: 0.0
    lossconfig:
      target: torch.nn.Identity

cond_stage_config:
target: ldm.modules.encoders.modules.FrozenOpenCLIPEmbedder
params:
freeze: True
layer: "penultimate"

data:
target: main.DataModuleFromConfig
params:
batch_size: 1
num_workers: 1
train:
target: ldm.data.teyvat.hf_dataset
params:
path: Fazzie/Teyvat
image_transforms:
- target: torchvision.transforms.Resize
params:
size: 512
- target: torchvision.transforms.RandomCrop
params:
size: 512
- target: torchvision.transforms.RandomHorizontalFlip

lightning:
trainer:
accelerator: 'gpu'
devices: 1
log_gpu_memory: all
max_epochs: 2
precision: 16
auto_select_gpus: True
strategy:
target: strategies.ColossalAIStrategy
params:
use_chunk: True
enable_distributed_storage: True
placement_policy: cuda
force_outputs_fp32: true

log_every_n_steps: 2
logger: True

default_root_dir: "/tmp/diff_log/"
# profiler: pytorch

logger_config:
wandb:
target: loggers.WandbLogger
params:
name: nowname
save_dir: "/tmp/diff_log/"
offline: opt.debug
id: nowname

yufengyao-lingoace · 2023-01-06T02:06:11Z

KiB Mem : 32408372 total, 200480 free, 31977184 used, 230708 buff/cache
KiB Swap: 18874364 total, 16226848 free, 2647516 used. 13100 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15065 ubuntu 20 0 59.505g 0.027t 151660 S 12.3 88.0 2:08.71 python

yufengyao-lingoace · 2023-01-06T02:06:48Z

KiB Mem : 32408372 total, 197792 free, 31973816 used, 236764 buff/cache
KiB Swap: 18874364 total, 5554208 free, 13320156 used. 13156 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15065 ubuntu 20 0 65.129g 0.029t 152660 D 15.0 97.4 2:25.59 python

yufengyao-lingoace · 2023-01-06T02:08:02Z

Now it has used 48g cpu cache, and it will be killed soon ...

yufengyao-lingoace · 2023-01-06T04:06:56Z

I KNOW what happened, it is because fit() raised error, so the model saved its parameters, this used large amount of cpu cache. The error is from gpu:CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 14.76 GiB total capacity; 13.62 GiB already allocated; 9.75 MiB free; 13.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Then the question is: colossal is thought to use little gpu cache, why 16g gpu is not enough?

yufengyao-lingoace closed this as completed Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train stable diffusion finetune stoped at "Summoning checkpoint" #2347

Train stable diffusion finetune stoped at "Summoning checkpoint" #2347

yufengyao-lingoace commented Jan 5, 2023

Thomas2419 commented Jan 5, 2023

yufengyao-lingoace commented Jan 6, 2023 •

edited

yufengyao-lingoace commented Jan 6, 2023

yufengyao-lingoace commented Jan 6, 2023

yufengyao-lingoace commented Jan 6, 2023

yufengyao-lingoace commented Jan 6, 2023

yufengyao-lingoace commented Jan 6, 2023 •

edited

yufengyao-lingoace commented Jan 6, 2023

Train stable diffusion finetune stoped at "Summoning checkpoint" #2347

Train stable diffusion finetune stoped at "Summoning checkpoint" #2347

Comments

yufengyao-lingoace commented Jan 5, 2023

Thomas2419 commented Jan 5, 2023

yufengyao-lingoace commented Jan 6, 2023 • edited

yufengyao-lingoace commented Jan 6, 2023

yufengyao-lingoace commented Jan 6, 2023

yufengyao-lingoace commented Jan 6, 2023

yufengyao-lingoace commented Jan 6, 2023

yufengyao-lingoace commented Jan 6, 2023 • edited

yufengyao-lingoace commented Jan 6, 2023

yufengyao-lingoace commented Jan 6, 2023 •

edited

yufengyao-lingoace commented Jan 6, 2023 •

edited