Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train stable diffusion finetune stoped at "Summoning checkpoint" #2347

Closed
yufengyao-lingoace opened this issue Jan 5, 2023 · 8 comments
Closed

Comments

@yufengyao-lingoace
Copy link

My Machine: cpu 32g, gpu 16g, batchsize=1. It seems colossalai is not working well.

{'accelerator': 'gpu', 'devices': 1, 'log_gpu_memory': 'all', 'max_epochs': 2, 'precision': 16, 'auto_select_gpus': False, 'strategy': {'target': 'strategies.ColossalAIStrategy', 'params': {'use_chunk': True, 'enable_distributed_storage': True, 'placement_policy': 'cuda', 'force_outputs_fp32': True}}, 'log_every_n_steps': 2, 'logger': True, 'default_root_dir': '/tmp/diff_log/'}
Running on GPU
Using FP16 = True
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in v-prediction mode
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Using strategy: strategies.ColossalAIStrategy
Monitoring val/loss_simple_ema as checkpoint metric.
Merged modelckpt-cfg:
{'target': 'lightning.pytorch.callbacks.ModelCheckpoint', 'params': {'dirpath': '/tmp/2023-01-05T10-52-57_train_colossalai_teyvat/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': 'val/loss_simple_ema', 'save_top_k': 3}}
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

....
....

Lightning config
trainer:
accelerator: gpu
devices: 1
log_gpu_memory: all
max_epochs: 2
precision: 16
auto_select_gpus: false
strategy:
target: strategies.ColossalAIStrategy
params:
use_chunk: true
enable_distributed_storage: true
placement_policy: cuda
force_outputs_fp32: true
log_every_n_steps: 2
logger: true
default_root_dir: /tmp/diff_log/
logger_config:
wandb:
target: loggers.WandbLogger
params:
name: nowname
save_dir: /tmp/diff_log/
offline: opt.debug
id: nowname

/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/loggers/tensorboard.py:261: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given.
rank_zero_warn(
Epoch 0: 0%| | 0/234 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/utilities/data.py:85: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use self.log(..., batch_size=batch_size).
warning_cache.warn(
/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:233: UserWarning: You called self.log('global_step', ...) in your training_step but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
Summoning checkpoint.
Killed

@Thomas2419
Copy link

Are you using data workers? I found them to be unstable which may just be because I was using my own dataset but still.

@yufengyao-lingoace
Copy link
Author

yufengyao-lingoace commented Jan 6, 2023

Are you using data workers? I found them to be unstable which may just be because I was using my own dataset but still.

I just run the official example: python main.py --logdir /tmp/ -t -b configs/Teyvat/train_colossalai_teyvat.yaml.

@yufengyao-lingoace
Copy link
Author

No, I just run the official example: python main.py --logdir /tmp/ -t -b configs/Teyvat/train_colossalai_teyvat.yaml

@yufengyao-lingoace
Copy link
Author

this is my configs/Teyvat/train_colossalai_teyvat.yaml:

model:
base_learning_rate: 1.0e-4
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
parameterization: "v"
linear_start: 0.00085
linear_end: 0.0120
num_timesteps_cond: 1
log_every_t: 200
timesteps: 1000
first_stage_key: image
cond_stage_key: txt
image_size: 64
channels: 4
cond_stage_trainable: false
conditioning_key: crossattn
monitor: val/loss_simple_ema
scale_factor: 0.18215
use_ema: False # we set this to false because this is an inference only config

scheduler_config: # 10000 warmup steps
  target: ldm.lr_scheduler.LambdaLinearScheduler
  params:
    warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratch
    cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
    f_start: [ 1.e-6 ]
    f_max: [ 1.e-4 ]
    f_min: [ 1.e-10 ]

unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
use_checkpoint: True
use_fp16: True
image_size: 32 # unused
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions: [ 4, 2, 1 ]
num_res_blocks: 2
channel_mult: [ 1, 2, 4, 4 ]
num_head_channels: 64 # need to fix for flash-attn
use_spatial_transformer: True
use_linear_in_transformer: True
transformer_depth: 1
context_dim: 1024
legacy: False

first_stage_config:
  target: ldm.models.autoencoder.AutoencoderKL
  params:
    embed_dim: 4
    monitor: val/rec_loss
    ddconfig:
      #attn_type: "vanilla-xformers"
      double_z: true
      z_channels: 4
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult:
      - 1
      - 2
      - 4
      - 4
      num_res_blocks: 2
      attn_resolutions: []
      dropout: 0.0
    lossconfig:
      target: torch.nn.Identity

cond_stage_config:
target: ldm.modules.encoders.modules.FrozenOpenCLIPEmbedder
params:
freeze: True
layer: "penultimate"

data:
target: main.DataModuleFromConfig
params:
batch_size: 1
num_workers: 1
train:
target: ldm.data.teyvat.hf_dataset
params:
path: Fazzie/Teyvat
image_transforms:
- target: torchvision.transforms.Resize
params:
size: 512
- target: torchvision.transforms.RandomCrop
params:
size: 512
- target: torchvision.transforms.RandomHorizontalFlip

lightning:
trainer:
accelerator: 'gpu'
devices: 1
log_gpu_memory: all
max_epochs: 2
precision: 16
auto_select_gpus: True
strategy:
target: strategies.ColossalAIStrategy
params:
use_chunk: True
enable_distributed_storage: True
placement_policy: cuda
force_outputs_fp32: true

log_every_n_steps: 2
logger: True

default_root_dir: "/tmp/diff_log/"
# profiler: pytorch

logger_config:
wandb:
target: loggers.WandbLogger
params:
name: nowname
save_dir: "/tmp/diff_log/"
offline: opt.debug
id: nowname

@yufengyao-lingoace
Copy link
Author

KiB Mem : 32408372 total, 200480 free, 31977184 used, 230708 buff/cache
KiB Swap: 18874364 total, 16226848 free, 2647516 used. 13100 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15065 ubuntu 20 0 59.505g 0.027t 151660 S 12.3 88.0 2:08.71 python

@yufengyao-lingoace
Copy link
Author

KiB Mem : 32408372 total, 197792 free, 31973816 used, 236764 buff/cache
KiB Swap: 18874364 total, 5554208 free, 13320156 used. 13156 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15065 ubuntu 20 0 65.129g 0.029t 152660 D 15.0 97.4 2:25.59 python

@yufengyao-lingoace
Copy link
Author

yufengyao-lingoace commented Jan 6, 2023

Now it has used 48g cpu cache, and it will be killed soon ...

@yufengyao-lingoace
Copy link
Author

I KNOW what happened, it is because fit() raised error, so the model saved its parameters, this used large amount of cpu cache. The error is from gpu:CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 14.76 GiB total capacity; 13.62 GiB already allocated; 9.75 MiB free; 13.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Then the question is: colossal is thought to use little gpu cache, why 16g gpu is not enough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants