New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train stable diffusion finetune stoped at "Summoning checkpoint" #2347
Comments
Are you using data workers? I found them to be unstable which may just be because I was using my own dataset but still. |
I just run the official example: python main.py --logdir /tmp/ -t -b configs/Teyvat/train_colossalai_teyvat.yaml. |
No, I just run the official example: python main.py --logdir /tmp/ -t -b configs/Teyvat/train_colossalai_teyvat.yaml |
this is my configs/Teyvat/train_colossalai_teyvat.yaml: model:
unet_config:
cond_stage_config: data: lightning:
default_root_dir: "/tmp/diff_log/" logger_config: |
KiB Mem : 32408372 total, 200480 free, 31977184 used, 230708 buff/cache PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
KiB Mem : 32408372 total, 197792 free, 31973816 used, 236764 buff/cache PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
Now it has used 48g cpu cache, and it will be killed soon ... |
I KNOW what happened, it is because fit() raised error, so the model saved its parameters, this used large amount of cpu cache. The error is from gpu:CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 14.76 GiB total capacity; 13.62 GiB already allocated; 9.75 MiB free; 13.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF |
My Machine: cpu 32g, gpu 16g, batchsize=1. It seems colossalai is not working well.
{'accelerator': 'gpu', 'devices': 1, 'log_gpu_memory': 'all', 'max_epochs': 2, 'precision': 16, 'auto_select_gpus': False, 'strategy': {'target': 'strategies.ColossalAIStrategy', 'params': {'use_chunk': True, 'enable_distributed_storage': True, 'placement_policy': 'cuda', 'force_outputs_fp32': True}}, 'log_every_n_steps': 2, 'logger': True, 'default_root_dir': '/tmp/diff_log/'}
Running on GPU
Using FP16 = True
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in v-prediction mode
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Using strategy: strategies.ColossalAIStrategy
Monitoring val/loss_simple_ema as checkpoint metric.
Merged modelckpt-cfg:
{'target': 'lightning.pytorch.callbacks.ModelCheckpoint', 'params': {'dirpath': '/tmp/2023-01-05T10-52-57_train_colossalai_teyvat/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': 'val/loss_simple_ema', 'save_top_k': 3}}
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
....
....
Lightning config
trainer:
accelerator: gpu
devices: 1
log_gpu_memory: all
max_epochs: 2
precision: 16
auto_select_gpus: false
strategy:
target: strategies.ColossalAIStrategy
params:
use_chunk: true
enable_distributed_storage: true
placement_policy: cuda
force_outputs_fp32: true
log_every_n_steps: 2
logger: true
default_root_dir: /tmp/diff_log/
logger_config:
wandb:
target: loggers.WandbLogger
params:
name: nowname
save_dir: /tmp/diff_log/
offline: opt.debug
id: nowname
/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/loggers/tensorboard.py:261: UserWarning: Could not log computational graph to TensorBoard: The
model.example_input_array
attribute is not set orinput_array
was not given.rank_zero_warn(
Epoch 0: 0%| | 0/234 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/utilities/data.py:85: UserWarning: Trying to infer the
batch_size
from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, useself.log(..., batch_size=batch_size)
.warning_cache.warn(
/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:233: UserWarning: You called
self.log('global_step', ...)
in yourtraining_step
but the value needs to be floating point. Converting it to torch.float32.warning_cache.warn(
Summoning checkpoint.
Killed
The text was updated successfully, but these errors were encountered: