Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Drive disconnected and it makes my training ended early #3892

Open
Linaqruf opened this issue Aug 3, 2023 · 4 comments
Open

Google Drive disconnected and it makes my training ended early #3892

Linaqruf opened this issue Aug 3, 2023 · 4 comments

Comments

@Linaqruf
Copy link

Linaqruf commented Aug 3, 2023

Describe the current behavior
The title says everything, I train SDXL models, last night I put my datasets (latents and caption files) in Google Drive, today I train it and the Google Drive just disconnected and it makes the training ended. I have Colab Pro+ and I'm disappointed.

Describe the expected behavior
The training still running and disconnected gdrive never happened

What web browser you are using
Chrome

Additional context

steps:  38% 3753/10000 [3:07:50<5:12:39,  3.00s/it, loss=0.128]Traceback (most recent call last):
  File "/content/kohya-trainer/sdxl_train.py", line 649, in <module>
    train(args)
  File "/content/kohya-trainer/sdxl_train.py", line 371, in train
    for step, batch in enumerate(train_dataloader):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 388, in __iter__
    next_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 644, in reraise
    raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataset.py", line 243, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "/content/kohya-trainer/library/train_util.py", line 1045, in __getitem__
    latents, original_size, crop_ltrb, flipped_latents = load_latents_from_disk(image_info.latents_npz)
  File "/content/kohya-trainer/library/train_util.py", line 1883, in load_latents_from_disk
    npz = np.load(npz_path)
  File "/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py", line 407, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/AnimagineXL/AnimagineXL-dataset/girl_cute_gothic_many_decorations_bouquet_gold_color_9afcee8a-597d-47a9-86ef-e749aa1ce9a8.npz'

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/kohya-trainer/sdxl_train.py:649 in <module>                         │
│                                                                              │
│   646args = parser.parse_args()                                         │
│   647args = train_util.read_config_from_file(args, parser)              │
│   648 │                                                                      │
│ ❱ 649train(args)                                                        │
│   650                                                                        │
│                                                                              │
│ /content/kohya-trainer/sdxl_train.py:371 in train                            │
│                                                                              │
│   368 │   │   │   m.train()                                                  │
│   369 │   │                                                                  │
│   370 │   │   loss_total = 0                                                 │
│ ❱ 371 │   │   for step, batch in enumerate(train_dataloader):                │
│   372 │   │   │   current_step.value = global_step                           │
│   373 │   │   │   with accelerator.accumulate(training_models[0]):  # 複数モ │374 │   │   │   │   if "latents" in batch and batch["latents"] is not None │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py:388 in     │
│ __iter__                                                                     │
│                                                                              │
│   385 │   │   │   │   # But we still move it to the device so it is done bef │386 │   │   │   │   if self.device is not None:                            │
│   387 │   │   │   │   │   current_batch = send_to_device(current_batch, self │
│ ❱ 388 │   │   │   │   next_batch = next(dataloader_iter)                     │
│   389 │   │   │   │   if batch_index >= self.skip_batches:                   │
│   390 │   │   │   │   │   yield current_batch                                │
│   391 │   │   │   │   batch_index += 1                                       │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:633   │
│ in __next__                                                                  │
│                                                                              │
│    630 │   │   │   if self._sampler_iter is None:                            │
│    631 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/7675 │632 │   │   │   │   self._reset()  # type: ignore[call-arg]               │
│ ❱  633 │   │   │   data = self._next_data()                                  │
│    634 │   │   │   self._num_yielded += 1                                    │
│    635 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \      │
│    636 │   │   │   │   │   self._IterableDataset_len_called is not None and  │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:1345  │
│ in _next_data                                                                │
│                                                                              │
│   1342 │   │   │   │   self._task_info[idx] += (data,)                       │
│   1343 │   │   │   else:                                                     │
│   1344 │   │   │   │   del self._task_info[idx]                              │
│ ❱ 1345 │   │   │   │   return self._process_data(data)                       │
│   1346 │                                                                     │
│   1347def _try_put_index(self):                                         │
│   1348 │   │   assert self._tasks_outstanding < self._prefetch_factor * self │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:1371  │
│ in _process_data                                                             │
│                                                                              │
│   1368 │   │   self._rcvd_idx += 1                                           │
│   1369 │   │   self._try_put_index()                                         │
│   1370 │   │   if isinstance(data, ExceptionWrapper):                        │
│ ❱ 1371 │   │   │   data.reraise()                                            │
│   1372 │   │   return data                                                   │
│   1373 │                                                                     │
│   1374def _mark_worker_as_unavailable(self, worker_id, shutdown=False): │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/torch/_utils.py:644 in reraise       │
│                                                                              │
│   641 │   │   │   # If the exception takes multiple arguments, don't try to  │642 │   │   │   # instantiate since we don't know how to                   │643 │   │   │   raise RuntimeError(msg) from None                          │
│ ❱ 644 │   │   raise exception                                                │
│   645                                                                        │
│   646                                                                        │
│   647 def _get_available_device_type():                                      │
╰──────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File 
"/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", 
line 308, in _worker_loop
    data = fetcher.fetch(index)
  File 
"/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line
51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line
51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataset.py", 
line 243, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "/content/kohya-trainer/library/train_util.py", line 1045, in __getitem__
    latents, original_size, crop_ltrb, flipped_latents = 
load_latents_from_disk(image_info.latents_npz)
  File "/content/kohya-trainer/library/train_util.py", line 1883, in 
load_latents_from_disk
    npz = np.load(npz_path)
  File "/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py", line 407, 
in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 
'/content/drive/MyDrive/AnimagineXL/AnimagineXL-dataset/girl_cute_gothic_many_de
corations_bouquet_gold_color_9afcee8a-597d-47a9-86ef-e749aa1ce9a8.npz'

wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: 
wandb: Run history:
wandb: loss ▆▅▅▅▅▇▂▂▅▅▃▄▄▃▇▄▃▅▄▂▁▆▃▅▇▆▃▆▇▅█▅▄▂▅▃▃▃█▂
wandb:   lr ▁███████████████████████████████████████
wandb: 
wandb: Run summary:
wandb: loss 0.14937
wandb:   lr 0.0
wandb: 
wandb: 🚀 View run dry-night-4 at: https://wandb.ai/linaqruf/animagine-xl-real/runs/uc0q3jh7
wandb: ️⚡ View job at https://wandb.ai/linaqruf/animagine-xl-real/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjg3NDk2OTk2/version_details/v1
wandb: Synced 5 W&B file(s), 18 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: /content/fine_tune/logs/20230803014004/wandb/run-20230803_014036-uc0q3jh7/logs
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/local/bin/accelerate:8 in <module>                                      │
│                                                                              │
│   5 from accelerate.commands.accelerate_cli import main                      │
│   6 if __name__ == '__main__':                                               │
│   7sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│ ❱ 8sys.exit(main())                                                     │
│   9                                                                          │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.p │
│ y:45 in main                                                                 │
│                                                                              │
│   42 │   │   exit(1)                                                         │
│   43 │                                                                       │
│   44# Run                                                               │
│ ❱ 45args.func(args)                                                     │
│   46                                                                         │
│   47                                                                         │
│   48 if __name__ == "__main__":                                              │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:918 in │
│ launch_command                                                               │
│                                                                              │
│   915elif defaults is not None and defaults.compute_environment == Comp │
│   916 │   │   sagemaker_launcher(defaults, args)                             │
│   917else:                                                              │
│ ❱ 918 │   │   simple_launcher(args)                                          │
│   919                                                                        │
│   920                                                                        │
│   921 def main():                                                            │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:580 in │
│ simple_launcher                                                              │
│                                                                              │
│   577process.wait()                                                     │
│   578if process.returncode != 0:                                        │
│   579 │   │   if not args.quiet:                                             │
│ ❱ 580 │   │   │   raise subprocess.CalledProcessError(returncode=process.ret │
│   581 │   │   else:                                                          │
│   582 │   │   │   sys.exit(1)                                                │
│   583                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['/usr/bin/python3', 'sdxl_train.py', 
'--sample_prompts=/content/fine_tune/config/sample_prompt.toml', 
'--config_file=/content/fine_tune/config/config_file.toml', 
'--wandb_api_key=']' returned non-zero 
exit status 1.
@Linaqruf Linaqruf added the bug label Aug 3, 2023
@Linaqruf
Copy link
Author

Linaqruf commented Aug 3, 2023

It happened again, wasted 150 compute units today, and i thought by subscribing google one I can easily save and load my dataset from google drive. Please do something, thanks.

steps:  43% 4261/10000 [3:34:31<4:48:55,  3.02s/it, loss=0.127]Traceback (most recent call last):
  File "/content/kohya-trainer/sdxl_train.py", line 649, in <module>
    train(args)
  File "/content/kohya-trainer/sdxl_train.py", line 371, in train
    for step, batch in enumerate(train_dataloader):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 388, in __iter__
    next_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 644, in reraise
    raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataset.py", line 243, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "/content/kohya-trainer/library/train_util.py", line 1045, in __getitem__
    latents, original_size, crop_ltrb, flipped_latents = load_latents_from_disk(image_info.latents_npz)
  File "/content/kohya-trainer/library/train_util.py", line 1883, in load_latents_from_disk
    npz = np.load(npz_path)
  File "/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py", line 407, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/AnimagineXL/AnimagineXL-dataset/kikkkaharu_Cheerful_girl_about_19_years_old_cheerful_expression_2b6a7f04-3707-4473-a615-fd412b4ee75d.npz'

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/kohya-trainer/sdxl_train.py:649 in <module>                         │
│                                                                              │
│   646args = parser.parse_args()                                         │
│   647args = train_util.read_config_from_file(args, parser)              │
│   648 │                                                                      │
│ ❱ 649train(args)                                                        │
│   650                                                                        │
│                                                                              │
│ /content/kohya-trainer/sdxl_train.py:371 in train                            │
│                                                                              │
│   368 │   │   │   m.train()                                                  │
│   369 │   │                                                                  │
│   370 │   │   loss_total = 0                                                 │
│ ❱ 371 │   │   for step, batch in enumerate(train_dataloader):                │
│   372 │   │   │   current_step.value = global_step                           │
│   373 │   │   │   with accelerator.accumulate(training_models[0]):  # 複数モ │374 │   │   │   │   if "latents" in batch and batch["latents"] is not None │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py:388 in     │
│ __iter__                                                                     │
│                                                                              │
│   385 │   │   │   │   # But we still move it to the device so it is done bef │386 │   │   │   │   if self.device is not None:                            │
│   387 │   │   │   │   │   current_batch = send_to_device(current_batch, self │
│ ❱ 388 │   │   │   │   next_batch = next(dataloader_iter)                     │
│   389 │   │   │   │   if batch_index >= self.skip_batches:                   │
│   390 │   │   │   │   │   yield current_batch                                │
│   391 │   │   │   │   batch_index += 1                                       │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:633   │
│ in __next__                                                                  │
│                                                                              │
│    630 │   │   │   if self._sampler_iter is None:                            │
│    631 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/7675 │632 │   │   │   │   self._reset()  # type: ignore[call-arg]               │
│ ❱  633 │   │   │   data = self._next_data()                                  │
│    634 │   │   │   self._num_yielded += 1                                    │
│    635 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \      │
│    636 │   │   │   │   │   self._IterableDataset_len_called is not None and  │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:1345  │
│ in _next_data                                                                │
│                                                                              │
│   1342 │   │   │   │   self._task_info[idx] += (data,)                       │
│   1343 │   │   │   else:                                                     │
│   1344 │   │   │   │   del self._task_info[idx]                              │
│ ❱ 1345 │   │   │   │   return self._process_data(data)                       │
│   1346 │                                                                     │
│   1347def _try_put_index(self):                                         │
│   1348 │   │   assert self._tasks_outstanding < self._prefetch_factor * self │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:1371  │
│ in _process_data                                                             │
│                                                                              │
│   1368 │   │   self._rcvd_idx += 1                                           │
│   1369 │   │   self._try_put_index()                                         │
│   1370 │   │   if isinstance(data, ExceptionWrapper):                        │
│ ❱ 1371 │   │   │   data.reraise()                                            │
│   1372 │   │   return data                                                   │
│   1373 │                                                                     │
│   1374def _mark_worker_as_unavailable(self, worker_id, shutdown=False): │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/torch/_utils.py:644 in reraise       │
│                                                                              │
│   641 │   │   │   # If the exception takes multiple arguments, don't try to  │642 │   │   │   # instantiate since we don't know how to                   │643 │   │   │   raise RuntimeError(msg) from None                          │
│ ❱ 644 │   │   raise exception                                                │
│   645                                                                        │
│   646                                                                        │
│   647 def _get_available_device_type():                                      │
╰──────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File 
"/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", 
line 308, in _worker_loop
    data = fetcher.fetch(index)
  File 
"/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line
51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File 
"/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line
51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataset.py", 
line 243, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "/content/kohya-trainer/library/train_util.py", line 1045, in __getitem__
    latents, original_size, crop_ltrb, flipped_latents = 
load_latents_from_disk(image_info.latents_npz)
  File "/content/kohya-trainer/library/train_util.py", line 1883, in 
load_latents_from_disk
    npz = np.load(npz_path)
  File "/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py", line 407, 
in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 
'/content/drive/MyDrive/AnimagineXL/AnimagineXL-dataset/kikkkaharu_Cheerful_girl
_about_19_years_old_cheerful_expression_2b6a7f04-3707-4473-a615-fd412b4ee75d.npz
'

wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: 
wandb: Run history:
wandb: loss ▄▄▄▄▁▇▅▇▄▅▅▁▆▄▁▃▄▅█▁▆▁▂▇▄▂▆▂▁▂▃▆▂▄▅▂▂▇▅▁
wandb:   lr ▁███████████████████████████████████████
wandb: 
wandb: Run summary:
wandb: loss 0.12345
wandb:   lr 0.0
wandb: 
wandb: 🚀 View run worldly-microwave-1 at: https://wandb.ai/linaqruf/animagine-xl-real2/runs/lxx7ztt9
wandb: ️⚡ View job at https://wandb.ai/linaqruf/animagine-xl-real2/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjg3NjEwNjYw/version_details/v0
wandb: Synced 5 W&B file(s), 21 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: /content/fine_tune/logs/20230803072957/wandb/run-20230803_073030-lxx7ztt9/logs
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/local/bin/accelerate:8 in <module>                                      │
│                                                                              │
│   5 from accelerate.commands.accelerate_cli import main                      │
│   6 if __name__ == '__main__':                                               │
│   7sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│ ❱ 8sys.exit(main())                                                     │
│   9                                                                          │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.p │
│ y:45 in main                                                                 │
│                                                                              │
│   42 │   │   exit(1)                                                         │
│   43 │                                                                       │
│   44# Run                                                               │
│ ❱ 45args.func(args)                                                     │
│   46                                                                         │
│   47                                                                         │
│   48 if __name__ == "__main__":                                              │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:918 in │
│ launch_command                                                               │
│                                                                              │
│   915elif defaults is not None and defaults.compute_environment == Comp │
│   916 │   │   sagemaker_launcher(defaults, args)                             │
│   917else:                                                              │
│ ❱ 918 │   │   simple_launcher(args)                                          │
│   919                                                                        │
│   920                                                                        │
│   921 def main():                                                            │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:580 in │
│ simple_launcher                                                              │
│                                                                              │
│   577process.wait()                                                     │
│   578if process.returncode != 0:                                        │
│   579 │   │   if not args.quiet:                                             │
│ ❱ 580 │   │   │   raise subprocess.CalledProcessError(returncode=process.ret │
│   581 │   │   else:                                                          │
│   582 │   │   │   sys.exit(1)                                                │
│   583                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['/usr/bin/python3', 'sdxl_train.py', 
'--sample_prompts=/content/fine_tune/config/sample_prompt.toml', 
'--config_file=/content/fine_tune/config/config_file.toml', 
'--wandb_api_key=']' returned non-zero 
exit status 1.

@mpdx-mods

This comment was marked as abuse.

@xerxes-k
Copy link

happened something similar. in my case google drive can't save a thing after training. it seems users raised this issue many times. can somebody do anything? if it goes on like this pro+ is useless for me

@foureyednymph
Copy link

Please fix this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants