-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Drive disconnected and it makes my training ended early #3892
Comments
It happened again, wasted 150 compute units today, and i thought by subscribing google one I can easily save and load my dataset from google drive. Please do something, thanks. steps: 43% 4261/10000 [3:34:31<4:48:55, 3.02s/it, loss=0.127]Traceback (most recent call last):
File "/content/kohya-trainer/sdxl_train.py", line 649, in <module>
train(args)
File "/content/kohya-trainer/sdxl_train.py", line 371, in train
for step, batch in enumerate(train_dataloader):
File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 388, in __iter__
next_batch = next(dataloader_iter)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in __next__
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 644, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 1.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataset.py", line 243, in __getitem__
return self.datasets[dataset_idx][sample_idx]
File "/content/kohya-trainer/library/train_util.py", line 1045, in __getitem__
latents, original_size, crop_ltrb, flipped_latents = load_latents_from_disk(image_info.latents_npz)
File "/content/kohya-trainer/library/train_util.py", line 1883, in load_latents_from_disk
npz = np.load(npz_path)
File "/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py", line 407, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/AnimagineXL/AnimagineXL-dataset/kikkkaharu_Cheerful_girl_about_19_years_old_cheerful_expression_2b6a7f04-3707-4473-a615-fd412b4ee75d.npz'
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/kohya-trainer/sdxl_train.py:649 in <module> │
│ │
│ 646 │ args = parser.parse_args() │
│ 647 │ args = train_util.read_config_from_file(args, parser) │
│ 648 │ │
│ ❱ 649 │ train(args) │
│ 650 │
│ │
│ /content/kohya-trainer/sdxl_train.py:371 in train │
│ │
│ 368 │ │ │ m.train() │
│ 369 │ │ │
│ 370 │ │ loss_total = 0 │
│ ❱ 371 │ │ for step, batch in enumerate(train_dataloader): │
│ 372 │ │ │ current_step.value = global_step │
│ 373 │ │ │ with accelerator.accumulate(training_models[0]): # 複数モ │
│ 374 │ │ │ │ if "latents" in batch and batch["latents"] is not None │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py:388 in │
│ __iter__ │
│ │
│ 385 │ │ │ │ # But we still move it to the device so it is done bef │
│ 386 │ │ │ │ if self.device is not None: │
│ 387 │ │ │ │ │ current_batch = send_to_device(current_batch, self │
│ ❱ 388 │ │ │ │ next_batch = next(dataloader_iter) │
│ 389 │ │ │ │ if batch_index >= self.skip_batches: │
│ 390 │ │ │ │ │ yield current_batch │
│ 391 │ │ │ │ batch_index += 1 │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:633 │
│ in __next__ │
│ │
│ 630 │ │ │ if self._sampler_iter is None: │
│ 631 │ │ │ │ # TODO(https://github.com/pytorch/pytorch/issues/7675 │
│ 632 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 633 │ │ │ data = self._next_data() │
│ 634 │ │ │ self._num_yielded += 1 │
│ 635 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 636 │ │ │ │ │ self._IterableDataset_len_called is not None and │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:1345 │
│ in _next_data │
│ │
│ 1342 │ │ │ │ self._task_info[idx] += (data,) │
│ 1343 │ │ │ else: │
│ 1344 │ │ │ │ del self._task_info[idx] │
│ ❱ 1345 │ │ │ │ return self._process_data(data) │
│ 1346 │ │
│ 1347 │ def _try_put_index(self): │
│ 1348 │ │ assert self._tasks_outstanding < self._prefetch_factor * self │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:1371 │
│ in _process_data │
│ │
│ 1368 │ │ self._rcvd_idx += 1 │
│ 1369 │ │ self._try_put_index() │
│ 1370 │ │ if isinstance(data, ExceptionWrapper): │
│ ❱ 1371 │ │ │ data.reraise() │
│ 1372 │ │ return data │
│ 1373 │ │
│ 1374 │ def _mark_worker_as_unavailable(self, worker_id, shutdown=False): │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/_utils.py:644 in reraise │
│ │
│ 641 │ │ │ # If the exception takes multiple arguments, don't try to │
│ 642 │ │ │ # instantiate since we don't know how to │
│ 643 │ │ │ raise RuntimeError(msg) from None │
│ ❱ 644 │ │ raise exception │
│ 645 │
│ 646 │
│ 647 def _get_available_device_type(): │
╰──────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 1.
Original Traceback (most recent call last):
File
"/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py",
line 308, in _worker_loop
data = fetcher.fetch(index)
File
"/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line
51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File
"/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line
51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataset.py",
line 243, in __getitem__
return self.datasets[dataset_idx][sample_idx]
File "/content/kohya-trainer/library/train_util.py", line 1045, in __getitem__
latents, original_size, crop_ltrb, flipped_latents =
load_latents_from_disk(image_info.latents_npz)
File "/content/kohya-trainer/library/train_util.py", line 1883, in
load_latents_from_disk
npz = np.load(npz_path)
File "/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py", line 407,
in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory:
'/content/drive/MyDrive/AnimagineXL/AnimagineXL-dataset/kikkkaharu_Cheerful_girl
_about_19_years_old_cheerful_expression_2b6a7f04-3707-4473-a615-fd412b4ee75d.npz
'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb:
wandb: Run history:
wandb: loss ▄▄▄▄▁▇▅▇▄▅▅▁▆▄▁▃▄▅█▁▆▁▂▇▄▂▆▂▁▂▃▆▂▄▅▂▂▇▅▁
wandb: lr ▁███████████████████████████████████████
wandb:
wandb: Run summary:
wandb: loss 0.12345
wandb: lr 0.0
wandb:
wandb: 🚀 View run worldly-microwave-1 at: https://wandb.ai/linaqruf/animagine-xl-real2/runs/lxx7ztt9
wandb: ️⚡ View job at https://wandb.ai/linaqruf/animagine-xl-real2/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjg3NjEwNjYw/version_details/v0
wandb: Synced 5 W&B file(s), 21 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: /content/fine_tune/logs/20230803072957/wandb/run-20230803_073030-lxx7ztt9/logs
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/local/bin/accelerate:8 in <module> │
│ │
│ 5 from accelerate.commands.accelerate_cli import main │
│ 6 if __name__ == '__main__': │
│ 7 │ sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │
│ ❱ 8 │ sys.exit(main()) │
│ 9 │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.p │
│ y:45 in main │
│ │
│ 42 │ │ exit(1) │
│ 43 │ │
│ 44 │ # Run │
│ ❱ 45 │ args.func(args) │
│ 46 │
│ 47 │
│ 48 if __name__ == "__main__": │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:918 in │
│ launch_command │
│ │
│ 915 │ elif defaults is not None and defaults.compute_environment == Comp │
│ 916 │ │ sagemaker_launcher(defaults, args) │
│ 917 │ else: │
│ ❱ 918 │ │ simple_launcher(args) │
│ 919 │
│ 920 │
│ 921 def main(): │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:580 in │
│ simple_launcher │
│ │
│ 577 │ process.wait() │
│ 578 │ if process.returncode != 0: │
│ 579 │ │ if not args.quiet: │
│ ❱ 580 │ │ │ raise subprocess.CalledProcessError(returncode=process.ret │
│ 581 │ │ else: │
│ 582 │ │ │ sys.exit(1) │
│ 583 │
╰──────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['/usr/bin/python3', 'sdxl_train.py',
'--sample_prompts=/content/fine_tune/config/sample_prompt.toml',
'--config_file=/content/fine_tune/config/config_file.toml',
'--wandb_api_key=']' returned non-zero
exit status 1. |
This comment was marked as abuse.
This comment was marked as abuse.
happened something similar. in my case google drive can't save a thing after training. it seems users raised this issue many times. can somebody do anything? if it goes on like this pro+ is useless for me |
Please fix this! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the current behavior
The title says everything, I train SDXL models, last night I put my datasets (latents and caption files) in Google Drive, today I train it and the Google Drive just disconnected and it makes the training ended. I have Colab Pro+ and I'm disappointed.
Describe the expected behavior
The training still running and disconnected gdrive never happened
What web browser you are using
Chrome
Additional context
The text was updated successfully, but these errors were encountered: