Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when reproducing the pre-training #16

Open
richardsun-voyager opened this issue Apr 10, 2024 · 12 comments
Open

Error when reproducing the pre-training #16

richardsun-voyager opened this issue Apr 10, 2024 · 12 comments

Comments

@richardsun-voyager
Copy link

I tried to reproduce the pre-training experiment using the command

python -m train   experiment=hg38/hg38   callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500   dataset.max_length=1024   dataset.batch_size=1024   dataset.mlm=true   dataset.mlm_probability=0.15   dataset.rc_aug=false   model=caduceus   model.config.d_model=64   model.config.n_layer=1   model.config.bidirectional=true   model.config.bidirectional_strategy=add   model.config.bidirectional_weight_tie=true   model.config.rcps=true   optimizer.lr="8e-3"   train.global_batch_size=8   trainer.max_steps=10000   +trainer.val_check_interval=100   wandb=null

But after the first epoch, there came an error saying:

 File "/home/users/**/miniforge3/envs/caduceus_env/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 174, in collate_tensor_fn
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [1024] at entry 1


Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 0:   0%|▏                                                                                               | 245/122081 [02:04<17:07:50,  1.98it/s, loss=1.23]

Can you look into the issue?
ThANKS!

@zhan8855
Copy link

The same issue!

@yair-schiff
Copy link
Contributor

I haven't seen this error before. Seems like you are hitting an empty batch randomly during the first epoch. Can you do:

export HYDRA_FULL_ERROR=1

and re-run to get a more full stack trace and post that here?

@zhan8855
Copy link

For me, the error occurred halfway during pre-training.

Validation DataLoader 1:   3%|▎         | 4/121 [00:00<00:20,  5.79it/s]�[A
Epoch 0:  11%|█▏        | 4401/38424 [19:32<2:31:05,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   4%|▍         | 5/121 [00:00<00:19,  5.83it/s]�[A
Epoch 0:  11%|█▏        | 4402/38424 [19:32<2:31:04,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   5%|▍         | 6/121 [00:01<00:19,  5.86it/s]�[A
Epoch 0:  11%|█▏        | 4403/38424 [19:32<2:31:03,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   6%|▌         | 7/121 [00:01<00:19,  5.88it/s]�[A
Epoch 0:  11%|█▏        | 4404/38424 [19:33<2:31:02,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]wandb: Waiting for W&B process to finish... (failed 1).
wandb: 
wandb: Run history:
wandb:               epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           test/loss ▁
wandb:     test/num_tokens ▁
wandb:     test/perplexity ▁
wandb:          timer/step █▁▂▁▁▁▁▂▂▁▁▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▂▁▂▂
wandb:    timer/validation ▁█
wandb:       trainer/epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: trainer/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:        trainer/loss █▅▅▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▂▁▁▂▂▁▁▁▂▂▂▁▁▂
wandb:      trainer/lr/pg1 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:      trainer/lr/pg2 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:      trainer/lr/pg3 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:            val/loss ▁
wandb:      val/num_tokens ▁
wandb:      val/perplexity ▁
wandb: 
wandb: Run summary:
wandb:               epoch 0
wandb:           test/loss 1.1343
wandb:     test/num_tokens 64487424
wandb:     test/perplexity 3.10899
wandb:          timer/step 0.26093
wandb:    timer/validation 55.69303
wandb:       trainer/epoch 0.0
wandb: trainer/global_step 3999
wandb:        trainer/loss 1.14593
wandb:      trainer/lr/pg1 0.00048
wandb:      trainer/lr/pg2 0.00048
wandb:      trainer/lr/pg3 0.00048
wandb:            val/loss 1.14128
wandb:      val/num_tokens 73400320
wandb:      val/perplexity 3.13077
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4
wandb: Find logs at: ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4/logs
Error executing job with overrides: ['experiment=hg38/hg38', 'callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500', 'dataset.max_length=1024', 'dataset.batch_size=256', 'dataset.mlm=false', 'dataset.mlm_probability=0.0', 'dataset.rc_aug=true', 'model=hyena', 'model.d_model=256', 'model.n_layer=4', 'optimizer.lr=6e-4', 'train.global_batch_size=1024', 'trainer.max_steps=10000', 'trainer.devices=4', '+trainer.val_check_interval=2000', 'wandb.group=pretrain_hg38', 'wandb.name=hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4', 'wandb.mode=offline']
Traceback (most recent call last):
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/zhan8855/Caduceus/train.py", line 719, in <module>
    main()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/zhan8855/Caduceus/train.py", line 715, in main
    train(config)
  File "/home/zhan8855/Caduceus/train.py", line 680, in train
    trainer.fit(model)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.on_advance_end()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end
    self._run_validation()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation
    self.val_loop.run()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance
    batch = next(data_fetcher)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
    return self.fetching_function()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch
    batch = next(iterator)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 7.
Original Traceback (most recent call last):
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 277, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 121, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 174, in collate_tensor_fn
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [1024] at entry 1

@yair-schiff
Copy link
Contributor

Not sure why you are encountering an empty batch. I would recommend perhaps adding a check / some printing / breakpoints in the HG38 dataset __getitem__ here to see if you are returning an empty batch and go from there

@richardsun-voyager
Copy link
Author

For me, the error occurred halfway during pre-training.

Validation DataLoader 1:   3%|▎         | 4/121 [00:00<00:20,  5.79it/s]�[A
Epoch 0:  11%|█▏        | 4401/38424 [19:32<2:31:05,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   4%|▍         | 5/121 [00:00<00:19,  5.83it/s]�[A
Epoch 0:  11%|█▏        | 4402/38424 [19:32<2:31:04,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   5%|▍         | 6/121 [00:01<00:19,  5.86it/s]�[A
Epoch 0:  11%|█▏        | 4403/38424 [19:32<2:31:03,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]

Validation DataLoader 1:   6%|▌         | 7/121 [00:01<00:19,  5.88it/s]�[A
Epoch 0:  11%|█▏        | 4404/38424 [19:33<2:31:02,  3.75it/s, loss=1.13, v_num=6e-4, val/loss=1.140, val/num_tokens=7.34e+7, val/perplexity=3.130, test/loss=1.130, test/num_tokens=6.45e+7, test/perplexity=3.110]wandb: Waiting for W&B process to finish... (failed 1).
wandb: 
wandb: Run history:
wandb:               epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           test/loss ▁
wandb:     test/num_tokens ▁
wandb:     test/perplexity ▁
wandb:          timer/step █▁▂▁▁▁▁▂▂▁▁▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▂▁▂▂
wandb:    timer/validation ▁█
wandb:       trainer/epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: trainer/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:        trainer/loss █▅▅▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▂▁▁▂▂▁▁▁▂▂▂▁▁▂
wandb:      trainer/lr/pg1 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:      trainer/lr/pg2 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:      trainer/lr/pg3 ▁▂▂▃▄▄▅▆▇▇████████████████▇▇▇▇▇▇▇▇▇▇▇▇▇▆
wandb:            val/loss ▁
wandb:      val/num_tokens ▁
wandb:      val/perplexity ▁
wandb: 
wandb: Run summary:
wandb:               epoch 0
wandb:           test/loss 1.1343
wandb:     test/num_tokens 64487424
wandb:     test/perplexity 3.10899
wandb:          timer/step 0.26093
wandb:    timer/validation 55.69303
wandb:       trainer/epoch 0.0
wandb: trainer/global_step 3999
wandb:        trainer/loss 1.14593
wandb:      trainer/lr/pg1 0.00048
wandb:      trainer/lr/pg2 0.00048
wandb:      trainer/lr/pg3 0.00048
wandb:            val/loss 1.14128
wandb:      val/num_tokens 73400320
wandb:      val/perplexity 3.13077
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4
wandb: Find logs at: ./wandb/offline-run-20240317_082813-hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4/logs
Error executing job with overrides: ['experiment=hg38/hg38', 'callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500', 'dataset.max_length=1024', 'dataset.batch_size=256', 'dataset.mlm=false', 'dataset.mlm_probability=0.0', 'dataset.rc_aug=true', 'model=hyena', 'model.d_model=256', 'model.n_layer=4', 'optimizer.lr=6e-4', 'train.global_batch_size=1024', 'trainer.max_steps=10000', 'trainer.devices=4', '+trainer.val_check_interval=2000', 'wandb.group=pretrain_hg38', 'wandb.name=hyena_rc_aug_seqlen-1k_dmodel-256_nlayer-4_lr-6e-4', 'wandb.mode=offline']
Traceback (most recent call last):
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/zhan8855/Caduceus/train.py", line 719, in <module>
    main()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/zhan8855/Caduceus/train.py", line 715, in main
    train(config)
  File "/home/zhan8855/Caduceus/train.py", line 680, in train
    trainer.fit(model)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.on_advance_end()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end
    self._run_validation()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation
    self.val_loop.run()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance
    batch = next(data_fetcher)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
    return self.fetching_function()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch
    batch = next(iterator)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 7.
Original Traceback (most recent call last):
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 277, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 144, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 121, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/home/zhan8855/miniconda3/envs/caduceus/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 174, in collate_tensor_fn
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [1024] at entry 1

Did you solve this problem?

@zhan8855
Copy link

Not yet... Currently, I blocked validation during pre-training.

@leannmlindsey
Copy link

I just want to report that I have also seen this problem in some runs but not all. I am using two different clusters to test, and it is only happening on the NCSA DELTA system. I have not yet figured out what is different in the runs where I see this problem.

@leannmlindsey
Copy link

@zhan8855 does it only happen in the validation step?

@zhan8855
Copy link

@leannmlindsey For me it is, and it seems that it most likely happens when the trainning loader and validation loader are running simultaneously.

@GengGengJiuXi
Copy link

The same issue!

i am arrch64

@smdrnks
Copy link

smdrnks commented May 17, 2024

I am facing the same issue unfortunately. It happens during eval. Investigating now...

@GengGengJiuXi
Copy link

GengGengJiuXi commented Jun 4, 2024

I have sloved this issue,The path to the file is '02_caduceus/src/dataloaders/datasets/hg38_dataset.py'. In this file, within the __getitem__function, there is an instance where the line of code seq = self.fasta(chr_name, start, end, max_length=self.max_length, i_shift=shift_idx, return_augs=self.return_augs) may sometimes assign an empty value, even though there is actually a result. A simple loop can be used to resolve this issue, for example:
i = 0
while(len(seq) == 0 and i<100):
seq= self.fasta(chr_name,start,end,max_length=self.max_length,i_shift=shift_idx,return_augs=self.return_augs)
i+=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants