Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Currently only supports dynamic loading from each domain for once. #15

Closed
Longyichen opened this issue Nov 13, 2023 · 17 comments

Comments

@Longyichen
Copy link
Contributor

When I use a single node, 8*A100 80G configuration, I find that an error occurs:

LLM-Shearing/llmshearing/datasets/streaming_da │
│ taset.py:46 in generate_work                                                                     │
│                                                                                                  │
│    43 │   │   List[List[int]]: The epoch for each domain of data (num physical nodes,            │
│    44 │   │   ranks per node, workers per rank, batches per worker, batch size).                 │
│    45 │   """                                                                                    │
│ ❱  46 │   assert epoch == 0, "Currently only supports dynamic loading from each domain for onc   │
│    47 │   # Ensure that num_canonical_nodes has been set.                                        │
│    48 │   if dataset.num_canonical_nodes is None:                                                │
│    49 │   │   raise RuntimeError(f'`num_canonical_nodes` can never be None. ' +                  │
╰───────────────────────────────────────────
AssertionError: Currently only supports dynamic loading from each domain for once.

If I delete "assert epoch == 0, "Currently only supports dynamic loading from each domain for once.", i will cause another error in

# Currently only supports dynamically loading data from each domain for once. 
        # Issues could occur if one domain of data is used up. 
        while True:
            proportion = self.proportion
            stream_id = np.random.choice(range(self.num_streams), 1, p=proportion)[0].item()
            domain_sample_id = sample_ids_per_stream[stream_id]
            domain_sample_id = domain_sample_id[self.used_num_samples_per_stream[stream_id] \
                                % self.samples_per_stream[stream_id]]
            self.used_num_samples_per_stream[stream_id] += 1
            yield self[domain_sample_id]
---
IndexError: index 24 is out of bounds for axis 0 with size 24

If i add "if world.is_local_leader and epoch==0:"
SharedMemory in _attach_work will come error:

│   304 │   │   │   # Load the generated epoch shape from shared memory.                           │
│   305 │   │   │   name = _get_path(self._shm_prefix_int, EPOCH_SHAPE + f"_{stream_id}")          │
│   306 │   │   │   size = ndim * np.int64().nbytes                                                │
│ ❱ 307 │   │   │   shape_shm = SharedMemory(name=name, create=False, size=size, auto_cleanup=Fa   │
│   308 │   │   │   shape = tuple(np.ndarray(5, buffer=shape_shm.buf, dtype=np.int64))             │
│   309 │   │   │                                                                                  │
│   310 │   │   │   # Attach to the generated epoch data in shared memory.      
---
FileNotFoundError: [Errno 2] No such file or directory: '/000000_epoch_shape_0'
Exception ignored in atexit callback: <function Engine._close at 0x7fe6dc766a70>
@xiamengzhou
Copy link
Contributor

Hi! How large is your dataset? We currently only supports using all the data points for once and exceeding 1ep of data will cause errors. Supporting multiple epochs requires to modify the StreamingDataset logics.

@Longyichen
Copy link
Contributor Author

Longyichen commented Nov 13, 2023

@xiamengzhou I performed data processing on the entire redpajama-1T according to your Readme, including tokenizing and sampling. This error occurred when I performed batch=[7/3200]:. It seems that the calculation has reached epoch1, and it comes error.
Does the algorithm only need to calculate 7/3200batch? The details are as follows:

[batch=6/3200]:
        Train time/batch: 5
        Train time/sample: 160
        Train time/batch_in_epoch: 5
        Train time/sample_in_epoch: 160
        Train time/token: 655360
        Train time/token_in_epoch: 655360
        Train metrics/train/cc_weight: 0.6700
        Train metrics/train/github_weight: 0.0450
        Train metrics/train/book_weight: 0.0450
        Train metrics/train/wiki_weight: 0.0450
        Train metrics/train/arxiv_weight: 0.0450
        Train metrics/train/c4-rp_weight: 0.1500
        Train memory/current_allocated_mem: 14.6140
        Train memory/current_active_mem: 14.6140
        Train memory/current_inactive_mem: 1.9267
        Train memory/current_reserved_mem: 39.3450
        Train memory/peak_allocated_mem: 28.0700
        Train memory/peak_active_mem: 28.0700
        Train memory/peak_inactive_mem: 11.7290
        Train memory/peak_reserved_mem: 39.3450
        Train memory/alloc_retries: 0
        Train metrics/train/expected_head_sparsity: 0.0039
        Train metrics/train/target_head_sparsity: 0.0029
        Train metrics/train/expected_intermediate_sparsity: 0.0039
        Train metrics/train/target_intermediate_sparsity: 0.0029
        Train metrics/train/expected_layer_sparsity: 0.0039
        Train metrics/train/target_layer_sparsity: 0.0000
        Train metrics/train/expected_hidden_sparsity: 0.0039
        Train metrics/train/target_hidden_sparsity: 0.0029
        Train metrics/train/expected_sparsity: 0.0117
        Train metrics/train/target_sparsity: 0.0048
        Train trainer/device_train_microbatch_size: 4
        Train loss/train/total: 1.8510
        Train loss/train/ce_loss: 1.8509
        Train loss/train/lag_loss: 0.0001
        Train metrics/train/LanguageCrossEntropy: 1.8509
        Train metrics/train/Perplexity: 6.3655
        Train metrics/train/cc_LanguageCrossEntropy: 1.9415
        Train metrics/train/cc_count: 121
        Train metrics/train/github_LanguageCrossEntropy: 0.8384
        Train metrics/train/github_count: 11
        Train metrics/train/book_LanguageCrossEntropy: nan
        Train metrics/train/book_count: 7
        Train metrics/train/wiki_LanguageCrossEntropy: 1.6548
        Train metrics/train/wiki_count: 8
        Train metrics/train/arxiv_LanguageCrossEntropy: nan
        Train metrics/train/arxiv_count: 5
        Train metrics/train/c4-rp_LanguageCrossEntropy: 1.9918
        Train metrics/train/c4-rp_count: 40
        Train time/train: 0.0152
        Train time/val: 0.0000
        Train time/total: 0.0152
[batch=7/3200]:
        Train time/batch: 6
        Train time/sample: 192
        Train time/batch_in_epoch: 6
        Train time/sample_in_epoch: 192
        Train time/token: 786432
        Train time/token_in_epoch: 786432
        Train metrics/train/cc_weight: 0.6700
        Train metrics/train/github_weight: 0.0450
        Train metrics/train/book_weight: 0.0450
        Train metrics/train/wiki_weight: 0.0450
        Train metrics/train/arxiv_weight: 0.0450
        Train metrics/train/c4-rp_weight: 0.1500
        Train memory/current_allocated_mem: 14.6140
        Train memory/current_active_mem: 14.6140
        Train memory/current_inactive_mem: 1.9267
        Train memory/current_reserved_mem: 39.3450
        Train memory/peak_allocated_mem: 28.0700
        Train memory/peak_active_mem: 28.0700
        Train memory/peak_inactive_mem: 11.7290
        Train memory/peak_reserved_mem: 39.3450
        Train memory/alloc_retries: 0
        Train metrics/train/expected_head_sparsity: 0.0039
        Train metrics/train/target_head_sparsity: 0.0035
        Train metrics/train/expected_intermediate_sparsity: 0.0039
        Train metrics/train/target_intermediate_sparsity: 0.0035
        Train metrics/train/expected_layer_sparsity: 0.0039
        Train metrics/train/target_layer_sparsity: 0.0000
        Train metrics/train/expected_hidden_sparsity: 0.0039
        Train metrics/train/target_hidden_sparsity: 0.0035
        Train metrics/train/expected_sparsity: 0.0117
        Train metrics/train/target_sparsity: 0.0057
        Train trainer/device_train_microbatch_size: 4
        Train loss/train/total: 1.8914
        Train loss/train/ce_loss: 1.8913
        Train loss/train/lag_loss: 0.0001
        Train metrics/train/LanguageCrossEntropy: 1.8913
        Train metrics/train/Perplexity: 6.6280
        Train metrics/train/cc_LanguageCrossEntropy: 1.8021
        Train metrics/train/cc_count: 140
        Train metrics/train/github_LanguageCrossEntropy: nan
        Train metrics/train/github_count: 11
        Train metrics/train/book_LanguageCrossEntropy: 1.9494
        Train metrics/train/book_count: 8
        Train metrics/train/wiki_LanguageCrossEntropy: 1.7889
        Train metrics/train/wiki_count: 9
        Train metrics/train/arxiv_LanguageCrossEntropy: nan
        Train metrics/train/arxiv_count: 5
        Train metrics/train/c4-rp_LanguageCrossEntropy: 2.0495
        Train metrics/train/c4-rp_count: 51
        Train time/train: 0.0172
        Train time/val: 0.0000
        Train time/total: 0.0172
Traceback (most recent call last):
 File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/train.py", line 319, in <module>
   main(cfg)
 File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/train.py", line 299, in main
   trainer.fit()
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1876, in fit
   self._train_loop()
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2018, in _train_loop
   for batch_idx, self.state.batch in enumerate(self._iter_dataloader(TrainerMode.TRAIN)):
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/composer/trainer/trainer.py", line 3024, in _iter_dataloader
   batch = next(dataloader_iter)
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
   data = self._next_data()
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
   data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
   data.append(next(self.dataset_iter))
 File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/datasets/streaming_dataset.py", line 401, in __iter__
   sample_ids_per_stream = self._get_work(world, epoch, used_sample_ids)
 File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/datasets/streaming_dataset.py", line 355, in _get_work
   sample_ids_per_stream = generate_work(self, world, epoch, used_domain_ids)
 File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/datasets/streaming_dataset.py", line 46, in generate_work
   assert epoch == 0, "Currently only supports dynamic loading from each domain for once."
AssertionError: Currently only supports dynamic loading from each domain for once.

@xiamengzhou
Copy link
Contributor

Hiiii, I am not sure why it is happening here -- I will need to take a closer look at it and will get back to you later. Could you share the configuration you are using, and the number of data points in each domain?

@Longyichen
Copy link
Contributor Author

Hi, thanks for your help. I think you are right, there may be something wrong with the data. Although I cannot directly read the mds file to view the number of data points, I found that the size of the data file obtained using sample sampling is smaller than usual. I'll check carefully what's wrong with the sample file.

@xiamengzhou
Copy link
Contributor

You can use the TextStreamingDataset to load the data and count the number of data points by simply using the len() function. You can also check the index.json file to check the number of samples.

@Longyichen
Copy link
Contributor Author

Thank you for your patient reply. Previously I used the default script you provided without setting the number of sampling tokens. This leads to related problems. I'm resampling the data now and waiting to see if it improves, it may take a while.
There is also a related question:
If I need to disable DoReMi, how do I need to modify the settings?
Because I found that there are multiple places that seem to be related to DoReMi configuration, including

In addition, the yaml file also configures the data path. When will this be used or will it be overwritten?

@xiamengzhou
Copy link
Contributor

xiamengzhou commented Nov 13, 2023

There are two ways to use a fixed data loading proportion!

The first way:

  • dynamic: false
  • split: wikipedia (make sure that it's mds files in this directory)
    This set up allows you to load data from a single data folder with mds files.

The second way:

  • dynamic: true
  • update_type: constant
  • set_names: specify the set names
  • proportion: specify the loading proportion
    This setup allows you to load data from multiple data folders with mds files and use a constant proportion.

You can refer to the callback function of dynamic loading here: https://github.com/princeton-nlp/LLM-Shearing/blob/main/llmshearing/callbacks/dynamic_loading_callback.py#L32

@Longyichen
Copy link
Contributor Author

Longyichen commented Nov 14, 2023

Thanks for your help, the code runs smoothly. But sometimes the loss will be nan. Is this normal?

[batch=3194/3200]:
         Train time/batch: 3193
         Train time/sample: 102176
         Train time/batch_in_epoch: 3193
         Train time/sample_in_epoch: 102176
         Train time/token: 418512896
         Train time/token_in_epoch: 418512896
         Train metrics/train/cc_weight: 0.0450
         Train metrics/train/github_weight: 0.0017
         Train metrics/train/book_weight: 0.0007
         Train metrics/train/stackexchange_weight: 0.0023
         Train metrics/train/wiki_weight: 0.0121
         Train metrics/train/arxiv_weight: 0.0011
         Train metrics/train/c4-rp_weight: 0.9370
         Train memory/current_allocated_mem: 14.6140
         Train memory/current_active_mem: 14.6140
         Train memory/current_inactive_mem: 1.9286
         Train memory/current_reserved_mem: 43.5430
         Train memory/peak_allocated_mem: 28.0710
         Train memory/peak_active_mem: 28.0710
         Train memory/peak_inactive_mem: 11.7290
         Train memory/peak_reserved_mem: 43.5430
         Train memory/alloc_retries: 0
         Train metrics/train/expected_head_sparsity: 0.3750
         Train metrics/train/target_head_sparsity: 0.3750
         Train metrics/train/expected_intermediate_sparsity: 0.3714
         Train metrics/train/target_intermediate_sparsity: 0.3721
         Train metrics/train/expected_layer_sparsity: 0.0039
         Train metrics/train/target_layer_sparsity: 0.0000
         Train metrics/train/expected_hidden_sparsity: 0.3734
         Train metrics/train/target_hidden_sparsity: 0.3750
         Train metrics/train/expected_sparsity: 0.6085
         Train metrics/train/target_sparsity: 0.6082
         Train trainer/device_train_microbatch_size: 4
         Train loss/train/total: 9.1105
         Train loss/train/ce_loss: 2.4873
         Train loss/train/lag_loss: 6.6233
         Train metrics/train/LanguageCrossEntropy: 2.4873
         Train metrics/train/Perplexity: 12.0283
         Train metrics/train/cc_LanguageCrossEntropy: nan
         Train metrics/train/cc_count: 24363
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 1389
         Train metrics/train/book_LanguageCrossEntropy: nan
         Train metrics/train/book_count: 1033
         Train metrics/train/stackexchange_LanguageCrossEntropy: nan
         Train metrics/train/stackexchange_count: 741
         Train metrics/train/wiki_LanguageCrossEntropy: 1.9528
         Train metrics/train/wiki_count: 14344
         Train metrics/train/arxiv_LanguageCrossEntropy: nan
         Train metrics/train/arxiv_count: 783
         Train metrics/train/c4-rp_LanguageCrossEntropy: 2.5045
         Train metrics/train/c4-rp_count: 59555
         Train throughput/batches_per_sec: 0.1416
         Train throughput/samples_per_sec: 4.5306
         Train throughput/device/batches_per_sec: 0.0177
         Train throughput/device/samples_per_sec: 0.5663
         Train throughput/tokens_per_sec: 18557.1898
         Train throughput/device/tokens_per_sec: 2319.6487
         Train throughput/flops_per_sec: 869869884376190.8750
         Train throughput/device/flops_per_sec: 108733735547023.8594
         Train throughput/device/mfu: 0.3485
         Train time/train: 6.3065
         Train time/val: 1.3523
         Train time/total: 7.6588

@xiamengzhou
Copy link
Contributor

When the batch does not contain data from a specific domain, the loss becomes nan. So it should be normal! For a sanity check, you can print the amount of data used by each batch to verify.

@lippman1125
Copy link

@Longyichen Would you like to share your pruning script?

@lippman1125
Copy link

Thanks for your help, the code runs smoothly. But sometimes the loss will be nan. Is this normal?

[batch=3194/3200]:
         Train time/batch: 3193
         Train time/sample: 102176
         Train time/batch_in_epoch: 3193
         Train time/sample_in_epoch: 102176
         Train time/token: 418512896
         Train time/token_in_epoch: 418512896
         Train metrics/train/cc_weight: 0.0450
         Train metrics/train/github_weight: 0.0017
         Train metrics/train/book_weight: 0.0007
         Train metrics/train/stackexchange_weight: 0.0023
         Train metrics/train/wiki_weight: 0.0121
         Train metrics/train/arxiv_weight: 0.0011
         Train metrics/train/c4-rp_weight: 0.9370
         Train memory/current_allocated_mem: 14.6140
         Train memory/current_active_mem: 14.6140
         Train memory/current_inactive_mem: 1.9286
         Train memory/current_reserved_mem: 43.5430
         Train memory/peak_allocated_mem: 28.0710
         Train memory/peak_active_mem: 28.0710
         Train memory/peak_inactive_mem: 11.7290
         Train memory/peak_reserved_mem: 43.5430
         Train memory/alloc_retries: 0
         Train metrics/train/expected_head_sparsity: 0.3750
         Train metrics/train/target_head_sparsity: 0.3750
         Train metrics/train/expected_intermediate_sparsity: 0.3714
         Train metrics/train/target_intermediate_sparsity: 0.3721
         Train metrics/train/expected_layer_sparsity: 0.0039
         Train metrics/train/target_layer_sparsity: 0.0000
         Train metrics/train/expected_hidden_sparsity: 0.3734
         Train metrics/train/target_hidden_sparsity: 0.3750
         Train metrics/train/expected_sparsity: 0.6085
         Train metrics/train/target_sparsity: 0.6082
         Train trainer/device_train_microbatch_size: 4
         Train loss/train/total: 9.1105
         Train loss/train/ce_loss: 2.4873
         Train loss/train/lag_loss: 6.6233
         Train metrics/train/LanguageCrossEntropy: 2.4873
         Train metrics/train/Perplexity: 12.0283
         Train metrics/train/cc_LanguageCrossEntropy: nan
         Train metrics/train/cc_count: 24363
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 1389
         Train metrics/train/book_LanguageCrossEntropy: nan
         Train metrics/train/book_count: 1033
         Train metrics/train/stackexchange_LanguageCrossEntropy: nan
         Train metrics/train/stackexchange_count: 741
         Train metrics/train/wiki_LanguageCrossEntropy: 1.9528
         Train metrics/train/wiki_count: 14344
         Train metrics/train/arxiv_LanguageCrossEntropy: nan
         Train metrics/train/arxiv_count: 783
         Train metrics/train/c4-rp_LanguageCrossEntropy: 2.5045
         Train metrics/train/c4-rp_count: 59555
         Train throughput/batches_per_sec: 0.1416
         Train throughput/samples_per_sec: 4.5306
         Train throughput/device/batches_per_sec: 0.0177
         Train throughput/device/samples_per_sec: 0.5663
         Train throughput/tokens_per_sec: 18557.1898
         Train throughput/device/tokens_per_sec: 2319.6487
         Train throughput/flops_per_sec: 869869884376190.8750
         Train throughput/device/flops_per_sec: 108733735547023.8594
         Train throughput/device/mfu: 0.3485
         Train time/train: 6.3065
         Train time/val: 1.3523
         Train time/total: 7.6588

@Longyichen Have you noticed that c4-rp_weight is 0.9370, which is not consistent with the data in the literature?

@Longyichen
Copy link
Contributor Author

@lippman1125 Yes, I have a similar problem, but I don't know what causes it. It seems that the performance of the model does not suffer much loss compared to the paper. For details, we can seek for @xiamengzhou for help.

@lippman1125
Copy link

@Longyichen Because the eval CE Loss determines the proportion, but the new proportion only affects train CE Loss. I guess, if there is some gap between the training samples and eval samples, it could lead to this problem.

@Longyichen
Copy link
Contributor Author

@lippman1125 Have you tried continuous pre-training? You can try using the pre-trained data set and evaluation set to see if the same problem occurs

@coderchem
Copy link

@Longyichen could you share your scripts? i happened same problem , and not solv.

@coderchem
Copy link

我尝试了上面所有的方法,还是不行;我使用的数据集是样例数据集;帮忙讲解一下,可能是哪里的问题吗?

@xiamengzhou
Copy link
Contributor

xiamengzhou commented Jan 11, 2024

@coderchem Have you been using the data shared on the google drive?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants