-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why the training always fails on the full data sets? #253
Comments
Hi @yaqlee, If I understand correctly, worker here means an instance/machine with 352GB of memory and 8 GPUs. However, we do not use ray for distributed training but rather for multi-processing. Can you first try to train on one machine using |
As described in advanced_model_training.ipynb,
so I should run caching first, but when I run caching on distributed 4 instance with cpu, it takes too long(one thirds of full data almost would take 9 days), which makes me wonder if the configure was right. could you please give a demonstration on how to run caching with multi machines?Thanks a lot |
anyway, what's the normal duration on running caching? It will helps me estimate if the preprocessor was normally running. |
@patk-motional Bump on this? For me caching 10% of the dataset takes nearly 3 days and I'm wondering if this is expected. I've tried both Maybe it's just a hardware diff? |
Hi @bhyang, Can you share what machines and command you are using? Internally we use 4 CPUs with 70 cores. That takes about an hour. edit: on the training set with |
@patk-motional Is it possible for you to share the full scenario_filter/scenario_builder hydra config that you are using for baseline training? Are you using the tagged scenarios or the complete logs? |
Here's my CPU info (from Here's the exact command:
|
@patk-motional Any update on this? Thanks! |
@patk-motional Same here. My training task of lanegcn keeps getting killed due to memory being full. I already set limit_total_scenarios=0.01 but still get out of memory. I only have several machines each with one or two gpus. How am I supposed to train on my machine with your dataset? And pytorch ddp could actually work on distributed machines so there's no need for ray distributed system. Your code is encapsulated too well. How can we better remove the Ray framework and use PyTorch DDP for multi-node training? Do you have any worker code example available internally for this purpose? If not, can you tell us how to proceed without using the Ray distributed system? |
I found that setting |
Hi all, Here is the full config to cache the features for the urban driver model. You will see that this is nothing special.
We have a system to dispatch jobs for distributed caching and training internally. There are a few environment variables to set:
Caching jobThere is no node-to-node communication required. The code looks at the environment variable Lastly, there is an assumption that your data (db files) are stored in AWS somewhere for multi-node caching. Otherwise, you will need to load all of the data you need to each machine prior to running the job. You will also need to modify the code in this function here if you are not using AWS S3. Training jobYou can follow https://pytorch-lightning.readthedocs.io/en/1.6.5/clouds/cluster.html as a general guide. We are using pl.plugins.DDPPlugin not ray for node-to-node communication. The same note about requiring AWS S3 applies to training as well. Otherwise, make the changes stated in the caching section. |
Thank you for your information. I will follow your instructions to give it a try. |
Hi, @patk-motional I've encountered a failure while attempting to train on the full dataset using the DDPPlugin after caching in local. The loading time for the complete dataset from cache files exceeds 30 minutes I have measured the time it takes for data loading(approximately 19 million datasets) in different sections of the extract_scenarios_from_cache function within the scenario_builder.py file, and the results were as follows: During the process of placing a barrier and synchronizing until processing on all other GPUs is finished, it seems that a prolonged duration is being consumed while reading the candidate_scenario_dirs path. This extended time consumption leads to a timeout and results in a failure during the preprocessing stage of training.) I suspect that the large dataset is the underlying reason for this issue. This problem might potentially be resolved by manipulating the timeout arguments introduced in DDPStrategy in PyTorch Lightning version 2.0.7. However, the current version being used by Nuplan is PyTorch Lightning 1.4.9, which employs the DDPPlugin. Could you give me some advice? |
Hi @bhyang , I am facing the same memory problem during cache generation, and I found your solution interesting. Where do you set |
@CristianGariboldi You can see my earlier comment for the exact command I used, except you can replace the worker arguments with |
hello, I met the same problem. How do you do to solve the probelm? thanks |
Hello. I met the same question: How did you deal with it? Thanks |
When training with the full dataset, I always encounter one of the following two issues:
Each worker used for training has 352GB of memory and 8 GPUs, and the training command is like this:
the configuration I composed is like this:
Could you please advise on how to resolve the above issues? What adjustments should be made to the training configuration or command?
The text was updated successfully, but these errors were encountered: