Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question for Submitting the ml planner from #306 #308

Closed
changliucoding opened this issue May 18, 2023 · 29 comments
Closed

Question for Submitting the ml planner from #306 #308

changliucoding opened this issue May 18, 2023 · 29 comments

Comments

@changliucoding
Copy link

Hi this question is still the question from #306. I did exactly as you told us in #302, but it still failed. I wonder what should i put in checkpoint_path? I checked my docker images files, my model.ckpt is indeed in the nuplan-devkit. I don't know what did I miss. thanks!

@abbyxxn
Copy link

abbyxxn commented May 20, 2023

I meet the same question.
My submission file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-20999:4ea52d5d-51b1-4faf-9331-e89f24ae6b53"}
and meet Merged stderr:
validation_challenge99.log
2023-05-20 02:12:53,117 : ERROR : Planner initialization failed!
2023-05-20 02:12:58,389 : ERROR : Planner initialization failed!

I did docker-compose up --build and it success in my local server.

@changliucoding
Copy link
Author

I meet the same question. My submission file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-20999:4ea52d5d-51b1-4faf-9331-e89f24ae6b53"} and meet Merged stderr: validation_challenge99.log 2023-05-20 02:12:53,117 : ERROR : Planner initialization failed! 2023-05-20 02:12:58,389 : ERROR : Planner initialization failed!

I did docker-compose up --build and it success in my local server.

same question seems, don't know how to fix it.

@patk-motional
Copy link
Collaborator

@abbyxxn,

This is the error during initialization:

Traceback (most recent call last):
--
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/grpc/_server.py", line 443, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/nuplan_devkit/nuplan/submission/challenge_servicers.py", line 98, in InitializePlanner
planners = build_planners(self._planner_config, None)
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 70, in build_planners
planner = cache.get(name, _build_planner(config, scenario))
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 24, in _build_planner
if is_target_type(planner_cfg, MLPlanner):
File "/nuplan_devkit/nuplan/planning/script/builders/utils/utils_type.py", line 23, in is_target_type
return bool(_locate(cfg._target_) == target_type)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/hydra/_internal/utils.py", line 577, in _locate
raise ImportError(
ImportError: Encountered error: `No module named 'transformer4planning.submission'` when loading module 'transformer4planning.submission.planner.ControlTFPlanner'

@patk-motional
Copy link
Collaborator

@changliucoding,

What is your team name and time of submission? I can look up the detailed logs for you

@changliucoding
Copy link
Author

Hi, my team name is changdrive. Thank you very much!

@Fan-Yixuan
Copy link

Hi @patk-motional, Same problem here, the submitted file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-16281:6095e0da-394d-4c89-8e37-159cc30b67fc"}, could you please find detailed logs for me? Thanks a lot.

@MMz000
Copy link

MMz000 commented May 23, 2023

Hi, @patk-motional,We have also encountered a similar problem. We followed the tutorial and used the COPY command in the Dockerfile to copy the ckpt file. However, when running the docker-compose up --build command, we received a file not found error.

To troubleshoot this, we added an "ls" command to check if the ckpt file is present in the nuplan-devkit folder. The result showed that the file does exist.

It is worth noting that we were able to successfully run the simple planner.

Here are some of our configurations.

Dockerfile & Dockerfile.submission

c8a4707f2822f11bb4db2b3906b36c3e
0f16147d990525546660d61fc3fcf85c

entrtpoint_submission.sh & entrtpoint_simulation.sh

b0cae972890bf67597b971389af512d3
a41a5638215c6eb87c5cbcc36d7f75d1

ml_planner.yaml

9ccd4d1a70a02b1fe5ac4b7087d1add9

Here is the output of the ls command executed in the entrypoint_simulation.sh script

5b2278d48901d0d132521d578a284003

Following that, the command docker-compose up --build encountered the following error:

6bbe05beb318d0e3ca38121032f7c256

@gianmarco-motional
Copy link
Contributor

@MMz000, the ls command is performed in the wrong container (the simulation one, and you should not modify that entrypoint anyway as we will use our version on our servers).

The error is because you are looking for /nuplan-devkit/ub_ours.ckpt, while the file is copied to nuplan_devkit/ub_ours.ckpt (note hyphen vs underscore in the devkit name)

@patk-motional
Copy link
Collaborator

@Fan-Yixuan,

There are no detailed logs for you as your container failed to start.

@patk-motional
Copy link
Collaborator

@changliucoding,

Your hydra config isn't setup properly

Traceback (most recent call last):
--
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/grpc/_server.py", line 443, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/nuplan_devkit/nuplan/submission/challenge_servicers.py", line 98, in InitializePlanner
planners = build_planners(self._planner_config, None)
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 58, in build_planners
return [_build_planner(planner, scenario) for planner in planner_cfg.values()]
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 58, in <listcomp>
return [_build_planner(planner, scenario) for planner in planner_cfg.values()]
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 25, in _build_planner
torch_module_wrapper = build_torch_module_wrapper(planner_cfg.model_config)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 357, in __getattr__
self._format_and_raise(key=key, value=None, cause=e)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/base.py", line 190, in _format_and_raise
format_and_raise(
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/_utils.py", line 821, in format_and_raise
_raise(ex, cause)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/_utils.py", line 719, in _raise
raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 445, in _get_impl
return self._resolve_with_default(
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/omegaconf/basecontainer.py", line 58, in _resolve_with_default
raise MissingMandatoryValue("Missing mandatory value: $FULL_KEY")
omegaconf.errors.MissingMandatoryValue: Missing mandatory value: planner.ml_planner.model_config
full_key: planner.ml_planner.model_config
object_type=dict

@abbyxxn
Copy link

abbyxxn commented May 24, 2023

My submission file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-20999:92e91991-f365-46b5-9493-acfa2dd57e96"}
and meet Merged stderr:
validation_challenge99.log
2023-05-24 06:37:08,919 : ERROR : Trajectory computation service failed!
2023-05-24 06:37:08,919 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED
2023-05-24 06:37:46,718 : ERROR : Trajectory computation service failed!
2023-05-24 06:37:46,718 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED

I did docker-compose up --build and it success in my local server.
Can I know the detailed error message?

@patk-motional
Copy link
Collaborator

Hi @abbyxxn,

I only see logs for the initialization stage. This usually indicates that your planner timed out in the first iteration. Have you profiled your planner locally?

INFO:nuplan.submission.submission_planner:Server starting...
--
INFO:nuplan.submission.submission_planner:Server started!
2023-05-24 06:36:44.757755: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-24 06:36:44.893808: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-05-24 06:36:47.446118: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-05-24 06:36:47.446229: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-05-24 06:36:47.446244: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Pretrained GPT nonautofrom /nuplan_devkit/test-loss0.25
INFO:nuplan.submission.challenge_servicers:Initialization request received..
INFO:root:Planner initialized!
count:  1 True
/nuplan_devkit/nuplan/common/maps/nuplan_map/utils.py:413: RuntimeWarning: invalid value encountered in cast
return elements.iloc[np.where(elements[column_label].to_numpy().astype(int) == int(desired_value))]
/opt/conda/envs/nuplan/lib/python3.9/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Pretrained GPT nonautofrom /nuplan_devkit/test-loss0.25
INFO:nuplan.submission.challenge_servicers:Initialization request received..
INFO:root:Planner initialized!
count:  2 True

@abbyxxn
Copy link

abbyxxn commented May 24, 2023

Thank you for your very useful reply!
I have new submission file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-20999:6f8860cd-d1c7-4e80-92b6-5dde3889c687"}
and meet Merged stderr:
validation_challenge99.log
2023-05-24 17:15:08,181 : ERROR : Trajectory computation service failed!
2023-05-24 17:15:08,181 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED
2023-05-24 17:15:51,685 : ERROR : Trajectory computation service failed!
2023-05-24 17:15:51,686 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED

May I know the detailed error message?

@gianmarco-motional
Copy link
Contributor

@abbyxxn do you mean logs from your container? I see many prints like this (this one is the last one, most are around 0.89 in time consumed)

count:  24 True
--
(224, 224, 109) (224, 224, 109) (10, 4) 22
time after ratser build 0.8149843215942383
time after gpt 0.943516731262207
time consumed 1.1793630123138428

@tinkei
Copy link

tinkei commented May 25, 2023

Can I ask about my error logs for my submissions for team NaNny as well?
It's a ml_planner using raster_model using the default resnet50 backbone.
Both the trained model, and the huggingface/timm cache are copied to the image.
I keep getting this error without any details that I could act on:

Merged stderr:
validation_challenge99.log
2023-05-25 10:15:49,113 : ERROR : Planner initialization failed!
2023-05-25 10:16:29,764 : ERROR : Planner initialization failed!

I tried the following syntax:


Submitted at May 25, 2023 11:50:58 AM

{submitted_image_uri | "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-16386:271ffe90-3fad-460b-90ed-cd662567e140"}

ml_planner.yaml:

model_config: ???
checkpoint_path: /nuplan_devkit/best_model.ckpt

entrypoint_submission.sh:

conda run -n nuplan --no-capture-output python -u nuplan/planning/script/run_submission_planner.py output_dir=/tmp/ model=raster_model planner=ml_planner planner.ml_planner.model_config=\${model}

Submitted at May 25, 2023 1:26:01 PM

{"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-16386:6107a367-7691-41f0-a588-13d249104f19"}

ml_planner.yaml:

model_config: ???
checkpoint_path: /nuplan_devkit/best_model.ckpt

entrypoint_submission.sh:

conda run -n nuplan --no-capture-output python -u nuplan/planning/script/run_submission_planner.py output_dir=/tmp/ model=raster_model planner=ml_planner planner.ml_planner.model_config=raster_model planner.ml_planner.checkpoint_path=/nuplan_devkit/best_model.ckpt

Submitted at May 25, 2023 2:33:31 PM

{"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-16386:fa16eba8-d729-4730-ae7a-685ccb56db4a"}

ml_planner.yaml:

model_config: ${model}
checkpoint_path: /nuplan_devkit/best_model.ckpt

entrypoint_submission.sh:

conda run -n nuplan --no-capture-output python -u nuplan/planning/script/run_submission_planner.py output_dir=/tmp/ model=raster_model planner=ml_planner

@patk-motional
Copy link
Collaborator

@tinkei
Copy link

tinkei commented May 25, 2023

https://evalai.s3.amazonaws.com/media/submission_files/submission_283754/236ccfc8-a510-4940-96aa-ce3a0a0f8a2a.txt

Thank you for your prompt response. But this is an old submission from last night, before I included the huggingface/timm cache into Dockerfile.submission. Yet the submissions (the ones I cited above) are still failing this afternoon.

@patk-motional
Copy link
Collaborator

Traceback (most recent call last):
--
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/grpc/_server.py", line 443, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/nuplan_devkit/nuplan/submission/challenge_servicers.py", line 98, in InitializePlanner
planners = build_planners(self._planner_config, None)
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 58, in build_planners
return [_build_planner(planner, scenario) for planner in planner_cfg.values()]
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 58, in <listcomp>
return [_build_planner(planner, scenario) for planner in planner_cfg.values()]
File "/nuplan_devkit/nuplan/planning/script/builders/planner_builder.py", line 25, in _build_planner
torch_module_wrapper = build_torch_module_wrapper(planner_cfg.model_config)
File "/nuplan_devkit/nuplan/planning/script/builders/model_builder.py", line 19, in build_torch_module_wrapper
model = instantiate(cfg)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 180, in instantiate
return instantiate_node(config, *args, recursive=_recursive_, convert=_convert_)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 249, in instantiate_node
return _call_target(target, *args, **kwargs)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 64, in _call_target
raise type(e)(
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 62, in _call_target
return target(*args, **kwargs)
File "/nuplan_devkit/nuplan/planning/training/modeling/models/raster_model.py", line 62, in __init__
self._model = timm.create_model(model_name, pretrained=pretrained, num_classes=0, in_chans=num_input_channels)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/_factory.py", line 114, in create_model
model = create_fn(
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/resnet.py", line 1276, in resnet50
return _create_resnet('resnet50', pretrained, **dict(model_args, **kwargs))
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/resnet.py", line 547, in _create_resnet
return build_model_with_cfg(ResNet, variant, pretrained, **kwargs)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/_builder.py", line 393, in build_model_with_cfg
load_pretrained(
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/_builder.py", line 186, in load_pretrained
state_dict = load_state_dict_from_hf(pretrained_loc)
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/timm/models/_hub.py", line 183, in load_state_dict_from_hf
return safetensors.torch.load_file(cached_safe_file, device="cpu")
File "/opt/conda/envs/nuplan/lib/python3.9/site-packages/safetensors/torch.py", line 261, in load_file
result[k] = f.get_tensor(k)
AttributeError: Error instantiating 'nuplan.planning.training.modeling.models.raster_model.RasterModel' : module 'torch' has no attribute 'frombuffer'

This was from your latest submission

@tinkei
Copy link

tinkei commented May 25, 2023

@patk-motional Would you mind sharing the error log of my two latest two submissions? One of them I downgraded timm, and another one I didn't even use timm, but somehow they are still failing.

@gianmarco-motional
Copy link
Contributor

@tinkei Can you share the Dockerfile.submission and entrypoint_submission.sh? If you prefer send me a DM on slack: https://join.slack.com/t/opendrivelab/shared_invite/zt-1uhny7uci-T5~otGGdwUtGo8L1j0~NUA

@gianmarco-motional
Copy link
Contributor

gianmarco-motional commented May 25, 2023

@tinkei one problem is definitely you commenting out this line:
# [ -d "/mnt/data" ] && cp -r /mnt/data/nuplan-v1.1/maps/* $NUPLAN_MAPS_ROOT in entrypoint_submission.sh

@sindhu-pr
Copy link

@patk-motional I am getting the exact same error: #308 (comment) . My submission details are:
{"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-18335:6151b08b-04a0-4484-af5a-ed404b9a600d"}
If you can share the detailed logs, will be helpful for my team.

@patk-motional
Copy link
Collaborator

Can you share your submission id?
image

@sindhu-pr
Copy link

Can you share your submission id? image

284337

Thanks

@patk-motional
Copy link
Collaborator

I've pushed to your stderr

@sindhu-pr
Copy link

sindhu-pr commented May 26, 2023

I've pushed to your stderr

Hi, the following is the error:


Could not override 'planner.ml_planner.checkpoint_path'.
To append to your config use +planner.ml_planner.checkpoint_path=/model.ckpt
Key 'checkpoint_path' is not in struct
    full_key: planner.ml_planner.checkpoint_path
    object_type=dict
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

for these errors, where do you think we need to make changes, so that error can be resolved? Where should we set the HYDRA_FULL_ERROR=1 ?

In entrypoint_submission.sh, we have the following line:

conda run -n nuplan --no-capture-output python -u nuplan/planning/script/run_submission_planner.py output_dir=/tmp/ model=<OUR MODEL> planner=ml_planner planner.ml_planner.model_config=\${model} planner.ml_planner.checkpoint_path="${NUPLAN_HOME}/<chkpoint file name>"

@tinkei
Copy link

tinkei commented May 26, 2023

@tinkei one problem is definitely you commenting out this line: # [ -d "/mnt/data" ] && cp -r /mnt/data/nuplan-v1.1/maps/* $NUPLAN_MAPS_ROOT in entrypoint_submission.sh

Thanks! Everything is working now!
I commented it out only a few submissions before to debug an issue with my local docker-compose, but forgot to revert it afterwards. Nice catch!

@tinkei
Copy link

tinkei commented May 26, 2023

@sindhu-pr ${NUPLAN_HOME} is /nuplan_devkit in Dockerfile.submission, but your hydra config is pointing to /model.ckpt. I guess for some reason ${NUPLAN_HOME} evaluated to an empty string?

@XZHSTAX
Copy link

XZHSTAX commented May 26, 2023

Thank you for your very useful reply! I have new submission file is {"submitted_image_uri": "937891341272.dkr.ecr.us-east-1.amazonaws.com/nuplan-planning-challenge-1856-participant-team-20999:6f8860cd-d1c7-4e80-92b6-5dde3889c687"} and meet Merged stderr: validation_challenge99.log 2023-05-24 17:15:08,181 : ERROR : Trajectory computation service failed! 2023-05-24 17:15:08,181 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED 2023-05-24 17:15:51,685 : ERROR : Trajectory computation service failed! 2023-05-24 17:15:51,686 : ERROR : Trajectory computation timed out: StatusCode.DEADLINE_EXCEEDED

May I know the detailed error message?

I met this problem too. Actually I have success to profile my planner locally but still get this problems.

Then I notice this #298 .As i use nuplan-devkit v1.1, the Dockerfile.submmission is still on laste version. I follow #298 to update Dockerfile.submmission manually, then submit and get Finished.

so there must be some problem if you use last version of Dockerfile.submmission. you can fix it through #298

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

9 participants