Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: __init__() missing 2 required positional arguments: 'node_def' and 'op' #32

Closed
J-shel opened this issue Jul 21, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@J-shel
Copy link

J-shel commented Jul 21, 2022

Describe the bug
Hi,
Thank you for sharing your implementation of DGMR. I'm new to deep learning, but I'm very interested in it and learning to use it in atmospheric science.
When I run the code using the run.py under the train directory, I got the following message:
...
...
98.3 M Trainable params
0 Non-trainable params
98.3 M Total params
393.086 Total estimated model params size (MB)

Sanity Checking: 0it [00:00, ?it/s]2022-07-21 01:24:47.641350: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.641881: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.641954: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.644718: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.646172: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.656873: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
Traceback (most recent call last):
File "run.py", line 205, in
trainer.fit(model, datamodule)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
self._run_sanity_check()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
val_loop.run()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 112, in advance
batch = next(data_fetcher)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 259, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 273, in _fetch_next_batch
batch = next(iterator)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1347, in _next_data
return self._process_data(data)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1373, in _process_data
data.reraise()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/_utils.py", line 454, in reraise
raise self.exc_type(message=msg)
TypeError: init() missing 2 required positional arguments: 'node_def' and 'op'
...
...
Please see the attached run.log file for full log message.

To Reproduce
Steps to reproduce the behavior:

  1. Go to the train directory;
  2. eidt run.py. Since I'm using CPU, so I changed the accelerator to "CPU".
    trainer = Trainer(
    max_epochs=1000,
    logger=wandb_logger,
    callbacks=[model_checkpoint],

gpus=6,

precision=32,
accelerator="cpu"
  1. run "python run.py"

Expected behavior
I'm not sure if I have done it in a right way to train the model using the radar data in the paper and how to use multiple cpus. The README.md file make it very clear about how to install the model and run it in a simple way. It may be nice to have a very small sample of train/val/test data of radar with the code or provide a link to download the train/val/test data manually since it would be very helpful to see what the data really like and to understand the model.

Additional context
I attached the entire log file "run.log" and the packages I used just in case.
run.log
pip_list.txt

@J-shel J-shel added the bug Something isn't working label Jul 21, 2022
@jacobbieker
Copy link
Member

Hi,

Glad you like the repo! There is a small set of train/validation/test located at "gs://dm-nowcasting-example-data/datasets/nowcasting_open_source_osgb/nimrod_osgb_1000m_yearly_splits/radar/20200718" in GCP. It seems that this issue has to do with being unable to access the sample dataset data. The run script uses this HuggingFace dataset script https://huggingface.co/datasets/openclimatefix/nimrod-uk-1km/blob/main/nimrod-uk-1km.py to load and process the data into the format that DGMR expects, and while it shouldn't need any credentials I think, as its a public GCP bucket, you might have to supply something?

@J-shel
Copy link
Author

J-shel commented Jul 21, 2022

Hi, I tried to download the data in GCP simply using "gsutil cp -R gs://dm-nowcasting-example-data ." and it succeed. It didn't ask any credentials. Now it's kinda of confusing. Could you please take a look at the screenshot I attached? As you say, the run script uses nimrod-uk-1km.py to load and process the data, however I didn't find nimrod-uk-1km.py in my directory. Am I miss something?
screenshot_1

@jacobbieker
Copy link
Member

Yeah, the nimrod-uk-1km is downloaded to the HuggingFace cache, usually under ~/.cache/huggingface/ somewhere and is loaded on the fly from HuggingFace, so its not included in the repo.

@J-shel
Copy link
Author

J-shel commented Jul 21, 2022

Yes, I found it! I have no idea why that happened, but when I move to a GPU machine, I didn't get that error any more. However, I met a new error as below.
screentshot2

@jacobbieker
Copy link
Member

Yeah, sorry, I've been trying to get it to run on multiple gpus, but it seems like there is an issue with parameterized modules that currently doesn't allow that. So if you change gpus to 1 it should work, you probably have to reduce the batch size as well

@J-shel
Copy link
Author

J-shel commented Jul 21, 2022

Got it! Thank you very very much! O(∩_∩)O

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants