TypeError: init() missing 2 required positional arguments: 'node_def' and 'op' #32

J-shel · 2022-07-21T07:19:08Z

Describe the bug
Hi,
Thank you for sharing your implementation of DGMR. I'm new to deep learning, but I'm very interested in it and learning to use it in atmospheric science.
When I run the code using the run.py under the train directory, I got the following message:
...
...
98.3 M Trainable params
0 Non-trainable params
98.3 M Total params
393.086 Total estimated model params size (MB)

Sanity Checking: 0it [00:00, ?it/s]2022-07-21 01:24:47.641350: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.641881: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.641954: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.644718: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.646172: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-07-21 01:24:47.656873: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
Traceback (most recent call last):
File "run.py", line 205, in
trainer.fit(model, datamodule)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
self._run_sanity_check()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
val_loop.run()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 112, in advance
batch = next(data_fetcher)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 259, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 273, in _fetch_next_batch
batch = next(iterator)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1347, in _next_data
return self._process_data(data)
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1373, in _process_data
data.reraise()
File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/_utils.py", line 454, in reraise
raise self.exc_type(message=msg)
TypeError: init() missing 2 required positional arguments: 'node_def' and 'op'
...
...
Please see the attached run.log file for full log message.

To Reproduce
Steps to reproduce the behavior:

Go to the train directory;
eidt run.py. Since I'm using CPU, so I changed the accelerator to "CPU".
trainer = Trainer(
max_epochs=1000,
logger=wandb_logger,
callbacks=[model_checkpoint],

gpus=6,

precision=32,
accelerator="cpu"

run "python run.py"

Expected behavior
I'm not sure if I have done it in a right way to train the model using the radar data in the paper and how to use multiple cpus. The README.md file make it very clear about how to install the model and run it in a simple way. It may be nice to have a very small sample of train/val/test data of radar with the code or provide a link to download the train/val/test data manually since it would be very helpful to see what the data really like and to understand the model.

Additional context
I attached the entire log file "run.log" and the packages I used just in case.
run.log
pip_list.txt

The text was updated successfully, but these errors were encountered:

jacobbieker · 2022-07-21T07:32:38Z

Hi,

Glad you like the repo! There is a small set of train/validation/test located at "gs://dm-nowcasting-example-data/datasets/nowcasting_open_source_osgb/nimrod_osgb_1000m_yearly_splits/radar/20200718" in GCP. It seems that this issue has to do with being unable to access the sample dataset data. The run script uses this HuggingFace dataset script https://huggingface.co/datasets/openclimatefix/nimrod-uk-1km/blob/main/nimrod-uk-1km.py to load and process the data into the format that DGMR expects, and while it shouldn't need any credentials I think, as its a public GCP bucket, you might have to supply something?

J-shel · 2022-07-21T08:25:47Z

Hi, I tried to download the data in GCP simply using "gsutil cp -R gs://dm-nowcasting-example-data ." and it succeed. It didn't ask any credentials. Now it's kinda of confusing. Could you please take a look at the screenshot I attached? As you say, the run script uses nimrod-uk-1km.py to load and process the data, however I didn't find nimrod-uk-1km.py in my directory. Am I miss something?

jacobbieker · 2022-07-21T09:05:10Z

Yeah, the nimrod-uk-1km is downloaded to the HuggingFace cache, usually under ~/.cache/huggingface/ somewhere and is loaded on the fly from HuggingFace, so its not included in the repo.

J-shel · 2022-07-21T09:54:00Z

Yes, I found it! I have no idea why that happened, but when I move to a GPU machine, I didn't get that error any more. However, I met a new error as below.

jacobbieker · 2022-07-21T09:57:49Z

Yeah, sorry, I've been trying to get it to run on multiple gpus, but it seems like there is an issue with parameterized modules that currently doesn't allow that. So if you change gpus to 1 it should work, you probably have to reduce the batch size as well

J-shel · 2022-07-21T10:26:46Z

Got it! Thank you very very much! O(∩_∩)O

J-shel added the bug Something isn't working label Jul 21, 2022

jacobbieker closed this as completed Jul 23, 2022

ZHANGZ1YUE mentioned this issue Sep 6, 2022

Training on other dataset + Error on using run.py #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: init() missing 2 required positional arguments: 'node_def' and 'op' #32

TypeError: init() missing 2 required positional arguments: 'node_def' and 'op' #32

J-shel commented Jul 21, 2022

jacobbieker commented Jul 21, 2022

J-shel commented Jul 21, 2022

jacobbieker commented Jul 21, 2022

J-shel commented Jul 21, 2022

jacobbieker commented Jul 21, 2022

J-shel commented Jul 21, 2022

TypeError: __init__() missing 2 required positional arguments: 'node_def' and 'op' #32

TypeError: __init__() missing 2 required positional arguments: 'node_def' and 'op' #32

Comments

J-shel commented Jul 21, 2022

gpus=6,

jacobbieker commented Jul 21, 2022

J-shel commented Jul 21, 2022

jacobbieker commented Jul 21, 2022

J-shel commented Jul 21, 2022

jacobbieker commented Jul 21, 2022

J-shel commented Jul 21, 2022

TypeError: init() missing 2 required positional arguments: 'node_def' and 'op' #32

TypeError: init() missing 2 required positional arguments: 'node_def' and 'op' #32