Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors #8

Closed
lorrp1 opened this issue Aug 20, 2020 · 14 comments
Closed

Errors #8

lorrp1 opened this issue Aug 20, 2020 · 14 comments

Comments

@lorrp1
Copy link

lorrp1 commented Aug 20, 2020

Hello, im stallion.py but im getting few problems:
using gpu = 1 in the trainer return

models/temporal_fusion_transformer/init.py", line 793, in _log_interpretation dim=0

using only the cpu it works but return this error:

AttributeError: module 'tensorflow._api.v1.io.gfile' has no attribute 'get_filesystem'
which apparently it is due to pytorch and tensorboard incompatibility but i managed to install pytorch-forecasting without problems so could you tell me what version of pythorch/tensorboard are you using?

thanks

@jdb78
Copy link
Owner

jdb78 commented Aug 20, 2020

This is a known tensorboard issue. Deinstall tensorflow and then install tensorboard 2.2 (I think 2.3 has an unfixed bug).

On GPU: I have not tested that recently. If you know of any free CI with GPUs, I would be very keen to learn about it.

@lorrp1
Copy link
Author

lorrp1 commented Aug 22, 2020

i managed to get it working by reinstalling everything.

On GPU: I have not tested that recently. If you know of any free CI with GPUs, I would be very keen to learn about it.

if by cl you mean cloud then colab should be fine.

i having another "problem", i cant find a way to increase the training batches per epoch which is stuck at 5 (even though the error say 4:

ValueError: val_check_interval (200) must be less than or equal to the number of the training batches (4). "

i have read tried reading the documentation of pytorch_lightning but there is only min_epoch or way to limit using limit_train_batches

the problem is that the loss stop getting lower after a few epoch (too early comparing it to other models) and i think it may be caused by the too low train_batches and the results seems almost random every time due to this.

@jdb78
Copy link
Owner

jdb78 commented Aug 22, 2020

I guess there are 4 training batches as the training data loader will drop the last one.

Probably, your data is just super tiny or your batch size is huge. You should be able to fix your early stopping problem by increasing the patience on the pytorch lightning early stopping callback. Anyways, DL normally excels at larger data.

I updated the docs and now there is are some better tutorials available.

BTW: I will run stuff on the GPU tomorrow and will fix the remaining issues. By CI, I mean a continuous integration system such as CircleCI to test PRs automatically. I might eventually settle for self hosted GPUs and just fork out the money but would prefer a free method. Colab is great but not made for that.

@lorrp1
Copy link
Author

lorrp1 commented Aug 22, 2020

It’s a very simple period function it shouldn’t have problem but the length is definitely very small. I’ll try increasing the length of the data frame or reducing the batch_size itself.

@lorrp1
Copy link
Author

lorrp1 commented Aug 23, 2020

using a way smaller bench size return a way better result.
i have seen the "fix gpu", im going to check soon if it works now.

@jdb78
Copy link
Owner

jdb78 commented Aug 25, 2020

Have you checked that the training works on GPU for you? I am always grateful for feedback!

@lorrp1
Copy link
Author

lorrp1 commented Aug 26, 2020

I haven’t got time.
But I’ll try as soon as possible.

@lorrp1
Copy link
Author

lorrp1 commented Aug 27, 2020

not working with the last update from pip :

Epoch 16: 100%|██████████| 27/Traceback (most recent call last):8.498, v_num=55]
  File "/home//Documents//TimeSeries/Multivariate/pytorchForecasting/randomTIme/stallion2.py", line 113, in <module>
    preds, index = tft.predict(val_dataloader, return_index=True, fast_dev_run=True)
  File "/home//.local/lib/python3.7/site-packages/pytorch_forecasting/models/base_model.py", line 505, in predict
    out = self(x)  # raw output is dictionary
  File "/home//.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home//.local/lib/python3.7/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py", line 447, in forward
    input_vectors[name] = emb(x_cat[..., self.hparams.x_categoricals.index(name)])
  File "/home//.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home//.local/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 126, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home//.local/lib/python3.7/site-packages/torch/nn/functional.py", line 1814, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

@jdb78
Copy link
Owner

jdb78 commented Aug 27, 2020

Thanks for the bug - have not run into this yet. Interesting that this seems to happen after 16 epochs. It's definitely time to employ GPUs for testing - I will give this a shot next week.

@jdb78
Copy link
Owner

jdb78 commented Aug 27, 2020

@lorrp1 Are you sure to be on the newest version? The line 447 the traceback cites seems to be 449 on the master (https://github.com/jdb78/pytorch-forecasting/blob/master/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py#L449)

@lorrp1
Copy link
Author

lorrp1 commented Aug 27, 2020

by "last from pip" i meant to write pypi.
with the last from the master (0.2.4) same:


Epoch 48: 100%|███████████| 27Traceback (most recent call last):1.221, v_num=56]
  File "/home//Documents//TimeSeries/Multivariate/pytorchForecasting/randomTIme/stallion2.py", line 113, in <module>
    preds, index = tft.predict(val_dataloader, return_index=True, fast_dev_run=True)
  File "/home//.local/lib/python3.7/site-packages/pytorch_forecasting/models/base_model.py", line 541, in predict
    out = self(x)  # raw output is dictionary
  File "/home//.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home//.local/lib/python3.7/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py", line 449, in forward
    input_vectors[name] = emb(x_cat[..., self.hparams.x_categoricals.index(name)])
  File "/home//.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home//.local/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 126, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home//.local/lib/python3.7/site-packages/torch/nn/functional.py", line 1814, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

also its pretty strange that the gpu=0 version ended at the 32 epoch (without errors)

@jdb78
Copy link
Owner

jdb78 commented Aug 27, 2020

I think the finishing without errors is due to early stopping which is expected behaviour.

On the other error, I believe the root cause is this:

  1. There is no issue with training because you fail at predicting the validation dataset after training (this explains why the the epoch is greater than 0)
  2. The model lives on the GPU and you try to predict with a dataloader that outputs data on the CPU. There is a very easy manual fix. Just call tft.to("cpu") before calling its predict function.
  3. There is a chance you can fix the error by installing the latest version of pytorch-lightning. I had to manually move the model to the GPU to reproduce your error.
  4. This is actually a good catch because there might be use cases where you want to also predict on the GPU. I will add this in the next version of the package so data is moved to the GPU before being passed onto the model in case it lives on the GPU.

Let me know if this fixes it for you.

@jdb78
Copy link
Owner

jdb78 commented Aug 30, 2020

Fixed in #27

@lorrp1
Copy link
Author

lorrp1 commented Sep 3, 2020

it works adding tft.to("cpu") before .predict

@jdb78 jdb78 closed this as completed Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants