Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor Dimension Error When Applying TFT to Multiple Groups in Own Data #85

Closed
AlexMRuch opened this issue Oct 8, 2020 · 12 comments · Fixed by #108
Closed

Tensor Dimension Error When Applying TFT to Multiple Groups in Own Data #85

AlexMRuch opened this issue Oct 8, 2020 · 12 comments · Fixed by #108
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@AlexMRuch
Copy link

AlexMRuch commented Oct 8, 2020

Hi @jdb78,

After getting the TFT model working well for one group of data based on our last convo (and updating to the latest version of the library), I'm getting an odd tensor dimension error when I try to train my model across multiple groups on my data. Specifically on epoch 15 I get this error/trace:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-32-14fda4f79b4a> in <module>
      1 # Train model
----> 2 trainer.fit(
      3     tft,
      4     train_dataloader = train_dataloader,
      5     val_dataloaders = val_dataloader

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py in wrapped_fn(self, *args, **kwargs)
     46             if entering is not None:
     47                 self.state = entering
---> 48             result = fn(self, *args, **kwargs)
     49 
     50             # The INTERRUPTED state can be set inside the run function. To indicate that run was interrupted

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
   1071             self.accelerator_backend = GPUBackend(self)
   1072             model = self.accelerator_backend.setup(model)
-> 1073             results = self.accelerator_backend.train(model)
   1074 
   1075         elif self.use_tpu:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_backend.py in train(self, model)
     49 
     50     def train(self, model):
---> 51         results = self.trainer.run_pretrain_routine(model)
     52         return results
     53 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
   1237 
   1238         # CORE TRAINING LOOP
-> 1239         self.train()
   1240 
   1241     def _run_sanity_check(self, ref_model, model):

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in train(self)
    392                 # RUN TNG EPOCH
    393                 # -----------------
--> 394                 self.run_training_epoch()
    395 
    396                 if self.max_steps and self.max_steps <= self.global_step:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    548 
    549         # process epoch outputs
--> 550         self.run_training_epoch_end(epoch_output, checkpoint_accumulator, early_stopping_accumulator, num_optimizers)
    551 
    552         # checkpoint callback

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch_end(self, epoch_output, checkpoint_accumulator, early_stopping_accumulator, num_optimizers)
    662             # run training_epoch_end
    663             # a list with a result per optimizer index
--> 664             epoch_output = model.training_epoch_end(epoch_output)
    665 
    666             if isinstance(epoch_output, Result):

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in training_epoch_end(self, outputs)
    133 
    134     def training_epoch_end(self, outputs):
--> 135         log, _ = self.epoch_end(outputs, label="train")
    136         return log
    137 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in epoch_end(self, outputs, label)
    613         log, out = super().epoch_end(outputs, label=label)
    614         if self.log_interval(label == "train") > 0:
--> 615             self._log_interpretation(out, label=label)
    616         return log, out
    617 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in _log_interpretation(self, outputs, label)
    820         """
    821         # extract interpretations
--> 822         interpretation = {
    823             name: torch.stack([x["interpretation"][name] for x in outputs]).sum(0)
    824             for name in outputs[0]["interpretation"].keys()

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in <dictcomp>(.0)
    821         # extract interpretations
    822         interpretation = {
--> 823             name: torch.stack([x["interpretation"][name] for x in outputs]).sum(0)
    824             for name in outputs[0]["interpretation"].keys()
    825         }

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 6 and 7 in dimension 1 at /tmp/pip-req-build-8yht7tdu/aten/src/THC/generic/THCTensorMath.cu:71

At first I thought this could be because not every state has the same amount of data available. For example, for select states, the number of days for which data exist are:

state
CA    218
FL    218
GA    218
NY    218
TX    218
WA    260

So I tried just getting the last 218 observations for each state and resetting time_idx, but the same tensor error was raised.

Right now I'm only including the state as a group variable and am doing univariate time series modeling, so no other data are in the TFT.

Also, the learning rate finder works, this is just the training that's failing. Really odd.

I've made my code available here: https://drive.google.com/file/d/1r1w2tHZJrr8iXVqw7U5_qL1x4iOUsauk/view?usp=sharing. The data I'm playing with are on COVID, and the notebook includes a pd.read_csv() call that read in all the data for you from an online source, so you should be able to run it on your own without any issues.

Any thoughts? Again, I greatly appreciate your feedback as I'm new to deep learning on time series data.

Thanks in advance!

Best,
Alex

@jdb78 jdb78 added the help wanted Extra attention is needed label Oct 8, 2020
@AlexMRuch
Copy link
Author

AlexMRuch commented Oct 8, 2020

After looking at the Stallion data, I presume there is no need to make sure all groups have the same length data, given the stallion data grouped by agency has counts such as

data.groupby("agency").count()
agency
agency
Agency_01    360
Agency_02    540
Agency_03    360
Agency_04    300
Agency_05    540
Agency_07    360
Agency_08    300
Agency_09    420
...

@AlexMRuch
Copy link
Author

AlexMRuch commented Oct 8, 2020

On the other hand, the data all has the same max time_idx:

data.groupby("agency").max()["time_idx"]
agency
Agency_01    59
Agency_02    59
Agency_03    59
Agency_04    59
Agency_05    59
Agency_07    59
Agency_08    59
Agency_09    59
...

@AlexMRuch
Copy link
Author

AlexMRuch commented Oct 8, 2020

Also odd because running with fast_dev_run=True is successful: the model is able to train/validate without any issue for an epoch, the model saves, I can reload it and apply it, etc. So I'm really not sure what's going on at Epoch 15.

That's what makes me thing I'm doing something really subtle and stupid haha 😬

@AlexMRuch
Copy link
Author

Ugh, when I add in all 50 states the model fails on the very first epoch with a very similar error as above:

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 25 and 24 in dimension 1 at /tmp/pip-req-build-8yht7tdu/aten/src/THC/generic/THCTensorMath.cu:71

New notebook for all 50 states: https://drive.google.com/file/d/1N2GBs8PDksE-dGSs31hMZZOZ8oAZZZGJ/view?usp=sharing

@AlexMRuch
Copy link
Author

I got past the "fail on epoch one" bit by setting min_encoder_length to 1 instead of equal to the MAX_ENCODER_LENGTH – so hopefully that's the issue 🤞

    min_encoder_length = 1,  # allow predictions without history
    #min_encoder_length = MAX_ENCODER_LENGTH,  # keep encoder length long (as it is in the validation set)

@AlexMRuch
Copy link
Author

THAT SOLVED IT (at least I believe, as it worked for my example dataframe with a few states and the model with all 50 states is going past epoch 1)! 🎉

@AlexMRuch
Copy link
Author

AlexMRuch commented Oct 9, 2020

Update: the model finished training with all 50 states 🎉

That being said, I'm wondering if some kind of assert should be included to prevent the model from training if min_encoder_length is not small enough? Is there some quick math that can be done to check this when the TimeSeriesDataSet class is initiated?

Please feel free to close this, @jdb78, pending your review. Hopefully this helps others who hit this problem or leads to a note being added to the docs about lowering min_encoder_length as a possible solution if others hit the same error 😄

@jdb78
Copy link
Collaborator

jdb78 commented Oct 9, 2020

Hm, this looks like a real bug to me - particularly because the issue seems to be in the interpretation. There is currently only an assert that will throw an error if no timeseries are left to work with. Another assert to alert already if one timeseries is removed would be useful though.

@jdb78 jdb78 added the bug Something isn't working label Oct 9, 2020
@jdb78 jdb78 self-assigned this Oct 9, 2020
@AlexMRuch
Copy link
Author

Wow, really happy to hear that I wasn't just being a dunce, lol. Really appreciate that feedback!

@jdb78
Copy link
Collaborator

jdb78 commented Oct 17, 2020

The issue seems to be stacking variable length tensors - surprised it did not come up earlier. Fixing in #108 now. Performance of the TFT is pretty bad in the notebook. Might have a look and see if can be fixed if I have time.

@jdb78
Copy link
Collaborator

jdb78 commented Oct 17, 2020

Update: Easy fix: there were no covariates given including the target variable as an unknown time varying real variable.

@AlexMRuch
Copy link
Author

Oh, gosh, yeah, 🤦 I updated the notebook I've been using for experimenting to include those covariates as well as a ton of other covariates from that epi data in addition to extending the model to run on each state (see below for better performance). Once I get the issues worked out from the other threads I can share!
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants