Tensor Dimension Error When Applying TFT to Multiple Groups in Own Data #85

AlexMRuch · 2020-10-08T03:35:05Z

After getting the TFT model working well for one group of data based on our last convo (and updating to the latest version of the library), I'm getting an odd tensor dimension error when I try to train my model across multiple groups on my data. Specifically on epoch 15 I get this error/trace:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-32-14fda4f79b4a> in <module>
      1 # Train model
----> 2 trainer.fit(
      3     tft,
      4     train_dataloader = train_dataloader,
      5     val_dataloaders = val_dataloader

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py in wrapped_fn(self, *args, **kwargs)
     46             if entering is not None:
     47                 self.state = entering
---> 48             result = fn(self, *args, **kwargs)
     49 
     50             # The INTERRUPTED state can be set inside the run function. To indicate that run was interrupted

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
   1071             self.accelerator_backend = GPUBackend(self)
   1072             model = self.accelerator_backend.setup(model)
-> 1073             results = self.accelerator_backend.train(model)
   1074 
   1075         elif self.use_tpu:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_backend.py in train(self, model)
     49 
     50     def train(self, model):
---> 51         results = self.trainer.run_pretrain_routine(model)
     52         return results
     53 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in run_pretrain_routine(self, model)
   1237 
   1238         # CORE TRAINING LOOP
-> 1239         self.train()
   1240 
   1241     def _run_sanity_check(self, ref_model, model):

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in train(self)
    392                 # RUN TNG EPOCH
    393                 # -----------------
--> 394                 self.run_training_epoch()
    395 
    396                 if self.max_steps and self.max_steps <= self.global_step:

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    548 
    549         # process epoch outputs
--> 550         self.run_training_epoch_end(epoch_output, checkpoint_accumulator, early_stopping_accumulator, num_optimizers)
    551 
    552         # checkpoint callback

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch_end(self, epoch_output, checkpoint_accumulator, early_stopping_accumulator, num_optimizers)
    662             # run training_epoch_end
    663             # a list with a result per optimizer index
--> 664             epoch_output = model.training_epoch_end(epoch_output)
    665 
    666             if isinstance(epoch_output, Result):

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/base_model.py in training_epoch_end(self, outputs)
    133 
    134     def training_epoch_end(self, outputs):
--> 135         log, _ = self.epoch_end(outputs, label="train")
    136         return log
    137 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in epoch_end(self, outputs, label)
    613         log, out = super().epoch_end(outputs, label=label)
    614         if self.log_interval(label == "train") > 0:
--> 615             self._log_interpretation(out, label=label)
    616         return log, out
    617 

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in _log_interpretation(self, outputs, label)
    820         """
    821         # extract interpretations
--> 822         interpretation = {
    823             name: torch.stack([x["interpretation"][name] for x in outputs]).sum(0)
    824             for name in outputs[0]["interpretation"].keys()

~/anaconda3/envs/forecasting/lib/python3.8/site-packages/pytorch_forecasting/models/temporal_fusion_transformer/__init__.py in <dictcomp>(.0)
    821         # extract interpretations
    822         interpretation = {
--> 823             name: torch.stack([x["interpretation"][name] for x in outputs]).sum(0)
    824             for name in outputs[0]["interpretation"].keys()
    825         }

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 6 and 7 in dimension 1 at /tmp/pip-req-build-8yht7tdu/aten/src/THC/generic/THCTensorMath.cu:71

At first I thought this could be because not every state has the same amount of data available. For example, for select states, the number of days for which data exist are:

state
CA    218
FL    218
GA    218
NY    218
TX    218
WA    260

So I tried just getting the last 218 observations for each state and resetting time_idx, but the same tensor error was raised.

Right now I'm only including the state as a group variable and am doing univariate time series modeling, so no other data are in the TFT.

Also, the learning rate finder works, this is just the training that's failing. Really odd.

I've made my code available here: https://drive.google.com/file/d/1r1w2tHZJrr8iXVqw7U5_qL1x4iOUsauk/view?usp=sharing. The data I'm playing with are on COVID, and the notebook includes a pd.read_csv() call that read in all the data for you from an online source, so you should be able to run it on your own without any issues.

Any thoughts? Again, I greatly appreciate your feedback as I'm new to deep learning on time series data.

Thanks in advance!

Best,
Alex

The text was updated successfully, but these errors were encountered:

AlexMRuch · 2020-10-08T18:28:32Z

After looking at the Stallion data, I presume there is no need to make sure all groups have the same length data, given the stallion data grouped by agency has counts such as

data.groupby("agency").count()
agency
agency
Agency_01    360
Agency_02    540
Agency_03    360
Agency_04    300
Agency_05    540
Agency_07    360
Agency_08    300
Agency_09    420
...

AlexMRuch · 2020-10-08T19:12:05Z

On the other hand, the data all has the same max time_idx:

data.groupby("agency").max()["time_idx"]
agency
Agency_01    59
Agency_02    59
Agency_03    59
Agency_04    59
Agency_05    59
Agency_07    59
Agency_08    59
Agency_09    59
...

AlexMRuch · 2020-10-08T19:25:09Z

Also odd because running with fast_dev_run=True is successful: the model is able to train/validate without any issue for an epoch, the model saves, I can reload it and apply it, etc. So I'm really not sure what's going on at Epoch 15.

That's what makes me thing I'm doing something really subtle and stupid haha 😬

AlexMRuch · 2020-10-08T20:08:59Z

Ugh, when I add in all 50 states the model fails on the very first epoch with a very similar error as above:

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 25 and 24 in dimension 1 at /tmp/pip-req-build-8yht7tdu/aten/src/THC/generic/THCTensorMath.cu:71

New notebook for all 50 states: https://drive.google.com/file/d/1N2GBs8PDksE-dGSs31hMZZOZ8oAZZZGJ/view?usp=sharing

AlexMRuch · 2020-10-08T20:14:35Z

I got past the "fail on epoch one" bit by setting min_encoder_length to 1 instead of equal to the MAX_ENCODER_LENGTH – so hopefully that's the issue 🤞

    min_encoder_length = 1,  # allow predictions without history
    #min_encoder_length = MAX_ENCODER_LENGTH,  # keep encoder length long (as it is in the validation set)

AlexMRuch · 2020-10-08T20:22:51Z

THAT SOLVED IT (at least I believe, as it worked for my example dataframe with a few states and the model with all 50 states is going past epoch 1)! 🎉

AlexMRuch · 2020-10-09T14:01:42Z

Update: the model finished training with all 50 states 🎉

That being said, I'm wondering if some kind of assert should be included to prevent the model from training if min_encoder_length is not small enough? Is there some quick math that can be done to check this when the TimeSeriesDataSet class is initiated?

Please feel free to close this, @jdb78, pending your review. Hopefully this helps others who hit this problem or leads to a note being added to the docs about lowering min_encoder_length as a possible solution if others hit the same error 😄

jdb78 · 2020-10-09T15:54:25Z

Hm, this looks like a real bug to me - particularly because the issue seems to be in the interpretation. There is currently only an assert that will throw an error if no timeseries are left to work with. Another assert to alert already if one timeseries is removed would be useful though.

AlexMRuch · 2020-10-09T15:56:04Z

Wow, really happy to hear that I wasn't just being a dunce, lol. Really appreciate that feedback!

jdb78 · 2020-10-17T17:49:06Z

The issue seems to be stacking variable length tensors - surprised it did not come up earlier. Fixing in #108 now. Performance of the TFT is pretty bad in the notebook. Might have a look and see if can be fixed if I have time.

jdb78 · 2020-10-17T20:50:49Z

Update: Easy fix: there were no covariates given including the target variable as an unknown time varying real variable.

AlexMRuch · 2020-10-19T19:36:41Z

Oh, gosh, yeah, 🤦 I updated the notebook I've been using for experimenting to include those covariates as well as a ton of other covariates from that epi data in addition to extending the model to run on each state (see below for better performance). Once I get the issues worked out from the other threads I can share!

jdb78 added the help wanted Extra attention is needed label Oct 8, 2020

jdb78 added the bug Something isn't working label Oct 9, 2020

jdb78 self-assigned this Oct 9, 2020

AlexMRuch mentioned this issue Oct 15, 2020

Error when Using Distributed GPU Processing #103

Open

jdb78 linked a pull request Oct 17, 2020 that will close this issue

Enable stacking of variable lengths tensors #108

Merged

jdb78 closed this as completed in #108 Oct 17, 2020

TomSteenbergen mentioned this issue Apr 19, 2021

TemporalFusionTransformer predicting with mode "raw" results in RuntimeError: Sizes of tensors must match except in dimension 0 #449

Open

cristianegea mentioned this issue Sep 7, 2021

torch.cat(): Sizes of tensors must match except in dimension 0. Got 117 and 85 in dimension 3 (The offending index is 1) #666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor Dimension Error When Applying TFT to Multiple Groups in Own Data #85

Tensor Dimension Error When Applying TFT to Multiple Groups in Own Data #85

AlexMRuch commented Oct 8, 2020 •

edited

Loading

AlexMRuch commented Oct 8, 2020 •

edited

Loading

AlexMRuch commented Oct 8, 2020 •

edited

Loading

AlexMRuch commented Oct 8, 2020 •

edited

Loading

AlexMRuch commented Oct 8, 2020

AlexMRuch commented Oct 8, 2020

AlexMRuch commented Oct 8, 2020

AlexMRuch commented Oct 9, 2020 •

edited

Loading

jdb78 commented Oct 9, 2020

AlexMRuch commented Oct 9, 2020

jdb78 commented Oct 17, 2020

jdb78 commented Oct 17, 2020

AlexMRuch commented Oct 19, 2020

Tensor Dimension Error When Applying TFT to Multiple Groups in Own Data #85

Tensor Dimension Error When Applying TFT to Multiple Groups in Own Data #85

Comments

AlexMRuch commented Oct 8, 2020 • edited Loading

AlexMRuch commented Oct 8, 2020 • edited Loading

AlexMRuch commented Oct 8, 2020 • edited Loading

AlexMRuch commented Oct 8, 2020 • edited Loading

AlexMRuch commented Oct 8, 2020

AlexMRuch commented Oct 8, 2020

AlexMRuch commented Oct 8, 2020

AlexMRuch commented Oct 9, 2020 • edited Loading

jdb78 commented Oct 9, 2020

AlexMRuch commented Oct 9, 2020

jdb78 commented Oct 17, 2020

jdb78 commented Oct 17, 2020

AlexMRuch commented Oct 19, 2020

AlexMRuch commented Oct 8, 2020 •

edited

Loading

AlexMRuch commented Oct 8, 2020 •

edited

Loading

AlexMRuch commented Oct 8, 2020 •

edited

Loading

AlexMRuch commented Oct 8, 2020 •

edited

Loading

AlexMRuch commented Oct 9, 2020 •

edited

Loading