Prediction on unseen data #83

randomgitdude · 2020-10-07T20:02:36Z

Hi,
Firstly - thank you for your time, work and commitment that went into this package. All is good stuff. Yet one on thing I'm kinda struggling is how to check the predictions on the data that is not visible in the trainer class (from the documentation). I guess I should append it with the original data - but do you have any good practices that you can share ?

AlexMRuch · 2020-10-08T03:21:59Z

Hi @randomgitdude, check out #67 and the updated tutorials on https://github.com/jdb78/pytorch-forecasting/blob/master/docs/source/tutorials/stallion.ipynb,

randomgitdude · 2020-10-08T18:53:17Z

@AlexMRuch Thank you for pointing out to that updated tutorial. However I have few questions:

encoder_data = data[lambda x: x.time_idx > x.time_idx.max() - max_encoder_length]
This indeed creates a new data set - but what about scaling these features ? AFAIC only TimeSeriesDataSet class does that - and in the case of not calling the class upfront on the new datasets we are feeding the NN with unseen datasets "as-is" without any per-processing that it was initially trained with.
I manged to the get predictions with the following steps:
-> Loading the new data
-> Standard pre-processing (assiging categorical variables)
-> Assigning a date index - the same way as for the training set
date_map = dict( zip( pd.bdate_range( "2006-01-01", "2099-01-01", ), np.arange( 0, len( pd.bdate_range( "2006-01-01", "2030-01-01", ) ), ) + 1, ) ) df_valid["Date_Time"] = df_valid["Date_Time"].copy().map(date_map).astype(int)
-> Creating abc= TimeSeriesDataSet.from_dataset(training, df_valid, predict=True, stop_randomization=True)
-> Then testing_sample = abc.to_dataloader(train=False)
-> Finally raw_predictions = best_tft.predict(testing_sample )

But I wonder if this is the correct approach or am I missing something ?

AlexMRuch · 2020-10-09T13:59:08Z

For point 1, do you mean scaling the future target and covariate data or do you mean scaling the past historical data that has been trained on? If you mean the latter, I'm not sure as my time series skills are not that great and @jdb78 may have more thoughts. For historical data, I think you can still do all the scaling you need on the data DataFrame before this step, as the lambda is only slicing off a section of the DataFrame.

For point 2, I'm glad to hear you got the predictions working. I implemented the forecasting methods just as @jdb78 did in the tutorial and the results have face validity with what I'd expect (and they do differ from the evaluation plots), so that's about all I can speak to the question of whether the approach is correct.

randomgitdude · 2020-10-09T14:57:22Z

For point 1, do you mean scaling the future target and covariate data or do you mean scaling the past historical data that has been trained on?

Future target data.

For historical data, I think you can still do all the scaling you need on the data DataFrame before this step,

I was referring to future data. For the historical data - the data is actually scaled in the training class. Thus my assumption is that it should be used to scale future data. Why ? Well, simply because you have to scale the future values according to the mean and std of the training data not according to the mean or std of the future values.

As for point no. 2 maybe @jdb78 can elaborate ?

AlexMRuch · 2020-10-09T15:43:21Z

Ah, yeah, I definitely see your point and am curious to know what is best-practice as well! Thanks for clarifying!

jdb78 · 2020-10-09T15:56:10Z

Issue #51 sheds some light on it. There are basically two approaches. Both are implemented in PyTorch Forecasting.

randomgitdude · 2020-10-10T07:39:12Z

The first option is off the table for various reasons - at least IMHO.
Now, the second - so the EncoderNormalizer inherits from pytorch_forecasting.data.encoders.TorchNormalizer which inherits from the standard sklearn's sklearn.base.BaseEstimator, sklearn.base.TransformerMixin. So far so good - but in that case should I use it in the TimeSeriesDataSet class or before feeding it into it ? Becasuse if I get this correctly TimeSeriesDataSet only normalizes the target with the EncoderNormalizer.

jdb78 · 2020-10-10T19:41:44Z

In practice there should be minimal leakage by normalising on the entire training set instead of the encoder if the variable in question is not the target. Probably, normalising something else on the encoder sequence only would not work because the normalisation would not be stable. If you want to contribute this feature, feel invited to raise a PR!

randomgitdude · 2020-10-11T17:07:36Z

Ok - so few questions:

Why the normalization would not be stable ?
In the give case - does normalizing the target yields any benefit ?
Is abc= TimeSeriesDataSet.from_dataset(training, df_valid, predict=True, stop_randomization=True) a viable way of per-processing an unseen datasets ?

jdb78 · 2020-10-12T08:57:43Z

Sure.

Calculating variance when most of the values are constant is likely to be difficult (e.g. price is mostly constant). You can imagine the normalization vastly changing by just moving a few timesteps. This prevents learning useful information.
NNs have troubles with outputting unnormalised numbers. It is possible but you start to get issues because all the non-linearities are built for values between -2 and 2. Further, normalisation makes values across timeseries comparable, hence facilitating transfer learning.
Yes, because it means that you copy the pre-processors from training to abc.

Hope this is helpful.

jdb78 added the question Further information is requested label Oct 8, 2020

randomgitdude closed this as completed Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prediction on unseen data #83

Prediction on unseen data #83

randomgitdude commented Oct 7, 2020

AlexMRuch commented Oct 8, 2020

randomgitdude commented Oct 8, 2020

AlexMRuch commented Oct 9, 2020

randomgitdude commented Oct 9, 2020

AlexMRuch commented Oct 9, 2020

jdb78 commented Oct 9, 2020

randomgitdude commented Oct 10, 2020

jdb78 commented Oct 10, 2020

randomgitdude commented Oct 11, 2020

jdb78 commented Oct 12, 2020

Prediction on unseen data #83

Prediction on unseen data #83

Comments

randomgitdude commented Oct 7, 2020

AlexMRuch commented Oct 8, 2020

randomgitdude commented Oct 8, 2020

AlexMRuch commented Oct 9, 2020

randomgitdude commented Oct 9, 2020

AlexMRuch commented Oct 9, 2020

jdb78 commented Oct 9, 2020

randomgitdude commented Oct 10, 2020

jdb78 commented Oct 10, 2020

randomgitdude commented Oct 11, 2020

jdb78 commented Oct 12, 2020