Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage imbalancing in TFT #1040

Open
LuigiDarkSimeone opened this issue Jun 20, 2022 · 7 comments
Open

Manage imbalancing in TFT #1040

LuigiDarkSimeone opened this issue Jun 20, 2022 · 7 comments

Comments

@LuigiDarkSimeone
Copy link

  • PyTorch-Forecasting version: 0.10.2
  • PyTorch version:
  • Python version: 3.8.5
  • Operating System: Windows

I have a dataset of several shops. For each I have a time series of sales.
Shops are spread unequally in the world (1000 in us, 100 in EU), I need to predict the sales based on the location and other variables.
However, such data set is imbalanced.
Is there a way to manage imbalance in TFT? (upsampling, downsampling, apply a weight-balance similar to sklearn, or force each batch to select equal number of example)

@fnavruzov
Copy link

Have you tried "weight" argument while creating datasets? You can create a column with weights to be used in training

ds = TimeSeriesDataSet(
    data=data[train_data_filter],
    time_idx=time_idx_col,
    target=...,
    weight='weight', # pass name of a weight column in your df, samples/sampler weight(s)
    group_ids=group_ids,
    ...
)

@RonanFR
Copy link

RonanFR commented Jun 20, 2022

Hi @LuigiDarkSimeone,

  1. As suggested by @fnavruzov, on way to "rebalance" the dataset could be to use the weight argument of TimeSeriesDataSet. This will generate a weight tensor in addition to the target tensor used while fitting the model.
    Note that in this case, the portion of the loss associated to each sample is weighted differently. This is similar to what is done in scikit-learn (sample_weight argument of method .fit(...))

  2. You could also use the weights to alter the probability of a given sample to be part of a mini-batch (sampling scheme). As indicated in the documentation, you can call method to_dataloader with a custom sampler, for example an instance of torch WeightedRandomSampler. You can find a small example here.

  3. You can aso combine both 1) and 2).

N.B: The DeepAR paper empirically shows the benefit of method 2) compared to not using any weights. To the best of my knowledge, they do not present any result based on method 1). That being said, in their setting, the issue is the size of the dataset and the main problem in this case is to be able to select the most relevant samples (since the total number of samples is huge, it may not be possible to go over all samples several times during the training procedure and they show that weighting the samples based on their "velocity" greatly improves the performances).

See also: Weighted loss functions vs weighted sampling?

@LuigiDarkSimeone
Copy link
Author

First of all thanks to @RonanFR and @fnavruzov, for your replies.
Lately it has been quite hard to get answers in here.
I will have a look at your oprions and test them to get whether they are suitable for my case.

Due to the struggling I am having to get answer, and since you look expert, I would like you to kindly have a look at this question I posted quite a few days ago (which it will never get an answer I guess):

#1032

I know it is not good practice to post another question in a different issue, so I really apologise in advance, but I cannot get over this problem, even after looking the source code.
Hope to hear from you soon

many thanks
Luigi

@FrancescoFondaco
Copy link

Thanks @RonanFR, @fnavruzov.

I am trying to implement what you've suggested using the "weight" argument in the TimeseriesDataset Class in order to manage imbalances in my dataset.

training = TimeSeriesDataSet(
    myData,
    time_idx="Time_idx",
    target="TVPI",
    group_ids=["Fund"],
    min_encoder_length=8,  
    max_encoder_length=80,
    min_prediction_length=1,
    max_prediction_length=30,
    weight="Weight"
    static_categoricals=...

Where the Weight column contains the weight associated to each sample.
image

Unfortunetly the described implementation raises the error below:
image

Would you know how to solve it?
Thanks,
Francesco

@RonanFR
Copy link

RonanFR commented Jul 7, 2022

Hi @FrancescoFondaco ,

Can you provide a detailed minimal reproducible example that raises this error ? (small toy dataset of only few lines)

@QijiaShao
Copy link

Thanks @RonanFR, @fnavruzov.

I am trying to implement what you've suggested using the "weight" argument in the TimeseriesDataset Class in order to manage imbalances in my dataset.

training = TimeSeriesDataSet(
    myData,
    time_idx="Time_idx",
    target="TVPI",
    group_ids=["Fund"],
    min_encoder_length=8,  
    max_encoder_length=80,
    min_prediction_length=1,
    max_prediction_length=30,
    weight="Weight"
    static_categoricals=...

Where the Weight column contains the weight associated to each sample. image

Unfortunetly the described implementation raises the error below: image

Would you know how to solve it? Thanks, Francesco

Have you figured out this issue? I am having the same issue after adding the "weight" parameter. Thx!

@terbed
Copy link

terbed commented May 8, 2023

Dear @FrancescoFondaco and @QijiaShao,
I suspect the issue is related to the automatic fill-forward nan mechanism. If your time index is not continuous then the missing steps are filled but the weights are missing for those samples. So you should disable automatic filling in case you are using weights. This is just a guess.

Best wished,
Daniel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants