Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about my dataset #29

Open
SSSUNSHINNING opened this issue Dec 18, 2022 · 5 comments
Open

question about my dataset #29

SSSUNSHINNING opened this issue Dec 18, 2022 · 5 comments

Comments

@SSSUNSHINNING
Copy link

Thank you for sharing your code, there are some parts of your code that I don't understand because of my limited ability, could you help me?

  1. The scenario you are modelling is a direct segmentation of the data, but the scenario I am dealing with uses a sliding window to dynamically traverse the data, so where should I fix this if I want to load my own data.
  2. I still don't quite understand the difference between ImputationDataset and TransductionDataset, it seems that there is no difference between classification and regression tasks in the unsupervised learning phase.
    Looking forward to your reply, thanks for your help
@SSSUNSHINNING
Copy link
Author

I don't need the IDs,but without it the code can't run

@gzerveas
Copy link
Owner

Hello,

Regarding the ImputationDataset and TransductionDataset: these are pytorch Dataset classes that implement the tasks of imputation (i.e. filling masked/missing values across all variables) and transduction (i.e. consistently predicting some entirely missing variables from some others which are always available). They can be used as pretraining objectives, but may also be the actual target tasks one is interested in.
Sample-level regression and classification are simply the two downstream tasks that are used to evaluate the architecture and pretraining capabilities of the multivariate times series transformer. They don't affect pretraining at all; unsupervised pretraining is done through imputation, and the model that is learned is ready to be fine-tuned for either sample-level ("extrinsic") regression or classification.

Regarding your own dataset: In your case, one way of implementing your scenario would be to define a new dataset in the data.py file, where you would do the sliding window segmentation in advance, and define some "dummy IDs", similar to how it is done in the example PMUData class here. Do you have a compelling reason why you wish to do the segmentation dynamically, instead of defining all samples in advance? If so, then what I would do is to define a new ImputationDataset which would offer the same functionality with respect to masking data, but e.g. use the IterableDataset (see here) to get the data in chunks, instead of accessing a DataFrame by index, as it does know in __getitem__. I think everything else can remain the same. Your sample IDs (which you can define at will, like 0, 1, 2, ...) will correspond to your main time series samples and the train/validation/test splits will be done based on those (because you don't want to be using parts of the same sample in the training set and parts in the other sets), and then the IterableDataset will produce sub-samples (sliding windows) for each sample ID in your training and other sets.

@SSSUNSHINNING
Copy link
Author

Hello,thank you for your answer. If my data has no missing values, can I just pre-train with TransductionDataset?Also, I still don't quite understand the specific role of paddingmask, which actually only appears in the attention calculation, and if it is not used, it has any effect on the results.

@SSSUNSHINNING
Copy link
Author

in my area, no missing data and each data is the same length, and there is no non-isometric phenomenon.

@gzerveas
Copy link
Owner

gzerveas commented Jan 6, 2023

Regarding padding: If all the samples in your dataset are of the same length, then padding is not necessary in your case. You don't need to do anything yourself; the batching code in dataset.py will find out that all your samples are of equal length, and the padding masks which will be generated will be "all-ones" tensors, meaning, no input time steps will be ignored, they can potentially all contribute to the calculation.
In the general case, if the samples had differing length, the batching code would pad sequences of length shorter than the maximum length sequence in the batch with some value (e.g. 0s), until they all have the same maximum length. This is done to allow efficient batch processing in tensors. The padding masks in this case would have 0s in the positions corresponding to the padding, such that in the attention layer these positions would receive a really large negative value and thus after softmax a zero attention weight, meaning, they will be completely ignored.

Regarding the training objective you want to use, imputation vs transduction:
Your data doesn't need to originally have missing values to do imputation training. It is simply an artificial pre-training strategy/technique: we are masking the input and asking the model to guess the missing values. That would make it an excellent objective for actual imputation, if we were really missing data, but it's not the only use; through this objective, the model learns much about the distribution of the time series variables, the co-dependencies between signals and the dynamics of each variable. So you can apply it on your dataset, for a variety of downstream tasks.
Another possible strategy for pre-training is to allow the model to consistently observe for each sample several variables completely (i.e. no missing parts), and to try to predict some other, entirely (or partially) missing variables from the existing/observable ones. This is called transduction. This is a technique that can work well in some cases, and I have tested its effectiveness, but it was not the focus of this paper. That's why I have removed some of the relevant code from the main.py. I can add it back if there is demand for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants