Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabnet_pretrain accepts missings in predictors #68

Merged
merged 27 commits into from Dec 13, 2021

Conversation

cregouby
Copy link
Collaborator

@cregouby cregouby commented Nov 6, 2021

fix #65. and add corresponding vignette.
At the same time

  • add vip::vip() and tabnet_explain() support for tabnet_pretrain objects
  • improve random_obfuscator masking performance
    image

@cregouby cregouby requested a review from dfalbel November 6, 2021 21:41
@cregouby cregouby marked this pull request as draft November 8, 2021 09:07
@cregouby cregouby marked this pull request as ready for review November 13, 2021 10:58
@cregouby
Copy link
Collaborator Author

cregouby commented Dec 3, 2021

Hello @dfalbel, any chance that you spend some time to review this P.R. ?

@dfalbel
Copy link
Member

dfalbel commented Dec 3, 2021

@cregouby ! So sorry, i completely missed it. Will review today!
Sorry about that!

@cregouby
Copy link
Collaborator Author

cregouby commented Dec 3, 2021

Thanks !

@dfalbel
Copy link
Member

dfalbel commented Dec 3, 2021

Hey @cregouby ! I couldn't get to it today, but will have better look tomorrow.
Sorry for the delay.

@cregouby
Copy link
Collaborator Author

cregouby commented Dec 4, 2021

No worries at all.

Copy link
Member

@dfalbel dfalbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cregouby ! Codewise this looks great! I don't have any comments.

I'm a bit confused by the example though.
It seems to me that masking the NA values would be a good decision when missingness occurs randomly - thus we don't want the model to encode that information.

In the ames case, I'd say that we in general wont want to discard this information as that will be important for finding the 0 versus >0 relationship. For example has a pool or not.

Do I understand correctly, or maybe I am missing something?

Sorry again for the long time to review your PR.

@cregouby
Copy link
Collaborator Author

cregouby commented Dec 8, 2021

You are right on the fact that this feature is a given solution for missing at random (MAR) dataset. Maybe it is worth mentioning in the vignette... for Non missing at random (NMAR) , like in Ames, the question is definitively valid and covers multiple topics :

human perspective

This is related to the information encoded in the variable, and their semantic, nothing related to the modeling aspects

  • do we understand that the value pool surface = 0 m² is a proxy information for the implicit column "has_pool = FALSE" (I think yes,...)
  • how could we understand the value pool surface = 0.01 m² in the dataset, as we encode it as a numerical positive value, this should be valid, and could be the result of any regression.
  • same for categorical qualitative feature on the pool as pool_condition = "no_pool" is also a proxy encoding, not a pool condition...
    So that's what I had in mind when I prepared the ames_missing dataset.

model quality perspective

Now, is it beneficial for the model performance ?

  • is the model technically capable of forging an internal representation the proxy variable has_pool based on the interaction between pool_quality = "no_pool" and pool_surface = 0m² ? I think answer is yes as it is the aim of the attention block to give room to such feature.
  • has the model enough data to do it with ames dataset ? I don't know.
  • will the pretrained model be better with explicit NAs or with encoded NAs for the following task of ames price prediction ? I don't know (and I eluded the question in the vignette not doing it further...)
  • will the pretrained model be better to capture this interaction with explicit NAs or with encoded NAs ? Maybe not, as you suggest, but at least variable importance output is more to be trusted with explicit NAs ( as the vignette highlight), and this is an interesting topic (that I just surface for now)

I may have a slot to discuss the fundamental part of it with one of the missing-data cran task view author. In the meantime, I don't know what's best....

@dfalbel
Copy link
Member

dfalbel commented Dec 8, 2021

@cregouby OK! I'm much more confident that I fully understand the approach for handling missing in the training data.
I agree with you there are many factors to discuss around that theme and there's no perfect solution.

I think for completeness it would be nice to add a paragraph in the motivation section of the vignette describing in a high level how we approach the problem in TabNet and what are the wins and possible drawbacks. That way I feel users will be more confident in relying on TabNet's missing data handling.

What do you think?

@cregouby
Copy link
Collaborator Author

It sounds perfect for me !

@cregouby
Copy link
Collaborator Author

cregouby commented Dec 12, 2021

Hello @dfalbel,
Finally, presence or absence of missing data in the vip plot is not reproducible / was by chance / has such variability in between training occurrences. So I prefer to remove the vignette ( that was at the end more like a blog article as you can see in https://github.com/mlverse/tabnet/blob/4f4404aac9dcc18e8b76d8fbd81fc7f72285e174/vignettes/Missing_data_predictors.Rmd )

We may revive it later on (when/if I can manage NAs in downstream tasks )

@dfalbel
Copy link
Member

dfalbel commented Dec 13, 2021

@cregouby Sounds great! Feel free to merge this PR whenever you want.

@cregouby cregouby merged commit 36dd73a into main Dec 13, 2021
cregouby added a commit that referenced this pull request Dec 18, 2021
@cregouby cregouby deleted the feature/accept_missings_in_predictors branch January 16, 2022 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow missing values in predictors during pretraining through a static NA_mask
3 participants