New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tabnet_pretrain
accepts missings in predictors
#68
Conversation
…est for na values
…uld mimic the embedded_x (WiP) and maybe be with no_autograd (TBC)
…ncludes an embedded_x_na_mask along embedded_x.
fix inconsistant logic between na_mask, add better error message for embedding dimension mismatch
improve boolean invetion performance add embedded_x_na_mask support in validation
make vip consider tabnet_pretrain models
add tests for explain improve test description
…increasing the dataset size remove typo of duplicate `tabnet_explain()` call
in order to differentiate `outcomes=` value
add and fix text of `tabnet_explain.tabnet_pretrain` for data.frame
lower tests footprint
Hello @dfalbel, any chance that you spend some time to review this P.R. ? |
@cregouby ! So sorry, i completely missed it. Will review today! |
Thanks ! |
Hey @cregouby ! I couldn't get to it today, but will have better look tomorrow. |
No worries at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cregouby ! Codewise this looks great! I don't have any comments.
I'm a bit confused by the example though.
It seems to me that masking the NA
values would be a good decision when missingness occurs randomly - thus we don't want the model to encode that information.
In the ames case, I'd say that we in general wont want to discard this information as that will be important for finding the 0 versus >0 relationship. For example has a pool or not.
Do I understand correctly, or maybe I am missing something?
Sorry again for the long time to review your PR.
You are right on the fact that this feature is a given solution for missing at random (MAR) dataset. Maybe it is worth mentioning in the vignette... for Non missing at random (NMAR) , like in Ames, the question is definitively valid and covers multiple topics : human perspectiveThis is related to the information encoded in the variable, and their semantic, nothing related to the modeling aspects
model quality perspectiveNow, is it beneficial for the model performance ?
I may have a slot to discuss the fundamental part of it with one of the |
@cregouby OK! I'm much more confident that I fully understand the approach for handling missing in the training data. I think for completeness it would be nice to add a paragraph in the motivation section of the vignette describing in a high level how we approach the problem in TabNet and what are the wins and possible drawbacks. That way I feel users will be more confident in relying on TabNet's missing data handling. What do you think? |
It sounds perfect for me ! |
Hello @dfalbel, We may revive it later on (when/if I can manage NAs in downstream tasks ) |
@cregouby Sounds great! Feel free to merge this PR whenever you want. |
fix #65. and add corresponding vignette.
At the same time
vip::vip()
andtabnet_explain()
support fortabnet_pretrain
objectsrandom_obfuscator
masking performance