`tabnet_pretrain` accepts missings in predictors #68

cregouby · 2021-11-06T21:41:00Z

fix #65. and add corresponding vignette.
At the same time

add vip::vip() and tabnet_explain() support for tabnet_pretrain objects
improve random_obfuscator masking performance

…est for na values

…uld mimic the embedded_x (WiP) and maybe be with no_autograd (TBC)

…gs_in_predictors

…ncludes an embedded_x_na_mask along embedded_x.

…gs_in_predictors

fix inconsistant logic between na_mask, add better error message for embedding dimension mismatch

improve boolean invetion performance add embedded_x_na_mask support in validation

make vip consider tabnet_pretrain models

add tests for explain improve test description

…increasing the dataset size remove typo of duplicate `tabnet_explain()` call

in order to differentiate `outcomes=` value

add and fix text of `tabnet_explain.tabnet_pretrain` for data.frame

lower tests footprint

… to README

cregouby · 2021-12-03T18:02:20Z

Hello @dfalbel, any chance that you spend some time to review this P.R. ?

dfalbel · 2021-12-03T18:03:42Z

@cregouby ! So sorry, i completely missed it. Will review today!
Sorry about that!

cregouby · 2021-12-03T18:07:39Z

Thanks !

dfalbel · 2021-12-03T20:50:45Z

Hey @cregouby ! I couldn't get to it today, but will have better look tomorrow.
Sorry for the delay.

cregouby · 2021-12-04T16:30:09Z

No worries at all.

dfalbel

Hi @cregouby ! Codewise this looks great! I don't have any comments.

I'm a bit confused by the example though.
It seems to me that masking the NA values would be a good decision when missingness occurs randomly - thus we don't want the model to encode that information.

In the ames case, I'd say that we in general wont want to discard this information as that will be important for finding the 0 versus >0 relationship. For example has a pool or not.

Do I understand correctly, or maybe I am missing something?

Sorry again for the long time to review your PR.

cregouby · 2021-12-08T08:35:33Z

You are right on the fact that this feature is a given solution for missing at random (MAR) dataset. Maybe it is worth mentioning in the vignette... for Non missing at random (NMAR) , like in Ames, the question is definitively valid and covers multiple topics :

human perspective

This is related to the information encoded in the variable, and their semantic, nothing related to the modeling aspects

do we understand that the value pool surface = 0 m² is a proxy information for the implicit column "has_pool = FALSE" (I think yes,...)
how could we understand the value pool surface = 0.01 m² in the dataset, as we encode it as a numerical positive value, this should be valid, and could be the result of any regression.
same for categorical qualitative feature on the pool as pool_condition = "no_pool" is also a proxy encoding, not a pool condition...
So that's what I had in mind when I prepared the ames_missing dataset.

model quality perspective

Now, is it beneficial for the model performance ?

is the model technically capable of forging an internal representation the proxy variable has_pool based on the interaction between pool_quality = "no_pool" and pool_surface = 0m² ? I think answer is yes as it is the aim of the attention block to give room to such feature.
has the model enough data to do it with ames dataset ? I don't know.
will the pretrained model be better with explicit NAs or with encoded NAs for the following task of ames price prediction ? I don't know (and I eluded the question in the vignette not doing it further...)
will the pretrained model be better to capture this interaction with explicit NAs or with encoded NAs ? Maybe not, as you suggest, but at least variable importance output is more to be trusted with explicit NAs ( as the vignette highlight), and this is an interesting topic (that I just surface for now)

I may have a slot to discuss the fundamental part of it with one of the missing-data cran task view author. In the meantime, I don't know what's best....

dfalbel · 2021-12-08T18:05:17Z

@cregouby OK! I'm much more confident that I fully understand the approach for handling missing in the training data.
I agree with you there are many factors to discuss around that theme and there's no perfect solution.

I think for completeness it would be nice to add a paragraph in the motivation section of the vignette describing in a high level how we approach the problem in TabNet and what are the wins and possible drawbacks. That way I feel users will be more confident in relying on TabNet's missing data handling.

What do you think?

cregouby · 2021-12-10T10:11:23Z

It sounds perfect for me !

…eproducible / was by chance.

cregouby · 2021-12-12T09:05:30Z

Hello @dfalbel,
Finally, presence or absence of missing data in the vip plot is not reproducible / was by chance / has such variability in between training occurrences. So I prefer to remove the vignette ( that was at the end more like a blog article as you can see in https://github.com/mlverse/tabnet/blob/4f4404aac9dcc18e8b76d8fbd81fc7f72285e174/vignettes/Missing_data_predictors.Rmd )

We may revive it later on (when/if I can manage NAs in downstream tasks )

dfalbel · 2021-12-13T19:27:10Z

@cregouby Sounds great! Feel free to merge this PR whenever you want.

cregouby added 16 commits October 17, 2021 18:51

add na_mask transmission along pretraining pipeline, shift left the t…

620220a

…est for na values

add x_na_mask along the pipeline. add na_embedding_generator that sho…

ce9a3a8

…uld mimic the embedded_x (WiP) and maybe be with no_autograd (TBC)

Merge branch 'feature/autoplot.fit_legend' into feature/accept_missin…

890a415

…gs_in_predictors

few improvement on vignette

ff28c01

add missing in categorical predictors. tabnet_pretrain workflow now i…

879871e

…ncludes an embedded_x_na_mask along embedded_x.

fix embedding_generator for x with missings

de62cb3

Merge remote-tracking branch 'origin/main' into feature/accept_missin…

e0842c2

…gs_in_predictors

fix a duplicate variable name,

f30390d

fix inconsistant logic between na_mask, add better error message for embedding dimension mismatch

add pretrain test for cat_emb_dim as vector

0948ffb

add cat_emb_dim test to tabnet_fit

b39fa2b

fix mask output being inverted.

ab99bef

improve boolean invetion performance add embedded_x_na_mask support in validation

roll-back to 0-based mask

65144e4

make vip consider tabnet_pretrain models

add vip.tabnet_pretrain()

449c718

enhance predictor missing data vignette

572f460

fix png name typo

622956c

add tests for explain improve test description

update news

7d3198d

cregouby requested a review from dfalbel November 6, 2021 21:41

cregouby marked this pull request as draft November 8, 2021 09:07

Christophe-Regouby and others added 7 commits November 8, 2021 11:54

add the new nn_module to testthat snapshot

29284a1

workaround wired failure due to constant values in predictor through …

0c07b7b

…increasing the dataset size remove typo of duplicate `tabnet_explain()` call

turn tabnet_explain() into S3 method

c0a543e

in order to differentiate `outcomes=` value

fix missing device= for na_embedding

5f82668

add and fix text of `tabnet_explain.tabnet_pretrain` for data.frame

fix na_mask must be computed before subsetting x

b8ea056

rollback to both tabnet_explain() beeing equal

d4b3484

lower tests footprint

Refresh vignette after fixing the validation set subsetting

a62d425

cregouby marked this pull request as ready for review November 13, 2021 10:58

avoid unwanted warning messages, and add the training diagnostic plot…

62e8a76

… to README

dfalbel reviewed Dec 6, 2021

View reviewed changes

fix typo in vignette to get back images

16658f8

cregouby added 2 commits December 12, 2021 09:56

Add more context and give more room to categorical embeddings

4f4404a

Finally, presence or absence of missing data in the vip plot is not r…

eab6417

…eproducible / was by chance.

cregouby merged commit 36dd73a into main Dec 13, 2021

jemus42 mentioned this pull request Dec 14, 2021

tabnet: support for pretraining mlr-org/mlr3torch#13

Closed

cregouby added a commit that referenced this pull request Dec 18, 2021

fix obsolete news about #68

bacaef2

cregouby deleted the feature/accept_missings_in_predictors branch January 16, 2022 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tabnet_pretrain` accepts missings in predictors #68

`tabnet_pretrain` accepts missings in predictors #68

cregouby commented Nov 6, 2021 •

edited

cregouby commented Dec 3, 2021 •

edited

dfalbel commented Dec 3, 2021

cregouby commented Dec 3, 2021

dfalbel commented Dec 3, 2021

cregouby commented Dec 4, 2021

dfalbel left a comment

cregouby commented Dec 8, 2021 •

edited

dfalbel commented Dec 8, 2021

cregouby commented Dec 10, 2021

cregouby commented Dec 12, 2021 •

edited

dfalbel commented Dec 13, 2021

tabnet_pretrain accepts missings in predictors #68

tabnet_pretrain accepts missings in predictors #68

Conversation

cregouby commented Nov 6, 2021 • edited

cregouby commented Dec 3, 2021 • edited

dfalbel commented Dec 3, 2021

cregouby commented Dec 3, 2021

dfalbel commented Dec 3, 2021

cregouby commented Dec 4, 2021

dfalbel left a comment

Choose a reason for hiding this comment

cregouby commented Dec 8, 2021 • edited

human perspective

model quality perspective

dfalbel commented Dec 8, 2021

cregouby commented Dec 10, 2021

cregouby commented Dec 12, 2021 • edited

dfalbel commented Dec 13, 2021

`tabnet_pretrain` accepts missings in predictors #68

`tabnet_pretrain` accepts missings in predictors #68

cregouby commented Nov 6, 2021 •

edited

cregouby commented Dec 3, 2021 •

edited

cregouby commented Dec 8, 2021 •

edited

cregouby commented Dec 12, 2021 •

edited