Add category onehot encoder for both ECD and GBM #3057

tgaddair · 2023-02-08T04:40:06Z

This PR enables GBMs to support any vector-like input feature (vector, sequence, text, timeseries, even image or audio) provided there is a fixed (non-trainable) encoder available.

Follow-up tasks (outside the scope of this PR):

Deprecate existing sparse encoder, as it is essentially a more expensive version of onehot.
Add support for vector-like features with GBMs in the schema.

for more information, see https://pre-commit.ci

…to gbm-encoders

for more information, see https://pre-commit.ci

github-actions · 2023-02-08T06:01:16Z

Unit Test Results

        6 files ±  0         6 suites ±0 6h 46m 26s ⏱️ + 52m 3s
  4 037 tests +19   3 994 ✔️ +19   43 💤 ±0 0 ❌ ±0
12 129 runs +54 11 995 ✔️ +55 134 💤 - 1 0 ❌ ±0

Results for commit 39d463d. ± Comparison against base commit bd139bb.

♻️ This comment has been updated with latest results.

for more information, see https://pre-commit.ci

…to gbm-encoders

w4nderlust · 2023-02-28T00:07:42Z

I would remove the current mechanism for sparse embedding as it does the same thing functionally but is more expensive (as it keeps the matrix of binary values in memory)

ludwig/models/gbm.py

ludwig/utils/dataframe_utils.py

tgaddair · 2023-02-28T01:09:35Z

Good point, @w4nderlust. How would you want to implement that in practice, given backwards compatibility? We could, for example, mark it as deprecated similar to what we did for the resnet and vit legacy image encoders. Or did you have something else in mind?

w4nderlust · 2023-02-28T02:45:02Z

We could deprecate the representation parameter i nthe Embed encoder alltogether (the behaviour would be always dense. It would impact also other feature types and encoders that use that module. I'm trying to think of cases where that would be a problem though, and I can't find an honestly... maybe deprecating in 0.7.1 and then removing in 0.8 could be a good timeline.

tgaddair · 2023-02-28T05:46:03Z

@w4nderlust my main worry would be if the existing sparse encoder / embed module adds any parameters to the saved state dict of the model. If so, then removing the module could break older Ludwig models. If not, then we could transparently upgrade models trained with the sparse encoder to use the new onehot encoder.

for more information, see https://pre-commit.ci

…to gbm-encoders

for more information, see https://pre-commit.ci

…to gbm-encoders

jeffkinnison

LGTM!

tgaddair added 5 commits February 7, 2023 16:31

WIP onehot encoding

76ebec3

Added dataset

a173dd7

Tests

27f8e38

Fix

e68fc1b

Disable test

ff70f18

tgaddair requested review from justinxzhao and arnavgarg1 February 8, 2023 04:40

pre-commit-ci bot and others added 4 commits February 8, 2023 04:40

[pre-commit.ci] auto fixes from pre-commit.com hooks

41d6720

for more information, see https://pre-commit.ci

Fix

14d41de

Merge branch 'gbm-encoders' of https://github.com/ludwig-ai/ludwig in…

67f9293

…to gbm-encoders

[pre-commit.ci] auto fixes from pre-commit.com hooks

18ada15

for more information, see https://pre-commit.ci

tgaddair and others added 6 commits February 27, 2023 14:20

Merge

4c4a545

Fixed imports

84d0699

Merge

121348c

[pre-commit.ci] auto fixes from pre-commit.com hooks

d580bf6

for more information, see https://pre-commit.ci

Fixed api

9656541

Merge branch 'gbm-encoders' of https://github.com/ludwig-ai/ludwig in…

2a5357a

…to gbm-encoders

tgaddair requested review from jppgks and jeffkinnison February 27, 2023 22:57

Make cache_encoder_embeddings optional for ecd

45b0fd8

jeffkinnison reviewed Feb 28, 2023

View reviewed changes

ludwig/models/gbm.py Outdated Show resolved Hide resolved

ludwig/utils/dataframe_utils.py Outdated Show resolved Hide resolved

ludwig/utils/dataframe_utils.py Show resolved Hide resolved

ludwig/utils/dataframe_utils.py Show resolved Hide resolved

Addressed comments

8140a0e

Fixed can cache embeddings

7c463d6

tgaddair added 2 commits February 28, 2023 21:10

Added exception

4c56106

Explain with onehot

f28c09d

tgaddair and others added 19 commits February 28, 2023 21:13

Merge branch 'master' into gbm-encoders

23383d7

Merge branch 'master' into gbm-encoders

54dc75b

Merge branch 'master' into gbm-encoders

47044ab

Merge branch 'master' into gbm-encoders

2e7e89c

Merge branch 'master' into gbm-encoders

79277f5

Feature importance for vector features

4fcc851

Fixed torchscript

b193a45

Change default back to passthrough

f96b995

Reduced test resources

5c61da8

Merge branch 'master' into gbm-encoders

af88369

Fixed prepare_batch

e8e6e23

Fixed embedding layer

044081e

Fixed torchscript

9aaa466

[pre-commit.ci] auto fixes from pre-commit.com hooks

d7466f3

for more information, see https://pre-commit.ci

Added tests

60af2cc

Merge branch 'gbm-encoders' of https://github.com/ludwig-ai/ludwig in…

bdc425e

…to gbm-encoders

[pre-commit.ci] auto fixes from pre-commit.com hooks

7b4a8c7

for more information, see https://pre-commit.ci

Use tensor_extension_casting

2e0f52c

Merge branch 'gbm-encoders' of https://github.com/ludwig-ai/ludwig in…

2e367bf

…to gbm-encoders

jeffkinnison approved these changes Mar 5, 2023

View reviewed changes

tgaddair changed the title ~~Add onehot encoding and make it the default GBM category encoder~~ Add category onehot encoder for both ECD and GBM Mar 5, 2023

tgaddair added 2 commits March 6, 2023 10:16

Move from_ray_dataset into context manager

5e6f2cc

Bump timeouts

39d463d

tgaddair merged commit fe0dca8 into master Mar 7, 2023

tgaddair deleted the gbm-encoders branch March 7, 2023 00:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add category onehot encoder for both ECD and GBM #3057

Add category onehot encoder for both ECD and GBM #3057

tgaddair commented Feb 8, 2023 •

edited

github-actions bot commented Feb 8, 2023 •

edited

w4nderlust commented Feb 28, 2023

tgaddair commented Feb 28, 2023

w4nderlust commented Feb 28, 2023

tgaddair commented Feb 28, 2023

jeffkinnison left a comment

Add category onehot encoder for both ECD and GBM #3057

Add category onehot encoder for both ECD and GBM #3057

Conversation

tgaddair commented Feb 8, 2023 • edited

github-actions bot commented Feb 8, 2023 • edited

Unit Test Results

w4nderlust commented Feb 28, 2023

tgaddair commented Feb 28, 2023

w4nderlust commented Feb 28, 2023

tgaddair commented Feb 28, 2023

jeffkinnison left a comment

Choose a reason for hiding this comment

tgaddair commented Feb 8, 2023 •

edited

github-actions bot commented Feb 8, 2023 •

edited