Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add category onehot encoder for both ECD and GBM #3057

Merged
merged 41 commits into from
Mar 7, 2023
Merged

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Feb 8, 2023

This PR enables GBMs to support any vector-like input feature (vector, sequence, text, timeseries, even image or audio) provided there is a fixed (non-trainable) encoder available.

Follow-up tasks (outside the scope of this PR):

  • Deprecate existing sparse encoder, as it is essentially a more expensive version of onehot.
  • Add support for vector-like features with GBMs in the schema.

@github-actions
Copy link

github-actions bot commented Feb 8, 2023

Unit Test Results

         6 files  ±  0           6 suites  ±0   6h 46m 26s ⏱️ + 52m 3s
  4 037 tests +19    3 994 ✔️ +19    43 💤 ±0  0 ±0 
12 129 runs  +54  11 995 ✔️ +55  134 💤  - 1  0 ±0 

Results for commit 39d463d. ± Comparison against base commit bd139bb.

♻️ This comment has been updated with latest results.

@w4nderlust
Copy link
Collaborator

I would remove the current mechanism for sparse embedding as it does the same thing functionally but is more expensive (as it keeps the matrix of binary values in memory)

ludwig/models/gbm.py Outdated Show resolved Hide resolved
ludwig/utils/dataframe_utils.py Outdated Show resolved Hide resolved
ludwig/utils/dataframe_utils.py Show resolved Hide resolved
ludwig/utils/dataframe_utils.py Show resolved Hide resolved
@tgaddair
Copy link
Collaborator Author

Good point, @w4nderlust. How would you want to implement that in practice, given backwards compatibility? We could, for example, mark it as deprecated similar to what we did for the resnet and vit legacy image encoders. Or did you have something else in mind?

@w4nderlust
Copy link
Collaborator

We could deprecate the representation parameter i nthe Embed encoder alltogether (the behaviour would be always dense. It would impact also other feature types and encoders that use that module. I'm trying to think of cases where that would be a problem though, and I can't find an honestly... maybe deprecating in 0.7.1 and then removing in 0.8 could be a good timeline.

@tgaddair
Copy link
Collaborator Author

@w4nderlust my main worry would be if the existing sparse encoder / embed module adds any parameters to the saved state dict of the model. If so, then removing the module could break older Ludwig models. If not, then we could transparently upgrade models trained with the sparse encoder to use the new onehot encoder.

Copy link
Contributor

@jeffkinnison jeffkinnison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgaddair tgaddair changed the title Add onehot encoding and make it the default GBM category encoder Add category onehot encoder for both ECD and GBM Mar 5, 2023
@tgaddair tgaddair merged commit fe0dca8 into master Mar 7, 2023
@tgaddair tgaddair deleted the gbm-encoders branch March 7, 2023 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants