Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements sequence_length param #3221

Merged
merged 17 commits into from
Mar 10, 2023
Merged

Implements sequence_length param #3221

merged 17 commits into from
Mar 10, 2023

Conversation

geoffreyangus
Copy link
Collaborator

@geoffreyangus geoffreyangus commented Mar 7, 2023

This PR implements a new config param for text and sequence features, sequence_length, a nullable positive integer.

sequence_length defaults to None. This means that the sequence length for a given feature will be inferred from the dataset. The inferred sequence length will be capped at max_sequence_length, which defaults to 256.

If sequence_length is not None, then the sequence length for a given feature will be the specified value and samples will be padded and truncated as needed. If the specified value for sequence_length is greater than max_sequence_length, then experiment will fail fast with an error message recommending that you set max_sequence_length to a value greater than or equal to sequence_length.

@geoffreyangus geoffreyangus changed the title First draft of sequence length logic Implements sequence_length param Mar 7, 2023
@geoffreyangus geoffreyangus marked this pull request as ready for review March 7, 2023 22:18
Copy link
Collaborator

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a potentially awkward experience for a user to set sequence_length=300, get a config validation error requiring them to manually update a second max_sequence_length=300 parameter.

What do you think about going with a dynamic update option (which would mean adding a function to the ModelConfig's post_init (example)?

@geoffreyangus
Copy link
Collaborator Author

Good point @justinxzhao – updated the PR!

@github-actions
Copy link

github-actions bot commented Mar 8, 2023

Unit Test Results

         6 files  ±  0           6 suites  ±0   7h 3m 16s ⏱️ + 52m 2s
  4 062 tests +11    4 019 ✔️ +11    43 💤 ±0  0 ±0 
12 207 runs  +33  12 072 ✔️ +33  135 💤 ±0  0 ±0 

Results for commit 7e97726. ± Comparison against base commit 32d305e.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@w4nderlust
Copy link
Collaborator

w4nderlust commented Mar 8, 2023

The lgoic looks fine to me.
One thing that I don't see (please point me to it if I just missed it) is the actual solution of the original issue.
In my understanding this was needed because of tied encoders, and:

  1. I d like to understand how this does it. i believe it boils down to the user specifying sequence length explicitly for all tied features to the same value, but maybe I'm wrong
  2. I see no test that actually tests that this solution solves that problem
    We also need to specify the expectation of the user behavior in a couple places, for instance catching error when tied is set and in the docs description of tied adding somethign like "when using tied with text/sequence/... features you should make sure the sequence length is the same by setting the sequencelength parameter" or something along those lines.

@geoffreyangus
Copy link
Collaborator Author

geoffreyangus commented Mar 9, 2023

@w4nderlust, thanks for the comments– I've added a new test that confirms this fix addresses this GitHub issue, as well as updated the schema so that the description of tied recommends setting sequence_length if using sequence and text features with a sequence combiner.

The unit test that validates the explain workflow is still in progress, since there seems to be nested errors having to do with tied weights. Using sequence_length resolves one of the errors, but not the next one, so some more investigation is needed there. Let me know if you had any other thoughts here!

@w4nderlust
Copy link
Collaborator

@tgaddair @justinxzhao FYI

Added a couple minor comments. one thing I haven't checked though is backwards compatibility, so be mindful to check if this change makes models with the old max_sequence_length parameter obsolete.

overall this looks good for now, but I still believe that this could be solved differently and potentially better in the future. Writing here so that there's a record of it (maybe we can create an issue somewhere to capture it too).

A couple options:

  1. creating another instance of the same encoder, with all the same parameters but sequence length, and then setting manually the weights to point to the same weights of the tied encoder (as the weights shapes themselves shouldn't depend on the sequence length)
  2. adding a global preprocessing parameter like concatenate_text_features that when multiple text / sequence features are specified creates a derived column that concatenates them and uses that (with a single encoder) instead of the original ones. It's arguable if the default would need to be true or false. Also In most cases it would be simpler and cleaner to do it before Ludwig, but still this would solve the current issue too.

Both things could complement the current solution instead of replacing them. Moreover, I like the idea of an explicit sequence length parameter as it's similar to what we do for images height and width, so even if in the future we were to implement 1 and 2, this work is still valuable and still holds.

@geoffreyangus
Copy link
Collaborator Author

Sounds good, thanks @w4nderlust! Regarding backwards compatibility, we should be in the clear. The max_sequence_length parameter still defaults to 256, and sequence_length here defaults to None. Given those two values, the derived max_sequence_length will be equal to the min(preprocessing_parameters['max_sequence_length'], max_len), where max_len is the max length of the feature in the dataset, which is what we had before this change.

Will address your comments and merge after tests pass. Thanks!

@geoffreyangus geoffreyangus merged commit 75b1941 into master Mar 10, 2023
@geoffreyangus geoffreyangus deleted the sequence-length-param branch March 10, 2023 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants