Implements `sequence_length` param #3221

geoffreyangus · 2023-03-07T21:45:43Z

This PR implements a new config param for text and sequence features, sequence_length, a nullable positive integer.

sequence_length defaults to None. This means that the sequence length for a given feature will be inferred from the dataset. The inferred sequence length will be capped at max_sequence_length, which defaults to 256.

If sequence_length is not None, then the sequence length for a given feature will be the specified value and samples will be padded and truncated as needed. If the specified value for sequence_length is greater than max_sequence_length, then experiment will fail fast with an error message recommending that you set max_sequence_length to a value greater than or equal to sequence_length.

ludwig/features/text_feature.py

justinxzhao

It seems like a potentially awkward experience for a user to set sequence_length=300, get a config validation error requiring them to manually update a second max_sequence_length=300 parameter.

What do you think about going with a dynamic update option (which would mean adding a function to the ModelConfig's post_init (example)?

geoffreyangus · 2023-03-08T17:15:10Z

Good point @justinxzhao – updated the PR!

github-actions · 2023-03-08T19:19:27Z

Unit Test Results

        6 files ±  0         6 suites ±0 7h 3m 16s ⏱️ + 52m 2s
  4 062 tests +11   4 019 ✔️ +11   43 💤 ±0 0 ❌ ±0
12 207 runs +33 12 072 ✔️ +33 135 💤 ±0 0 ❌ ±0

Results for commit 7e97726. ± Comparison against base commit 32d305e.

♻️ This comment has been updated with latest results.

justinxzhao

Nice!

w4nderlust · 2023-03-08T21:48:25Z

The lgoic looks fine to me.
One thing that I don't see (please point me to it if I just missed it) is the actual solution of the original issue.
In my understanding this was needed because of tied encoders, and:

I d like to understand how this does it. i believe it boils down to the user specifying sequence length explicitly for all tied features to the same value, but maybe I'm wrong
I see no test that actually tests that this solution solves that problem
We also need to specify the expectation of the user behavior in a couple places, for instance catching error when tied is set and in the docs description of tied adding somethign like "when using tied with text/sequence/... features you should make sure the sequence length is the same by setting the sequencelength parameter" or something along those lines.

… schema

geoffreyangus · 2023-03-09T03:00:14Z

@w4nderlust, thanks for the comments– I've added a new test that confirms this fix addresses this GitHub issue, as well as updated the schema so that the description of tied recommends setting sequence_length if using sequence and text features with a sequence combiner.

The unit test that validates the explain workflow is still in progress, since there seems to be nested errors having to do with tied weights. Using sequence_length resolves one of the errors, but not the next one, so some more investigation is needed there. Let me know if you had any other thoughts here!

ludwig/features/sequence_feature.py

ludwig/features/text_feature.py

w4nderlust · 2023-03-09T05:02:15Z

@tgaddair @justinxzhao FYI

Added a couple minor comments. one thing I haven't checked though is backwards compatibility, so be mindful to check if this change makes models with the old max_sequence_length parameter obsolete.

overall this looks good for now, but I still believe that this could be solved differently and potentially better in the future. Writing here so that there's a record of it (maybe we can create an issue somewhere to capture it too).

A couple options:

creating another instance of the same encoder, with all the same parameters but sequence length, and then setting manually the weights to point to the same weights of the tied encoder (as the weights shapes themselves shouldn't depend on the sequence length)
adding a global preprocessing parameter like concatenate_text_features that when multiple text / sequence features are specified creates a derived column that concatenates them and uses that (with a single encoder) instead of the original ones. It's arguable if the default would need to be true or false. Also In most cases it would be simpler and cleaner to do it before Ludwig, but still this would solve the current issue too.

Both things could complement the current solution instead of replacing them. Moreover, I like the idea of an explicit sequence length parameter as it's similar to what we do for images height and width, so even if in the future we were to implement 1 and 2, this work is still valuable and still holds.

geoffreyangus · 2023-03-09T16:25:02Z

Sounds good, thanks @w4nderlust! Regarding backwards compatibility, we should be in the clear. The max_sequence_length parameter still defaults to 256, and sequence_length here defaults to None. Given those two values, the derived max_sequence_length will be equal to the min(preprocessing_parameters['max_sequence_length'], max_len), where max_len is the max length of the feature in the dataset, which is what we had before this change.

Will address your comments and merge after tests pass. Thanks!

First draft of sequence length logic

4ad6932

geoffreyangus changed the title ~~First draft of sequence length logic~~ Implements sequence_length param Mar 7, 2023

adds config validation

6ae4c5b

geoffreyangus marked this pull request as ready for review March 7, 2023 22:18

geoffreyangus added 2 commits March 7, 2023 14:27

clean up schema

8d5e853

clean up schema

2b55ea0

geoffreyangus requested review from justinxzhao, abidwael, tgaddair and w4nderlust March 7, 2023 22:33

abidwael reviewed Mar 8, 2023

View reviewed changes

ludwig/features/text_feature.py Outdated Show resolved Hide resolved

justinxzhao reviewed Mar 8, 2023

View reviewed changes

geoffreyangus added 4 commits March 8, 2023 08:45

Merge branch 'master' into sequence-length-param

622f42b

added dynamic config update based on pr revision

a29edc2

pr revision; add check if max_sequence_length is None

f31b689

update parameter metadata

ae8f9e0

remove obsolete unit test

090f975

geoffreyangus requested review from justinxzhao and abidwael March 8, 2023 17:18

lint

b6c7a15

justinxzhao approved these changes Mar 8, 2023

View reviewed changes

abidwael approved these changes Mar 8, 2023

View reviewed changes

geoffreyangus added 3 commits March 8, 2023 18:51

pr revision; add test from github issue

2933ddb

pr revisions; add description that relates tied to sequence_length in…

e98fd5d

… schema

merge master

8846aa8

w4nderlust reviewed Mar 9, 2023

View reviewed changes

ludwig/features/sequence_feature.py Outdated Show resolved Hide resolved

w4nderlust reviewed Mar 9, 2023

View reviewed changes

ludwig/features/text_feature.py Outdated Show resolved Hide resolved

geoffreyangus added 4 commits March 9, 2023 08:26

pr revisions

170fe7d

reduce memory footprint of new test

d6f3dc5

reduce memory footprint of test with non-hf text encoder

0f435d9

try downgrading marshmallow from 8.5.5 to 8.5.4

7e97726

geoffreyangus merged commit 75b1941 into master Mar 10, 2023

geoffreyangus deleted the sequence-length-param branch March 10, 2023 21:18

geoffreyangus mentioned this pull request Mar 16, 2023

Cherry-pick 0.7: sequence_length capability #3259

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements `sequence_length` param #3221

Implements `sequence_length` param #3221

geoffreyangus commented Mar 7, 2023 •

edited

justinxzhao left a comment

geoffreyangus commented Mar 8, 2023

github-actions bot commented Mar 8, 2023 •

edited

justinxzhao left a comment

w4nderlust commented Mar 8, 2023 •

edited

geoffreyangus commented Mar 9, 2023 •

edited

w4nderlust commented Mar 9, 2023

geoffreyangus commented Mar 9, 2023

Implements sequence_length param #3221

Implements sequence_length param #3221

Conversation

geoffreyangus commented Mar 7, 2023 • edited

justinxzhao left a comment

Choose a reason for hiding this comment

geoffreyangus commented Mar 8, 2023

github-actions bot commented Mar 8, 2023 • edited

Unit Test Results

justinxzhao left a comment

Choose a reason for hiding this comment

w4nderlust commented Mar 8, 2023 • edited

geoffreyangus commented Mar 9, 2023 • edited

w4nderlust commented Mar 9, 2023

geoffreyangus commented Mar 9, 2023

Implements `sequence_length` param #3221

Implements `sequence_length` param #3221

geoffreyangus commented Mar 7, 2023 •

edited

github-actions bot commented Mar 8, 2023 •

edited

w4nderlust commented Mar 8, 2023 •

edited

geoffreyangus commented Mar 9, 2023 •

edited