Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check fill_with_const has fill_value for binary features #3278

Merged
merged 6 commits into from
Mar 27, 2023

Conversation

abidwael
Copy link
Contributor

@abidwael abidwael commented Mar 21, 2023

fix for the following failure
RuntimeError: Caught exception during model preprocessing: Invalid missing value strategy fill_with_const
using the following config

{'defaults': {'binary': {'preprocessing': {'missing_value_strategy': 'fill_with_const'}}},
 'input_features': [{'name': 'binary_1', 'type': 'binary'}],
 'model_type': 'ecd',
 'output_features': [{'decoder': {'idx2str': ['lwrVkpk',
                                              'ztHj',
                                              'dQxBJ',
                                              'mLFhKPHrTn',
                                              'iBjnr',
                                              'Pb',
                                              'qyiL',
                                              'VEQv',
                                              'zLol',
                                              'SBpN'],
                                  'vocab_size': 10},
                      'name': 'category_output_1',
                      'type': 'category'}],
 'trainer': {'train_steps': 1}}

@github-actions
Copy link

github-actions bot commented Mar 21, 2023

Unit Test Results

       6 files  ±       0         6 suites  ±0   2h 1m 30s ⏱️ + 1h 15m 38s
2 651 tests +   167  2 615 ✔️ +   140  36 💤 +27  0 ±0 
2 679 runs   - 2 301  2 637 ✔️  - 2 321  42 💤 +20  0 ±0 

Results for commit ed97d10. ± Comparison against base commit 02afe73.

♻️ This comment has been updated with latest results.

@arnavgarg1
Copy link
Contributor

Have a couple of questions about this PR:

  1. Is there no way to do this during schema validation itself? I thought we were already doing this, but I can't seem to find the piece of code where we condition fill_value when using fill_with_const.
  2. When we don't have a fill_value, we usually compute a computed_fill_value that's used as a fallback, so it should not raise an error even if a value isn't provided. Is that not happening?

@arnavgarg1
Copy link
Contributor

Actually, I took a look and I see the problem. It's in this line here: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1393

The fill_value and precomputed_fill_value will both be None, which will cause errors. So, my best guess is that this isn't actually limited to binary features, but is true for all features.

But after another look, we seem to have a default value for fill_value for most of our features except the following, so we'd run into the same issue for them as well:

  1. Binary
  2. Vector
  3. Audio
  4. Date
  5. Image

and maybe timeseries?

It would be great to add a parametrized test for all of these features to validate that this is the case, then update your function to raise a ConfiValidationError if it happens for any of these features. If you're also willing to make the extra effort, it would be super neat to have a meaningful resolution method for each feature type (like what to do for image features, or a vector feature)

@arnavgarg1 arnavgarg1 self-requested a review March 21, 2023 07:02
@abidwael
Copy link
Contributor Author

Thanks for the deep dive @arnavgarg1!

The only case where we'd encounter the same error (RuntimeError: Caught exception during model preprocessing: Invalid missing value strategy fill_with_const) is when the default missing_value_strategy is FILL_WITH_CONST and the fill_value is None, which is only the case for binary. Other features have combinations of these two parameters (e.g. bag will have FILL_WITH_CONST and UNKNOWN_SYMBOL).

The boolean default missing value strategy is FILL_WITH_FALSE. I initially thought about removing FILL_WITH_CONST altogether and replace it with FILL_WITH_TRUE as we shouldn't be accepting any other const anyways. WDYT?

@arnavgarg1
Copy link
Contributor

Thanks for the deep dive @arnavgarg1!

The only case where we'd encounter the same error (RuntimeError: Caught exception during model preprocessing: Invalid missing value strategy fill_with_const) is when the default missing_value_strategy is FILL_WITH_CONST and the fill_value is None, which is only the case for binary. Other features have combinations of these two parameters (e.g. bag will have FILL_WITH_CONST and UNKNOWN_SYMBOL).

Ah yeah good call, I didn't factor in the default missing value strategy itself. However, I think this seems to also be true for:

  1. Vector: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/schema/features/preprocessing/vector.py
  2. Date: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/schema/features/preprocessing/date.py

since they also default to fill_with_const and fill_value is empty.

The boolean default missing value strategy is FILL_WITH_FALSE. I initially thought about removing FILL_WITH_CONST altogether and replace it with FILL_WITH_TRUE as we shouldn't be accepting any other const anyways. WDYT?

Hmm, it is an interesting proposition, and to be honest, I do think it makes sense for binary feature types to just support one of FILL_WITH_TRUE or FILL_WITH_FALSE and drop FILL_WITH_CONST.

@arnavgarg1
Copy link
Contributor

arnavgarg1 commented Mar 21, 2023

Quite frankly, I'd expand the scope of this aux validation check to make sure there is a non-empty fill_value for any feature when the missing value strategy is fill_with_const. It won't do much in most cases since we have sensible defaults for a majority of features (barring the 3 mentioned above), but it is also useful just in case the user overrides the default but doesn't actually set a value or something like that.

@abidwael
Copy link
Contributor Author

I got rid of FILL_WITH_CONST for binary features and added FILL_WITH_TRUE as another option.

The error also does not happen with date and vector features as their fill values are "" and not None. Here's the line with the condition that won't be satisfied which will eventually cause the ValueError.

For these features, the default is FILL_WITH_CONST and it doesn't seem to me like it's a good idea to default to any other missing value strategy. But I'm curious why we have "" as a default fill value @tgaddair @justinxzhao.

@abidwael abidwael requested a review from tgaddair March 21, 2023 10:10
@@ -322,6 +322,21 @@ def check_tagger_decoder_requirements(config: "ModelConfig") -> None: # noqa: F
)


@register_config_check
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. No longer needed if FILL_WITH_CONST is removed as an option.

@@ -17,7 +26,7 @@ class BinaryPreprocessingConfig(BasePreprocessingConfig):
"""BinaryPreprocessingConfig is a dataclass that configures the parameters used for a binary input feature."""

missing_value_strategy: str = schema_utils.StringOptions(
MISSING_VALUE_STRATEGY_OPTIONS + [FILL_WITH_FALSE],
[FILL_WITH_MODE, BFILL, FFILL, DROP_ROW, FILL_WITH_FALSE, FILL_WITH_TRUE],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This will break older models that have missing_value_strategy: FILL_WITH_CONST. However, 1) the risk seems low since this wasn't the default value and 2) these models would be broken anyway unless they had also specified fill_value.

@abidwael abidwael merged commit 1e9266c into master Mar 27, 2023
@abidwael abidwael deleted the fill-with-const-fill-value branch March 27, 2023 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants