Employ a fallback str2bool mapping from the feature column's distinct values when the feature's values aren't boolean-like. #1469

justinxzhao · 2021-11-05T21:27:08Z

In hyperopt, we infer a BINARY feature type if there are two distinct values.

Currently, we use a rather limited whitelist to automatically map string values to booleans. This means that any binary feature column that uses values that aren't in our whitelists explicitly (high/low, good/bad, human/bot) would all be mapped to False.

This is the culprit behind oddly perfect training curves, 0% loss, 100% accuracy on multiple datatsets seen on staging and king (see slack threads 1, 2).

The training_set_metadata also reveals this issue, e.g. twitterbots:

  'str2bool': {'bot': False, 'human': False},
  'bool2str': ['bot', 'human'],

The proposed solution is to use a fallback str2bool mapping derived from the feature column's distinct values when the feature's values aren't boolean-like, using the first distinct value as the value for True (alphabetical order).

  'str2bool': {'bot': True, 'human': False},
  'bool2str': ['human', 'bot'], 
  'fallback_true_label': 'bot',

This appears to fix the training loss curves and accuracy metrics (no longer 100%) 👍

There may still be some performance differences between representing the feature as a binary vs. category, which also use slightly different metrics for loss (BWCE vs. SoftmaxCrossEntropy) and accuracy (CategoryAccuracy vs. Accuracy). Binary could be an honest description of a feature with only two distinct possible values, but category may be more semantically correct for what the feature is supposed to represent. It’s also not impossible that some specific configuration of the model performs better or produces more useful metrics with the output feature represented as a binary, category, or textual representation.

We improve the default preprocessing behavior for binary features, but we still leave this configuration choice choice up to the user, with the option of specifying preprocessing.fallback_true_label to explicitly specify which label to use as true.

… values when the feature's values aren't boolean-like.

github-actions · 2021-11-05T22:00:08Z

Unit Test Results

        8 files ±0       8 suites ±0 1h 31m 36s ⏱️ - 7m 29s
  2 921 tests +1 2 372 ✔️ +2   549 💤 ±0 0 ❌ - 1
11 684 runs +4 9 488 ✔️ +5 2 196 💤 ±0 0 ❌ - 1

Results for commit 4b67bc1. ± Comparison against base commit 0af79dd.

♻️ This comment has been updated with latest results.

… in preprocessing parameters.

justinxzhao added 2 commits November 5, 2021 15:46

Employ a fallback str2bool mapping from the feature column's distinct…

ba51e20

… values when the feature's values aren't boolean-like.

Import pytest.

cf9ffe2

justinxzhao requested review from w4nderlust and tgaddair November 5, 2021 21:27

justinxzhao added 2 commits November 8, 2021 17:58

Add fallback_true_label as a binary-specific preprocessing feature.

987dc43

Add a logged warning if fallback_true_label is not manually specified…

e48fa64

… in preprocessing parameters.

w4nderlust approved these changes Nov 9, 2021

View reviewed changes

justinxzhao added 2 commits November 8, 2021 21:23

Fix warning message.

2802d96

Revert sampling.py

4b67bc1

justinxzhao merged commit 1bd3187 into master Nov 9, 2021

justinxzhao mentioned this pull request Nov 9, 2021

[tf-legacy] Employ a fallback str2bool mapping from the feature column's distinct values when the feature's values aren't boolean-like. #1471

Merged

justinxzhao deleted the str2bool-fix branch November 30, 2021 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Employ a fallback str2bool mapping from the feature column's distinct values when the feature's values aren't boolean-like. #1469

Employ a fallback str2bool mapping from the feature column's distinct values when the feature's values aren't boolean-like. #1469

justinxzhao commented Nov 5, 2021 •

edited

Loading

github-actions bot commented Nov 5, 2021 •

edited

Loading

Employ a fallback str2bool mapping from the feature column's distinct values when the feature's values aren't boolean-like. #1469

Employ a fallback str2bool mapping from the feature column's distinct values when the feature's values aren't boolean-like. #1469

Conversation

justinxzhao commented Nov 5, 2021 • edited Loading

github-actions bot commented Nov 5, 2021 • edited Loading

Unit Test Results

justinxzhao commented Nov 5, 2021 •

edited

Loading

github-actions bot commented Nov 5, 2021 •

edited

Loading