Employ a fallback str2bool mapping from the feature column's distinct values when the feature's values aren't boolean-like. #1469
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In hyperopt, we infer a
BINARY
feature type if there are two distinct values.Currently, we use a rather limited whitelist to automatically map string values to booleans. This means that any binary feature column that uses values that aren't in our whitelists explicitly (high/low, good/bad, human/bot) would all be mapped to
False
.This is the culprit behind oddly perfect training curves, 0% loss, 100% accuracy on multiple datatsets seen on
staging
andking
(see slack threads 1, 2).The
training_set_metadata
also reveals this issue, e.g. twitterbots:The proposed solution is to use a fallback str2bool mapping derived from the feature column's distinct values when the feature's values aren't boolean-like, using the first distinct value as the value for
True
(alphabetical order).This appears to fix the training loss curves and accuracy metrics (no longer 100%) 👍
There may still be some performance differences between representing the feature as a binary vs. category, which also use slightly different metrics for loss (BWCE vs. SoftmaxCrossEntropy) and accuracy (CategoryAccuracy vs. Accuracy). Binary could be an honest description of a feature with only two distinct possible values, but category may be more semantically correct for what the feature is supposed to represent. It’s also not impossible that some specific configuration of the model performs better or produces more useful metrics with the output feature represented as a binary, category, or textual representation.
We improve the default preprocessing behavior for binary features, but we still leave this configuration choice choice up to the user, with the option of specifying
preprocessing.fallback_true_label
to explicitly specify which label to use as true.