Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in-training tooling to find a more optimal threshold for binary classification. #2181

Open
justinxzhao opened this issue Jun 22, 2022 · 2 comments
Assignees
Labels
feature New feature or request release-0.6 Feature to be implemented in v0.6
Projects

Comments

@justinxzhao
Copy link
Collaborator

Ludwig uses a default threshold of 0.5 to calculate accuracy for binary classification problems. However, it's highly possible, especially for imbalanced datasets that a threshold of 0.5 is not the best threshold to use.

The AUC measures the performance of a binary classifier averaged across all possible decision thresholds, and is commonly used to determine a better threshold that gets a better balance of precision and recall.

One such algorithmic outline, proposed by @geoffreyangus and @w4nderlust:

def find_best_threshold(model, output_feature_name, dataset, metric, thresholds:  range(0, 1, 0.05)):
  probabilities = model.predict(dataset)[output_feature_name]['probabilities']
  scores = []
  for threshold in thresholds:
    preds = probabilities[:, 1] > threshold
    metric_score = metric(preds, targets)  # TODO: extract targets from `dataset`
    scores.append(metric_score)
  return threshold[argmax(scores)]

By default, the optimal threshold should be calculated at the end of the training phase.

It would also be useful to expose this as a standalone API.

@justinxzhao justinxzhao added feature New feature or request release-0.6 Feature to be implemented in v0.6 labels Jun 22, 2022
@justinxzhao justinxzhao added this to To do in AutoML Jun 23, 2022
@amholler
Copy link
Collaborator

amholler commented Jun 23, 2022

An example that works on the current code is here:
https://github.com/ludwig-ai/experiments/blob/main/automl/heuristics/santander_customer_satisfaction/eval_util.py
with an example invocation here:
https://github.com/ludwig-ai/experiments/blob/main/automl/heuristics/santander_customer_satisfaction/train_tabnet_imbalance_ros.py

@justinxzhao
Copy link
Collaborator Author

Largely a duplicate of #2158

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request release-0.6 Feature to be implemented in v0.6
Projects
AutoML
To do
Development

No branches or pull requests

3 participants