Analyzing different ML model comparison metrics
- AUROC_AUPRC: Matthew's original examination of beta distributions and their relationship to AUROC and AUPRC.
- 1_synthetic_experiments: Giovanni's extension of Matthews work.
- Overleaf draft here
- Clean up expressions for AUROC and AUPRC that involve taking expectations over p_+ of various quantities.
- Verify the Au(log)ROC expression and explore if it is useful.
- Fairness analyses - if two subgroups, and their p_+ and p_- are each uniform distributions over intersecting ranges, in what settings can we comment on the impact of AUROC over AUPRC w.r.t. fairness implications?
- Specify model configurations
- Specify metrics
- AUROC
- AUPRC
- Brier Score
- Best-F1
- Best-Accuracy
- Best-Precision
- Best-Recall
- Best-Sensitivity
- Best-Specificity
- Total expected deployment cost under a uniform sampling of independent FP, FN, TP, and TN cost/benefit ratios.
- Specify subgroup analysis
- N subgroups
- Prevalences
- Positive label prevalences per subgroup
- Specify metrics
- For each metric and set up above --> Theoretically computed metrics (independent of sizes of dev and test sets).
- Goal: Empirically computed metrics with variances over varying test and dev set sizes (dev set only used to pick thresholds for threshold dependent metrics).
- For each combination of subgroup prevalences S, positive label prevalences per subgroup of R, test dataset size N, dev dataset size M, and model selection under decision rule Q...
- Expected true deployment cost of that metric under cost-benefit ratios V for each subgroup.
- Need to do mini-literature review of studies stating auprc is better in imbalanced datasets.
- See Figma sketch here.
- Basic integration of Matthews code into dashboard done.
- To rediscuss
- Run SubPopBench but integrate the extra metrics above + calculate for subgroups.
- Chexclusion