Code for How good is 85%? A survey tool to connect classifier evaluation to acceptability of accuracy
Matthew Kay ([] (, Shwetak N. Patel ([] (, and Julie A. Kientz ([] (
This repository contains analysis code from:
Kay, Matthew, Patel, Shwetak N., and Kientz, Julie A. How Good is 85%? A Survey Tool to Connect Classifier Evaluation to Acceptability of Accuracy. CHI 2015 (upcoming).
It is intended to allow others to adopt our tool for generating surveys and modelling acceptability of accuracy. It is currently a work-in-progress. If you have any questions, please email Matthew Kay (above). Also, if you've done something cool with this work, we'd love to hear from you!
Plus some helper functions in util.R:
Example code for fitting a model can be found in src/application_ui-regression.R. Since fitting
the model takes some time, for the purposes of this example we will just load the fitted model,
which has been saved in src/output/acceptability_ui-model-small-final.RData
Let's plot some posterior parameter estimates. First, we extract the
sample estimates for b0
, b
, and alpha
params = extract_samples(best_model_chain, cbind(b0, b, alpha)[application])
This gives us a pretty simple table of estimates. We can look at the first couple of entries to get an idea of its structure:
.sample | application | b0 | b | alpha |
1 | alarm_police | -15.65 | 18.49 | 0.5639 |
2 | alarm_police | -16.58 | 20.07 | 0.5377 |
3 | alarm_police | -17.05 | 19.83 | 0.4827 |
4 | alarm_police | -16.86 | 19.22 | 0.5002 |
5 | alarm_police | -16.2 | 18.64 | 0.4509 |
6 | alarm_police | -16.18 | 17.97 | 0.5414 |
Now we'll plot each parameter in turn, with some useful reference lines:
ggposterior(params, aes(x=application, y=b)) +
geom_hline(yintercept=0, lty="dashed")
ggposterior(params, aes(x=application, y=b0)) +
geom_hline(yintercept=0, lty="dashed")
ggposterior(params, aes(x=application, y=alpha)) +
geom_hline(yintercept=0.5, lty="dashed") +
ylim(0, 1)
We can also examine the posterior difference in alpha between each condition. First we extract the samples for alpha, this time in a wide format to facilitate comparison:
alpha = extract_samples(best_model_chain, alpha[application] | application)
Again, let's see the first couple of entries to get an idea of its structure:
.sample | alarm_police | alarm_text_message | electricity | location |
1 | 0.5639 | 0.2988 | 0.5119 | 0.3859 |
2 | 0.5377 | 0.3567 | 0.4585 | 0.577 |
3 | 0.4827 | 0.3135 | 0.4513 | 0.4802 |
4 | 0.5002 | 0.3772 | 0.4969 | 0.5148 |
5 | 0.4509 | 0.2896 | 0.4851 | 0.5001 |
6 | 0.5414 | 0.3805 | 0.5025 | 0.5161 |
Now, for every pair of applications, let's get the posterior distribution of their difference in alpha:
alpha_comparisons = ldply(combn(levels(df$application), 2, simplify=FALSE),
function(applications) {
applications = paste(applications, collapse=" - "),
alpha_difference = alpha[[applications[[1]]]] - alpha[[applications[[2]]]]
Which looks like this:
applications | alpha_difference |
alarm_police - alarm_text_message | 0.2651 |
alarm_police - alarm_text_message | 0.181 |
alarm_police - alarm_text_message | 0.1692 |
alarm_police - alarm_text_message | 0.123 |
alarm_police - alarm_text_message | 0.1614 |
alarm_police - alarm_text_message | 0.1609 |
Finally, we can plot the estimated differences:
ggposterior(alpha_comparisons, aes(x=applications, y=alpha_difference)) +
geom_hline(yintercept=0, lty="dashed") +
ylim(-0.5, 0.5)
Please cite the CHI paper above.
Should you encounter any issues with this code, contact Matthew Kay ( If you have found a bug, please file it here with minimal code to reproduce the issue.