Skip to content

Latest commit

 

History

History
52 lines (43 loc) · 3.18 KB

pml-requiredModelAccuracy.md

File metadata and controls

52 lines (43 loc) · 3.18 KB

Practical Machine Learning: Required Model Accuracy for Course project

As students complete the course project for Practical Machine Learning, they tend to raise questions about the accuracy required to correctly predict all 20 cases in the test data set.

Going back to the probability theory concepts that were covered in Statistical Inference, each observation in the test data set is independent of the others. If a represents the accuracy of a machine learning model, then the probability of correctly predicting 20 out of 20 test cases with the model in question is a^20, because the probability of the total is equal to the product of the independent probabilities.

The following table illustrates the probability of predicting all 20 test cases, given a particular model accuracy.



Model
Accuracy
Probability
of Predicting
20 out of 20
Correctly
0.8000.0115
0.8500.0388
0.9000.1216
0.9500.3585
0.9900.8179
0.9910.8346
0.9920.8516
0.9930.8689
0.9940.8866
0.9950.9046

Bottom Line: Submit your test cases for grading only after you've achieved a model accuracy of at least .99 on the training data set.

Appendix: Accuracy Required for 95% Confidence Across 20 Tests

In January 2018 a student posted an issue on my github site, suggesting that a better way to calculate the required accuracy would be to use the formula (1-.05)^(1/20). This approach leverages the concept of familywise error rates across multiple comparisons of means in the week 4 lectures from the Statistical Inference course. This specific calculation is known as the Šidák correction for multiple tests.

When we compare the two approaches we find that they produce the same result within .001. To have 95% confidence that all 20 predictions will be accurate, we need a familywise accuracy rate of .9974386, as illustrated below.

  > mdlAccuracy <- c(.8,.85,.9,.95,.99,.995,.996,.997,.9974,0.9974386,.9975)
  > predAccuracy <- mdlAccuracy^20
  > data.frame(mdlAccuracy,predAccuracy)
     mdlAccuracy predAccuracy
  1    0.8000000   0.01152922
  2    0.8500000   0.03875953
  3    0.9000000   0.12157665
  4    0.9500000   0.35848592
  5    0.9900000   0.81790694
  6    0.9950000   0.90461048
  7    0.9960000   0.92296826
  8    0.9970000   0.94167961
  9    0.9974000   0.94926458
  10   0.9974386   0.94999960
  11   0.9975000   0.95116988
  >
  > # alternate approach: Šidák's correction of multiple tests
  > # generate 95% confidence familywise accuracy needed across 20 tests
  > (1 - .05)^(1/20)
  [1] 0.9974386
  >