You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe it was a design decision to not test nor benchmark our examples. (If this is correct, please remind me.)
However, some examples produce an accuracy, and I don't believe the accuracy produces by the examples has too much variance when done on different systems or hardware.
I wonder if we should have a special jenkins test that benchmarks specific examples only when there's been a change to the examples or the parts of the API used by the example. I think a good check would be if the accuracy is close enough to some stable commit.
for example, @vivianwhite has provided pull request #1022. Right now I am testing the examples before Vivian's PR and after their PR. should I just look at the accuracy of 1 run for each possible argument, or should I be doing multiple runs and seeing if its within error bars? if so, how many runs?
It makes sense to me to have a stable commit which we run for a reasonable number of runs (3-5) over all possible arguments and compare if the new commit is within error bars (with 3-5 runs making up the mean accuracy) over all possible arguments.
What do we think?
The text was updated successfully, but these errors were encountered:
I wonder if we should have a special jenkins test that benchmarks specific examples only when there's been a change to the examples or the parts of the API used by the example.
Don't know that this is possible to do. Also, there may be changes that don't affect the API yet cause the examples to give different results.
One option is to have a workflow that is only triggered manually or on some other even (such as a beta release).
It makes sense to me to have a stable commit which we run for a reasonable number of runs (3-5) over all possible arguments and compare if the new commit is within error bars (with 3-5 runs making up the mean accuracy) over all possible arguments.
That sounds tricky. I think a safer bet here is to fix the seed(s). That way we can guarantee that the result is exactly the same and test for that.
I believe it was a design decision to not test nor benchmark our examples. (If this is correct, please remind me.)
However, some examples produce an accuracy, and I don't believe the accuracy produces by the examples has too much variance when done on different systems or hardware.
I wonder if we should have a special jenkins test that benchmarks specific examples only when there's been a change to the examples or the parts of the API used by the example. I think a good check would be if the accuracy is close enough to some stable commit.
for example, @vivianwhite has provided pull request #1022. Right now I am testing the examples before Vivian's PR and after their PR. should I just look at the accuracy of 1 run for each possible argument, or should I be doing multiple runs and seeing if its within error bars? if so, how many runs?
It makes sense to me to have a stable commit which we run for a reasonable number of runs (3-5) over all possible arguments and compare if the new commit is within error bars (with 3-5 runs making up the mean accuracy) over all possible arguments.
What do we think?
The text was updated successfully, but these errors were encountered: