Testing/Benchmarking of examples #1024

MuawizChaudhary · 2023-09-12T19:11:27Z

I believe it was a design decision to not test nor benchmark our examples. (If this is correct, please remind me.)

However, some examples produce an accuracy, and I don't believe the accuracy produces by the examples has too much variance when done on different systems or hardware.

I wonder if we should have a special jenkins test that benchmarks specific examples only when there's been a change to the examples or the parts of the API used by the example. I think a good check would be if the accuracy is close enough to some stable commit.

for example, @vivianwhite has provided pull request #1022. Right now I am testing the examples before Vivian's PR and after their PR. should I just look at the accuracy of 1 run for each possible argument, or should I be doing multiple runs and seeing if its within error bars? if so, how many runs?

It makes sense to me to have a stable commit which we run for a reasonable number of runs (3-5) over all possible arguments and compare if the new commit is within error bars (with 3-5 runs making up the mean accuracy) over all possible arguments.

What do we think?

lostanlen · 2023-09-12T19:39:42Z

Related: #858

MuawizChaudhary · 2023-09-12T19:43:09Z

Good memory thanks.

janden · 2023-10-30T16:07:32Z

I wonder if we should have a special jenkins test that benchmarks specific examples only when there's been a change to the examples or the parts of the API used by the example.

Don't know that this is possible to do. Also, there may be changes that don't affect the API yet cause the examples to give different results.

One option is to have a workflow that is only triggered manually or on some other even (such as a beta release).

It makes sense to me to have a stable commit which we run for a reasonable number of runs (3-5) over all possible arguments and compare if the new commit is within error bars (with 3-5 runs making up the mean accuracy) over all possible arguments.

That sounds tricky. I think a safer bet here is to fix the seed(s). That way we can guarantee that the result is exactly the same and test for that.

MuawizChaudhary assigned eickenberg, lostanlen, janden, edouardoyallon, cyrusvahidi, lylyhan and MuawizChaudhary Sep 12, 2023

This was referenced Sep 12, 2023

Examples are not in CI #858

Open

MAINT use lr_scheduler instead of resetting optimizer #1022

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing/Benchmarking of examples #1024

Testing/Benchmarking of examples #1024

MuawizChaudhary commented Sep 12, 2023 •

edited

lostanlen commented Sep 12, 2023

MuawizChaudhary commented Sep 12, 2023

janden commented Oct 30, 2023

Testing/Benchmarking of examples #1024

Testing/Benchmarking of examples #1024

Comments

MuawizChaudhary commented Sep 12, 2023 • edited

lostanlen commented Sep 12, 2023

MuawizChaudhary commented Sep 12, 2023

janden commented Oct 30, 2023

MuawizChaudhary commented Sep 12, 2023 •

edited