Change `--rebuild` to `--skip` #114

jeremyfowers · 2024-02-15T01:29:13Z

Problem statement:

Turnkey currently has a rebuild argument that gets passed into the Build Tool. It determines whether the build cache policy should be to rebuild the model always. never, or if_needed.
Turnkey needs to have a similar policy for benchmarking: whether a previously-attempted benchmark should be attempted again in a new evaluation.
These two policies could and should be combined to minimize argument bloat.

Proposed solution:

Eliminate --rebuild as a top-level turnkey argument. It will still be an argument to build_model(), but the value of rebuild= will typically be set internally to turnkey based on the value of the skip policy.
Introduce a new --skip top-level turnkey argument. It determines whether various aspects of a turnkey evaluation should be skipped. It has the following settings:
- skip = attempted: skip any build or benchmark that has been previously attempted (regardless of outcome). Specifics:
  - if a model has a successful build, but no benchmark attempted, this would skip over building (load from cache) and then benchmark (same as rebuild = never)
  - if a model has a successful build and benchmark, this would simply print the benchmark result and exit.
  - if a model has a failed build or benchmark already, the model is skipped entirely
- skip = successful: similar to attempted, except that only successful operations are skipped.
  - if a model has a successful build, but no benchmark attempted, this would skip over building (load from cache) and then benchmark (same as rebuild = never)
  - if a model has a successful build and benchmark, this would simply print the benchmark result and exit.
  - if a model has a failed build or benchmark already, the failed build/benchmark is attempted again
- skip = failed: similar to attempted, except that only failed operations are skipped.
  - if a model has a successful build, but no benchmark attempted, this repeat the build (wipe the cache) and then benchmark (same as rebuild = if_needed)
  - if a model has a successful build and benchmark, this would redo the build and benchmark (same as rebuild = if_needed)
  - if a model has a failed build or benchmark already, that model is skipped
- skip = none: cached build and benchmarks are ignored and the build and benchmark are always performed (same behavior as rebuild = always)

The text was updated successfully, but these errors were encountered:

danielholanda · 2024-02-15T23:23:33Z

Agree that skip and rebuild should be combined. This proposal, however, is still not very intuitive and potentially misses a few important scenarios:

Example:
What if a model compiled, it failed to execute the first time because the HW was in a bad state, and the user simply wants to reattempt execution? Is that not possible anymore without recompiling the model?

This is a very challenging UI decision.

danielholanda · 2024-02-15T23:45:01Z

We almost need something like:

turnkey build::if_needed -> Build if we never tried or failed (our current default behavior --build-only) [default]
turnkey build::always -> Build always (never benchmark)
turnkey build::if_not_attempted -> Build if we never tried (never benchmark)

turnkey benchmark::if_needed -> Benchmark if we never tried or failed benchmarking (never rebuild) [default]
turnkey benchmark::always -> Benchmark always (never rebuild)
turnkey benchmark::if_not_attempted -> Benchmark if we never tried benchmarking (never rebuild)

We can also allow any combination of the above

Example:
turnkey build::if_needed benchmark::if_needed (same as turnkey build benchmark) -> Same as our current default behavior

jeremyfowers · 2024-02-16T13:59:33Z

@danielholanda maybe not as bad as you thought? Posting a full "truth table" below. It's a big table, but its intuitive to me what is supposed to happen in each case, and when I would use each setting.

Example: What if a model compiled, it failed to execute the first time because the HW was in a bad state, and the user simply wants to reattempt execution? Is that not possible anymore without recompiling the model?

In this specific example, the build state is S(uccessful), the benchmark state is F(ailed), and the user can apply skip=successful to load the build from cache and retry the benchmark. This would be the default behavior, since it is what is wanted in the general case.

Legend:

S = successful state
F = failed state
NA = action Not previously Attempted

Skip Policy	Build State	Benchmark State	Build Action	Benchmark Action	Same as "rebuild=X"	Useful in this common scenario
Successful (default)	S	S	Load	Load	-	Demos
Successful (default)	S	F	Load	Run	if_needed (default)	Debugging benchmarking
Successful (default)	F	NA	Run	Run	if_needed (default)	Debugging building
Successful (default)	NA	NA	Run	Run	if_needed (default)	Typical use
Successful (default)	S	NA	Load	Run	if_needed (default)
Failed	S	S	Run	Run	always
Failed	S	F	Load	Skip	-
Failed	F	F	Skip	Skip	never
Attempted	S	S	Load	Load	-	Benchmarking a cloud compile
Attempted	S	F	Load	Skip	-	Benchmarking a cloud compile
Attempted	F	NA	Skip	Skip	-	Benchmarking a cloud compile
Attempted	NA	NA	Run	Run	if_needed (default)
Attempted	S	NA	Load	Run	if_needed (default)	Benchmarking a cloud compile
None	X	X	Run	Run	always	Debugging

PS. a significant flaw with the above is: there is no way to re-run a benchmark that already succeeded, if the build succeeded. AKA the current behavior of turnkey benchmark. So granting fine grained control, as you suggested, is ultimately the right thing.

jeremyfowers · 2024-02-16T14:42:41Z

We almost need something like:

turnkey build::if_needed -> Build if we never tried or failed (our current default behavior --build-only) [default] turnkey build::always -> Build always (never benchmark) turnkey build::if_not_attempted -> Build if we never tried (never benchmark)

turnkey benchmark::if_needed -> Benchmark if we never tried or failed benchmarking (never rebuild) [default] turnkey benchmark::always -> Benchmark always (never rebuild) turnkey benchmark::if_not_attempted -> Benchmark if we never tried benchmarking (never rebuild)

We can also allow any combination of the above

Example: turnkey build::if_needed benchmark::if_needed (same as turnkey build benchmark) -> Same as our current default behavior

This would work too, and I like that it grants more fine-grained control to the user. It's sort of related to #20 so it would be nice if we could get two birds with one stone. However, I just tried to figure out how all the assumed commands would work (e.g., "discover" is always assumed unless its explicit, "build" is assumed when "benchmark" is used) and I got quite confused with respect to the details of your turnkey benchmark behavior above.

I'm also not super happy with the if_needed and if_not_attempted terms since I don't think our users have ever intuitively understood if_needed and if_not_attempted seems even less intuitive.

Requirements

Going back to first principles, the use cases we really want to support are:

Typical interactive use: cache as much as possible, require minimal user input / args
Debugging: grant the user fine grained control of what gets loaded from cache and what re-runs
Mass benchmarking: enable "resume" behavior (do not re-attempt anything)

In typical mode, commands like turnkey discover, turnkey build, and turnkey benchmark make a lot of sense to me. As in "I called turnkey benchmark because I want a benchmark and a tool named turnkey should just do whatever it takes to make that happen for me."

In debugging mode, something has gone wrong and I want control over everything so that I can quickly diagnose the problem. Something like turnkey build::load_cache benchmark::always makes a lot of sense here since it grants fine grained control and is very explicit. The specific scenarios, which are not covered by "typical use" are simply:

Retry a successful build
Retry a successful benchmark without rebuilding
Most other debugging scenarios, such as retrying a benchmark without rebuilding, are covered by typical use!

In mass evaluation mode, we need these behaviors:

Attempt to build and benchmark each model exactly once. If the command crashes, resume only the models that haven't been attempted yet.
Attempt to build each model exactly once. Same resume requirement as above.
Attempt to benchmark each [cloud] build exactly once. Same resume requirement as above.
Run debugging workflows on specific models (e.g., retry a failed benchmark to see if it fails deterministically).

And, finally, I never want to type discover except in the --analyze-only scenario.

Proposed solution

TBD...

jeremyfowers · 2024-02-16T15:46:04Z

Design Study

Eliminate: --build-only, --analyze-only, --rebuild, --skip, and turnkey cache benchmark. Then introduce the following:

The evaluation commands would be:

turnkey discover: discover the models in a file (using any operations necessary)
turnkey build: build the models in a file (using any operations necessary)
turnkey benchmark (default): benchmark the models in a file (using any operations necessary)

The big rule here is that if you ask for something, by default it always happens regardless of the cache state. turnkey build bert.py will always build BERT (ie, rebuild=always) because you the user are asking for a build.

turnkey build INPUT: always build INPUT
turnkey benchmark INPUT: build INPUT if needed, always benchmark

Ok now we just need a way to change the default policy to achieve some of our desired scenarios from the requirements.

Typical use case: just call discover, build, or benchmark
- Bonus: benchmark a potentially stale build (current behavior of --rebuild never)
Debugging use case:
- Retry a successful or failed build: turnkey build INPUT
- Retry a successful or failed benchmark, without rebuilding: turnkey benchmark INPUT
- Retry a model that both successfully built and successfully benchmarked (this is not one of our scenarios... just for the sake of demonstrating flexibility):
  - turnkey cache delete BUILD would do it
  - turnkey build INPUT followed by turnkey benchmark INPUT would also do it
  - Perhaps some POLICY would be helpful
Mass evaluation use case:
- Attempt to build and benchmark each model exactly once (even when resuming): needs a POLICY
- Attempt to build each model exactly once (even when resuming): needs a POLICY
- Attempt to benchmark each [cloud] build exactly once (even when resuming): needs a POLICY
- Run debugging workflow on one of these models: the debugging commands above would work fine.

Ok so based on this, our POLICY needs to change the default behavior in two main ways:

Alway re-attempt the specific action
Never re-attempt any action
Bonus: re-attempt both build and benchmark

Based on that requirement, the following seems reasonably intuitive to me (especially since it is only really relevant to mass-evaluation): a new flag named --retry which can have the following values:

--retry nothing: never re-attempt any action; this covers all the mass-evaluation scenarios
--retry benchmark: default behavior of turnkey benchmark (not available as a choice in other commands); loads build from cache if available
--retry build: default behavior of turnkey build; in turnkey benchmark it rebuilds and then re-benchmarks (bonus objective achieved!)
--retry no_builds: never build or rebuild, benchmark only if the build is cached (this is a bit confusing, but AFAIK this is a super niche usecase. I would be happy deprecating this use case as well).

How it works in practice

I really like this because it feels like everything is in plain english, and the behavior is pretty obvious in all cases:

I want to...	I use the command...	Under the hood, the default value of retry is...
Discover the models in some files	turnkey disocover FILES	-
Build the models in some files	turnkey build FILES	builds
Benchmark the models in some files	turnkey benchmark FILES	benchmarks
Benchmark the models in some files (with no risk of rebuilding)	turnkey benchmark FILES --retry no_builds	-
Debug: rebuild a model that already built	turnkey build FILES	builds
Debug: rebenchmark a model that built but failed benchmarking (without rebuilding)	turnkey benchmark FILES	benchmarks
Debug: rebenchmark a model that already built and benchmarked (without rebuilding)	turnkey benchmark FILES	benchmarks
Debug: rebuild and rebenchmark a model that already succeeded	turnkey benchmark FILES --retry builds	-
Mass evaluation: attempt to build each model exactly once (even when resuming)	turnkey build FILES --retry nothing	-
Mass evaluation: attempt to benchmark builds exactly once (even when resuming)	turnkey benchmark cache/*.tkb --retry nothing	-
Mass evaluation: attempt to build and benchmark each model exactly once (even when resuming)	turnkey benchmark FILES --retry nothing	-

jeremyfowers added cleanup Cleaning up old/unused code and tech debt p1 Medium priority breaking API breaking change that should get extra testing in canary labels Feb 15, 2024

jeremyfowers mentioned this issue Feb 15, 2024

Integrate mass-benchmarking into the Files API #115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `--rebuild` to `--skip` #114

Change `--rebuild` to `--skip` #114

jeremyfowers commented Feb 15, 2024

danielholanda commented Feb 15, 2024

danielholanda commented Feb 15, 2024

jeremyfowers commented Feb 16, 2024 •

edited

Loading

jeremyfowers commented Feb 16, 2024 •

edited

Loading

jeremyfowers commented Feb 16, 2024 •

edited

Loading

Change --rebuild to --skip #114

Change --rebuild to --skip #114

Comments

jeremyfowers commented Feb 15, 2024

danielholanda commented Feb 15, 2024

danielholanda commented Feb 15, 2024

jeremyfowers commented Feb 16, 2024 • edited Loading

jeremyfowers commented Feb 16, 2024 • edited Loading

Requirements

Proposed solution

jeremyfowers commented Feb 16, 2024 • edited Loading

Design Study

How it works in practice

Change `--rebuild` to `--skip` #114

Change `--rebuild` to `--skip` #114

jeremyfowers commented Feb 16, 2024 •

edited

Loading

jeremyfowers commented Feb 16, 2024 •

edited

Loading

jeremyfowers commented Feb 16, 2024 •

edited

Loading