Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change --rebuild to --skip #114

Open
jeremyfowers opened this issue Feb 15, 2024 · 5 comments
Open

Change --rebuild to --skip #114

jeremyfowers opened this issue Feb 15, 2024 · 5 comments
Labels
breaking API breaking change that should get extra testing in canary cleanup Cleaning up old/unused code and tech debt p1 Medium priority

Comments

@jeremyfowers
Copy link
Collaborator

Problem statement:

  • Turnkey currently has a rebuild argument that gets passed into the Build Tool. It determines whether the build cache policy should be to rebuild the model always. never, or if_needed.
  • Turnkey needs to have a similar policy for benchmarking: whether a previously-attempted benchmark should be attempted again in a new evaluation.
  • These two policies could and should be combined to minimize argument bloat.

Proposed solution:

  • Eliminate --rebuild as a top-level turnkey argument. It will still be an argument to build_model(), but the value of rebuild= will typically be set internally to turnkey based on the value of the skip policy.
  • Introduce a new --skip top-level turnkey argument. It determines whether various aspects of a turnkey evaluation should be skipped. It has the following settings:
    • skip = attempted: skip any build or benchmark that has been previously attempted (regardless of outcome). Specifics:
      • if a model has a successful build, but no benchmark attempted, this would skip over building (load from cache) and then benchmark (same as rebuild = never)
      • if a model has a successful build and benchmark, this would simply print the benchmark result and exit.
      • if a model has a failed build or benchmark already, the model is skipped entirely
    • skip = successful: similar to attempted, except that only successful operations are skipped.
      • if a model has a successful build, but no benchmark attempted, this would skip over building (load from cache) and then benchmark (same as rebuild = never)
      • if a model has a successful build and benchmark, this would simply print the benchmark result and exit.
      • if a model has a failed build or benchmark already, the failed build/benchmark is attempted again
    • skip = failed: similar to attempted, except that only failed operations are skipped.
      • if a model has a successful build, but no benchmark attempted, this repeat the build (wipe the cache) and then benchmark (same as rebuild = if_needed)
      • if a model has a successful build and benchmark, this would redo the build and benchmark (same as rebuild = if_needed)
      • if a model has a failed build or benchmark already, that model is skipped
    • skip = none: cached build and benchmarks are ignored and the build and benchmark are always performed (same behavior as rebuild = always)

cc @danielholanda

@jeremyfowers jeremyfowers added cleanup Cleaning up old/unused code and tech debt p1 Medium priority breaking API breaking change that should get extra testing in canary labels Feb 15, 2024
@danielholanda
Copy link
Collaborator

Agree that skip and rebuild should be combined. This proposal, however, is still not very intuitive and potentially misses a few important scenarios:

Example:
What if a model compiled, it failed to execute the first time because the HW was in a bad state, and the user simply wants to reattempt execution? Is that not possible anymore without recompiling the model?

This is a very challenging UI decision.

@danielholanda
Copy link
Collaborator

We almost need something like:

turnkey build::if_needed -> Build if we never tried or failed (our current default behavior --build-only) [default]
turnkey build::always -> Build always (never benchmark)
turnkey build::if_not_attempted -> Build if we never tried (never benchmark)

turnkey benchmark::if_needed -> Benchmark if we never tried or failed benchmarking (never rebuild) [default]
turnkey benchmark::always -> Benchmark always (never rebuild)
turnkey benchmark::if_not_attempted -> Benchmark if we never tried benchmarking (never rebuild)

We can also allow any combination of the above

Example:
turnkey build::if_needed benchmark::if_needed (same as turnkey build benchmark) -> Same as our current default behavior

@jeremyfowers
Copy link
Collaborator Author

jeremyfowers commented Feb 16, 2024

@danielholanda maybe not as bad as you thought? Posting a full "truth table" below. It's a big table, but its intuitive to me what is supposed to happen in each case, and when I would use each setting.

Example: What if a model compiled, it failed to execute the first time because the HW was in a bad state, and the user simply wants to reattempt execution? Is that not possible anymore without recompiling the model?

In this specific example, the build state is S(uccessful), the benchmark state is F(ailed), and the user can apply skip=successful to load the build from cache and retry the benchmark. This would be the default behavior, since it is what is wanted in the general case.

Legend:

  • S = successful state
  • F = failed state
  • NA = action Not previously Attempted
Skip Policy Build State Benchmark State Build Action Benchmark Action Same as "rebuild=X" Useful in this common scenario
Successful (default) S S Load Load - Demos
Successful (default) S F Load Run if_needed (default) Debugging benchmarking
Successful (default) F NA Run Run if_needed (default) Debugging building
Successful (default) NA NA Run Run if_needed (default) Typical use
Successful (default) S NA Load Run if_needed (default)
Failed S S Run Run always
Failed S F Load Skip -
Failed F F Skip Skip never
Attempted S S Load Load - Benchmarking a cloud compile
Attempted S F Load Skip - Benchmarking a cloud compile
Attempted F NA Skip Skip - Benchmarking a cloud compile
Attempted NA NA Run Run if_needed (default)
Attempted S NA Load Run if_needed (default) Benchmarking a cloud compile
None X X Run Run always Debugging

PS. a significant flaw with the above is: there is no way to re-run a benchmark that already succeeded, if the build succeeded. AKA the current behavior of turnkey benchmark. So granting fine grained control, as you suggested, is ultimately the right thing.

@jeremyfowers
Copy link
Collaborator Author

jeremyfowers commented Feb 16, 2024

We almost need something like:

turnkey build::if_needed -> Build if we never tried or failed (our current default behavior --build-only) [default] turnkey build::always -> Build always (never benchmark) turnkey build::if_not_attempted -> Build if we never tried (never benchmark)

turnkey benchmark::if_needed -> Benchmark if we never tried or failed benchmarking (never rebuild) [default] turnkey benchmark::always -> Benchmark always (never rebuild) turnkey benchmark::if_not_attempted -> Benchmark if we never tried benchmarking (never rebuild)

We can also allow any combination of the above

Example: turnkey build::if_needed benchmark::if_needed (same as turnkey build benchmark) -> Same as our current default behavior

This would work too, and I like that it grants more fine-grained control to the user. It's sort of related to #20 so it would be nice if we could get two birds with one stone. However, I just tried to figure out how all the assumed commands would work (e.g., "discover" is always assumed unless its explicit, "build" is assumed when "benchmark" is used) and I got quite confused with respect to the details of your turnkey benchmark behavior above.

I'm also not super happy with the if_needed and if_not_attempted terms since I don't think our users have ever intuitively understood if_needed and if_not_attempted seems even less intuitive.

Requirements

Going back to first principles, the use cases we really want to support are:

  • Typical interactive use: cache as much as possible, require minimal user input / args
  • Debugging: grant the user fine grained control of what gets loaded from cache and what re-runs
  • Mass benchmarking: enable "resume" behavior (do not re-attempt anything)

In typical mode, commands like turnkey discover, turnkey build, and turnkey benchmark make a lot of sense to me. As in "I called turnkey benchmark because I want a benchmark and a tool named turnkey should just do whatever it takes to make that happen for me."

In debugging mode, something has gone wrong and I want control over everything so that I can quickly diagnose the problem. Something like turnkey build::load_cache benchmark::always makes a lot of sense here since it grants fine grained control and is very explicit. The specific scenarios, which are not covered by "typical use" are simply:

  • Retry a successful build
  • Retry a successful benchmark without rebuilding
  • Most other debugging scenarios, such as retrying a benchmark without rebuilding, are covered by typical use!

In mass evaluation mode, we need these behaviors:

  • Attempt to build and benchmark each model exactly once. If the command crashes, resume only the models that haven't been attempted yet.
  • Attempt to build each model exactly once. Same resume requirement as above.
  • Attempt to benchmark each [cloud] build exactly once. Same resume requirement as above.
  • Run debugging workflows on specific models (e.g., retry a failed benchmark to see if it fails deterministically).

And, finally, I never want to type discover except in the --analyze-only scenario.

Proposed solution

TBD...

@jeremyfowers
Copy link
Collaborator Author

jeremyfowers commented Feb 16, 2024

Design Study

Eliminate: --build-only, --analyze-only, --rebuild, --skip, and turnkey cache benchmark. Then introduce the following:

The evaluation commands would be:

  • turnkey discover: discover the models in a file (using any operations necessary)
  • turnkey build: build the models in a file (using any operations necessary)
  • turnkey benchmark (default): benchmark the models in a file (using any operations necessary)

The big rule here is that if you ask for something, by default it always happens regardless of the cache state. turnkey build bert.py will always build BERT (ie, rebuild=always) because you the user are asking for a build.

  • turnkey build INPUT: always build INPUT
  • turnkey benchmark INPUT: build INPUT if needed, always benchmark

Ok now we just need a way to change the default policy to achieve some of our desired scenarios from the requirements.

  • Typical use case: just call discover, build, or benchmark
    • Bonus: benchmark a potentially stale build (current behavior of --rebuild never)
  • Debugging use case:
    • Retry a successful or failed build: turnkey build INPUT
    • Retry a successful or failed benchmark, without rebuilding: turnkey benchmark INPUT
    • Retry a model that both successfully built and successfully benchmarked (this is not one of our scenarios... just for the sake of demonstrating flexibility):
      • turnkey cache delete BUILD would do it
      • turnkey build INPUT followed by turnkey benchmark INPUT would also do it
      • Perhaps some POLICY would be helpful
  • Mass evaluation use case:
    • Attempt to build and benchmark each model exactly once (even when resuming): needs a POLICY
    • Attempt to build each model exactly once (even when resuming): needs a POLICY
    • Attempt to benchmark each [cloud] build exactly once (even when resuming): needs a POLICY
    • Run debugging workflow on one of these models: the debugging commands above would work fine.

Ok so based on this, our POLICY needs to change the default behavior in two main ways:

  • Alway re-attempt the specific action
  • Never re-attempt any action
  • Bonus: re-attempt both build and benchmark

Based on that requirement, the following seems reasonably intuitive to me (especially since it is only really relevant to mass-evaluation): a new flag named --retry which can have the following values:

  • --retry nothing: never re-attempt any action; this covers all the mass-evaluation scenarios
  • --retry benchmark: default behavior of turnkey benchmark (not available as a choice in other commands); loads build from cache if available
  • --retry build: default behavior of turnkey build; in turnkey benchmark it rebuilds and then re-benchmarks (bonus objective achieved!)
  • --retry no_builds: never build or rebuild, benchmark only if the build is cached (this is a bit confusing, but AFAIK this is a super niche usecase. I would be happy deprecating this use case as well).

How it works in practice

I really like this because it feels like everything is in plain english, and the behavior is pretty obvious in all cases:

I want to... I use the command... Under the hood, the default value of retry is...
Discover the models in some files turnkey disocover FILES -
Build the models in some files turnkey build FILES builds
Benchmark the models in some files turnkey benchmark FILES benchmarks
Benchmark the models in some files (with no risk of rebuilding) turnkey benchmark FILES --retry no_builds -
Debug: rebuild a model that already built turnkey build FILES builds
Debug: rebenchmark a model that built but failed benchmarking (without rebuilding) turnkey benchmark FILES benchmarks
Debug: rebenchmark a model that already built and benchmarked (without rebuilding) turnkey benchmark FILES benchmarks
Debug: rebuild and rebenchmark a model that already succeeded turnkey benchmark FILES --retry builds -
Mass evaluation: attempt to build each model exactly once (even when resuming) turnkey build FILES --retry nothing -
Mass evaluation: attempt to benchmark builds exactly once (even when resuming) turnkey benchmark cache/*.tkb --retry nothing -
Mass evaluation: attempt to build and benchmark each model exactly once (even when resuming) turnkey benchmark FILES --retry nothing -

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking API breaking change that should get extra testing in canary cleanup Cleaning up old/unused code and tech debt p1 Medium priority
Projects
None yet
Development

No branches or pull requests

2 participants