Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General Behavior CI #1634

Closed
16 tasks
pitag-ha opened this issue Jun 26, 2023 · 2 comments
Closed
16 tasks

General Behavior CI #1634

pitag-ha opened this issue Jun 26, 2023 · 2 comments

Comments

@pitag-ha
Copy link
Member

pitag-ha commented Jun 26, 2023

Merlin already has quite a few end-to-end tests (which are also run in CI). All of those are behavior-specific: we want to test one concrete behavior of Merlin, create specific input for that and test Merlin on that input. Additionally to those tests, we now also want to test Merlin's behavior non-specifically on a large randomized input scale. Similarly as for the benchmark CI, here we can use merl-an as well.

Purpose of the CI

We want a CI on big scale that monitors Merlin's behavior on a big amount of randomized (but deterministic) input. A change in behavior doesn't necessarily need to be a regression. The CI serves as an alert tool for unexpected behavior changes that aren't tested by any of the concrete tests.

Usage of the CI

The big scale of the CI makes it different from normal behavior CIs in two ways:

  • It takes a lot longer, so we won't want to wait for it on every PR.
  • It has far more input, so changes can't always be analyzed on an individual basis.

Challenges/Questions

When would we run the CI? The CI takes too long to run it on all PRs. Running it in a cron job would make it really hard to distinguish wanted behavior changes from unwanted behavior changes.

Running it opt-in on PRs

We could have it opt-in on PRs. E.g. we could enable it via a github flag. The concrete workflow would then look as follows:

  • For a PR that consciously implements a behavior change, we'll run the CI and likely see the change on a big amount of samples. We won't be able to go through all those samples. We'll skim through them and then "promote". This is one of the two differences with "normal end-to-end tests": We only skim through the changes.
  • For a PR that doesn't consciously implement a behavior change, we need to decide whether to run the CI or not.. This is the second difference with "normal end-to-end tests": The tests aren't run on every PR.
    • If we consider the PR "risky", we run the CI. If it unconsciously changes the behavior, we check out some of the samples where it did and fix what we messed up (or consider it right and "promote").
    • If we consider the PR "risk-less", we don't run the CI on it. If we were wrong about considering it "risk-less" and it did change the behavior, we miss it and will find the behavior change sometime later when running the CI and will then need to "bisect".

Would that be useful?

Main Challenge with running it opt-in on PRs

On which base would we decide when to opt-in to run the CI? How would we make sure we don't forget?

Implementation

merl-an already has a command merl-an behavior for this General Behavior CI. What's missing is the integration into a CI workflow.

Data to take into account

Result of running merl-an behavior on Irmin for 1 sample on each file we get:

  • 50MB (we assume that it will scale linearly)
  • 15 mins (we assume that it won't scale linearly, will be faster; also, we could improve this in several ways)

So the "simple approach" of having the tests locally as dune cram tests won't scale in terms of size in the git repo.

Next action points

@3Rafal has already written a PoC for a similar CI. For some purposes on this CI, such as gathering more context and data, we can re-use that PoC here 1:1, just changing one merl-an command. The next action points are:

  • Decide whether implementing this CI is worth the effort and inconveniences: Either decide "no" with our current state of knowledge or:
    • Gather more insight into what the CI would be capable of and what kind of sample set it would need:
      • Go through some past Merlin PRs and find the ones with behavior changes, both wanted and unwanted.
      • Write a PoC as local cram tests (literally copying @3Rafal's PoC of the Concrete-regression CI, only modifying the merl-an command and promote)
      • Run the PoC on those PRs and find out whether they'd have caught the behavior change at all and whether the diff would have been human-readable.
    • Do the most low-hanging fruits to optimize the CI in terms of time:
      • On the CI-side: e.g. cache the set-up.
      • On the merl-an-side: Make sure we use the Merlin cache as much as possible (e.g. if the traversal is done in query -> file order, do it in file -> query order instead).
      • On the sample-side:
        • For Merlin cache, better to have lots of samples in few files than having few samples in a lot of files.
        • Do we want to run it on all mli-files as well?
    • Possibly: After the optimization, gather more realistic data.
  • Write an MvP.
    • Either: Find out if git LFS could be a good solution for making the PoC an MvP.
    • Or: Store the last results on a Tarides server; fetch, compare, push. Started here.
  • Likely: Fix merl-an to avoid OOM problems. See Fix workflow of persisting the generated data pitag-ha/merl-an#18.
  • With the optimized CI, find a good balance for the sample size.
  • Decide on a good workflow for the CI: Opt-in on PRs; or on all PRs but non-blocking for merge; or as cron jobs (see above).

Benefit pay-off

Given the time and size the CI consumes, we need to find the right balance between having enough samples and taking not-too-much time. The pay-off we'll likely end up paying is:

  • The bigger the sample size, the less we'll run the CI, so the more we'll notice changes delayed and will have to "bisect".
  • The smaller the sample size, the more the CI will miss behavior changes.

How can we know whether a sample set is significant enough for the CI to be useful?

Main question

Do we think the CI will have such a big impact that it's worth all the effort and inconveniences (see first action point above)?

@pitag-ha
Copy link
Member Author

@3Rafal, is there anything you'd add?

@pitag-ha
Copy link
Member Author

Implemented in #1716

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant