Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use flatware to parallelize CI specs #30284

Merged
merged 3 commits into from
Jun 27, 2024

Conversation

mjankowski
Copy link
Contributor

Previous versions of this:

Original PR - #27663 - has the most history of performance improvements here, but seems stuck on it's codecov/gh-actions integration ... I'm guessing this may be a result of the codecov action having been version bumped while the PR was open, and am going to replace that PR with this one as a final attempt here.

At this point I do see the same coverage percentages locally on main and this branch, and as far as I can tell the codecov integration should be working.

@mjankowski
Copy link
Contributor Author

CodeCov still showing a drop here. This is sort of hard to debug given how slow it is to iterate and that I’m not 100% sure how Codecov processes the uploaded files. One theory is that the formatter used is impacting the differences in reported results, and/or the combining of results. Did a bunch of local trial and error, and see:

Style Formatter LOC Results Percentage
Plain HTML 24858 / 28046 88.63%
Plain LCOV 25065 of 27769 90.3%
Plain XML 25065 / 27769 90.26%
Plain JSON 25065 / 27769 90.26%
Flatware HTML 24859 / 28046 88.64%
Flatware LCOV 25065 of 27769 90.3%
Flatware XML 24190 / 27769 87.11%
Flatware JSON 25065 / 27769 90.26%

Notably here, the local results (for HTML, JSON, LCOV) do seem to be combining correctly, with the XML format having a small drop off. I also experimented with disabling/enabling the branch coverage, but that doesn’t seem to impact.

When I download the generated file which is sent from GH actions to Codecov and run genhtml on it locally, I don’t see the same results as what Codecov shows in the UI.

Not sure how to proceed.

@mjankowski
Copy link
Contributor Author

@nschonni I think you originally added the codecov integration here - #23868 - any ideas on what we need to do here to get this output aligned with whatever it expects?

@nschonni
Copy link
Contributor

Not really sure, but it seems like they don't like the HTML reporter https://docs.codecov.com/docs/supported-report-formats

@mjankowski
Copy link
Contributor Author

Did more digging here. When I download the file attached to this commit - https://app.codecov.io/github/mastodon/mastodon/commit/cba103e23348888aed92057a185f957b2708ce13 - which was uploaded from gh actions to codecov, and I run it through genhtml locally to see the summary of the lcov data...

  • I see results locally of "82.1% (23069 of 28114 lines)"
  • But the CodeCov report on the PR is showing 76.78% coverage and calls that a loss of -5.28% from the BASE
  • Add those together and you get back to 82.06, which is presumably what I see (rounded) locally

In the current commit history on the PR - https://app.codecov.io/gh/mastodon/mastodon/pull/30284/commits - you can my initial commit (using LCOV) dropped the current project coverage (from https://app.codecov.io/gh/mastodon/mastodon showing just over 85%) down to 76.78%; then switching to JSON formatter bumped it up to 82.06%; then reverting that and going back to LCOV was once again 76.78%

I also wanted to rule out that the specs which ONLY run on CI (triggered by PAM/OIDC/etc env vars) were not relevant. When I set all those env vars locally, I see (using default LCOV format):

  • Main - lines.......: 90.6% (25150 of 27769 lines)
  • Flatware - lines.......: 90.6% (25150 of 27769 lines)

So it doesn't seem to be that either ... and then I also have no idea how this local 90.6% turns into the reported ~85% on codecov. Hmm.

@mjankowski
Copy link
Contributor Author

Unrelated to this flatware branch, when I download the coverage data for the current latest commit main - from https://app.codecov.io/gh/mastodon/mastodon/commit/3a7aec2807089a004db90851c66db0a007a18a48 - and then generate data for it locally...

  • Locally I see "90.6% (25148 of 27769 lines)"
  • CodeCov shows "24495 of 28736 lines covered" and 85.24%

All of which is to say -- I don't know what process codecov uses to convert the uploaded coverage data into the report it shows, and thus I have no idea how to debug this further.

If this PR is contingent on getting codecov to match, I'll likely just abandon it.

If we'd like to get the parallelism speedups and are comfortable with either a) this one-off codecov change that we can't explain, b) removing codecov integration -- that probably works. Or I guess if someone else has ideas here.

@renchap
Copy link
Sponsor Member

renchap commented May 14, 2024

Maybe you can open an issue here: https://github.com/codecov/feedback/issues ?
They seem to be responsive

@mjankowski
Copy link
Contributor Author

Good idea, opened one (linked).

@mjankowski mjankowski force-pushed the flatware-parallel-specs-rebased branch 6 times, most recently from 24b1d63 to 81b9dc0 Compare May 14, 2024 19:32
@mjankowski
Copy link
Contributor Author

RE: the source of the local differences with the HTML formatter vs any of the other formatters -- I think this comes down to the eager_load setting in test.rb environment (usually controlled by presence of CI env var). Running in this branch and doing some override of that setting, I see:

  • With html formatter and eager_load disabled — 24958 / 28030 LOC (89.04%) covered
  • With html formatter eager_load enabled — 25154 / 27774 LOC (90.57%) covered
  • With lcov formatter and eager_load enabled -- lines.......: 90.6% (25154 of 27774 lines)
  • With lcov and eager_load disabled -- lines.......: 89.0% (24958 of 28030 lines)

All of these have the full set of CI env vars enabled, hence the slightly higher numbers than typical local runs.

This is not super relevant to the codecov issues, but I wanted to get a good baseline to explain those differences so that I could be sure anything else that was changing was actually from the config, and not from formatter differences.

@mjankowski
Copy link
Contributor Author

RE: the seeming disconnect between codecov and local, I think this came down to a load order / config loading thing, and that the current codecov from this branch (after many force pushes with various things tried) does seem to match what I see locally - https://app.codecov.io/gh/mastodon/mastodon/pull/30284/commits - granted that is now HIGHER than the previous codecov number, rather than lower.

I'm going to pull out a separate branch which will include both an experiment re: that coverage number, and also just a useful extraction from this PR to get in first before the flatware changes. Will link to that.

@mjankowski
Copy link
Contributor Author

Other random note - the presence of tmp/rspec/examples.txt is used when present to inform how flatware splits the suite. Because I have this locally from multiple runs but CI doesn't have it, I think this was making the results I saw in terms of coverage more consistent, regardless of whether merging was working correctly or not. Made things harder to sort out.

@mjankowski
Copy link
Contributor Author

Ok, this is rebased against main after the simplecov config merged. I believe at this point I understand final discrepancy in the codecov changes -- there are a bunch of open issues/PRs against simplecov (which, side note, appears barely maintained) around the interaction of parallel testing and re-adding branch coverage (as opposed to line coverage). The current change on codecov is there because we still have line coverage but have dropped branch coverage (the setting is still on, but the results are just not generated fully) and thus codecov shows an effective coverage bump vs main because the C0 line coverage looks better than the full branch coverage.

In the previous simplecov config change commit I added a monkeypatch which prevents a nil error from being raised, but does NOT restore the branch coverage data correctly. As far I can tell, none of the open PRs there do this, and I'd have to dig much further than I really want to into simplecov to know where to start.

Options here:

  • Merge this as-is (or first more explicitly disable the branch coverage instead of just having it silently not work) -- we'll get the speedups, we'll preserve our codecov integration, we'll have this one-time seeming coverage bump (from ~85% to ~90%) which we now have an explanation for, and then we can monitor the relevant issues on simplecov repo and re-enable branch coverage if those are fixed
  • Pursue further, open our own simplecov issues, try to dig deeper for better monkey patch, etc ... and hold out for full branch coverage

Any feedback or other ideas appreciated.

@mjankowski
Copy link
Contributor Author

Opened #30596 to disable branch coverage (for now) with hopes to rebase/merge this after that.

Short of that, I'm kind of out of ideas here and will prob abandon this path.

@mjankowski mjankowski force-pushed the flatware-parallel-specs-rebased branch 2 times, most recently from c12da10 to 98cdff9 Compare June 14, 2024 15:21
@mjankowski
Copy link
Contributor Author

Rebased this after recent CI config changes ... though most of those were not narrowly relevant to the rspec-run portion of the CI setup. That said, speedup here still meaningful ... recent main branch runs are in the 6-7 min range (for just the rspec portion), this PR branch is ~3mins, sometimes a little faster.

One possible followup here if we merge this - if we also cache the persistence file output between runs I suspect we can get this even more optimized (take previous run time data as input into how to allocate across parallel runs to minimize overall time).

@mjankowski mjankowski closed this Jun 18, 2024
@mjankowski mjankowski deleted the flatware-parallel-specs-rebased branch June 18, 2024 17:01
@mjankowski mjankowski restored the flatware-parallel-specs-rebased branch June 18, 2024 17:02
@mjankowski mjankowski reopened this Jun 18, 2024
@mjankowski mjankowski force-pushed the flatware-parallel-specs-rebased branch 3 times, most recently from b3a5ca8 to ccc33e2 Compare June 25, 2024 18:24
@mjankowski
Copy link
Contributor Author

A previous dying worker issue was fixed in latest release, so bumped back to latest stable here.

@renchap
Copy link
Sponsor Member

renchap commented Jun 25, 2024

I would be very happy to get this merged if it makes our tests go faster.

We could always revert it if it causes unexpected issues. I dont have an opinion either about the coverage, we do not use those metrics much at the moment, and I think the important part is that they go up?

@ClearlyClaire any opinion here?

@mjankowski
Copy link
Contributor Author

I would be very happy to get this merged if it makes our tests go faster.

That's the motivation here. The CI feedback loop is super slow right now. The biggest issues are 1) the rspec suite run time, 2) the JS compilation step, 3) the slowness and lack of cachability on parts of the ruby/JS setups.

This is attempting to chip away at the rspec portion of that.

We could always revert it if it causes unexpected issues.

I'd love to get it merged, and force a bunch of CI runs (could rebase ~50 PRs or something) and try to force out any issues that might exist. Would absolutely revert if it causes issues.

I dont have an opinion either about the coverage, we do not use those metrics much at the moment, and I think the important part is that they go up?

Yes - my impression is generally that I might be the only one in any way looking at these numbers. My current opinion is that the benefits of gaining parallelism across cores on CI runs is worth the (maybe temporary?) loss of branch coverage.

@mjankowski mjankowski force-pushed the flatware-parallel-specs-rebased branch from ccc33e2 to 1c93128 Compare June 26, 2024 16:11
@renchap renchap added this pull request to the merge queue Jun 27, 2024
Merged via the queue into mastodon:main with commit f6390c3 Jun 27, 2024
32 checks passed
@mjankowski mjankowski deleted the flatware-parallel-specs-rebased branch June 27, 2024 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Runtime performance testing Automated lint and test suites
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants