Use flatware to parallelize CI specs #30284

mjankowski · 2024-05-13T16:09:42Z

Previous versions of this:

Original PR - #27663 - has the most history of performance improvements here, but seems stuck on it's codecov/gh-actions integration ... I'm guessing this may be a result of the codecov action having been version bumped while the PR was open, and am going to replace that PR with this one as a final attempt here.

At this point I do see the same coverage percentages locally on main and this branch, and as far as I can tell the codecov integration should be working.

mjankowski · 2024-05-13T20:00:27Z

CodeCov still showing a drop here. This is sort of hard to debug given how slow it is to iterate and that I’m not 100% sure how Codecov processes the uploaded files. One theory is that the formatter used is impacting the differences in reported results, and/or the combining of results. Did a bunch of local trial and error, and see:

Style	Formatter	LOC Results	Percentage
Plain	HTML	24858 / 28046	88.63%
Plain	LCOV	25065 of 27769	90.3%
Plain	XML	25065 / 27769	90.26%
Plain	JSON	25065 / 27769	90.26%
Flatware	HTML	24859 / 28046	88.64%
Flatware	LCOV	25065 of 27769	90.3%
Flatware	XML	24190 / 27769	87.11%
Flatware	JSON	25065 / 27769	90.26%

Notably here, the local results (for HTML, JSON, LCOV) do seem to be combining correctly, with the XML format having a small drop off. I also experimented with disabling/enabling the branch coverage, but that doesn’t seem to impact.

When I download the generated file which is sent from GH actions to Codecov and run genhtml on it locally, I don’t see the same results as what Codecov shows in the UI.

Not sure how to proceed.

mjankowski · 2024-05-13T20:05:19Z

@nschonni I think you originally added the codecov integration here - #23868 - any ideas on what we need to do here to get this output aligned with whatever it expects?

nschonni · 2024-05-14T04:38:35Z

Not really sure, but it seems like they don't like the HTML reporter https://docs.codecov.com/docs/supported-report-formats

mjankowski · 2024-05-14T14:33:55Z

Did more digging here. When I download the file attached to this commit - https://app.codecov.io/github/mastodon/mastodon/commit/cba103e23348888aed92057a185f957b2708ce13 - which was uploaded from gh actions to codecov, and I run it through genhtml locally to see the summary of the lcov data...

I see results locally of "82.1% (23069 of 28114 lines)"
But the CodeCov report on the PR is showing 76.78% coverage and calls that a loss of -5.28% from the BASE
Add those together and you get back to 82.06, which is presumably what I see (rounded) locally

In the current commit history on the PR - https://app.codecov.io/gh/mastodon/mastodon/pull/30284/commits - you can my initial commit (using LCOV) dropped the current project coverage (from https://app.codecov.io/gh/mastodon/mastodon showing just over 85%) down to 76.78%; then switching to JSON formatter bumped it up to 82.06%; then reverting that and going back to LCOV was once again 76.78%

I also wanted to rule out that the specs which ONLY run on CI (triggered by PAM/OIDC/etc env vars) were not relevant. When I set all those env vars locally, I see (using default LCOV format):

Main - lines.......: 90.6% (25150 of 27769 lines)
Flatware - lines.......: 90.6% (25150 of 27769 lines)

So it doesn't seem to be that either ... and then I also have no idea how this local 90.6% turns into the reported ~85% on codecov. Hmm.

mjankowski · 2024-05-14T14:42:25Z

Unrelated to this flatware branch, when I download the coverage data for the current latest commit main - from https://app.codecov.io/gh/mastodon/mastodon/commit/3a7aec2807089a004db90851c66db0a007a18a48 - and then generate data for it locally...

Locally I see "90.6% (25148 of 27769 lines)"
CodeCov shows "24495 of 28736 lines covered" and 85.24%

All of which is to say -- I don't know what process codecov uses to convert the uploaded coverage data into the report it shows, and thus I have no idea how to debug this further.

If this PR is contingent on getting codecov to match, I'll likely just abandon it.

If we'd like to get the parallelism speedups and are comfortable with either a) this one-off codecov change that we can't explain, b) removing codecov integration -- that probably works. Or I guess if someone else has ideas here.

renchap · 2024-05-14T14:44:48Z

Maybe you can open an issue here: https://github.com/codecov/feedback/issues ?
They seem to be responsive

mjankowski · 2024-05-14T15:09:12Z

Good idea, opened one (linked).

mjankowski · 2024-05-14T22:33:31Z

RE: the source of the local differences with the HTML formatter vs any of the other formatters -- I think this comes down to the eager_load setting in test.rb environment (usually controlled by presence of CI env var). Running in this branch and doing some override of that setting, I see:

With html formatter and eager_load disabled — 24958 / 28030 LOC (89.04%) covered
With html formatter eager_load enabled — 25154 / 27774 LOC (90.57%) covered
With lcov formatter and eager_load enabled -- lines.......: 90.6% (25154 of 27774 lines)
With lcov and eager_load disabled -- lines.......: 89.0% (24958 of 28030 lines)

All of these have the full set of CI env vars enabled, hence the slightly higher numbers than typical local runs.

This is not super relevant to the codecov issues, but I wanted to get a good baseline to explain those differences so that I could be sure anything else that was changing was actually from the config, and not from formatter differences.

mjankowski · 2024-05-14T22:35:53Z

RE: the seeming disconnect between codecov and local, I think this came down to a load order / config loading thing, and that the current codecov from this branch (after many force pushes with various things tried) does seem to match what I see locally - https://app.codecov.io/gh/mastodon/mastodon/pull/30284/commits - granted that is now HIGHER than the previous codecov number, rather than lower.

I'm going to pull out a separate branch which will include both an experiment re: that coverage number, and also just a useful extraction from this PR to get in first before the flatware changes. Will link to that.

mjankowski · 2024-05-14T22:45:03Z

Other random note - the presence of tmp/rspec/examples.txt is used when present to inform how flatware splits the suite. Because I have this locally from multiple runs but CI doesn't have it, I think this was making the results I saw in terms of coverage more consistent, regardless of whether merging was working correctly or not. Made things harder to sort out.

mjankowski · 2024-05-15T14:01:06Z

Ok, this is rebased against main after the simplecov config merged. I believe at this point I understand final discrepancy in the codecov changes -- there are a bunch of open issues/PRs against simplecov (which, side note, appears barely maintained) around the interaction of parallel testing and re-adding branch coverage (as opposed to line coverage). The current change on codecov is there because we still have line coverage but have dropped branch coverage (the setting is still on, but the results are just not generated fully) and thus codecov shows an effective coverage bump vs main because the C0 line coverage looks better than the full branch coverage.

In the previous simplecov config change commit I added a monkeypatch which prevents a nil error from being raised, but does NOT restore the branch coverage data correctly. As far I can tell, none of the open PRs there do this, and I'd have to dig much further than I really want to into simplecov to know where to start.

Options here:

Merge this as-is (or first more explicitly disable the branch coverage instead of just having it silently not work) -- we'll get the speedups, we'll preserve our codecov integration, we'll have this one-time seeming coverage bump (from ~85% to ~90%) which we now have an explanation for, and then we can monitor the relevant issues on simplecov repo and re-enable branch coverage if those are fixed
Pursue further, open our own simplecov issues, try to dig deeper for better monkey patch, etc ... and hold out for full branch coverage

Any feedback or other ideas appreciated.

mjankowski · 2024-06-07T15:01:18Z

Opened #30596 to disable branch coverage (for now) with hopes to rebase/merge this after that.

Short of that, I'm kind of out of ideas here and will prob abandon this path.

mjankowski · 2024-06-14T15:38:11Z

Rebased this after recent CI config changes ... though most of those were not narrowly relevant to the rspec-run portion of the CI setup. That said, speedup here still meaningful ... recent main branch runs are in the 6-7 min range (for just the rspec portion), this PR branch is ~3mins, sometimes a little faster.

One possible followup here if we merge this - if we also cache the persistence file output between runs I suspect we can get this even more optimized (take previous run time data as input into how to allocate across parallel runs to minimize overall time).

mjankowski · 2024-06-25T18:49:20Z

A previous dying worker issue was fixed in latest release, so bumped back to latest stable here.

renchap · 2024-06-25T20:18:03Z

I would be very happy to get this merged if it makes our tests go faster.

We could always revert it if it causes unexpected issues. I dont have an opinion either about the coverage, we do not use those metrics much at the moment, and I think the important part is that they go up?

@ClearlyClaire any opinion here?

mjankowski · 2024-06-26T16:01:27Z

I would be very happy to get this merged if it makes our tests go faster.

That's the motivation here. The CI feedback loop is super slow right now. The biggest issues are 1) the rspec suite run time, 2) the JS compilation step, 3) the slowness and lack of cachability on parts of the ruby/JS setups.

This is attempting to chip away at the rspec portion of that.

We could always revert it if it causes unexpected issues.

I'd love to get it merged, and force a bunch of CI runs (could rebase ~50 PRs or something) and try to force out any issues that might exist. Would absolutely revert if it causes issues.

I dont have an opinion either about the coverage, we do not use those metrics much at the moment, and I think the important part is that they go up?

Yes - my impression is generally that I might be the only one in any way looking at these numbers. My current opinion is that the benefits of gaining parallelism across cores on CI runs is worth the (maybe temporary?) loss of branch coverage.

mjankowski mentioned this pull request May 13, 2024

Use flatware to parallelize CI specs #27663

Closed

mjankowski force-pushed the flatware-parallel-specs-rebased branch from fe8d506 to b896052 Compare May 13, 2024 19:09

mjankowski mentioned this pull request May 14, 2024

Disconnect between local results and codecov output codecov/feedback#376

Closed

mjankowski force-pushed the flatware-parallel-specs-rebased branch 6 times, most recently from 24b1d63 to 81b9dc0 Compare May 14, 2024 19:32

mjankowski mentioned this pull request May 14, 2024

Move simplecov config into rails_helper #30302

Merged

mjankowski force-pushed the flatware-parallel-specs-rebased branch from 81b9dc0 to e5e8744 Compare May 15, 2024 11:03

This was referenced May 15, 2024

Simplecov coverage not calculating briandunn/flatware#43

Closed

Combining branch data on parallel runs? simplecov-ruby/simplecov#1091

Open

mjankowski requested a review from a team May 24, 2024 14:40

mjankowski added performance Runtime performance testing Automated lint and test suites labels May 29, 2024

mjankowski force-pushed the flatware-parallel-specs-rebased branch from e5e8744 to f97f6b5 Compare May 31, 2024 13:40

mjankowski mentioned this pull request Jun 7, 2024

Disable simplecov branch coverage #30596

Open

mjankowski force-pushed the flatware-parallel-specs-rebased branch 2 times, most recently from c12da10 to 98cdff9 Compare June 14, 2024 15:21

mjankowski closed this Jun 18, 2024

mjankowski deleted the flatware-parallel-specs-rebased branch June 18, 2024 17:01

mjankowski restored the flatware-parallel-specs-rebased branch June 18, 2024 17:02

mjankowski reopened this Jun 18, 2024

mjankowski force-pushed the flatware-parallel-specs-rebased branch 3 times, most recently from b3a5ca8 to ccc33e2 Compare June 25, 2024 18:24

mjankowski added 3 commits June 26, 2024 11:59

Add flatware gem

f96a5ea

Run flatware on CI

10c0838

Update flatware-rspec to version 2.3.2

1c93128

mjankowski force-pushed the flatware-parallel-specs-rebased branch from ccc33e2 to 1c93128 Compare June 26, 2024 16:11

ClearlyClaire approved these changes Jun 27, 2024

View reviewed changes

renchap added this pull request to the merge queue Jun 27, 2024

Merged via the queue into mastodon:main with commit f6390c3 Jun 27, 2024
32 checks passed

mjankowski deleted the flatware-parallel-specs-rebased branch June 27, 2024 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use flatware to parallelize CI specs #30284

Use flatware to parallelize CI specs #30284

mjankowski commented May 13, 2024

mjankowski commented May 13, 2024

mjankowski commented May 13, 2024

nschonni commented May 14, 2024

mjankowski commented May 14, 2024

mjankowski commented May 14, 2024

renchap commented May 14, 2024

mjankowski commented May 14, 2024

mjankowski commented May 14, 2024

mjankowski commented May 14, 2024

mjankowski commented May 14, 2024

mjankowski commented May 15, 2024

mjankowski commented Jun 7, 2024

mjankowski commented Jun 14, 2024

mjankowski commented Jun 25, 2024

renchap commented Jun 25, 2024

mjankowski commented Jun 26, 2024

Use flatware to parallelize CI specs #30284

Use flatware to parallelize CI specs #30284

Conversation

mjankowski commented May 13, 2024

mjankowski commented May 13, 2024

mjankowski commented May 13, 2024

nschonni commented May 14, 2024

mjankowski commented May 14, 2024

mjankowski commented May 14, 2024

renchap commented May 14, 2024

mjankowski commented May 14, 2024

mjankowski commented May 14, 2024

mjankowski commented May 14, 2024

mjankowski commented May 14, 2024

mjankowski commented May 15, 2024

mjankowski commented Jun 7, 2024

mjankowski commented Jun 14, 2024

mjankowski commented Jun 25, 2024

renchap commented Jun 25, 2024

mjankowski commented Jun 26, 2024