-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use flatware to parallelize CI specs #30284
Use flatware to parallelize CI specs #30284
Conversation
fe8d506
to
b896052
Compare
CodeCov still showing a drop here. This is sort of hard to debug given how slow it is to iterate and that I’m not 100% sure how Codecov processes the uploaded files. One theory is that the formatter used is impacting the differences in reported results, and/or the combining of results. Did a bunch of local trial and error, and see:
Notably here, the local results (for HTML, JSON, LCOV) do seem to be combining correctly, with the XML format having a small drop off. I also experimented with disabling/enabling the branch coverage, but that doesn’t seem to impact. When I download the generated file which is sent from GH actions to Codecov and run genhtml on it locally, I don’t see the same results as what Codecov shows in the UI. Not sure how to proceed. |
Not really sure, but it seems like they don't like the HTML reporter https://docs.codecov.com/docs/supported-report-formats |
Did more digging here. When I download the file attached to this commit - https://app.codecov.io/github/mastodon/mastodon/commit/cba103e23348888aed92057a185f957b2708ce13 - which was uploaded from gh actions to codecov, and I run it through
In the current commit history on the PR - https://app.codecov.io/gh/mastodon/mastodon/pull/30284/commits - you can my initial commit (using LCOV) dropped the current project coverage (from https://app.codecov.io/gh/mastodon/mastodon showing just over 85%) down to 76.78%; then switching to JSON formatter bumped it up to 82.06%; then reverting that and going back to LCOV was once again 76.78% I also wanted to rule out that the specs which ONLY run on CI (triggered by PAM/OIDC/etc env vars) were not relevant. When I set all those env vars locally, I see (using default LCOV format):
So it doesn't seem to be that either ... and then I also have no idea how this local 90.6% turns into the reported ~85% on codecov. Hmm. |
Unrelated to this flatware branch, when I download the coverage data for the current latest commit main - from https://app.codecov.io/gh/mastodon/mastodon/commit/3a7aec2807089a004db90851c66db0a007a18a48 - and then generate data for it locally...
All of which is to say -- I don't know what process codecov uses to convert the uploaded coverage data into the report it shows, and thus I have no idea how to debug this further. If this PR is contingent on getting codecov to match, I'll likely just abandon it. If we'd like to get the parallelism speedups and are comfortable with either a) this one-off codecov change that we can't explain, b) removing codecov integration -- that probably works. Or I guess if someone else has ideas here. |
Maybe you can open an issue here: https://github.com/codecov/feedback/issues ? |
Good idea, opened one (linked). |
24b1d63
to
81b9dc0
Compare
RE: the source of the local differences with the HTML formatter vs any of the other formatters -- I think this comes down to the
All of these have the full set of CI env vars enabled, hence the slightly higher numbers than typical local runs. This is not super relevant to the codecov issues, but I wanted to get a good baseline to explain those differences so that I could be sure anything else that was changing was actually from the config, and not from formatter differences. |
RE: the seeming disconnect between codecov and local, I think this came down to a load order / config loading thing, and that the current codecov from this branch (after many force pushes with various things tried) does seem to match what I see locally - https://app.codecov.io/gh/mastodon/mastodon/pull/30284/commits - granted that is now HIGHER than the previous codecov number, rather than lower. I'm going to pull out a separate branch which will include both an experiment re: that coverage number, and also just a useful extraction from this PR to get in first before the flatware changes. Will link to that. |
Other random note - the presence of |
81b9dc0
to
e5e8744
Compare
Ok, this is rebased against main after the simplecov config merged. I believe at this point I understand final discrepancy in the codecov changes -- there are a bunch of open issues/PRs against simplecov (which, side note, appears barely maintained) around the interaction of parallel testing and re-adding branch coverage (as opposed to line coverage). The current change on codecov is there because we still have line coverage but have dropped branch coverage (the setting is still on, but the results are just not generated fully) and thus codecov shows an effective coverage bump vs In the previous simplecov config change commit I added a monkeypatch which prevents a nil error from being raised, but does NOT restore the branch coverage data correctly. As far I can tell, none of the open PRs there do this, and I'd have to dig much further than I really want to into simplecov to know where to start. Options here:
Any feedback or other ideas appreciated. |
e5e8744
to
f97f6b5
Compare
Opened #30596 to disable branch coverage (for now) with hopes to rebase/merge this after that. Short of that, I'm kind of out of ideas here and will prob abandon this path. |
c12da10
to
98cdff9
Compare
Rebased this after recent CI config changes ... though most of those were not narrowly relevant to the rspec-run portion of the CI setup. That said, speedup here still meaningful ... recent main branch runs are in the 6-7 min range (for just the rspec portion), this PR branch is ~3mins, sometimes a little faster. One possible followup here if we merge this - if we also cache the persistence file output between runs I suspect we can get this even more optimized (take previous run time data as input into how to allocate across parallel runs to minimize overall time). |
b3a5ca8
to
ccc33e2
Compare
A previous dying worker issue was fixed in latest release, so bumped back to latest stable here. |
I would be very happy to get this merged if it makes our tests go faster. We could always revert it if it causes unexpected issues. I dont have an opinion either about the coverage, we do not use those metrics much at the moment, and I think the important part is that they go up? @ClearlyClaire any opinion here? |
That's the motivation here. The CI feedback loop is super slow right now. The biggest issues are 1) the rspec suite run time, 2) the JS compilation step, 3) the slowness and lack of cachability on parts of the ruby/JS setups. This is attempting to chip away at the rspec portion of that.
I'd love to get it merged, and force a bunch of CI runs (could rebase ~50 PRs or something) and try to force out any issues that might exist. Would absolutely revert if it causes issues.
Yes - my impression is generally that I might be the only one in any way looking at these numbers. My current opinion is that the benefits of gaining parallelism across cores on CI runs is worth the (maybe temporary?) loss of branch coverage. |
ccc33e2
to
1c93128
Compare
Previous versions of this:
Original PR - #27663 - has the most history of performance improvements here, but seems stuck on it's codecov/gh-actions integration ... I'm guessing this may be a result of the codecov action having been version bumped while the PR was open, and am going to replace that PR with this one as a final attempt here.
At this point I do see the same coverage percentages locally on main and this branch, and as far as I can tell the codecov integration should be working.