Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic test failures in GitHub Actions CI and CodeCov #9873

Closed
jywarren opened this issue Jun 29, 2021 · 49 comments · Fixed by #9905
Closed

Periodic test failures in GitHub Actions CI and CodeCov #9873

jywarren opened this issue Jun 29, 2021 · 49 comments · Fixed by #9905
Labels
bug the issue is regarding one of our programs which faces problems when a certain task is executed high-priority

Comments

@jywarren
Copy link
Member

I'm seeing occasional errors that seem to self-resolve when we re-run the tests. I'm not sure the cause. Here's one log I found:

2021-06-29 20:46:08 +0000 Rack app ("GET /barnstar/givestar=basic&nid=1" - (127.0.0.1)): Mysql2::Error
=[Screenshot]: tmp/screenshots/failures_test_awarding_barnstar_functions_correctly.png
ERROR PostTest#test_awarding_barnstar_functions_correctly (873.23s)
Minitest::UnexpectedError:         Mysql2::Error: 
            app/models/tag.rb:430:in `update_tags_activity'
            app/models/node.rb:678:in `tag_activity'
            app/models/node.rb:671:in `add_comment'
            app/controllers/tag_controller.rb:284:in `barnstar'

These can sometimes happen due to timing issues -- because tests run all at once with no time passing in between. We can sometimes resolve them by adding an order clause to the SQL, or by using TimeCop to manipulate the time during tests. Let's collect more data to see what tests are inconsistent (paste them here before restarting them, please) and go from there. cc @publiclab/soc

@jywarren jywarren added the bug the issue is regarding one of our programs which faces problems when a certain task is executed label Jun 29, 2021
@jywarren
Copy link
Member Author

Confirmed the exact same bug blocked #9848 - and is blocking lots of dependabot PRs too, so that's a good place to look for this happening!

@jywarren
Copy link
Member Author

#9864 as well! So, what is the actual error?

It seems related to this line from #9165 by @RuthNjeri --

Tag.where(tid: tids).update_all(activity_timestamp: DateTime.now, latest_activity_nid: activity_id)

No worries, Ruth, it seems super obscure. But we're doing an update_all -- could it be blocked by some other database lock? We are running tests in parallel, which kind of simulates how this could run in production with many users at once, you know?

Could we do this differently?

@daemon1024
Copy link
Member

I wonder if introducing pessimistic locking will help us prevent this race condition.

https://api.rubyonrails.org/classes/ActiveRecord/Locking/Pessimistic.html

@jywarren
Copy link
Member Author

jywarren commented Jul 2, 2021

Oh, interesting! So how would that look? Like this?

Tag.transaction do
 Tag.lock
  .where(tid: tids)
  .update_all(activity_timestamp: DateTime.now, latest_activity_nid: activity_id) 
end

@RuthNjeri @cesswairimu , what do you think? I've never done this before! 😅

@jywarren
Copy link
Member Author

jywarren commented Jul 3, 2021

@icarito, would it be OK to lock some tag records in this way while the rows are being updated together? Or would it cause some adverse performance?

I'm wondering if:

  1. we should only lock in test mode, to avoid a performance issue in production, or
  2. we should just lock in general because we may have this kind of MySQL error happening in production?

We should be able to check for such errors in Sentry. I don't think i've been seeing any MySQL related errors though, so i think this should be OK to leave as-is in production mode.

@jywarren
Copy link
Member Author

jywarren commented Jul 3, 2021

For reference, I think we're talking about locking about 6 rows in the term_data table (the tags table) while we update those 6 rows. So not a huge issue, and only when tags are added to a node:

See https://github.com/publiclab/plots2/pull/9165/files#diff-2845087add773fb20ddff83e51ab1ef70788ad7e522c7363e03458c6adab93b3R11

@jywarren
Copy link
Member Author

jywarren commented Jul 3, 2021

I've attempted the change irrespective of mode (so for both production and testing) in #9881 so let's see what happens there.

@cesswairimu
Copy link
Collaborator

Oh, interesting! So how would that look? Like this?

Tag.transaction do
 Tag.lock
  .where(tid: tids)
  .update_all(activity_timestamp: DateTime.now, latest_activity_nid: activity_id) 
end

@RuthNjeri @cesswairimu , what do you think? I've never done this before!

This is looking good @jywarren 🎉

@RuthNjeri
Copy link
Contributor

RuthNjeri commented Jul 5, 2021

Hi @jywarren and Everyone 👋🏾
I'm having a look at this now, it seems interesting locking the database transaction... I thought the fact that MySQL is relational, it will lock a transaction based on the ACID property...

If the tag locking does not work, we could look into mocking this for the tests...

jywarren added a commit that referenced this issue Jul 6, 2021
@jywarren
Copy link
Member Author

jywarren commented Jul 6, 2021

OK, we merged #9881 to try to address this. Let's watch out and i suspect (and @Manasa2850 noted something related to this too) that there's another issue sometimes causing test failures. We can close it if things have stabilized in a week or two!

I did notice that CodeCov failed twice for @17sushmita in #9885 and #9879 for no apparent reason, misreading test coverage changes. So let's watch that too, and if you see that behavior please mention and link us to it here! Thanks, all!

@jywarren
Copy link
Member Author

jywarren commented Jul 6, 2021

Lots of codecov failures w dependabot: #9887

@jywarren
Copy link
Member Author

jywarren commented Jul 6, 2021

See here for example I asked dependabot to rebase #9886 (comment)

17sushmita pushed a commit to 17sushmita/plots2 that referenced this issue Jul 7, 2021
@Tlazypanda
Copy link
Collaborator

So I might be shooting in the dark but I found this issue codecov/codecov-action#330 and this link https://stackoverflow.com/questions/67861379/codecov-fails-in-github-actions which suggests moving to 1.5.2 should solve our issue and then in our dependabot.yml file codecov dependency is under ignore category so could it be that we aren't updating our codecov and we are facing this? not sure if it gets updated in another way as I said throwing darts :p @jywarren @cesswairimu

@cesswairimu
Copy link
Collaborator

cesswairimu commented Jul 10, 2021

oh really nice!! @Tlazypanda checking this out. Thanks

@cesswairimu
Copy link
Collaborator

@jywarren I checked and couldn't find anywhere we could specify the codecov version. Any ideas? thanks

@jywarren
Copy link
Member Author

jywarren commented Jul 13, 2021

Great digging here, @Tlazypanda @cesswairimu !!!

Digging into when that ignore got added...

#6290 relevant... also #9756 cited the same CodeCov issue!

#9552 is maybe where CodeCov was last updated, on April 23 2021

plots2/Gemfile.lock

Lines 89 to 90 in 6832a43

codecov (0.5.2)
simplecov (>= 0.15, < 0.22)
shows v0.5.2 so we are already on that version...

https://github.com/publiclab/plots2/pull/9583/files is where the newer Dependabot added the ignore?

So wait, the ignore is only to ignore that specific version, not later ones. It's generated when we close a Dependabot PR on the specific version -- it "remembers" that we don't want to update to that particular version. But we got past it so we're now on v0.5.2:

- dependency-name: codecov
versions:
- 0.4.2

@jywarren
Copy link
Member Author

Looking at a recent fail here for exact reference: #9894

@jywarren
Copy link
Member Author

So there seems to be some error for 9894 in CodeCov:

image

@jywarren
Copy link
Member Author

I'm going to do 2 things to see if we get past it:

  1. read and implement https://docs.codecov.com/docs/error-reference#section-missing-base-commit to get past that issue
  2. get a new token, as codecov fails in github actions codecov/codecov-action#330 recommends

@jywarren
Copy link
Member Author

Also noting some people are using a specific GitHub Actions step like this:

    - name: Upload to codecov
      uses: codecov/codecov-action@v1

@jywarren
Copy link
Member Author

Oh hmm, i see this message:

New! Install Codecov's GitHub App to improve application link. Comment posting will be on behalf of Codecov not a bot user. Install now

Sure! I'll try that!

@jywarren
Copy link
Member Author

OK, we had a slight hiccup where the merge-pr action was in the wrong folder. Moved it and adjusted in #9910

@jywarren
Copy link
Member Author

OK, hopeful this will work: https://github.com/publiclab/plots2/actions/runs/1038211403

🤞

@jywarren
Copy link
Member Author

@jywarren
Copy link
Member Author

@jywarren
Copy link
Member Author

OMG ok this took a while. I think i got it. We now have this which triggers after a merge to main:

https://github.com/publiclab/plots2/runs/3088995766?check_suite_focus=true

That then is used as a trigger for another run of the regular tests workflow, which strangely here ran twice:

https://github.com/publiclab/plots2/actions/runs/1038480174

https://github.com/publiclab/plots2/actions/runs/1038480867

One says jywarren requested and the other jywarren completed so maybe we should trigger it on a specific event:

OK done: #9916

@jywarren
Copy link
Member Author

actions/runner#950

@jywarren
Copy link
Member Author

Hmm. it's still running 2x. But the sequence is now:

  1. https://github.com/publiclab/plots2/actions/runs/1038489457 - original PR
  2. https://github.com/publiclab/plots2/actions/runs/1038493942 to trigger one more run after merging PR
  3. https://github.com/publiclab/plots2/actions/runs/1038493985 and https://github.com/publiclab/plots2/actions/runs/1038494629 run

This is good enough for now!

@jywarren
Copy link
Member Author

Now i'll open a new PR and we'll see what it compares against!

@jywarren
Copy link
Member Author

I'm also hoping we'll see "Missing base report" on this page go away:

https://app.codecov.io/gh/publiclab/plots2/pulls?page=1&state=open&order=-pullid

image

@jywarren
Copy link
Member Author

Ooh! I think that's it!

image

@jywarren
Copy link
Member Author

And #9909 | https://app.codecov.io/gh/publiclab/plots2/compare/9909 is now not showing "Missing base report" anymore! 🎉

@jywarren
Copy link
Member Author

OK - the good news is that the comparison is against the correct base commit: compared to 3d3de3f

However, the /check/ isn't passing, and is reporting 59.59% (-22.56%) which is not right. CodeCov itself reports 74.07% here: https://app.codecov.io/gh/publiclab/plots2/compare/9909/. I think the 59.59% is from one commit behind. I.e. before it was really ready to report back. The "re-run" link doesn't work.

jywarren added a commit that referenced this issue Jul 16, 2021
@jywarren jywarren changed the title Periodic test failures in GitHub Actions CI Periodic test failures in GitHub Actions CI and CodeCov Jul 16, 2021
@jywarren
Copy link
Member Author

After looking in documentation, i finally just reported the discrepancy to CodeCov in their forums: https://community.codecov.com/t/github-status-check-reported-too-early/3043

https://docs.codecov.com/docs/github-checks

https://docs.codecov.com/docs/commit-status

@cesswairimu
Copy link
Collaborator

thanks Jeff 🚀 🚀

jywarren added a commit that referenced this issue Jul 20, 2021
* add wait_for_ajax for barnstar errors (2nd attempt)

copy of #9909 for codeCov debugging cc #9873

* notify:         after_n_builds: 5
reginaalyssa pushed a commit to reginaalyssa/plots2 that referenced this issue Oct 16, 2021
reginaalyssa pushed a commit to reginaalyssa/plots2 that referenced this issue Oct 16, 2021
* add wait_for_ajax for barnstar errors (2nd attempt)

copy of publiclab#9909 for codeCov debugging cc publiclab#9873

* notify:         after_n_builds: 5
billymoroney1 pushed a commit to billymoroney1/plots2 that referenced this issue Dec 28, 2021
billymoroney1 pushed a commit to billymoroney1/plots2 that referenced this issue Dec 28, 2021
* add wait_for_ajax for barnstar errors (2nd attempt)

copy of publiclab#9909 for codeCov debugging cc publiclab#9873

* notify:         after_n_builds: 5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug the issue is regarding one of our programs which faces problems when a certain task is executed high-priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants