-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI became slow #78
Comments
also, I noticed that the CI became unreliable and it fails frequently due to random errors - hence, I need to restart the CI 2-3 times for it to pass |
@edublancas @bryannho It looks like the intermittent failures are due to tests randomly reaching the timeout. I can think of a couple short-term solutions to get the CI passing reliably without addressing the underlying issue. We could increase the timeout in
It's 60s right now. This might not actually work if the tests that timeout would never actually finish even with a longer timeout. TBD. We could also leave the timeout at 60s, or even decrease it, and then somehow mark some subset of the tests to ignore that error. Regarding the underlying issue of the CI being slow, we added a bunch of tests that have this property that they randomly take a while and time out, and we might have also contributed the code that caused the tests to have that property. I have not been able to reproduce it locally, I don't get the timeouts. It looks like other people have had this issue: microsoft/playwright#13090 To give an example of the unreliable behavior, the test that timed out in the CI linked above as "commit" (CI, code change) which is
is like 52s. So I assume that's part of the 60s that it's using. But in this CI from yesterday https://github.com/ploomber/jupysql-plugin/actions/runs/6383729916/job/17324954125 the test doesn't fail and the time delta between those lines
is about 5s. Also there are tests that timed out in that CI from yesterday where the tests were written a month ago, so apparently they didn't used to have this property where they randomly time out. Can someone more familiar with the code look at the commit and see if issues commented in microsoft/playwright#13090 are being introduced? |
Thanks Marshall, I can take a look although I am also somewhat new to this code base- @edublancas let us know if you have specific notes on how to proceed |
@bryannho @marshallwhiteorg please work together on this. Sadly, I don't have much to say here: I'm new to javascript/typescript and this repository was mostly created from a template, so we're just using whatever configuration we inherited from the template |
@marshallwhiteorg I was able to look into this commit - it does display the same behaviors detailed in microsoft/playwright#13090. Interestingly, the commit already employs some of the fixes that other users mention in that issue - yet it still runs into the timeout error. Other users experienced the same issue with no major solution. I will continue investigating, but for now, I think it's worth it to try increasing the timeout in No guarantee this will work consistently given we can't locate the reason playwright is hanging in the first place. This also may cause the tests to run even slower. In the long run, we may need to restructure all of the tests to optimize runtime and avoid the timeout issue. |
@bryannho Okay, I'll make a PR to increase the timeout for now |
@bryannho It still timed out with the 2m timeout, so will need a different approach https://github.com/ploomber/jupysql-plugin/actions/runs/6401180754/job/17378192452 |
@marshallwhiteorg Okay, so we'll have to get to the bottom of why the tests are hanging. I noticed that usually the timeout occurs at I've been looking into this test to see if it might be one of the causes. Will do some more googling and CI runs to see if we can garner any more information. |
In addition to this error I've ran into a new one:
I got here by commenting out |
@marshallwhiteorg @edublancas After debugging the CI on my own fork I have not been able to come up with a viable solution. Some steps I took:
Isolating the problem has been difficult. While debugging, I've found that when one test succeeds/is commented out, a completely different test fails or a new error arises. Overall the tests are flaky and unreliable. This leads me to believe there is some race condition that is causing the UI tests to fall out of sync/components to hang but I haven't been able to get deeper than that. My best guess is a resource issue with the CI or that our tests are structured in a way that encourages these race conditions (incorrect use of For now, without getting to the root of the issue, it might be helpful to configure the CI/playwright to re-try failed tests 2-3 times each. Perhaps someone with more familiarity/bandwidth could review the testing strategies we use in detail. But to best utilize time I'll move onto other issues for now. Here were the most helpful resources I found:
|
@bryannho sure, I can assign a few other things. @marshallwhiteorg what's the conclusion on this? |
@edublancas @bryannho I suggest we reduce the timeout as much as we can by measuring the max duration of the successful tests and then implement the retry mechanism. How does that sound? |
yeah sure go ahead |
this commit increased the CI runtime (see this previous commit as a comparison).
we need to check what happened. I believe the commit only added new tests, so it'd be surprising that just adding a few extra tests would increase the runtime so much, but I don't see any configuration changes.
if adding new tests slowed the CI so much, we need to find a way to optimize it.
The text was updated successfully, but these errors were encountered: