-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Troubleshoot flaky cleanup in workbench rest tests #778
Conversation
+1 Yes looks like there was some redundancy that could be removed and logging is a good idea. I'm not that familiar with the codes that I could name a clear suspect for the random failures without it. |
@Rikkola so the tests failed and confirmed my suspicion: the rest cleanup is timing out, because it sometimes more than 10s to delete a space. Please look at this filtered output of the console text and let me know what you think. Do you think it's acceptable? Most delete space requests take about 1s to complete, but sometimes it takes more than 10s. See output of
|
jenkins retest this in the meantime to give us more data about this "sometimes it takes more than 10s to delete space" issue. |
jenkins retest this |
So now we know that |
@jhrcek Do the spaces that take longer to deletes bigger than the others or does it just randomly pick on that takes longer? Anyway, there might be a need to check the task has finished before deleting the next one. So the tests might need some improving. |
And another thing. Is this picking up an actual problem since I could imagine this test setup being a simple use case of how the REST is used and the real use cases are heavier and hit this issue more often. |
@Rikkola I added a test to the PR which repeatedly creates a space with one project and then deletes it. Here are the data from the PR builder job entered in histogram (did 240 cycles of space+project creation-deletion before the build was aborted). You can also take a look at the raw data1830 199 15557 875 1851 6048 1011 1014 31136 3019 1006 2010 1007 3015 1006 10041 12051 3014 1006 4018 2014 3013 2010 1006 7049 14055 5021 3013 3013 1006 8035 3017 1010 2010 1005 4016 7028 1006 2009 8030 3013 9035 1031 1005 1005 1007 14050 2009 3012 5019 4015 2009 1005 4015 5019 7027 5019 1006 6023 3012 2008 3011 6022 1005 1006 13046 4014 2008 6022 6021 1005 1005 2008 4015 2008 7029 3011 1004 6020 1004 4014 4014 2008 2008 1005 1006 1006 1005 2009 6021 9028 9028 2008 1004 2007 2008 2007 1004 5017 1004 2008 7024 3010 1004 5016 1005 6021 1004 2007 1004 2007 4013 2007 2007 1005 12039 3009 6018 8023 1004 1005 2007 1004 1004 1004 2007 3011 9028 1004 1004 1003 3009 2007 1004 1004 1004 1004 1004 1004 3009 1004 2007 2007 2006 1004 5015 2006 1004 5014 2007 1003 1004 1004 1004 3009 2006 1004 1004 3009 1004 2008 5014 2007 5014 1003 2006 1004 2006 3009 3010 6017 1004 1004 1004 1003 2006 1003 1004 4011 1003 1004 2006 4012 4012 6016 12032 4013 4011 2006 3008 1003 1003 1003 1003 2006 1003 7019 3009 2007 3010 1003 20053 4011 1003 6016 23060 1003 1003 3009 1003 2006 2006 3008 2006 2006 1003 1003 1003 6018 6019 11029 5013 5013 2006
The logic for waiting is already there, currently the timeout for most calls is set to 10 seconds. We COULD increase the timeout to 15 seconds, but even then it would occasionally fail. I think this needs some investigation, but not sure where to start / where to look. |
Couple of more data points to get wider picture: Just thinking out loud here: could this have something to do with indexing changes that were done by Max B around that time? https://github.com/kiegroup/appformer/commits?author=mbarkley |
Based on all those Indexing tests I'm tempted to think so, but of course we would need some real data about the issue. @ederign You might be the one who knows most about deleting Spaces. Can't you think of anything that could cause it to behave like this? Sometimes it is quite fast and then suddenly a lot slower. |
jenkins retest this |
This has been fixed upstream, trying again. |
@Rikkola After the tests pass I consider this ready to merge.
|
I see that even setting 30s rest client timeout didn't help and now there's also failure in selenium tests (list of example projects is not loaded). This suggests that this issue should be investigated rather sooner than later! |
Can one of the admins verify this patch? |
jenkins retest this |
Can't add myself as the reviewer, but giving this +1. |
Got Unexplicable failure in selenium tests which can't be influenced by the changes in this PR. |
Build finished. No test results found. |
jenkins retest this |
jenkins retest this |
jenkins retest this |
Jenkins please retest this. |
Getting only one confusing failure in selenium tests: jenkins retest this |
Yep. Rest is not used internally so login or anything else should not be affected. |
I'm looking into this more deeply. Now that I know how to connect to Jenkins slave while the tests are running, I'll try to investigate why it's sometimes so slow. |
I connected to jenkins slave via ssh while the tests were running and repeatedly took stacktrace dumps via jstack while the long "delete space" calls were in progress. I found the following stack trace elements were present in most of these cases:
It seems quite plausible that this is the root cause of the slowness. Not sure what to do about it though.. Also @adrielparedes there are tons of indexing related exceptions that I mentioned about 2 weeks ago. Did you have any chance to look at those?
For reference how to reproduce this locally: checkout this PR and
And you'll see tons of these errors. |
jenkins retest this |
1 similar comment
jenkins retest this |
jenkins retest this |
Created a companion PR trying to fix NPE maybe related to this. |
Can one of the admins verify this PR? Comment with 'ok to test' to start the build. |
Progress update: I closed the companion PR because it doesn't help the issue. The NPEs are just a side effect of a general problem that these tests make visible: when too many operations that manipulate multiple git filesystems are executed in a short time, there is a lot of indexing operations running in parallel and sometimes the spaces / projects are deleted before the indexing finishes. I need to rerun these tests to evaluate the scope of this issue (how much the creating / deleting spaces / indexing assets are overlapping). |
Here is the summary of my findings:
I think it is good idea to prioritize investigation and fixing of these indexing issues before we merge multiple branch support (CC @paulovmr) and infinispan indexing (CC @adrielparedes) which will introduce more logic around indexing! |
Testing stability of changes : 1 green |
Testing stability of changes : 2 greens |
Testing stability of changes : 3 greens |
Testing stability of changes : 4 greens |
@manstis I've got 5 greens in a row. I believe the rest tests are now stable enough and this can be merged. Summary of changes made:
|
Good stuff well done. |
Hello @Rikkola,
please check this. Troubleshooting instabilities in REST tests as discussed in email. The
@AfterClass
has been failing very often recently with similar errors:deleteAllProject
because it was doing the same thing asdeleteAllSpaces
which is called at the same places.