Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshoot flaky cleanup in workbench rest tests #778

Merged
merged 1 commit into from
Aug 16, 2018
Merged

Troubleshoot flaky cleanup in workbench rest tests #778

merged 1 commit into from
Aug 16, 2018

Conversation

jhrcek
Copy link

@jhrcek jhrcek commented Jul 9, 2018

Hello @Rikkola,
please check this. Troubleshooting instabilities in REST tests as discussed in email. The @AfterClass has been failing very often recently with similar errors:

org.kie.wb.test.rest.client.NotSuccessException: APPROVED: null
	at org.kie.wb.test.rest.client.RestWorkbenchClient.waitUntilJobFinished(RestWorkbenchClient.java:333)
	at org.kie.wb.test.rest.client.RestWorkbenchClient.waitUntilJobFinished(RestWorkbenchClient.java:303)
	at org.kie.wb.test.rest.client.RestWorkbenchClient.deleteSpace(RestWorkbenchClient.java:255)
	at org.kie.wb.test.rest.RestTestBase.deleteAllProject(RestTestBase.java:120)
	at org.kie.wb.test.rest.RestTestBase.cleanUp(RestTestBase.java:73)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
  1. Removed deleteAllProject because it was doing the same thing as deleteAllSpaces which is called at the same places.
  2. Added debug logging to find out why the AfterClass cleanup is sometimes failing. By default there's 10 second timeout in the test client. The debug log are there to confirm we're exceeding the 10s timeout when it fails, in which case we might tweak the timeout / discuss if it's acceptable for space deletion to take so long.

@Rikkola
Copy link
Member

Rikkola commented Jul 9, 2018

+1

Yes looks like there was some redundancy that could be removed and logging is a good idea. I'm not that familiar with the codes that I could name a clear suspect for the random failures without it.

@jhrcek
Copy link
Author

jhrcek commented Jul 9, 2018

@Rikkola so the tests failed and confirmed my suspicion: the rest cleanup is timing out, because it sometimes more than 10s to delete a space. Please look at this filtered output of the console text and let me know what you think. Do you think it's acceptable? Most delete space requests take about 1s to complete, but sometimes it takes more than 10s. See

output of grep RestTestBase consoleText
09:08:20.359 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 0 spaces
09:08:22.619 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 2 spaces
09:08:30.735 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getSpace' took 8116 ms
09:08:31.748 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'jobTestSpace' took 1012 ms
09:08:45.618 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 1 spaces
09:08:48.642 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'testSpace' took 3023 ms
09:10:03.150 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 1 spaces
09:10:14.208 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'projectTestSpace' took 11057 ms
	at org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0(RestTestBase.java:124)
	at org.kie.wb.test.rest.RestTestBase.deleteAllSpaces(RestTestBase.java:120)
	at org.kie.wb.test.rest.RestTestBase.cleanUp(RestTestBase.java:78)
09:10:22.660 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 4 spaces
09:10:23.668 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getExistingSpace' took 1007 ms
09:10:24.675 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'minimalSpace' took 1007 ms
09:10:25.682 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'oneOfManySpace' took 1006 ms
09:10:26.690 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'spaceWithGroupId' took 1007 ms
09:10:33.526 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 4 spaces
09:10:34.536 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'createSpaceWithrestProject' took 1010 ms
09:10:35.545 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'deleteSpaceWithnoRest' took 1008 ms
09:10:36.555 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getSpaceWithnoRest' took 1009 ms
09:10:37.563 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getSpaceWithrestProject' took 1008 ms
09:10:41.661 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 4 spaces
09:10:42.671 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'deleteSpaceWithnoRest' took 1010 ms
09:10:43.681 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'deleteSpaceWithrestProject' took 1009 ms
09:10:44.689 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getSpaceWithnoRest' took 1007 ms
09:10:45.697 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getSpaceWithrestProject' took 1007 ms
09:10:51.848 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 1 spaces
09:10:54.867 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'repositoryAccessTestSpace' took 3018 ms
09:11:53.603 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 1 spaces
09:12:04.651 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'projectAccessTestSpace' took 11048 ms
	at org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0(RestTestBase.java:124)
	at org.kie.wb.test.rest.RestTestBase.deleteAllSpaces(RestTestBase.java:120)
	at org.kie.wb.test.rest.RestTestBase.cleanUp(RestTestBase.java:78)
  ProjectIntegrationTest>RestTestBase.cleanUp:78->RestTestBase.deleteAllSpaces:120->RestTestBase.lambda$deleteAllSpaces$0:124 » NotSuccess
  ProjectAccessIntegrationTest>RestTestBase.cleanUp:78->RestTestBase.deleteAllSpaces:120->RestTestBase.lambda$deleteAllSpaces$0:124 » NotSuccess
  

@jhrcek
Copy link
Author

jhrcek commented Jul 9, 2018

jenkins retest this in the meantime to give us more data about this "sometimes it takes more than 10s to delete space" issue.

@jhrcek jhrcek changed the title Troubleshoot flaky cleanup in workbench rest tests DO NOT MERGE: Troubleshoot flaky cleanup in workbench rest tests Jul 10, 2018
@jhrcek
Copy link
Author

jhrcek commented Jul 10, 2018

jenkins retest this

@jhrcek
Copy link
Author

jhrcek commented Jul 10, 2018

So now we know that @AfterClass is failing regularly because of requests to delete all spaces one by one are timing out after 10 seconds. Just for research I increased the timeout to 30 seconds and added a test (not to be merged) that creates and deletes a lot of spaces, just to get more duration / timeout data.

@Rikkola
Copy link
Member

Rikkola commented Jul 10, 2018

@jhrcek Do the spaces that take longer to deletes bigger than the others or does it just randomly pick on that takes longer?

Anyway, there might be a need to check the task has finished before deleting the next one. So the tests might need some improving.

@Rikkola
Copy link
Member

Rikkola commented Jul 10, 2018

And another thing. Is this picking up an actual problem since I could imagine this test setup being a simple use case of how the REST is used and the real use cases are heavier and hit this issue more often.

@jhrcek
Copy link
Author

jhrcek commented Jul 10, 2018

@Rikkola I added a test to the PR which repeatedly creates a space with one project and then deletes it. Here are the data from the PR builder job entered in histogram (did 240 cycles of space+project creation-deletion before the build was aborted).
X axis has duration (in miliseconds) how long it takes to delete space an Y axis shows how many measurements took that time to complete. It seems that it just sometimes randomly takes a bit longer.

screenshot from 2018-07-10 12-02-13

You can also take a look at the

raw data
1830
199
15557
875
1851
6048
1011
1014
31136
3019
1006
2010
1007
3015
1006
10041
12051
3014
1006
4018
2014
3013
2010
1006
7049
14055
5021
3013
3013
1006
8035
3017
1010
2010
1005
4016
7028
1006
2009
8030
3013
9035
1031
1005
1005
1007
14050
2009
3012
5019
4015
2009
1005
4015
5019
7027
5019
1006
6023
3012
2008
3011
6022
1005
1006
13046
4014
2008
6022
6021
1005
1005
2008
4015
2008
7029
3011
1004
6020
1004
4014
4014
2008
2008
1005
1006
1006
1005
2009
6021
9028
9028
2008
1004
2007
2008
2007
1004
5017
1004
2008
7024
3010
1004
5016
1005
6021
1004
2007
1004
2007
4013
2007
2007
1005
12039
3009
6018
8023
1004
1005
2007
1004
1004
1004
2007
3011
9028
1004
1004
1003
3009
2007
1004
1004
1004
1004
1004
1004
3009
1004
2007
2007
2006
1004
5015
2006
1004
5014
2007
1003
1004
1004
1004
3009
2006
1004
1004
3009
1004
2008
5014
2007
5014
1003
2006
1004
2006
3009
3010
6017
1004
1004
1004
1003
2006
1003
1004
4011
1003
1004
2006
4012
4012
6016
12032
4013
4011
2006
3008
1003
1003
1003
1003
2006
1003
7019
3009
2007
3010
1003
20053
4011
1003
6016
23060
1003
1003
3009
1003
2006
2006
3008
2006
2006
1003
1003
1003
6018
6019
11029
5013
5013
2006
  

Anyway, there might be a need to check the task has finished before deleting the next one. So the tests might need some improving.

The logic for waiting is already there, currently the timeout for most calls is set to 10 seconds. We COULD increase the timeout to 15 seconds, but even then it would occasionally fail. I think this needs some investigation, but not sure where to start / where to look.

@jhrcek
Copy link
Author

jhrcek commented Jul 10, 2018

Couple of more data points to get wider picture:
Your PR that recurrected these tests #693 was merged on 26th February.
And here is the spread of faillure dates of these setup methods - the pattern seems to indicate that the failures started around mid March.

restfailurespattern

Just thinking out loud here: could this have something to do with indexing changes that were done by Max B around that time? https://github.com/kiegroup/appformer/commits?author=mbarkley

@Rikkola
Copy link
Member

Rikkola commented Jul 10, 2018

Based on all those Indexing tests I'm tempted to think so, but of course we would need some real data about the issue.

@ederign You might be the one who knows most about deleting Spaces. Can't you think of anything that could cause it to behave like this? Sometimes it is quite fast and then suddenly a lot slower.

@jhrcek
Copy link
Author

jhrcek commented Jul 11, 2018

jenkins retest this

@jhrcek
Copy link
Author

jhrcek commented Jul 11, 2018

ERROR] Failed to execute goal on project kie-tomcat-integration: Could not resolve dependencies for project org.kie:kie-tomcat-integration:jar:7.9.0-SNAPSHOT: Could not find artifact org.apache.tomcat:catalina:jar:9.0.10 in jboss-public-repository-group (https://repository.jboss.org/nexus/content/groups/public/) -> [Help 1]

This has been fixed upstream, trying again.
jenkins retest this

@jhrcek jhrcek changed the title DO NOT MERGE: Troubleshoot flaky cleanup in workbench rest tests Troubleshoot flaky cleanup in workbench rest tests Jul 12, 2018
@jhrcek
Copy link
Author

jhrcek commented Jul 12, 2018

@Rikkola After the tests pass I consider this ready to merge.
What was done:

  • created AF-1310 that'll need to be prioritized and investigated
  • bumped workbench rest client timeout in tests from 10 to 30 seconds to reduce the frequency of random failures
  • added @Ignore-d test that clearly demonstrates this issue

@jhrcek
Copy link
Author

jhrcek commented Jul 13, 2018

I see that even setting 30s rest client timeout didn't help and now there's also failure in selenium tests (list of example projects is not loaded). This suggests that this issue should be investigated rather sooner than later!
jenkins retest this

@kie-qe-ci
Copy link

Can one of the admins verify this patch?

@jhrcek
Copy link
Author

jhrcek commented Jul 23, 2018

jenkins retest this

@Rikkola
Copy link
Member

Rikkola commented Jul 23, 2018

Can't add myself as the reviewer, but giving this +1.

@jhrcek
Copy link
Author

jhrcek commented Jul 23, 2018

Got Unexplicable failure in selenium tests which can't be influenced by the changes in this PR.
jenkins retest this

@kie-ci
Copy link

kie-ci commented Jul 23, 2018

Build finished. No test results found.

@jhrcek
Copy link
Author

jhrcek commented Jul 23, 2018

00:02:08.129 ERROR: Error cloning remote repo 'origin'
00:02:08.130 hudson.plugins.git.GitException: Command "git fetch --tags --progress https://github.com/kiegroup/kie-wb-distributions.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:

jenkins retest this

@jhrcek
Copy link
Author

jhrcek commented Jul 24, 2018

jenkins retest this

@triceo triceo requested a review from Rikkola July 24, 2018 12:39
@jhrcek
Copy link
Author

jhrcek commented Jul 24, 2018

jenkins retest this

@jhrcek jhrcek changed the title Troubleshoot flaky cleanup in workbench rest tests [DO NOT MERGE] Troubleshoot flaky cleanup in workbench rest tests Jul 25, 2018
@manstis
Copy link
Member

manstis commented Jul 25, 2018

Jenkins please retest this.

@jhrcek
Copy link
Author

jhrcek commented Jul 26, 2018

Getting only one confusing failure in selenium tests:
LoginIntegrationTest is failing because it sees this after opening the app in the browser. This PR should not have any influence on the UI, it's just rest test stabilization..

jenkins retest this

@Rikkola
Copy link
Member

Rikkola commented Jul 26, 2018

Yep. Rest is not used internally so login or anything else should not be affected.

@jhrcek
Copy link
Author

jhrcek commented Jul 27, 2018

I'm looking into this more deeply. Now that I know how to connect to Jenkins slave while the tests are running, I'll try to investigate why it's sometimes so slow.
jenkins retest this

@jhrcek
Copy link
Author

jhrcek commented Jul 27, 2018

I connected to jenkins slave via ssh while the tests were running and repeatedly took stacktrace dumps via jstack while the long "delete space" calls were in progress. I found the following stack trace elements were present in most of these cases:

      	at org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:327)
     	at org.apache.lucene.util.IOUtils.fsync(IOUtils.java:472)

It seems quite plausible that this is the root cause of the slowness. Not sure what to do about it though..
@Rikkola @adrielparedes any ideas?

Also @adrielparedes there are tons of indexing related exceptions that I mentioned about 2 weeks ago. Did you have any chance to look at those?

[INFO] [talledLocalContainer] 03:38:32,938 ERROR [org.uberfire.ext.metadata.io.IndexerDispatcher] (Thread-100) Indexing error for indexer [id=org.kie.workbench.common.screens.datamodeller.backend.server.indexing.JavaFileIndexer]: java.lang.NullPointerException
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.backend.lucene.provider.LuceneIndexProvider.delete(LuceneIndexProvider.java:108)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.index.MetadataIndexEngine.doDelete(MetadataIndexEngine.java:195)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.index.MetadataIndexEngine.doAction(MetadataIndexEngine.java:134)
[INFO] [talledLocalContainer] 	at java.util.ArrayList.forEach(ArrayList.java:1257)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.index.MetadataIndexEngine.doCommit(MetadataIndexEngine.java:231)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.index.MetadataIndexEngine.commit(MetadataIndexEngine.java:220)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.IndexerDispatcher$IndexerJob.get(IndexerDispatcher.java:189)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.IndexerDispatcher$IndexerJob.get(IndexerDispatcher.java:159)
[INFO] [talledLocalContainer] 	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
[INFO] [talledLocalContainer] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[INFO] [talledLocalContainer] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[INFO] [talledLocalContainer] 	at java.lang.Thread.run(Thread.java:748)
[INFO] [talledLocalContainer] 

For reference how to reproduce this locally: checkout this PR and

cd kie-wb-distributions/kie-wb-tests/kie-wb-tests-rest
mvn clean verify -Pkie-wb,wildfly11 -Dit.test=TroubleshootRestSetupIssueIntegrationTest

And you'll see tons of these errors.

@jhrcek
Copy link
Author

jhrcek commented Aug 1, 2018

jenkins retest this

1 similar comment
@jhrcek
Copy link
Author

jhrcek commented Aug 2, 2018

jenkins retest this

@jhrcek
Copy link
Author

jhrcek commented Aug 2, 2018

stderr: fatal: unable to access 'https://github.com/kiegroup/kie-wb-distributions.git/': Could not resolve host: github.com; Unknown error

jenkins retest this

@jhrcek
Copy link
Author

jhrcek commented Aug 2, 2018

Created a companion PR trying to fix NPE maybe related to this.
jenkins retest this

@kie-qe-ci
Copy link

Can one of the admins verify this PR? Comment with 'ok to test' to start the build.

@jhrcek
Copy link
Author

jhrcek commented Aug 9, 2018

Progress update: I closed the companion PR because it doesn't help the issue. The NPEs are just a side effect of a general problem that these tests make visible: when too many operations that manipulate multiple git filesystems are executed in a short time, there is a lot of indexing operations running in parallel and sometimes the spaces / projects are deleted before the indexing finishes. I need to rerun these tests to evaluate the scope of this issue (how much the creating / deleting spaces / indexing assets are overlapping).

@jhrcek
Copy link
Author

jhrcek commented Aug 10, 2018

Here is the summary of my findings:

  • Probable cause: Disk IO (niogit & indexing operations) comparatively slower on jenkins slaves.
  • Reported 2 issues in indexing implementation:
    • Redundant indexing being done (reported in RHPAM-1455)
    • NPEs being thrown from LuceneIndexProvider (reported in RHPAM-1459)
  • I’d also like to disable incremental build to make the tests less “IO intensive” (By supplying -Dbuild.enable-incremental=false when starting wildfly which runs the workbench - see AppSetup), but this is currently impossible due to regression AF-1408.

I think it is good idea to prioritize investigation and fixing of these indexing issues before we merge multiple branch support (CC @paulovmr) and infinispan indexing (CC @adrielparedes) which will introduce more logic around indexing!

@jhrcek
Copy link
Author

jhrcek commented Aug 15, 2018

Testing stability of changes : 1 green
jenkins retest this

@jhrcek
Copy link
Author

jhrcek commented Aug 15, 2018

Testing stability of changes : 2 greens
jenkins retest this

@jhrcek
Copy link
Author

jhrcek commented Aug 16, 2018

Testing stability of changes : 3 greens
jenkins retest this

@jhrcek
Copy link
Author

jhrcek commented Aug 16, 2018

Testing stability of changes : 4 greens
jenkins retest this

@jhrcek
Copy link
Author

jhrcek commented Aug 16, 2018

@manstis I've got 5 greens in a row. I believe the rest tests are now stable enough and this can be merged.

Summary of changes made:

  • increased rest client timeout (jsut in tests) to 60 seconds. Most of the time the operations are fast (~1 s) but occasionally it takes up to 30 seconds for the space deletion to finish. The duration of rest calls is logged so we can check it later.
  • Filed couple of jiras to investigate couple of issues around indexing (see above).
  • Improved exception message in case the rest client timeout occurs in tests so it's clear why the test failed.
  • Un-@Ignore-d the tests ignored here. With the changes in this PR they're passing.

@Rikkola
Copy link
Member

Rikkola commented Aug 16, 2018

Good stuff well done.

@manstis manstis changed the title [DO NOT MERGE] Troubleshoot flaky cleanup in workbench rest tests Troubleshoot flaky cleanup in workbench rest tests Aug 16, 2018
@manstis manstis merged commit a95d1a9 into kiegroup:master Aug 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants