Troubleshoot flaky cleanup in workbench rest tests #778

jhrcek · 2018-07-09T11:24:47Z

Hello @Rikkola,
please check this. Troubleshooting instabilities in REST tests as discussed in email. The @AfterClass has been failing very often recently with similar errors:

org.kie.wb.test.rest.client.NotSuccessException: APPROVED: null
	at org.kie.wb.test.rest.client.RestWorkbenchClient.waitUntilJobFinished(RestWorkbenchClient.java:333)
	at org.kie.wb.test.rest.client.RestWorkbenchClient.waitUntilJobFinished(RestWorkbenchClient.java:303)
	at org.kie.wb.test.rest.client.RestWorkbenchClient.deleteSpace(RestWorkbenchClient.java:255)
	at org.kie.wb.test.rest.RestTestBase.deleteAllProject(RestTestBase.java:120)
	at org.kie.wb.test.rest.RestTestBase.cleanUp(RestTestBase.java:73)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)

Removed deleteAllProject because it was doing the same thing as deleteAllSpaces which is called at the same places.
Added debug logging to find out why the AfterClass cleanup is sometimes failing. By default there's 10 second timeout in the test client. The debug log are there to confirm we're exceeding the 10s timeout when it fails, in which case we might tweak the timeout / discuss if it's acceptable for space deletion to take so long.

Rikkola · 2018-07-09T11:28:51Z

+1

Yes looks like there was some redundancy that could be removed and logging is a good idea. I'm not that familiar with the codes that I could name a clear suspect for the random failures without it.

jhrcek · 2018-07-09T13:39:20Z

@Rikkola so the tests failed and confirmed my suspicion: the rest cleanup is timing out, because it sometimes more than 10s to delete a space. Please look at this filtered output of the console text and let me know what you think. Do you think it's acceptable? Most delete space requests take about 1s to complete, but sometimes it takes more than 10s. See

output of grep RestTestBase consoleText

09:08:20.359 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 0 spaces
09:08:22.619 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 2 spaces
09:08:30.735 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getSpace' took 8116 ms
09:08:31.748 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'jobTestSpace' took 1012 ms
09:08:45.618 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 1 spaces
09:08:48.642 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'testSpace' took 3023 ms
09:10:03.150 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 1 spaces
09:10:14.208 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'projectTestSpace' took 11057 ms
	at org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0(RestTestBase.java:124)
	at org.kie.wb.test.rest.RestTestBase.deleteAllSpaces(RestTestBase.java:120)
	at org.kie.wb.test.rest.RestTestBase.cleanUp(RestTestBase.java:78)
09:10:22.660 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 4 spaces
09:10:23.668 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getExistingSpace' took 1007 ms
09:10:24.675 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'minimalSpace' took 1007 ms
09:10:25.682 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'oneOfManySpace' took 1006 ms
09:10:26.690 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'spaceWithGroupId' took 1007 ms
09:10:33.526 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 4 spaces
09:10:34.536 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'createSpaceWithrestProject' took 1010 ms
09:10:35.545 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'deleteSpaceWithnoRest' took 1008 ms
09:10:36.555 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getSpaceWithnoRest' took 1009 ms
09:10:37.563 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getSpaceWithrestProject' took 1008 ms
09:10:41.661 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 4 spaces
09:10:42.671 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'deleteSpaceWithnoRest' took 1010 ms
09:10:43.681 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'deleteSpaceWithrestProject' took 1009 ms
09:10:44.689 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getSpaceWithnoRest' took 1007 ms
09:10:45.697 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'getSpaceWithrestProject' took 1007 ms
09:10:51.848 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 1 spaces
09:10:54.867 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'repositoryAccessTestSpace' took 3018 ms
09:11:53.603 [main] INFO  org.kie.wb.test.rest.RestTestBase.deleteAllSpaces:119 - Deleting 1 spaces
09:12:04.651 [main] DEBUG org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0:127 - Deleting space 'projectAccessTestSpace' took 11048 ms
	at org.kie.wb.test.rest.RestTestBase.lambda$deleteAllSpaces$0(RestTestBase.java:124)
	at org.kie.wb.test.rest.RestTestBase.deleteAllSpaces(RestTestBase.java:120)
	at org.kie.wb.test.rest.RestTestBase.cleanUp(RestTestBase.java:78)
  ProjectIntegrationTest>RestTestBase.cleanUp:78->RestTestBase.deleteAllSpaces:120->RestTestBase.lambda$deleteAllSpaces$0:124 » NotSuccess
  ProjectAccessIntegrationTest>RestTestBase.cleanUp:78->RestTestBase.deleteAllSpaces:120->RestTestBase.lambda$deleteAllSpaces$0:124 » NotSuccess

jhrcek · 2018-07-09T13:40:45Z

jenkins retest this in the meantime to give us more data about this "sometimes it takes more than 10s to delete space" issue.

jhrcek · 2018-07-10T07:12:04Z

jenkins retest this

jhrcek · 2018-07-10T07:20:37Z

So now we know that @AfterClass is failing regularly because of requests to delete all spaces one by one are timing out after 10 seconds. Just for research I increased the timeout to 30 seconds and added a test (not to be merged) that creates and deletes a lot of spaces, just to get more duration / timeout data.

Rikkola · 2018-07-10T07:53:29Z

@jhrcek Do the spaces that take longer to deletes bigger than the others or does it just randomly pick on that takes longer?

Anyway, there might be a need to check the task has finished before deleting the next one. So the tests might need some improving.

Rikkola · 2018-07-10T07:56:50Z

And another thing. Is this picking up an actual problem since I could imagine this test setup being a simple use case of how the REST is used and the real use cases are heavier and hit this issue more often.

jhrcek · 2018-07-10T10:12:45Z

@Rikkola I added a test to the PR which repeatedly creates a space with one project and then deletes it. Here are the data from the PR builder job entered in histogram (did 240 cycles of space+project creation-deletion before the build was aborted).
X axis has duration (in miliseconds) how long it takes to delete space an Y axis shows how many measurements took that time to complete. It seems that it just sometimes randomly takes a bit longer.

You can also take a look at the

raw data

Anyway, there might be a need to check the task has finished before deleting the next one. So the tests might need some improving.

The logic for waiting is already there, currently the timeout for most calls is set to 10 seconds. We COULD increase the timeout to 15 seconds, but even then it would occasionally fail. I think this needs some investigation, but not sure where to start / where to look.

jhrcek · 2018-07-10T10:26:35Z

Couple of more data points to get wider picture:
Your PR that recurrected these tests #693 was merged on 26th February.
And here is the spread of faillure dates of these setup methods - the pattern seems to indicate that the failures started around mid March.

Just thinking out loud here: could this have something to do with indexing changes that were done by Max B around that time? https://github.com/kiegroup/appformer/commits?author=mbarkley

Rikkola · 2018-07-10T13:01:20Z

Based on all those Indexing tests I'm tempted to think so, but of course we would need some real data about the issue.

@ederign You might be the one who knows most about deleting Spaces. Can't you think of anything that could cause it to behave like this? Sometimes it is quite fast and then suddenly a lot slower.

jhrcek · 2018-07-11T04:47:48Z

jenkins retest this

jhrcek · 2018-07-11T07:52:27Z

ERROR] Failed to execute goal on project kie-tomcat-integration: Could not resolve dependencies for project org.kie:kie-tomcat-integration:jar:7.9.0-SNAPSHOT: Could not find artifact org.apache.tomcat:catalina:jar:9.0.10 in jboss-public-repository-group (https://repository.jboss.org/nexus/content/groups/public/) -> [Help 1]

This has been fixed upstream, trying again.
jenkins retest this

jhrcek · 2018-07-12T10:19:21Z

@Rikkola After the tests pass I consider this ready to merge.
What was done:

created AF-1310 that'll need to be prioritized and investigated
bumped workbench rest client timeout in tests from 10 to 30 seconds to reduce the frequency of random failures
added @Ignore-d test that clearly demonstrates this issue

jhrcek · 2018-07-13T05:06:33Z

I see that even setting 30s rest client timeout didn't help and now there's also failure in selenium tests (list of example projects is not loaded). This suggests that this issue should be investigated rather sooner than later!
jenkins retest this

kie-qe-ci · 2018-07-13T14:43:25Z

Can one of the admins verify this patch?

jhrcek · 2018-07-23T07:02:10Z

jenkins retest this

Rikkola · 2018-07-23T07:29:11Z

Can't add myself as the reviewer, but giving this +1.

jhrcek · 2018-07-23T12:01:24Z

Got Unexplicable failure in selenium tests which can't be influenced by the changes in this PR.
jenkins retest this

kie-ci · 2018-07-23T12:33:50Z

Build finished. No test results found.

jhrcek · 2018-07-23T12:51:41Z

00:02:08.129 ERROR: Error cloning remote repo 'origin'
00:02:08.130 hudson.plugins.git.GitException: Command "git fetch --tags --progress https://github.com/kiegroup/kie-wb-distributions.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:

jenkins retest this

jhrcek · 2018-07-24T12:37:22Z

jenkins retest this

jhrcek · 2018-07-24T14:09:20Z

jenkins retest this

manstis · 2018-07-25T21:02:27Z

Jenkins please retest this.

jhrcek · 2018-07-26T05:01:34Z

Getting only one confusing failure in selenium tests:
LoginIntegrationTest is failing because it sees this after opening the app in the browser. This PR should not have any influence on the UI, it's just rest test stabilization..

jenkins retest this

Rikkola · 2018-07-26T05:29:23Z

Yep. Rest is not used internally so login or anything else should not be affected.

jhrcek · 2018-07-27T06:06:26Z

I'm looking into this more deeply. Now that I know how to connect to Jenkins slave while the tests are running, I'll try to investigate why it's sometimes so slow.
jenkins retest this

jhrcek · 2018-07-27T09:10:28Z

I connected to jenkins slave via ssh while the tests were running and repeatedly took stacktrace dumps via jstack while the long "delete space" calls were in progress. I found the following stack trace elements were present in most of these cases:

      	at org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:327)
     	at org.apache.lucene.util.IOUtils.fsync(IOUtils.java:472)

It seems quite plausible that this is the root cause of the slowness. Not sure what to do about it though..
@Rikkola @adrielparedes any ideas?

Also @adrielparedes there are tons of indexing related exceptions that I mentioned about 2 weeks ago. Did you have any chance to look at those?

[INFO] [talledLocalContainer] 03:38:32,938 ERROR [org.uberfire.ext.metadata.io.IndexerDispatcher] (Thread-100) Indexing error for indexer [id=org.kie.workbench.common.screens.datamodeller.backend.server.indexing.JavaFileIndexer]: java.lang.NullPointerException
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.backend.lucene.provider.LuceneIndexProvider.delete(LuceneIndexProvider.java:108)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.index.MetadataIndexEngine.doDelete(MetadataIndexEngine.java:195)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.index.MetadataIndexEngine.doAction(MetadataIndexEngine.java:134)
[INFO] [talledLocalContainer] 	at java.util.ArrayList.forEach(ArrayList.java:1257)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.index.MetadataIndexEngine.doCommit(MetadataIndexEngine.java:231)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.index.MetadataIndexEngine.commit(MetadataIndexEngine.java:220)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.IndexerDispatcher$IndexerJob.get(IndexerDispatcher.java:189)
[INFO] [talledLocalContainer] 	at org.uberfire.ext.metadata.io.IndexerDispatcher$IndexerJob.get(IndexerDispatcher.java:159)
[INFO] [talledLocalContainer] 	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
[INFO] [talledLocalContainer] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[INFO] [talledLocalContainer] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[INFO] [talledLocalContainer] 	at java.lang.Thread.run(Thread.java:748)
[INFO] [talledLocalContainer]

For reference how to reproduce this locally: checkout this PR and

cd kie-wb-distributions/kie-wb-tests/kie-wb-tests-rest
mvn clean verify -Pkie-wb,wildfly11 -Dit.test=TroubleshootRestSetupIssueIntegrationTest

And you'll see tons of these errors.

jhrcek · 2018-08-01T12:04:50Z

jenkins retest this

jhrcek · 2018-08-02T03:29:29Z

jenkins retest this

jhrcek · 2018-08-02T04:22:37Z

stderr: fatal: unable to access 'https://github.com/kiegroup/kie-wb-distributions.git/': Could not resolve host: github.com; Unknown error

jenkins retest this

jhrcek · 2018-08-02T07:48:16Z

Created a companion PR trying to fix NPE maybe related to this.
jenkins retest this

kie-qe-ci · 2018-08-06T09:41:13Z

Can one of the admins verify this PR? Comment with 'ok to test' to start the build.

jhrcek · 2018-08-09T08:12:25Z

Progress update: I closed the companion PR because it doesn't help the issue. The NPEs are just a side effect of a general problem that these tests make visible: when too many operations that manipulate multiple git filesystems are executed in a short time, there is a lot of indexing operations running in parallel and sometimes the spaces / projects are deleted before the indexing finishes. I need to rerun these tests to evaluate the scope of this issue (how much the creating / deleting spaces / indexing assets are overlapping).

jhrcek · 2018-08-10T08:22:52Z

Here is the summary of my findings:

Probable cause: Disk IO (niogit & indexing operations) comparatively slower on jenkins slaves.
Reported 2 issues in indexing implementation:
- Redundant indexing being done (reported in RHPAM-1455)
- NPEs being thrown from LuceneIndexProvider (reported in RHPAM-1459)
I’d also like to disable incremental build to make the tests less “IO intensive” (By supplying -Dbuild.enable-incremental=false when starting wildfly which runs the workbench - see AppSetup), but this is currently impossible due to regression AF-1408.

I think it is good idea to prioritize investigation and fixing of these indexing issues before we merge multiple branch support (CC @paulovmr) and infinispan indexing (CC @adrielparedes) which will introduce more logic around indexing!

jhrcek · 2018-08-15T10:27:49Z

Testing stability of changes : 1 green
jenkins retest this

jhrcek · 2018-08-15T12:28:04Z

Testing stability of changes : 2 greens
jenkins retest this

jhrcek · 2018-08-16T03:01:09Z

Testing stability of changes : 3 greens
jenkins retest this

jhrcek · 2018-08-16T04:56:41Z

Testing stability of changes : 4 greens
jenkins retest this

jhrcek · 2018-08-16T06:48:03Z

@manstis I've got 5 greens in a row. I believe the rest tests are now stable enough and this can be merged.

Summary of changes made:

increased rest client timeout (jsut in tests) to 60 seconds. Most of the time the operations are fast (~1 s) but occasionally it takes up to 30 seconds for the space deletion to finish. The duration of rest calls is logged so we can check it later.
Filed couple of jiras to investigate couple of issues around indexing (see above).
Improved exception message in case the rest client timeout occurs in tests so it's clear why the test failed.
Un-@Ignore-d the tests ignored here. With the changes in this PR they're passing.

Rikkola · 2018-08-16T06:56:40Z

Good stuff well done.

jhrcek changed the title ~~Troubleshoot flaky cleanup in workbench rest tests~~ DO NOT MERGE: Troubleshoot flaky cleanup in workbench rest tests Jul 10, 2018

jhrcek changed the title ~~DO NOT MERGE: Troubleshoot flaky cleanup in workbench rest tests~~ Troubleshoot flaky cleanup in workbench rest tests Jul 12, 2018

triceo requested a review from Rikkola July 24, 2018 12:39

Rikkola approved these changes Jul 24, 2018

View reviewed changes

jhrcek changed the title ~~Troubleshoot flaky cleanup in workbench rest tests~~ [DO NOT MERGE] Troubleshoot flaky cleanup in workbench rest tests Jul 25, 2018

manstis mentioned this pull request Jul 26, 2018

DROOLS-2566: [DMN Editor] Integrate Editor into Workbench #781

Merged

Rikkola mentioned this pull request Jul 27, 2018

AF-1412 : Kie WB REST getJob randomly return 410 Gone after previous REST call #782

Merged

jhrcek mentioned this pull request Aug 2, 2018

[DO NOT MERGE] Prevent NPE in LuceneIndexProvider kiegroup/appformer#426

Closed

Troubleshoot flaky cleanup in workbench rest tests

f7a36df

manstis changed the title ~~[DO NOT MERGE] Troubleshoot flaky cleanup in workbench rest tests~~ Troubleshoot flaky cleanup in workbench rest tests Aug 16, 2018

manstis merged commit a95d1a9 into kiegroup:master Aug 16, 2018

Troubleshoot flaky cleanup in workbench rest tests #778

Troubleshoot flaky cleanup in workbench rest tests #778

Conversation

jhrcek commented Jul 9, 2018

Rikkola commented Jul 9, 2018 • edited Loading

jhrcek commented Jul 9, 2018 • edited Loading

jhrcek commented Jul 9, 2018

jhrcek commented Jul 10, 2018

jhrcek commented Jul 10, 2018 • edited Loading

Rikkola commented Jul 10, 2018

Rikkola commented Jul 10, 2018

jhrcek commented Jul 10, 2018 • edited Loading

jhrcek commented Jul 10, 2018

Rikkola commented Jul 10, 2018

jhrcek commented Jul 11, 2018

jhrcek commented Jul 11, 2018

jhrcek commented Jul 12, 2018

jhrcek commented Jul 13, 2018

kie-qe-ci commented Jul 13, 2018

jhrcek commented Jul 23, 2018

Rikkola commented Jul 23, 2018

jhrcek commented Jul 23, 2018

kie-ci commented Jul 23, 2018

jhrcek commented Jul 23, 2018

jhrcek commented Jul 24, 2018

jhrcek commented Jul 24, 2018

manstis commented Jul 25, 2018

jhrcek commented Jul 26, 2018

Rikkola commented Jul 26, 2018

jhrcek commented Jul 27, 2018

jhrcek commented Jul 27, 2018 • edited Loading

jhrcek commented Aug 1, 2018

jhrcek commented Aug 2, 2018

jhrcek commented Aug 2, 2018

jhrcek commented Aug 2, 2018 • edited Loading

kie-qe-ci commented Aug 6, 2018

jhrcek commented Aug 9, 2018

jhrcek commented Aug 10, 2018 • edited Loading

jhrcek commented Aug 15, 2018

jhrcek commented Aug 15, 2018 • edited Loading

jhrcek commented Aug 16, 2018

jhrcek commented Aug 16, 2018

jhrcek commented Aug 16, 2018

Rikkola commented Aug 16, 2018

Rikkola commented Jul 9, 2018 •

edited

Loading

jhrcek commented Jul 9, 2018 •

edited

Loading

jhrcek commented Jul 10, 2018 •

edited

Loading

jhrcek commented Jul 10, 2018 •

edited

Loading

jhrcek commented Jul 27, 2018 •

edited

Loading

jhrcek commented Aug 2, 2018 •

edited

Loading

jhrcek commented Aug 10, 2018 •

edited

Loading

jhrcek commented Aug 15, 2018 •

edited

Loading