-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job not getting stopped when number of available hosts < 'minHosts' property #579
Comments
Srinath, please test the fix on branch "issue579". Since it could be tricky for me to setup this exact scenario, I'm hoping you can try it in your setup and see if this is the right fix. Thanks to your thorough issue report here, I feel confident that I understand what caused the issue. |
21:16:40.119 [pool-2-thread-1] INFO c.m.c.d.impl.WriteBatcherImpl - (withForestConfig) Using [rh7v-intel64-90-test-5, rh7v-intel64-90-test-5.marklogic.com, rh7v-intel64-90-test-4.marklogic.com] hosts with forests for "WriteHostBatcher"
21:15:24.303 [main] INFO c.m.c.d.impl.WriteBatcherImpl - (withForestConfig) Using [rh7v-intel64-90-test-5, rh7v-intel64-90-test-6.marklogic.com, rh7v-intel64-90-test-4.marklogic.com] hosts with forests for "WriteHostBatcher" |
Keen observation, Srinath! I think it's worth its own issue. Could you log it separately and keep this issue focused on fixing the infinte retry scenario? |
Sam, With 'issue579' branch, I still face the same issue. I've attached the logs with logging done the way you wanted (like stress tests). One additional info is that the initial client object (used to obtain instance of DataMovementManager) is obtained from "rh7v-intel64-90-test-5" (which is one of the nodes shutdown) during the test. |
Ok, looks like your keen observation may have hit the nail on the head. I was distracted by the idea "even after job is stopped" and that's what I fixed in the last commit. But now we're seeing a situation where the job is not getting stopped because, as you noticed, the hostname on the DatabaseClient doesn't match any of the the preferred hostnames from ForestConfiguration, so we're not black-listing a host because we don't see it in the list of preferred hosts. Yet we're retrying the failed batch on the assumption the black-listing has already occured.
I believe I have a fix. Please forgive me for asking you to test it again. |
Please test on the issue579 branch |
Sam, I am still seeing the issue (job not getting stopped) . I have attached the A. Initially all 3 nodes are up with a forest in each of them(WriteHostBatcher-1/2/3 on rh7v-intel64-90-test-4/5/6.marklogic.com). It can be seen that if db client object used to query 'forestinfo' endpoint uses hostname as "rh7v-intel64-90-test-5"(as opposed to FQDN) , the Host is set to "rh7v-intel64-90-test-5.marklogic.com" and alternate host is set to "rh7v-intel64-90-test-5"
B. Now , node rh7v-intel64-90-test-6.marklogic.com (hosting WriteHostBatcher-3) is stopped. The node is blacklisted and hence removed from the hashset as expected
C. This is when forest failover starts to occur. The forest on rh7v-intel64-90-test-6.marklogic.com is configured to fail over to rh7v-intel64-90-test-5.marklogic.com. As can be seen,now when 'forestinfo' endpoint is queried, for 'WriteHostBatcher-3' it returns 'Host' as rh7v-intel64-90-test-6.marklogic.com (where it was configured to run originally) and 'alternateHost' is rh7v-intel64-90-test-5.marklogic.com (where the forest failed over to and currently running). So, preferred host for this 'forest' object is now set to 'rh7v-intel64-90-test-5.marklogic.com' and hashset now contains [rh7v-intel64-90-test-5, rh7v-intel64-90-test-5.marklogic.com, rh7v-intel64-90-test-4.marklogic.com]
D. Now, when host "rh7v-intel64-90-test-5" is stopped, "rh7v-intel64-90-test-5" is blacklisted .The job should stop as 2 hosts are down but hashset still contains [rh7v-intel64-90-test-5.marklogic.com, rh7v-intel64-90-test-4.marklogic.com] and hence it continues to run. Based on this, it looks like hashset 'hosts' should be populated with object 'host' (forestNode.get("host").asText())
|
This issue still occurs in develop branch. All the information mentioned in the previous comment still apply. The latest client log is attached (WriteHostBatcher-1/2/3 on rh7v-intel64-90-test-19/20/21.marklogic.com) |
…ull host names; #579 - Job not stopped
The job stops after the fix |
This issue was observed with a specific forest configuration described below:
A. This test was run on a 3 node cluster (rh7v-intel64-90-test-4/5/6.marklogic.com) with a forest (WriteBatcher-1,2,3) on each of the node which are associated with a db. 'WriteBatcher-1' is not configured for failover. 'WriteBatcher-3' is configured to fail over to host 'rh7v-intel64-90-test-5.marklogic.com'. 'WriteBatcher-2' is configured to fail over to host 'rh7v-intel64-90-test-4.marklogic.com'
B. Now when 'ihb2' WB job is getting executed,nodes rh7v-intel64-90-test-6.marklogic.com is first stopped .
21:16:24.885 [main] ERROR c.m.c.d.HostAvailabilityListener - ERROR: host unavailable "rh7v-intel64-90-test-6.marklogic.com", black-listing it for PT15S
The forest fails over to 'rh7v-intel64-90-test-5.marklogic.com'. The writing of document to db resumes once the failover is complete.
C. Now 'rh7v-intel64-90-test-5.marklogic.com' is stopped. It gets blacklisted
21:17:02.508 [main] ERROR c.m.c.d.HostAvailabilityListener - ERROR: host unavailable "rh7v-intel64-90-test-5", black-listing it for PT15S
D. After that, the job is stopped as available hosts < minHosts
21:17:02.772 [pool-1-thread-1] ERROR c.m.c.d.HostAvailabilityListener - Encountered [com.sun.jersey.api.client.ClientHandlerException: org.apache.http.NoHttpResponseException: The target server failed to respond] on host "rh7v-intel64-90-test-5.marklogic.com" but black-listing it would drop job below minHosts (2), so stopping job "unnamed".
E. After that , retrying of failed batches keeps running infinitely
21:17:02.550 [main] WARN c.m.c.d.HostAvailabilityListener - Retrying failed batch: 132, results so far: 2640, uris: [/local/ABC-2620, /local/ABC-2621, /local/ABC-2622, /local/ABC-2623, /local/ABC-2624, /local/ABC-2625, /local/ABC-2626, /local/ABC-2627, /local/ABC-2628, /local/ABC-2629, /local/ABC-2630, /local/ABC-2631, /local/ABC-2632, /local/ABC-2633, /local/ABC-2634, /local/ABC-2635, /local/ABC-2636, /local/ABC-2637, /local/ABC-2638, /local/ABC-2639]
F. The client process was killed after sometime and the client logs and stack trace have been attached.
Client log
Stack trace
Test:
The text was updated successfully, but these errors were encountered: