Skip to content
This repository has been archived by the owner on Nov 9, 2022. It is now read-only.

replicating Security DB results in failed bootstrap #735

Closed
joecrean opened this issue Feb 9, 2017 · 26 comments
Closed

replicating Security DB results in failed bootstrap #735

joecrean opened this issue Feb 9, 2017 · 26 comments

Comments

@joecrean
Copy link

joecrean commented Feb 9, 2017

The issue

Short description of the problem:

When you try to replicate the Security database as part of a bootstrap you can see that when the Security db goes offline the bootstrap seems to continue and gets into a kind of death loop trying to lookup details in the Security db. Here is a sample of the output in the ML ErrorLog.txt.

2017-02-09 14:20:12.583 Debug: Retrying HTTPRequestTask::handleXDBCEval eval 8527616213787243855 Update 90 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:14.487 Debug: Retrying ForestReindexerThread::reindexFragments Last-Login 28 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:14.843 Debug: Retrying HTTPRequestTask::handleXDBCEval eval 8527616213787243855 Update 91 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:15.918 Debug: Retrying ForestReindexerThread::reindexFragments Modules 28 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:16.961 Debug: Retrying HTTPRequestTask::handleXDBCEval eval 8527616213787243855 Update 92 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:16.975 Debug: Retrying ForestReindexerThread::reindexFragments Meters 30 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:17.350 Debug: Retrying ForestReindexerThread::reindexFragments Schemas 29 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:17.819 Debug: Retrying ForestReindexerThread::reindexFragments Fab 25 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:18.728 Debug: Retrying ForestReindexerThread::reindexFragments App-Services 27 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:18.831 Debug: Retrying ForestReindexerThread::reindexFragments Documents 28 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:19.528 Debug: Retrying HTTPRequestTask::handleXDBCEval eval 8527616213787243855 Update 93 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:21.655 Debug: Retrying ForestReindexerThread::reindexFragments Triggers 27 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-02-09 14:20:21.798 Debug: Retrying HTTPRequestTask::handleXDBCEval eval 8527616213787243855 Update 94 because XDMP-FORESTNOT: Forest Security not available: middle closing
2

What are the steps to reproduce the problem?

Create a 3 node cluster using Docker , Red Hat SLES 6, ML 8.0-6.1. My host laptop is Mac OS X. Download and create a vanilla roxy project. Set systems-dbs=Security. Run the bootstrap as below , eventually you will see the error message below.

~/Workspace/Haufe/test-app -> ml local bootstrap --replicate-internals
Bootstrapping replicas for Security on localhost...
ERROR: 500 "Internal Server Error"
ERROR: {"errorResponse":{"statusCode":500, "status":"Internal Server Error", "messageCode":"XDMP-FORESTNOT", "message":"XDMP-FORESTNOT: Forest Security not available: middle closing", "messageDetail":{"messageTitle":"Forest not available"}}}

Here is my local.properties

test-content-db=${app-name}-content-test
test-modules-db=${app-name}-modules
test-port=8072
test-authentication-method=digest
content-forests-per-host=1
system-dbs=Security

Tech Specs

Which Operating System are you using?

SLES 6

Which version of MarkLogic are you using?

8.0-6.1

Which version of Roxy are you using (see version.txt)?

1.7.5

@joecrean
Copy link
Author

joecrean commented Feb 9, 2017

here is a zip of my test project

test-app.zip

@joecrean
Copy link
Author

joecrean commented Feb 9, 2017

this is a biggie because we cannot bootstrap into Production without replicating the Security.

I also forgot to mention that when you change the call to MarkLogic from v1/eval to qconsole evaler then it works (after you disable the CSRF validation - in the ML source xquery)

@RobertSzkutak RobertSzkutak self-assigned this Feb 9, 2017
@RobertSzkutak RobertSzkutak added this to the April 2017 milestone Feb 9, 2017
@RobertSzkutak
Copy link
Contributor

Any notes in your ErrorLog.txt to give us an idea of where/when/why setup.xqy failed? Thanks!

@joecrean
Copy link
Author

joecrean commented Feb 9, 2017

the annoying and no less mystifying thing about this is that there is no error written to the error log. All the "contention" messages on the security db are at debug level. I originally tracked the error down to the call of admin:forest-create on line 1615 of setup.xqy however that when I was creating App-Services forests. However i bet it is still going in there. I traced the flow into this api and it never seemed to come out. I had try/catch envelopes but it all didn't help. I have simply no idea how or where it failed.

@RobertSzkutak RobertSzkutak modified the milestones: July 2017, April 2017 May 5, 2017
@grtjn
Copy link
Contributor

grtjn commented May 28, 2017

I wonder what the essential difference between /v1/eval and QC eval is..

I did notice that QC forces different-transaction. Maybe /v1/eval doesn't by default? I'd have to dig around in /v1/eval sources.

@grtjn
Copy link
Contributor

grtjn commented May 29, 2017

The logic to apply replica forests has changed considerably. That may be the root cause..

@grtjn
Copy link
Contributor

grtjn commented May 29, 2017

Initial test doesn't look good. Bootstrap fails:

./ml local bootstrap --replicate-internals
Bootstrapping replicas for Security on ml8cl-ml1...
ERROR: 500 "Internal Server Error"
ERROR: {"errorResponse":{"statusCode":500, "status":"Internal Server Error", "messageCode":"XDMP-FORESTNOT", "message":"XDMP-FORESTNOT: Forest Security not available: middle closing", "messageDetail":{"messageTitle":"Forest not available"}}}

And after that Admin UI, and other things stop responding..

@grtjn
Copy link
Contributor

grtjn commented May 29, 2017

That is using latest dev..

@grtjn
Copy link
Contributor

grtjn commented May 30, 2017

I ran this against a 3 node cluster with 8.0-6.4, and I noticed the following ErrorLog messages. I first see these message appear:

2017-05-30 04:37:20.855 Info: Forest Security state changed from open to start closing because forest replicas changed
2017-05-30 04:37:20.855 Info: Database Security is offline
2017-05-30 04:37:21.057 Info: Forest Security state changed from start closing to middle closing because forest replicas changed
2017-05-30 04:37:22.839 Info: Mounted forest Security-rep1-on-002 remotely on ml8cl-ml2

after which it waits. At some point the bootstrap times out to fail with the earlier mentioned XDMP-FORESTNOT. After that ErrorLog suddenly continues with:

2017-05-30 04:40:24.800 Info: Forest Security state changed from middle closing to finish closing because forest replicas changed
2017-05-30 04:40:24.804 Info: Forest Security state changed from finish closing to mounted
2017-05-30 04:40:24.806 Info: Forest Security state changed from mounted to unmounted
2017-05-30 04:40:24.806 Info: Unmounted forest Security
2017-05-30 04:40:26.036 Info: Forest Security state changed from unmounted to mounted
2017-05-30 04:40:26.036 Info: Mounted forest Security locally on /var/opt/MarkLogic/Forests/Security read write (dm-0 xfs rw,seclabel,relatime,attr2,inode64,noquota)
2017-05-30 04:40:26.038 Info: Forest Security state changed from mounted to recovering
2017-05-30 04:40:26.039 Info: Forest Security state changed from recovering to wait replication because not yet sure who is master
2017-05-30 04:40:27.051 Info: Forest Security is in state wait replication because waiting for checkpoint
2017-05-30 04:40:27.062 Info: Forest Security assuming the role of master with old precise time 14960869159704380
2017-05-30 04:40:27.062 Info: Forest Security state changed from wait replication to open because was previous master
2017-05-30 04:40:28.037 Info: Database Security is online with 1 forest
2017-05-30 04:40:29.083 Info: Forest Security starting synchronization to domestic forest Security-rep1-on-002
2017-05-30 04:40:29.084 Info: Forest Security starting bulk replication to domestic forest Security-rep1-on-002 with effectiveVersion=0
2017-05-30 04:40:29.087 Info: Forest Security replicating 1297 keys to domestic forest Security-rep1-on-002
2017-05-30 04:40:29.108 Info: Forest Security needs to replicate 1297 fragments [0:0..0:0] to domestic forest Security-rep1-on-002
2017-05-30 04:40:29.499 Info: Forest Security finished bulk replicating to domestic forest Security-rep1-on-002
2017-05-30 04:40:29.509 Info: Forest Security finished synchronizing to domestic replica forest Security-rep1-on-002 at frame 1523

And once finished synchronizing, the cluster returns to a responsive state.

Note: I tried with 8.0-1.1 before, and that didn't complete successfully, leaving a unresponsive cluster, which I had to clear fully..

@joecrean
Copy link
Author

@grtjn so just to summarise: with 8.0-6.4, bootstrap fails but the deployment is eventually successful?

@grtjn
Copy link
Contributor

grtjn commented May 30, 2017

@joecrean still running tests. Ping me offline if you like..

@grtjn
Copy link
Contributor

grtjn commented May 30, 2017

I think I have been able to pinpoint the issue, though I still wonder why this never caused trouble in the past..

@grtjn
Copy link
Contributor

grtjn commented May 30, 2017

I have derived the following so far: the bootstrap gets passed admin:forest-create, and even to the point of admin:save-config-without-restart. At that stage the Security database starts closing, and gets stuck at middle closing. This is because bootstrap code has not finished yet for some reason. It used to complete without trouble in the past, but since MarkLogic 8 (and also with 9) the bootstrap code gets stuck because it tries to access Security database, but that is by that time in middle closing state. E.g. deadlock. Only when the evals timeout with an XDMP-FORESTNOT, the handles on Security db are released, and Security db can unmount, remount, and initialize replication.

I added a check to see if Security has been touched, and bail out from bootstrap. Doing that makes the bootstrap end normally and timely. I need to do some extra checks though as to why this is necessary..

@grtjn
Copy link
Contributor

grtjn commented May 30, 2017

Same issue occurs on MarkLogic 7, except that it doesn't wait with releasing handles, so one doesn't really notice there is a deadlock:

2017-05-30 15:35:57.269 Info: App-Services: INIT delete map file: /roxy/status/cleanup/delete-map-internal.xml
2017-05-30 15:35:57.269 Info: App-Services: INIT replicating map file: /roxy/status/cleanup/replicating-map-internal.xml
2017-05-30 15:35:57.292 Info: App-Services: Calling admin:forest-create(Security-rep1-on-002, 2014913018365232144)
2017-05-30 15:35:57.302 Info: App-Services: Done calling admin:forest-create(Security-rep1-on-002, 2014913018365232144)
2017-05-30 15:35:57.316 Info: App-Services: Calling admin:save-config
2017-05-30 15:35:57.333 Config: Wrote /var/opt/MarkLogic/assignments.xml
2017-05-30 15:35:57.333 Config: Loading /var/opt/MarkLogic/assignments.xml
2017-05-30 15:35:57.333 Info: [Event:id=Forest State] Forest Security state changed from open to start closing at 14961585573333630
2017-05-30 15:35:57.387 Info: App-Services: Done calling admin:save-config
2017-05-30 15:35:57.533 Info: [Event:id=Forest State] Forest Security state changed from start closing to middle closing at 14961585575338290
2017-05-30 15:35:57.902 Debug: Retrying AppRequestTask::handleEvalLocked 15655607047287356364 Query 1 because XDMP-FORESTNOT: Forest Security not available: middle closing
2017-05-30 15:35:57.934 Info: [Event:id=Forest State] Forest Security released all stand handles
2017-05-30 15:35:57.934 Info: [Event:id=Forest State] Forest Security state changed from middle closing to finish closing at 14961585579345520
2017-05-30 15:35:57.936 Debug: Closing journal /var/opt/MarkLogic/Forests/Security/Journals/Journal1-19700101000000-0-18446744073709551615-0 with fsns 0..1292

@jamsilvia
Copy link
Contributor

So I had seen this issue when I added the code to scale out the cluster. I had tested with an older version of Roxy at that point, and it had the same problem. So I didn't delve into it deeper (I just reran bootstrap and it would complete after 1 or 2 re-runs, typically). All of my runs were on ML8.

@grtjn
Copy link
Contributor

grtjn commented May 30, 2017

I did a lot of research, and discovered there are many xdmp:eval's agains the Security database, most of them irrelevant to --replicate-internals, except for those in setup:save-cleanup-state. Commenting this line helped a lot, but disables functionality important to setting up replica forests:

https://github.com/marklogic/roxy/blob/dev/deploy/lib/xquery/setup.xqy#L1344

I need to see if I can remove the xdmp:eval in that function, or execute it earlier somehow..

@jamsilvia
Copy link
Contributor

The update for the cleanup state should be done very late in the execution (after successfully creating the new replicas has been completed). The eval was used just so that we could run against the security DB to get some role information. I don't think we can change it to not run against that DB....but we should be able to move it earlier. Although at this point, I don't know that the evals in here run until everything is done.

@grtjn
Copy link
Contributor

grtjn commented May 30, 2017

I'll dig in and try some things..

@grtjn
Copy link
Contributor

grtjn commented May 31, 2017

Actually, I think you can remove the xdmp:eval in that function. It is unnecessary to give doc permissions to the admin role (admin users can always see everything), so no need to check if the current user has admin role either.

@grtjn
Copy link
Contributor

grtjn commented May 31, 2017

@joecrean I think I found a satisfying solution, see PR #771. I'd be happy to get your feedback..

@grtjn grtjn assigned grtjn and unassigned RobertSzkutak May 31, 2017
RobertSzkutak added a commit that referenced this issue May 31, 2017
Fixed #735: removed unnecessary eval against Security
@grtjn
Copy link
Contributor

grtjn commented May 31, 2017

@joecrean the fix was merged into the dev branch. You could run ./ml upgrade --branch=dev to run checks with it. Make a backup of your current Roxy first though.

Please close the ticket if you are satisfied by the fix..

@grtjn
Copy link
Contributor

grtjn commented Jun 1, 2017

One sec, in my last PR i moved amps to before db creation, but it depends on the db, which must exist before the amp can be created. I'll open a PR for that shortly..

grtjn added a commit to grtjn/roxy that referenced this issue Jun 1, 2017
grtjn added a commit that referenced this issue Jun 1, 2017
#735: accidental move of create-amps, breaking self-test
@joecrean
Copy link
Author

joecrean commented Jun 1, 2017

@grtjn
i tested the dev branch today and the bootstrap seems to be working fine. I will do it in 2 steps
ml env bootstrap followed by ml env bootstrap --replicate-internals.. in the corresponding props file i will set system-dbs=Security. The second step creates the Security replica

One thing - ml env wipe does not remove the Security replica.

@grtjn
Copy link
Contributor

grtjn commented Jun 1, 2017

To wipe the internal replicas you need to use ./ml env wipe --internal-replicas ;-)

@joecrean
Copy link
Author

joecrean commented Jun 1, 2017

that wiped the grin off my face... haha i crack me up

@grtjn
Copy link
Contributor

grtjn commented Jun 6, 2017

Fixed in dev

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants