Added eloquent error message for MODCLUSTER-448. #140

Karm · 2015-02-23T16:53:37Z

Description

If a worker registers with the balancer and it both doesn't exist at the time and yet it shares url with another worker returned by ap_proxy_get_worker, we log the following error message:

[error] Apparently, there are workers that reported the same url ajp://127.0.0.1:8009. This will cause problems with failover.

In the aforementioned case, there were two workers with different jvmRoutes on two virtual machines, each reporting its hostname and port as 127.0.0.1:8009.

Tested

Apache HTTP Server 2.2.26
Apache HTTP Server 2.4.6
Both on Fedora 20 x86_64

modcluster-pull-request · 2015-02-23T16:55:26Z

Triggering build using a merge of 4335bf7 on branch master:
Private: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mod_cluster-1.3.x-pull-player-executor/

modcluster-pull-request · 2015-02-23T16:57:29Z

Build 137 outcome was SUCCESS using a merge of 4335bf7 on branch master:
Private: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mod_cluster-1.3.x-pull-player-executor/137
Public: https://hudson.jboss.org/hudson/job/mod_cluster-1.3.x-pull-player-executor/137

Karm · 2015-03-19T00:53:38Z

This is the offending line causing the core: apr_pool_destroy(worker->cp->pool);

According to the behaviour described in BZ 56314, could it be the case that we simply can't destroy the pool at this place for this worker because the other worker is allocated there and it's accepting requests?

IMHO it would be the best either remove_workers_node or set it to ERROR, perhaps even collect worker->cp->res prior to destroying the pool. Gonna try that.

Karm · 2015-03-19T11:11:37Z

(For my record)

i = apr_reslist_acquired_count(worker->cp->res);
ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, server, "KARM: remove_workers_node (reslist) %d %s", i, node->mess.JVMRoute);

The aforementioned inserted just before the offending apr_pool_destroy(worker->cp->pool); returns:

[debug] mod_proxy_cluster.c(522): KARM: remove_workers_node (helper) 0 4e6189af-0502-3305-8ff3-fad7fee8b516
httpd.worker: misc/apr_reslist.c:161: reslist_cleanup: Assertion `rl->ntotal == 0' failed.
[debug] mod_manager.c(2263): manager_trans STATUS (/)

Bewildering. Note: reslist_cleanup appears to be thread safe.

Karm · 2015-03-19T15:06:03Z

native/include/mod_proxy_cluster.h

@@ -55,4 +55,32 @@ int (* proxy_node_isup)(request_rec *r, int id, int load);
 int (* proxy_host_isup)(request_rec *r, char *scheme, char *host, char *port);
 };
 typedef struct balancer_method balancer_method;
+
+/* From apr_reslist.c


It is consistent between httpd 2.4.6 and httpd 2.2.26. Is there any better way how to avoid dereferencing pointer to incomplete type?

modcluster-pull-request · 2015-03-19T15:09:20Z

Triggering build using a merge of 7907693 on branch master:
Private: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mod_cluster-1.3.x-pull-player-executor/

Karm · 2015-03-19T15:10:41Z

native/mod_proxy_cluster/mod_proxy_cluster.c

        if (!worker->opaque)
            worker->opaque = (proxy_cluster_helper *) apr_pcalloc(conf->pool,  sizeof(proxy_cluster_helper));
        if (!worker->opaque)
            return APR_EGENERAL;
        if (worker->cp->pool) {
            /* destroy and create a new one */
+            ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, server, "Reslist stats: worker->cp->res->ntotal:%d, worker->cp->res->nidle:%d", worker->cp->res->ntotal, worker->cp->res->nidle);
+            //MODCLUSTER-448
+            if(worker->cp->res->ntotal != worker->cp->res->nidle) {


If the number of idle and total is not the same, we can't simply destroy the pool for this worker at this time. It will be caught during next iteration. This whole code branch is a mere corner case: It takes place when one tries to connect two worker with the same host and port at the same time. If you just remove and add 1 worker, you won't get ntotal != nidle ever.

Karm · 2015-03-19T15:14:50Z

@jfclere WDYT?

modcluster-pull-request · 2015-03-19T15:23:36Z

Build 138 outcome was SUCCESS using a merge of 7907693 on branch master:
Private: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mod_cluster-1.3.x-pull-player-executor/138
Public: https://hudson.jboss.org/hudson/job/mod_cluster-1.3.x-pull-player-executor/138

…ntotal.

Karm · 2015-03-19T15:50:46Z

native/mod_proxy_cluster/mod_proxy_cluster.c

@@ -101,6 +104,33 @@ static int enable_options = -1; /* Use OPTIONS * for CPING/CPONG */
 #define TIMESESSIONID 300                    /* after 5 minutes the sessionid have probably timeout */
 #define TIMEDOMAIN    300                    /* after 5 minutes the sessionid have probably timeout */

+/* From apr_reslist.c


Declaration is not enough :(, we need the definition for our check on ntotal, nidle...

modcluster-pull-request · 2015-03-19T15:54:05Z

Triggering build using a merge of 5398900 on branch master:
Private: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mod_cluster-1.3.x-pull-player-executor/

modcluster-pull-request · 2015-03-19T16:09:05Z

Build 139 outcome was SUCCESS using a merge of 5398900 on branch master:
Private: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mod_cluster-1.3.x-pull-player-executor/139
Public: https://hudson.jboss.org/hudson/job/mod_cluster-1.3.x-pull-player-executor/139

jfclere · 2015-03-20T10:49:48Z

Try with apr_reslist_acquired_count()

Karm · 2015-04-05T22:45:49Z

Notes

Ad apr_reslist_acquired_count() that was the first thing I tried -- before my crazy re-definition of apr_reslist_t :) It does not help us:

[debug] mod_proxy_cluster.c(1108): update_workers_node done
[debug] mod_proxy_cluster.c(1096): update_workers_node starting
[error] Apparently, there are workers that reported the same url ajp://127.0.0.1:8009. This will cause problems with failover.
[debug] mod_proxy_cluster.c(523): Created: can't reuse worker as it for ajp://127.0.0.1:8009 cleaning...
[debug] mod_proxy_cluster.c(530): Reslist stats: worker->cp->res->ntotal:1, worker->cp->res->nidle:1, apr_reslist_acquired_count(worker->cp->res):0
[debug] mod_proxy_cluster.c(610): proxy: initialized worker 2 in child 26374 for (127.0.0.1) min=0 max=25 smax=25
[debug] mod_proxy_cluster.c(666): Created: reuse worker ajp://127.0.0.1:8009 of balancer://mycluster
[debug] mod_proxy_cluster.c(680): Created: worker for ajp://127.0.0.1:8009 2 (status): 1

Note in the aforementioned snippet that while apr_reslist_acquired_count(worker->cp->res):0, there is the evil worker->cp->res->nidle:1 causing the crash.

I run preprocessor on our code, first with httpd-2.2.26 and then with httpd-2.4.6 apxs, looking for differences in the create_worker function. Unfortunately, this is so much mod_proxy version dependent code that the diff actually forms two different functions (left httpd-2.2.26, right httpd-2.4.6).

I wonder whether the httpd-2.2.26 ain't missing anything.

It would be the best if ap_proxy_get_worker(conf->pool, balancer, conf, url) wouldn't "guess" and look up workers accurately :(

Karm · 2015-08-18T07:28:42Z

I'm gonna revisit this beauty and try to implement it without re-defining struct apr_reslist_t.

Karm · 2016-05-31T18:13:47Z

@jfclere Gonna do differently and port to 1.3.x, 1.2.x and 2.0 (master).

Karm force-pushed the MODCLUSTER-448 branch from 4335bf7 to 7907693 Compare March 19, 2015 15:03

Karm reviewed Mar 19, 2015
View reviewed changes

Karm force-pushed the MODCLUSTER-448 branch from 7907693 to 91d0134 Compare March 19, 2015 15:47

Added eloquent error message for MODCLUSTER-448. Watch for nidle and …

5398900

…ntotal.

Karm force-pushed the MODCLUSTER-448 branch from 91d0134 to 5398900 Compare March 19, 2015 15:48

Karm reviewed Mar 19, 2015
View reviewed changes

rhusar assigned jfclere Jun 15, 2015

Karm closed this May 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added eloquent error message for MODCLUSTER-448. #140

Added eloquent error message for MODCLUSTER-448. #140

Karm commented Feb 23, 2015

modcluster-pull-request commented Feb 23, 2015

modcluster-pull-request commented Feb 23, 2015

Karm commented Mar 19, 2015

Karm commented Mar 19, 2015

Karm Mar 19, 2015

modcluster-pull-request commented Mar 19, 2015

Karm Mar 19, 2015

Karm commented Mar 19, 2015

modcluster-pull-request commented Mar 19, 2015

Karm Mar 19, 2015

modcluster-pull-request commented Mar 19, 2015

modcluster-pull-request commented Mar 19, 2015

jfclere commented Mar 20, 2015

Karm commented Apr 5, 2015

Karm commented Aug 18, 2015

Karm commented May 31, 2016

Added eloquent error message for MODCLUSTER-448. #140

Added eloquent error message for MODCLUSTER-448. #140

Conversation

Karm commented Feb 23, 2015

Description

Tested

modcluster-pull-request commented Feb 23, 2015

modcluster-pull-request commented Feb 23, 2015

Karm commented Mar 19, 2015

Karm commented Mar 19, 2015

Karm Mar 19, 2015

Choose a reason for hiding this comment

modcluster-pull-request commented Mar 19, 2015

Karm Mar 19, 2015

Choose a reason for hiding this comment

Karm commented Mar 19, 2015

modcluster-pull-request commented Mar 19, 2015

Karm Mar 19, 2015

Choose a reason for hiding this comment

modcluster-pull-request commented Mar 19, 2015

modcluster-pull-request commented Mar 19, 2015

jfclere commented Mar 20, 2015

Karm commented Apr 5, 2015

Notes

Karm commented Aug 18, 2015

Karm commented May 31, 2016