Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added eloquent error message for MODCLUSTER-448. #140

Closed
wants to merge 1 commit into from

Conversation

Karm
Copy link
Member

@Karm Karm commented Feb 23, 2015

Description

If a worker registers with the balancer and it both doesn't exist at the time and yet it shares url with another worker returned by ap_proxy_get_worker, we log the following error message:

[error] Apparently, there are workers that reported the same url ajp://127.0.0.1:8009. This will cause problems with failover.

In the aforementioned case, there were two workers with different jvmRoutes on two virtual machines, each reporting its hostname and port as 127.0.0.1:8009.

Tested

  • Apache HTTP Server 2.2.26
  • Apache HTTP Server 2.4.6
  • Both on Fedora 20 x86_64

@modcluster-pull-request

Triggering build using a merge of 4335bf7 on branch master:
Private: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mod_cluster-1.3.x-pull-player-executor/

@modcluster-pull-request

@Karm
Copy link
Member Author

Karm commented Mar 19, 2015

This is the offending line causing the core: apr_pool_destroy(worker->cp->pool);

According to the behaviour described in BZ 56314, could it be the case that we simply can't destroy the pool at this place for this worker because the other worker is allocated there and it's accepting requests?

IMHO it would be the best either remove_workers_node or set it to ERROR, perhaps even collect worker->cp->res prior to destroying the pool. Gonna try that.

@Karm
Copy link
Member Author

Karm commented Mar 19, 2015

(For my record)

i = apr_reslist_acquired_count(worker->cp->res);
ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, server, "KARM: remove_workers_node (reslist) %d %s", i, node->mess.JVMRoute);

The aforementioned inserted just before the offending apr_pool_destroy(worker->cp->pool); returns:

[debug] mod_proxy_cluster.c(522): KARM: remove_workers_node (helper) 0 4e6189af-0502-3305-8ff3-fad7fee8b516
httpd.worker: misc/apr_reslist.c:161: reslist_cleanup: Assertion `rl->ntotal == 0' failed.
[debug] mod_manager.c(2263): manager_trans STATUS (/)

Bewildering. Note: reslist_cleanup appears to be thread safe.

@@ -55,4 +55,32 @@ int (* proxy_node_isup)(request_rec *r, int id, int load);
int (* proxy_host_isup)(request_rec *r, char *scheme, char *host, char *port);
};
typedef struct balancer_method balancer_method;

/* From apr_reslist.c
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is consistent between httpd 2.4.6 and httpd 2.2.26. Is there any better way how to avoid dereferencing pointer to incomplete type?

@modcluster-pull-request

Triggering build using a merge of 7907693 on branch master:
Private: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mod_cluster-1.3.x-pull-player-executor/

if (!worker->opaque)
worker->opaque = (proxy_cluster_helper *) apr_pcalloc(conf->pool, sizeof(proxy_cluster_helper));
if (!worker->opaque)
return APR_EGENERAL;
if (worker->cp->pool) {
/* destroy and create a new one */
ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, server, "Reslist stats: worker->cp->res->ntotal:%d, worker->cp->res->nidle:%d", worker->cp->res->ntotal, worker->cp->res->nidle);
//MODCLUSTER-448
if(worker->cp->res->ntotal != worker->cp->res->nidle) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of idle and total is not the same, we can't simply destroy the pool for this worker at this time. It will be caught during next iteration. This whole code branch is a mere corner case: It takes place when one tries to connect two worker with the same host and port at the same time. If you just remove and add 1 worker, you won't get ntotal != nidle ever.

@Karm
Copy link
Member Author

Karm commented Mar 19, 2015

@jfclere WDYT?

@modcluster-pull-request

@@ -101,6 +104,33 @@ static int enable_options = -1; /* Use OPTIONS * for CPING/CPONG */
#define TIMESESSIONID 300 /* after 5 minutes the sessionid have probably timeout */
#define TIMEDOMAIN 300 /* after 5 minutes the sessionid have probably timeout */

/* From apr_reslist.c
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declaration is not enough :(, we need the definition for our check on ntotal, nidle...

@modcluster-pull-request

Triggering build using a merge of 5398900 on branch master:
Private: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mod_cluster-1.3.x-pull-player-executor/

@modcluster-pull-request

@jfclere
Copy link
Member

jfclere commented Mar 20, 2015

Try with apr_reslist_acquired_count()

@Karm
Copy link
Member Author

Karm commented Apr 5, 2015

Notes

Ad apr_reslist_acquired_count() that was the first thing I tried -- before my crazy re-definition of apr_reslist_t :) It does not help us:

[debug] mod_proxy_cluster.c(1108): update_workers_node done
[debug] mod_proxy_cluster.c(1096): update_workers_node starting
[error] Apparently, there are workers that reported the same url ajp://127.0.0.1:8009. This will cause problems with failover.
[debug] mod_proxy_cluster.c(523): Created: can't reuse worker as it for ajp://127.0.0.1:8009 cleaning...
[debug] mod_proxy_cluster.c(530): Reslist stats: worker->cp->res->ntotal:1, worker->cp->res->nidle:1, apr_reslist_acquired_count(worker->cp->res):0
[debug] mod_proxy_cluster.c(610): proxy: initialized worker 2 in child 26374 for (127.0.0.1) min=0 max=25 smax=25
[debug] mod_proxy_cluster.c(666): Created: reuse worker ajp://127.0.0.1:8009 of balancer://mycluster
[debug] mod_proxy_cluster.c(680): Created: worker for ajp://127.0.0.1:8009 2 (status): 1

Note in the aforementioned snippet that while apr_reslist_acquired_count(worker->cp->res):0, there is the evil worker->cp->res->nidle:1 causing the crash.

I run preprocessor on our code, first with httpd-2.2.26 and then with httpd-2.4.6 apxs, looking for differences in the create_worker function. Unfortunately, this is so much mod_proxy version dependent code that the diff actually forms two different functions (left httpd-2.2.26, right httpd-2.4.6).

I wonder whether the httpd-2.2.26 ain't missing anything.

It would be the best if ap_proxy_get_worker(conf->pool, balancer, conf, url) wouldn't "guess" and look up workers accurately :(

@Karm
Copy link
Member Author

Karm commented Aug 18, 2015

I'm gonna revisit this beauty and try to implement it without re-defining struct apr_reslist_t.

@Karm
Copy link
Member Author

Karm commented May 31, 2016

@jfclere Gonna do differently and port to 1.3.x, 1.2.x and 2.0 (master).

@Karm Karm closed this May 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants