Skip to content

Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*()#1

Merged
leonardvimond merged 1 commit intomasterfrom
FTClient-FasterSelector
Feb 25, 2016
Merged

Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*()#1
leonardvimond merged 1 commit intomasterfrom
FTClient-FasterSelector

Conversation

@leonardvimond
Copy link
Copy Markdown
Owner

************ OVERVIEW ************
When a LocationForward occurred and that TAO is retrieving a profile to which it can get connected, then all new outgoing requests are blocked by a mutex in TAO_FT_Invocation_Endpoint_Selector::select_primary or in TAO_FT_Invocation_Endpoint_Selector::select_secondary, as long as the request in progress has not found any profile.
It looks like that each request, once it got the mutex, will try to connect to each profile of the IOGR at the moment it arrived, and will not necessarily use the IOGR updated by the first request. If some profiles are unreachable, then the attempts of connection can be long, and consequently all pending requests will be delayed.
If one configure a Relative RoundTrip Timeout, he will possibly get TIMEOUT to these requests while there would be enough time to get a reply from the new primary.

************ ISSUE ************
I have a use case with a FT client sending many requests to a FT replicated server, and the FT primary server is unplugged from the network.
We expect all requests to be forwarded to the new primary once the switch is over, but many requests get TIMEOUT instead.

For a disconnection of 10.100.14.96 at 16:50:01Z and a RTTT=20s (/var/log/messages-20160214:Feb 12 16:50:01 systint85 kernel: bnx2 0000:03:00.0: eth0: NIC Copper Link is Down), the failure of TCP connection is detected after 6s (as expected, thanks to the TCP Keep Alive we have configured):
DOCGroup#16:50:08.107683

TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, timeout after recv is <13602> status <-1>
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, recovering after an error
...
DOCGroup#16:50:23.041159

TAO_FT (20595|140665202775808) - Got a primary component

And then some attempts of reconnection fail after 3s, accordingly to the TCP parameter tcp_retries2=3.
DOCGroup#16:50:23.042602

TAO (20595|140665202775808) - IIOP_Connector::begin_connection, to 10.100.14.96:11063 which should block
...
DOCGroup#16:50:26.481488

TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, timeout after recv is <0> status <-1>
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, recovering after an error

A very long time (15s here) is spent between the failure and the first attempt of reconnection, which looks to be only explained by the time needed to gain the mutex in FTSelector.
All requests will pay the cost of 3s when attempting the unreachable profile, and last ones will finish with a TIMEOUT.

************ FIX ************
Making a copy of profiles and release immediately the Mutex enables to all requests to be processed at the same time, they will all try to find the right profile concurrently.
That fix has been validated on the old TAO-V161, however the relative code looks to have been very stable since then and it may work the same in latest releases.

OVERVIEW
When a LocationForward occurred and that TAO is retrieving a profile to which it can get connected, then all new outgoing requests are blocked by a mutex in TAO_FT_Invocation_Endpoint_Selector::select_primary or in TAO_FT_Invocation_Endpoint_Selector::select_secondary, as long as the request in progress has not found any profile.
It looks like that each request, once it got the mutex, will try to connect to each profile of the IOGR at the moment it arrived, and will not necessarily use the IOGR updated by the first request. If some profiles are unreachable, then the attempts of connection can be long, and consequently all pending requests will be delayed. 
If one configure a Relative RoundTrip Timeout, he will possibly get TIMEOUT to these requests while there would be enough time to get a reply from the new primary.

ISSUE
I have a use case with a FT client sending many requests to a FT (replicated) server, and the FT primary
leonardvimond added a commit that referenced this pull request Feb 25, 2016
Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*()
@leonardvimond leonardvimond merged commit a7f0d24 into master Feb 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant