Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client throwing AuthenticationException: Invalid credentials after owner member left. #12038

Closed
tkountis opened this issue Dec 21, 2017 · 7 comments
Closed

Comments

@tkountis
Copy link
Contributor

@tkountis tkountis commented Dec 21, 2017

Server version: 3.9.2_SNAPSHOT
Client version: 3.9.1

Cluster with 3 member, having 2 clients connected to it.
Client owned by node 10.211.176.189

Rebooted node 10.211.176.189
Member logs 10.211.176.189
https://gist.githubusercontent.com/tkountis/75020b3540efcc06d9851af923f88e3f/raw/92be6a54258b60ab558635fa4e9984027941ecee/gistfile1.txt

Member logs 10.211.176.190

https://gist.githubusercontent.com/tkountis/3df660225d531344eda08f9a2a793025/raw/4c0d1e28aeff61394e164650b69bd36043ea0ae0/gistfile1.txt

Client:

https://gist.githubusercontent.com/tkountis/af694f625ff41ce50957dca5a48dabcc/raw/85c8e1b9d80d96c27f50a2f4f57cc1af89ad87a1/gistfile1.txt

This keeps happening for ~1min in the client, until the client is able to resume and continue its writes.
The member-left detection window on the members side is 3-4 seconds, which is usually the case for the client too (under current configuration). However, when I reboot the owner of that client, the connection takes ~10secs to close, and 1min on top of that dealing with the refused connection, which stalls the client.

@sancar
Copy link
Member

@sancar sancar commented Dec 21, 2017

Here is what I found so far.
Client disconnects from both 10.211.176.190 and
10.211.176.194 disconnects from client. Not clear why from these logs. Only thing we know disconnections started by the servers.

Clients owner is this
Member [10.211.176.189]:1521 - ac5846c5-31cf-41a3-812c-655120d9ee0a

Client tries to open connection back to 10.211.176.194.

10.211.176.194 rejects client, because member(10.211.176.194) says 10.211.176.189(owner of client) is not part of the cluster.
That means 10.211.176.189 is not in member list of 10.211.176.194.

Client is trying 10.211.176.189 because it is trying all configured members and members from last member list.
Client will try all addresses in the list one by one normally. Since logs only show one instance of exception happening, we cannot reason about that right now.
Client configuration is important here. If only 10.211.176.189 is configured in member than client could easily stuck because it has no other member to retry.

@tkountis
Copy link
Contributor Author

@tkountis tkountis commented Dec 21, 2017

Client addresses config

clientConfig.getNetworkConfig()
                    .addAddress("110.211.176.190:1521", "10.211.176.189:1521", "10.211.176.194:1521");

configs.put("connection_timeout", "5000");
configs.put("retry_attempt_limit", "10");
configs.put("retry_attempt_interval", "1000");
@tkountis
Copy link
Contributor Author

@tkountis tkountis commented Dec 21, 2017

Collecting logs, will upload shortly.

@sancar
Copy link
Member

@sancar sancar commented Dec 21, 2017

With the new logs. one possible explanation to this behaviour can be as follows:

It seems that 10.211.176.189 is separated from last of the cluster first because others can not ping this. After that 10.211.176.189 shuts down.
If other members can not reach this, member client also can't. When this member shutting down it prints that it closes the connection to clients.

2017-12-21 04:24:27 [hz.ShutdownThread] INFO :: [10.211.176.189]:1521 [nonprod] [3.9.2-SNAPSHOT] Connection[id=16, /10.211.176.189:1521->sl73rskwsapd001.hostname.com/10.211.129.43:46651, endpoint=[10.211.129.43]:46651, alive=false, type=JAVA_CLIENT] closed. Reason: TcpIpConnectionManager is stopping

But client does not get tcp EOF package.
It can detect only after heartbeat timeout seconds later and continue. (60 seconds in this case)

We need to understand the test scenario. How it is the case that any tcp package to/from member 10.211.176.189 is not reaching anywhere, before it shuts down ?

@ManojaMishra
Copy link

@ManojaMishra ManojaMishra commented Dec 21, 2017

Hi Sancar,

Here is the test scenario.

Test Setup:
a. Server cluster size: 3 members (10.211.176.189, 10.211.176.190, 10.211.176.194)
b. 2 clients connected to the cluster
c. The clients are using the Mutual SSL authentication and UsernamePasswordCredential for authenticating with the members using ClientLoginModule

Test scenario:

  1. Client -1 connected to 10.211.176.189 as owner member and performing write operation for a key. The partition of the key is owned by member node 10.211.176.189
  2. Client -2 connected to 10.211.176.194 as owner member and performing write operation for a key. The partition of the key is owned by member node 10.211.176.194
  3. Rebooted the node 10.211.176.189 using "reboot" command
  4. Observed that the Client-1 is stalled for ~1Min:10 sec before it successfully resumed the writing operation
@sancar
Copy link
Member

@sancar sancar commented Dec 27, 2017

From the logs it seems that reboot command not letting instance to close gracefully.
Network layer communication cuts before instance getting a chance to say goodbye. That is why client and servers have to detect this via heartbeat(or icmp) .

Since members have already icmp it was able to detect within seconds.
Following prs are implementing icmp ping on the clients,
so that when a member is gone ungracefully clients can detect it within seconds.
#12049
#12048

@sancar
Copy link
Member

@sancar sancar commented Jan 3, 2018

fixed by #12048 and #12049

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.