-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
redis-cluster attempting to connect to a stale redis server IP and not recovering #133
Comments
Hi, can someone please take a look at this one? We are a bit blocked with this currently. |
That sounds strange. |
Hi @bjosv Yes, cluster nodes command was separately run on the redis-server when issue happened and ouput(captured below) is matching with the kubectl o/p.
We dont set redisClusterSetOptionMaxRetry() explicitly. So default value should be set i guess. Also, I couldn't find any logs which suggests that we ran cluster nodes command, and i suppose if it was run, it would have self resolved. One more point to add is, when hiredis calls back into application during tcp disconnect, application doesnt do anything with the redisAsyncContext* pointer that is sent back (I mean no delete of object or anything). I hope that is fine. |
detailed log set from begining: Log file top entriesbegining of connection with redis. {"severity":"debug","debug-string":"[2023-01-12T07:38:41.584874][connectCallback][98] connectCallback invoked {"severity":"debug","debug-string":"[2023-01-12T07:45:45.904981][connectCallback][98] connectCallback invoked {"severity":"debug","debug-string":"[2023-01-12T07:47:30.263102][disconnectCallback][149]Error: Server closed the connection {"severity":"debug","debug-string":"[2023-01-12T07:48:24.185349][connectCallback][98] connectCallback invoked {"severity":"debug","debug-string":"[2023-01-12T07:48:26.186998][CommandCb][1480]Redis command timeout {"severity":"debug","debug-string":"[2023-01-12T07:48:26.189608][disconnectCallback][149]Error: Timeout {"severity":"debug","debug-string":"[2023-01-12T07:48:28.191117][disconnectCallback][149]Error: Timeout {"severity":"debug","debug-string":"[2023-01-12T07:48:34.186333][[disconnectCallback][149]Error: Connection reset by peer {"severity":"debug","debug-string":"[2023-01-12T08:39:36.675527][disconnectCallback][149]Error: Timeout {"severity":"debug","debug-string":"[2023-01-12T08:53:23.599153][disconnectCallback][149]Error: Timeout Once connection is established, there are application commands that are triggered. We dont have prints as to when in this sequence they are triggered. There are application commands that are timing out in bwn (CommandCb) Other logs are all connect/disconnect related. |
Hi @bjosv, You have any thoughts on this? |
One other thing I saw while going through the code is that max_retry_count is checked for making cluster commands call is done in following methoods.
But in this scenario mentioned in the ticket, I don't think we are going through any of these methods. So it looks like max_retry_count never gets compared against which could explain why we are remaining/stuck in the same state. |
I only see 2 CommandCb logs, which if they are errors would increases the internal failure counter. Normally the query from a client would get MOVED response, which will trigger the renewal of hiredis-clusters slots distribution. In this case in the async code 5 send errors are allowed before the renewal. Each time the client tries to send the command it first attempts to connect to the redis node. |
This is legacy behavior and a bit strange, and it would be nice to have it changed. |
Ya there are more than 5. I just pasted logs from top of the logfile. If you see my initial post, it goes on for >2hrs. But one thing to note here is that its not actually a response coming from the server which the application is logging, we have configured for command_timeout of 2sec, and since no response is seen for that period, there is a timeout happening which is getting logged by the application with CommandCb tag. And as i mentioned above, the places where we check for max_retry_count is mostly getting skipped as we never go through one of those flows. Edit: There is no response from the IP because that IP(192.168.228.26) is no longer valid. So obviously all the commands will see a timeout eventually. |
Hi @bjosv do you agree with the above observation? |
Hi @bjosv Correcting my statement above:
But the issue here is that, if traffic is less and the key/slot never falls into the issue node, redis-cluster could just continue with the bad state for such a long time (as seen here approx 2hrs, 3 callback errors) with the default value of 5. |
One more query in the same context. I was also wondering why a connect() call would succeed in such a scenario? If the remote ip is not reachable shouldn't the connect() itself fail? The fact that application is able to print connectCallback tagged messages is an indication that the call succeeded with REDIS_OK. Do you have any thoughts on this @bjosv? |
Yes, I believe it was done in this way to handle heavy traffic scenarios, giving an imperfect behavior for low traffic scenarios..
Since the async api is used the connect will only register to the eventsystem, which normally wont fail. |
Ah, yes, I see now that you got calls to the disconnect callback, which only should be received when there has been a successful connect. Probably from 192.168.228.26. Do you have authentication enabled on Redis, using password or username/password? |
Hi @bjosv
Yes, I keep getting REDIS_OK in the connectCallbacks.
Isn't connect() supposed to be a blocking call? I hope you are talking about redisAsyncConnectWithOptions() which then calls redisContextConnectBindTcp() where I could see after connect() call its checking for status of connect().
But the IP is no longer present in the cluster. I was thinking there is some issue here due to which its wrongly claiming it's connecting.
There is no password authentication/TLS used for now. |
I would guess the status check is skipped for async. hiredis net.c
I think this can be indicated and viewed in the hiredis-cluster async tests, I will double-check. |
Hmm. Yes, I guess that definitely looks like a possible gap. However I was checking more in that flow, I see that once we come out of that else if block, we also set the flag to REDIS_CONNECTED.
And when i searched for places where hiredis could callback onConnect() api with success only 1 search result came up. https://github.com/redis/hiredis/blob/7583ebb1b271e7cb8e44f8211d02bd719d4a4bd1/async.c#L672 redisCheckConnectDone() internally does a connect() on the socket and if return code is 0 then only it should invoke the line 627. --> Will connect() continue to be non-blocking mode here ? Also, __redisAsyncHandleConnect() is called only if flags don't have REDIS_CONNECTED set.
But if errno == EINPROGRESS block had got hit, the flag would have been already set, and we would never invoke the onConnect() callback. I'm thinking now is __redisAsyncHandleConnect() at fault here ? |
Just as a note, there is this in |
Ya, I suppose I missed noting this. :) |
Ok, still trying to understand where would be the REDIS_OK connectedCallback be called from. I guess when there is a redisAsyncHandleWrite() or a redisAsyncHandleRead() operation, it will invoke __redisAsyncHandleConnect() which inturn calls connect() and connect should returns back 0 ? This is the only place where it could invoke connected callback with REDIS_OK right ? |
Exactly what I think, |
Ya this flapping went on for atleast 2hrs. So very unlikely that could be the case. |
One more thing i was thinking is, since the IP is not existing, wouldn't number of connectCallback prints translate to number of commands attempted towards the node ? I could see its definitely more than 5. |
There should be a connectCallback (connect attempt) for each CommandCb that timed out towards IP x.26. |
Yes, in the logfile i am looking at i could see in this time interval application Timeout error has occurred for 7 times. |
Just one other thing that I thought of: This would just making sure they are all in sync since hiredis-cluster can get its slot distribution from any master node, which might not be the one you send the command to. |
Which operator are you using btw? |
Hi @bjosv cluster nodes command is run as below on redis-cluster kubernetes service FQDN. redis-cli --cluster call redis-cluster.local:6379 cluster nodes It is not run for all nodes separately. OK, I see you think operator could also be involved in this. We are using IBM redis-operator for this. |
The root problem that a timeout was not triggering a slotmap update is corrected by #144. |
Hi @bjosv,
We came across a case wherein hiredis-cluster seems to be trying to connect to a stale redis-server ip while issuing commands to redis server, this connection flapping kept on happening (approx 2hrs) until one more redeploy was done for the application image (which has redis-cluster integrated).
From the server-side logs, it looks like this ip was alive earlier for some period of time in the past, but not for the previous couple of hours atleast when the issue was seen. There was an image upgrade done on the redis server nodes which could have resulted in the change of ips.
As part of the callbacks registered by application with hiredis, the below prints were seen as part of the redis command send calls.
kubectl get pods output was collected when the issue was seen, and at that point the list of ips are as captured below. 3 primary nodes and 2 replica per node (adds up to total 9 nodes) as shown below. Here, the ip which redis-cluster is trying to connect - 192.168.228.26 is not appearing.
Do you have some clue on what could be resulting in this behaviour?
The text was updated successfully, but these errors were encountered: