Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic IO stalls/pauses #1494

Closed
jchambers opened this issue Nov 3, 2020 · 25 comments
Closed

Periodic IO stalls/pauses #1494

jchambers opened this issue Nov 3, 2020 · 25 comments
Labels
status: invalid An issue that we don't feel is valid

Comments

@jchambers
Copy link
Contributor

jchambers commented Nov 3, 2020

We've observed behavior in our production systems where all Lettuce operations will (seemingly without provocation) grind to a halt, then resume just as abruptly. We've directly measured "stalls" between 3 and 5 seconds, and can infer the existence stalls shorter than 3s and longer than 10s from other metrics. Stalls appear to happen on the order of once every host-week (which can add up quickly when there are lots of hosts in play). This generally results in a bunch of RedisCommandTimeoutException and a pileup of pending requests.

We've observed this behavior both with Lettuce 5.3.4 and 6.0.0 under Ubuntu 18.04. We have not yet tried 5.3.5 or 6.0.1, but (unless this is somehow a TCP NODELAY issue) I haven't spotted anything in the changelog that seems relevant.

Our setup is that we have tens of application servers, each of which has a default ClientResources instance shared between three RedisClusterClient instances. We have a default command timeout of 3 seconds. Our application servers have 8 physical cores, and so my understanding is that our IO thread pools have 8 threads. Our application servers live in AWS EC2 and communicate with three Redis clusters hosted by AWS ElastiCache:

  • A general cache cluster with two shards (each with a leader and replica) that mostly receives simple GET/SET commands of relatively consistent sizes
  • A metrics cluster with two shards (each with a leader and replica) that also mostly receives simple GET/SET commands
  • A notification cluster with 8 shards (one leader/replica per shard) that receives a relatively high volume of messages of varying sizes and also notifies listeners of new messages via keyspace notifications

Between all three clusters, application instances issue somewhere between 5,000 and 10,000 Redis commands per second.

We've instrumented instances by logging from io.lettuce.core.protocol at DEBUG and by observing TCP traffic. When we experience a stall, the symptoms are:

  • All traffic from the affected instance to ALL shards in ALL Redis clusters stops completely at a TCP level
  • Other application servers are unaffected and continue communicating with Redis nodes normally
  • Lettuce continues to log attempts to write commands, though the write attempts don't resolve for several seconds and we see no corresponding outbound packets at a TCP level
  • Keyspace notifications from the notification cluster continue to arrive at the affected instance as expected and are ACKed at the TCP level even though they're not processed at the application layer; I'm afraid I can't tell from the Lettuce logs if they're "seen" by Lettuce when they arrive or if that's also delayed

A representative log line when attempting to write a command during a stall:

DEBUG [2020-10-30 19:19:48,265] io.lettuce.core.protocol.DefaultEndpoint: [channel=0x5154e1c6, /10.0.1.170:34672 -> /10.0.0.109:6379, epid=0x49] write() writeAndFlush command ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5]

…which will resolve several seconds later:

DEBUG [2020-10-30 19:19:50,184] io.lettuce.core.protocol.DefaultEndpoint: [channel=0x5154e1c6, /10.0.1.170:34672 -> /10.0.0.109:6379, epid=0x49] write() done

We do not see any reads reported by Lettuce during a stall, though we do see them before and after.

I have not yet found anything in our application logs, Lettuce's logs, or TCP events that predicts when one of these stalls will occur. We will often see TCP retransmissions or "zero window" packets during or after a stall, but have not yet seen a consistent pattern prior to the stall. We have not observed any reconnections or cluster topology changes in our own logs, and to the best of my knowledge, that's supported by Lettuce's own logging because channel IDs remain consistent before and after a stall.

I do usually see an increase in allocated direct memory in the minute or two before one of these stalls, but we also see increases in allocated direct memory that do not precede a stall.

I'm aware that similar issues have been reported in the past, but have generally been closed due to a lack of debug information. My next step is to capture some thread dumps when these stalls occur. In the meantime, I think we can address some hypotheses with reasonable (but not perfect) confidence:

  • I would be surprised if something in our application space were blocking IO threads because:
    • We do not issue any blocking Redis commands
    • With the exception of keyspace notifications, we do not call any commands that result in a callback from Lettuce into our code
    • Keyspace notifications are immediately dispatched to a separate ExecutorService to avoid blocking Lettuce/Netty IO threads
    • In keeping with our application's underlying framework, we use sync() commands for virtually everything
    • It seems unlikely that we'd manage to block all IO threads simultaneously
    • The duration of a stall is not closely correlated with our command timeout (i.e. a command timing out does not appear to unclog things)
  • This does not appear to be correlated with GC activity; GC time is consistent before, during, and after a stall
  • The direct memory allocation clue may point to us sending an abnormally large (several megabytes?) message through our notification cluster, but it's not clear to me how that would block all other traffic; I also can't find a corresponding increase in throughput at a TCP level (and that would certainly be at odds with the observed halt in outbound traffic), and it's also not clear why that would prevent reads, though I'm less familiar with Netty's internals on that point
  • This does not appear to be a problem with the cluster itself because traffic continues as normal to all other application servers

I expect to have a thread dump in the next few days. In the meantime, while I can't provide the raw debug data in its entirety, I'd be happy to answer any questions about our debug data and go digging for potentially-relevant excerpts that I can share here.

@jchambers jchambers added the type: bug A general bug label Nov 3, 2020
@jchambers
Copy link
Contributor Author

jchambers commented Nov 3, 2020

The direct memory allocation clue may point to us sending an abnormally large (several megabytes?) message through our notification cluster…

I should also add that this hypothesis isn't supported by other application-level metrics; we have a pretty good idea of max notification sizes, and unless we have some inconveniently-missing data points, there's no indication that any unusually large payloads were in play at the time of these stalls.

@mp911de
Copy link
Collaborator

mp911de commented Nov 4, 2020

Thanks a lot for the elaborate analysis. Looking at the log lines, the stall doesn't seem to be related to I/O since calls to DefaultEndpoint happen from the code path that submits a command into the write queue.

DefaultEndpoint.write(…) is the method that requires a bit more attention. Inside this method, we use a shared lock to ensure a mutex with methods that can't run in parallel. Specifically, notifyDrainQueuedCommands(…) and notifyChannelInactive(…) (both called when a channel is closed), cancelBufferedCommands(…) (called on close and reset), and notifyException(…) are mutually exclusive operations that would block a write.

SharedLock.incrementWriters(…) performs the wait. The log line write() writeAndFlush command happens just before writing the command to the netty channel. After the write, we attach a future listener.

Are you able to deploy a customized Lettuce version to your systems so we can continue investigation? It would be helpful to identify which method causes the delay (channel.writeAndFlush(…), channelFuture.addListener(…)

It would also make sense to check on which thread the debug log stems from. Being on the EventLoop thread causes a direct write in netty. Being on a different thread enqueues the action to be invoked by an EventLoop threrad.

@mp911de mp911de added the status: waiting-for-feedback We need additional information before we can continue label Nov 4, 2020
@jchambers
Copy link
Contributor Author

Thanks for the quick reply!

It would also make sense to check on which thread the debug log stems from. Being on the EventLoop thread causes a direct write in netty. Being on a different thread enqueues the action to be invoked by an EventLoop threrad.

Sure; I think I can just tweak our log message format to include the calling thread. Did you have something else in mind?

Are you able to deploy a customized Lettuce version to your systems so we can continue investigation? It would be helpful to identify which method causes the delay (channel.writeAndFlush(…), channelFuture.addListener(…)

Potentially yes! Do you have an estimate for when that build could be ready? If you expect it to be ready quickly, I can roll together the custom build, the thread dump logic, and the logging changes into a single deployment.

Cheers!

@mp911de
Copy link
Collaborator

mp911de commented Nov 4, 2020

I'd rather like that you add log lines to your fork of lettuce as that allows faster iteration/feedback cycles.

@jchambers
Copy link
Contributor Author

Okay; I think I'm a little confused about next steps then. Is the idea that you will produce a branch with the specific information you're looking for, or are you asking me to instrument everything in my own branch?

@jchambers
Copy link
Contributor Author

I'm sorry—I re-parsed your message and it's clear that you're asking me to put together the build. I can certainly do that; can you clarify what, specifically, you're hoping to learn? Just timings/threads around writeAndFlush and addListener?

@mp911de
Copy link
Collaborator

mp911de commented Nov 4, 2020

Just timings/threads around writeAndFlush and addListener?

Yes, basically to narrow down where this comes from.

@jchambers
Copy link
Contributor Author

Understood. Will report back soon. It may take a couple days because America.

@jchambers
Copy link
Contributor Author

Don't mean to press you for a detailed code review, but as an "are we on the same page?" check, does 6.0.x...jon-signal:debug_timing_logging look reasonable to you?

@mp911de
Copy link
Collaborator

mp911de commented Nov 5, 2020

Looks good. Note that activation should happen only once per TCP connection.

@jchambers
Copy link
Contributor Author

jchambers commented Nov 10, 2020

I've captured logs/thread dumps around a few instances of command timeouts. There's a LOT in here and I'll need a little time to sift through it all and extract the important pieces, but I wanted to share an update and be clear that things are moving forward.

@jchambers
Copy link
Contributor Author

The logs I've captured so far are less clear-cut than the ones from previous captures, and I'm going to adjust our strategy a bit. One thing I HAVE noticed from the data we've collected so far is that DefaultEndpoint#channelWriteAndFlush(RedisCommand<?, ?, ?>) will sometimes take tens of milliseconds for trivial commands (like a simple GET), but that seems to happen more frequently in the 100-ms period after a command has timed out, not before. I haven't yet spotted any writeAndFlush calls that are taking more than 100ms. None of the other things we're measuring ever seem to get much above 1ms.

Analysis is still very much in progress (and a shift in data collection strategy may be in order on my part), but I wanted to share some early findings.

@jchambers
Copy link
Contributor Author

I'm continuing to capture and analyze new data. It's still too soon to say for sure, but it looks like the nature of this issue has changed since I deployed our custom build of Lettuce. Notably, it's derived from Lettuce 6.0.1 instead of Lettuce 6.0.0; that was mostly an oversight on my part, but now I'm wondering if something in #1462 or #1476 wound up making a significant difference after all.

It may also be that there are multiple phenomena at work, and this is just weird sampling/luck on my part.

That said, what I'm seeing now is that, rather than all channels grinding to a halt simultaneously, only one will. And, significantly, this time, a stall begins exactly when we see a group of TCP retransmissions from Lettuce to the Redis server and resolves almost exactly 3 seconds (our command timeout) later.

Time               Source        Destination         Length    Info
15:56:29.369165    Lettuce       Redis server        336       34086 → 6379 [PSH, ACK] Seq=88610 Ack=160958 Win=4498 Len=268 TSval=1499824324 TSecr=3663545965
15:56:29.392541    Lettuce       Redis server        336       34086 → 6379 [PSH, ACK] Seq=88878 Ack=160958 Win=4498 Len=268 TSval=1499824347 TSecr=3663545965
15:56:29.398096    Lettuce       Redis server        336       34086 → 6379 [PSH, ACK] Seq=89146 Ack=160958 Win=4498 Len=268 TSval=1499824353 TSecr=3663545965
15:56:29.412261    Lettuce       Redis server        336       [TCP Retransmission] 34086 → 6379 [PSH, ACK] Seq=89146 Ack=160958 Win=4498 Len=268 TSval=1499824367 TSecr=3663545965
15:56:29.490672    Lettuce       Redis server        336       34086 → 6379 [PSH, ACK] Seq=89414 Ack=160958 Win=4498 Len=268 TSval=1499824445 TSecr=3663545965
15:56:29.498624    Lettuce       Redis server        336       34086 → 6379 [PSH, ACK] Seq=89682 Ack=160958 Win=4498 Len=268 TSval=1499824453 TSecr=3663545965
15:56:29.512264    Lettuce       Redis server        336       34086 → 6379 [PSH, ACK] Seq=89950 Ack=160958 Win=4498 Len=268 TSval=1499824467 TSecr=3663545965
15:56:29.724267    Lettuce       Redis server        1408      [TCP Retransmission] 34086 → 6379 [PSH, ACK] Seq=88610 Ack=160958 Win=4498 Len=1340 TSval=1499824679 TSecr=3663545965
15:56:30.156271    Lettuce       Redis server        1408      [TCP Retransmission] 34086 → 6379 [PSH, ACK] Seq=88610 Ack=160958 Win=4498 Len=1340 TSval=1499825111 TSecr=3663545965
15:56:30.988264    Lettuce       Redis server        1408      [TCP Retransmission] 34086 → 6379 [PSH, ACK] Seq=88610 Ack=160958 Win=4498 Len=1340 TSval=1499825943 TSecr=3663545965
15:56:32.652265    Lettuce       Redis server        1408      [TCP Retransmission] 34086 → 6379 [PSH, ACK] Seq=88610 Ack=160958 Win=4498 Len=1340 TSval=1499827607 TSecr=3663545965
15:56:32.653119    Redis server  Lettuce             80        [TCP Previous segment not captured] 6379 → 34086 [ACK] Seq=165087 Ack=90218 Win=1551 Len=0 TSval=3663549309 TSecr=1499824467 SLE=88610 SRE=89950
15:56:32.653128    Lettuce       Redis server        1516      34086 → 6379 [ACK] Seq=90218 Ack=160958 Win=4498 Len=1448 TSval=1499827608 TSecr=3663545965

The packets that are getting retransmitted are both UNSUBSCRIBE commands that look pretty normal to me. The full command fits inside a single packet and appears well-formed.

While this stall is happening, Lettuce's logs show Lettuce happily continuing to write and flush commands, but receiving no data back. As soon as the stall resolves, Lettuce reports that it's receiving data again (and has a backlog of commands on the stack).

So in this most recent batch of samples, it looks like Lettuce is mostly doing the right thing and the problem is happening deeper in the stack. At the same time, it's fishy to me that data starts flowing again as soon as the first command times out (not that I can think of anything Lettuce itself could be doing to produce the observed effect at the TCP level).

I'll certainly acknowledge the possibility that these stalls are more frequent than the ones I'm observing, and it's just a coincidence that the stalls long enough to cause a command timeout are almost exactly 3s.

I'm going to continue to capture samples to see if this pattern holds.

@mp911de
Copy link
Collaborator

mp911de commented Nov 12, 2020

Wow, that's getting pretty weird.

Can you correlate anything like keep alive or send/receive buffer effects with the timeouts? At this point, maybe even something holds a monitor somewhere and the I/O Thread isn't able to continue.

@jchambers
Copy link
Contributor Author

I've just captured two more snapshots that show almost exactly the same behavior: we retransmit a TCP packet to the Redis server, then the connection essentially goes dead for three seconds. One of the snapshots happened issuing the same kind of UNSUBSCRIBE command, which had me very suspicious. The second came from a different part of our codebase, though, and was for an EVALSHA command (so we can't isolate this to just pubsub connections, unfortunately).

Can you correlate anything like keep alive or send/receive buffer effects with the timeouts?

Not yet, but I'll see what I can find.

At this point, maybe even something holds a monitor somewhere and the I/O Thread isn't able to continue.

I think we can test and refute that hypothesis with the information we already have (unless I'm misunderstanding which thread you mean). According to the Lettuce debug logs, the thread that handles the frozen channel (lettuce-nioEventLoop-4-4 in one of these examples) continues to process events for other channels while the stall is in effect, so I don't think that's our culprit.

@mp911de
Copy link
Collaborator

mp911de commented Nov 12, 2020

Thanks, I missed that the thread continues to progress on other channels. If you see TCP retransmission, then this sounds a bit like packet loss.

@jchambers
Copy link
Contributor Author

…refute that hypothesis with the information we already have…

I realize this may have come across as snarky, which wasn't my intent! I just mean that I checked the logs after you asked and didn't need to capture new data. I don't think the behavior of the IO thread was obvious from the prior discussion. Please pardon the poor phrasing on my part!

If you see TCP retransmission, then this sounds a bit like packet loss.

I agree, though the consistent three-second timing remains suspicious to me. I'll see what I can learn on the AWS side of things.

@jchambers
Copy link
Contributor Author

jchambers commented Nov 13, 2020

Ah—an important clue: the first four default TCP retransmission times (with exponential backoff) are 200ms, 400ms, 800ms, and 1600ms. That means that four retransmissions takes almost exactly three seconds. I'll try moving our timeout to 3.5 seconds to see how or if that changes things.

@mp911de
Copy link
Collaborator

mp911de commented Nov 13, 2020

No worries, I understand that getting to the root cause can be a pretty extensive undertaking. Thanks for keeping us posted. TIL that retransmission windows are so close together. I somehow expected multiple seconds (something above 10 or 30 seconds).

@jchambers
Copy link
Contributor Author

I'm continuing to work with AWS on this issue, but don't have anything to report from that thread yet.

In thinking through what could be happening, I tried to more closely align some things in the Lettuce logs with our packet captures and noticed something curious. Here's another representative stall:

Time               Source          Destination    Length    Info
18:01:14.015652    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=86707 Ack=147196 Win=3591 Len=268 TSval=1646985261 TSecr=3120642697
18:01:14.029754    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=86975 Ack=147196 Win=3591 Len=268 TSval=1646985275 TSecr=3120642697
18:01:14.042778    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=87243 Ack=147196 Win=3591 Len=268 TSval=1646985288 TSecr=3120642697
18:01:14.056651    Lettuce         Redis server    336      [TCP Retransmission] 51434 → 6379 [PSH, ACK] Seq=87243 Ack=147196 Win=3591 Len=268 TSval=1646985302 TSecr=3120642697
18:01:14.099746    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=87511 Ack=147196 Win=3591 Len=268 TSval=1646985345 TSecr=3120642697
18:01:14.159235    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=87779 Ack=147196 Win=3591 Len=268 TSval=1646985405 TSecr=3120642697
18:01:14.169596    Lettuce         Redis server    333      51434 → 6379 [PSH, ACK] Seq=88047 Ack=147196 Win=3591 Len=265 TSval=1646985415 TSecr=3120642697
18:01:14.170204    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=88312 Ack=147196 Win=3591 Len=268 TSval=1646985416 TSecr=3120642697
18:01:14.188244    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=88580 Ack=147196 Win=3591 Len=268 TSval=1646985434 TSecr=3120642697
18:01:14.206560    Lettuce         Redis server    333      51434 → 6379 [PSH, ACK] Seq=88848 Ack=147196 Win=3591 Len=265 TSval=1646985452 TSecr=3120642697
18:01:14.226694    Lettuce         Redis server    336      51434 → 6379 [PSH, ACK] Seq=89113 Ack=147196 Win=3591 Len=268 TSval=1646985472 TSecr=3120642697
18:01:14.248654    Lettuce         Redis server    1408     [TCP Retransmission] 51434 → 6379 [PSH, ACK] Seq=86707 Ack=147196 Win=3591 Len=1340 TSval=1646985494 TSecr=3120642697
18:01:14.668652    Lettuce         Redis server    1408     [TCP Retransmission] 51434 → 6379 [PSH, ACK] Seq=86707 Ack=147196 Win=3591 Len=1340 TSval=1646985914 TSecr=3120642697
18:01:15.500657    Lettuce         Redis server    1408     [TCP Retransmission] 51434 → 6379 [PSH, ACK] Seq=86707 Ack=147196 Win=3591 Len=1340 TSval=1646986746 TSecr=3120642697
18:01:17.168646    Lettuce         Redis server    1408     [TCP Retransmission] 51434 → 6379 [PSH, ACK] Seq=86707 Ack=147196 Win=3591 Len=1340 TSval=1646988414 TSecr=3120642697
18:01:17.168816    Redis server    Lettuce         80       [TCP Previous segment not captured] 6379 → 51434 [ACK] Seq=151107 Ack=89381 Win=1543 Len=0 TSval=3120645861 TSecr=1646985452 SLE=86707 SRE=88047
18:01:17.168829    Lettuce         Redis server    1516     51434 → 6379 [ACK] Seq=89381 Ack=147196 Win=3591 Len=1448 TSval=1646988414 TSecr=3120642697
18:01:17.168832    Lettuce         Redis server    1516     51434 → 6379 [ACK] Seq=90829 Ack=147196 Win=3591 Len=1448 TSval=1646988414 TSecr=3120642697

I filtered the logs to see what the thread handling that connection was doing at the time:

DEBUG [2020-11-12 18:01:17,013] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.RedisStateMachine: Decode done, empty stack: true
DEBUG [2020-11-12 18:01:17,013] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x9be15e7f, /10.0.2.99:59220 -> /10.0.2.111:6379, chid=0x54] Completing command LatencyMeteredCommand [type=EVALSHA, output=NestedMultiOutput [output=[], error='null'], commandType=io.lettuce.core.cluster.ClusterCommand]
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] write(ctx, ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5], promise)
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandEncoder: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379] writing command ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5]
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] Received: 1865 bytes, 1 commands in the stack
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] Stack contains: 1 commands
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.RedisStateMachine: Decode done, empty stack: true
DEBUG [2020-11-12 18:01:17,018] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] Completing command LatencyMeteredCommand [type=GET, output=ValueOutput [output=REDACTED, error='null'], commandType=io.lettuce.core.cluster.ClusterCommand]
DEBUG [2020-11-12 18:01:17,020] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] write(ctx, ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5], promise)
DEBUG [2020-11-12 18:01:17,020] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandEncoder: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379] writing command ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5]
DEBUG [2020-11-12 18:01:17,753] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] Received: 2533 bytes, 1 commands in the stack
DEBUG [2020-11-12 18:01:17,872] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.CommandHandler: [channel=0x3719aead, /10.0.2.99:39148 -> /10.0.0.109:6379, chid=0x4c] Stack contains: 1 commands
DEBUG [2020-11-12 18:01:17,872] [lettuce-nioEventLoop-4-4] io.lettuce.core.protocol.RedisStateMachine: Decode done, empty stack: true

Even though none of those lines are related to the connection in question (on port 51434), there's a suspicious stall between 18:01:17,020 and 18:01:17,753, and that's right when traffic from the Redis host seems to wake back up abruptly. Checking our application logs, we see a batch of 46 timeouts starting at 18:01:17,017 and ending at 18:01:17,756.

I've attached a slightly longer excerpt from the Lettuce logs that shows the backlog getting cleared out. One thing that's curious to me is that, even though we can see packets arriving during that pause in the logs, Lettuce doesn't notice them for several hundred milliseconds, and then the packet logs and Lettuce logs seem to (temporarily?) lose agreement about what's happening when. My wager is that the IO thread is getting tied up populating stack traces for the RedisCommandTimeoutException instances, then has to catch up on the newly-arrived packets.

One thing I think this does tell us is that we're probably not overrunning the TCP receive buffer because we see packets arriving before Lettuce logs that it's processed them (unless there's some weird packet purgatory where something is taking packets from the buffer and holding them for a significant amount of time before Lettuce processes them). It still doesn't explain why the stall happened in the first place, though.

There may also be some new hints in the attached thread log that jump out to an experienced reader that aren't obvious to me.

Anyhow, don't think there are any earth-shattering revelations in here, and I'll continue to work with the AWS folks. Still, I wanted to share updates as I have them.

Cheers!

@jchambers
Copy link
Contributor Author

Can you correlate anything like keep alive or send/receive buffer effects with the timeouts?

To address this directly, I haven't seen any signs of buffer "events" anywhere; if we were overruning our buffers, I'd expect to see zero-window/window full packets, but those never seem to happen before a stall. Sometimes we'll wind up with a zero-window packet after a stall when we seem to get a gigantic wave of deferred traffic, but nothing that seems to predict a stall before it happens.

@mp911de
Copy link
Collaborator

mp911de commented Nov 30, 2020

Is there any update, specifically anything to do for Lettuce?

@jchambers
Copy link
Contributor Author

Good morning!

No update yet, but I'm actively gathering more information and hope to have an update to share soon. In the meantime, the status is:

  1. We're continuing to work with AWS support. My hope is that this will lead to a firm conclusion one way or the other and we'll just know what's up without having to do more investigation.
  2. I've changed our Lettuce timeout so it no longer coincides with TCP-level timeouts. As mentioned earlier, we'll expect TCP retransmissions with a delay of 200ms, 400ms, 800ms, 1600ms, and then 3200ms. For us, that means that there are retransmissions at 3 seconds and then 6.2 seconds. We've changed our Lettuce command timeout to 3.5 seconds, which should move us safely away from both of those thresholds, and that should give us another clear signal about whether Lettuce is involved in the observed behavior.

Thanks for your patience!

@mp911de
Copy link
Collaborator

mp911de commented Dec 22, 2020

Since we haven't heard any update on this ticket, I'm going to close it. If you would like us to look at this issue, please provide additional information and we will re-open the issue.

@mp911de mp911de closed this as completed Dec 22, 2020
@mp911de mp911de removed status: waiting-for-feedback We need additional information before we can continue type: bug A general bug labels Dec 22, 2020
@mp911de mp911de added the status: invalid An issue that we don't feel is valid label Dec 22, 2020
@jchambers
Copy link
Contributor Author

I understand the decision to close this, but please accept my assurance that the investigation is both active and ongoing on our end. I'll post an update as soon as I have one.

draco1023 pushed a commit to draco1023/Signal-Server that referenced this issue May 6, 2021
draco1023 pushed a commit to draco1023/Signal-Server that referenced this issue May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: invalid An issue that we don't feel is valid
Projects
None yet
Development

No branches or pull requests

2 participants