RedisClusterClient unable to read partition information and results in ConcurrentModificationException #333

Closed
OutOfBrain opened this Issue Aug 11, 2016 · 8 comments

Projects

None yet

2 participants

@OutOfBrain

Hey,

we are using the RedisClusterClient in our setup and encountered several variations of those two errors:

Error: com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 7558

see [stacktrace1]
see [stacktrace2]

and

Error: com.lambdaworks.redis.RedisException: java.util.ConcurrentModificationException

see [stacktrace3]

This is our configuration / connection code which seems to be the issue:

final ClusterTopologyRefreshOptions refreshOptions = ClusterTopologyRefreshOptions.builder()
    .enablePeriodicRefresh(60, TimeUnit.SECONDS) // this one breaks it
    .enableAllAdaptiveRefreshTriggers()
    .build();
final ClusterClientOptions clusterClientOptions = ClusterClientOptions.builder()
    .autoReconnect(true)
    .maxRedirects(4)
    .topologyRefreshOptions(refreshOptions)
    .build();

String redisHost = context.getConfiguration().get(REDIS_HOST_KEY);
RedisClusterClient redisClusterClient = RedisClusterClient.create(RedisURI.Builder.redis(redisHost).build());

redisClusterClient.setOptions(clusterClientOptions);

connection = redisClusterClient.connect().sync();

And we pinned it down to that marked line:

.enablePeriodicRefresh(60, TimeUnit.SECONDS) 

Without that line we don't experience any of the above exceptions.
The actual redis command that triggers those exceptions is a hkeys:

redis.hmset(playerKey, playerProfile);
// [...snip...]
redis.hkeys(playerKey);  // this triggers the error
// [...snip...]
redis.hdel(...)

Our use case is quite simple - we only use hmset, hkeys, hdel in that order.

tldr;

For RedisClusterClient setting the ClusterTopologyRefreshOptions with enablePeriodicRefresh(60, TimeUnit.SECONDS) results in ConcurrentModificationException and Cannot determine a partition to read for slot exceptions.

Stacktraces

[stacktrace1]

Error: com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 1630 (Partitions: Partitions [RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.28', port=7001], nodeId='9fc072f1a3228f3fea613cbe7b50ab791310d625', connected=true, slaveOf='2d9f6f64b566c45f8caf0b8b0c3662185f23041a', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=33, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.24', port=7001], nodeId='8b2442dd5659702c59d1a6f3197b048f490d9727', connected=true, slaveOf='d30ffe3ed863cf028f6cad8f9a6babb97142a630', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=36, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.25', port=6379], nodeId='2494cbc3a5d0d8539d255ee0711f725655a2d06f', connected=true, slaveOf='cd836fbc2f6b9e19f9768702febb4a85759930fd', pingSentTimestamp=0, pongReceivedTimestamp=1470761451926, configEpoch=27, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.27', port=7001], nodeId='8d8f63487f1b1d8f562fee0efcd71d33c38ed612', connected=true, slaveOf='e59e15e90f131fead7cd13b43891144b14773519', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=32, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.25', port=7001], nodeId='2d9f6f64b566c45f8caf0b8b0c3662185f23041a', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761452440, configEpoch=33, flags=[MASTER], slot count=3278], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.27', port=6379], nodeId='d61059d427a992748bd63d44f26fa963a823503b', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761452956, configEpoch=34, flags=[MASTER], slot count=3276], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.24', port=6379], nodeId='4c41014792825ff94a596e86da9fac4da9254585', connected=true, slaveOf='d61059d427a992748bd63d44f26fa963a823503b', pingSentTimestamp=0, pongReceivedTimestamp=1470761453466, configEpoch=34, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.23', port=6379], nodeId='d30ffe3ed863cf028f6cad8f9a6babb97142a630', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=0, configEpoch=36, flags=[MASTER, MYSELF], slot count=3276], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.23', port=7001], nodeId='e59e15e90f131fead7cd13b43891144b14773519', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=32, flags=[MASTER], slot count=3278], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.28', port=6379], nodeId='cd836fbc2f6b9e19f9768702febb4a85759930fd', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761452956, configEpoch=27, flags=[MASTER], slot count=3276], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.26', port=6379], nodeId='8c6623b00a586bf9f16f1b88bac1baad218a6cd7', connected=false, slaveOf='null', pingSentTimestamp=1470761451407, pongReceivedTimestamp=0, configEpoch=0, flags=[HANDSHAKE], slot count=0]]) with setting com.lambdaworks.redis.ReadFromImpl$ReadFromMaster@3003697
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:130)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:98)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:189)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1798)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:437)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:334)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy17.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

[stacktrace2]

Error: com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 7558 (Partitions: Partitions []) with setting com.lambdaworks.redis.ReadFromImpl$ReadFromMaster@408613cc
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:130)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:98)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:189)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1798)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:437)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:334)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy17.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

[stacktrace3]

Error: com.lambdaworks.redis.RedisException: java.util.ConcurrentModificationException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:77)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:98)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:189)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1798)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:437)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:334)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy17.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadCandidates(PooledClusterConnectionProvider.java:174)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:116)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69) ... 22 more
@mp911de mp911de added the bug label Aug 12, 2016
@mp911de mp911de added a commit that referenced this issue Aug 12, 2016
@mp911de Add synchronization to Partitions #333
All non-threadsafe access to Partitions is now synchronized. Functional blocks that change the state of Partitions are grouped into atomar blocks that prevent reads to stale data and reads of partially updated data.
4ab6942
@mp911de mp911de added this to the Lettuce 4.2.2 milestone Aug 12, 2016
@mp911de mp911de added a commit that referenced this issue Aug 12, 2016
@mp911de Add synchronization to Partitions #333
All non-threadsafe access to Partitions is now synchronized. Functional blocks that change the state of Partitions are grouped into atomar blocks that prevent reads to stale data and reads of partially updated data.
713c97d
@mp911de
Owner
mp911de commented Aug 12, 2016

Thanks for your bug report. The periodic topology refresh performs in your case an update each 60 seconds. You have also the adaptive topology update configured that determines based on Redis responses whether to update the topology. In fact, the adaptive update is the better one as it updates the topology only on demand. I propose to raise the periodic update to 1-2 hrs or get rid of it at all.

I deployed a fixed snapshot (4.2.2-SNAPSHOT) to Sonatype's OSS Snapshot Repository. Would you mind to give that snapshot a try with your current settings?

@OutOfBrain

Hey, we tested with 4.2.2-Snapshot and got a NullPointerException:

Error: com.lambdaworks.redis.RedisException: java.lang.NullPointerException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:77)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:100)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:178)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1797)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:436)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:321)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy18.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.NullPointerException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadCandidates(PooledClusterConnectionProvider.java:175)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:116)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69) ... 22 more
@mp911de
Owner
mp911de commented Aug 12, 2016

Do you have a bit more of the stack trace? It shows that NullPointerException is wrapped inside RedisException but not where the NPE occurred.

@OutOfBrain

That is the full length of the stack trace - 3 lines of our internal calls.
I think the last part of that stracktrace is the interesting part:

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.NullPointerException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadCandidates(PooledClusterConnectionProvider.java:175)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:116)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69) ... 22 more

Looks like the NullPointerException originates from there:
https://github.com/mp911de/lettuce/blob/4.2.x/src/main/java/com/lambdaworks/redis/cluster/PooledClusterConnectionProvider.java#L175

@mp911de
Owner
mp911de commented Aug 12, 2016

Thanks for the update. I wasn't able to reproduce the issue so far (single/multithreaded connection usage) so it's somewhat cumbersome to nail down the cause. I'm not sure whether synchronizing access to Partitions will fix the issue.

@mp911de mp911de added a commit that referenced this issue Aug 12, 2016
@mp911de Use a read-view for consistent Partition usage during Partitions upda…
…tes #333

Partitions now uses a cached read-view for query/read operations. This is to decouple updates to Partitions. The most common case, iteration over Partition is now consistent with the state at which the iterator was obtained. The iterator and its content do not change during partition updates.
279e483
@mp911de
Owner
mp911de commented Aug 12, 2016

I changed Partitions slightly to use a consistent read-view for query/read operations. I'm not fully that this will fix the issue but it adds at least a more consistent behavior. I also deployed a new snapshot of 4.2.2-SNAPSHOT (lettuce-4.2.2-20160812.122340-2.jar) Sonatype's OSS Snapshot Repository.

@OutOfBrain

Hey,

we just ran with the initial config again with the new 4.2.2-SNAPSHOT and can confirm that there are no more exceptions being thrown. Seems fixed.
Thanks for that very fast reaction time 👍

@mp911de
Owner
mp911de commented Aug 12, 2016

Great to hear that. Closing that ticket. Bugfix will be shipped with 4.2.2, scheduled for Aug, 19.

@mp911de mp911de closed this Aug 12, 2016
@mp911de mp911de added a commit that referenced this issue Aug 16, 2016
@mp911de Use a read-view for consistent Partition usage during Partitions upda…
…tes #333

Partitions now uses a cached read-view for query/read operations. This is to decouple updates to Partitions. The most common case, iteration over Partition is now consistent with the state at which the iterator was obtained. The iterator and its content do not change during partition updates.
aca1d34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment