RedisClusterClient unable to read partition information and results in ConcurrentModificationException #333

OutOfBrain · 2016-08-11T17:01:09Z

Hey,

we are using the RedisClusterClient in our setup and encountered several variations of those two errors:

Error: com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 7558

see [stacktrace1]
see [stacktrace2]

and

Error: com.lambdaworks.redis.RedisException: java.util.ConcurrentModificationException

see [stacktrace3]

This is our configuration / connection code which seems to be the issue:

final ClusterTopologyRefreshOptions refreshOptions = ClusterTopologyRefreshOptions.builder()
    .enablePeriodicRefresh(60, TimeUnit.SECONDS) // this one breaks it
    .enableAllAdaptiveRefreshTriggers()
    .build();
final ClusterClientOptions clusterClientOptions = ClusterClientOptions.builder()
    .autoReconnect(true)
    .maxRedirects(4)
    .topologyRefreshOptions(refreshOptions)
    .build();

String redisHost = context.getConfiguration().get(REDIS_HOST_KEY);
RedisClusterClient redisClusterClient = RedisClusterClient.create(RedisURI.Builder.redis(redisHost).build());

redisClusterClient.setOptions(clusterClientOptions);

connection = redisClusterClient.connect().sync();

And we pinned it down to that marked line:

.enablePeriodicRefresh(60, TimeUnit.SECONDS)

Without that line we don't experience any of the above exceptions.
The actual redis command that triggers those exceptions is a hkeys:

redis.hmset(playerKey, playerProfile);
// [...snip...]
redis.hkeys(playerKey);  // this triggers the error
// [...snip...]
redis.hdel(...)

Our use case is quite simple - we only use hmset, hkeys, hdel in that order.

tldr;

For RedisClusterClient setting the ClusterTopologyRefreshOptions with enablePeriodicRefresh(60, TimeUnit.SECONDS) results in ConcurrentModificationException and Cannot determine a partition to read for slot exceptions.

Stacktraces

[stacktrace1]

Error: com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 1630 (Partitions: Partitions [RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.28', port=7001], nodeId='9fc072f1a3228f3fea613cbe7b50ab791310d625', connected=true, slaveOf='2d9f6f64b566c45f8caf0b8b0c3662185f23041a', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=33, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.24', port=7001], nodeId='8b2442dd5659702c59d1a6f3197b048f490d9727', connected=true, slaveOf='d30ffe3ed863cf028f6cad8f9a6babb97142a630', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=36, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.25', port=6379], nodeId='2494cbc3a5d0d8539d255ee0711f725655a2d06f', connected=true, slaveOf='cd836fbc2f6b9e19f9768702febb4a85759930fd', pingSentTimestamp=0, pongReceivedTimestamp=1470761451926, configEpoch=27, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.27', port=7001], nodeId='8d8f63487f1b1d8f562fee0efcd71d33c38ed612', connected=true, slaveOf='e59e15e90f131fead7cd13b43891144b14773519', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=32, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.25', port=7001], nodeId='2d9f6f64b566c45f8caf0b8b0c3662185f23041a', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761452440, configEpoch=33, flags=[MASTER], slot count=3278], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.27', port=6379], nodeId='d61059d427a992748bd63d44f26fa963a823503b', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761452956, configEpoch=34, flags=[MASTER], slot count=3276], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.24', port=6379], nodeId='4c41014792825ff94a596e86da9fac4da9254585', connected=true, slaveOf='d61059d427a992748bd63d44f26fa963a823503b', pingSentTimestamp=0, pongReceivedTimestamp=1470761453466, configEpoch=34, flags=[SLAVE], slot count=0], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.23', port=6379], nodeId='d30ffe3ed863cf028f6cad8f9a6babb97142a630', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=0, configEpoch=36, flags=[MASTER, MYSELF], slot count=3276], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.23', port=7001], nodeId='e59e15e90f131fead7cd13b43891144b14773519', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761453982, configEpoch=32, flags=[MASTER], slot count=3278], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.28', port=6379], nodeId='cd836fbc2f6b9e19f9768702febb4a85759930fd', connected=true, slaveOf='null', pingSentTimestamp=0, pongReceivedTimestamp=1470761452956, configEpoch=27, flags=[MASTER], slot count=3276], RedisClusterNodeSnapshot [uri=RedisURI [host='10.21.2.26', port=6379], nodeId='8c6623b00a586bf9f16f1b88bac1baad218a6cd7', connected=false, slaveOf='null', pingSentTimestamp=1470761451407, pongReceivedTimestamp=0, configEpoch=0, flags=[HANDSHAKE], slot count=0]]) with setting com.lambdaworks.redis.ReadFromImpl$ReadFromMaster@3003697
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:130)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:98)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:189)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1798)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:437)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:334)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy17.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

[stacktrace2]

Error: com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 7558 (Partitions: Partitions []) with setting com.lambdaworks.redis.ReadFromImpl$ReadFromMaster@408613cc
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:130)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:98)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:189)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1798)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:437)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:334)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy17.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

[stacktrace3]

Error: com.lambdaworks.redis.RedisException: java.util.ConcurrentModificationException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:77)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:98)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:189)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1798)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:437)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:334)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy17.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
at java.util.ArrayList$Itr.next(ArrayList.java:851)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadCandidates(PooledClusterConnectionProvider.java:174)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:116)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69) ... 22 more

The text was updated successfully, but these errors were encountered:

All non-threadsafe access to Partitions is now synchronized. Functional blocks that change the state of Partitions are grouped into atomar blocks that prevent reads to stale data and reads of partially updated data.

mp911de · 2016-08-12T05:23:40Z

Thanks for your bug report. The periodic topology refresh performs in your case an update each 60 seconds. You have also the adaptive topology update configured that determines based on Redis responses whether to update the topology. In fact, the adaptive update is the better one as it updates the topology only on demand. I propose to raise the periodic update to 1-2 hrs or get rid of it at all.

I deployed a fixed snapshot (4.2.2-SNAPSHOT) to Sonatype's OSS Snapshot Repository. Would you mind to give that snapshot a try with your current settings?

OutOfBrain · 2016-08-12T09:43:32Z

Hey, we tested with 4.2.2-Snapshot and got a NullPointerException:

Error: com.lambdaworks.redis.RedisException: java.lang.NullPointerException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:77)
at com.lambdaworks.redis.cluster.ClusterDistributionChannelWriter.write(ClusterDistributionChannelWriter.java:100)
at com.lambdaworks.redis.RedisChannelHandler.dispatch(RedisChannelHandler.java:118)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl.dispatch(StatefulRedisClusterConnectionImpl.java:178)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.dispatch(AbstractRedisAsyncCommands.java:1797)
at com.lambdaworks.redis.AbstractRedisAsyncCommands.hkeys(AbstractRedisAsyncCommands.java:436)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.lambdaworks.redis.cluster.StatefulRedisClusterConnectionImpl$ClusterFutureSyncInvocationHandler.handleInvocation(StatefulRedisClusterConnectionImpl.java:321)
at com.lambdaworks.redis.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:54)
at com.sun.proxy.$Proxy18.hkeys(Unknown Source)
[...snip...]
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.NullPointerException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadCandidates(PooledClusterConnectionProvider.java:175)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:116)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69) ... 22 more

mp911de · 2016-08-12T10:13:09Z

Do you have a bit more of the stack trace? It shows that NullPointerException is wrapped inside RedisException but not where the NPE occurred.

OutOfBrain · 2016-08-12T10:30:18Z

That is the full length of the stack trace - 3 lines of our internal calls.
I think the last part of that stracktrace is the interesting part:

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.NullPointerException
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadCandidates(PooledClusterConnectionProvider.java:175)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getReadConnection(PooledClusterConnectionProvider.java:116)
at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.getConnection(PooledClusterConnectionProvider.java:69) ... 22 more

Looks like the NullPointerException originates from there:
https://github.com/mp911de/lettuce/blob/4.2.x/src/main/java/com/lambdaworks/redis/cluster/PooledClusterConnectionProvider.java#L175

mp911de · 2016-08-12T10:59:45Z

Thanks for the update. I wasn't able to reproduce the issue so far (single/multithreaded connection usage) so it's somewhat cumbersome to nail down the cause. I'm not sure whether synchronizing access to Partitions will fix the issue.

…tes #333 Partitions now uses a cached read-view for query/read operations. This is to decouple updates to Partitions. The most common case, iteration over Partition is now consistent with the state at which the iterator was obtained. The iterator and its content do not change during partition updates.

mp911de · 2016-08-12T12:25:58Z

I changed Partitions slightly to use a consistent read-view for query/read operations. I'm not fully that this will fix the issue but it adds at least a more consistent behavior. I also deployed a new snapshot of 4.2.2-SNAPSHOT (lettuce-4.2.2-20160812.122340-2.jar) Sonatype's OSS Snapshot Repository.

OutOfBrain · 2016-08-12T14:40:21Z

Hey,

we just ran with the initial config again with the new 4.2.2-SNAPSHOT and can confirm that there are no more exceptions being thrown. Seems fixed.
Thanks for that very fast reaction time 👍

mp911de · 2016-08-12T15:43:19Z

Great to hear that. Closing that ticket. Bugfix will be shipped with 4.2.2, scheduled for Aug, 19.

…tes #333 Partitions now uses a cached read-view for query/read operations. This is to decouple updates to Partitions. The most common case, iteration over Partition is now consistent with the state at which the iterator was obtained. The iterator and its content do not change during partition updates.

mp911de added the type: bug A general bug label Aug 12, 2016

mp911de added this to the Lettuce 4.2.2 milestone Aug 12, 2016

mp911de closed this as completed Aug 12, 2016

oklahomer mentioned this issue Aug 23, 2016

Best practice for Cluster Topology Refresh #339

Closed

mudasirale mentioned this issue Dec 20, 2017

Make adaptive topology refresh better usable for failover/master-slave promotion changes #672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RedisClusterClient unable to read partition information and results in ConcurrentModificationException #333

RedisClusterClient unable to read partition information and results in ConcurrentModificationException #333

OutOfBrain commented Aug 11, 2016

mp911de commented Aug 12, 2016

OutOfBrain commented Aug 12, 2016

mp911de commented Aug 12, 2016

OutOfBrain commented Aug 12, 2016

mp911de commented Aug 12, 2016

mp911de commented Aug 12, 2016

OutOfBrain commented Aug 12, 2016

mp911de commented Aug 12, 2016

RedisClusterClient unable to read partition information and results in ConcurrentModificationException #333

RedisClusterClient unable to read partition information and results in ConcurrentModificationException #333

Comments

OutOfBrain commented Aug 11, 2016

tldr;

Stacktraces

mp911de commented Aug 12, 2016

OutOfBrain commented Aug 12, 2016

mp911de commented Aug 12, 2016

OutOfBrain commented Aug 12, 2016

mp911de commented Aug 12, 2016

mp911de commented Aug 12, 2016

OutOfBrain commented Aug 12, 2016

mp911de commented Aug 12, 2016