Skip to content

Randomly getting an 'An error occurred at ReplicationManager.RetrieveStoreCheckpoint' error message when replicating the Garnet instance #1163

@wessisaac

Description

@wessisaac

Describe the bug

Randomly getting an 'An error occurred at ReplicationManager.RetrieveStoreCheckpoint' error message when replicating the Garnet instance.

When replicating a Garnet instance using Redis-cli or programmatically, I receive an 'OK' output, but replication does not occur. In the error log, I get the message: 'An error occurred at ReplicationManager.RetrieveStoreCheckpoint.'

Commands:
redis-cli -c -h 127.0.0.1 -p 6379 Cluster Replicate ff8f09782078336d22b570988dfaca602d097c6f
Ans: OK

redis-cli -h 127.0.0.1 -p 6380 cluster nodes
Ans: ff8f09782078336d22b570988dfaca602d097c6f 127.0.0.1:6380@16380,DUC-L155.ducontindia.local myself,master - 0 0 53 connected 0-16383
9dd6ee8dfc4817de91375c19522f682b2a77e1be 127.0.0.1:6379@16379,DUC-L155.ducontindia.local master - 638794782854197274 638794782854196345 90 connected

Error Log:

[000.2025-04-05 19:18:12.6720] (Trace) CLUSTER REPLICATE ff8f09782078336d22b570988dfaca602d097c6f
[000.2025-04-05 19:18:12.6721] (Trace) Success recover lock [ClusterReplicate]
[000.2025-04-05 19:18:12.6726] (Information) Initiating background checkpoint retrieval
[000.2025-04-05 19:18:12.6939] (Debug) [127.0.0.1:61344] [00981EB9] Disposing RespServerSession Id=20
[000.2025-04-05 19:18:12.6941] (Debug) [] [0040850F] Disposing RespServerSession Id=0
[000.2025-04-05 19:18:12.7099] (Information) <TsavoriteKV [main]> Index Checkpoint: b9b2d6f2-294c-4eff-bb9c-d00070ee2687
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> ******** Index Checkpoint Info for b9b2d6f2-294c-4eff-bb9c-d00070ee2687 ********
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Table Size: 2097152
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Main Table Size (in GB): 0.13421772799999998
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Overflow Table Size (in GB): 1.0239999999999999E-06
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Num Buckets: 16
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Start Logical Address: 64
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Final Logical Address: 64
[000.2025-04-05 19:18:12.7215] (Information) <TsavoriteKV [obj]> Index Checkpoint: b9b2d6f2-294c-4eff-bb9c-d00070ee2687
[000.2025-04-05 19:18:12.7215] (Information) <TsavoriteKV [obj]> ******** Index Checkpoint Info for b9b2d6f2-294c-4eff-bb9c-d00070ee2687 ********
[000.2025-04-05 19:18:12.7215] (Information) <TsavoriteKV [obj]> Table Size: 262144
[000.2025-04-05 19:18:12.7216] (Information) <TsavoriteKV [obj]> Main Table Size (in GB): 0.016777215999999998
[000.2025-04-05 19:18:12.7216] (Information) <TsavoriteKV [obj]> Overflow Table Size (in GB): 1.0239999999999999E-06
[000.2025-04-05 19:18:12.7216] (Information) <TsavoriteKV [obj]> Num Buckets: 16
[000.2025-04-05 19:18:12.7216] (Information) <TsavoriteKV [obj]> Start Logical Address: 72
[000.2025-04-05 19:18:12.7216] (Information) <TsavoriteKV [obj]> Final Logical Address: 72
[000.2025-04-05 19:18:12.7752] (Information) <TsavoriteLog [aof]> ******* Recovered HybridLog Stats *******
[000.2025-04-05 19:18:12.7753] (Information) <TsavoriteLog [aof]> Head Address: 64
[000.2025-04-05 19:18:12.7753] (Information) <TsavoriteLog [aof]> Safe Head Address: 64
[000.2025-04-05 19:18:12.7753] (Information) <TsavoriteLog [aof]> ReadOnly Address: 3864
[000.2025-04-05 19:18:12.7753] (Information) <TsavoriteLog [aof]> Safe ReadOnly Address: 3864
[000.2025-04-05 19:18:12.7753] (Information) <TsavoriteLog [aof]> Tail Address: 3864
[000.2025-04-05 19:18:12.7753] (Information) Recovered AOF: begin address = 64, tail address = 3864
[000.2025-04-05 19:18:12.7753] (Information) InitiateReplicaSync: AOF BeginAddress:64 AOF TailAddress:3864
[000.2025-04-05 19:18:12.7771] (Debug) Accepted TCP connection from 127.0.0.1:61346
[000.2025-04-05 19:18:12.8055] (Error) An error occurred at ReplicationManager.RetrieveStoreCheckpoint
System.Exception: ReplicaSyncSession replicaAofTail 3864 < canServeFromAofAddress 10536
at Garnet.cluster.ReplicationManager.g__ReplicaSyncAttachTask|20_0() in /
/libs/cluster/Server/Replication/ReplicaOps/ReplicaReceiveCheckpoint.cs:line 143
[000.2025-04-05 19:18:12.8059] (Trace) EndRecovery [ClusterReplicate,NoRecovery]_

Garnet.conf

Garnet.txt

Steps to reproduce the bug

When replicating a Garnet instance using Redis-cli or programmatically, I receive an 'OK' output, but replication does not occur. In the error log, I get the message: 'An error occurred at ReplicationManager.RetrieveStoreCheckpoint.'

Commands:
redis-cli -c -h 127.0.0.1 -p 6379 Cluster Replicate ff8f09782078336d22b570988dfaca602d097c6f
Ans: OK

redis-cli -h 127.0.0.1 -p 6380 cluster nodes
Ans: ff8f09782078336d22b570988dfaca602d097c6f 127.0.0.1:6380@16380,DUC-L155.ducontindia.local myself,master - 0 0 53 connected 0-16383
9dd6ee8dfc4817de91375c19522f682b2a77e1be 127.0.0.1:6379@16379,DUC-L155.ducontindia.local master - 638794782854197274 638794782854196345 90 connected

Error Log:

[000.2025-04-05 19:18:12.6720] (Trace) CLUSTER REPLICATE ff8f09782078336d22b570988dfaca602d097c6f
[000.2025-04-05 19:18:12.6721] (Trace) Success recover lock [ClusterReplicate]
[000.2025-04-05 19:18:12.6726] (Information) Initiating background checkpoint retrieval
[000.2025-04-05 19:18:12.6939] (Debug) [127.0.0.1:61344] [00981EB9] Disposing RespServerSession Id=20
[000.2025-04-05 19:18:12.6941] (Debug) [] [0040850F] Disposing RespServerSession Id=0
[000.2025-04-05 19:18:12.7099] (Information) <TsavoriteKV [main]> Index Checkpoint: b9b2d6f2-294c-4eff-bb9c-d00070ee2687
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> ******** Index Checkpoint Info for b9b2d6f2-294c-4eff-bb9c-d00070ee2687 ********
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Table Size: 2097152
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Main Table Size (in GB): 0.13421772799999998
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Overflow Table Size (in GB): 1.0239999999999999E-06
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Num Buckets: 16
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Start Logical Address: 64
[000.2025-04-05 19:18:12.7101] (Information) <TsavoriteKV [main]> Final Logical Address: 64
[000.2025-04-05 19:18:12.7215] (Information) <TsavoriteKV [obj]> Index Checkpoint: b9b2d6f2-294c-4eff-bb9c-d00070ee2687
[000.2025-04-05 19:18:12.7215] (Information) <TsavoriteKV [obj]> ******** Index Checkpoint Info for b9b2d6f2-294c-4eff-bb9c-d00070ee2687 ********
[000.2025-04-05 19:18:12.7215] (Information) <TsavoriteKV [obj]> Table Size: 262144
[000.2025-04-05 19:18:12.7216] (Information) <TsavoriteKV [obj]> Main Table Size (in GB): 0.016777215999999998
[000.2025-04-05 19:18:12.7216] (Information) <TsavoriteKV [obj]> Overflow Table Size (in GB): 1.0239999999999999E-06
[000.2025-04-05 19:18:12.7216] (Information) <TsavoriteKV [obj]> Num Buckets: 16
[000.2025-04-05 19:18:12.7216] (Information) <TsavoriteKV [obj]> Start Logical Address: 72
[000.2025-04-05 19:18:12.7216] (Information) <TsavoriteKV [obj]> Final Logical Address: 72
[000.2025-04-05 19:18:12.7752] (Information) <TsavoriteLog [aof]> ******* Recovered HybridLog Stats *******
[000.2025-04-05 19:18:12.7753] (Information) <TsavoriteLog [aof]> Head Address: 64
[000.2025-04-05 19:18:12.7753] (Information) <TsavoriteLog [aof]> Safe Head Address: 64
[000.2025-04-05 19:18:12.7753] (Information) <TsavoriteLog [aof]> ReadOnly Address: 3864
[000.2025-04-05 19:18:12.7753] (Information) <TsavoriteLog [aof]> Safe ReadOnly Address: 3864
[000.2025-04-05 19:18:12.7753] (Information) <TsavoriteLog [aof]> Tail Address: 3864
[000.2025-04-05 19:18:12.7753] (Information) Recovered AOF: begin address = 64, tail address = 3864
[000.2025-04-05 19:18:12.7753] (Information) InitiateReplicaSync: AOF BeginAddress:64 AOF TailAddress:3864
[000.2025-04-05 19:18:12.7771] (Debug) Accepted TCP connection from 127.0.0.1:61346
[000.2025-04-05 19:18:12.8055] (Error) An error occurred at ReplicationManager.RetrieveStoreCheckpoint
System.Exception: ReplicaSyncSession replicaAofTail 3864 < canServeFromAofAddress 10536
at Garnet.cluster.ReplicationManager.g__ReplicaSyncAttachTask|20_0() in /
/libs/cluster/Server/Replication/ReplicaOps/ReplicaReceiveCheckpoint.cs:line 143
[000.2025-04-05 19:18:12.8059] (Trace) EndRecovery [ClusterReplicate,NoRecovery]_

Garnet.conf

Garnet.txt

Expected behavior

No response

Screenshots

No response

Release version

No response

IDE

No response

OS version

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions