Replication does not work on 3 freshly started VMs with memgraph 2.15.0-1 #1774

bradacina · 2024-02-29T15:41:35Z

Memgraph version
memgraph/now 2.15.0-1 amd64
NOT USING ENTERPRISE LICENSE

Environment
3 Azure VMs, in the same subnet, Ubuntu 22.04.4, x86_64
One of the VMs is the MAIN, the other 2 VMs are Replicas
In my memgraph.conf for all 3 VMs I have:
--log-level=TRACE
--storage-mode=ON_DISK_TRANSACTIONAL
--replication-restore-state-on-startup=true
I have restarted the memgraph service on all 3 VMs
I have demoted 2 VMs to Replicas
I have registered the replicas on the MAIN VM
SHOW REPLICAS; displays the 2 registered replicas
I have restarted the memgraph service on all 3 VMs AGAIN

Describe the bug
I've followed the https://memgraph.com/docs/configuration/replication#set-up-a-replication-cluster to setup a cluster but replication doesn't actually happen. Creating a node on the MAIN does not replicated the data on the Replicas.

To Reproduce
Steps to reproduce the behavior:

create a node on MAIN VM in mgconsole: create (c:Customer {customerNumber:1}) return c;
go to Replica VM and in mgconsole: match (c:Customer) return c;
the Replica does not have the node replicated

Expected behavior
The node should be replicated on the Replica

Logs
On MAIN:

[2024-02-29 15:37:17.803] [memgraph_log] [debug] [Run - memgraph] 'SHOW STORAGE INFO'
[2024-02-29 15:37:17.863] [memgraph_log] [debug] [Run - memgraph] 'MATCH (APP_INTERNAL_EXEC_VAR) return COUNT(APP_INTERNAL_EXEC_VAR) as cnt'
[2024-02-29 15:37:17.863] [memgraph_log] [trace] rocksdb: Commit successful
[2024-02-29 15:37:17.865] [memgraph_log] [debug] [Run - memgraph] 'SHOW INDEX INFO'
[2024-02-29 15:37:17.865] [memgraph_log] [debug] [Run - memgraph] 'SHOW TRIGGERS'
[2024-02-29 15:37:17.865] [memgraph_log] [debug] [Run - memgraph] 'SHOW CONSTRAINT INFO'
[2024-02-29 15:37:17.865] [memgraph_log] [trace] rocksdb: Commit successful
[2024-02-29 15:37:17.865] [memgraph_log] [trace] rocksdb: Commit successful
[2024-02-29 15:37:17.865] [memgraph_log] [trace] rocksdb: Commit successful
[2024-02-29 15:37:18.643] [memgraph_log] [debug] Replica 'Replica2' can't respond or missing database 'memgraph' - '7a8142bd-da14-41db-9f1c-b53b6d5c743a'
[2024-02-29 15:37:19.643] [memgraph_log] [debug] Replica 'Replica2' can't respond or missing database 'memgraph' - '7a8142bd-da14-41db-9f1c-b53b6d5c743a'
[2024-02-29 15:37:20.643] [memgraph_log] [debug] Replica 'Replica2' can't respond or missing database 'memgraph' - '7a8142bd-da14-41db-9f1c-b53b6d5c743a'

On Replica:

[2024-02-29 15:38:55.641] [memgraph_log] [error] Received HeartbeatReq with main_id: 8c0e9475-9464-4fac-9379-897f382265fa != current_main_uuid:
[2024-02-29 15:38:56.641] [memgraph_log] [debug] Received FrequentHeartbeatRpc
[2024-02-29 15:38:56.641] [memgraph_log] [debug] Received SystemRecoveryRpc
[2024-02-29 15:38:56.641] [memgraph_log] [error] Received SystemRecoveryReq with main_id: 8c0e9475-9464-4fac-9379-897f382265fa != current_main_uuid:
[2024-02-29 15:38:56.641] [memgraph_log] [debug] Received HeartbeatRpc
[2024-02-29 15:38:56.641] [memgraph_log] [warning] No database with UUID "7a8142bd-da14-41db-9f1c-b53b6d5c743a" on replica!
[2024-02-29 15:38:56.641] [memgraph_log] [error] Received HeartbeatReq with main_id: 8c0e9475-9464-4fac-9379-897f382265fa != current_main_uuid:

Verification Environment
Once we fix it, what do you need to verify the fix?
Do you need:

plain memgraph pacakge for Ubuntu 22.04

Thank you for your work/time!

The text was updated successfully, but these errors were encountered:

nils-stefan-weiher · 2024-03-01T14:21:02Z

I get a similar error on the replica, but the initial replication seemed to work:

[2024-03-01 14:18:44.953][Error]Handling SystemRecovery, an enterprise RPC message, without license.
[2024-03-01 14:18:44.955][Warning]No database with UUID "195f98a0-54e7-4d62-b76c-4592728532a9" on replica!
[2024-03-01 14:18:44.955][Warning]No database accessor

I have been using a setup with docker compose on one machine:

version: "3"
 
services:
  memgraph-main:
    image: memgraph/memgraph-mage:1.15-memgraph-2.15
    volumes:
      - mg_lib:/var/lib/memgraph
      - mg_log_main:/var/log/memgraph
      - mg_etc_main:/etc/memgraph
    container_name: memgraph-main
    ports:
      - "7687:7687"
      - "7444:7444"
    command: ["--replication-restore-state-on-startup=true"]
  memgraph-public:
    image: memgraph/memgraph-mage:1.15-memgraph-2.15
    volumes:
      - mg_log_public:/var/log/memgraph
      - mg_etc_public:/etc/memgraph
    container_name: memgraph-public
    ports:
      - "7689:7687"
      - "7445:7444"
    depends_on:
      - memgraph-main
    command: ["--replication-restore-state-on-startup=true"]
  lab:
    image: memgraph/lab:2.12.0
    container_name: memgraph-lab
    ports:
      - "3000:3000"
    depends_on:
      - memgraph-main
    environment:
      - QUICK_CONNECT_MG_HOST=memgraph-main
      - QUICK_CONNECT_MG_PORT=7687
volumes:
  mg_lib:
  mg_log_main:
  mg_log_public:
  mg_etc_main:
  mg_etc_public:

memgraph-main is the main node and memgraph-public was setup as REPLICA in SYNC mode.

The error above has been showing up after a restart of the docker containers.

EDIT: After the initial import and a restart, it is no longer possible to create nodes on the main. The error in the log reads:

2024-03-01 14:43:36.939][Error]Couldn't replicate data to public_mirror. For more details, visit https://memgr.ph/replication.

and the query response displayed is:

At least one SYNC replica has not confirmed committing last transaction.

katarinasupe · 2024-03-21T10:41:24Z

Hi @nils-stefan-weiher, I managed to reproduce your issue, but it seems a bit different than reported above. I created a new issue from your comment, so you can track the progress there. Thank you for reporting this 🙏

katarinasupe · 2024-03-21T10:50:52Z

After having read @bradacina's issue once again, I think these two are actually the same reports. @bradacina can you confirm if you ran the queries after you restarted instances? (replication is not working after restart)

katarinasupe · 2024-03-21T10:55:16Z

Following @bradacina and @nils-stefan-weiher comments, I managed to reproduce the issue.
Environment - Memgraph 2.15 Docker Compose, Mac M1

To Reproduce

Run 2 Memgraph instances
Set up replication
Run query on replica (7689):
SET REPLICATION ROLE TO REPLICA WITH PORT 10000;
Run on MAIN (7687):
REGISTER REPLICA my_replica ASYNC TO "memgraph-public";
Check roles on both:
SHOW REPLICATION ROLE;
Check replicas on MAIN:
SHOW REPLICAS;
Test replication (run on MAIN):
CREATE (c:Customer {customerNumber:1}) RETURN c;
Data is properly replicated to the ASYNC replica.
Stop and start the Memgraph containers.
Check the logs and test replication again.

Expected behavior - Replication continues working after the restart.

Logs
Replication roles are persisted properly. Show replicas from MAIN shows correctly the registered replica.
Here are the logs:

On MAIN:

[2024-03-21 10:24:17.157][Debug]Replica 'my_replica' can't respond or missing database 'memgraph' - '012b1537-4826-4caf-9ce8-47372cf7c556'
[2024-03-21 10:24:18.128][Debug]Replica 'my_replica' can't respond or missing database 'memgraph' - '012b1537-4826-4caf-9ce8-47372cf7c556'
[2024-03-21 10:24:19.137][Debug]Replica 'my_replica' can't respond or missing database 'memgraph' - '012b1537-4826-4caf-9ce8-47372cf7c556'

On REPLICA:

[2024-03-21 10:26:44.134][Debug]Received FrequentHeartbeatRpc
[2024-03-21 10:26:44.136][Debug]Received SystemRecoveryRpc
[2024-03-21 10:26:44.136][Error]Handling SystemRecovery, an enterprise RPC message, without license.
[2024-03-21 10:26:44.137][Info]Item with name "memgraph" already exists.
[2024-03-21 10:26:44.137][Debug]Trying to create db 'memgraph' on replica which already exists.
[2024-03-21 10:26:44.137][Debug]Different UUIDs
[2024-03-21 10:26:44.138][Debug]Default storage is not clean, cannot update UUID...
[2024-03-21 10:26:44.138][Debug]Received HeartbeatRpc
[2024-03-21 10:26:44.138][Warning]No database with UUID "012b1537-4826-4caf-9ce8-47372cf7c556" on replica!
[2024-03-21 10:26:44.139][Warning]No database accessor
[2024-03-21 10:26:44.139][Debug]Replica 'my_replica' can't respond or missing database 'memgraph' - '012b1537-4826-4caf-9ce8-47372cf7c556'

andrejtonev · 2024-03-21T12:56:20Z

@bradacina
I see you set --storage-mode=ON_DISK_TRANSACTIONAL. Replication is currently only supported for in-memory transactional.
Do you still have this issue if using IN_MEMORY_TRANSACTIONAL?

katarinasupe · 2024-03-21T13:39:24Z

@andrejtonev That's why I thought these issues were not related (and they might not be; that is, solving one might not solve the other). Still, I reproduced the issue in in-memory transactional mode, just like @nils-stefan-weiher reported.

andrejtonev · 2024-03-21T16:19:03Z

@katarinasupe @nils-stefan-weiher
Can you try adding --data-recovery-on-startup=true flag?
The storage-recover-on-startup is deprecated and does not recover all data needed in this case.

antoniofilipovic · 2024-03-22T09:36:47Z

@bradacina hi, question for you also:

Since you are using --replication-restore state-on-startup=true I am not sure why is there an issue but we should have logs to verify.

Can you check for logs which can provide us with more info on whether there is a bug on our side:
On REPLICA instances, try to search for:

Recovered main's uuid for replica {}

On MAIN:

Recovered uuid for main {}

If {} are empty, the bug is on recovery of UUID. From logs, it seems that UUID was not recovered. My question is if you had on instance restart also --replication-restore-state-on-startup=true.

katarinasupe · 2024-03-25T11:30:16Z

@andrejtonev it seems that --data-recovery-on-startup=true in my case (and hopefully in @nils-stefan-weiher) works well. Can you elaborate on why --replication-restore-state-on-startup=true wasn't enough and if this is an expected behavior?

andrejtonev · 2024-03-25T14:09:10Z

@katarinasupe --data-recovery-on-startup was originally a multi-tenancy flag that was used to recover all tenants and their data. Since the last release, this flag has been expanded to also restore information regarding replication.
Specifically, we need to recover a unique identifier that gets set at first connection. This protects us from overwriting data from a rogue MAIN.
Unfortunately this change hasn't been communicated in the best way.

antoniofilipovic · 2024-04-03T12:13:17Z

@andrejtonev correct me if I am wrong but I checked the code and @katarinasupe --replication-restore-state-on-startup=true should be enough if no new data is added. If there is some data added and changed, --data-recovery-on-startup=true must be added and set to true because and otherwise we will not recover for each database correct identifier and it will seem as two databases have diverged.

Also, and additional problem is that we are still having in our codebase --storage-recover-on-startup=true by default, which can recover data from storage but not the unique identifier for each database, and then again, it seems as databases have diverged.

To conclude, --replication-restore-state-on-startup=true should be enough if there was no data, but if data is added, please use --data-recovery-on-startup=true which recovers not just data but also database identifiers important for replication.

nils-stefan-weiher · 2024-04-10T09:19:48Z

@andrejtonev it seems that --data-recovery-on-startup=true in my case (and hopefully in @nils-stefan-weiher) works well. Can you elaborate on why --replication-restore-state-on-startup=true wasn't enough and if this is an expected behavior?

Thanks for trying. I'm only back in office since Monday. Will try if this works in our case.

nils-stefan-weiher · 2024-04-17T07:16:52Z

Using the parameters and after deleting the container volumes I could enable replication (SYNC with one replica) again.
With version 2.15.2 and 2.16.0 it is working now and I could import the data again.

I only have one question about the upgrade process. While upgrading the containers from 2.15.2 to 2.16.0 I had to REGISTER the replica again and set the correct ROLE. Is this because I don't have a persistent volume for the data in the /var/lib folder in the replica?

antejavor · 2024-04-18T09:18:20Z

I only have one question about the upgrade process. While upgrading the containers from 2.15.2 to 2.16.0 I had to REGISTER the replica again and set the correct ROLE. Is this because I don't have a persistent volume for the data in the /var/lib folder in the replica?

Yeah, the persistency is in /var/lib/memgraph directory, it is tied to volumes in docker case.

katarinasupe · 2024-05-22T12:41:24Z

Related to #2061

bradacina added the bug bug label Feb 29, 2024

gitbuda added community community Effort - Unknown Effort - Unknown Severity - S3 Severity - S3 Frequency - Monthly Frequency - Monthly Reach - Some Reach - Some labels Mar 1, 2024

katarinasupe mentioned this issue Mar 21, 2024

After restarting container, replication is not working as expected #1844

Closed

katarinasupe added Frequency - EveryTime Frequency - EveryTime Severity - S2 Severity - S2 and removed Frequency - Monthly Frequency - Monthly Severity - S3 Severity - S3 labels Mar 21, 2024

antejavor closed this as completed Apr 18, 2024

antejavor reopened this Apr 18, 2024

gitbuda added the high-availability label May 22, 2024

antepusic removed the Effort - Unknown Effort - Unknown label Jun 10, 2024

antepusic added Effort - High Effort - High Effort - Low Effort - Low and removed Effort - High Effort - High labels Jun 10, 2024

hal-eisen-MG added the Priority - Later Priority - Later label Jun 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication does not work on 3 freshly started VMs with memgraph 2.15.0-1 #1774

Replication does not work on 3 freshly started VMs with memgraph 2.15.0-1 #1774

bradacina commented Feb 29, 2024 •

edited

Loading

nils-stefan-weiher commented Mar 1, 2024 •

edited

Loading

katarinasupe commented Mar 21, 2024

katarinasupe commented Mar 21, 2024

katarinasupe commented Mar 21, 2024 •

edited

Loading

andrejtonev commented Mar 21, 2024

katarinasupe commented Mar 21, 2024

andrejtonev commented Mar 21, 2024 •

edited

Loading

antoniofilipovic commented Mar 22, 2024

katarinasupe commented Mar 25, 2024

andrejtonev commented Mar 25, 2024 •

edited

Loading

antoniofilipovic commented Apr 3, 2024

nils-stefan-weiher commented Apr 10, 2024

nils-stefan-weiher commented Apr 17, 2024

antejavor commented Apr 18, 2024 •

edited

Loading

katarinasupe commented May 22, 2024

Replication does not work on 3 freshly started VMs with memgraph 2.15.0-1 #1774

Replication does not work on 3 freshly started VMs with memgraph 2.15.0-1 #1774

Comments

bradacina commented Feb 29, 2024 • edited Loading

nils-stefan-weiher commented Mar 1, 2024 • edited Loading

katarinasupe commented Mar 21, 2024

katarinasupe commented Mar 21, 2024

katarinasupe commented Mar 21, 2024 • edited Loading

andrejtonev commented Mar 21, 2024

katarinasupe commented Mar 21, 2024

andrejtonev commented Mar 21, 2024 • edited Loading

antoniofilipovic commented Mar 22, 2024

katarinasupe commented Mar 25, 2024

andrejtonev commented Mar 25, 2024 • edited Loading

antoniofilipovic commented Apr 3, 2024

nils-stefan-weiher commented Apr 10, 2024

nils-stefan-weiher commented Apr 17, 2024

antejavor commented Apr 18, 2024 • edited Loading

katarinasupe commented May 22, 2024

bradacina commented Feb 29, 2024 •

edited

Loading

nils-stefan-weiher commented Mar 1, 2024 •

edited

Loading

katarinasupe commented Mar 21, 2024 •

edited

Loading

andrejtonev commented Mar 21, 2024 •

edited

Loading

andrejtonev commented Mar 25, 2024 •

edited

Loading

antejavor commented Apr 18, 2024 •

edited

Loading