Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication does not work on 3 freshly started VMs with memgraph 2.15.0-1 #1774

Open
bradacina opened this issue Feb 29, 2024 · 15 comments
Open
Labels
bug bug community community Effort - Low Effort - Low Frequency - EveryTime Frequency - EveryTime high-availability Priority - Later Priority - Later Reach - Some Reach - Some Severity - S2 Severity - S2

Comments

@bradacina
Copy link

bradacina commented Feb 29, 2024

Memgraph version
memgraph/now 2.15.0-1 amd64
NOT USING ENTERPRISE LICENSE

Environment
3 Azure VMs, in the same subnet, Ubuntu 22.04.4, x86_64
One of the VMs is the MAIN, the other 2 VMs are Replicas
In my memgraph.conf for all 3 VMs I have:
--log-level=TRACE
--storage-mode=ON_DISK_TRANSACTIONAL
--replication-restore-state-on-startup=true
I have restarted the memgraph service on all 3 VMs
I have demoted 2 VMs to Replicas
I have registered the replicas on the MAIN VM
SHOW REPLICAS; displays the 2 registered replicas
I have restarted the memgraph service on all 3 VMs AGAIN

Describe the bug
I've followed the https://memgraph.com/docs/configuration/replication#set-up-a-replication-cluster to setup a cluster but replication doesn't actually happen. Creating a node on the MAIN does not replicated the data on the Replicas.

To Reproduce
Steps to reproduce the behavior:

  1. create a node on MAIN VM in mgconsole: create (c:Customer {customerNumber:1}) return c;
  2. go to Replica VM and in mgconsole: match (c:Customer) return c;
  3. the Replica does not have the node replicated

Expected behavior
The node should be replicated on the Replica

Logs
On MAIN:

[2024-02-29 15:37:17.803] [memgraph_log] [debug] [Run - memgraph] 'SHOW STORAGE INFO'
[2024-02-29 15:37:17.863] [memgraph_log] [debug] [Run - memgraph] 'MATCH (APP_INTERNAL_EXEC_VAR) return COUNT(APP_INTERNAL_EXEC_VAR) as cnt'
[2024-02-29 15:37:17.863] [memgraph_log] [trace] rocksdb: Commit successful
[2024-02-29 15:37:17.865] [memgraph_log] [debug] [Run - memgraph] 'SHOW INDEX INFO'
[2024-02-29 15:37:17.865] [memgraph_log] [debug] [Run - memgraph] 'SHOW TRIGGERS'
[2024-02-29 15:37:17.865] [memgraph_log] [debug] [Run - memgraph] 'SHOW CONSTRAINT INFO'
[2024-02-29 15:37:17.865] [memgraph_log] [trace] rocksdb: Commit successful
[2024-02-29 15:37:17.865] [memgraph_log] [trace] rocksdb: Commit successful
[2024-02-29 15:37:17.865] [memgraph_log] [trace] rocksdb: Commit successful
[2024-02-29 15:37:18.643] [memgraph_log] [debug] Replica 'Replica2' can't respond or missing database 'memgraph' - '7a8142bd-da14-41db-9f1c-b53b6d5c743a'
[2024-02-29 15:37:19.643] [memgraph_log] [debug] Replica 'Replica2' can't respond or missing database 'memgraph' - '7a8142bd-da14-41db-9f1c-b53b6d5c743a'
[2024-02-29 15:37:20.643] [memgraph_log] [debug] Replica 'Replica2' can't respond or missing database 'memgraph' - '7a8142bd-da14-41db-9f1c-b53b6d5c743a'

On Replica:

[2024-02-29 15:38:55.641] [memgraph_log] [error] Received HeartbeatReq with main_id: 8c0e9475-9464-4fac-9379-897f382265fa != current_main_uuid:
[2024-02-29 15:38:56.641] [memgraph_log] [debug] Received FrequentHeartbeatRpc
[2024-02-29 15:38:56.641] [memgraph_log] [debug] Received SystemRecoveryRpc
[2024-02-29 15:38:56.641] [memgraph_log] [error] Received SystemRecoveryReq with main_id: 8c0e9475-9464-4fac-9379-897f382265fa != current_main_uuid:
[2024-02-29 15:38:56.641] [memgraph_log] [debug] Received HeartbeatRpc
[2024-02-29 15:38:56.641] [memgraph_log] [warning] No database with UUID "7a8142bd-da14-41db-9f1c-b53b6d5c743a" on replica!
[2024-02-29 15:38:56.641] [memgraph_log] [error] Received HeartbeatReq with main_id: 8c0e9475-9464-4fac-9379-897f382265fa != current_main_uuid:

Verification Environment
Once we fix it, what do you need to verify the fix?
Do you need:

  • plain memgraph pacakge for Ubuntu 22.04

Thank you for your work/time!

@bradacina bradacina added the bug bug label Feb 29, 2024
@gitbuda gitbuda added community community Effort - Unknown Effort - Unknown Severity - S3 Severity - S3 Frequency - Monthly Frequency - Monthly Reach - Some Reach - Some labels Mar 1, 2024
@nils-stefan-weiher
Copy link

nils-stefan-weiher commented Mar 1, 2024

I get a similar error on the replica, but the initial replication seemed to work:

[2024-03-01 14:18:44.953][Error]Handling SystemRecovery, an enterprise RPC message, without license.
[2024-03-01 14:18:44.955][Warning]No database with UUID "195f98a0-54e7-4d62-b76c-4592728532a9" on replica!
[2024-03-01 14:18:44.955][Warning]No database accessor

I have been using a setup with docker compose on one machine:

version: "3"
 
services:
  memgraph-main:
    image: memgraph/memgraph-mage:1.15-memgraph-2.15
    volumes:
      - mg_lib:/var/lib/memgraph
      - mg_log_main:/var/log/memgraph
      - mg_etc_main:/etc/memgraph
    container_name: memgraph-main
    ports:
      - "7687:7687"
      - "7444:7444"
    command: ["--replication-restore-state-on-startup=true"]
  memgraph-public:
    image: memgraph/memgraph-mage:1.15-memgraph-2.15
    volumes:
      - mg_log_public:/var/log/memgraph
      - mg_etc_public:/etc/memgraph
    container_name: memgraph-public
    ports:
      - "7689:7687"
      - "7445:7444"
    depends_on:
      - memgraph-main
    command: ["--replication-restore-state-on-startup=true"]
  lab:
    image: memgraph/lab:2.12.0
    container_name: memgraph-lab
    ports:
      - "3000:3000"
    depends_on:
      - memgraph-main
    environment:
      - QUICK_CONNECT_MG_HOST=memgraph-main
      - QUICK_CONNECT_MG_PORT=7687
volumes:
  mg_lib:
  mg_log_main:
  mg_log_public:
  mg_etc_main:
  mg_etc_public:

memgraph-main is the main node and memgraph-public was setup as REPLICA in SYNC mode.

The error above has been showing up after a restart of the docker containers.

EDIT: After the initial import and a restart, it is no longer possible to create nodes on the main. The error in the log reads:

2024-03-01 14:43:36.939][Error]Couldn't replicate data to public_mirror. For more details, visit https://memgr.ph/replication.

and the query response displayed is:

At least one SYNC replica has not confirmed committing last transaction.

@katarinasupe
Copy link
Contributor

Hi @nils-stefan-weiher, I managed to reproduce your issue, but it seems a bit different than reported above. I created a new issue from your comment, so you can track the progress there. Thank you for reporting this 🙏

@katarinasupe
Copy link
Contributor

After having read @bradacina's issue once again, I think these two are actually the same reports. @bradacina can you confirm if you ran the queries after you restarted instances? (replication is not working after restart)

@katarinasupe
Copy link
Contributor

katarinasupe commented Mar 21, 2024

Following @bradacina and @nils-stefan-weiher comments, I managed to reproduce the issue.
Environment - Memgraph 2.15 Docker Compose, Mac M1

To Reproduce

  1. Run 2 Memgraph instances

  2. Set up replication
    Run query on replica (7689):
    SET REPLICATION ROLE TO REPLICA WITH PORT 10000;
    Run on MAIN (7687):
    REGISTER REPLICA my_replica ASYNC TO "memgraph-public";
    Check roles on both:
    SHOW REPLICATION ROLE;
    Check replicas on MAIN:
    SHOW REPLICAS;

  3. Test replication (run on MAIN):
    CREATE (c:Customer {customerNumber:1}) RETURN c;
    Data is properly replicated to the ASYNC replica.

  4. Stop and start the Memgraph containers.

  5. Check the logs and test replication again.

Expected behavior - Replication continues working after the restart.

Logs
Replication roles are persisted properly. Show replicas from MAIN shows correctly the registered replica.
Here are the logs:

On MAIN:

[2024-03-21 10:24:17.157][Debug]Replica 'my_replica' can't respond or missing database 'memgraph' - '012b1537-4826-4caf-9ce8-47372cf7c556'
[2024-03-21 10:24:18.128][Debug]Replica 'my_replica' can't respond or missing database 'memgraph' - '012b1537-4826-4caf-9ce8-47372cf7c556'
[2024-03-21 10:24:19.137][Debug]Replica 'my_replica' can't respond or missing database 'memgraph' - '012b1537-4826-4caf-9ce8-47372cf7c556'

On REPLICA:

[2024-03-21 10:26:44.134][Debug]Received FrequentHeartbeatRpc
[2024-03-21 10:26:44.136][Debug]Received SystemRecoveryRpc
[2024-03-21 10:26:44.136][Error]Handling SystemRecovery, an enterprise RPC message, without license.
[2024-03-21 10:26:44.137][Info]Item with name "memgraph" already exists.
[2024-03-21 10:26:44.137][Debug]Trying to create db 'memgraph' on replica which already exists.
[2024-03-21 10:26:44.137][Debug]Different UUIDs
[2024-03-21 10:26:44.138][Debug]Default storage is not clean, cannot update UUID...
[2024-03-21 10:26:44.138][Debug]Received HeartbeatRpc
[2024-03-21 10:26:44.138][Warning]No database with UUID "012b1537-4826-4caf-9ce8-47372cf7c556" on replica!
[2024-03-21 10:26:44.139][Warning]No database accessor
[2024-03-21 10:26:44.139][Debug]Replica 'my_replica' can't respond or missing database 'memgraph' - '012b1537-4826-4caf-9ce8-47372cf7c556'

@katarinasupe katarinasupe added Frequency - EveryTime Frequency - EveryTime Severity - S2 Severity - S2 and removed Frequency - Monthly Frequency - Monthly Severity - S3 Severity - S3 labels Mar 21, 2024
@andrejtonev
Copy link
Contributor

@bradacina
I see you set --storage-mode=ON_DISK_TRANSACTIONAL. Replication is currently only supported for in-memory transactional.
Do you still have this issue if using IN_MEMORY_TRANSACTIONAL?

@katarinasupe
Copy link
Contributor

@andrejtonev That's why I thought these issues were not related (and they might not be; that is, solving one might not solve the other). Still, I reproduced the issue in in-memory transactional mode, just like @nils-stefan-weiher reported.

@andrejtonev
Copy link
Contributor

andrejtonev commented Mar 21, 2024

@katarinasupe @nils-stefan-weiher
Can you try adding --data-recovery-on-startup=true flag?
The storage-recover-on-startup is deprecated and does not recover all data needed in this case.

@antoniofilipovic
Copy link
Contributor

@bradacina hi, question for you also:

Since you are using --replication-restore state-on-startup=true I am not sure why is there an issue but we should have logs to verify.

Can you check for logs which can provide us with more info on whether there is a bug on our side:
On REPLICA instances, try to search for:

Recovered main's uuid for replica {}

On MAIN:

Recovered uuid for main {}

If {} are empty, the bug is on recovery of UUID. From logs, it seems that UUID was not recovered. My question is if you had on instance restart also --replication-restore-state-on-startup=true.

@katarinasupe
Copy link
Contributor

@andrejtonev it seems that --data-recovery-on-startup=true in my case (and hopefully in @nils-stefan-weiher) works well. Can you elaborate on why --replication-restore-state-on-startup=true wasn't enough and if this is an expected behavior?

@andrejtonev
Copy link
Contributor

andrejtonev commented Mar 25, 2024

@katarinasupe --data-recovery-on-startup was originally a multi-tenancy flag that was used to recover all tenants and their data. Since the last release, this flag has been expanded to also restore information regarding replication.
Specifically, we need to recover a unique identifier that gets set at first connection. This protects us from overwriting data from a rogue MAIN.
Unfortunately this change hasn't been communicated in the best way.

@antoniofilipovic
Copy link
Contributor

@andrejtonev correct me if I am wrong but I checked the code and @katarinasupe --replication-restore-state-on-startup=true should be enough if no new data is added. If there is some data added and changed, --data-recovery-on-startup=true must be added and set to true because and otherwise we will not recover for each database correct identifier and it will seem as two databases have diverged.

Also, and additional problem is that we are still having in our codebase --storage-recover-on-startup=true by default, which can recover data from storage but not the unique identifier for each database, and then again, it seems as databases have diverged.

To conclude, --replication-restore-state-on-startup=true should be enough if there was no data, but if data is added, please use --data-recovery-on-startup=true which recovers not just data but also database identifiers important for replication.

@nils-stefan-weiher
Copy link

@andrejtonev it seems that --data-recovery-on-startup=true in my case (and hopefully in @nils-stefan-weiher) works well. Can you elaborate on why --replication-restore-state-on-startup=true wasn't enough and if this is an expected behavior?

Thanks for trying. I'm only back in office since Monday. Will try if this works in our case.

@nils-stefan-weiher
Copy link

Using the parameters and after deleting the container volumes I could enable replication (SYNC with one replica) again.
With version 2.15.2 and 2.16.0 it is working now and I could import the data again.

I only have one question about the upgrade process. While upgrading the containers from 2.15.2 to 2.16.0 I had to REGISTER the replica again and set the correct ROLE. Is this because I don't have a persistent volume for the data in the /var/lib folder in the replica?

@antejavor
Copy link
Contributor

antejavor commented Apr 18, 2024

I only have one question about the upgrade process. While upgrading the containers from 2.15.2 to 2.16.0 I had to REGISTER the replica again and set the correct ROLE. Is this because I don't have a persistent volume for the data in the /var/lib folder in the replica?

Yeah, the persistency is in /var/lib/memgraph directory, it is tied to volumes in docker case.

@antejavor antejavor reopened this Apr 18, 2024
@katarinasupe
Copy link
Contributor

Related to #2061

@antepusic antepusic removed the Effort - Unknown Effort - Unknown label Jun 10, 2024
@antepusic antepusic added Effort - High Effort - High Effort - Low Effort - Low and removed Effort - High Effort - High labels Jun 10, 2024
@hal-eisen-MG hal-eisen-MG added the Priority - Later Priority - Later label Jun 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bug community community Effort - Low Effort - Low Frequency - EveryTime Frequency - EveryTime high-availability Priority - Later Priority - Later Reach - Some Reach - Some Severity - S2 Severity - S2
Projects
Development

No branches or pull requests

9 participants