[BUG] Investigate engine integration test failure - restore_with_frontend: can only restore backup from replica in mode RW, got ERR #1628

joshimoo · 2020-07-22T02:57:46Z

Describe the bug
We noticed engine integration test failures, during a restore operation.
The error is not dependent on the test method in question.
Potentially any restore operation in a test could trigger this.

https://drone-publish.rancher.io/longhorn/longhorn-engine/220/1/3
also noticed it on a dev build

To Reproduce

Unknown

Expected behavior
A clear and concise description of what you expected to happen.

Log
If applicable, add the Longhorn managers' log when the issue happens.

Environment:

Longhorn version: 1.0.1 (rc/master)
Kubernetes version: 1.17
Node OS type and version: ubuntu

Additional context

Test failure

=================================== FAILURES ===================================
--
24703 | ________________________ test_backup_incremental_logic _________________________
24704 |  
24705 | grpc_replica1 = <rpc.replica.replica_client.ReplicaClient object at 0x7fdb0811b0b8>
24706 | grpc_replica2 = <rpc.replica.replica_client.ReplicaClient object at 0x7fdb082e8940>
24707 | grpc_controller = <rpc.controller.controller_client.ControllerClient object at 0x7fdb08402470>
24708 | backup_targets = ['s3://backupbucket@us-east-1/s3-test', 'vfs:///data/backupbucket']
24709 |  
24710 | def test_backup_incremental_logic(grpc_replica1, grpc_replica2,  # NOQA
24711 | grpc_controller, backup_targets):  # NOQA
24712 | for backup_target in backup_targets:
24713 | dev = get_dev(grpc_replica1, grpc_replica2,
24714 | grpc_controller)
24715 | address = grpc_controller.address
24716 | volume_name = VOLUME_NAME
24717 | engine_name = ENGINE_NAME
24718 | offset = 0
24719 | length = 128
24720 |  
24721 | # initial backup
24722 | snap1_data = random_string(length)
24723 | verify_data(dev, offset, snap1_data)
24724 | snap1_checksum = checksum_dev(dev)
24725 | snap1 = cmd.snapshot_create(address)
24726 | backup1_info = create_backup(address, snap1, backup_target)
24727 | assert backup1_info["IsIncremental"] is False
24728 |  
24729 | # delta backup on top of initial backup
24730 | snap2_data = random_string(int(length / 2))
24731 | verify_data(dev, offset, snap2_data)
24732 | snap2 = cmd.snapshot_create(address)
24733 | backup2_info = create_backup(address, snap2, backup_target)
24734 | assert backup2_info["IsIncremental"] is True
24735 |  
24736 | # delete the volume
24737 | cmd.sync_agent_server_reset(address)
24738 | grpc_controller = cleanup_controller(grpc_controller)
24739 | grpc_replica1 = cleanup_replica(grpc_replica1)
24740 | grpc_replica2 = cleanup_replica(grpc_replica2)
24741 |  
24742 | # recreate the volume
24743 | dev = get_dev(grpc_replica1, grpc_replica2,
24744 | grpc_controller, clean_backup_dir=False)
24745 |  
24746 | # empty initial backup after volume recreation
24747 | snap3 = cmd.snapshot_create(address)
24748 | backup3_info = create_backup(address, snap3, backup_target)
24749 | assert backup3_info["VolumeName"] == volume_name
24750 | assert backup3_info["Size"] == '0'
24751 | assert backup3_info["IsIncremental"] is False
24752 |  
24753 | # write half of snap1 onto head
24754 | snap4_data = snap1_data[:int(length / 2)]
24755 | assert len(snap4_data) == int(length / 2)
24756 | verify_data(dev, offset, snap4_data)
24757 | snap4_checksum = checksum_dev(dev)
24758 | assert snap4_checksum != snap1_checksum
24759 | snap4 = cmd.snapshot_create(address)
24760 | backup4_info = create_backup(address, snap4, backup_target)
24761 | assert backup4_info["IsIncremental"] is True
24762 |  
24763 | # restore initial backup
24764 | restore_with_frontend(address, engine_name,
24765 | >                                 backup1_info["URL"])
24766 |  
24767 | data/test_backup.py:355:
24768 | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
24769 | common/core.py:287: in restore_with_frontend
24770 | cmd.backup_restore(url, backup)
24771 | common/cmd.py:112: in backup_restore
24772 | return subprocess.check_output(cmd, encoding='utf-8').strip()
24773 | /usr/lib/python3.6/subprocess.py:356: in check_output
24774 | **kwargs).stdout
24775 | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
24776 |  
24777 | input = None, timeout = None, check = True
24778 | popenargs = (['/go/src/github.com/longhorn/longhorn-engine/integration/common/../../bin/longhorn', '--url', 'localhost:30001', '--debug', 'backup', 'restore', ...],)
24779 | kwargs = {'encoding': 'utf-8', 'stdout': -1}
24780 | process = <subprocess.Popen object at 0x7fdb08109e48>
24781 | stdout = '{\n\t"ReplicaErrors": [\n\t\t{\n\t\t\t"Address": "tcp://localhost:10015",\n\t\t\t"Message": "can only restore backup from replica in mode RW, got ERR"\n\t\t}\n\t]\n}\n'
24782 | stderr = None, retcode = 1
24783 |  
24784 | def run(*popenargs, input=None, timeout=None, check=False, **kwargs):
24785 | """Run command with arguments and return a CompletedProcess instance.
24786 |  
24787 | The returned instance will have attributes args, returncode, stdout and
24788 | stderr. By default, stdout and stderr are not captured, and those attributes
24789 | will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them.
24790 |  
24791 | If check is True and the exit code was non-zero, it raises a
24792 | CalledProcessError. The CalledProcessError object will have the return code
24793 | in the returncode attribute, and output & stderr attributes if those streams
24794 | were captured.
24795 |  
24796 | If timeout is given, and the process takes too long, a TimeoutExpired
24797 | exception will be raised.
24798 |  
24799 | There is an optional argument "input", allowing you to
24800 | pass a string to the subprocess's stdin.  If you use this argument
24801 | you may not also use the Popen constructor's "stdin" argument, as
24802 | it will be used internally.
24803 |  
24804 | The other arguments are the same as for the Popen constructor.
24805 |  
24806 | If universal_newlines=True is passed, the "input" argument must be a
24807 | string and stdout/stderr in the returned object will be strings rather than
24808 | bytes.
24809 | """
24810 | if input is not None:
24811 | if 'stdin' in kwargs:
24812 | raise ValueError('stdin and input arguments may not both be used.')
24813 | kwargs['stdin'] = PIPE
24814 |  
24815 | with Popen(*popenargs, **kwargs) as process:
24816 | try:
24817 | stdout, stderr = process.communicate(input, timeout=timeout)
24818 | except TimeoutExpired:
24819 | process.kill()
24820 | stdout, stderr = process.communicate()
24821 | raise TimeoutExpired(process.args, timeout, output=stdout,
24822 | stderr=stderr)
24823 | except:
24824 | process.kill()
24825 | process.wait()
24826 | raise
24827 | retcode = process.poll()
24828 | if check and retcode:
24829 | raise CalledProcessError(retcode, process.args,
24830 | >                                        output=stdout, stderr=stderr)
24831 | E               subprocess.CalledProcessError: Command '['/go/src/github.com/longhorn/longhorn-engine/integration/common/../../bin/longhorn', '--url', 'localhost:30001', '--debug', 'backup', 'restore', 'vfs:///data/backupbucket?backup=backup-9cdc65550f104abf&volume=test-9e3b6ea6-volume']' returned non-zero exit status 1.
24832 |  
24833 | /usr/lib/python3.6/subprocess.py:438: CalledProcessError
24834 | ----------------------------- Captured stderr call -----------------------------
24835 | time="2020-07-21T01:22:45Z" level=info msg="Backing up 5c643bd7-19bb-4580-b100-e1e56336277c on tcp://localhost:10000, to s3://backupbucket@us-east-1/s3-test"
24836 | time="2020-07-21T01:22:46Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24837 | time="2020-07-21T01:22:46Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24838 | time="2020-07-21T01:22:46Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24839 | time="2020-07-21T01:22:46Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=start
24840 | time="2020-07-21T01:22:46Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=complete
24841 | time="2020-07-21T01:22:46Z" level=info msg="Backing up ce08bc7f-a6f4-457d-8cba-d2ca661627b7 on tcp://localhost:10000, to s3://backupbucket@us-east-1/s3-test"
24842 | time="2020-07-21T01:22:47Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24843 | time="2020-07-21T01:22:47Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24844 | time="2020-07-21T01:22:47Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24845 | time="2020-07-21T01:22:47Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=start
24846 | time="2020-07-21T01:22:47Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=complete
24847 | time="2020-07-21T01:22:47Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10000"
24848 | time="2020-07-21T01:22:47Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10015"
24849 | time="2020-07-21T01:22:47Z" level=info msg="Successfully reset sync agent server"
24850 | time="2020-07-21T01:22:48Z" level=info msg="Backing up 055b3546-0223-470e-a640-fc866c08cd30 on tcp://localhost:10000, to s3://backupbucket@us-east-1/s3-test"
24851 | time="2020-07-21T01:22:48Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24852 | time="2020-07-21T01:22:48Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24853 | time="2020-07-21T01:22:48Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24854 | time="2020-07-21T01:22:48Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=start
24855 | time="2020-07-21T01:22:48Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=complete
24856 | time="2020-07-21T01:22:48Z" level=info msg="Backing up df143117-f9c8-473f-90f6-c16212d29bca on tcp://localhost:10000, to s3://backupbucket@us-east-1/s3-test"
24857 | time="2020-07-21T01:22:49Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24858 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24859 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24860 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24861 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24862 | time="2020-07-21T01:22:49Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24863 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24864 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24865 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=start
24866 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=complete
24867 | time="2020-07-21T01:22:53Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24868 | time="2020-07-21T01:22:53Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24869 | time="2020-07-21T01:22:53Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24870 | time="2020-07-21T01:22:53Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24871 | time="2020-07-21T01:22:53Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24872 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24873 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=start
24874 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=complete
24875 | time="2020-07-21T01:22:56Z" level=debug msg="Removed backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg on backupstore" pkg=backupstore
24876 | time="2020-07-21T01:22:56Z" level=info msg="Removed backup backup-c3893fe3ea764f10 for volume test-9e3b6ea6-volume" pkg=backupstore
24877 | time="2020-07-21T01:22:56Z" level=debug msg="GC started" pkg=backupstore
24878 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24879 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24880 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=start
24881 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=complete
24882 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=start
24883 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=complete
24884 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24885 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24886 | time="2020-07-21T01:22:56Z" level=debug msg="deleted block adf74a9bac4f23fd4977af7f30c25967c38de58416f52465c8d465a145c46a24 for volume test-9e3b6ea6-volume" pkg=backupstore
24887 | time="2020-07-21T01:22:56Z" level=debug msg="Retained 2 blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24888 | time="2020-07-21T01:22:56Z" level=debug msg="Removed 1 unused blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24889 | time="2020-07-21T01:22:56Z" level=debug msg="GC completed" pkg=backupstore
24890 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24891 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24892 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24893 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24894 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24895 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24896 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24897 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-c3893fe3ea764f10 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24898 | time="2020-07-21T01:22:56Z" level=fatal msg="Error running restore backup command: no backups found with url s3://backupbucket@us-east-1/s3-test?backup=backup-c3893fe3ea764f10&volume=test-9e3b6ea6-volume: cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg in backupstore"
24899 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24900 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24901 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24902 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-c3893fe3ea764f10 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24903 | time="2020-07-21T01:22:56Z" level=error msg="cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg in backupstore"
24904 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24905 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=start
24906 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=complete
24907 | time="2020-07-21T01:22:56Z" level=debug msg="Removed backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg on backupstore" pkg=backupstore
24908 | time="2020-07-21T01:22:56Z" level=info msg="Removed backup backup-8aaf61823aaf40b5 for volume test-9e3b6ea6-volume" pkg=backupstore
24909 | time="2020-07-21T01:22:56Z" level=debug msg="GC started" pkg=backupstore
24910 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24911 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24912 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=start
24913 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=complete
24914 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24915 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24916 | time="2020-07-21T01:22:56Z" level=debug msg="deleted block f3738654eb0c76985cbe563fc8dff05a067edba207ac5ecb8d880b293baa1e78 for volume test-9e3b6ea6-volume" pkg=backupstore
24917 | time="2020-07-21T01:22:56Z" level=debug msg="Retained 1 blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24918 | time="2020-07-21T01:22:56Z" level=debug msg="Removed 1 unused blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24919 | time="2020-07-21T01:22:56Z" level=debug msg="GC completed" pkg=backupstore
24920 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24921 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24922 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24923 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24924 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24925 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24926 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24927 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-8aaf61823aaf40b5 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24928 | time="2020-07-21T01:22:56Z" level=fatal msg="Error running restore backup command: no backups found with url s3://backupbucket@us-east-1/s3-test?backup=backup-8aaf61823aaf40b5&volume=test-9e3b6ea6-volume: cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg in backupstore"
24929 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24930 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24931 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24932 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-8aaf61823aaf40b5 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24933 | time="2020-07-21T01:22:56Z" level=error msg="cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg in backupstore"
24934 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24935 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=start
24936 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=complete
24937 | time="2020-07-21T01:22:56Z" level=debug msg="Removed backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg on backupstore" pkg=backupstore
24938 | time="2020-07-21T01:22:56Z" level=info msg="Removed backup backup-9574a4747cda4250 for volume test-9e3b6ea6-volume" pkg=backupstore
24939 | time="2020-07-21T01:22:56Z" level=debug msg="GC started" pkg=backupstore
24940 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24941 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24942 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24943 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24944 | time="2020-07-21T01:22:56Z" level=debug msg="Retained 1 blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24945 | time="2020-07-21T01:22:56Z" level=debug msg="Removed 0 unused blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24946 | time="2020-07-21T01:22:56Z" level=debug msg="GC completed" pkg=backupstore
24947 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24948 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24949 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24950 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24951 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24952 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24953 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24954 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-9574a4747cda4250 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24955 | time="2020-07-21T01:22:56Z" level=fatal msg="Error running restore backup command: no backups found with url s3://backupbucket@us-east-1/s3-test?backup=backup-9574a4747cda4250&volume=test-9e3b6ea6-volume: cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg in backupstore"
24956 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24957 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24958 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24959 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-9574a4747cda4250 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24960 | time="2020-07-21T01:22:56Z" level=error msg="cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg in backupstore"
24961 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24962 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24963 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24964 | time="2020-07-21T01:22:56Z" level=debug msg="Removed backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg on backupstore" pkg=backupstore
24965 | time="2020-07-21T01:22:56Z" level=info msg="Removed backup backup-3dc94f4336cd48c1 for volume test-9e3b6ea6-volume" pkg=backupstore
24966 | time="2020-07-21T01:22:56Z" level=debug msg="GC started" pkg=backupstore
24967 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24968 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24969 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24970 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24971 | time="2020-07-21T01:22:56Z" level=debug msg="deleted block 34e2632f6a48a7264197c60a10bb71b987dd4d0c2f2b7615122d748ac16965e2 for volume test-9e3b6ea6-volume" pkg=backupstore
24972 | time="2020-07-21T01:22:56Z" level=debug msg="Retained 0 blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24973 | time="2020-07-21T01:22:56Z" level=debug msg="Removed 1 unused blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24974 | time="2020-07-21T01:22:56Z" level=debug msg="GC completed" pkg=backupstore
24975 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24976 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24977 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24978 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24979 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24980 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24981 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24982 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-3dc94f4336cd48c1 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24983 | time="2020-07-21T01:22:56Z" level=fatal msg="Error running restore backup command: no backups found with url s3://backupbucket@us-east-1/s3-test?backup=backup-3dc94f4336cd48c1&volume=test-9e3b6ea6-volume: cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg in backupstore"
24984 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24985 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24986 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24987 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-3dc94f4336cd48c1 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24988 | time="2020-07-21T01:22:56Z" level=error msg="cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg in backupstore"
24989 | time="2020-07-21T01:22:57Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10000"
24990 | time="2020-07-21T01:22:57Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10015"
24991 | time="2020-07-21T01:22:57Z" level=info msg="Successfully reset sync agent server"
24992 | time="2020-07-21T01:22:59Z" level=info msg="Backing up e65e1b6c-9c43-408e-ac12-142d5b60e200 on tcp://localhost:10000, to vfs:///data/backupbucket"
24993 | time="2020-07-21T01:23:00Z" level=debug msg="Loaded driver for vfs:///data/backupbucket" pkg=vfs
24994 | time="2020-07-21T01:23:00Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=start
24995 | time="2020-07-21T01:23:00Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=complete
24996 | time="2020-07-21T01:23:00Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9cdc65550f104abf.cfg kind=vfs object=config pkg=backupstore reason=start
24997 | time="2020-07-21T01:23:00Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9cdc65550f104abf.cfg kind=vfs object=config pkg=backupstore reason=complete
24998 | time="2020-07-21T01:23:00Z" level=info msg="Backing up 0abed94c-2be3-42c6-a5ed-12c5d6072d02 on tcp://localhost:10000, to vfs:///data/backupbucket"
24999 | time="2020-07-21T01:23:01Z" level=debug msg="Loaded driver for vfs:///data/backupbucket" pkg=vfs
25000 | time="2020-07-21T01:23:01Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=start
25001 | time="2020-07-21T01:23:01Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=complete
25002 | time="2020-07-21T01:23:01Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-126f3f8d692e4d61.cfg kind=vfs object=config pkg=backupstore reason=start
25003 | time="2020-07-21T01:23:01Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-126f3f8d692e4d61.cfg kind=vfs object=config pkg=backupstore reason=complete
25004 | time="2020-07-21T01:23:01Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10000"
25005 | time="2020-07-21T01:23:01Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10015"
25006 | time="2020-07-21T01:23:01Z" level=info msg="Successfully reset sync agent server"
25007 | time="2020-07-21T01:23:02Z" level=info msg="Backing up 35fd5bdc-c462-4f51-abad-3b485662d7aa on tcp://localhost:10000, to vfs:///data/backupbucket"
25008 | time="2020-07-21T01:23:02Z" level=debug msg="Loaded driver for vfs:///data/backupbucket" pkg=vfs
25009 | time="2020-07-21T01:23:02Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=start
25010 | time="2020-07-21T01:23:02Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=complete
25011 | time="2020-07-21T01:23:02Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-d833e3e537e84b84.cfg kind=vfs object=config pkg=backupstore reason=start
25012 | time="2020-07-21T01:23:02Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-d833e3e537e84b84.cfg kind=vfs object=config pkg=backupstore reason=complete
25013 | time="2020-07-21T01:23:02Z" level=info msg="Backing up df6bece8-c44c-42a5-98d5-a469eb8ef27a on tcp://localhost:10000, to vfs:///data/backupbucket"
25014 | time="2020-07-21T01:23:03Z" level=debug msg="Loaded driver for vfs:///data/backupbucket" pkg=vfs
25015 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=start
25016 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=complete
25017 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-fb7bcb032312474f.cfg kind=vfs object=config pkg=backupstore reason=start
25018 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-fb7bcb032312474f.cfg kind=vfs object=config pkg=backupstore reason=complete
25019 | time="2020-07-21T01:23:03Z" level=debug msg="Loaded driver for vfs:///data/backupbucket" pkg=vfs
25020 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=start
25021 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=complete
25022 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9cdc65550f104abf.cfg kind=vfs object=config pkg=backupstore reason=start
25023 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9cdc65550f104abf.cfg kind=vfs object=config pkg=backupstore reason=complete
25024 | time="2020-07-21T01:23:03Z" level=fatal msg="Error running restore backup command: tcp://localhost:10015: can only restore backup from replica in mode RW, got ERR"

The text was updated successfully, but these errors were encountered:

shuo-wu · 2020-07-22T03:50:27Z

This failure seems to be caused by a delay in the monitor stop. The monitor stop is best-effort, then it can happen after the volume recreation and lead to the replica mode `ERR, then leads to the restore command gets failed.

Here are the logs:

Typically, the replica close happens before the volume recreation:

[test-9e3b6ea6-engine] time="2020-07-21T01:23:01Z" level=error msg="Backend tcp://localhost:10000 monitoring failed, mark as ERR: ping failure: replica state closed"
time="2020-07-21T01:23:01Z" level=info msg="Monitoring stopped tcp://localhost:10000"

The volume recreation:

[test-9e3b6ea6-replica-1-1] time="2020-07-21T01:23:01Z" level=info msg="New connection from: 127.0.0.1:39900"
[test-9e3b6ea6-engine] time="2020-07-21T01:23:01Z" level=info msg="Connecting to remote: localhost:10000"
time="2020-07-21T01:23:01Z" level=info msg="Opening: localhost:10000"
[test-9e3b6ea6-replica-1-1] time="2020-07-21T01:23:01Z" level=info msg="Opening volume /tmp/tmp4go56mb0, size 4194304/512"
[test-9e3b6ea6-engine] time="2020-07-21T01:23:01Z" level=info msg="Connecting to remote: localhost:10015"
[test-9e3b6ea6-replica-1-2] time="2020-07-21T01:23:01Z" level=info msg="New connection from: 127.0.0.1:54428"
[test-9e3b6ea6-engine] time="2020-07-21T01:23:01Z" level=info msg="Opening: localhost:10015"

But due to an unknown delay, another replica close happens after the volume recreation, which means the replica is added to the engine with mode RW then wrongly marked as ERROR:

time="2020-07-21T01:23:01Z" level=error msg="Backend tcp://localhost:10015 monitoring failed, mark as ERR: ping failure: replica state closed"

BTW, I don't know why the Captured stderr doesn't contain all of the related logs...

joshimoo · 2020-08-28T10:58:18Z

@shuo-wu Another instance of this test failure:
(after the restore refactor longhorn/longhorn-engine#521 & longhorn/longhorn-engine#529)
https://drone-pr.longhorn.io/longhorn/longhorn-engine/39/1/3

Looks like this test is still flaky, runs on local fails on remote.
context: longhorn/longhorn-engine#512

joshimoo · 2020-08-28T11:08:04Z

@shuo-wu Is the replica state something that would fix itself?
If so can we add a for loop just for the replica state checking here: https://github.com/longhorn/longhorn-engine/blob/91189263f90ce1b4c28b88ef7d4c49c8fecc0680/integration/common/core.py#L363

instead of needing to go through the whole reset-volume ramp down/up again?

shuo-wu · 2020-08-28T12:22:41Z

Let me explain the case with more details:

The engine controller tries to stop monitoring the old replica (address)
The engine controller removes the replica from the list without verifying/waiting for the above monitor stopping success.
A new replica with the same address is added to the controller.
the goroutine starts to close the replica then wrongly mark the new replica as ERROR

The key point is that we don't know what the progress is in this monitoring stop goroutine. Hence checking the replica state is meaningless.

shuo-wu · 2021-02-03T04:11:57Z

Maybe we can add a lock to prevent the replica from being marked as ERR during the backend removal.

yasker · 2021-06-08T19:12:41Z

We're hitting this issue more often now. Raise the priority.

yasker · 2021-06-25T18:51:59Z

@keithalucas remember to move the issue to Analysis and Design when you start working on it.

keithalucas · 2021-06-25T21:12:33Z

It looks like monitoring goroutine continues to run after the replica is deleted from the array of replicas in the controller. https://github.com/longhorn/longhorn-engine/blob/c101e7a2388ca41e84bfc0cad33366f9e22b279c/pkg/controller/control.go#L832. The monitoring function will set an address's Mode to ERR by looking up any element of the replicas array with the specified address. It is possible that:

We delete the replica with that address from the array of replicas, the replica's monitoring goroutine is still running
We add a new replica with the same address to the controller's array of replicas
The monitoring goroutine receives an error on monitorChan for the first replica with that address calls SetReplicaMode with the address.
SetReplicaMode looks for the element of the array with the address and sets its Mode to ERR.

We could create a mechanism to stop the monitoring goroutine for the replica when we delete the replica from the array of replicas and prevent any old goroutine from setting any replica with the original address to ERR.

longhorn-io-github-bot · 2021-07-02T14:13:22Z

Pre Ready-For-Testing Checklist

innobead · 2021-07-06T01:45:05Z

@keithalucas please update the checklist comment and also make sure the uncertain failed test cases got fixed before moving to ready for testing.

keithalucas · 2021-07-07T16:03:39Z

This issue is dependent on timing so this procedure to duplicate it doesn't always work.

Start two replicas.
Start a controller with two replicas.
The following program will get the list of replicas, delete one, and immediately recreate it. https://github.com/keithalucas/replica-grpc-error
1. clone repo
2. go build
3. ./replica-grpc-error longhorn-controller-ip:longhorn-controller-grpc-port

One the old codebase, it might not occur every time but, the controller will fail to recreate the replica because it has a mode of ERR.

With the new codebase, the replica will always be recreated.

keithalucas · 2021-07-09T00:40:30Z

I found another path that causes this problem. When the VolumeShutdown GRPC method is called on the controller, the replicas are closed without stopping the monitoring goroutine. The VolumeShutdown GRPC method is called frequently in the python integration tests via cleanup_controller. The output of the tests often shows Ping timeouts after the replica is closed. I was able to demonstrate this by repeatedly running dapper ./scripts/integration-test -- -s -x -k test_backup_hole_with_backing_file_raw until it failed because a replica's mode was ERR.

khushboo-rancher · 2021-07-27T18:50:26Z

For the validation, can we rely on the test results or a conscious testing is required with steps from #1628 (comment)?

khushboo-rancher · 2021-07-28T19:02:25Z

I checked the build from #392 till #431, the builds are no longer failing because of test failures related to replica being set to ERR mode

khushboo-rancher · 2021-07-28T19:29:03Z

@keithalucas Could you please help with backport PR for v1.1.3?

keithalucas · 2021-07-29T17:43:11Z

@keithalucas Could you please help with backport PR for v1.1.3?

I back ported with https://github.com/longhorn/longhorn-engine/pull/645/commits

khushboo-rancher · 2021-07-30T22:14:56Z

Closing this as the fix is validated and backported for v1.1.3

joshimoo added area/v1-data-engine v1 data engine (iSCSI tgt) kind/bug labels Jul 22, 2020

joshimoo added this to the v1.1.0 milestone Jul 22, 2020

yasker added the priority/1 Highly recommended to fix in this release (managed by PO) label Jul 22, 2020

shuo-wu mentioned this issue Aug 13, 2020

Restore refactor longhorn/longhorn-engine#521

Merged

joshimoo mentioned this issue Aug 28, 2020

Backupstore file locks longhorn/longhorn-engine#512

Merged

yasker assigned shuo-wu Oct 2, 2020

yasker modified the milestones: v1.1.0, v1.1.1 Oct 13, 2020

yasker assigned shuo-wu and joshimoo and unassigned shuo-wu and joshimoo Dec 22, 2020

yasker unassigned shuo-wu Jan 19, 2021

yasker modified the milestones: v1.1.1, v1.1.2 Jan 19, 2021

shuo-wu mentioned this issue Feb 3, 2021

Enable Backing image longhorn/longhorn-engine#557

Merged

jenting mentioned this issue Feb 24, 2021

Bump longhorn/backupstore version to add KubernetesStatus metadata to volume.cfg longhorn/longhorn-engine#563

Merged

innobead modified the milestones: v1.1.2, v1.1.1 Feb 26, 2021

innobead modified the milestones: v1.1.1, v1.1.2 Mar 16, 2021

innobead removed this from the v1.1.2 milestone Apr 29, 2021

yasker mentioned this issue May 27, 2021

Return a non-zero value of one of the WriteAts is successful. longhorn/longhorn-engine#621

Merged

joshimoo mentioned this issue Jun 8, 2021

Update sparse-tools to get replica rebuild enhancement longhorn/longhorn-engine#622

Merged

yasker assigned keithalucas Jun 8, 2021

yasker modified the milestones: v1.2.1, v1.2.0 Jun 8, 2021

joshimoo mentioned this issue Jun 28, 2021

Stop monitoring goroutine when removing replica longhorn/longhorn-engine#632

Merged

keithalucas mentioned this issue Jul 2, 2021

Add test for removing a replica and recreating it longhorn/longhorn-engine#634

Merged

This was referenced Jul 9, 2021

Make closing all the replicas stop all the replica monitors longhorn/longhorn-engine#637

Merged

[BUG] test related to backup with backing file fail inconsistently on with master longhorn-engine #2783

Closed

innobead added the backport/1.1.3 Require to backport to 1.1.3 release branch label Jul 16, 2021

keithalucas mentioned this issue Jul 29, 2021

1628 backport closing monitoring channel longhorn/longhorn-engine#645

Merged

khushboo-rancher closed this as completed Jul 30, 2021

innobead added backport-needed/1.1.x and removed backport/1.1.3 Require to backport to 1.1.3 release branch labels Oct 11, 2021

innobead added the backport/1.1.3 Require to backport to 1.1.3 release branch label Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Investigate engine integration test failure - restore_with_frontend: can only restore backup from replica in mode RW, got ERR #1628

[BUG] Investigate engine integration test failure - restore_with_frontend: can only restore backup from replica in mode RW, got ERR #1628

joshimoo commented Jul 22, 2020

shuo-wu commented Jul 22, 2020 •

edited

joshimoo commented Aug 28, 2020

joshimoo commented Aug 28, 2020

shuo-wu commented Aug 28, 2020

shuo-wu commented Feb 3, 2021

yasker commented Jun 8, 2021

yasker commented Jun 25, 2021

keithalucas commented Jun 25, 2021

longhorn-io-github-bot commented Jul 2, 2021 •

edited by keithalucas

innobead commented Jul 6, 2021

keithalucas commented Jul 7, 2021 •

edited

keithalucas commented Jul 9, 2021

khushboo-rancher commented Jul 27, 2021

khushboo-rancher commented Jul 28, 2021

khushboo-rancher commented Jul 28, 2021 •

edited

keithalucas commented Jul 29, 2021

khushboo-rancher commented Jul 30, 2021

[BUG] Investigate engine integration test failure - restore_with_frontend: can only restore backup from replica in mode RW, got ERR #1628

[BUG] Investigate engine integration test failure - restore_with_frontend: can only restore backup from replica in mode RW, got ERR #1628

Comments

joshimoo commented Jul 22, 2020

shuo-wu commented Jul 22, 2020 • edited

joshimoo commented Aug 28, 2020

joshimoo commented Aug 28, 2020

shuo-wu commented Aug 28, 2020

shuo-wu commented Feb 3, 2021

yasker commented Jun 8, 2021

yasker commented Jun 25, 2021

keithalucas commented Jun 25, 2021

longhorn-io-github-bot commented Jul 2, 2021 • edited by keithalucas

Pre Ready-For-Testing Checklist

innobead commented Jul 6, 2021

keithalucas commented Jul 7, 2021 • edited

keithalucas commented Jul 9, 2021

khushboo-rancher commented Jul 27, 2021

khushboo-rancher commented Jul 28, 2021

khushboo-rancher commented Jul 28, 2021 • edited

keithalucas commented Jul 29, 2021

khushboo-rancher commented Jul 30, 2021

shuo-wu commented Jul 22, 2020 •

edited

longhorn-io-github-bot commented Jul 2, 2021 •

edited by keithalucas

keithalucas commented Jul 7, 2021 •

edited

khushboo-rancher commented Jul 28, 2021 •

edited