Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Investigate engine integration test failure - restore_with_frontend: can only restore backup from replica in mode RW, got ERR #1628

Closed
joshimoo opened this issue Jul 22, 2020 · 18 comments
Assignees
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) backport/1.1.3 Require to backport to 1.1.3 release branch kind/bug priority/1 Highly recommended to fix in this release (managed by PO)
Milestone

Comments

@joshimoo
Copy link
Contributor

Describe the bug
We noticed engine integration test failures, during a restore operation.
The error is not dependent on the test method in question.
Potentially any restore operation in a test could trigger this.

To Reproduce

  • Unknown

Expected behavior
A clear and concise description of what you expected to happen.

Log
If applicable, add the Longhorn managers' log when the issue happens.

Environment:

  • Longhorn version: 1.0.1 (rc/master)
  • Kubernetes version: 1.17
  • Node OS type and version: ubuntu

Additional context

Test failure

=================================== FAILURES ===================================
--
24703 | ________________________ test_backup_incremental_logic _________________________
24704 |  
24705 | grpc_replica1 = <rpc.replica.replica_client.ReplicaClient object at 0x7fdb0811b0b8>
24706 | grpc_replica2 = <rpc.replica.replica_client.ReplicaClient object at 0x7fdb082e8940>
24707 | grpc_controller = <rpc.controller.controller_client.ControllerClient object at 0x7fdb08402470>
24708 | backup_targets = ['s3://backupbucket@us-east-1/s3-test', 'vfs:///data/backupbucket']
24709 |  
24710 | def test_backup_incremental_logic(grpc_replica1, grpc_replica2,  # NOQA
24711 | grpc_controller, backup_targets):  # NOQA
24712 | for backup_target in backup_targets:
24713 | dev = get_dev(grpc_replica1, grpc_replica2,
24714 | grpc_controller)
24715 | address = grpc_controller.address
24716 | volume_name = VOLUME_NAME
24717 | engine_name = ENGINE_NAME
24718 | offset = 0
24719 | length = 128
24720 |  
24721 | # initial backup
24722 | snap1_data = random_string(length)
24723 | verify_data(dev, offset, snap1_data)
24724 | snap1_checksum = checksum_dev(dev)
24725 | snap1 = cmd.snapshot_create(address)
24726 | backup1_info = create_backup(address, snap1, backup_target)
24727 | assert backup1_info["IsIncremental"] is False
24728 |  
24729 | # delta backup on top of initial backup
24730 | snap2_data = random_string(int(length / 2))
24731 | verify_data(dev, offset, snap2_data)
24732 | snap2 = cmd.snapshot_create(address)
24733 | backup2_info = create_backup(address, snap2, backup_target)
24734 | assert backup2_info["IsIncremental"] is True
24735 |  
24736 | # delete the volume
24737 | cmd.sync_agent_server_reset(address)
24738 | grpc_controller = cleanup_controller(grpc_controller)
24739 | grpc_replica1 = cleanup_replica(grpc_replica1)
24740 | grpc_replica2 = cleanup_replica(grpc_replica2)
24741 |  
24742 | # recreate the volume
24743 | dev = get_dev(grpc_replica1, grpc_replica2,
24744 | grpc_controller, clean_backup_dir=False)
24745 |  
24746 | # empty initial backup after volume recreation
24747 | snap3 = cmd.snapshot_create(address)
24748 | backup3_info = create_backup(address, snap3, backup_target)
24749 | assert backup3_info["VolumeName"] == volume_name
24750 | assert backup3_info["Size"] == '0'
24751 | assert backup3_info["IsIncremental"] is False
24752 |  
24753 | # write half of snap1 onto head
24754 | snap4_data = snap1_data[:int(length / 2)]
24755 | assert len(snap4_data) == int(length / 2)
24756 | verify_data(dev, offset, snap4_data)
24757 | snap4_checksum = checksum_dev(dev)
24758 | assert snap4_checksum != snap1_checksum
24759 | snap4 = cmd.snapshot_create(address)
24760 | backup4_info = create_backup(address, snap4, backup_target)
24761 | assert backup4_info["IsIncremental"] is True
24762 |  
24763 | # restore initial backup
24764 | restore_with_frontend(address, engine_name,
24765 | >                                 backup1_info["URL"])
24766 |  
24767 | data/test_backup.py:355:
24768 | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
24769 | common/core.py:287: in restore_with_frontend
24770 | cmd.backup_restore(url, backup)
24771 | common/cmd.py:112: in backup_restore
24772 | return subprocess.check_output(cmd, encoding='utf-8').strip()
24773 | /usr/lib/python3.6/subprocess.py:356: in check_output
24774 | **kwargs).stdout
24775 | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
24776 |  
24777 | input = None, timeout = None, check = True
24778 | popenargs = (['/go/src/github.com/longhorn/longhorn-engine/integration/common/../../bin/longhorn', '--url', 'localhost:30001', '--debug', 'backup', 'restore', ...],)
24779 | kwargs = {'encoding': 'utf-8', 'stdout': -1}
24780 | process = <subprocess.Popen object at 0x7fdb08109e48>
24781 | stdout = '{\n\t"ReplicaErrors": [\n\t\t{\n\t\t\t"Address": "tcp://localhost:10015",\n\t\t\t"Message": "can only restore backup from replica in mode RW, got ERR"\n\t\t}\n\t]\n}\n'
24782 | stderr = None, retcode = 1
24783 |  
24784 | def run(*popenargs, input=None, timeout=None, check=False, **kwargs):
24785 | """Run command with arguments and return a CompletedProcess instance.
24786 |  
24787 | The returned instance will have attributes args, returncode, stdout and
24788 | stderr. By default, stdout and stderr are not captured, and those attributes
24789 | will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them.
24790 |  
24791 | If check is True and the exit code was non-zero, it raises a
24792 | CalledProcessError. The CalledProcessError object will have the return code
24793 | in the returncode attribute, and output & stderr attributes if those streams
24794 | were captured.
24795 |  
24796 | If timeout is given, and the process takes too long, a TimeoutExpired
24797 | exception will be raised.
24798 |  
24799 | There is an optional argument "input", allowing you to
24800 | pass a string to the subprocess's stdin.  If you use this argument
24801 | you may not also use the Popen constructor's "stdin" argument, as
24802 | it will be used internally.
24803 |  
24804 | The other arguments are the same as for the Popen constructor.
24805 |  
24806 | If universal_newlines=True is passed, the "input" argument must be a
24807 | string and stdout/stderr in the returned object will be strings rather than
24808 | bytes.
24809 | """
24810 | if input is not None:
24811 | if 'stdin' in kwargs:
24812 | raise ValueError('stdin and input arguments may not both be used.')
24813 | kwargs['stdin'] = PIPE
24814 |  
24815 | with Popen(*popenargs, **kwargs) as process:
24816 | try:
24817 | stdout, stderr = process.communicate(input, timeout=timeout)
24818 | except TimeoutExpired:
24819 | process.kill()
24820 | stdout, stderr = process.communicate()
24821 | raise TimeoutExpired(process.args, timeout, output=stdout,
24822 | stderr=stderr)
24823 | except:
24824 | process.kill()
24825 | process.wait()
24826 | raise
24827 | retcode = process.poll()
24828 | if check and retcode:
24829 | raise CalledProcessError(retcode, process.args,
24830 | >                                        output=stdout, stderr=stderr)
24831 | E               subprocess.CalledProcessError: Command '['/go/src/github.com/longhorn/longhorn-engine/integration/common/../../bin/longhorn', '--url', 'localhost:30001', '--debug', 'backup', 'restore', 'vfs:///data/backupbucket?backup=backup-9cdc65550f104abf&volume=test-9e3b6ea6-volume']' returned non-zero exit status 1.
24832 |  
24833 | /usr/lib/python3.6/subprocess.py:438: CalledProcessError
24834 | ----------------------------- Captured stderr call -----------------------------
24835 | time="2020-07-21T01:22:45Z" level=info msg="Backing up 5c643bd7-19bb-4580-b100-e1e56336277c on tcp://localhost:10000, to s3://backupbucket@us-east-1/s3-test"
24836 | time="2020-07-21T01:22:46Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24837 | time="2020-07-21T01:22:46Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24838 | time="2020-07-21T01:22:46Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24839 | time="2020-07-21T01:22:46Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=start
24840 | time="2020-07-21T01:22:46Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=complete
24841 | time="2020-07-21T01:22:46Z" level=info msg="Backing up ce08bc7f-a6f4-457d-8cba-d2ca661627b7 on tcp://localhost:10000, to s3://backupbucket@us-east-1/s3-test"
24842 | time="2020-07-21T01:22:47Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24843 | time="2020-07-21T01:22:47Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24844 | time="2020-07-21T01:22:47Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24845 | time="2020-07-21T01:22:47Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=start
24846 | time="2020-07-21T01:22:47Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=complete
24847 | time="2020-07-21T01:22:47Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10000"
24848 | time="2020-07-21T01:22:47Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10015"
24849 | time="2020-07-21T01:22:47Z" level=info msg="Successfully reset sync agent server"
24850 | time="2020-07-21T01:22:48Z" level=info msg="Backing up 055b3546-0223-470e-a640-fc866c08cd30 on tcp://localhost:10000, to s3://backupbucket@us-east-1/s3-test"
24851 | time="2020-07-21T01:22:48Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24852 | time="2020-07-21T01:22:48Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24853 | time="2020-07-21T01:22:48Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24854 | time="2020-07-21T01:22:48Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=start
24855 | time="2020-07-21T01:22:48Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=complete
24856 | time="2020-07-21T01:22:48Z" level=info msg="Backing up df143117-f9c8-473f-90f6-c16212d29bca on tcp://localhost:10000, to s3://backupbucket@us-east-1/s3-test"
24857 | time="2020-07-21T01:22:49Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24858 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24859 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24860 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24861 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24862 | time="2020-07-21T01:22:49Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24863 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24864 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24865 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=start
24866 | time="2020-07-21T01:22:49Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=complete
24867 | time="2020-07-21T01:22:53Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24868 | time="2020-07-21T01:22:53Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24869 | time="2020-07-21T01:22:53Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24870 | time="2020-07-21T01:22:53Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24871 | time="2020-07-21T01:22:53Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24872 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24873 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=start
24874 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg kind=s3 object=config pkg=backupstore reason=complete
24875 | time="2020-07-21T01:22:56Z" level=debug msg="Removed backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg on backupstore" pkg=backupstore
24876 | time="2020-07-21T01:22:56Z" level=info msg="Removed backup backup-c3893fe3ea764f10 for volume test-9e3b6ea6-volume" pkg=backupstore
24877 | time="2020-07-21T01:22:56Z" level=debug msg="GC started" pkg=backupstore
24878 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24879 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24880 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=start
24881 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=complete
24882 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=start
24883 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=complete
24884 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24885 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24886 | time="2020-07-21T01:22:56Z" level=debug msg="deleted block adf74a9bac4f23fd4977af7f30c25967c38de58416f52465c8d465a145c46a24 for volume test-9e3b6ea6-volume" pkg=backupstore
24887 | time="2020-07-21T01:22:56Z" level=debug msg="Retained 2 blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24888 | time="2020-07-21T01:22:56Z" level=debug msg="Removed 1 unused blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24889 | time="2020-07-21T01:22:56Z" level=debug msg="GC completed" pkg=backupstore
24890 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24891 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24892 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24893 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24894 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24895 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24896 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24897 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-c3893fe3ea764f10 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24898 | time="2020-07-21T01:22:56Z" level=fatal msg="Error running restore backup command: no backups found with url s3://backupbucket@us-east-1/s3-test?backup=backup-c3893fe3ea764f10&volume=test-9e3b6ea6-volume: cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg in backupstore"
24899 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24900 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24901 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24902 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-c3893fe3ea764f10 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24903 | time="2020-07-21T01:22:56Z" level=error msg="cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-c3893fe3ea764f10.cfg in backupstore"
24904 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24905 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=start
24906 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg kind=s3 object=config pkg=backupstore reason=complete
24907 | time="2020-07-21T01:22:56Z" level=debug msg="Removed backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg on backupstore" pkg=backupstore
24908 | time="2020-07-21T01:22:56Z" level=info msg="Removed backup backup-8aaf61823aaf40b5 for volume test-9e3b6ea6-volume" pkg=backupstore
24909 | time="2020-07-21T01:22:56Z" level=debug msg="GC started" pkg=backupstore
24910 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24911 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24912 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=start
24913 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=complete
24914 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24915 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24916 | time="2020-07-21T01:22:56Z" level=debug msg="deleted block f3738654eb0c76985cbe563fc8dff05a067edba207ac5ecb8d880b293baa1e78 for volume test-9e3b6ea6-volume" pkg=backupstore
24917 | time="2020-07-21T01:22:56Z" level=debug msg="Retained 1 blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24918 | time="2020-07-21T01:22:56Z" level=debug msg="Removed 1 unused blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24919 | time="2020-07-21T01:22:56Z" level=debug msg="GC completed" pkg=backupstore
24920 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24921 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24922 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24923 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24924 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24925 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24926 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24927 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-8aaf61823aaf40b5 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24928 | time="2020-07-21T01:22:56Z" level=fatal msg="Error running restore backup command: no backups found with url s3://backupbucket@us-east-1/s3-test?backup=backup-8aaf61823aaf40b5&volume=test-9e3b6ea6-volume: cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg in backupstore"
24929 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24930 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24931 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24932 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-8aaf61823aaf40b5 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24933 | time="2020-07-21T01:22:56Z" level=error msg="cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-8aaf61823aaf40b5.cfg in backupstore"
24934 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24935 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=start
24936 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg kind=s3 object=config pkg=backupstore reason=complete
24937 | time="2020-07-21T01:22:56Z" level=debug msg="Removed backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg on backupstore" pkg=backupstore
24938 | time="2020-07-21T01:22:56Z" level=info msg="Removed backup backup-9574a4747cda4250 for volume test-9e3b6ea6-volume" pkg=backupstore
24939 | time="2020-07-21T01:22:56Z" level=debug msg="GC started" pkg=backupstore
24940 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24941 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24942 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24943 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24944 | time="2020-07-21T01:22:56Z" level=debug msg="Retained 1 blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24945 | time="2020-07-21T01:22:56Z" level=debug msg="Removed 0 unused blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24946 | time="2020-07-21T01:22:56Z" level=debug msg="GC completed" pkg=backupstore
24947 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24948 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24949 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24950 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24951 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24952 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24953 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24954 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-9574a4747cda4250 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24955 | time="2020-07-21T01:22:56Z" level=fatal msg="Error running restore backup command: no backups found with url s3://backupbucket@us-east-1/s3-test?backup=backup-9574a4747cda4250&volume=test-9e3b6ea6-volume: cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg in backupstore"
24956 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24957 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24958 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24959 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-9574a4747cda4250 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24960 | time="2020-07-21T01:22:56Z" level=error msg="cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9574a4747cda4250.cfg in backupstore"
24961 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24962 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=start
24963 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg kind=s3 object=config pkg=backupstore reason=complete
24964 | time="2020-07-21T01:22:56Z" level=debug msg="Removed backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg on backupstore" pkg=backupstore
24965 | time="2020-07-21T01:22:56Z" level=info msg="Removed backup backup-3dc94f4336cd48c1 for volume test-9e3b6ea6-volume" pkg=backupstore
24966 | time="2020-07-21T01:22:56Z" level=debug msg="GC started" pkg=backupstore
24967 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24968 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24969 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24970 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24971 | time="2020-07-21T01:22:56Z" level=debug msg="deleted block 34e2632f6a48a7264197c60a10bb71b987dd4d0c2f2b7615122d748ac16965e2 for volume test-9e3b6ea6-volume" pkg=backupstore
24972 | time="2020-07-21T01:22:56Z" level=debug msg="Retained 0 blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24973 | time="2020-07-21T01:22:56Z" level=debug msg="Removed 1 unused blocks for volume test-9e3b6ea6-volume" pkg=backupstore
24974 | time="2020-07-21T01:22:56Z" level=debug msg="GC completed" pkg=backupstore
24975 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24976 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24977 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24978 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24979 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24980 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24981 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24982 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-3dc94f4336cd48c1 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24983 | time="2020-07-21T01:22:56Z" level=fatal msg="Error running restore backup command: no backups found with url s3://backupbucket@us-east-1/s3-test?backup=backup-3dc94f4336cd48c1&volume=test-9e3b6ea6-volume: cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg in backupstore"
24984 | time="2020-07-21T01:22:56Z" level=debug msg="Loaded driver for s3://backupbucket@us-east-1/s3-test" pkg=s3
24985 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=start
24986 | time="2020-07-21T01:22:56Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=s3 object=config pkg=backupstore reason=complete
24987 | time="2020-07-21T01:22:56Z" level=info msg="Failed to load backup in backupstore" backup=backup-3dc94f4336cd48c1 event=list object=backup pkg=backupstore reason=fallback volume=test-9e3b6ea6-volume
24988 | time="2020-07-21T01:22:56Z" level=error msg="cannot find backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-3dc94f4336cd48c1.cfg in backupstore"
24989 | time="2020-07-21T01:22:57Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10000"
24990 | time="2020-07-21T01:22:57Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10015"
24991 | time="2020-07-21T01:22:57Z" level=info msg="Successfully reset sync agent server"
24992 | time="2020-07-21T01:22:59Z" level=info msg="Backing up e65e1b6c-9c43-408e-ac12-142d5b60e200 on tcp://localhost:10000, to vfs:///data/backupbucket"
24993 | time="2020-07-21T01:23:00Z" level=debug msg="Loaded driver for vfs:///data/backupbucket" pkg=vfs
24994 | time="2020-07-21T01:23:00Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=start
24995 | time="2020-07-21T01:23:00Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=complete
24996 | time="2020-07-21T01:23:00Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9cdc65550f104abf.cfg kind=vfs object=config pkg=backupstore reason=start
24997 | time="2020-07-21T01:23:00Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9cdc65550f104abf.cfg kind=vfs object=config pkg=backupstore reason=complete
24998 | time="2020-07-21T01:23:00Z" level=info msg="Backing up 0abed94c-2be3-42c6-a5ed-12c5d6072d02 on tcp://localhost:10000, to vfs:///data/backupbucket"
24999 | time="2020-07-21T01:23:01Z" level=debug msg="Loaded driver for vfs:///data/backupbucket" pkg=vfs
25000 | time="2020-07-21T01:23:01Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=start
25001 | time="2020-07-21T01:23:01Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=complete
25002 | time="2020-07-21T01:23:01Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-126f3f8d692e4d61.cfg kind=vfs object=config pkg=backupstore reason=start
25003 | time="2020-07-21T01:23:01Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-126f3f8d692e4d61.cfg kind=vfs object=config pkg=backupstore reason=complete
25004 | time="2020-07-21T01:23:01Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10000"
25005 | time="2020-07-21T01:23:01Z" level=info msg="Performing sync-agent-server-reset for replica tcp://localhost:10015"
25006 | time="2020-07-21T01:23:01Z" level=info msg="Successfully reset sync agent server"
25007 | time="2020-07-21T01:23:02Z" level=info msg="Backing up 35fd5bdc-c462-4f51-abad-3b485662d7aa on tcp://localhost:10000, to vfs:///data/backupbucket"
25008 | time="2020-07-21T01:23:02Z" level=debug msg="Loaded driver for vfs:///data/backupbucket" pkg=vfs
25009 | time="2020-07-21T01:23:02Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=start
25010 | time="2020-07-21T01:23:02Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=complete
25011 | time="2020-07-21T01:23:02Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-d833e3e537e84b84.cfg kind=vfs object=config pkg=backupstore reason=start
25012 | time="2020-07-21T01:23:02Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-d833e3e537e84b84.cfg kind=vfs object=config pkg=backupstore reason=complete
25013 | time="2020-07-21T01:23:02Z" level=info msg="Backing up df6bece8-c44c-42a5-98d5-a469eb8ef27a on tcp://localhost:10000, to vfs:///data/backupbucket"
25014 | time="2020-07-21T01:23:03Z" level=debug msg="Loaded driver for vfs:///data/backupbucket" pkg=vfs
25015 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=start
25016 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=complete
25017 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-fb7bcb032312474f.cfg kind=vfs object=config pkg=backupstore reason=start
25018 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-fb7bcb032312474f.cfg kind=vfs object=config pkg=backupstore reason=complete
25019 | time="2020-07-21T01:23:03Z" level=debug msg="Loaded driver for vfs:///data/backupbucket" pkg=vfs
25020 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=start
25021 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/volume.cfg kind=vfs object=config pkg=backupstore reason=complete
25022 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9cdc65550f104abf.cfg kind=vfs object=config pkg=backupstore reason=start
25023 | time="2020-07-21T01:23:03Z" level=debug filepath=backupstore/volumes/f4/43/test-9e3b6ea6-volume/backups/backup_backup-9cdc65550f104abf.cfg kind=vfs object=config pkg=backupstore reason=complete
25024 | time="2020-07-21T01:23:03Z" level=fatal msg="Error running restore backup command: tcp://localhost:10015: can only restore backup from replica in mode RW, got ERR"
@joshimoo joshimoo added area/v1-data-engine v1 data engine (iSCSI tgt) kind/bug labels Jul 22, 2020
@joshimoo joshimoo added this to the v1.1.0 milestone Jul 22, 2020
@shuo-wu
Copy link
Contributor

shuo-wu commented Jul 22, 2020

This failure seems to be caused by a delay in the monitor stop. The monitor stop is best-effort, then it can happen after the volume recreation and lead to the replica mode `ERR, then leads to the restore command gets failed.

Here are the logs:

  1. Typically, the replica close happens before the volume recreation:
[test-9e3b6ea6-engine] time="2020-07-21T01:23:01Z" level=error msg="Backend tcp://localhost:10000 monitoring failed, mark as ERR: ping failure: replica state closed"
time="2020-07-21T01:23:01Z" level=info msg="Monitoring stopped tcp://localhost:10000"
  1. The volume recreation:
[test-9e3b6ea6-replica-1-1] time="2020-07-21T01:23:01Z" level=info msg="New connection from: 127.0.0.1:39900"
[test-9e3b6ea6-engine] time="2020-07-21T01:23:01Z" level=info msg="Connecting to remote: localhost:10000"
time="2020-07-21T01:23:01Z" level=info msg="Opening: localhost:10000"
[test-9e3b6ea6-replica-1-1] time="2020-07-21T01:23:01Z" level=info msg="Opening volume /tmp/tmp4go56mb0, size 4194304/512"
[test-9e3b6ea6-engine] time="2020-07-21T01:23:01Z" level=info msg="Connecting to remote: localhost:10015"
[test-9e3b6ea6-replica-1-2] time="2020-07-21T01:23:01Z" level=info msg="New connection from: 127.0.0.1:54428"
[test-9e3b6ea6-engine] time="2020-07-21T01:23:01Z" level=info msg="Opening: localhost:10015"
  1. But due to an unknown delay, another replica close happens after the volume recreation, which means the replica is added to the engine with mode RW then wrongly marked as ERROR:
time="2020-07-21T01:23:01Z" level=error msg="Backend tcp://localhost:10015 monitoring failed, mark as ERR: ping failure: replica state closed"

BTW, I don't know why the Captured stderr doesn't contain all of the related logs...

@yasker yasker added the priority/1 Highly recommended to fix in this release (managed by PO) label Jul 22, 2020
@joshimoo
Copy link
Contributor Author

@shuo-wu Another instance of this test failure:
(after the restore refactor longhorn/longhorn-engine#521 & longhorn/longhorn-engine#529)
https://drone-pr.longhorn.io/longhorn/longhorn-engine/39/1/3

Looks like this test is still flaky, runs on local fails on remote.
context: longhorn/longhorn-engine#512

@joshimoo
Copy link
Contributor Author

@shuo-wu Is the replica state something that would fix itself?
If so can we add a for loop just for the replica state checking here: https://github.com/longhorn/longhorn-engine/blob/91189263f90ce1b4c28b88ef7d4c49c8fecc0680/integration/common/core.py#L363

instead of needing to go through the whole reset-volume ramp down/up again?

@shuo-wu
Copy link
Contributor

shuo-wu commented Aug 28, 2020

Let me explain the case with more details:

  1. The engine controller tries to stop monitoring the old replica (address)
  2. The engine controller removes the replica from the list without verifying/waiting for the above monitor stopping success.
  3. A new replica with the same address is added to the controller.
  4. the goroutine starts to close the replica then wrongly mark the new replica as ERROR

The key point is that we don't know what the progress is in this monitoring stop goroutine. Hence checking the replica state is meaningless.

@yasker yasker modified the milestones: v1.1.0, v1.1.1 Oct 13, 2020
@yasker yasker assigned shuo-wu and joshimoo and unassigned shuo-wu and joshimoo Dec 22, 2020
@yasker yasker modified the milestones: v1.1.1, v1.1.2 Jan 19, 2021
@shuo-wu
Copy link
Contributor

shuo-wu commented Feb 3, 2021

Maybe we can add a lock to prevent the replica from being marked as ERR during the backend removal.

@innobead innobead added investigation-needed Need to identify the case before estimating and starting the development and removed investigation-needed Need to identify the case before estimating and starting the development reprioritization-needed Need to reconsider to re-prioritize in another milestone instead of the current one labels May 25, 2021
@yasker yasker modified the milestones: v1.2.1, v1.2.0 Jun 8, 2021
@yasker
Copy link
Member

yasker commented Jun 8, 2021

We're hitting this issue more often now. Raise the priority.

@yasker
Copy link
Member

yasker commented Jun 25, 2021

@keithalucas remember to move the issue to Analysis and Design when you start working on it.

@keithalucas
Copy link

It looks like monitoring goroutine continues to run after the replica is deleted from the array of replicas in the controller. https://github.com/longhorn/longhorn-engine/blob/c101e7a2388ca41e84bfc0cad33366f9e22b279c/pkg/controller/control.go#L832. The monitoring function will set an address's Mode to ERR by looking up any element of the replicas array with the specified address. It is possible that:

  1. We delete the replica with that address from the array of replicas, the replica's monitoring goroutine is still running
  2. We add a new replica with the same address to the controller's array of replicas
  3. The monitoring goroutine receives an error on monitorChan for the first replica with that address calls SetReplicaMode with the address.
  4. SetReplicaMode looks for the element of the array with the address and sets its Mode to ERR.

We could create a mechanism to stop the monitoring goroutine for the replica when we delete the replica from the array of replicas and prevent any old goroutine from setting any replica with the original address to ERR.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jul 2, 2021

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: [BUG] Investigate engine integration test failure - restore_with_frontend: can only restore backup from replica in mode RW, got ERR #1628 (comment)

  • Is there a workaround for the issue? If so, where is it documented?

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at Stop monitoring goroutine when removing replica longhorn-engine#632 and
    Make closing all the replicas stop all the replica monitors longhorn-engine#637

  • Which areas/issues this PR might have potential impacts on?
    Area Engine, Data Path
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at Add test for removing a replica and recreating it longhorn-engine#634

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@innobead
Copy link
Member

innobead commented Jul 6, 2021

@keithalucas please update the checklist comment and also make sure the uncertain failed test cases got fixed before moving to ready for testing.

@keithalucas
Copy link

keithalucas commented Jul 7, 2021

This issue is dependent on timing so this procedure to duplicate it doesn't always work.

  1. Start two replicas.
  2. Start a controller with two replicas.
  3. The following program will get the list of replicas, delete one, and immediately recreate it. https://github.com/keithalucas/replica-grpc-error
    1. clone repo
    2. go build
    3. ./replica-grpc-error longhorn-controller-ip:longhorn-controller-grpc-port

One the old codebase, it might not occur every time but, the controller will fail to recreate the replica because it has a mode of ERR.

With the new codebase, the replica will always be recreated.

@keithalucas
Copy link

I found another path that causes this problem. When the VolumeShutdown GRPC method is called on the controller, the replicas are closed without stopping the monitoring goroutine. The VolumeShutdown GRPC method is called frequently in the python integration tests via cleanup_controller. The output of the tests often shows Ping timeouts after the replica is closed. I was able to demonstrate this by repeatedly running dapper ./scripts/integration-test -- -s -x -k test_backup_hole_with_backing_file_raw until it failed because a replica's mode was ERR.

@khushboo-rancher
Copy link
Contributor

For the validation, can we rely on the test results or a conscious testing is required with steps from #1628 (comment)?

@khushboo-rancher
Copy link
Contributor

I checked the build from #392 till #431, the builds are no longer failing because of test failures related to replica being set to ERR mode

@khushboo-rancher
Copy link
Contributor

khushboo-rancher commented Jul 28, 2021

@keithalucas Could you please help with backport PR for v1.1.3?

@keithalucas
Copy link

@keithalucas Could you please help with backport PR for v1.1.3?

I back ported with https://github.com/longhorn/longhorn-engine/pull/645/commits

@khushboo-rancher
Copy link
Contributor

Closing this as the fix is validated and backported for v1.1.3

@innobead innobead added backport-needed/1.1.x and removed backport/1.1.3 Require to backport to 1.1.3 release branch labels Oct 11, 2021
@innobead innobead added the backport/1.1.3 Require to backport to 1.1.3 release branch label Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) backport/1.1.3 Require to backport to 1.1.3 release branch kind/bug priority/1 Highly recommended to fix in this release (managed by PO)
Projects
None yet
Development

No branches or pull requests

7 participants