Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to clean up instance's previous state's virtual NIC from host's stack. #983

Open
2 of 6 tasks
markrattray opened this issue Jul 11, 2024 · 3 comments
Open
2 of 6 tasks
Labels
Incomplete Waiting on more information from reporter

Comments

@markrattray
Copy link

markrattray commented Jul 11, 2024

Required information

  • Distribution: Ubuntu
  • Distribution version: Ubuntu Server 22.04
  • The output of "incus info" or if that fails:
config:
  cluster.https_address: somenode.somedomain.com:8443
  core.https_address: 192.168.1.5:8443
  network.ovn.northbound_connection: tcp:192.168.1.5:6641,tcp:192.168.1.9:6641,tcp:192.168.1.11:6641,tcp:192.168.1.13:6641
  storage.backups_volume: isp03/isb_somenode1
  storage.images_volume: isp03/isi_somenode1
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
- auth_tls_jwt
- oidc_claim
- device_usb_serial
- numa_cpu_balanced
- image_restriction_nesting
- network_integrations
- instance_memory_swap_bytes
- network_bridge_external_create
- network_zones_all_projects
- storage_zfs_vdev
- container_migration_stateful
- profiles_all_projects
- instances_scriptlet_get_instances
- instances_scriptlet_get_cluster_members
- instances_scriptlet_get_project
- network_acl_stateless
- instance_state_started_at
- networks_all_projects
- network_acls_all_projects
- storage_buckets_all_projects
- resources_load
- instance_access
- project_access
- projects_force_delete
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: someadmin
auth_user_method: unix
environment:
  addresses:
  - 192.168.1.5:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    somcert
    -----END CERTIFICATE-----
  certificate_fingerprint: somefingerprint
  driver: lxc | qemu
  driver_version: 6.0.1 | 9.0.1
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "false"
    unpriv_fscaps: "true"
  kernel_version: 6.5.0-41-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "22.04"
  project: someproject
  server: incus
  server_clustered: true
  server_event_mode: full-mesh
  server_name: somenode1
  server_pid: 3302431
  server_version: "6.2"
  storage: zfs
  storage_version: 2.1.5-1ubuntu6~22.04.4
  storage_supported_drivers:
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.48.0
    remote: false
  - name: lvmcluster
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.48.0
    remote: true
  - name: zfs
    version: 2.1.5-1ubuntu6~22.04.4
    remote: false
  - name: btrfs
    version: 5.16.2
    remote: false

  • Storage backend in use: Incus managed ZFS on local disks.

Issue description

At the moment we are still using macvlan type NICs with Incus instances until we're able to move to OVN.

Sometimes VMs (mostly Windows) are unable to start because the previous state of the virtual NIC ( volatile.eth0.host_name ) is still bound to the Incus node's parent NIC.

It's not limited to one Incus node and the software installed in them are quite different:

  • incus node 01: MS SQL server with additional block device on enterprise grade NVMe for data, root disk on enterprise grade
  • incus node 03: MS IIS server with no additional devices, root disk on enterprise grade RAID10 SSDs
  • Both are running WS2022
  • I tested another instance created from the same image and it has 4 additional block devices in it, and after an update and reboot, it came up fine.

This happened before under LXD and was partially resolved:

incus start {instance-name}
	Error: Failed to start device "eth0": Failed adding link: Failed to run: ip link add name mace74a984e link br0 address 00:16:3e:24:a3:7a allmulticast on up type macvtap mode bridge: exit status 2 (RTNETLINK answers: Address already in use)

The manual steps to get the VM back up and running are:

  1. find the Incus node that the instance is running on
  2. on that node run: ip link show | grep -B 1 '{instance-mac-address}'
  3. delete the virtual NIC: sudo ip link delete mac0f01152c
  4. start the instance

My request is to get Incus to the following during the VM startup process:

  1. get the instance's MAC address/s
  2. check the Incus node's IP stack for lingering virtual NICs with the MACs and delete them
  3. continue with the VM startup

I think the previous attempts at fixing this may have been too granular and perhaps there might not be a need to fix multiple scenarios where this can happen, so maybe it just needs to clean up on every start because I have observed that these virtual NICs change at every startup. Also I might not be aware of other scenarios where there will be problems with what I'm suggesting.

Thanks

Steps to reproduce

  1. VM might crash or reboot (in this scenario 2 affected VMs rebooted after Windows updates)
  2. Doesn't come up
  3. Try start manually and observe the error: Address already in use

Information to attach

  • Any relevant kernel output (dmesg)
    none
  • Container log (incus info NAME --show-log)
  • Container configuration (incus config show NAME --expanded)
  • Main daemon log (at /var/log/incus/incusd.log)
This is the only entry around the time.
time="2024-07-11T00:44:15Z" level=error msg="Failed to cleanly stop instance" err="Failed to start device \"eth0\": Failed adding link: Failed to run: ip link add name mac450200c0 link br0 address 00:16:3e:24:a3:7a allmulticast on up type macvtap mode bridge: exit status 2 (RTNETLINK answers: Address already in use)" instance=someinstance instanceType=virtual-machine project=someproject

  • Output of the client with --debug
  • Output of the daemon with --debug (alternatively output of incus monitor --pretty while reproducing the issue)
@stgraber
Copy link
Member

Iterating over all the host interfaces to try to clean up potential conflicts shouldn't be needed and may actually be dangerous as it's perfectly valid in some environments to have the same MAC on multiple interfaces and starting to arbitrarily delete them may just cause a whole bunch of issues.

I spent around 30min trying to reproduce the issue you're describing, both by killing QEMU to simulate a hard crash and by triggering reboots from within a VM as seems to be the trigger for you, but I never managed to get the issue to happen here, so we're going to need some kind of somewhat reliable reproducer.

Looking at the macvlan nic cleanup logic, I'm not seeing anything wrong in there. As soon as the VM comes down, it triggers the onStop action which then iterates over all the devices on the instance and calls their Stop command. In the macvlan case, this will return a function that will delete the host device. I also did a test build here to make sure that code path is properly being hit during an instance initiated reboot and it did hit.

@stgraber stgraber added the Incomplete Waiting on more information from reporter label Jul 11, 2024
@stgraber
Copy link
Member

If you can reproduce this somewhat reliably with a VM, it'd be good to run incus monitor --pretty on the system it's running, then reboot the VM and see it hit the issue. That should show us a better trace of all the calls being made.

Having the full incus config show --expanded output for an affected VM would also help as it's certainly possible that other devices or configuration are impacting this.

@markrattray
Copy link
Author

Good morning. Sorry had a few emergencies so been away. Thank you for your efforts and checking out all this.

It's a bit random unfortunately and I've been rebooting VMs based on the same image regularly. The problematic ones did have a lot more workload than what I was rebooting. I'm working this Sunday so I'll see if I can reproduce the scenario again.

It might have something to do with the network setup on these hosts.... OVN wanted a dedicated NIC or a bridge, so to test OVN I deployed a bridge then OVN on a single NIC, but still using macvlan NICs for instances due to a routing issue to/from external networks and routed OVN networks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Incomplete Waiting on more information from reporter
Development

No branches or pull requests

2 participants