Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nesting (docker) in containers broken on Ubuntu 24.04 #791

Closed
mmanjos opened this issue Apr 26, 2024 · 7 comments
Closed

Nesting (docker) in containers broken on Ubuntu 24.04 #791

mmanjos opened this issue Apr 26, 2024 · 7 comments

Comments

@mmanjos
Copy link

mmanjos commented Apr 26, 2024

Required information

  • Distribution: Ubuntu
  • Distribution version: 24.04 LTS Final (not a development prerelease)
  • Incus version: 6.0.0
incus info
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
- auth_tls_jwt
- oidc_claim
- device_usb_serial
- numa_cpu_balanced
- image_restriction_nesting
- network_integrations
- instance_memory_swap_bytes
- network_bridge_external_create
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
  addresses:
  - 1.2.3.4:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
-----END CERTIFICATE-----

certificate_fingerprint:
driver: lxc
driver_version: 5.0.3
firewall: nftables
kernel: Linux
kernel_architecture: x86_64
kernel_features:
idmapped_mounts: "true"
netnsid_getifaddrs: "true"
seccomp_listener: "true"
seccomp_listener_continue: "true"
uevent_injection: "true"
unpriv_binfmt: "true"
unpriv_fscaps: "true"
kernel_version: 6.8.0-31-generic
lxc_features:
cgroup2: "true"
core_scheduling: "true"
devpts_fd: "true"
idmapped_mounts_v2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "true"
seccomp_allow_deny_syntax: "true"
seccomp_notify: "true"
seccomp_proxy_send_notify_fd: "true"
os_name: Ubuntu
os_version: "24.04"
project: default
server: incus
server_clustered: false
server_event_mode: full-mesh
server_name: test
server_pid: 2874
server_version: 6.0.0
storage: btrfs
storage_version: 6.6.3
storage_supported_drivers:

  • name: dir
    version: "1"
    remote: false
  • name: lvm
    version: 2.03.16(2) (2022-05-18) / 1.02.185 (2022-05-18) / 4.48.0
    remote: false
  • name: lvmcluster
    version: 2.03.16(2) (2022-05-18) / 1.02.185 (2022-05-18) / 4.48.0
    remote: true
  • name: btrfs
    version: 6.6.3
    remote: false

Issue description

I'm trying to run docker inside a simple lxc container, using a config that worked on an early development release kernel of 24.04. I've rebuilt the system to 24.04 final and it appears something has broken container nesting on this latest kernel.

My incus container config has:

  security.nesting: "true"
  security.privileged: "true"

When I try to start any docker container from within this incus container, I get:
$ docker run -it alpine /bin/sh docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error jailing process inside rootfs: pivot_root .: permission denied: unknown. ERRO[0000] error waiting for container:

And the parent host generates an audit message:

[ 2834.188006] audit: type=1400 audit(1714097921.411:1252): apparmor="DENIED" operation="pivotroot" class="mount" namespace="root//incus-test_<var-lib-incus>" profile="runc" name="/var/lib/docker/overlay2/38b1e498be70b0eff840bc92770eef7ebaa1c2b3caea9bf0f93bf5ff53088c28/merged/" pid=26921 comm="runc:[2:INIT]" srcname="/var/lib/docker/overlay2/38b1e498be70b0eff840bc92770eef7ebaa1c2b3caea9bf0f93bf5ff53088c28/merged/"

I've tried disabling the new ubuntu-specific Unprivileged user namespace restrictions by setting kernel.apparmor_restrict_unprivileged_userns=0 but it did not help

@mmanjos
Copy link
Author

mmanjos commented Apr 26, 2024

hmm... it looks like there's a similar issue opened up with LXD

@stgraber
Copy link
Member

Never use security.privileged for this kind of stuff, it prevents the use of AppArmor namespaces on top of making your host system extremely vulnerable to attacks.

@mmanjos
Copy link
Author

mmanjos commented Apr 26, 2024

Never use security.privileged for this kind of stuff, it prevents the use of AppArmor namespaces on top of making your host system extremely vulnerable to attacks.

Thanks - that was just there for testing to see if it would make a difference. Using security.privileged: "false" doesn't fix this issue. (I was trying to disable anything security or apparmor related to try to find the source of this regression on a lab server)

@mmanjos
Copy link
Author

mmanjos commented Apr 26, 2024

As a workaround, moving this environment to mainline kernel 6.8.0-060800-generic has temporarily solved this issue and I'm able to run docker inside of lxc containers created by incus again.

@stgraber
Copy link
Member

Sounds like some new AppArmor feature that's only in the Ubuntu kernel. I'll have to take a look.

If that's the case, we'll be closing this issue as we have little interest in doing special handling for distro specific kernel experiments.

@mmanjos
Copy link
Author

mmanjos commented Apr 26, 2024

Fair enough - thanks for taking a peek. It does feel like they introduced a change into 24.04 at the last minute (I've been testing nightlies of 24.04 for a while now and everything has worked great with Incus, right up until this recent upgrade with the final release of Noble)

@stgraber
Copy link
Member

Tests so far:

  • Issue happens with Ubuntu 24.04 container on Ubuntu 24.04 host with stock kernel
  • No issue with Ubuntu 22.04 container on Ubuntu 24.04 host with stock kernel
  • No issue with Ubuntu 24.04 container on Ubuntu 24.04 host with my own kernel build (6.8.x)

The issue with this denial:

[  355.928870] audit: type=1400 audit(1714427935.547:331): apparmor="DENIED" operation="pivotroot" class="mount" namespace="root//incus-docker_<var-lib-incus>" profile="runc" name="/var/lib/docker/overlay2/9a3dc7e25b7fb1f6b0bfed217133862d024a62f2d84c6d70b527a5fa2567360f/merged/" pid=9480 comm="runc:[2:INIT]" srcname="/var/lib/docker/overlay2/9a3dc7e25b7fb1f6b0bfed217133862d024a62f2d84c6d70b527a5fa2567360f/merged/"

Is that it apparently occurs within the generated runc profile, not our own profile, so this isn't something that can be fixed on our end.

The apparmor runc profile only exists on Ubuntu 24.04 and is there only to deal with Ubuntu's odd default of blocking unpriv userns unless a profile says otherwise.

One way to resolve the mess is to undo what Ubuntu did:

  • echo 0 > /proc/sys/kernel/apparmor_restrict_unprivileged_userns
  • ln -s /etc/apparmor.d/runc /etc/apparmor.d/disable/ (in the container)

Closing as this whole thing is because of Ubuntu-specific changes causing wide ranging regressions (requirement of AppArmor for anything to use userns) and their attempted fixes for this situation (apparmor profile for runc) then further getting things to run into AppArmor bugs/issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants