Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when adding gpu device in a container #3950

Closed
sfabris opened this issue Oct 17, 2017 · 25 comments
Closed

error when adding gpu device in a container #3950

sfabris opened this issue Oct 17, 2017 · 25 comments
Labels

Comments

@sfabris
Copy link

@sfabris sfabris commented Oct 17, 2017

Required information

  • Distribution: Arch linux
  • The output of "lxc info" or if that fails:
    config:
    core.https_address: '[::]:8443'
    api_extensions:
  • storage_zfs_remove_snapshots
  • container_host_shutdown_timeout
  • container_syscall_filtering
  • auth_pki
  • container_last_used_at
  • etag
  • patch
  • usb_devices
  • https_allowed_credentials
  • image_compression_algorithm
  • directory_manipulation
  • container_cpu_time
  • storage_zfs_use_refquota
  • storage_lvm_mount_options
  • network
  • profile_usedby
  • container_push
  • container_exec_recording
  • certificate_update
  • container_exec_signal_handling
  • gpu_devices
  • container_image_properties
  • migration_progress
  • id_map
  • network_firewall_filtering
  • network_routes
  • storage
  • file_delete
  • file_append
  • network_dhcp_expiry
  • storage_lvm_vg_rename
  • storage_lvm_thinpool_rename
  • network_vlan
  • image_create_aliases
  • container_stateless_copy
  • container_only_migration
  • storage_zfs_clone_copy
  • unix_device_rename
  • storage_lvm_use_thinpool
  • storage_rsync_bwlimit
  • network_vxlan_interface
  • storage_btrfs_mount_options
  • entity_description
  • image_force_refresh
  • storage_lvm_lv_resizing
  • id_map_base
  • file_symlinks
  • container_push_target
  • network_vlan_physical
  • storage_images_delete
  • container_edit_metadata
  • container_snapshot_stateful_migration
  • storage_driver_ceph
  • storage_ceph_user_name
  • resource_limits
  • storage_volatile_initial_source
  • storage_ceph_force_osd_reuse
  • storage_block_filesystem_btrfs
  • resources
  • kernel_limits
    api_status: stable
    api_version: "1.0"
    auth: trusted
    public: false
    environment:
    addresses:
    • 192.168.10.30:8443
    • 10.32.94.1:8443
    • '[fd42:1eb7:5949:38cc::1]:8443'
      architectures:
    • x86_64
    • i686
      certificate: |
      -----BEGIN CERTIFICATE-----
      cut
      -----END CERTIFICATE-----
      certificate_fingerprint: dcb1fa0f6b41e3efe4445c53fc7fdc4ad31252466a35c85515a5d6aecb1dbf9d
      driver: lxc
      driver_version: 2.1.0
      kernel: Linux
      kernel_architecture: x86_64
      kernel_version: 4.13.5-1-userns
      server: lxd
      server_pid: 23073
      server_version: "2.18"
      storage: btrfs
      storage_version: "4.13"
  • Storage backend in use:

lxc storage list
+---------+-------------+--------+--------------------------------+---------+
| NAME | DESCRIPTION | DRIVER | SOURCE | USED BY |
+---------+-------------+--------+--------------------------------+---------+
| default | | btrfs | /var/lib/lxd/disks/default.img | 3 |
+---------+-------------+--------+--------------------------------+---------+

Issue description and steps to reproduce

Try to add gpu device to a fresh container, got error:

  1. lxc launch ubuntu:17.04 ubuntu
  2. lxc config device add ubuntu gpu gpu
    result --> error: strconv.Atoi: parsing "Force": invalid syntax

Information to attach

  • Container log (lxc info NAME --show-log)
    Name: ubuntu
    Remote: unix://
    Architecture: x86_64
    Created: 2017/10/17 18:43 UTC
    Status: Running
    Type: persistent
    Profiles: default
    Pid: 23364
    Ips:
    eth0: inet 10.32.94.39 veth1E6PCA
    eth0: inet6 fd42:1eb7:5949:38cc:216:3eff:fe39:d412 veth1E6PCA
    eth0: inet6 fe80::216:3eff:fe39:d412 veth1E6PCA
    lo: inet 127.0.0.1
    lo: inet6 ::1
    Resources:
    Processes: 31
    CPU usage:
    CPU usage (in seconds): 4
    Memory usage:
    Memory (current): 196.58MB
    Memory (peak): 305.61MB
    Network usage:
    eth0:
    Bytes received: 9.31kB
    Bytes sent: 2.96kB
    Packets received: 65
    Packets sent: 29
    lo:
    Bytes received: 588B
    Bytes sent: 588B
    Packets received: 6
    Packets sent: 6

Log:

        lxc 20171017184404.604 WARN     lxc_monitor - monitor.c:lxc_monitor_fifo_send:111 - Failed to open fifo to send message: No such file or directory.
        lxc 20171017184404.604 WARN     lxc_monitor - monitor.c:lxc_monitor_fifo_send:111 - Failed to open fifo to send message: No such file or directory.
        lxc 20171017184404.668 WARN     lxc_cgfsng - cgroups/cgfsng.c:chown_cgroup_wrapper:1492 - Error chmoding /sys/fs/cgroup/unified//lxc/ubuntu: No such file or directory
        lxc 20171017184404.738 WARN     lxc_monitor - monitor.c:lxc_monitor_fifo_send:111 - Failed to open fifo to send message: No such file or directory.
        lxc 20171017184404.738 WARN     lxc_monitor - monitor.c:lxc_monitor_fifo_send:111 - Failed to open fifo to send message: No such file or directory.
  • Container configuration
    architecture: x86_64
    config:
    image.architecture: amd64
    image.description: ubuntu 17.04 amd64 (release) (20171011)
    image.label: release
    image.os: ubuntu
    image.release: zesty
    image.serial: "20171011"
    image.version: "17.04"
    volatile.base_image: 4a38bd884a643a837d49ec376665fe45f1cdb0ea242762ba9b5017c0ba4a5774
    volatile.eth0.hwaddr: 00:16:3e:39:d4:12
    volatile.eth0.name: eth0
    volatile.idmap.base: "0"
    volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":65536}]'
    volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":65536}]'
    volatile.last_state.power: RUNNING
    devices:
    eth0:
    nictype: bridged
    parent: lxdbr0
    type: nic
    root:
    path: /
    pool: default
    type: disk
    ephemeral: false
    profiles:

  • default
    stateful: false
    description: ""

  • Output of the client with --debug
    lxc config device add ubuntu gpu gpu --debug

--- cut ---
DBUG[10-17|21:18:22] Got operation from LXD
DBUG[10-17|21:18:22]
{
"id": "88bfc9a4-43d4-4f9a-9ca7-d415303f572c",
"class": "task",
"created_at": "2017-10-17T21:18:22.017934326+02:00",
"updated_at": "2017-10-17T21:18:22.017934326+02:00",
"status": "Running",
"status_code": 103,
"resources": {
"containers": [
"/1.0/containers/ubuntu"
]
},
"metadata": null,
"may_cancel": false,
"err": ""
}
DBUG[10-17|21:18:22] Sending request to LXD etag= method=GET url=http://unix.socket/1.0/operations/88bfc9a4-43d4-4f9a-9ca7-d415303f572c
DBUG[10-17|21:18:22] Got response struct from LXD
DBUG[10-17|21:18:22]
{
"id": "88bfc9a4-43d4-4f9a-9ca7-d415303f572c",
"class": "task",
"created_at": "2017-10-17T21:18:22.017934326+02:00",
"updated_at": "2017-10-17T21:18:22.017934326+02:00",
"status": "Running",
"status_code": 103,
"resources": {
"containers": [
"/1.0/containers/ubuntu"
]
},
"metadata": null,
"may_cancel": false,
"err": ""
}
error: strconv.Atoi: parsing "Force": invalid syntax

  • Output of the daemon with --debug (alternatively output of lxc monitor while reproducing the issue)
    --- cut ---
    timestamp: 2017-10-17T21:22:30.825852583+02:00
    type: logging

metadata:
class: task
created_at: 2017-10-17T21:22:30.816935982+02:00
err: 'strconv.Atoi: parsing "Force": invalid syntax'
id: fb6b355e-73f5-42dd-bde2-99044d4cec11
may_cancel: false
metadata: null
resources:
containers:
- /1.0/containers/ubuntu
status: Failure
status_code: 400
updated_at: 2017-10-17T21:22:30.816935982+02:00
timestamp: 2017-10-17T21:22:30.826898669+02:00
type: operation

@brauner

This comment has been minimized.

Copy link
Member

@brauner brauner commented Oct 17, 2017

Can you please show:

ls -al /dev/dri/
@brauner

This comment has been minimized.

Copy link
Member

@brauner brauner commented Oct 17, 2017

And, if you have an Nvidia card, please show:

cat /proc/driver/nvidia/gpus/<card-id>/information

for each <card-id>.

@brauner brauner added the Incomplete label Oct 17, 2017
@sfabris

This comment has been minimized.

Copy link
Author

@sfabris sfabris commented Oct 17, 2017

Here the information:

ls -al /dev/dri/
[sf@pongo ~]$ ls -al /dev/dri/
totale 0
drwxr-xr-x   3 root root     80 17 ott 23.00 .
drwxr-xr-x  22 root root   3800 17 ott 23.00 ..
drwxr-xr-x   2 root root     60 17 ott 23.00 by-path
crw-rw-rw-+  1 root root 226, 0 17 ott 23.00 card0
[sf@pongo ~]$ cat /proc/driver/nvidia/gpus/0000\:04\:00.0/information
Model:           GeForce GTX 650
IRQ:             44
GPU UUID:        GPU-e3f822b9-8c5f-95a5-b379-8dfdeefd7758
Video BIOS:      80.07.35.00.80
Bus Type:        PCIe
DMA Size:        40 bits
DMA Mask:        0xffffffffff
Bus Location:    0000:04:00.0
@brauner

This comment has been minimized.

Copy link
Member

@brauner brauner commented Oct 18, 2017

Uhm, yeah. So there's no easy way for LXD here to figure out the correspondence between the Nvidia card and the id under /dev/dri since the Device Minor entry in the information file is missing. We can add some code to workaround your specific case as there's only one card under /dev/dri but it's not guaranteed that this is not indeed a different card. So this becomes guesswork.

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Oct 18, 2017

@brauner can't we figure it out based on the PCI path?

@sfabris

This comment has been minimized.

Copy link
Author

@sfabris sfabris commented Oct 18, 2017

Turn out to be related to nvidia drivers...
Once update to most recent stable drivers, this is what I get:

cat /proc/driver/nvidia/gpus/0000\:04\:00.0/information
Model:           GeForce GTX 650
IRQ:             45
GPU UUID:        GPU-e3f822b9-8c5f-95a5-b379-8dfdeefd7758
Video BIOS:      80.07.35.00.80
Bus Type:        PCIe
DMA Size:        40 bits
DMA Mask:        0xffffffffff
Bus Location:    0000:04:00.0
Device Minor:    0

and now
lxc config device add ubuntu gpu gpu
works

@sfabris sfabris closed this Oct 18, 2017
@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Oct 18, 2017

Ok, we should probably improve the error message in that case though :)

@sfabris

This comment has been minimized.

Copy link
Author

@sfabris sfabris commented Oct 18, 2017

@stgraber yeas, would be nice 👍

@castleguarders

This comment has been minimized.

Copy link

@castleguarders castleguarders commented Sep 27, 2018

This issue shows up in a different form in my configuration. I'd suspect this will show up in every case when the newer nvidia driver is used. This happens when I try to start my nvidia containers after upgrading to ubuntu 18.04, while using nvidia cuda 10 with the packaged drivers. The "Device Minor" entry is present in the gpu information file, but there is a new entry after this line. It's almost like the code expects that last line to be device minor, vs. being able to find it anywhere in the file?

lxc start dlcontainer
Error: strconv.Atoi: parsing "0\nBlacklisted:\t": invalid syntax
Try lxc info --show-log dlcontainer for more info

lxc --version
3.0.1

ls -l /dev/dri
total 0
drwxr-xr-x 2 root root 80 Sep 26 21:01 by-path
crw-rw----+ 1 root video 226, 0 Sep 26 21:01 card0
crw-rw----+ 1 root video 226, 128 Sep 26 21:01 renderD128

tree /dev/dri
/dev/dri
├── by-path
│   ├── pci-0000:03:00.0-card -> ../card0
│   └── pci-0000:03:00.0-render -> ../renderD128
├── card0
└── renderD128

$ cat /proc/driver/nvidia/gpus/0000:03:00.0/information
Model: GeForce GTX 1080 Ti
IRQ: 42
GPU UUID: GPU-8a377998-43e6-438c-2ead-a3eb05bff08d
Video BIOS: 86.02.39.00.90
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:03:00.0
Device Minor: 0
Blacklisted: No


cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 410.48 Thu Sep 6 06:36:33 CDT 2018
GCC version: gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Sep 27, 2018

Ah, let me try to get our test VM on 410

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Sep 27, 2018

@castleguarders what LXD version?

@castleguarders

This comment has been minimized.

Copy link

@castleguarders castleguarders commented Sep 27, 2018

lxc --version
3.0.1
lxd --version
3.0.1

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Sep 27, 2018

Nevermind, you said 3.0.1, so I think it's a bug we fixed already.

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Sep 27, 2018

Because 410 is working fine with 3.5 here and I suspect 3.0.2 has the same fix

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Sep 27, 2018

@castleguarders Can you show lxc config show --expanded NAME for your container?

@castleguarders

This comment has been minimized.

Copy link

@castleguarders castleguarders commented Sep 27, 2018

It happens on several, here's two examples.
lxc config show --expanded dlcontainer
architecture: x86_64
config:
image.architecture: amd64
image.description: ubuntu 18.04 LTS amd64 (release) (20180617)
image.label: release
image.os: ubuntu
image.release: bionic
image.serial: "20180617"
image.version: "18.04"
security.privileged: "true"
volatile.base_image: b190d5ec0c537468465e7bd122fe127d9f3509e3a09fb699ac33b0c5d4fe050f
volatile.eth0.hwaddr: 00:16:3e:f3:04:1b
volatile.eth0.name: eth0
volatile.idmap.base: "0"
volatile.idmap.next: '[]'
volatile.last_state.idmap: '[]'
volatile.last_state.power: STOPPED
devices:
eth0:
nictype: bridged
parent: lxdbr0
type: nic
gpu:
type: gpu
nvidia-modeset:
path: /dev/nvidia-modeset
type: unix-char
nvidia-uvm:
path: /dev/nvidia-uvm
type: unix-char
nvidia0:
path: /dev/nvidia0
type: unix-char
nvidiactl:
path: /dev/nvidiactl
type: unix-char
root:
path: /
pool: default
type: disk
shareName:
path: /sharedDownloads
source: /home/ssss/Downloads
type: disk
ephemeral: false
profiles:

  • default
    stateful: false
    description: ""

=======
lxc config show --expanded roscontainer1
architecture: x86_64
config:
image.architecture: amd64
image.description: ubuntu 16.04 LTS amd64 (release) (20180522)
image.label: release
image.os: ubuntu
image.release: xenial
image.serial: "20180522"
image.version: "16.04"
security.privileged: "true"
volatile.base_image: 08bbf441bb737097586e9f313b239cecbba96222e58457881b3718c45c17e074
volatile.eth0.hwaddr: 00:16:3e:7e:8a:2f
volatile.eth0.name: eth0
volatile.idmap.base: "0"
volatile.idmap.next: '[]'
volatile.last_state.idmap: '[]'
volatile.last_state.power: STOPPED
devices:
eth0:
nictype: bridged
parent: lxdbr0
type: nic
gpu:
type: gpu
nvidia-modeset:
path: /dev/nvidia-modeset
type: unix-char
nvidia-uvm:
path: /dev/nvidia-modeset
type: unix-char
nvidia-uvm-tools:
path: /dev/nvidia-uvm-tools
type: unix-char
nvidia0:
path: /dev/nvidia0
type: unix-char
nvidiactl:
path: /dev/nvidiactl
type: unix-char
root:
path: /
pool: default
type: disk
shareName:
path: /sharedDownloads
source: /home/ssss/Downloads
type: disk
ephemeral: false
profiles:

  • default
    stateful: false
    description: ""
@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Sep 27, 2018

Ok, I'm looking at a potential related failure with LXD 3.5 now

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Sep 27, 2018

Tracked it down, so unfortunately 3.0.2 won't help you either...

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Sep 27, 2018

#5080 fixes this issue

@castleguarders

This comment has been minimized.

Copy link

@castleguarders castleguarders commented Sep 27, 2018

Looking at the current code, SplitN should have handled this case and parts[0] should have just been a 0, instead of "0\nBlacklisted:\t". Unless the code i'm looking at is newer than 3.0.1..

idx := strings.Index(strBuf, "Device Minor:")
if idx != -1 {
	idx += len("Device Minor:")
	strBuf = strBuf[idx:]
	strBuf = strings.TrimSpace(strBuf)
	parts := strings.SplitN(strBuf, "\n", 1)
	_, err = strconv.Atoi(parts[0])
	if err == nil {
		return parts[0], nil

}

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Sep 27, 2018

The SplitN was wrong, number of parts (last argument) should be 2, not 1, otherwise it effectively does nothing

@castleguarders

This comment has been minimized.

Copy link

@castleguarders castleguarders commented Sep 27, 2018

Ah, looks like I was looking at the fixed code.. 2 was replaced with 1. Explains the issue.
parts := strings.SplitN(strBuf, "\n", 1)
parts := strings.SplitN(strBuf, "\n", 2)

@castleguarders

This comment has been minimized.

Copy link

@castleguarders castleguarders commented Sep 27, 2018

Thanks, any suggested workarounds other than a local build of lxd? Waiting for an upstream fix is probably not going to happen soon.

@stgraber

This comment has been minimized.

Copy link
Member

@stgraber stgraber commented Sep 27, 2018

Well, there is the ugly way out of this :)

  • cat /proc/driver/nvidia/gpus/0000:03:00.0/information | grep -v Blacklisted > /tmp/.nvidia-information
  • mount --bind /tmp/.nvidia-information /proc/driver/nvidia/gpus/0000:03:00.0/information
@RanMaosong

This comment has been minimized.

Copy link

@RanMaosong RanMaosong commented Nov 13, 2019

Ah, looks like I was looking at the fixed code.. 2 was replaced with 1. Explains the issue.
parts := strings.SplitN(strBuf, "\n", 1)
parts := strings.SplitN(strBuf, "\n", 2)

where is the file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants
You can’t perform that action at this time.