Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to configure user/dedicated workqueue with the "config-wq" command #11

Closed
ozhuraki opened this issue Oct 13, 2021 · 17 comments
Closed

Comments

@ozhuraki
Copy link

# accel-config -v
3.4.2.git63991cc9
# uname -rv
5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021
# cat /proc/cmdline
[...] intel_iommu=on,sm_on
# id
uid=0(root) gid=0(root) groups=0(root)
# accel-config list
[
]
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# accel-config disable-device dsa0
disabled 1 device(s) out of 1
# accel-config list
[
]
# accel-config config-wq --type=user --mode=dedicated --name="dsa0.0" --group-id=0 --wq-size=16 --priority=10 --block-on-fault=1 --threshold=15 dsa0/wq0.0
libaccfg: accfg_wq_set_threshold: wq0.0: write failed: Invalid argument
# accel-config list
[
]
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# 
# cat dsa0.conf
[
  {
    "dev":"dsa0",
    "token_limit":0,
    "groups":[
      {
        "dev":"group0.0",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.0",
            "mode":"dedicated",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.0",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.0",
            "group_id":0
          }
        ]
      },
      {
        "dev":"group0.1",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.1",
            "mode":"dedicated",
            "size":16,
            "group_id":1,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.1",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.1",
            "group_id":1
          }
        ]
      },
      {
        "dev":"group0.2",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.2",
            "mode":"dedicated",
            "size":16,
            "group_id":2,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.2",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.2",
            "group_id":2
          }
        ]
      },
      {
        "dev":"group0.3",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.3",
            "mode":"dedicated",
            "size":16,
            "group_id":3,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.3",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.3",
            "group_id":3
          }
        ]
      }
    ]
  }
]
# 
@davejiang
Copy link
Contributor

I think the 5.11 kernel may still have the driver bug that causes wq mode change to not happen. I believe that has been fixed in later kernels. Can you see if a 5.12 kernel works any better?

@ramesh-thomas
Copy link
Contributor

"threshold" is only for shared wqs. Try without the "--threshold" option.

@davejiang
Copy link
Contributor

Thanks @ramesh-thomas. I was reading the log wrong and was mistaken incorrect mode rather than threshold.

@ozhuraki
Copy link
Author

@ramesh-thomas

Try without the "--threshold" option.

# accel-config list
[
]
# accel-config config-wq --type=user --mode=dedicated --name="dsa0.0" --group-id=0 --wq-size=16 --priority=10 --block-on-fault=1 dsa0/wq0.0
# accel-config config-engine --group-id=0 dsa0/engine0.0
# accel-config enable-device dsa0
failed in dsa0
enabled 0 device(s) out of 1
Error[      0x15] dsa0: Invalid group config: lack of wq or engines
# accel-config list
[
]
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# 

@davejiang

I think the 5.11 kernel may still have the driver bug that causes wq mode change to not happen. I believe that has been fixed in later kernels. Can you see if a 5.12 kernel works any better?

OK, thanks, we will try that. In principle, it works in 5.11 through load-configuration.

@davejiang
Copy link
Contributor

@ozhuraki you don't need to switch kernel. I misread your earlier log.

@ramesh-thomas
Copy link
Contributor

Can you try rebooting or resetting by unloading and reloading idxd module? Those commands worked for me.

@ozhuraki
Copy link
Author

@ramesh-thomas

Can you try rebooting or resetting by unloading and reloading idxd module? Those commands worked for me.

After rebooting "config-wq" works, but only once.
On reloading:

# modprobe -r idxd
# modprobe idxd
# accel-config list
[
]
# accel-config config-wq --type=user --mode=dedicated --name="dsa0.0" --group-id=0 --wq-size=16 --priority=10 --block-on-fault=1 dsa0/wq0.0
# accel-config config-engine --group-id=0 dsa0/engine0.0
# accel-config enable-device dsa0
failed in dsa0
enabled 0 device(s) out of 1
Error[      0x16] dsa0: Invalid group config: wq misconfigured
# accel-config list
[
]
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# 

@davejiang
Copy link
Contributor

Can you try putting group id as the first parameter? I wonder if there's an ordering issue for whatever reason.
Something like:
accel-config config-wq --group-id=0 --mode=dedicated --wq-size=16 --type=user --name="mywq" --priority=10 --block-on-fault=1 dsa0/wq0.0

@ozhuraki
Copy link
Author

@davejiang

Can you try putting group id as the first parameter?

# accel-config list
[
]
# accel-config config-wq --group-id=0 --type=user --mode=dedicated --name="dsa0.0" --wq-size=16 --priority=10 --block-on-fault=1 dsa0/wq0.0
# accel-config config-engine --group-id=0 dsa0/engine0.0
# accel-config enable-device dsa0
failed in dsa0
enabled 0 device(s) out of 1
Error[      0x16] dsa0: Invalid group config: wq misconfigured
# accel-config list
[
]
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# 

@davejiang
Copy link
Contributor

Can you attach the dsa0.conf? Also, given it's a dedicated wq, can you try the latest upstream kernel? 5.15-rc5 would be great. Thanks!

@ozhuraki
Copy link
Author

@davejiang

Can you attach the dsa0.conf?

# cat dsa0.conf
[
  {
    "dev":"dsa0",
    "token_limit":0,
    "groups":[
      {
        "dev":"group0.0",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.0",
            "mode":"dedicated",
            "size":16,
            "group_id":0,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.0",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.0",
            "group_id":0
          }
        ]
      },
      {
        "dev":"group0.1",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.1",
            "mode":"dedicated",
            "size":16,
            "group_id":1,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.1",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.1",
            "group_id":1
          }
        ]
      },
      {
        "dev":"group0.2",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.2",
            "mode":"dedicated",
            "size":16,
            "group_id":2,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.2",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.2",
            "group_id":2
          }
        ]
      },
      {
        "dev":"group0.3",
        "tokens_reserved":0,
        "use_token_limit":0,
        "tokens_allowed":8,
        "grouped_workqueues":[
          {
            "dev":"wq0.3",
            "mode":"dedicated",
            "size":16,
            "group_id":3,
            "priority":10,
            "block_on_fault":1,
            "type":"user",
            "name":"dsa0.3",
            "threshold":15
          }
        ],
        "grouped_engines":[
          {
            "dev":"engine0.3",
            "group_id":3
          }
        ]
      }
    ]
  }
]
# 

@davejiang
Copy link
Contributor

You don't have a conf file that only configures a single wq same as the commandline?

Can you do a 'accel-config list -i' after you have configured with commandline? Curious what accel-config has configured so far after commandline.

@ozhuraki
Copy link
Author

@davejiang

You don't have a conf file that only configures a single wq same as the commandline?

Reducing the conf to fewer than 3 workqueus doesn't work, i.e. such configuration fails to load through "load-configuration".

Can you do a 'accel-config list -i' after you have configured with commandline?

# accel-config list
[
]
# accel-config list --idle | jq '.[].dev' | grep dsa
"dsa0"
"dsa1"
"dsa2"
"dsa3"
"dsa4"
"dsa5"
"dsa6"
"dsa7"
# accel-config list --idle | jq '.[0]'
{
  "dev": "dsa0",
  "token_limit": 0,
  "max_groups": 4,
  "max_work_queues": 8,
  "max_engines": 4,
  "work_queue_size": 128,
  "numa_node": 0,
  "op_cap": [
    "0x1003f03ff",
    "0",
    "0",
    "0"
  ],
  "gen_cap": "0x40915f010f",
  "version": "0x100",
  "state": "disabled",
  "max_tokens": 96,
  "max_batch_size": 1024,
  "max_transfer_size": 2147483648,
  "configurable": 1,
  "pasid_enabled": 1,
  "cdev_major": 234,
  "clients": 0,
  "groups": [
    {
      "dev": "group0.0",
      "tokens_reserved": 0,
      "use_token_limit": 0,
      "tokens_allowed": 8,
      "traffic_class_a": 0,
      "traffic_class_b": 1,
      "grouped_engines": [
        {
          "dev": "engine0.0",
          "group_id": 0
        }
      ]
    },
    {
      "dev": "group0.1",
      "tokens_reserved": 0,
      "use_token_limit": 0,
      "tokens_allowed": 8,
      "traffic_class_a": 0,
      "traffic_class_b": 1,
      "grouped_engines": [
        {
          "dev": "engine0.1",
          "group_id": 1
        }
      ]
    },
    {
      "dev": "group0.2",
      "tokens_reserved": 0,
      "use_token_limit": 0,
      "tokens_allowed": 8,
      "traffic_class_a": 0,
      "traffic_class_b": 1,
      "grouped_engines": [
        {
          "dev": "engine0.2",
          "group_id": 2
        }
      ]
    },
    {
      "dev": "group0.3",
      "tokens_reserved": 0,
      "use_token_limit": 0,
      "tokens_allowed": 8,
      "traffic_class_a": 0,
      "traffic_class_b": 1,
      "grouped_engines": [
        {
          "dev": "engine0.3",
          "group_id": 3
        }
      ]
    }
  ],
  "ungrouped workqueues": [
    {
      "dev": "wq0.0",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 1,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    },
    {
      "dev": "wq0.1",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 1,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    },
    {
      "dev": "wq0.2",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 1,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    },
    {
      "dev": "wq0.3",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 1,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    },
    {
      "dev": "wq0.4",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 0,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    },
    {
      "dev": "wq0.5",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 0,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    },
    {
      "dev": "wq0.6",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 0,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    },
    {
      "dev": "wq0.7",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 0,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    }
  ]
}
# accel-config load-config -c dsa0.conf
# accel-config enable-device dsa0
enabled 1 device(s) out of 1
# accel-config enable-wq dsa0/wq0.0
enabled 1 wq(s) out of 1
# accel-config list --idle | jq '.[0]'
{
  "dev": "dsa0",
  "token_limit": 0,
  "max_groups": 4,
  "max_work_queues": 8,
  "max_engines": 4,
  "work_queue_size": 128,
  "numa_node": 0,
  "op_cap": [
    "0x1003f03ff",
    "0",
    "0",
    "0"
  ],
  "gen_cap": "0x40915f010f",
  "version": "0x100",
  "state": "enabled",
  "max_tokens": 96,
  "max_batch_size": 1024,
  "max_transfer_size": 2147483648,
  "configurable": 1,
  "pasid_enabled": 1,
  "cdev_major": 234,
  "clients": 0,
  "groups": [
    {
      "dev": "group0.0",
      "tokens_reserved": 0,
      "use_token_limit": 0,
      "tokens_allowed": 8,
      "traffic_class_a": 0,
      "traffic_class_b": 1,
      "grouped_workqueues": [
        {
          "dev": "wq0.0",
          "mode": "dedicated",
          "size": 16,
          "group_id": 0,
          "priority": 10,
          "block_on_fault": 1,
          "max_batch_size": 1024,
          "max_transfer_size": 2147483648,
          "cdev_minor": 0,
          "type": "user",
          "name": "dsa0.0",
          "threshold": 0,
          "ats_disable": 0,
          "state": "enabled",
          "clients": 0
        }
      ],
      "grouped_engines": [
        {
          "dev": "engine0.0",
          "group_id": 0
        }
      ]
    },
    {
      "dev": "group0.1",
      "tokens_reserved": 0,
      "use_token_limit": 0,
      "tokens_allowed": 8,
      "traffic_class_a": 0,
      "traffic_class_b": 1,
      "grouped_workqueues": [
        {
          "dev": "wq0.1",
          "mode": "dedicated",
          "size": 16,
          "group_id": 1,
          "priority": 10,
          "block_on_fault": 1,
          "max_batch_size": 1024,
          "max_transfer_size": 2147483648,
          "type": "user",
          "name": "dsa0.1",
          "threshold": 0,
          "ats_disable": 0,
          "state": "disabled",
          "clients": 0
        }
      ],
      "grouped_engines": [
        {
          "dev": "engine0.1",
          "group_id": 1
        }
      ]
    },
    {
      "dev": "group0.2",
      "tokens_reserved": 0,
      "use_token_limit": 0,
      "tokens_allowed": 8,
      "traffic_class_a": 0,
      "traffic_class_b": 1,
      "grouped_workqueues": [
        {
          "dev": "wq0.2",
          "mode": "dedicated",
          "size": 16,
          "group_id": 2,
          "priority": 10,
          "block_on_fault": 1,
          "max_batch_size": 1024,
          "max_transfer_size": 2147483648,
          "type": "user",
          "name": "dsa0.2",
          "threshold": 0,
          "ats_disable": 0,
          "state": "disabled",
          "clients": 0
        }
      ],
      "grouped_engines": [
        {
          "dev": "engine0.2",
          "group_id": 2
        }
      ]
    },
    {
      "dev": "group0.3",
      "tokens_reserved": 0,
      "use_token_limit": 0,
      "tokens_allowed": 8,
      "traffic_class_a": 0,
      "traffic_class_b": 1,
      "grouped_workqueues": [
        {
          "dev": "wq0.3",
          "mode": "dedicated",
          "size": 16,
          "group_id": 3,
          "priority": 10,
          "block_on_fault": 1,
          "max_batch_size": 1024,
          "max_transfer_size": 2147483648,
          "type": "user",
          "name": "dsa0.3",
          "threshold": 0,
          "ats_disable": 0,
          "state": "disabled",
          "clients": 0
        }
      ],
      "grouped_engines": [
        {
          "dev": "engine0.3",
          "group_id": 3
        }
      ]
    }
  ],
  "ungrouped workqueues": [
    {
      "dev": "wq0.4",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 0,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    },
    {
      "dev": "wq0.5",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 0,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    },
    {
      "dev": "wq0.6",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 0,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    },
    {
      "dev": "wq0.7",
      "mode": "shared",
      "size": 0,
      "priority": 0,
      "block_on_fault": 0,
      "max_batch_size": 1024,
      "max_transfer_size": 2147483648,
      "type": "none",
      "name": "",
      "threshold": 0,
      "ats_disable": 0,
      "state": "disabled",
      "clients": 0
    }
  ]
}
# 
# accel-config list --idle | jq '.[].dev' | grep dsa
"dsa0"
"dsa1"
"dsa2"
"dsa3"
"dsa4"
"dsa5"
"dsa6"
"dsa7"

@davejiang
Copy link
Contributor

I find the engines all pre-assigned to each group to be strange.
Can you reboot, run that single accel-config config-wq, and then do the filtered accel-config list --idle please? Thanks!

@mythi
Copy link

mythi commented Oct 19, 2021

@davejiang

Can you reboot

we find this rebooting a bit strange. would rmmod/modprobe idxd be enough as suggested by @ramesh-thomas earlier:

Can you try rebooting or resetting by unloading and reloading idxd module?

@ozhuraki
Copy link
Author

@davejiang

Can you reboot

There are multiple users, unfortunately, this is problematic. Resetting by unloading/loading the idxd module was already tried #11 (comment). Are there any other ways to reset the DSA HW?

While discovering this, an earlier observation is that "config-wq", "config-engine", "enable-device", "enable-wq", "disable..." works only a limited number of times after a reboot and was reproducible in multiple physical setups.

Since the identical configuration can be succesfully loaded and enabled through "load-configuration", is the problem in the order of setting the sysfs entries by accel-config in case of "config-wq" / "config-engine" / "enable-device"?

@davejiang
Copy link
Contributor

You can unload module. But I really want a clean slate to see if this is a problem or something else caused this. Also, the 5.11 kernel is pretty old consider 5.15 is about to be released. The 5.11 probably has a lot of bugs that are fixed in later kernels. Unless you are reproducing a bug on the latest upstream kernel, there isn't much we can do. BTW, what silicon stepping are you using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants