Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[storage]: Front-end Attach / Detach Namespace Proto-Bufs #177

Open
mkalderon opened this issue Oct 31, 2022 · 23 comments · May be fixed by #237
Open

[storage]: Front-end Attach / Detach Namespace Proto-Bufs #177

mkalderon opened this issue Oct 31, 2022 · 23 comments · May be fixed by #237
Assignees
Labels
Storage APIs or code related to storage area

Comments

@mkalderon
Copy link
Contributor

Need to add support for attaching / detaching controllers from a namespace.
We can take this. Just raising as issue in-case someone has prior thoughts on this ?
@glimchb

@glimchb glimchb added the Storage APIs or code related to storage area label Oct 31, 2022
@glimchb
Copy link
Member

glimchb commented Oct 31, 2022

isn't NVMeNamespaceCreate same as Attach ? we can rename it if that helps...

@glimchb
Copy link
Member

glimchb commented Oct 31, 2022

I opened #178 as well for enable/disable NS...

@mkalderon
Copy link
Contributor Author

Not quite. When a namespace is created it isn't attached to a controller. We're providing api to create subsystem/namespace/controllers etc.. so it's not like backend that you get this information and attachement information as well.
We need a way to attach between namespace and controller. Same namespace can also be attached to several controllers.

Having said that, looking at your additional issue with enable/disable - maybe it's somehow related and provides another way to make this attachment. in enable/disable are you referring to enable a namespace for a specific controller ?

@glimchb
Copy link
Member

glimchb commented Oct 31, 2022

in enable/disable are you referring to enable a namespace for a specific controller ?

yes

@benlwalker
Copy link

benlwalker commented Oct 31, 2022

A controller is a "view" or a "session" between a host and a subsystem. A subsystem contains the namespaces. I don't think it makes any sense in the NVMe model to attach a namespace to a controller. Can you outline the use case here? Is this for private namespaces? Or is this some dual port device case where the PCI endpoints see different namespaces?

Also, it does make sense for NVMe over fabrics to set up visibility rules based on the hostnqn via a call we don't have in the API yet. But I don't think that's what this issue is talking about.

@glimchb
Copy link
Member

glimchb commented Oct 31, 2022

if we have 1 subsystem with 1 namespace and 10 controllers each on a separate PF/VF that we want to pass-through to 10 VMs...
how would you describe the operation of exposing/attaching Namespace to Controller (PF/VF) ?

@benlwalker
Copy link

The best approach is to create 1 subsystem containing 1 namespace for each PF/VF. You do not want the VMs to share state of any kind, and if they share a subsystem there is shared state in there, even if you filter out shared namespaces from view.

@mkalderon
Copy link
Contributor Author

There may be cases where you would want the host "view" to match a backend, with shared namespaces etc...
I think the proto-bufs should provide this functionality, best approach limits the administrator.

@benlwalker
Copy link

Unfortunately, that's in direct opposition to the defined OPI view of separate "front", "middle", and "back" ends. The front-end is entirely virtualized and emulated, and back/middle end volumes are added as namespaces. The front-end and the host never see anything about back-end subsystems or controllers, and in fact the back-end may not even be NVMe. Things like NVMe admin commands that come into the front-end are not simply passed through to some "middle" or "back" end object because there isn't middle/back-end object that corresponds to a controller/subsystem - only volumes.

In order to think through solutions, I think we'll need a lot more detail on the motivating use case for the host "view" to match some backend. Are you able to provide that?

@glimchb
Copy link
Member

glimchb commented Oct 31, 2022

The best approach is to create 1 subsystem containing 1 namespace for each PF/VF. You do not want the VMs to share state of any kind, and if they share a subsystem there is shared state in there, even if you filter out shared namespaces from view.

I'm not arguing the best approach. I think API should allow and not limit if user wants to share subsystem. Even if we don't see today the use case, it is reasonable to allow this

@sburla-marvell
Copy link
Contributor

A subsystem can have namespaces that are not attached to any controller and similarly a subsystem can have a namespace that is attached to multiple controllers. This is all still front end view(nothing do with backend). This means that there should be an ability to attach/detach a namespace to/from a controller. There is also an ability to attach or detach namespaces from nvme controllers using async event at runtime (namespace attribute changed) . This event causes host to reenumerate the namespaces. This is a feature most of our customers use.

@benlwalker
Copy link

I think there may be some mixing of very different concepts going on here. These API definitions are not doing the equivalent thing of the NVMe Namespace Attach/Detach commands. Those NVMe commands are looking at a list of namespaces already attached to a subsystem, and then adding a reference, via an NSID, to a controller's view. That's not OPI's job - OPI is the thing that put the namespaces into the subsystem to beginwith.

So again, can you clarify what the actual use case is for this feature? An example flow and justification of some kind, even very high level? I'm certainly not advocating that we don't support all sorts of dynamic hot insert/removal of namespaces with all of the appropriate async events, nor am I advocating that we preclude support for any specific NVMe namespace management commands. OPI is definitely going to have to support all of that stuff. I'm really focused just on the need to set up mappings between controllers and namespaces during provisioning, based on the controller identity itself (as opposed to based on a hostnqn), and only in the front-end for NVMe devices. It's a very narrow thing I'm advocating against currently.

The reason I'd like to avoid adding support for this is relatively straightforward:

  • Anyone sharing subsystems between tenants has to be very careful not to expose shared state. You have to disable NSSR, likely disable namespace management (because the NSIDs are global and you can run into conflicts), and possibly other things. Especially dangerous are new things added to the spec later on.
  • There's an easy alternative design (make more subsystems) that's safer and better.
  • We should try to avoid adding APIs just because the spec supports it or we'll end up with 100+ calls and too much complexity. We should add calls whenever there's a real use case for an xPU that someone presents. I don't even propose we be very strict or scrutinize the use cases all that closely - just some reasonable justification is probably a good enough bar for me.

@sburla-marvell
Copy link
Contributor

This is not about the optional Namespace management support from the host side. I am not actually we should let the host do namespace management in this case anyway, as we want DPU to be in control. This is about how the DPU provisioning new namespaces to a controller at runtime. Unfortunately, the only way the spec supports this is through the host side (using namespace management commands). But the same mechanism lets DPU side also do this by issuing an async event. The high level use case for this feature is provide additional storage to a tenant without the need to burn another VF/PF resource.

@benlwalker
Copy link

Can't what you're describing be accomplished by adding a namespace to a subsystem instead of adding it to the controller? I believe it can if the default behavior of a controller is to automatically see all namespaces in the subsystem.

This is really the core of this discussion. What you outlined above can be done with the calls that already exist. What's the new/extra functionality needed that warrants an extra step to attach/detach from specific controllers?

@sburla-marvell
Copy link
Contributor

A subsystem can have multiple controllers and multiple namespaces. Not every controller in a subsystem needs to have every namespace in the subsystem active in its CNS:02 list. Even if you added a namespace to a subsystem, the host is not able to see it right away. It would not be able to enumerate the new namespace until it issues the identify (CNS:02) command again (which is generally a reboot/nvme driver reload). The above API lets host see a new namespace that is added to the subsystem to be active for a controller at runtime without having reload the driver.

@benlwalker
Copy link

A subsystem can have multiple controllers and multiple namespaces. Not every controller in a subsystem needs to have every namespace in the subsystem active in its CNS:02 list.

My line of questioning isn't whether you can or cannot do this. My line of questioning has been focused on why you want to do this. I've provided some reasons above that you probably don't, and instead should have made two subsystems. What's the use case where this is the superior solution to using two subsystems?

Even if you added a namespace to a subsystem, the host is not able to see it right away. It would not be able to enumerate the new namespace until it issues the identify (CNS:02) command again (which is generally a reboot/nvme driver reload). The above API lets host see a new namespace that is added to the subsystem to be active for a controller at runtime without having reload the driver.

This may be the core misconception. I'm advocating that all namespaces in a subsystem (for the front-end) are surfaced through all associated controllers immediately and automatically when they're added or removed from the subsystem. So adding a namespace does in fact generate the asynchronous event notifications, which causes the driver to issue the identify namespace commands and see the new namespace without being reloaded. Is there some problem with this behavior at least being the default?

@sburla-marvell
Copy link
Contributor

Ok. In that case

I'm advocating that all namespaces in a subsystem (for the front-end) are surfaced through all associated controllers immediately and automatically when they're added or removed from the subsystem. So adding a namespace does in fact generate the asynchronous event notifications, which causes the driver to issue the identify namespace commands and see the new namespace without being reloaded. Is there some problem with this behavior at least being the default?

Ok. In that case, the only comment I have is that it is more restrictive than what the spec allows for.

@glimchb
Copy link
Member

glimchb commented Nov 2, 2022

I agree we can't / shouldn't implement the entire spec. If there is a real need, we can add new RPCs and new Fields in existing RPCs. But I wouldn't want to go into direction where we just add more things without clear value... it will be hard to support long term.

@sburla-marvell
Copy link
Contributor

Also I wonder, how native multipathing would be supported for the front end, if every host controller ends up with its own subsystem as is being advocated here. if the same namespace is exposed to the host (considered the same due to reporting the same unique identifier in "Identify Namespace") through 2 different controllers which are not in the same subsystem, the host nvme driver will end up creating 2 block devices which beats the purpose

@benlwalker
Copy link

benlwalker commented Nov 2, 2022

I'm not advocating that every controller has it's own subsystem. I'm advocating that every controller associated with a given subsystem has the same view of the namespaces in that subsystem. This is a very minor narrowing of the flexibility which yields a large reduction in overall complexity. For example, in my proposed system, the common steps to configure a front-end device would be:

  1. Create N back-end/middle-end volumes
  2. Create a subsystem
  3. Add the volumes from 1 to the subsystem in 2.
  4. Create N controllers, one for each PF/VF that will see the subsystem created in step 2. Usually, there is just one controller.

But if we add the ability for each controller to see a different set of namespaces, the steps would become:

  1. Create a back-end/middle-end volume
  2. Create a subsystem
  3. Create N controllers, one for each PF/VF that will see the subsystem created in step 2.
  4. For each controller, add the subset of volumes from 1 that it should see to the controller. This subset can be different for every controller. Allocate the NSIDs such that they don't conflict with any other controller.

If N is anything but 1, is it clear how the second option suddenly becomes much more complex than the first? And I've provided a few reasons that the complexity isn't actually adding any value - there's other ways to accomplish almost every use case that maintain the simpler provisioning flow.

@glimchb
Copy link
Member

glimchb commented Sep 12, 2023

link to DOCA for reference nvme_controller_attach_ns and nvme_controller_detach_ns
https://docs.nvidia.com/networking/display/bluefield3snaplatest/SNAP+RPC+Commands#SNAPRPCCommands-nvme_controller_attach_nsnvme_controller_attach_ns

@benlwalker
Copy link

Almost a year later, my opinion remains that OPI should not support "private" namespaces. In other words, all controllers in a subsystem shall see the same set of namespaces. If a namespace is added to a subsystem, all controllers in the subsystem are notified. The normal/common case is one controller per subsystem (although we'd still allow multiple controllers for "dual-port" type set ups).

This also means that the normal configuration for these DPU products is to create one subsystem per VF (as opposed to one giant subsystem for the whole device). This has other really important ramifications - it makes it much simpler to live migrate a controller between two subsystems if there is only one controller in it, for example. It also makes it clear what other components in the system are impacted when you request a subsystem reset (just your VF). I really do believe it's the only reasonable design choice to make, so OPI is safe to be opinionated here.

@glimchb glimchb assigned benlwalker and unassigned mkalderon Sep 12, 2023
@sburla-marvell
Copy link
Contributor

@mkalderon is on leave. As I stated earlier, I am ok with the restriction as long as there is a way for a subsystem to have multiple controllers and multipathing.Marvell SDK also supports the attach/detach semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Storage APIs or code related to storage area
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants