Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PSA] Document effect of multiple 'pipelines' on Register extern? #353

Closed
jafingerhut opened this issue Jul 19, 2017 · 55 comments

Comments

@jafingerhut
Copy link
Contributor

commented Jul 19, 2017

I am assuming that the following will be true of some P4 programmable devices:

(a) If it has multiple match-action pipelines, then the physical storage accessed by Register externs will be separate for each such pipeline.

(b) There will typically be no kind of 'coherency' protocol implemented between these pipelines (that would be very expensive and probably application-dependent what you would want it to do -- I am not advocating that anyone implement such a feature).

If that is true, then it seems worth documenting this in the PSA, and point out what affect it can have on a P4 program that uses the Register extern. e.g. it should be straightforward to have read-modify-writeback state where all packets going through the same physical match-action pipeline can see the writes of earlier packets, but they will not see any writes of packets that went through a different pipeline.

Note: This kind of thing is typical in distributed chassis network devices, e.g. counters and meters are per-line-card, not global in the system. I am merely proposing that we document this explicitly in the PSA, so P4 program authors are aware of this, since the number of pipelines could vary from device to device.

@mbudiu-vmw

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2017

An extern instance should have its own resources.
A Register instance, in particular, should have its private storage. If your architecture has multiple pipelines, it should have multiple control instances as arguments of the toplevel package, and each of them will then use different register instances. The language is rich enough to support describing how resources are allocated for various architectures.

@antoninbas

This comment has been minimized.

Copy link
Member

commented Jul 20, 2017

I agree with @jafingerhut that it should probably be mentioned somewhere in the spec. This pretty much affects all stateful externs and you did mention counters and meters.
@mbudiu-vmw I expect that this kind of hardware will indeed come with its own architecture which will expose one programmable control per pipe. However, the whole idea of PSA is that it can be implemented across as many P4-programmable targets as possible to provide some measure of program portability. If these hardware targets are deemed incompatible with PSA, then the scope and usefulness of PSA is greatly reduced IMO.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 20, 2017

@mbudiu-vmw I understand what you are suggesting, I believe, but for a chip that has 4 ingress and 4 egress pipelines, it would mean the top level package would have 'slots' to plug in 4 parsers, 4 ingress pipelines, 4 egress pipelines, and 4 deparsers. In practice I suspect most people would always want to 'plug in' the same parser into all 4 slots, the same ingress control block into all 4 slots, etc.

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2017

I think that PSA should stay with a single ingress and egress (in terms of the package). Otherwise we will end up with PSA-1, PSA-2, PSA-4, etc (depending on the number of pipes). Some architectures might be using a pipeline per port and that might get really ugly.

I suggest we simply state that the programmer needs to be aware of the certain effects if the underlying target uses multiple independent pipelines. Specifically, state is local to the pipelines, that typically represent ports or groups of ports.

  1. Counters are actually OK, since their values cannot be accessed in the data plane and can be easily and safely aggregated by the control plane, provided that we do not mandate set_value() method for the counters and only keep "clear()"
  2. In terms of meters, accurate metering will be possible only within the scope of a given pipeline
  3. In terms of registers, all state is, again, localized to the pipeline scope (typically a group of ports)
@mbudiu-vmw

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2017

I think that the more precise description of PSA is as being a parameterized family of architectures, and not an architecture. For example, the number of ports is a parameter.

One can think of the number of pipelines being another parameter.

But indeed, we don't have suitable language mechanisms to describe the various architectures parameterized by the number of pipelines in a nice way. Logically you have something like this:

control ingress<...>(...);
control egress<...>(...);

package switch1_1(..., ingress i1, egress e1);
package switch2_1(..., ingress i1, ingress i2, egress e1);
package switch2_2(..., ingress i1, ingress i2, egress e1, egress e1);
...

By using some pre-defined macros we could also enforce that all ingress pipelines use the same implementation:

#define switch2_2_impl(i, e) switch2_2(..., i, i, e, e);

This is not pretty, but it would describe the architecture accurately.

@mbudiu-vmw

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2017

BTW: this will be the most important to get right for the control-plane API, because if your switch has multiple pipelines the control plane has to use different names for each of the stateful objects that it is accessing.

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2017

@jafingerhut It is not really clear to me what specific problem this complex definition will actually solve, except multiplying all objects by N=number_of_pipes.

As @mbudiu-vmw noticed, the main issue is on the control plane side and from the control-plane perspective, pipe-local objects are usually an exception, rather than the rule and more often than not this is just an optimization that allows to achieve better table capacity. I do not think that forcing the control plane to explicitly program/query objects in each pipe instead of being able to do "all pipes" by default will be convenient to the users.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 20, 2017

@vgurevich In a 2-pipeline device, where the same P4 program is currently loaded into every pipeline, not only the registers, but all P4 tables have independent storage for each pipeline, e.g. an IPv4 longest-prefix-match table that is read-only from the P4 program is actually stored 2 times, once per pipeline.

Are you saying that you believe that the control plane API should be always restricted to adding identical entries into every pipeline of the same device?

Or do you think there would be interesting use cases where someone would want the option to add different table entries to different pipelines?

An IPv4 FIB table might not be a great example of a table where the control plane would want different contents in each pipeline. Ingress ACL (access control lists) are perhaps a better example -- one could fill them all identically in every pipeline, but if ingress pipelines are tied to particular subsets of physical ports on a device, then installing ingress ACL table entries in all pipelines identically can be a waste of table space, e.g. if some ingress ACL entries should only be matched on packets arriving on input port 7, and input port 7 is attached to pipeline #0, then if you also install those same entries in pipeline #1's ingress ACL table, they will never be matched by any packets. I have worked on control planes that intentionally installed different entries in different ingress pipelines in an ingress ACL table, to avoid wasting that space.

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2017

@jafingerhut ,

You are correct -- in a multi-pipeline device everything is typically pipe-local and that certainly includes tables.

IPv4 LPM table is actually a good example of a table that is typically programmed the same way for all the pipes -- you never know from which port a packet destined to a given IP address will ingress. It might be possible to save some space if you use port-based VRFs and you happen to have a VRF that includes only ports from a certain pipe, but it is rarely true. Note also, that if you always program both copies identically, the behavior of the program will still be correct; you might be wasting some entries, but that's all.

Normally, only the tables that explicitly use ingress or egress port in the match key are programmed differently and the truth is:

  1. There are actually only few of them in the grand scheme of things (and even fewer are big enough to be worth bothering about)
  2. The control plane knows very well which tables/objects need to be assymmetric and can request such an option from the driver. Alternatively, these tables can be annotated in the program, but in any case, only control plane is affected.

Ingress-port-based access lists definitely fit in that category.

@jnfoster

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2017

I haven't been following this discussion closely, but I would ask that whatever design we converge on be compatible with the atomic annotation, which was added to P4_16 (led by @anirudhSK).

@mbudiu-vmw

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2017

If you have multiple pipelines and each of them has a separate register/counter, a control plane shim will be unable to provide even the standard semantics for a global register/counter (leaving @atomic aside). Consider the case where one of the counters in one pipeline overflows or saturates.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 20, 2017

@jnfoster Thanks for reminding me of the @atomic annotation, and I wanted to ask a clarifying question on that.

It seems to me that in the PSA document, we should be reminding people that if they do a Register read() call, then later a write() call, and those operations are not enclosed inside a block with the @atomic annotation, then a PSA-compliant implementation is allowed to make it non-atomic. That is, the value read() for a packet might not be the value written by the most recently arriving packet that wrote it, but 2 packets ago, or 3, or more.

This might be perfectly acceptable for some features using the Register extern, but I would think that in most cases P4 program authors would want the atomic behavior, since otherwise they could be 'losing' the side effects of one or more packets writing to a register.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 20, 2017

@mbudiu-vmw Multiple counter states in parallel pipelines are easily combinable via addition, and for pretty much any practical purposes I can imagine, that is functionally equivalent to a single counter state updated by all pipelines. You lose information in the wrap/saturate case regardless of whether there is one pipeline or more than one. The current recommendation I have written in a pull request for the PSA counter section is that it is assumed that some driver code is reading counter states with a high enough frequency that information is not lost, without specifying how this is to be implemented.

@anirudhSK

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2017

@jnfoster I think multi-pipeline atomics are forbidden by the P4 grammar. The grammar snippet just under 6.5 here (https://p4lang.github.io/p4-spec/docs/P4-16-v1.0.0-spec.html) suggests that you cannot have an @atomic that spans two control declarations, which would be needed to implement a multi-pipeline atomic counter---at least a counter implemented using an @atomic block like this:
@atomic{
tmp = register.read();
tmp++;
register.write()
}

We could still have extern objects that are instantiated once at the top level, but then used within multiple control declarations, such as a multi-pipeline counter. But, with an extern object, it's up to the implementer of the extern to ensure that extern method calls are atomic (From 16.4.1: "Execution of a method call on an extern instance is atomic.")

For the PSA, it seems to me the simplest solution is to ensure that each globally allocated extern (we considered and rejected local externs earlier: #81) is still accessed from at most one pipeline. While this is kind of hard to ensure, it's just a formalized version of @vgurevich's three points above. It's relying on the programmer instead of having the language prevent such use cases.

Alternatively, we could revisit local externs, and make meters and registers local externs to reflect the reality of the underlying hardware. Counters could be global externs, provided they only support a count() method, which can be implemented by adding up pipeline-local counters. In addition, if they also support a read() method, which gets the latest value of the counter in the data plane, then we can't support a global extern counter. This is because the latest value of the counter is only available when the control plane eventually aggregates the counts, and not in the data plane.

@jafingerhut, I like your idea of warning programmers about non-atomic uses of a register in the PSA.

@mbudiu-vmw

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2017

@anirudhSK, I believe that the problem here is that PSA wants people to write a p4 program for a switch with a single pipeline, but the program will be synthesized identically multiple times in multiple pipelines. The control plane is supposed ti hide this fact, by aggregating information from all pipelines.

I think that this is very fragile; maybe this can be made to work for some restricted scenarios, but in general the program written by the user and the program implemented by the system will not have the same semantics.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2017

I am going to mention @antoninbas explicitly, since this comment would have a noticeable effect on the control plane API, if the idea is implemented.

I believe that the main motivation for wanting to try to 'hide' multiple pipelines from the control plane is performance -- if tables have the same contents (a common use case), it would be nice if the control plane API can send only one update, and have all pipelines be updated as a result. I am curious to hear if there are other motivations.

It seems that this performance concern could be addressed by making the existence of multiple pipelines explicit. In the control plane API, we could create something like a "mirroring" configuration, e.g. the control plane could configure table A in pipelines #0 thru #3 of a 4-pipeline P4 device to be mirrored, so that all updates to that "mirroring group" always apply to all physical copies of the table in that group.

If this is done, it seems worth having the option, in the same device at one time, for some tables being mirrored, and other tables having unique contents in different pipelines. The control plane would select which tables it wants to be mirrored, and which not.

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2017

@jafingerhut

Actually, the main reason is convenience, because having loops around 90% of the API calls (plus the proper error handling) is quite annoying. It's certainly possible to do (and sometimes might be needed in the most advanced cases), but I would not make it a default option.

But in any case, all this can be easily achieved at the control plane API level -- I do not think we need to complicate the actual P4 program with these details if the only thing we are doing is running the same program on multiple pipelines.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 22, 2017

@vgurevich I don't think it has huge effects on the P4 code, but there are at least a few issues where a P4 programmer needs to be aware that the code they are writing would be running on a multi-pipeline device, vs. a single-pipeline device.

The draft PSA we have now mentions Ingress2Egress and Egress2Ingress as clone methods. If there are multiple ingress and multiple egress pipelines, are these cloning methods restricted to cloning to the same one of the multiple pipelines, or can they 'jump between' pipelines while cloning? It seems reasonable to restrict that Ingress2Ingress and Egress2Egress clones would be restricted to stay in the same pipeline.

The rest of this (long) message is on the effects multi-pipeline devices can have on the portability of P4 programs that use the Register extern. I assume we are talking about multi-pipeline devices where Register externs have a separate instance per pipeline, with no access to common Register state between different pipelines.

Suppose you want to write a feature that uses a Register to maintain some state that is modified by all packets that go out of output port X, with separate Register array elements holding the state for each output port.

Can you write a P4 program for this that updates that Register in the ingress control block? If the target device has only one pipeline, then yes. If it has multiple pipelines, then no -- you must access that Register in the egress pipeline. If you were to put it in the ingress control block in a 2-pipeline device, there would be 2 independent Register states for output port X, one in each ingress pipeline, and only a subset of packets will update each one.

Even that statement contains an assumption that affects whether P4 programs are portable across multi-pipeline devices -- that such a device would have each pipeline dedicated to some subset of the device's external ports. Call that a port-based multi-pipeline.

What if someone wanted to design a "hash-based" multi-pipeline device instead? That is, it calculates a hash of some packet header fields, and uses that hash value to select an ingress pipeline, and later does that again to select an egress pipeline. On such a device, it would be impossible to write a P4 program that accessed a common Register array element for all packets coming in from the same ingress port, or for all packets going out the same egress port. Should we say that such a device is PSA-compliant, or no?

On a hash-based multi-pipeline device, you can write P4 programs that have Register state accessed by all packets with the same packet header 5-tuple (or whatever the hash is calculated from).

But port-based can't maintain the kind of Register state that hash-based can do, and vice versa.

If we say that both port-based and hash-based multi-pipeline devices are PSA-compliant, it restricts the kinds of program features that can be portable across PSA devices.

I realize that there are plenty of useful P4 programs that don't need Registers at all, but if they are going to be a standard PSA extern, writing portable P4 programs that use them takes a bit of care for multi-pipeline devices. I am personally content to say that a port-based multi-pipeline device is PSA-compliant, but a hash-based multi-pipeline device is not, since I would guess that hash-based devices would be less common (perhaps even non-existent).

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2017

@jafingerhut

I totally agree that there may be systems and programs, where multi-pipeline effects are significantly more pronounced and visible and all the examples you cite are totally possible and valid. I also totally agree that the users of the multi-pipeline device must be fully aware of the details of their particular architectures and differences vs. single-pipeline devices and act accordingly. I also totally agree that port-based pipelines are going to be more common, but hash- (or flow-) based ones are also plausible.

Here is what I am trying to say:

  1. It would be nice if the users of single-pipeline devices do not have to worry and do extra stuff just because other users have multi-pipeline devices.
  2. Generally speaking, multi-pipeline architectures are designed in a way to minimize visible differences between them and single-pipeline architectures, meaning that a lot of typical programs written for these devices would not know the difference. I think it would also be nice if those programs (and programmers who write them) did not have to pay extra price just because other people might write programs that do, indeed, require special care.
  3. No matter how we decide to solve this problem, be it at P4 level (by requiring the program to specify the same control in the package multiple times) or at the control plane level (by requiring the pipeline ID to be specified along with the device ID) fundamentally the device behavior and the way people think about what the program is doing is going to stay the same. Moreover, I do not see how one way is going to be more instrumental in exposing these subtle behavioral differences than other, so in my mind it's mostly about convenience and specifically about how much we want to inconvenience people who would rather not worry.

That's why I think that the control plane "path" might be easier, since things can be much more easily abstracted/covered in there for those who do not need/want to know, while still being fully exposed to those who care, but ultimately I am fully open to other solutions.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 22, 2017

So in the PSA, we have choices on what to say about this. Below are two possibilities. I personally prefer #2 over #1.

  1. Nothing. The disadvantage there being that there will be non-portable P4 programs between PSA-compliant devices that we know about now, on which we say nothing in the document, but let people rediscover it themselves.

  2. Add an appendix describing these issues at least somewhat, perhaps at the level of detail in some of the comments earlier in this thread, and reference it from the PSA sections on the most -affected behavior, e.g. from the sections on the Register extern and clone primitive.

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2017

@jafingerhut

Definitely +1 on number 2.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2017

@anirudhSK You say "For the PSA, it seems to me the simplest solution is to ensure that each globally allocated extern (we considered and rejected local externs earlier: #81) is still accessed from at most one pipeline."

The P4_16 v1.0.0 spec does allow local externs instantiated at the top level of control blocks and parsers, at least. The grammar permits them in other places, but at least the 2017-Jul-19 version of p4test gives an error if you try to instantiate an extern within a parser state, action, or within the 'apply { ... }' block that is within a control block.

I agree with you that for the PSA, if an extern like a meter, counter, or register is instantiated globally (i.e. at the top level of the program), it should be restricted to be used within only one parser or control block. I also believe it would be a good idea for a PSA multi-pipeline device to explicitly say that any state associated with tables or externs are local to the one 'instance' of the parser or control block that is for that one pipeline.

@mbudiu-vmw @cole-barefoot I believe the P4_16 v1.0.0 language spec allows a global extern to have method calls for it in multiple parsers and control blocks. Does that match your recollection? I also believe the language spec has no way for a table to be used from multiple control blocks. Is that true?

@anirudhSK

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2017

@jafingerhut
I don't think local externs are permitted by the grammar. When I said local extern, I meant the ability to declare an extern within a control block, the way table declarations are local to a control block.

https://github.com/p4lang/p4-spec/blob/master/p4-16/spec/P4-16-spec.mdk#L1348
suggests that externs (externDeclaration) can be declared at the same level as a parser or control, but not inside a parser or control. I don't see any other place in the grammar where an externDeclaration can occur. For instance, it can't be nested within a control block because the production rules of controlDeclaration do not contain externDeclaration (https://github.com/p4lang/p4-spec/blob/master/p4-16/spec/P4-16-spec.mdk#L4718)

When you say, "it should be restricted to be used within only one parser or control block. ", my concern is how this restriction is enforced. One way is to document it in the PSA and say "don't do this.". But that could result in undefined behavior if the programmer (willingly or not) ignores it. We could go one step further and forbid it in the language, by making these objects local externs that just cannot be accessed from other pipelines.

I prefer the second approach because it doesn't assume the programmer will be careful. That said, it requires us to introduce local externs (assuming my interpretation of the grammar above is correct and local externs don't already exist), model multiple pipelines explicitly in the PSA, and then instantiate each of the problematic externs (meters, registers, etc.) locally in each pipeline.

But, as I understand it, the expectation is that the PSA is a simple single-pipeline device (for instance, as @mbudiu-vmw said earlier: "PSA wants people to write a p4 program for a switch with a single pipeline"). I am just worried that this single-pipeline view hides too much of the underlying architecture and is a recipe for the programmer to shoot themselves in the foot by assuming an extern is shared consistently across pipelines.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2017

@anirudhSK I believe it is the plan of record that compilers for PSA-compliant devices are allowed to reject perfectly correct P4_16 programs, if those programs violate any of a list of additional restrictions documented in the forthcoming PSA spec.

If that is so, there is no reason to change the P4_16 spec to disallow things, if we wish to disallow them in PSA-portable programs.

@anirudhSK

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2017

Rejecting non-compliant programs in the compiler is definitely helpful. It is much easier than changing the spec and certainly better than undefined behavior. But I have a somewhat naive question. How does the compiler even detect this unsafe access of an extern from multiple pipelines?

If the PSA looks like a single-pipeline device at the level of the P4 source, i.e., one control for ingress, one for egress, and one for parser, the P4 source will only have a single pipeline---even if it's implemented as a multi-pipeline device underneath. So, by definition, all extern method calls will happen only in a single pipeline as far as the P4 program is concerned, and the compiler will have nothing to check.

If the PSA looks like a multi-pipeline device at the level of the P4 source, then I am OK with either having the compiler check that global externs are only accessed within a single pipeline (or) introducing local externs. I will add that this is the second time a need for a local extern has come up (#81 talked about it in the context of parser checksums). We solved the previous problem using a different language mechanism, but there may be value to just introducing local externs because they seem to be recurring. I am curious to know what @jnfoster and @mbudiu-vmw think of this?

But, ultimately, at least for me, it's far more important that we check and warn against these accesses in some manner (either in the compiler or forbid them at the language level using local externs). It's less important how that's done (compiler or language).

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2017

@anirudhSK You point out the part of the grammar that has an externDeclaration. Maybe I'm way off base here, but I think an externDeclaration without an 'instantiation' will not create any extern state within a P4 program. It is the instantiations that create the state, and the instantiations can be global, but are also permitted within a parser or control block.

Hopefully someone else who knows more can support or refute my statements here.

@anirudhSK

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2017

@jafingerhut You're right. Extern variables can be instantiated within a control (as my own comments at the top of #81 show :)). Here's another example, where Counter is declared here: https://github.com/p4lang/p4c/blob/428a38ea275c74d51fbe97cc4a81eeaab62bc016/testdata/p4_16_samples_outputs/rcp1-frontend.p4#L15 and instantiated here: https://github.com/p4lang/p4c/blob/428a38ea275c74d51fbe97cc4a81eeaab62bc016/testdata/p4_16_samples_outputs/rcp1-frontend.p4#L31

The "local" in #81 refers to thread-local lifetime, not local visibility, which is what we need here. Locally visible externs are supported in P4-16. Sorry for the confusion.

Now that I know we can actually support locally visible externs, my preference would be to make the existence of the multiple pipelines explicit in the PSA source code as pointed out in Mihai's code snippet earlier. For instance, this is a 2 ingress, 2 egress pipeline:

package switch2_2(..., ingress i1, ingress i2, egress e1, egress e2);

Otherwise, with a single-pipeline PSA, we run the risk that the programmer thinks there is a single meter/register, which is really implemented as two pipeline-local meters/registers. I don't know how we specify the semantics of a register in a single-pipeline PSA when it is really implemented as two registers.

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2017

@anirudhSK ,

I am still not clear how specifically creating a package with multiple, but identical ingress/egress control functions will help the compiler to notice that someone is trying to access an extern that "lives" in another pipeline. These externs will still be local to the corresponding controls, so there is no way one can even express that they are trying to access something else.

@anirudhSK

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2017

I intended for the package to support multiple possibly distinct ingress and egress control functions. But in the common case, these control functions (as implemented by the P4 programmer) could carry out the same functionality in each pipeline, similar to how Andy talked about plugging in the same parser block into all four pipelines.

If the extern is instantiated locally (within a parser or control block), then as you say "there is no way one can even express that they are trying to access something else." But, this is exactly what we want. We don't want a programmer to be accessing anything except their own locally instantiated externs.

If the extern is instantiated globally, at the top-level scope, then the compiler must make sure it is accessed in exactly one pipeline, because it is visible in all pipelines. Here a PSA compiler can check that any extern that is instantiated at the top level is accessed in only one pipeline.

Neither of the two cases above can be handled if we abstracted a multi-pipeline switch into a single-pipeline PSA model. For both locally and globally instantiated externs, there is nothing for the compiler to check if the PSA had only a single pipeline. Also, for a single-pipeline PSA model that is in reality implemented as multiple pipelines, I don't see a way to clearly specify its semantics without invoking its multi-pipeline implementation, which seems awkward.

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

@anirudhSK ,

One way to solve this might be to have an array of N globally-instantiated externs (one for each pipeline), e.g.

Counter(4096, counter_type.packets_and_bytes) [4] my_counter;
Meter(8192, meter_type.packets) [4] my_meter;
Register<bit<32>>(16384) [4] my_register;

and also have a special (e.g. intrinsic metadata), e.g. my_pipe. Then, every time one wants to access these externs, they will have to specify something like

counter[my_pipe].count(index)

Personally, I feel that this medicine will be worse than the actual problem (not to mention required changes to the language) and even then it is not clear if it would help much, since the relationship between a given packet and my_pipe will still be murky. This would also lead to all externs be instantiated globally.

I feel that we are trying to solve an unsolvable problem here and a practical solution might be as follows:

  1. Formally define PSA for a single-pipeline device
  2. As @jafingerhut suggested, write an appendix that will state that in many circumstances a multi-pipeline device will also do, but here are the things programmers should be aware of
  3. While the compiler cannot formally verify the assertion, we can, nevertheless state that a truly PSA-compliant program should exhibit the same behavior regardless of the number of pipelines and therefore if there is an incompatibility, it will be a problem with the program, not with PSA :)
@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 24, 2017

@vgurevich I don't see why there is any reason to go to having an array of 4 globally accessible externs in order to support a 4-pipeline device.

It is sufficient to have a package that has 4 instances of an ingress control, and each of those ingress control blocks is the same one in the source code, which has its own locally instantiated externs, and that ingress control block code always accesses its local extern only. No changes the the P4_16 language are needed for this, and the differences in the source code between a 4-pipeline device and a 1-pipeline device would be only the package declaration (unless I am missing some reason why this wouldn't be correct P4_16 code).

There would be no reason in the PSA to generalize to allowing different pipelines to be running different code.

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

@jafingerhut

I think we need to decide what we are trying to achieve. If the desire is to do things in a simple way and worry about complex issues later, I would suggest defining PSA for a single pipeline only and write an appendix with notes/warnings about multi-pipeline devices.

If the desire is to express the contents of that appendix in the language, then having a package with N instances of ingress/egress/etc. is not going to cut it, as @anirudhSK and I discussed above. Therefore, my question is "why bother if it is not going to help, but is going to make things more complicated?"

While I mentioned globally defined arrays of instances (and a special intrinsic metadata to identify the pipe), I actually think that even that explicit declaration will not help much, but offered it as at least a more expressive method.

@anirudhSK

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

"If the desire is to express the contents of that appendix in the language, then having a package with N instances of ingress/egress/etc. is not going to cut it, as @anirudhSK and I discussed above. "

I think a package with N instances of ingress/egress should be sufficient to cover all the cases I can think of. For locally instantiated externs, the language itself will enforce that the externs cannot be accessed from more than one pipeline (#353 (comment) and #353 (comment)). For globally instantiated externs, it should be possible for the compiler to check that an extern instance is not accessed from more than one pipeline (#353 (comment)).

One downside is that the compiler check has to be conservative. As an example, if we access counter[x] from one pipeline and counter[y] from another, it's OK if x != y. However, determining x != y is hard if x and y are arbitrary expressions, so the compiler will have to reject the program assuming the worst. But I don't know how common globally instantiated extern arrays are, and a conservative check seems reasonable to me. I assume there is no other form of checking required, but @mbudiu-vmw or @jnfoster can correct me if there is the possibility of aliasing in the language.

In general, I would prefer that things are enforced by the language first, then the compiler, relying on the programmer only as a last resort. The package-based approach with N pipelines has the least reliance on the programmer to do the right thing.

Finally, if the worry is that we are complicating a simple solution (the single pipeline PSA) to handle the uncommon case (state access), I think defining a family of architectures as Mihai suggested much earlier (#353 (comment)) is one way to go.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 24, 2017

@vgurevich @anirudhSK I have created a simple, short P4_16 program using today's draft psa.p4 #include file, and also a slightly modified version of those files that demonstrates one way to make multiple pipelines explicit, but most of the changes are limited to the psa.p4 file, with almost no changes required in any P4_16 programs that #include it. You can see that sample code here: https://github.com/jafingerhut/p4-guide/tree/master/multipipeline

You can clone the repository that contains it for easier access/diff-ing/etc.: https://github.com/jafingerhut/p4-guide Directory named "multipipeline" has a README.md and only the few P4 source files for this sample code.

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

@jafingerhut

Thanks a lot for the example. It looks very neat (and I love your repo).

My question, however, still stands: how will the user or the compiler know, that the updates to a register/counter/meter with a given index I, caused by packet A will or will not be visible to a packet B if it uses the same index I, because in my mind that's the issue, isn't it?

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 24, 2017

@vgurevich The compiler will know because of normal P4_16 rules for instantiating externs (if I understand them correctly) -- an extern instantiated inside of one control block will have its own unique state that is not accessible in any way from other control blocks.

A P4 programmer will know for reasons of understanding the facts above, and the docs for their PSA-compliant P4 target device should say something similar to "Hey, our device is PSA-compliant and contains 2 pipelines, where pipeline #0 handles ports 0-3, and pipeline #1 handles ports 4-7."

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

@jafingerhut ,

I figured out that that's why you used the port number as the index. However, more often than not the index is derived and then is used. A typical code looks like this:

action set_meter_index(meter_index) {
    meta.meter_index = meter_index;
}

table meter_table {
    keys = {
        istd.ingress_port    : ternary;
        hdr.ethernet.dstAddr : ternary;
        hdr.ethernet.srcAddr : ternary;
        hdr.ipv4.srcAddr     : ternary;
        hdr.ipv4.srcAddr     : ternary;
        /* etc */
    }
    actions = {
       do_metering; 
    }
}

apply {
    MeterColor_t c1;

    meter_table.apply();
    c1 = meter1.execute(meta.meter_index, MeterColor_t.GREEN);
    if (c1 == MeterColor_t.RED) {
        pre.drop(); /* bde.drop() */
    }
}
@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 24, 2017

In my example I was focusing on the P4_16 language visibility issues, not on exactly what index is best to use to implement a particular feature.

In an example like I wrote, if the user instantiated a counter indexed by ingress port number, and allocated as many entries in a counter instance as there were ports in the entire system (e.g. 8), but the target device had two pipelines where pipe #0 only processed packets from ports 0-3, and pipe #1 only processed packets from ports 4-7, then what would be instantiated would be 2 arrays of 8 counters each, and the pipe #0 instance would have entries 0-3 updated, but entries 4-7 unused, and vice versa for the pipe #1 instance.

If you wanted to economize to exactly what was needed, then yes you would need to have a few more changes in your P4 program to minimize the sizes of tables that needed a number of entries proportional to the number of ports. Agreed that such economizing would be important, and is not addressed in my small example. It is certainly possible to address it, but I haven't gone through the exercise of seeing how small such changes would be.

@vgurevich

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

@jafingerhut

Just to be clear -- I am not criticizing your example. It is totally valid and applicable. We can certainly limit PSA to just port-based externs, but I am afraid it will be too strong of a restriction.

In fact, I believe that it is the ability to derive counter/meter/register indexes in arbitrary way that constitutes one of the main strengths of P4. However, if we agree with that, we will probably have to agree that we won't be able to express the multi-pipeline effects in our programs, at least not using the currently suggested mechanisms and if so, then what's the point of complicating PSA?

@cc10512

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

Joining the discussion a bit late.
While it is true that devices have multiple pipelines, I would keep the PSA (at least the initial version until we clarify all its defined semantics) to one pipe. If a target wants to map this to a multi-pipe design, it is the task of the backend writer to map it. Complicating it with multiple pipes at this stage seems to be an unproductive use of our limited bandwidth. As @jafingerhut and @vgurevich seem to already agree that we should point it out in the PSA spec(#353 (comment)), let's do just that and leave it to the target implementers to figure out how to map the single pipe.

With respect to @atomic: I believe we agreed that each extern method is atomic from the point of view of the P4 program. To get atomicity across multiple operations, we use atomic blocks. Given the sequential execution semantics, any program that first writes and then reads a register, should get the value it has written, if no other write has happened by a different thread. Therefore we need to guarantee that if the instance is declared globally, it is available across the controls defined by the PSA (one ingress, one egress). Should a target implementer decide to scatter the controls over multiple pipelines, they should take care of ensuring access to the correct instance. I may be wrong, but do you think this is a too onerous a restriction that will make the PSA unusable?

@anirudhSK

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

@cc10512

Yes, my concern with a single pipeline PSA is that it might become unusable. Concretely, it might be hard to build a high-speed and semantic-preserving implementation of the PSA. One way around this is to remove most/all stateful externs from the PSA. This way, the PSA would not mandate any stateful behavior, making it easier to comply with the PSA.

Here's an example of my concern. Let's assume a single pipeline PSA has a single scalar meter instance defined in the ingress pipeline, which meters all packets headed for one particular output port O. Semantically, this meter processes all packets belonging to all input ports.

Say we have packets P1, P2, ..., P10 arriving at times t1, t2, ... t10. Let's say that no other packets arrived between t1 and t10. Then, because each extern method is atomic, it must appear as though the single meter instance's meter() call was executed on the packets in some permutation of P1, P2, ..., P10 (because we don't mandate a specific thread interleaving).

I can think of two approaches to implementing a single-pipeline PSA. Depending on which approach we take, we either give up semantics or speed.

In the high-speed approach, let's say the 10 packets belonged to 10 different hardware pipelines in the implementation. Then, there would be 10 different local meters, each having processed a single packet. It's hard to even compare the spec and the implementation's semantics: the spec is a single pipeline with a single meter instance, which "splits" into 10 meter instances in the implementation.

In the semantic-preserving approach, the implementation uses a global lock around a single meter instance. This is correct, but loses performance.

I understand multiple pipelines may not be the most pressing concern right now given the limited bandwidth available. I am happy if we can restrict the PSA to include only stateless functionality for compliance, add some notes about stateful externs, and make stateful externs target-dependent functionality for now. We could fold stateful externs into the PSA at a later stage if desired.

@mbudiu-vmw

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

Even for a single pipeline, ensuring that atomic works may be non-trivial, especially in a PISA-like architecture. You may need to insert pipeline bubbles or interlocking if you cannot do a atomic operation in a single stage.

@cc10512

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

@anirudhSK in your high speed example, you have not discussed how you are going to implement the reduction that counts all 10 packets and makes them available as part of one global instance. When you add that, it is not necessarily high speed anymore. Sharing state across different pipelines require a commitment from the hardware to put in place mechanism for truly distributed atomics from pipelines. Pragmatically, this is a lot more than we can chew right now. Might be good material for a PhD thesis :)

Realistically, even sharing state across ingress and egress it may be a challenge for some hardware implementations. I am not as pessimistic as @mbudiu-vmw, in the sense that it is relatively easy to implement high performance atomic read-modify-update operations for registers as long as we constrain their scope. The main issues occur when we expand that scope beyond the local pipeline. So my proposal is leave for future work expanding to multiple pipes.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2017

@anirudhSK There is a long history of multi-line-card systems with separate switching ASICs on separate LCs, or even multiple switching ASICs on a single LC. I know of one effort inside of Cisco to try to implement 'distributed meters' by having a protocol that shared meter update state across ASICs, but it didn't work out so well, and that was even allowing for many meter updates to be compressed into single inter-ASIC messages.

By far the most common approach is simply: meters / counters / etc. are unique per ASIC, and you can't implement a global version of any of these things. Yes, it is frustrating, but it is based on physics and cost.

The reason for implementing a single switch ASIC with multiple pipelines has the same motivation -- you make one pipeline do the maximum packet rate you know how (e.g. 1 packet per clock cycle), and when you want to go N times faster than that, you build N pipelines. They do not share state, because it is too costly to do so. As @cc10512 mentioned, it would be a research project to figure out how to share state in a cost-effective way. Until and unless someone finds such a way, it isn't practical.

My reason for bringing up this thread isn't to try to solve that research problem (I personally doubt it has a cost-effective solution, and besides, useful systems can be built without solving it).

My reason is that I would like it to be clear to P4 program authors what the consequences are when their P4 programs run on a multi-pipeline device, i.e. it is almost as if their P4 program were running on 2N different devices side-by-side with no communication between them, except for a high speed packet buffer connecting them.

@anirudhSK

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2017

@cc10512, @jafingerhut: I agree with you. It's not the PSA's job to solve this problem of consistent data-plane state updates. If anything, it's the target implementer's problem. I don't want the target implementer to be left with these unsolvable problems because the single-pipeline PSA implies semantics for stateful externs that are hard to achieve (as both of you have pointed out). If we go with a single-pipeline PSA, I suggest that we keep all stateful externs out of the PSA and not require any compliance on the stateful side.

@cc10512

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2017

@anirudhSK: Isn't that a bit extreme? Stateful local externs are quite useful in their own right, and with an appropriate rigging on the control plane, you can possibly get a distributed implementation for some functions.

@anirudhSK

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2017

@cc10512 By stateful local extern, I assume you mean something local to the ingress/egress pipeline, but not shared by both. With a single-pipeline PSA, I see no particular advantage to a stateful local extern. It might be local to the pipeline at the level of the P4 source, but if it's implemented underneath using multiple hardware pipelines, it's shared/global in reality.

The only stateful local extern for which I can see a distributed implementation is a counter, and that too assuming the only extern method call it supports is a count(), and not a read() or a set(). That's because the count() method only needs to eventually have the correct value, because the latest value of the counter is never read in the data plane. If the latest value is read() or set() in the data plane, I don't see a way to implement all three extern method calls (count, read, and set) atomically if the target was a multi-pipeline switch.

Maybe I am worrying too much about the target implementation's semantics matching the PSA's, but I fear that it's a slippery slope, where soon the PSA's semantics will diverge in arbitrary ways from the target. To me, the cleaner solution with a single-pipeline PSA is to leave stateful externs out of the PSA, or include just a restricted form of counters, where the data plane can only call count() on it.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2017

@anirudhSK The current proposal is that the data plane can only call count() on counters.

I personally think that meters and registers are such useful things, even in a multi-pipeline device where the state is local to each ingress pipeline, and local to each egress pipeline, that they are worth including in the PSA. I will write up a draft of that appendix mentioned earlier in this discussion, and you can read over it to see if it helps address your concerns. I plan for it primarily to be a warning about issues of how P4 programs can be non-portable if they do not use those stateful features in a way corresponding to the nature of the multi-pipeline device.

For example, if you have port-based multi-pipeline device, it is perfectly correct to do ingress-port based metering in the ingress pipeline, and egress-port based metering in the egress pipeline. Similarly for other stateful features where you know that state X is accessed only by packets from a given ingress port, or to a given egress port. That covers an interesting set of features.

@jafingerhut jafingerhut self-assigned this Jul 25, 2017

@anirudhSK

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2017

@jafingerhut OK. If the goal is to keep meters and registers in the PSA because of their widespread utility, I think your warning in the appendix about the non-portability of stateful features makes a lot of sense. I am fine with this approach. I would just add that some natural looking code fragments that need an extern accessible from multiple pipelines will be rejected by the compiler.

On a related note, the markdown version of the PSA (https://github.com/p4lang/p4-spec/blob/master/p4-16/psa/PSA.mdk) does not seem to formally define what it means for a target to {comply with/provide/implement} the PSA. I think this is important if we want target implementers to provide the same guarantees as the PSA.

Here's one definition I can think of, which might be useful to add to the PSA document. I am defining PSA compliance as a target's ability to compile a reasonable subset of P4 programs that are written against the interface provided by the top-level package in the PSA (https://github.com/jafingerhut/p4-guide/blob/master/multipipeline/psa-1pipeline.p4#L382). These programs written against the PSA's top-level interface are allowed to use a subset of the extern and non-extern data types available in the PSA without supplying an implementation for them.

The word reasonable here is important, because it's up to a target to define what is reasonable, given its performance goals. For instance, a PSA-compliant software router might have none of the constraints discussed here.

If the paragraph above is an agreeable definition of PSA compliance, then we could add multi-pipeline stateful externs as just another subset of programs that can be rejected by a PSA-compliant compiler. Here are some program classes that would be rejected by a PSA-compiliant compiler.

  1. Stateful externs that need to be accessed by multiple hardware pipelines (Andy's warning from #353 (comment))
  2. Large atomic blocks (Mihai's concern from #353 (comment))
  3. Large tables that do not fit within the target's memory (the canonical example of why a P4 compiler rejects programs)

This way, the restrictions on stateful externs just become part of the subset of programs that the compiler is free to reject. This is a characteristic of P4 regardless of the target architecture, and not something particularly anomalous and unique to PSA-compliant target architectures.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2017

Regarding what it means for an implementation to be PSA compliant, and as a result, what kinds of P4 programs are expected to be portable across PSA compliant implementations, is the subject of a proposed pull request that is only recently written, but it should definitely be reviewed by lots of folks and refined before it becomes part of the PSA. That PR is here: #357

@anirudhSK

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2017

PR #357 is exactly the kind of compliance definition I was looking for. We could change the wording of meter and register updates (https://github.com/jafingerhut/p4-spec-1/blob/70f4f0508983be8f24646c4e6fd9a3a47bafa9c0/p4-16/psa/PSA.mdk#L685 and https://github.com/jafingerhut/p4-spec-1/blob/70f4f0508983be8f24646c4e6fd9a3a47bafa9c0/p4-16/psa/PSA.mdk#L687) to only require compliance for meter/register updates that are pipeline-local (such as ingress-port based metering in the ingress pipeline).

@cc10512

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2017

@jafingerhut I think we need to bring up for discussion the issue of what do we say about mapping PSA to multiple pipelines. I think we have a lot of good discussion here, but to close we need to decide whether we go with your second option proposed in #353 (comment)

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Nov 12, 2017

Current status of this issue -- there are a few brief mentions of multiple-pipeline devices in the current draft PSA, as part of the rationale for the restrictions on where PSA externs can be instantiated (i.e. a PSA implementation is not required to support instantiating externs at the top level).

A more full discussion of this topic seems appropriate for an appendix of the PSA, but I have not written one yet, nor has anyone else that I am aware of.

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2018

OK, I have created a PR with a proposed new appendix in the PSA spec that addresses this issue in more detail. You can find it here: #564

@jafingerhut

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2018

Closing this issue as the proposed PR has now been merged in.

@jafingerhut jafingerhut closed this Feb 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.