Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[design] Concurrency model for P4 #48

Closed
anirudhSK opened this issue Sep 10, 2016 · 23 comments

Comments

@anirudhSK
Copy link
Contributor

commented Sep 10, 2016

P4 lacks a formal concurrency model. I can see at least two scenarios that demand such a model.

  1. Interactions between the control and data plane: What happens when the controller changes a table entry while a packet is being processed in the data plane?
  2. Interactions between multiple packet processors in the data plane: These packet processors could be match-action tables, pipelines, or cores. Further, these packet processors can share state. As an example, consider flowlet switching from the SIGCOMM 2015 P4 tutorial. The state stored in register flowlet_id is shared by two tables:
    • It is read in the action lookup_flowlet_map in the flowlet table
    • It is later written in the action update_flowlet_id in the new_flowlet table.

P4 doesn't clarify behavior for these scenarios. For instance, in scenario 1, is the table entry guaranteed to be either the old or the new entry, but not some muddled combination? In scenario 2, is the value of flowlet_id read by one packet in lookup_flowlet_map guaranteed to be the value of flowlet_id written by the previous packet in update_flowlet_id?

I think we need a concurrency model for such scenarios. One conservative start is to forbid any state sharing and further guarantee that any state updated by a packet processor (table, pipeline, or core) is visible to the next packet, i.e., actions within a packet processor are atomic.

But to span networking devices that permit shared state, a more expansive model might be required: for instance, we could make an entire control flow block "atomic". Semantically, this atomic control flow block would process exactly one packet at a time. A compiler would then generate a pipelined implementation guaranteeing these semantics that processes multiple packets concurrently.

@gbrebner

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2016

Pasting in what the draft spec currently says: it's summarizing some prior email discussion that wasn't captured as a github issue.

18.3.1 Concurrency model
[TODO: is this concurrency model suitable?]
In practice a network device may be processing multiple packets simultaneously:
• Packets may be received concurrently on different network interfaces
• Packet processing may be pipelined, with a new packet starting before the completion of the previous one
As long as the packet processing involves stateless elements and read-only state elements there should be no difference in the results obtained from concurrent or purely sequential execution.
Since tables are read-only from the data-plane point of view, we can provide a very simple semantics for P4 programs written solely in the P4 core language: they should behave identically irrespective of the concurrent execution.
However, as soon as one is using any stateful extern constructs, the question arises with respect to the semantics of the program under concurrent execution. For example, given a set of counters that can be accessed by multiple actions, what is the interleaving of the execution of the counter methods when processing multiple packets? What is the interleaving of method invocations if the counter is accessed from different blocks (e.g., ingress and egress pipelines)?
The answer to this question is left partially to the discretion of the target architecture. An architecture could:
• Prescribe specific order
• Forbid resources that are shared between multiple blocks (e.g., each counter must be allocated in one pipeline exclusively, and it must be used only from actions that can appear within one single table)
• Prescribe an implementation-specific order
We suggest the following minimum constraints on any P4 implementation:
• The invocation of a table is atomic
• The execution of a parser is atomic

@chkim4142

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2016

Adding this to Chang's bucket.

@chkim4142 chkim4142 self-assigned this Sep 26, 2016

@anirudhSK

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2016

draft.pdf
This is a preliminary draft of a proposal for atomics in P4. The Latex source is here: https://github.com/anirudhSK/p4-concurrency

This draft contains motivation, examples, and the concurrency model. It's too verbose to go directly into the spec, but should hopefully explain what we have in mind.

@chkim4142 chkim4142 added this to the P4_16 milestone Oct 6, 2016

@chkim4142

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2016

This seems deserving a discussion. Although I've just assigned "P4_16" milestone to this issue, it could be considered "post P4_16" as well.

@gbrebner

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2016

Aniriudh - thanks for the thoughtful document - this is an important topic.

We've been encountering issues with unthinking concurrency in various real P4 examples, basically where people have written their "natural" algorithm, but unconsciously imagining that each packet is handled completely before the next one. This is certainly not the case in our FPGA implementation, where there are actually three levels of pipelining with multiple packets in flight at different pipeline stages.

I think there is a common issue behind your two suggested proposals: that the target architecture has to supply some atomic operation combinations. For the first (use registers) proposal, these are what the smartened compiler has to map to; for the second (more complex extern types), these are the said types. Really the only difference is whether or not these combo operations are exposed to the P4 user or not. The question is how the target can supply some (small) set of widely useful operation combinations. One case that we have found recurring is a general read-update-write register operation, and this is seen in your examples too. Maybe there is some natural set that might emerge with more examples.

Another thing you identify is the extent of atomic blocks. This is related to the previous point, of course, since it's reflected in how generous the target can be in terms of atomicity. One case we have found is where a register is essentially being used as a working variable, accessed from relatively distant parts of the P4 program. A general solution for this has been to rewrite this as metadata travelling with the packet. Another issue for heavily pipelined implementations is that too-generous atomicity extents can limit the pipelining effectiveness.

At this stage, I don't have a comprehensive solution in mind, since there haven't been enough use cases yet. Being explicit with @atomic annotations might help people to think about what they're writing in their programs, which would mean that there was no longer the case of them unconsciously placing such an annotation round a whole control block, for example.

@anirudhSK

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2016

Gordon,

Thanks for these comments; it helps illustrate the problem in at least one more context: an FPGA substrate.

  1. Like you said, I think we'll need more examples to be certain. For what it's worth, my sense based on the examples we used in the Domino work (http://dl.acm.org/citation.cfm?doid=2934872.2934900) is that @atomic should suffice for all of them because it corresponds directly to the packet transactions abstraction used there. I also think we can implement a compiler pass (Section 4.2 of http://dl.acm.org/citation.cfm?doid=2934872.2934900 has details) that can decompose a user-supplied @atomic into minimum-size @atomic blocks. This makes use of (among other things) the trick you just mentioned of reading a register into a metdata for subsequent use.
  2. We also need to solve the hardware-centric problem of specifying what the atomic instructions even are. This is substrate-specific and different hardware atomic instructions might have different performance characteristics (as measured in packet processing rate). I think part 2 of the compiler, which would reside in an FPGA or ASIC backend, would take the minimal-size atomic blocks from part 1 and generate atomic instructions for them if possible. On an FPGA target, a more generous block will run slower; on an ASIC, it may not run at all. Either way, the code generator should catch this.

I am happy to take a stab at implementing part 1 in the P4-16 compiler, while part 2 would reside in a vendor's backend. This has the added benefit of allowing the vendor to keep their atomic instructions closed and hidden within the backend.

Anirudh

@gbrebner

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2016

I think that the stateful extern that should be discussed and evolved first is register. There's been some discussion on #73 about whether P4 is drifting in a general-purpose direction - while I'm comfortable with the issues under discussion there, I think that register is actually the biggest danger, since it appears as a very generic stateful artifact, but this has conflicts with the overall P4 model, and especially its concurrency. So developing it further, whether through constraints or through defining more atomic operations, as an initial focus for atomic concurrency would be wise.

@anirudhSK

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2016

Yes, #73 is very pertinent here. I think it's useful for programmers to have a cost model of the hardware. At the same time, I agree that this is an implementation/target concern---not something to be mandated by the language.

For instance, I can imagine putting in a few conservative checks in a target's compiler that limits the extent of an atomic block. You could measure "extent" either by counting the number of statements within a @atomic or by turning the @atomic into a DAG of primitive instructions and measuring its depth.

Such conservative checks may be useful for many of the problems @chkim4142 points out in #73 like arbitrary complicated action-body expressions, action-body statements, and control-block statements.

@mbudiu-vmw, @ChrisDodd : How difficult is it to implement such checks in the P4-16 compiler?

@mbudiu-vmw

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2016

Most of these questions belong to the implementation, not to the spec.
@anirudhSK's own work has shown that just counting statements or depth is not enough, e.g., CoDel and sqrt, or using special hardware widgets (think multiply-add).

@anirudhSK

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2016

I can imagine a stateteless extern capturing the more exotic hardware ops (sqrt, multiply-add). These then show up as method calls within an atomic block, as opposed to expressions. We might have to "count" such method calls differently from other primitive expressions in the @atomic block, and this might only be doable with an intimate knowledge of the target.

Even if it's target specific, I think it's useful to think through a compiler implementation pathway for @atomic. That includes both compiling @atomics and reporting sane diagnostics when rejecting them, which the programmer can then use to modify their code.

@anirudhSK

This comment has been minimized.

Copy link
Contributor Author

commented Oct 7, 2016

Here's a first cut at a specification of atomics written into the concurrency model of the P4-16 draft: #80. It provides the language construct, some suggested compiler implementations, and notes on supporting reasonable diagnostics. Grateful for any feedback.

@mbudiu-vmw

This comment has been minimized.

Copy link
Contributor

commented Oct 14, 2016

I have simplified @anirudhSK's text. Here is the text I am proposing to use to replace Sections 18.3 and 18.4. If you like this text I will do the replacement in the spec.

1.1 Dynamic evaluation
The dynamic evaluation of a P4 program is orchestrated by the target model. Each target model needs to specify the order and the conditions under which the various P4 component programs are dynamically executed. For example, in the Simple Switch example the execution flow goes Parser->Pipe->Deparser.
Once a P4 execution block is invoked its execution proceeds until termination according to the semantics defined in this document (the various abstract machines).
1.1.1 Concurrency model
A typical packet processing system needs to execute multiple simultaneous logical “threads:” at the very least there is a thread executing the control plane, which can modify the contents of the tables. The data plane can exchange information with the control plane through extern method calls. Moreover, high throughput packet processing systems may be processing multiple packets simultaneously, e.g., in a pipelined fashion, or concurrently parsing a first packet while performing match-action operations on a second packet. This section specifies the semantics of P4 programs with respect to such concurrent executions.
Each top-level parser or control block is executed as a separate thread when invoked by the target architecture. All the parameters of the block and all local variables are thread-local: i.e., each thread has a private copy of these resources. This applies to the packet_in and packet_out parameters of parsers and deparsers.
As long as a P4 block uses only thread-local storage (e.g., metadata, packet headers, local variables), its behavior in the presence of concurrency is identical with the behavior in isolation, since any interleaving of statements from different threads must produce the same output.
In contrast, extern blocks instantiated by a P4 program are global, shared across all threads. If extern blocks mediate access to state (e.g., counters, registers) – i.e., the methods of the extern block read and write state, these stateful operations are subject to data races. P4 mandates the following behaviors:
• Execution of an action is atomic, i.e., the other threads can “see” the state as it is either before the start of the action or after the completion of the action.
• Execution of a method call on an extern instance is atomic.
To allow users to express atomic execution of larger code blocks, P4 provides an @atomic annotation, which can be applied to block statements, parser states, control blocks or whole parsers.
Consider the following example:

extern Register { ... }
control ingress() {
  Register() r;  
  table flowlet() { /* read state of r in an action */ }
  table new_flowlet() { /* write state of r in an action */ }
  apply {
    @atomic {
       flowlet.apply();
       if (ingress_metadata.flow_ipg > FLOWLET_INACTIVE_TIMEOUT) 
          new_flowlet.apply();
    }
  }
}

This program accesses an extern object r of type Register in actions invoked from tables flowlet_id (reading) and flowlet (writing). Without the @atomic annotation these two operations would not execute atomically: a second packet may read the state of r before the first packet had a chance to update it.
A compiler backend must reject a program containing @atomic blocks if it cannot implement the atomic execution of the instruction sequence. In such cases, the compiler should provide reasonable diagnostics.

@gbrebner

This comment has been minimized.

Copy link
Contributor

commented Oct 14, 2016

This looks reasonable to me, and clarifies an important aspect of P4 execution. As a small detail, it looks like this would just replace 18.4 ("Dynamic evaluation"), and not both 18.3 and 18.4 as you say.

In practical terms for most targets, as Aniriudh identified in his original proposal, it will have to be the case that @atomic blocks are relatively short and local before concurrency benefits start getting lost. Extreme uses like putting @atomic round large components will condemn systems to largely handle each packet to completion before taking a next packet. (Unless compilers are smart enough to discover that the user's broad-range @atomic block is in fact unnecessary and actual concurrency dangers are much more local or maybe non-existent.)

@mbudiu-vmw

This comment has been minimized.

Copy link
Contributor

commented Oct 15, 2016

The reason it makes sense to label a whole control is that you could write control modules in a library which have to behave atomically. After inlining these turn into blocks. There is no other way to do a multi-state atomic parser code fragment.

@anirudhSK

This comment has been minimized.

Copy link
Contributor Author

commented Oct 15, 2016

@mbudiu-vmw, thanks for writing this up. I think it's reasonable overall, but here are a few comments, which might clarify some aspects that confused me.

  1. "very least there is a thread executing the control plane, which can modify the contents of the tables." Personally, I would not bring up the control plane here because it is not written in P4, making it hard to specify its behavior in any way. But if we do bring it up, maybe we should specify some expected behavior, like guaranteeing that the match-action table has either the old rules or the new ones but not a strange mix.
  2. "The data plane can exchange information with the control plane through extern method calls." While this is true (I think you are referring to learning filters), this isn't the primary use case for externs as I understand it. For instance, registers, counters, and meters are externs that maintain data plane state and have nothing to do with the controller except for one-time configuration. I think something like "The data plane can store and manipulate state on a per-packet basis through extern method calls, e.g., registers and counters" would be a better way to introduce externs in this section.
  3. "concurrently parsing a first packet while performing match-action operations on a second packet." This is true, but not the focus of the concurrency model in this section. This section is discussing concurrency within a block (parser or control), not across different blocks as mandated by the target model, which is beyond P4.
  4. "In contrast, extern blocks instantiated by a P4 program are global". If #81 is adopted, then we should say "extern blocks are global by default" and include local externs as examples of thread-local storage.
  5. "is executed as a separate thread when invoked by the target architecture.". I would try and give examples of when it is invoked, e.g., packet arrival from the wire or a parsed packet arriving from another P4 program within the target model.
@anirudhSK

This comment has been minimized.

Copy link
Contributor Author

commented Oct 15, 2016

@gbrebner:

Generous atomic extents are a problem with a simplistic compiler, and rejecting really large atomic blocks is the right place to start. That said, your final parenthetical remark "Unless compilers are smart enough ..." is the direction I hope P4 compilers will go towards in the future :)

@mbudiu-vmw

This comment has been minimized.

Copy link
Contributor

commented Oct 16, 2016

Answers to @anirudhSK

  1. We have to mention the control plane in some way. I don't think we can promise atomic control-plane operations: we don't know in this spec what these operations are, and on many targets it may be impossible to respect this promise.
  2. Actually counters are exactly there to be read and probably reset by the control-plane. In this document we are not making any assumptions about what the various externs do, and how they interact with the control-plane, so we have to assume the worst-case behavior.
  3. Actually we have to discuss concurrency between different P4 blocks too; the spec allows you to instantiate an extern block at top-level and pass references to it to multiple architectural blocks, e.g., both parsers and controls. Also, different extern blocks may communicate with each other through various hidden channels, e.g., the control-plane (consider a learning provider that communicates with a packet generator). @atomic must work even between different P4 architectural blocks.
  4. Given the current spec the user cannot construct thread-local externs, so I could not refer to them yet (only packet_in and packet_out are thread-local). If we adopt the proposal in issue #81 then this section should be amended as you describe. I have not attempted to address issue #81 yet, we should first discuss it, but I think the discussion first requires us to solve this issue.
  5. There is a short (and rather vague) example describing how possible invocations may occur in the previous sub-section, based on the very simple switch (VSS) model. We can't be too specific here, because we don't know anything about the architecture. Take your two examples "packet arrival from the wire or a parsed packet arriving from another P4 program within the target model:" these don't even hold in the P4-14 spec model: after arrival on the wire some hidden architectural block checks and removes the packet ethernet trailer checksum and may also drop the packet. Also, between parser and control there is some queueing which could use the priorities computed by the parser. So P4 blocks in general are not invoked immediately one after another. The most precise description on how invocations occur is in the VSS architectural model description.
@anirudhSK

This comment has been minimized.

Copy link
Contributor Author

commented Oct 16, 2016

@mbudiu-vmw: Ok with 2, 4, and 5. Comments on 1 and 3 below.

  1. I am ok with mentioning the control plane. Can the spec at least say that "the target architecture should provide some formal semantics about how the control and data planes interact"?

  2. I agree and understand your point now. I have one clarification though. We could potentially manipulate the same extern instance from different P4 blocks. In this case, we want the method calls on that instance to appear atomic. But, do we really need an @atomic spanning different P4 blocks, e.g., parser and control?

    I am concerned with externs that have "hidden channels". A simpler view would be to say that each extern instance is an independent entity with no hidden state that is shared between externs. I think this is equivalent to saying that operations on different extern instances commute. Would this be too strong?

@jnfoster

This comment has been minimized.

Copy link
Contributor

commented Oct 16, 2016

Minor snark: Formal semantics is the reason that networking people find my papers unreadable :-) The "formal" in that phrase means having to do with "forms" or syntax, and few things about P4 have every been fully specified via a formal semantics. I might just say "targets should specify [or perhaps just 'describe'] how the control and data planes interact."

@anirudhSK

This comment has been minimized.

Copy link
Contributor Author

commented Oct 16, 2016

Fair enough :-) I like your wording much better.

@mbudiu-vmw

This comment has been minimized.

Copy link
Contributor

commented Oct 18, 2016

I have added a line about 1.

For 2 - the atomic block does not span multiple P4 blocks - what happens is that the @atomic execution is visible as atomic everywhere (it is not only atomic in the control, it is also atomic for parsers, and all other blocks in the P4 programs - they can only see state before or after the atomic block).

However, with 2 I think that we cannot do anything about externs that interact. All externs are visible to the control plane, and the control-plane may have APIs to read and write state from an extern. So in principle all externs can communicate with each other through the control-plane. The compiler front-end has to assume this.

We can perhaps add a series of annotations to give additional information to the front-end (and users). For instance, a @PrivateState annotation could indicate that an extern does not share state with other externs. This would imply that method calls between this extern and different ones can be reordered. But we may do this in also in a later language revision.

@anirudhSK

This comment has been minimized.

Copy link
Contributor Author

commented Oct 19, 2016

The @PrivateState annotation seems useful; it could even be on by default. That said, I agree we don't have to address that here and can consider it in a later language revision.

@chkim4142

This comment has been minimized.

Copy link
Contributor

commented Oct 24, 2016

We agree with what's proposed. We might need to look into the BMv2 architecture and the compiler backend for BMv2 to see what's needed to realize this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.