New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computer architecture for network engineers #18

Open
lukego opened this Issue Apr 12, 2017 · 17 comments

Comments

Projects
None yet
@lukego
Owner

lukego commented Apr 12, 2017

This is a rough idea for a talk/tutorial. Critique welcome :).

Suppose you are a network engineer and you want to understand how modern x86 CPUs work under the hood. Cache-misses, out-of-order execution, pipelined execution, etc. One approach is to read a big heavy book like Hennessy and Patterson. However, there is also a short-cut.

CPUs are basically networks these days (#15) and their mechanisms all have direct analogues in TCP. In fact, if you have spent time troubleshooting TCP performance problems in wireshark it's entirely likely that you have a more visceral intuition for CPU performance that most software people do.

Here is why CPUs are basically equivalent to TCP senders:

TCP CPU
TCP sends a stream of packets. CPU issues a stream of instructions.
TCP packets are eventually acknowledged. CPU instructions are eventually retired.
TCP sends multiple packets in series, without waiting for the first to be acknowledged, up to the window size. CPU issues multiple instructions in series, without waiting for the first to be retired, up to the reservation station size.
TCP packets that are "in flight" all make progress towards their destination at the same time. CPU instructions that are in flight all make progress towards completion at the same time in a pipelined architecture.
TCP incurs packet loss when a packet reaches an overloaded router. The main consequence of a packet loss is more latency between initial transmission and ultimate acknowledgement. (There are also a lot of complex state transitions.) CPU incurs cache misses when instructions refer to memory addresses that are not cached. The main consequence of a cache miss is more latency between the initial issue of an instruction and its ultimate retirement.
The impact of a packet loss depends on the workload. Losing certain packets can cripple performance, for example a control packet like a TCP SYN or a HTTP GET, while certain other packets won't have a noticable impact at all, like losing the 900th packet in an FTP transfer. The key is whether TCP can "keep the pipe full" with other data while it waits to recover the lost packet. The impact of a cache miss depends on the workload. Certain cache misses can cripple performance, for example when fetching the next instruction to execute or chasing a long chain of pointer-dereferences, while certain cache misses won't have a noticable impact at all, like a long series of pipelined memory accesses that all go out to RAM in parallel.
TCP can use Selective ACK to work-around hazards like packet loss and continue sending new packets beyond the slow one without waiting for it to be recovered and ACKed first. CPU can use out-of-order execution to work-around hazards like cache misses and continue executing new instructions beyond the slow one without waiting for it to be completed and retired first.
TCP can run multiple connections on the same link. This does not directly increase bandwidth, because they are sharing the same network resources, but it does improve robustness. If one connection is blocked by a hazard, such as a packet loss, the other can still make progress and so the link is less likely to become idle (which would waste bandwidth.) CPU can run multiple hyperthreads on the same core. This does not directly increase performance, because they are sharing the same computing resources, but it does improve robustness. If one hyperthread is blocked by a hazard, such as a cache miss, the other can still make progress and so the core is less likely to become idle (which would waste execution cycles.)

What do you think?

Have an idea for good analogs of branch prediction and dispatching instructions across multiple execution units?

@lukego lukego added the tech label Apr 12, 2017

@hirenp

This comment has been minimized.

Show comment
Hide comment
@hirenp

hirenp Apr 12, 2017

Very nice write-up!
AFAIK, branch prediction and parallelism is not what TCP itself can do at protocol level. QUIC (borrowed from http/2?) does have multiple UDP stream underneath a single connection which may provide what you are looking for wrt issuing multiple work requests. But if you are looking for 'single source serving multiple targets', the traditional sockets and way we do 'multiple clients connecting to a single server' may provide that.
(from someone who knows very little about CPUs but does know a little bit about protocols. :-))

hirenp commented Apr 12, 2017

Very nice write-up!
AFAIK, branch prediction and parallelism is not what TCP itself can do at protocol level. QUIC (borrowed from http/2?) does have multiple UDP stream underneath a single connection which may provide what you are looking for wrt issuing multiple work requests. But if you are looking for 'single source serving multiple targets', the traditional sockets and way we do 'multiple clients connecting to a single server' may provide that.
(from someone who knows very little about CPUs but does know a little bit about protocols. :-))

@pkazmier

This comment has been minimized.

Show comment
Hide comment
@pkazmier

pkazmier Apr 13, 2017

As a network engineer I appreciate the analogy! Now I know more about computer architecture than I did moments ago.

pkazmier commented Apr 13, 2017

As a network engineer I appreciate the analogy! Now I know more about computer architecture than I did moments ago.

@mgersh

This comment has been minimized.

Show comment
Hide comment
@mgersh

mgersh Apr 13, 2017

This is also at a higher layer than TCP, but I can see an analogy between branch prediction and HTTP/2 server push. Without branch prediction/server push, the CPU/client has to wait for the first instruction/request to finish before it knows what to execute/request next. With it, the CPU/client already has the next instruction/response when it needs it.

mgersh commented Apr 13, 2017

This is also at a higher layer than TCP, but I can see an analogy between branch prediction and HTTP/2 server push. Without branch prediction/server push, the CPU/client has to wait for the first instruction/request to finish before it knows what to execute/request next. With it, the CPU/client already has the next instruction/response when it needs it.

@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Apr 13, 2017

Multiple execution engines are somewhat similar to multipath routing or (at a smaller scale) multiple rx/tx queues in network cards?

njsmith commented Apr 13, 2017

Multiple execution engines are somewhat similar to multipath routing or (at a smaller scale) multiple rx/tx queues in network cards?

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Apr 13, 2017

Owner

I added hyperthreads which I think is a good fit.

Interesting that people are commenting and also finding analogies to more protocols than just TCP :).

I also reckon there is an analogy between packet size and micro-ops. Some packets are bigger than others (e.g. jumbo frames), and require more time to serialize over the link, while some instructions are also bigger than others (e.g. rep movsb memory copy), and require more cycles to execute all of their micro-ops.

On execution ports I am tempted to draw an analogy to the TCP congestion window (cwnd). Haswell has eight execution ports, so it can actively compute eight instructions (uops) at the same time, and so we can say it has a hardcoded cwnd=8. However, the original formulation above is a bit lacking by not mentioning the cwnd and making a distinction between packets/instructions that are actively progressing vs ones that are just waiting in the transmit-queue/reservation-station.

Plenty to think about :).

Owner

lukego commented Apr 13, 2017

I added hyperthreads which I think is a good fit.

Interesting that people are commenting and also finding analogies to more protocols than just TCP :).

I also reckon there is an analogy between packet size and micro-ops. Some packets are bigger than others (e.g. jumbo frames), and require more time to serialize over the link, while some instructions are also bigger than others (e.g. rep movsb memory copy), and require more cycles to execute all of their micro-ops.

On execution ports I am tempted to draw an analogy to the TCP congestion window (cwnd). Haswell has eight execution ports, so it can actively compute eight instructions (uops) at the same time, and so we can say it has a hardcoded cwnd=8. However, the original formulation above is a bit lacking by not mentioning the cwnd and making a distinction between packets/instructions that are actively progressing vs ones that are just waiting in the transmit-queue/reservation-station.

Plenty to think about :).

@anastop

This comment has been minimized.

Show comment
Hide comment
@anastop

anastop Apr 13, 2017

Nice write-up!

Is there any "thing" like a Re-Order Buffer in TCP? Perhaps TCP sequence numbers?

anastop commented Apr 13, 2017

Nice write-up!

Is there any "thing" like a Re-Order Buffer in TCP? Perhaps TCP sequence numbers?

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Apr 13, 2017

Owner

Is there any "thing" like a Re-Order Buffer in TCP?

Yep!

The TCP packet transmission queue seems to be like a Re-Order Buffer. TCP always frees packet buffers in FIFO sequential order, even if the packets may be delivered out of order due to loss/retransmission or reordering along the path. Modern TCPs will track a bunch of state for each packet in the queue e.g. do we believe that it is currently in flight / do we believe it's delivered because we got a preliminary Selective Acknowledgement / do we believe it's lost due to packets later in the sequence being acknowledged first.

(One major theme in TCP is needing to maintain heuristic estimates of which packets are in-flight/delivered/lost to accelerate decisions about when to transmit/retransmit a new packet. Sending too much wastes bandwidth by sending data that does not benefit the recipient e.g. because it is duplicate or because it is dropped at a bottleneck en route. Sending too little data wastes bandwidth due to idleness or causes starvation when competing with more aggressive TCP implementations on other hosts. This requires a bunch of state and clever algorithms that may well have direct parallels deeper in the CPU.)

Have to think about Reservation Station / Re-Order Buffer distinction a bit more carefully.

Owner

lukego commented Apr 13, 2017

Is there any "thing" like a Re-Order Buffer in TCP?

Yep!

The TCP packet transmission queue seems to be like a Re-Order Buffer. TCP always frees packet buffers in FIFO sequential order, even if the packets may be delivered out of order due to loss/retransmission or reordering along the path. Modern TCPs will track a bunch of state for each packet in the queue e.g. do we believe that it is currently in flight / do we believe it's delivered because we got a preliminary Selective Acknowledgement / do we believe it's lost due to packets later in the sequence being acknowledged first.

(One major theme in TCP is needing to maintain heuristic estimates of which packets are in-flight/delivered/lost to accelerate decisions about when to transmit/retransmit a new packet. Sending too much wastes bandwidth by sending data that does not benefit the recipient e.g. because it is duplicate or because it is dropped at a bottleneck en route. Sending too little data wastes bandwidth due to idleness or causes starvation when competing with more aggressive TCP implementations on other hosts. This requires a bunch of state and clever algorithms that may well have direct parallels deeper in the CPU.)

Have to think about Reservation Station / Re-Order Buffer distinction a bit more carefully.

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Apr 13, 2017

Owner

@hirenp I wonder if this whole model could be expressed in terms of HTTP instead of TCP too :).

Owner

lukego commented Apr 13, 2017

@hirenp I wonder if this whole model could be expressed in terms of HTTP instead of TCP too :).

@anastop

This comment has been minimized.

Show comment
Hide comment
@anastop

anastop Apr 13, 2017

Have to think about Reservation Station / Re-Order Buffer distinction a bit more carefully.

Yes, it's kind of tricky. And besides its re-ordering role, the ROB's capacity is also the factor that limits how far ahead the processor can seek for instructions to fetch.

anastop commented Apr 13, 2017

Have to think about Reservation Station / Re-Order Buffer distinction a bit more carefully.

Yes, it's kind of tricky. And besides its re-ordering role, the ROB's capacity is also the factor that limits how far ahead the processor can seek for instructions to fetch.

@hirenp

This comment has been minimized.

Show comment
Hide comment
@hirenp

hirenp Apr 13, 2017

@lukego Look at QUIC. It has some really nifty features. That will probably be a better fit but I am not sure as I don't know much about CPUs. :-)

hirenp commented Apr 13, 2017

@lukego Look at QUIC. It has some really nifty features. That will probably be a better fit but I am not sure as I don't know much about CPUs. :-)

@lukego

This comment has been minimized.

Show comment
Hide comment
@lukego

lukego Jun 21, 2017

Owner

Here is a dramatic conclusion from an excellent recent paper called Array layouts for comparison based searching by @pkhuong and Pat Morin:

Our conclusion is contrary to our expectations and to previous findings and goes
against conventional wisdom, which states that accesses to RAM are slow, and should be
minimized. A more accurate statement, that accounts for our findings, is that accesses to
RAM have high latency and this latency needs to be mitigated.

This makes intuitive sense if you think like a network engineer: your connection to RAM is a "long fat pipe."

Owner

lukego commented Jun 21, 2017

Here is a dramatic conclusion from an excellent recent paper called Array layouts for comparison based searching by @pkhuong and Pat Morin:

Our conclusion is contrary to our expectations and to previous findings and goes
against conventional wisdom, which states that accesses to RAM are slow, and should be
minimized. A more accurate statement, that accounts for our findings, is that accesses to
RAM have high latency and this latency needs to be mitigated.

This makes intuitive sense if you think like a network engineer: your connection to RAM is a "long fat pipe."

@Olsonist

This comment has been minimized.

Show comment
Hide comment
@Olsonist

Olsonist Jun 29, 2017

If you are going to recommend Hennessy and Patterson then you should recommend Computer Organization and Design rather than Computer Architecture. Otherwise that would be like recommending Radia Perlman before TCP/IP Network Administration.

Olsonist commented Jun 29, 2017

If you are going to recommend Hennessy and Patterson then you should recommend Computer Organization and Design rather than Computer Architecture. Otherwise that would be like recommending Radia Perlman before TCP/IP Network Administration.

@tarndt

This comment has been minimized.

Show comment
Hide comment
@tarndt

tarndt Jun 30, 2017

Nice! This is helpful for those more versed in computer architecture than TCP as well.

One nit: Hyperthreading is just Intel marketer speak for SMT (Simultaneous Multithreading), so hardware multithreading would be the actual jargon term you mean.

tarndt commented Jun 30, 2017

Nice! This is helpful for those more versed in computer architecture than TCP as well.

One nit: Hyperthreading is just Intel marketer speak for SMT (Simultaneous Multithreading), so hardware multithreading would be the actual jargon term you mean.

@helloweishi

This comment has been minimized.

Show comment
Hide comment
@helloweishi

helloweishi Jun 30, 2017

From protocol's perspective, I guess you want to compare network protocol with cache coherency protocol, they have many similarity, CPU focus on execution, network focus on transmission, there is no much to compare.

helloweishi commented Jun 30, 2017

From protocol's perspective, I guess you want to compare network protocol with cache coherency protocol, they have many similarity, CPU focus on execution, network focus on transmission, there is no much to compare.

@ejames17

This comment has been minimized.

Show comment
Hide comment
@ejames17

ejames17 Jun 30, 2017

great read lets have beer! Its the weekend!

ejames17 commented Jun 30, 2017

great read lets have beer! Its the weekend!

@melatonin355

This comment has been minimized.

Show comment
Hide comment
@melatonin355

melatonin355 Jul 1, 2017

Awesome work!

melatonin355 commented Jul 1, 2017

Awesome work!

@packerliu

This comment has been minimized.

Show comment
Hide comment
@packerliu

packerliu Dec 5, 2017

Very interesting analogy! Should we expand TCP/CPU to UDP/GPU or the like?

packerliu commented Dec 5, 2017

Very interesting analogy! Should we expand TCP/CPU to UDP/GPU or the like?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment