# Breaking Pipelined Hardware AES on CW305 FPGA

What happens when you take a hardware AES implementation and pipeline it?

## Prerequisites
- [PA_HW_CW305_1-Attacking_AES_on_an_FPGA.ipynb](PA_HW_CW305_1-Attacking_AES_on_an_FPGA.ipynb) (run this first!)
- basic knowledge of Verilog
- basic knowledge of AES

## Requirements
- CW305 or CW312 A35 target
- CW-Lite, Pro, or Husky

## Background
Our ["normal" hardware AES implementation](https://github.com/newaetech/chipwhisperer/tree/develop/hardware/victims/cw305_artixtarget/fpga/vivado_examples/aes128_verilog) performs one round of AES in one clock cycle; the 10 rounds of AES-128 are done in 11 consecutive clock cycles (the +1 will be covered later). This is pretty fast! It means that 128 bits are encrypted in 11 cycles. At 100 MHz, this is 1.1 Gbps of encryption. For some applications -- think disk encryption, memory encryption, layer 2 or layer 3 encryption (MACsec/IPsec) -- this may not be fast enough.

Some (but not all) [block cipher modes](https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation) of AES can be pipelined: the idea is to replicate the round encryption for each round. So in the case of AES-128, imagine having 10 round modules; let's label these RM1 through RM10. In the first clock cycle, RM1 performs the first round of encryption on the first AES block. On the second clock cycle, the output of RM1 is fed to RM2, which performs the second round of encryption on the first AES block, **while at the same time** RM1 begins encrypting the second AES block. Once the pipeline is filled, ten different AES plaintext are at different stages of encryption, and a fully encrypted AES block is produced at each clock cycle. This boosts throughput by a factor of 11. Pipelining trades off area and power for throughput: a factor of X increase in throughput can be obtained at the cost of increasing area and power by X also. A block cipher mode can be pipelined if the encryption or decryption of a block does not depend on the result of the previous encryption or decryption; examples include CTR, XTS, and GCM. An example of a mode which cannot be pipelined is CBC.

## The Target
[The target(s)](https://github.com/newaetech/chipwhisperer/tree/develop/hardware/victims/cw305_artixtarget/fpga/vivado_examples/aes128_pipelined) for this demo were built quite straightforwardly by taking our ["normal" hardware AES implementation](https://github.com/newaetech/chipwhisperer/tree/develop/hardware/victims/cw305_artixtarget/fpga/vivado_examples/aes128_verilog), instantiating multiple rounds, and tweaking the control logic to feed the rounds. The goal was to minimize modifications to the original design; this lets us quantify the effect of pipelining on side-channel attacks. We'll study 4 pipelined variants: one fully pipelined implementation, and one "half-pipelined" implementation done three different ways.

## Objectives
1. See that encrypting multiple AES rounds at the same time does not prevent CPA attacks.
2. De-mystify leakage models by matching different leakage models to different implementations.
3. As a bonus: see that Vivado can do unexpected things that have a huge impact on leakage.

## Platform Notes
This notebook was developed with the CW305 target and CW-Husky. Other combinations are possible as per the requirements above, but results may differ slightly, and some tweaks will be required; these will be noted throughout (i.e. don't blindly run through the cells and expect it to work; follow the instructions and make any required modifications).

In the case of the CW312 A35 target, the gain is set at a proper setting for the inductive shunt. If you're using a different shunt, you may need to make adjustments.


In [None]:
TARGET_PLATFORM = 'CW305_A100'
#TARGET_PLATFORM = 'CW305_A35'
#TARGET_PLATFORM = 'CW312T_A35'

In [None]:
import chipwhisperer as cw
scope = cw.scope()

In [None]:
scope.default_setup()
scope.io.hs2 = "disabled"

In [None]:
import time
def program_target(half_pipe, target=None, force=True):
    try:
        if target is not None:
            target.dis()        
    except:
        pass

    if TARGET_PLATFORM == 'CW312T_A35':
        scope.io.hs2 = 'clkgen'
        fpga_id = 'cw312t_a35'
        platform = 'ss2'
    else:
        platform = 'cw305'
        if TARGET_PLATFORM == 'CW305_A100':
            fpga_id = '100t'
        elif TARGET_PLATFORM == 'CW305_A35':
            fpga_id = '35t'

    target = cw.target(scope, cw.targets.CW305_AES_PIPELINED, force=force, fpga_id=fpga_id, platform=platform, version=half_pipe)
        
    time.sleep(0.5)
    scope.errors.clear()
    assert target.half_pipe == half_pipe
    return target

# Fully Pipelined Target
In this demo we'll look at a few different AES pipeline implementations.

We'll start with a "fully pipelined" implementation, so named because it has dedicated hardware resources for each AES round; this lets it process one 128-bit AES block per clock cycle.

In [None]:
target = program_target(half_pipe=0)

Next we set up the clocks: CW-Husky requires a different setup when the ADC clock is driven by the target. If using the CW312T-A35 target, the capture hardware needs to drive the target clock.

In [None]:
if 'CW305' in TARGET_PLATFORM:
    scope.adc.offset = 0
    target.vccint_set(1.0)
    # we only need PLL1:
    target.pll.pll_enable_set(True)
    target.pll.pll_outenable_set(False, 0)
    target.pll.pll_outenable_set(True, 1)
    target.pll.pll_outenable_set(False, 2)

    # run at 10 MHz:
    target.pll.pll_outfreq_set(10E6, 1)

    # 1ms is plenty of idling time
    target.clkusbautooff = True
    target.clksleeptime = 1
    
    if scope._is_husky:
        scope.clock.clkgen_freq = 40e6
        scope.clock.clkgen_src = 'extclk'
        scope.clock.adc_mul = 4
        # if the target PLL frequency is changed, the above must also be changed accordingly
    else:
        scope.clock.adc_src = "extclk_x4"

else:
    scope.adc.offset = 6
    scope.clock.clkgen_freq = 7.37e6
    scope.io.hs2 = 'clkgen'
    if scope._is_husky:
        scope.clock.clkgen_src = 'system'
        scope.clock.adc_mul = 4
        scope.clock.reset_dcms()
    else:
        scope.clock.adc_src = "clkgen_x4"
    import time
    time.sleep(0.1)
    target._ss2_test_echo()
    

Finally, ensure the ADC clock is locked:

In [None]:
import time
for i in range(5):
    scope.clock.reset_adc()
    time.sleep(1)
    if scope.clock.adc_locked:
        break 
assert (scope.clock.adc_locked), "ADC failed to lock"

Occasionally the ADC will fail to lock on the first try; when that happens, the above assertion will fail (and on the CW-Lite, the red LED will be on). Simply re-running the above cell again should fix things.

## Trace Capture: one AES block at a time

To simplify controlling the target's AES pipeline and how busy it's kept, the target has a FIFO  that's 128-bits wide and `target.fifo_depth` deep (on the CW305 this is 8192; on the CW312T_A35 it's 4096).
Plaintexts to be encrypted are written to the FIFO; then when the target receives the "go encrypt!" signal, it processes everything that's in the FIFO as fast as it can (i.e. it will encrypt at a rate of one block per cycle until the FIFO is empty).

To keep things simple as we get started, we'll start by having the target encrypt a single AES block at a time (e.g. write one block to the FIFO; make it "go"; capture the power trace; repeat).

This is similar to how our ["regular" AES FPGA target](PA_HW_CW305_1-Attacking_AES_on_an_FPGA.ipynb) works, and it doesn't take advantage of this pipelined implementation's higher potential throughput, but it will let us learn some important things about the leakage that's present here.

In [None]:
project_file = "projects/Tutorial_HW_CW305_AES_PIPELINED_HALF" + str(target.half_pipe) + ".cwp"
project = cw.create_project(project_file, overwrite=True)

In [None]:
from tqdm.notebook import tnrange
import numpy as np
import time
from Crypto.Cipher import AES

def get_traces(project, N, gain=None, check=True):
    scope.adc.samples = 80
    if gain == None:
        if TARGET_PLATFORM == 'CW312T_A35':
            scope.gain.db = 42
        else:
            scope.gain.db = 25
    else:
        scope.gain.db = gain
    ktp = cw.ktp.Basic()
    key, text = ktp.next()
    # flush FIFO in case it's not empty:
    target.fifo_flush()
    for i in tnrange(N, desc='Capturing traces'):
        key, text = ktp.next()  # manual creation of a key, text pair can be substituted here
        ret = target.capture_trace(scope, text, key, pre_expand=False, check=check)
        if not ret:
            print("Failed capture")
            continue
        project.traces.append(ret)
        assert scope.adc.trig_count == 44
    project.save()

In [None]:
get_traces(project, 3000, check=True)

In [None]:
from bokeh.palettes import inferno
from bokeh.plotting import figure, show
from bokeh.resources import INLINE
from bokeh.io import output_notebook
from bokeh.models import Span, Legend, LegendItem
import itertools

In [None]:
numtraces = 100
output_notebook(INLINE)
B = figure(plot_width=1800)
colors = itertools.cycle(inferno(numtraces))
for i in range(numtraces):
    B.line(range(scope.adc.samples), project.traces[i].wave, color=next(colors))
show(B)

Traces look pretty similar to those of our regular AES FPGA target; you can clearly see the 10 AES rounds.

When used this way, the only difference between the pipelined and non-pipelined AES targets is that in the non-pipelined target, the same logic gates are used for each round; here in the pipelined version, each round is a distinct and separate set of gates.

Next let's look at what this means for side-channel attacks.

## CPA Attack

Let's see what we get from the same attack and leakage model as the [non-pipelined target](PA_HW_CW305_1-Attacking_AES_on_an_FPGA.ipynb):

In [None]:
import chipwhisperer.analyzer as cwa

In [None]:
attack = cwa.cpa(project, cwa.leakage_models.last_round_state_diff)
attack.trace_range=[0,1000]
cb = cwa.get_jupyter_callback(attack)
attack_results = attack.run(cb)

In [None]:
np.average(attack_results.pge)

Recall that the PGE for each key byte represents how far each correct key byte is from being the guessed key byte.

The resulting average PGE of the above attack should be very close to 128, which is what you'd get if you randomly guessed the key.

Further, if you watched the table evolve during the attack, you may have noticed that some key bytes get close to being correctly guessed but then get further again.

These are all signs that we're not employing an appropriate or useful leakage model.

If you follow the source code, you'll find that `cwa.leakage_models.last_round_state_diff` points to this leakage definition in [AES_128_8bit.py](https://github.com/newaetech/chipwhisperer/blob/develop/software/chipwhisperer/analyzer/attacks/models/AES128_8bit.py):
```python
def leakage(self, pt, ct, key, bnum):
    # HD Leakage of AES State between 9th and 10th Round
    # Used to break SASEBO-GII / SAKURA-G
    st10 = ct[self.INVSHIFT_undo[bnum]]
    st9 = inv_sbox(ct[bnum] ^ key[bnum])
    return (st9 ^ st10)
```

(hot tip: in Jupyter, running `cwa.leakage_models.last_round_state.model??` will show you the source code for that function)

Understanding why this leakage model works requires knowledge of the steps of AES and of the [AES implementation](https://github.com/newaetech/chipwhisperer/blob/develop/hardware/victims/cw305_artixtarget/fpga/cryptosrc/aes_googlevault/aes_core.v) and is not obvious on first glance, but at a high level it's looking at one byte of the AES `state` register in `aes_core.v` and calculates the XOR of <this byte after round 9 (st9)> and <this byte after round 10 (st10)>.

Because of how AES works, we can do this on a byte-per-byte basis if we know the final ciphertext (`ct`); this is what allows the attack to guess one key byte at a time.

Why then does this attack fail on the pipelined target? The Verilog code for the pipelined target is very similar to the non-pipelined target (compare [aes_core.v](https://github.com/newaetech/chipwhisperer/blob/develop/hardware/victims/cw305_artixtarget/fpga/cryptosrc/aes_googlevault/aes_core.v) and [aes_round.v](https://github.com/newaetech/chipwhisperer/blob/develop/hardware/victims/cw305_artixtarget/fpga/vivado_example/aes128_pipelined/hdl/aes_round.v)), with one key difference: to enable pipelining, there is now a "state" register (`data_o` in `aes_round.v`) in *every pipeline stage*.

Each of these state registers gets updated after processing a round; for each one of them, its content changes from what it held when it was used in processing the *previous plaintext* to what it holds now after processing the *current plaintext*. You can see how the core functions by running the very basic Verilog testbench provided [here](https://github.com/newaetech/chipwhisperer/tree/develop/hardware/victims/cw305_artixtarget/fpga/vivado_examples/aes128_pipelined/sim) with `make HALF_PIPE=0 DUMP=1`

So, `cwa.leakage_models.last_round_state_diff` worked because there is a state register in `aes_core.v` which contains the output of round 9 and is then updated to the output of round 10, *for the same plaintext*. No such register exists in the pipelined `aes_round.v` because here, the output of each round goes to a register associated with only that round.

But we are not out of luck! In fact the solution has already been outlined: we simply need to pick a round register and define a leakage function based on the state register content on the previous plaintext versus the current plaintext:

```Python
def leakage(self, pt, ct, prev_ct, key, bnum):
    curr = inv_sbox(ct[bnum] ^ key[bnum])
    prev = inv_sbox(prev_ct[bnum] ^ key[bnum])
    return curr ^ prev
```

In the `leakage()` function above, `prev` represents the contents of the round 9 state register after it processed `prev_ct`, while `curr` represents the content of that same register when it was next updated, after it processed `ct`.

Let's try it:

In [None]:
attack_pipe = cwa.cpa(project, cwa.leakage_models.pipeline_diff)
attack_pipe.trace_range=[0,1000]
cb = cwa.get_jupyter_callback(attack_pipe)
attack_pipe_results = attack_pipe.run(cb)

In [None]:
np.average(attack_pipe_results.pge)

This isn't enough traces to retrieve the key, but we're quite close! You should see a PGE that's much closer to 0 than it is to 128 (with the CW312T_A35 you may need more traces to see this).

Before we continue, let's define our own CPA attack function; it will run a bit faster because it processes all traces at once, instead of incrementally, and it lets us focus on the PGE, which will be handy as we will evaluate different leakage models.

This uses the same approach as [Lab 4.2](<../courses/sca101/Lab 4_2 - CPA on Firmware Implementation of AES (MAIN).ipynb>):


In [None]:
HW = [bin(n).count("1") for n in range(0, 256)]

def mean(X):
    return np.sum(X, axis=0)/len(X)

def std_dev(X, X_bar):
    return np.sqrt(np.sum((X-X_bar)**2, axis=0))

def cov(X, X_bar, Y, Y_bar):
    return np.sum((X-X_bar)*(Y-Y_bar), axis=0)

def do_cpa(project, model, trace_range=None, point_range=None):
    short_traces = []

    if trace_range == None:
        tstart = 0
        tstop = len(project.traces)
    else:
        tstart = trace_range[0]
        tstop = trace_range[1]

    if point_range == None:
        pstart = 0
        pstop = len(project.traces[0].wave)
    else:
        pstart = point_range[0]
        pstop = point_range[1]

    # careful as this can lead to confusion! it means that ciphouts[i+1] is the ciphertext
    # corresponding to short_traces[i] (and so ciphouts[i] is the corresponding *previous* ciphertext)
    if tstart > 0:
        ciphouts = [project.traces[tstart-1].textout]
    else:
        ciphouts = [[0]*16]

    for i in range(tstart, tstop):
        short_traces.append(project.traces[i].wave[pstart:pstop])
        ciphouts.append(list(project.traces[i].textout))

    num_traces = tstop - tstart

    t_bar = np.sum(short_traces, axis=0)/num_traces
    o_t = np.sqrt(np.sum((short_traces - t_bar)**2, axis=0))

    cparefs = [0] * 16
    bestguess = [0] * 16
    bestguesses = []

    for bnum in tnrange(0, 16):
        maxcpa = [0] * 256
        klist = [0]*16
        for kguess in range(0, 256):
            klist[bnum] = kguess
            if model._has_prev:
                hws = np.array([[HW[model.modelobj.leakage(None, ciphouts[i+1], None, ciphouts[i], klist, bnum)] for i in range(num_traces)]]).transpose()
            else:
                hws = np.array([[HW[model.modelobj.leakage(None, ciphouts[i+1], klist, bnum)] for i in range(num_traces)]]).transpose()
            
            hws_bar = mean(hws)
            o_hws = std_dev(hws, hws_bar)
            correlation = cov(short_traces, t_bar, hws, hws_bar)
            cpaoutput = correlation/(o_t*o_hws)
            maxcpa[kguess] = max(abs(cpaoutput))
        bestguess[bnum] = np.argmax(maxcpa)
        bestguesses.append(np.argsort(maxcpa)[::-1])
        cparefs[bnum] = max(maxcpa)
    
    correct_recovered_key = model.modelobj.process_known_key(project.traces[0].key)
    scores = []
    for b in range(16):
        score = list(bestguesses[b]).index(correct_recovered_key[b])
        scores.append(score)
    print('Remaining PGE: %f' % np.average(scores))

    return bestguess, bestguesses

In [None]:
results = do_cpa(project, cwa.leakage_models.pipeline_diff, point_range=[44,54])

The result should be 0 or very very close to it, which is much better than `attack_pipe_results` above because:
1. We're using all 3000 traces.
2. We're only looking at samples 44 to 54 from each power trace, which is where we know the leakage we're exploiting is occurring (because that's when the target round register gets updated).

(Again, if you're using the CW312T_A35 target, you may find you need more traces.)

Targeting the power samples more precisely will become more important for the next step, when we try the attack with the pipeline fully occupied.

You can experiment with different settings for `point_range`; if you move it around a bit you should find slightly worse results; if you exclude samples `[44,54]` completely, you'll get much worse results.

# Filling the Pipeline:

We define a new trace capture function for this scenario.

Recall that filling the pipeline involves writing all our plaintexts to the target's input FIFO.

`target.capture_trace()` is a custom capture function for this target which takes care of this for us.

In [None]:
project_file = "projects/Tutorial_HW_CW305_AES_PIPELINED_FILLED_HALF" + str(target.half_pipe) + ".cwp"
project_filled = cw.create_project(project_file, overwrite=True)

In [None]:
def get_traces_filled_pipeline(project, N, NPT, half_pipe=False, gain=None, check=True):
    """ Args:
        N: number of traces
        NPT: number of encryptions per trace
        half_pipe: set to True for half-pipelined targets, False for fully-pipelined targets
        gain: leave to None to use defaults, otherwise provide desired gain in dB
    """
    if half_pipe:
        mx = 2
    else:
        mx = 1
    scope.adc.samples = (NPT+100)*scope.clock.adc_mul*mx
    if gain == None:
        if TARGET_PLATFORM == 'CW312T_A35':
            scope.gain.db = 34
        else:
            scope.gain.db = 18
    else:
        scope.gain.db = gain
    ktp = cw.ktp.Basic()
    key, text = ktp.next()
    # flush FIFO in case it's not empty:
    target.fifo_flush()
    for i in tnrange(N, desc='Capturing traces'):
        texts = []
        for j in range(NPT):
            key, text = ktp.next()
            texts.append(list(text))
        ret = target.capture_trace(scope, texts, key, pre_expand=False, check=check)
        if not ret:
            print("Failed capture")
            continue
        project.traces.append(ret)
    project.save()

`get_traces_filled_pipeline()` has a new argument, `NPT`.

This is the number of input words that are written to the target FIFO before it is made to go. This can be repeated `N` times.

So, `get_traces_filled_pipeline()` collects `N` power traces, each of which contains `NPT` encryptions.

`NPT` could be anything, but to get interesting results it should be greater than the number of AES rounds (so that at least part of the power trace has the pipeline fully occupied).

For fun let's see what traces with `NPT=20` looks like:

In [None]:
get_traces_filled_pipeline(project_filled, N=10, NPT=20)

In [None]:
numtraces = len(project_filled.traces)
output_notebook(INLINE)
B = figure(plot_width=1800)
colors = itertools.cycle(inferno(numtraces))
for i in range(numtraces):
    B.line(range(scope.adc.samples), project_filled.traces[i].wave, color=next(colors))
show(B)

You can clearly see power consumption ramping up with each clock cycle as the AES pipeline fills at the start, then go back down as it empties at the end.

Since we're interested in the feasibility of a side-channel attack on an AES pipeline, let's maximize `NPT`.

Notes:
1. Each iteration involves sending `target.fifo_depth` plaintexts and retrieving the same number of ciphertexts, so it takes a while; don't be concerned if you don't see any progress for a while.
2. If you have a CW-lite, you'll have to reduce NPT to 6000 so that you don't exceed the Lite's maximum capture size; you can increase N to 4 to capture roughly the same total number of encryptions.

`setN()` is a convenience function which adjusts N for different targets, in consideration for their differing FIFO depths, and the fact that the CW312 target requires more traces.

In [None]:
def setN(N):
    if 'A35' in TARGET_PLATFORM:
        N=N*2 # double N because `target.fifo_depth` is halved on A35 targets; double it again because more traces are required on this target
    if 'CW312' in TARGET_PLATFORM:
        N=N*2 # double N because more traces are required on this target
    return N

In [None]:
project_filled = cw.create_project(project_file, overwrite=True)

In [None]:
N = setN(3)
get_traces_filled_pipeline(project_filled, N=N, NPT=target.fifo_depth, check=False) # adjust N, NPT as per notes above if necessary

Now we have `N` long power traces, each containing `NPT` encryptions.

Our CPA attack needs to associate one power trace to each encryption. How do we do that? By splitting up each power trace into `NPT` segments! `split_traces()` is a convenience function that does this for us.

This is where the `point_range` from before comes in handy, because we need to tell `split_traces()` *where* to split up the power trace.

In [None]:
split_traces = target.split_traces(scope, project_filled.traces, 44, 54)

In [None]:
project_file = "projects/Tutorial_HW_CW305_AES_PIPELINED_FILLED_SPLIT_HALF" + str(target.half_pipe) + ".cwp"
project_split = cw.create_project(project_file, overwrite=True)

for t in split_traces:
    project_split.traces.append(t)

Now we have our synthesized per-encryption traces in this new project.

`split_traces()` has also done the work of associating the plaintext and ciphertext that goes which each power trace segment, so `project_split` is ready to be fed to `do_cpa()`.

(note that we no longer provide the `point_range` argument because it's already been used to build our set of power trace segments)

In [None]:
results = do_cpa(project_split, cwa.leakage_models.pipeline_diff, trace_range=[0, 2000])

2000 traces worked were sufficient (or almost) when the pipeline was processing only one encryption at a time; now that it's fully loaded, the noise of all the active pipeline stages means we need more traces
The attack needs more traces to succeed:

In [None]:
results = do_cpa(project_split, cwa.leakage_models.pipeline_diff)

This should work! This used `N*(NPT-20)` traces (the first and last 10 segments of each large power trace are omitted because the pipeline isn't fully loaded for those).

You can see just how many traces are required; it should be around 20000:

In [None]:
results = do_cpa(project_split, cwa.leakage_models.pipeline_diff, trace_range=[0, 20000])

So a ~10-fold increase in the noise that's present in the power traces (a very rough and simplistic approximation) leads to a ~10-fold increase in the number of traces required for the CPA attack.

(On the CW312T-A35, around 50K traces should do it.)

# Different Ways to Pipeline

Encryption is expensive; a 10-core AES pipeline is going to be about 10 times larger than the equivalent non-pipelined implementation, and if 128 bits/cycle throughput doesn't hit the sweet spot for a particular application, other landing points are possible.

You can go faster by implementing multiple pipelines in parallel; you can also go slower by reducing the number of pipeline stages.

Consider for example a five-stage pipeline, where each stage does two AES rounds (in two clock cycles) instead of just one. This will be half the size and half the throughput, since it will be able to take in one 16-byte plaintext every two cycles.

Changing the number of pipeline stages from ten to five impacts the leakage model (if you understood the leakage model for the ten-stage implementation you should immediately see why); it will also be interesting to see how the reduced number of stages affects the number of traces required.

In [None]:
target.dis()
target = program_target(half_pipe=1)

We start as before, with a single encryption at a time:

In [None]:
project_file = "projects/Tutorial_HW_CW305_AES_PIPELINED_HALF" + str(target.half_pipe) + ".cwp"
project = cw.create_project(project_file, overwrite=True)

In [None]:
get_traces(project, 2000)

In [None]:
project.save()

In [None]:
results = do_cpa(project, cwa.leakage_models.pipeline_diff, point_range=[44,54])

Yep, the leakage model that worked for the fully-pipelined target is useless here!

To figure out what the leakage model should be, we again need to look at the implementation details.

In `aes_core.v`, the initial round key addition is essentially done as a distinct "round 0", in its own clock cycle.

This means that the full encrytion is done in 11 cycles (not 10), so when we halve the pipeline, we have two implementation choices:

|pipe stage |version 1 rounds |version 2 rounds |
|-----------|----------|----------|
| 1         | 0        | 0, 1     |
| 2         | 1, 2     | 2, 3     |
| 3         | 3, 4     | 4, 5     |
| 4         | 5, 6     | 6, 7     |
| 5         | 7, 8     | 8, 9     |
| 6         | 9, 10    | 10       |

Our current target uses version 1. Again you may find it helpful to run the tesbench that's provided [here](https://github.com/newaetech/chipwhisperer/tree/develop/hardware/victims/cw305_artixtarget/fpga/vivado_examples/aes128_pipelined/sim) to see the core in action: `make HALF_PIPE=1 DUMP=1`.

One thing you should see right away is that since the last stage processes rounds 9 and 10 for the same plaintext in sequence, the leakage model that we use for the non-pipelined version of this target should work here!

And it does:

In [None]:
results = do_cpa(project, cwa.leakage_models.last_round_state_diff, point_range=[48,58])

This targeted the pipeline stage 6 register as it goes from the round 9 output to the round 10 output (on the same plaintext).

But we can also target that same register as it goes from the round 10 output for the *previous* plaintext to the round 9 output for the *current* plaintext, using this leakage model:

```Python
def leakage(self, pt, ct, prev_ct, key, bnum):
    curr = inv_sbox(ct[bnum] ^ key[bnum])
    prev = prev_ct[self.INVSHIFT_undo[bnum]]
    return curr ^ prev
```

In [None]:
results = do_cpa(project, cwa.leakage_models.half_pipeline_diff, point_range=[48,58])

This also works, just not quite as well as `last_round_state_diff` (on the CW312T_A35 it seems to barely work; however if you use more traces you should see better results). Let's move on to fully loading the pipeline:

In [None]:
project_file = "projects/Tutorial_HW_CW305_AES_PIPELINED_FILLED_HALF" + str(target.half_pipe) + ".cwp"
project_filled = cw.create_project(project_file, overwrite=True)

In [None]:
N=setN(3)
get_traces_filled_pipeline(project_filled, N=N, NPT=target.fifo_depth, half_pipe=True)

In [None]:
split_traces = target.split_traces(scope, project_filled.traces, 48, 58)

In [None]:
project_file = "projects/Tutorial_HW_CW305_AES_PIPELINED_FILLED_SPLIT_HALF" + str(target.half_pipe) + ".cwp"
project_split = cw.create_project(project_file, overwrite=True)

for t in split_traces:
    project_split.traces.append(t)

In [None]:
results = do_cpa(project_split, cwa.leakage_models.last_round_state_diff, trace_range=[0,10000])

In [None]:
results = do_cpa(project_split, cwa.leakage_models.half_pipeline_diff, trace_range=[0,10000])

`last_round_state_diff` still outperforms, but the delta is much smaller.

What if we *combine* the two models?

In [None]:
def do_cpa2x(project, model1, model2, trace_range=None, point_range=None):
    short_traces = []

    if trace_range == None:
        tstart = 0
        tstop = len(project.traces)
    else:
        tstart = trace_range[0]
        tstop = trace_range[1]

    if point_range == None:
        pstart = 0
        pstop = len(project.traces[0].wave)
    else:
        pstart = point_range[0]
        pstop = point_range[1]

    # careful as this can lead to confusion! it means that ciphouts[i+1] is the ciphertext
    # corresponding to short_traces[i] (and so ciphouts[i] is the corresponding *previous* ciphertext)
    if tstart > 0:
        ciphouts = [project.traces[tstart-1].textout]
    else:
        ciphouts = [[0]*16]

    for i in range(tstart, tstop):
        short_traces.append(project.traces[i].wave[pstart:pstop])
        ciphouts.append(list(project.traces[i].textout))

    num_traces = tstop - tstart

    t_bar = np.sum(short_traces, axis=0)/num_traces
    o_t = np.sqrt(np.sum((short_traces - t_bar)**2, axis=0))

    cparefs = [0] * 16
    bestguess = [0] * 16
    bestguesses = []

    for bnum in tnrange(0, 16):
        maxcpa = [0] * 256
        klist = [0]*16
        for kguess in range(0, 256):
            klist[bnum] = kguess
            hwss = []
            for model in [model1, model2]:
                if model._has_prev:
                    hwss.append(np.array([[HW[model.modelobj.leakage(None, ciphouts[i+1], None, ciphouts[i], klist, bnum)] for i in range(num_traces)]]).transpose())
                else:
                    hwss.append(np.array([[HW[model.modelobj.leakage(None, ciphouts[i+1], klist, bnum)] for i in range(num_traces)]]).transpose())
            
            for hws in hwss:
                hws_bar = mean(hws)
                o_hws = std_dev(hws, hws_bar)
                correlation = cov(short_traces, t_bar, hws, hws_bar)
                cpaoutput = correlation/(o_t*o_hws)
                maxcpa[kguess] += max(abs(cpaoutput))        

        bestguess[bnum] = np.argmax(maxcpa)
        bestguesses.append(np.argsort(maxcpa)[::-1])
        cparefs[bnum] = max(maxcpa)        

    correct_recovered_key = model1.modelobj.process_known_key(project.traces[0].key)
    scores = []
    for b in range(16):
        score = list(bestguesses[b]).index(correct_recovered_key[b])
        scores.append(score)
    print('Remaining PGE: %f' % np.average(scores))

    return bestguess, bestguesses

In [None]:
results = do_cpa2x(project_split, cwa.leakage_models.half_pipeline_diff, cwa.leakage_models.last_round_state_diff, trace_range=[0,10000])

It looks like the combination is beneficial! (Except on the CW312T_A35, where `last_round_state_diff` is so much better than `half_pipeline_diff`, and so their combination is not beneficial.)

The number of traces needs to be bumped up to around 18000 in order to fully succeed. This is a modest reduction compared to the fully-pipeline implementation.

In [None]:
results = do_cpa2x(project_split, cwa.leakage_models.half_pipeline_diff, cwa.leakage_models.last_round_state_diff, trace_range=[0,18000])

# Half-Pipeline Version 2

What if we use the second way to half-pipeline? The state registers are now updated as per the version 2 column of this table:

|pipe stage |version 1 rounds |version 2 rounds |
|-----------|----------|----------|
| 1         | 0        | 0, 1     |
| 2         | 1, 2     | 2, 3     |
| 3         | 3, 4     | 4, 5     |
| 4         | 5, 6     | 6, 7     |
| 5         | 7, 8     | 8, 9     |
| 6         | 9, 10    | 10       |

What leakage can we exploit here?

- The stage 6 register gets updated with the final ciphertext values for each encryption. If you code the leakage model for this, you'll find that the key is XOR's twice, which takes it out of the leakage definition; this can't be exploited to recover the key.
- The stage 5 register first gets updated from the round 9 output of the previous ciphertext to the round 8 output of the next ciphertext. This is not easily exploitable because the round key touches multiple bytes between round 8 and round 9.
- The stage 5 register then gets updated from the round 8 output to the round 9 output. Again, this is not easily exploitable because the round key touches multiple bytes between round 8 and round 9.

What about the first pipeline stage? We've overlooked it until now. The "round 0" content is the initial round key addition. This can't be exploited across successive encryptions because the key would cancel out:

```Python
prev = prev_pt[bnum] ^ key[bnum]
curr = pt[bnum] ^ key[bnum]
curr ^ prev == pt[bnum] ^ prev_pt[bnum]
```

And the round 1 output has gone through the MixColumns step which diffuses the key across multiple bytes.

Does that mean version 2 can't succumb to this CPA attack? Let's find out...

In [None]:
target.dis()
target = program_target(half_pipe=2)

In [None]:
project_file = "projects/Tutorial_HW_CW305_AES_PIPELINED_HALF" + str(target.half_pipe) + ".cwp"
project = cw.create_project(project_file, overwrite=True)

In [None]:
if 'CW305' in TARGET_PLATFORM:
    gain = 22
else:
    gain = 41

get_traces(project, 3000, gain=gain)

In [None]:
project.save()

In [None]:
results = do_cpa(project, cwa.leakage_models.half_pipeline_diff, point_range=[44,54])

In [None]:
results = do_cpa(project, cwa.leakage_models.last_round_state_diff, point_range=[44,54])

In [None]:
results = do_cpa(project, cwa.leakage_models.pipeline_diff, point_range=[44,54])

Surprisingly, two of the three leakage models that we've exploited for the previous targets work **very** well. **How can this be?!?**

Recall that `half_pipeline_diff` was developed for version 1, where the last stage register was updated from the round 10 value of the previous ciphertext to the round 9 value of the current ciphertext.

Here with version 2, the round 10 and round 9 values don't even go to the same physical state register.

**So why does it work here?** (Not only does it work, it appears to work at least as well is it does against version 1!)

To understand what seems to have happened, we'll have to dive into the Verilog and the actual FPGA implementation.

Looking at the AES round instantiations in [`aes_half_pipeline_top.v`](https://github.com/newaetech/chipwhisperer/blob/develop/hardware/victims/cw305_artixtarget/fpga/vivado_examples/aes128_pipelined/hdl/aes_half_pipeline_top.v), when `HALF_PIPE=2` we have 6 instances of `aes_two_rounds`; the last instance of `aes_two_rounds` (k = 10) is used to perform only one round (the last round).

This is lazy coding: the "correct" thing to do would be to instantiate `aes_round` separately for the last round, outside of the generate block, like is done for the initial round in the `HALF_PIPE=1` case. This really shouldn't matter -- the two cases should be logically equivalent -- but it appears Vivado had other ideas and did some unexpected "optimizations". The easiest way to see that something is amiss is to show the connectivity of the different round modules in Vivado's graphical device view. In the images below, the input FIFO logic is highlighted green and the output FIFO logic is highlighted red.

On the left-hand side, the white lines show the connectivity from the last round (`gen_half_rounds[10].U_aes_two_rounds`); as expected, there are many connections to the output FIFOs.

On the right-hand side, the white lines show the connectivity from the second last round (`gen_half_rounds[8].U_aes_two_rounds`); what is *very* surprising is that is that this *also* has many connections to the output FIFOs!

<img src="img/aes_pipe_half2_last_round_connections.png" width="600"> <img src="img/aes_pipe_half2_2ndlast_round_connections.png" width="600">

(these are from the CW305 100t target; the other targets will look different but should show the same idea)

Logically, this should not be happening: `gen_half_rounds[8].U_aes_two_rounds` should only be connected to the previous and next round instances; it should have no direct connection to the output FIFO.

This can be verified in the [synthesized netlist](https://github.com/newaetech/chipwhisperer/blob/develop/hardware/victims/cw305_artixtarget/fpga/vivado_examples/aes128_pipelined/aes128_pipelined_half2_netlist.v). What we're looking for is an output of `gen_half_rounds[10].U_aes_two_rounds` connected to an input of `gen_half_rounds[8].U_aes_two_rounds`.

Unfortunately, Vivado mangles net names horribly so this is not as easy as it should be, but with a little bit of sweat and tears you can find that, for example, `gen_half_rounds[10].U_aes_two_rounds_n_39` is an example of such a signal.

A shortcut (which unfortunately involves Tcl 😱), is to open the synthesized design in Vivado (for this you'll have to re-run the synthesis yourself) and query Vivado for outputs of `gen_half_rounds[10].U_aes_two_rounds` connected to inputs of `gen_half_rounds[8].U_aes_two_rounds`:
```tcl
set round10outs [get_nets -of [get_pins gen_half_pipe.U_aes_pipeline/gen_half_rounds[10].U_aes_two_rounds/* -filter {DIRECTION == OUT}]]
set round8ins [get_nets -of [get_pins gen_half_pipe.U_aes_pipeline/gen_half_rounds[8].U_aes_two_rounds/* -filter {DIRECTION == IN}]]
set intersect [list]
foreach elem $round8ins {if {$elem in $round10outs} {lappend intersect $elem}}
puts $intersect
```

(On the CW312T_A35 target, pre-pend the first argumenbt of each `get_pins` call with `U_cw305_dut/`.)

In our implementation we find 134 such signals.

Let's confirm our hunch. Setting `HALF_PIPE` to 3 instantiates  [`aes_half_pipeline_top_fixed.v`](https://github.com/newaetech/chipwhisperer/blob/develop/hardware/victims/cw305_artixtarget/fpga/vivado_examples/aes128_pipelined/hdl/aes_half_pipeline_top_fixed.v), which is identical to `aes_half_pipeline_top.v` except that the last round is done by an instance of `aes_round` instead of `aes_two_rounds`.

Before trying to attack this implementation, let's do a quick graphical sanity check (second-last round connections on the left; last round connections on the right):

<img src="img/aes_pipe_half3_last_round_connections.png" width="600"> <img src="img/aes_pipe_half3_2ndlast_round_connections.png" width="600">

This looks much better, and if you run the synthesis and the above Tcl commands, you should find that `$intersect` is now empty.

Let's now quickly try the attack with all three models:

In [None]:
target.dis()
target = program_target(half_pipe=3)

In [None]:
project_file = "projects/Tutorial_HW_CW305_AES_PIPELINED_HALF" + str(target.half_pipe) + ".cwp"
project = cw.create_project(project_file, overwrite=True)

In [None]:
get_traces(project, 3000, gain=gain)

In [None]:
results = do_cpa(project, cwa.leakage_models.half_pipeline_diff, point_range=[44,54])

In [None]:
results = do_cpa(project, cwa.leakage_models.last_round_state_diff, point_range=[44,54])

In [None]:
results = do_cpa(project, cwa.leakage_models.pipeline_diff, point_range=[44,54])

**Success!**

This doesn't mean that this implementation isn't vulnerable to CPA attacks (meet the [MixColumn attack](../courses/sca201/Lab%202_3%20-%20Attacking%20Across%20MixColumns.ipynb)!).

The objective here was to illustrate the impact of design choices, both intentional and not, on leakage. By luck, we stumbled into a wonderful illustration of how implementation tools can do strange and unexpected things with our source code that alter its side-channel leakage, and that this behaviour is not exclusive to software compilers.