# Breaking Hardware ECC on CW305 FPGA

## Background
To get the most out of this tutorial, some basic knowledge of elliptic curves, and in particular of point multiplication on elliptic curves, is required. A good overview is available here: https://cryptojedi.org/peter/data/eccss-20130911b.pdf.

The side-channel attack presented here targets the scalar multiplier ("k") in the elliptic curve point multiplication. Point multiplication is the most expensive operation in many (if not all?) cryptographic uses of elliptic curves. The secret scalar is not the private key, but learning the scalar used in an ECDSA signature (for example) allows the secret key to be trivially calculated.

This attack is quite different from the AES side-channel attacks in our other tutorials. In most ECC point multiplication implementations (including the target used here), the secret scalar k is consumed one bit at a time. At a high level, the attack is very simple:
1. Identify when each bit of k is processed on the power trace.
2. Find how processing a '1' is different from processing a '0'.
3. Assemble the secret k, one bit at a time.

Since we are attacking k one bit at a time, its size has no impact on the difficulty of the attack. The curve used in this attack is the NIST P-256 curve; the same approach would work just as well with a larger curve.

Our attack requires multiple traces to be collected. The secret k remains constant for each trace, but a different point must be used for each trace. However, we require no knowledge whatsoever of what the points actually are. Furthermore, if the attacker is limited to collecting a single trace for a given value of k, we will show in the end that we can correctly guess most of k.

The target for this attack is the point multiplication submodule of the [Cryptech ecdsa256 core](https://wiki.cryptech.is/log/user/shatov/ecdsa256).
Refer to the README in your ChipWhisperer repository (`hardware/victims/cw305_artixtarget/fpga/cryptosrc/cryptech/ecdsa256-v1/README.md`) for details on the target and the modifications that were made to it for this attack.

## Series Overview

This tutorial is the first part of a 5-part series. In this tutorial, we develop a basic attack and demonstrate its viability.

Part 2 improves the attack and introduces new ways to measure its performance.

In part 3, we switch from the attacker chair to the defender chair: we propose and evaluate several countermeasures to resist our attack.

In part 4, we study one more countermeasure which yields new insights on the target's leakage.

Part 5 concludes by looking at what TVLA can tell us about the target's leakage, with the benefit of all that we learned in the previous sections.

If you only do part 1, you'll learn quite a bit about hardware ECC and how it may be attacked, and you could choose to stop there. Hopefully you'll find this sufficiently interesting to cover the remaining sections and learn even more!

## Capture Notes

Most of the capture settings used below are similar to the standard ChipWhisperer scope settings. Some important points to note:

- The full ECC operation takes over one million clock cycles, so it is best done with a ChipWhisperer-Pro.
- With a ChipWhisperer-Lite, every trace needs to be captured in several steps, using the sample offset feature (47 steps to be precise!), so trace acquisition is much slower: around 5 seconds/trace, versus 4 traces/second with the CW-pro. Be patient! Luckily, the attack doesn't require a large number of traces.
- We're using EXTCLK x1 for our ADC clock. This means that the FPGA is outputting a clock signal, and we aren't driving it.
- It's possible that better results would be obtained with x4 sampling, but that would make trace acquisition with the CW-lite *very* slow.


## Supported Setups

This tutorial requires a CW305 target, and either a CW-Lite (two-part), CW-Pro or CW-Husky.

The tutorial was developed with a CW-Pro with the 100t FPGA; the observations made in the attack's development should be accurate if you're using the same, but other combinations of CW-Pro / CW-Lite / CW-Husky / 100t / 35t may behave somewhat differently.

***If you're using a CW-lite, your results may differ slightly***, and what you see may not correspond exactly to the notebook's comments, especially in the first attempt when using the power measured on specific clock cycles. This is likely due to the 5x higher clock frequency being used, which is done to keep the trace acquisition time reasonable. In the end, the final attack works well with the CW-lite, although it tends to require a few more traces.

In [None]:
#PLATFORM = 'CWLITE'
PLATFORM = 'CWPRO'
#PLATFORM = 'CWHUSKY'

## Capture Setup

Setup is somewhat similar to other targets. This time, however, we'll be using an external clock (from the FPGA). We'll also do the rest of the setup manually:

In [None]:
import chipwhisperer as cw

scope = cw.scope()
scope.adc.offset = 0
scope.adc.basic_mode = "rising_edge"
scope.trigger.triggers = "tio4"
scope.io.tio1 = "serial_rx"
scope.io.tio2 = "serial_tx"
scope.io.hs2 = "disabled"

Next we'll connect to the CW305 board. Here we'll need to specify our bitstream file to load as well as the usual scope and target_type arguments.

Pick the correct bitfile for your CW305 board by setting the `fpga_id` argument to '100t' or '35t'. By setting `force=False`, the bitfile will only be programmed if the FPGA is uninitialized (e.g. after powering up). Change to `force=True` to always program the FPGA (e.g. if you have generated a new bitfile).

In [None]:
target = cw.target(scope, cw.targets.CW305_ECC, fpga_id='100t', force=False) # or fpga_id='35t', as appropriate

With the CW-Pro, each trace can be captured in one go with streaming mode. However this requires the target clock to be no more than 10 MHz.

The CW-Husky can also capture a full trace in streaming mode. It's able to stream faster so we use a 15 MHz clock.

With the CW-Lite, we can only capture 24.4K samples at a time, but on the other hand we can increase the target clock up to 50 MHz, which speeds up the capture.

In [None]:
if PLATFORM == 'CWPRO':
    scope.adc.stream_mode = True
    scope.adc.samples = 1200000
    target.pll.pll_outfreq_set(10E6, 1)
    target._clksleeptime = 150
    scope.gain.db = 20
elif PLATFORM == 'CWHUSKY':
    scope.adc.stream_mode = True
    scope.adc.samples = 1200000
    target.pll.pll_outfreq_set(15E6, 1)
    target._clksleeptime = 100
    scope.gain.db = 20
elif PLATFORM == 'CWLITE':
    scope.adc.samples = 24400
    target.pll.pll_outfreq_set(50E6, 1)
    target._clksleeptime = 30
    scope.gain.db = 30

Husky's ADC has higher latency that CW-Pro's and CW-Lite's, which means that the collected power samples are offset. In this attack, we look at the power consumed on precise clock cycles. In order to reference the same clock cycles for all capture hardware, we use the trigger offset feature to account for the different latency:

In [None]:
if PLATFORM == 'CWHUSKY':
    scope.adc.offset = 3
else:
    scope.adc.offset = 0

Sanity check: make sure we've loaded the right bitfile!

In [None]:
assert (target.get_fpga_buildtime() == '10/13/2020, 09:31' or
        target.get_fpga_buildtime() == '10/22/2020, 13:38')

Next we set all the PLLs. We enable CW305's PLL1; this clock will feed both the target and the CW ADC. As explained [here](http://wiki.newae.com/Tutorial_CW305-1_Building_a_Project#Capture_Setup), **make sure the DIP switches on the CW305 board are set as follows**:
- J16 = 0
- K16 = 1

In [None]:
target.vccint_set(1.0)
# we only need PLL1:
target.pll.pll_enable_set(True)
target.pll.pll_outenable_set(False, 0)
target.pll.pll_outenable_set(True, 1)
target.pll.pll_outenable_set(False, 2)

In [None]:
if PLATFORM == 'CWHUSKY':
    scope.clock.clkgen_freq = 15e6
    scope.clock.clkgen_src = 'extclk'
    scope.clock.adc_mul = 1
else:
    scope.clock.adc_src = "extclk_x1"

In [None]:
# ensure ADC is locked:
scope.clock.reset_adc()
assert (scope.clock.adc_locked), "ADC failed to lock"

Occasionally the ADC will fail to lock on the first try; when that happens, the above assertion will fail (and on the CW-Lite, the red LED will be on). Simply re-running the above cell again should fix things.

## Trace Capture
Below is the capture loop. The main body of the loop loads some new multiplication parameters, arms the scope, then finally records and appends our new trace to the `traces[]` list.

Note that the multiplication result is read from the target and compared to the expected results, as a sanity check.

First let's pick a scalar for which we can very easily distinguish ones from zeros. Remember that k is the secret that we want to be able to retrieve with our side-channel attack.

In [None]:
k = 0xffffffffffffffffffffffffffffffff00000000000000000000000000000000

Define a platform-dependent method for capturing traces:

In [None]:
from chipwhisperer.common.traces import Trace
from tqdm import tnrange
import numpy as np
import time
import math 

def get_traces(N=50):
    traces = []
    if PLATFORM == 'CWPRO' or PLATFORM == 'CWHUSKY':
        for i in tnrange(N, desc='Capturing traces'):
            P = target.new_point() # every trace uses a different point
            ret = target.capture_trace(scope, Px=P.x, Py=P.y, k=k, check=True)
            if not ret:
                print("Failed capture")
                continue
            traces.append(ret)

    elif PLATFORM == 'CWLITE':
        #segments = math.ceil(target.pmul_cycles / scope.adc.samples)
        segments = 1
        for i in tnrange(N, desc='Capturing traces'):
            scope.adc.offset = 0
            wave = np.array([])
            for j in range(segments):
                P = target.new_point() # every trace uses a different point
                ret = target.capture_trace(scope, Px=P.x, Py=P.y, k=k)
                if not ret:
                    print("Failed capture")
                    continue
                wave = np.append(wave, ret.wave)
                scope.adc.offset += scope.adc.samples

            traces.append(Trace(wave[1:], ret.textin, ret.textout, None))

    return traces

We just need a single trace to start with:

In [None]:
traces = get_traces(1)

## Buidling the Attack: Trace Analysis

In the following, we build up the attack from scratch. In this way, while we are developing an attack which is very specific to our target, we show the methods you would use to build an attack for a different target.

Let's start by looking at a single trace. Let's start with the first 20k cycles only (you can plot the full trace but that will be very slow because it's a long trace!).

In [None]:
from bokeh.plotting import figure, show
from bokeh.resources import INLINE
from bokeh.io import output_notebook

output_notebook(INLINE)

In [None]:
p = figure(plot_width=2000)

samples = 20000
#samples = len(traces[0].wave)
xrange = range(samples)
p.line(xrange, traces[0].wave[:samples], line_color="red")

In [None]:
show(p)

### Simulation

There seems to be a very strong periodicity to the trace. We can confirm this by simulating the target core and looking at what it's actually doing.

If you want to go through the whole process, install the [iverilog simulator](http://iverilog.icarus.com/) (Ubuntu: `apt-get install iverilog`) and follow along below; otherwise skip ahead to the next section, **"Finding Ones and Zeros"**.

The next step runs the simulation and takes several minutes; you can see that it's still alive by looking at the `make.log` file. Once it's done, you'll see its output here.

In [None]:
%%bash
cd ../../hardware/victims/cw305_artixtarget/fpga/vivado_examples/ecc_p256_pmul/sim/
make DUMP=1 WAVEFORMAT=vcd

This produces a simulation waveform `../../hardware/victims/cw305_artixtarget/fpga/vivado_examples/ecc_p256_pmul/sim/results/tb.fst` which you can look at with gtkwave.

What we're going to do is record at what times the multiplication core's internal `bit_counter` changes, which tells us when the core is processing which bit of the secret k scalar.

We can automatically extract these event times with the vcdvcd package (https://github.com/cirosantilli/vcdvcd). Unfortunately this step needs to ingest the full 2.7G file all at once, so it's also very slow (again if you're impatient you can skip ahead to the next section).

In [None]:
from vcdvcd import VCDVCD
vcd = VCDVCD('../../hardware/victims/cw305_artixtarget/fpga/vivado_examples/ecc_p256_pmul/sim/results/tb.fst')

Now that we've ingested the simulation waveform, extracting event times from it is almost instantaneous:

In [None]:
kbittimes = vcd['tb.U_dut.U_curve_mul_256.bit_counter[7:0]']
cyclecounts = vcd['tb.cycle_count[31:0]']

With a bit of Python magic we build up the `cycles` array, which contains the clock cycle number for when each bit of k is processed, relative to the start of the point multiplication operation:

In [None]:
cycles = []
deltas = []
for i in range(1,257):
    cycles.append(int(cyclecounts[kbittimes.tv[i][0]],2))
    if (i > 1):
        deltas.append(cycles[-1] - cycles[-2])

One thing we can see right away is that each bit takes *exactly* 4204 cycles to process. So, no timing attacks here: the operation is rock-solid time-constant.

Go ahead and try with different values of k and P if you want; don't bother with the lengthy waveform generation and extraction, just look at the `trace.textout['cycles']` attribute to see how many clock cycles the job took (as measured by the scope looking at the target's trigger signal).

In [None]:
min(deltas), max(deltas)

In [None]:
traces[0].textout['cycles']

We only need to do this lengthy step once, so let's save the results:

In [None]:
import numpy as np
import os

cycles_file = 'data/ecc_cycles.npy'
# avoid overwriting:
if not os.path.exists(cycles_file):
    numpy.save(cycles_file, cycles)

### Finding Ones and Zeros:

The previously saved `cycles.npy` tells us at which clock cycle the target core is processing each bit of k. If you skipped over the previous section, carry on from here.

We begin by loading the array which tells us on which clock cycle processing begins for every bit of k:

In [None]:
import numpy as np
cycles = np.load('data/ecc_cycles.npy')

Let's overlay the power trace from a few differents bits of k, including both ones and zeros:

In [None]:
r = figure(plot_width=2000)

samples = 4204
xrange = range(samples)
for i, color in zip([10, 20, 30, 200, 210, 220], ['red', 'green', 'blue', 'orange', 'purple', 'brown']):
    r.line(xrange, traces[0].wave[cycles[i]:cycles[i]+samples], line_color=color)

In [None]:
show(r)

The peaks line up *perfectly*, and the different bits appear indistinguishable.

Of course, side-channel attacks work by picking up the smallest of differences, so we're not done yet...

Our next step is to average the power trace for all k=1 bits and all k=0 bits **from a single multiplication trace** to see if we can spot any differences:

In [None]:
# pick any trace here:
trace = traces[0]

In [None]:
avg_trace = np.zeros(samples)

for start in cycles[1:]:
    avg_trace += trace.wave[start:start+samples]

avg_trace /= len(cycles[1:])

In [None]:
avg_ones = np.zeros(samples)

for start in cycles[1:128]:
    avg_ones += trace.wave[start:start+samples]

avg_ones /= 128

In [None]:
avg_zeros = np.zeros(samples)

for start in cycles[128:256]:
    avg_zeros += trace.wave[start:start+samples]

avg_zeros /= 128

In [None]:
s = figure(plot_width=2000)

xrange = range(len(avg_trace))
#s.line(xrange, avg_ones, line_color="red")
#s.line(xrange, avg_zeros, line_color="blue")
s.line(xrange, avg_ones - avg_zeros, line_color="orange")

In [None]:
show(s)

**Bingo!** We see substantial differences at the very start and very end of the bit processing. Zoom in around cycle 4202 to quantify the difference; it's not big, but it's there.

Now, remember that the difference we've found here is from the average of 128 measurements.

The question is: is the difference seen in the average consistently present for *individual* bits of k. Let's look at that with some interactive plotting.

First let's define a helper function to sum the power samples:

In [None]:
def get_sums(no_traces):
    sums = []
    for c in cycles:
        sum = 0
        for trace in traces[:no_traces]:
            for i in poi:
                power = trace.wave[c+abs(i)]
                if i < 0:
                    sum -= power
                else:
                    sum += power
        sums.append(sum)
    return sums

Then we acquire more traces:

In [None]:
traces = get_traces(50)

If you get a number of warnings stating that the operation too more clock cycles than expected (typically 1 more clock cycle), you may get slightly different results from what's described in this notebook (due to some samples being off by 1).
You should be able to resolve the issue by resetting the ADC DCM (`scope.clock.reset_adc()`) or restarting the notebook.

Let's set up an interactive plot which lets us see whether we can distinguish k bits that are ones from k bits that are zeros, and how many traces might be required to do so reliably:

In [None]:
def update_plot(no_traces):
    SS.data_source.data['y'] = get_sums(no_traces)
    push_notebook()

In [None]:
from ipywidgets import interact, Layout
from bokeh.io import push_notebook

# these are the clock cycles for which we sum the power measurement
# for positive numbers, we add the power measurement at that clock cycle;
# for negative numbers, we substract the power measurement at abs(clock cycle).
poi = [4202, -6, 7]

# start with a single trace
no_traces = 1

S = figure(plot_width=2000)

xrange = range(len(cycles))
sums = get_sums(no_traces)
SS = S.line(xrange, sums)

In [None]:
show(S, notebook_handle=True)

In [None]:
interact(update_plot, no_traces=(1, len(traces)))

The x-axis of this plot is the index of the k bit being processed by the target; the y-axis is the metric which we hope to use to distinguish ones from zeros. The metric we used here is the sum of the power measurements at cycles 6, 7 and 4202.

Recall that our secret scalar k was set to {128 ones, 128 zeros}. So if our distinguishing metric is good, we expect the first half of the plot to be distinguishable from the second half.

With a single trace, the results aren't great: the two halves are statistically different, but an attacker wouldn't be able to correctly guess all k bits.

But by the time the slider hits about 8 traces, the two halves no longer overlap. With over 15 traces, the two halves are very distinct. We may have a successful side-channel attack!

### Sanity check

Before we declare victory, let's check whether our 0/1 distinguisher still works when k is **not** made of very long strings of 0's and 1's:

In [None]:
k = 0xffffffffffffffff0000000000000000aaaa0000cccc00001111000033330000
traces = get_traces(50)

In [None]:
S2 = figure(plot_width=2000)
poi = [4202, -6, 7]
xrange = range(len(cycles))
sums = get_sums(len(traces))
SS = S2.line(xrange, sums)

In [None]:
show(S2)

**Uh-oh**: when the k bits alternate between 0 and 1 every bit (e.g. bits 128-143), it looks like we get a constant metric that's about halfway between what we get for long strings of ones and long strings of zeros (e.g. bit 0-63 and 64-127).

Let's plot the 3 components of our metric separately:

In [None]:
S3 = figure(plot_width=2000)

no_traces = len(traces)

poi = [4202]
sums = np.asarray(get_sums(no_traces)) + 2    # just a vertical shift for easier visualization 
S4202 = S3.line(xrange, sums, line_color='red')

poi = [-6]
sums = get_sums(no_traces)
S6 = S3.line(xrange, sums, line_color='green')

poi = [7]
sums = get_sums(no_traces)
S7 = S3.line(xrange, sums, line_color='blue')

In [None]:
show(S3)

**Ah-ha!** At bits 128-143, we see that the red curve is offset from the others by one cycle, so when k changes every bit, the changes tend to cancel each other out.

This plot also shows that the red curve appears to have a better signal-to-noise ratio. It also has a more regular behaviour at the very beginning.

Maybe we can proceed with this attack by using only the power measurement at cycle 4202.

### One more thing...

But what about those peaks seen in the first few cycles?

Let's do another sanity check, this time with k not starting with a long string of ones:

In [None]:
k = 0x0000ffffffffffff0000000000000000aaaa0000cccc00001111000033330000
traces = get_traces(50)

In [None]:
S4 = figure(plot_width=2000)

no_traces = len(traces)

poi = [4202]
sums = np.asarray(get_sums(no_traces)) + 2    # just a vertical shift for easier visualization 
S4202 = S4.line(xrange, sums, line_color='red')

poi = [-6]
sums = get_sums(no_traces)
S6 = S4.line(xrange, sums, line_color='green')

poi = [7]
sums = get_sums(no_traces)
S7 = S4.line(xrange, sums, line_color='blue')

In [None]:
show(S4)

Normally, zeros get the lowest score, but when they are at the beginning of k, they get the highest score with cycle 4202; the behaviour with cycles 6 and 7 is stranger still.

This is getting a bit messy: having to distinguish between 3 levels will require a higher SNR. Also, if you repeat the above test with `k=0x1000...`, `k=0x3000...` and similar values, you'll see that it harder still to properly identify those first few bits.

We're **really** close to a working attack. In fact we could pretty much stop here: we can identify most bits of k, we just have some trouble with the first few; if we omit the unlikely cases where k starts with a very long string of zeros, we could simply and quickly brute force those first few bits.

But let's try something else (promise, this one's going to work): a slightly different approach which will give a cleaner attack, and which will also give us some insight into **why** the leakage is happening.

If you're a hardware designer, then what follows may be the most instructive part of this tutorial.

## ...To the Verilog!

If you're not too scared of a little Verilog, run the Verilog simulation as shown earlier in the notebook and open the simulation waveform in gtkwave. (If you are scared, just skip over to the **"Correlation Attack"** section.)

Then bring up `hardware/victims/cw305_artixtarget/fpga/cryptosrc/cryptech/ecdsa256-v1/rtl/curve/curve_mul_256.v` in a text editor.

Follow the `k_din` input. This is our secret k that we wish to retrieve with the side-channel attack.

`k_din` gets loaded into `k_din_reg`, and its most significant bit is assigned to `move_inhibit`, which in turns goes to `copy_t2r_int`. This last signal is used to enable the writing of intermediate results to the `bram_1rw_1ro_readfirst` memory instances. There are 3 such memories; one for each of the x, y, and z point coordinates.

### Cryptograpy Detour

To progress from here, a little bit of elliptic curve knowledge is required.

You may have noticed that the point that the target core is tasked to multiply is given with $(x,y)$ coordinates. Why is there a $z$ coordinate now in the source code? Without fully reversing the implementation, it's safe to assume that the target takes the given point from its *affine* $(x,y)$ coordinates and transforms it into *projective* $(x,y,z)$ coordinates. Many (most?) ECC implementations do this because point multiplication is faster in projective coordinates (https://www.nayuki.io/page/elliptic-curve-point-addition-in-projective-coordinates gives a good overview of this).

Now let's load that simulation waveform and look at the write timing on those `bram_1rw_1ro_readfirst` instances by probing `bram_rx_wr_en`, `bram_ry_wr_en` and `bram_rz_wr_en`. If you stare at it for a few minutes you should recognize that the write timings are *identical* for every bit of k, except for the last set of writes which are blocked whenever `move_inhibit` is high.

We can now make a pretty safe guess that the multiplication algorithm used is **double and always add**. Point multiplication in general consists of repeated doublings and adds. In this implementation, for each bit of k, the intermediate result goes through a point doubling and a point add; the result of the point add is discarded if the addition is not required, which is dependent on the value of the k bit being processed. `move_inhibit` is the logic which controls this discarding. This multiplication algorithm is a simple way to achieve time-constant execution, and, depending on implementation details, make it harder for side-channel attacks to identify whether the secret bit being processed is a 1 or a 0.

We now have a decent (and hopefully correct!) understanding of the implementation. For the purpose of side-channel attacks, we are now reasonably certain that the target does *exactly* the same thing independent of k, *except* for the storage operation which is masked via `move_inhibit`, depending on k.

The set of 8 writes that are blocked by `move_inhibit` occur on clock cycles 4195 to 4203 (relative to the processing of each bit of k). *Hmm,* 4195-4203... do these numbers sound familiar? Recall that 4202 is the clock cycle where we noted a statistical difference between processing a 1 versus a 0!

We know from our first attempt that looking at only the last set of writes doesn't lead to a clean attack. But now that we understand what's happening at those clock cycles, we can try something else to leverage the leakage that we've found. On the simulation waveform, we can look at the next time that the `bram_1rw_1ro_readfirst` memory instances are read (after the possibly blocked memory write). Here's the idea: if `move_inhibit` was not set, then the next memory read will return what was written at cycles 4195-4203; otherwise, it will return something else. The correlation between the power samples at those two points in time might be able to tell us whether `move_inhibit` was set or not.

## Correlation Attack

First let's define the cycle offsets where the memory read and writes that we're interested in are occurring.

The x/y/z writes happen simultaneously (at cycle `rupdate_offset`), but the three coordinates are read at three different times (`r[x|y|z]read_offset`).

The correlation is computed over `rupdate_cycles = 8` clock cycles, because that's how many clock cycles it takes to read or write an intermediate $R_x$, $R_y$ or $R_z$ value (256-bit values into 32-bit wide memories).

In [None]:
rupdate_offset = 4195
rupdate_cycles = 8
rxread_offset = 205
ryread_offset = 473
rzread_offset = 17

Now we compute the correlations:

In [None]:
t = len(traces)
corrsxonly = []
corrsyonly = []
corrszonly = []

for i in range (0, len(cycles)-1):
    corrx = 0
    corry = 0
    corrz = 0

    for trace in traces[:t]:
        corrx += np.corrcoef(trace.wave[cycles[i]+rupdate_offset:cycles[i]+rupdate_offset+rupdate_cycles], trace.wave[cycles[i+1]+rxread_offset:cycles[i+1]+rxread_offset+rupdate_cycles])[0][1]
        corry += np.corrcoef(trace.wave[cycles[i]+rupdate_offset:cycles[i]+rupdate_offset+rupdate_cycles], trace.wave[cycles[i+1]+ryread_offset:cycles[i+1]+ryread_offset+rupdate_cycles])[0][1]
        corrz += np.corrcoef(trace.wave[cycles[i]+rupdate_offset:cycles[i]+rupdate_offset+rupdate_cycles], trace.wave[cycles[i+1]+rzread_offset:cycles[i+1]+rzread_offset+rupdate_cycles])[0][1]

    corrsxonly.append(corrx/t)
    corrsyonly.append(corry/t)
    corrszonly.append(corrz/t)


In [None]:
C = figure(plot_width=2000)

xrange = range(len(corrsyonly))

C.line(xrange, corrsxonly, line_color="orange")
C.line(xrange, corrsyonly, line_color="purple", line_width=3)
C.line(xrange, corrszonly, line_color="brown")

In [None]:
show(C)

We appear to have good results with 50 traces! The correlation with the y-coordinate read is very good; the x-coordinate read has a strange peak at the start that we don't want to deal with, as well as a lower SNR, and the z-coordinate read appears to have no correlation whatsoever.

For the attack, we'll use correlation from the y-coordinate read only.

This image below illustrates illustrates how to define the thresholds that we'll use to identify ones and zeros. When $k$ starts with a leading zero, the correlations scores behave differently until the first one is encountered, so we use two thresholds to decide whether each bit is a one or a zero.

![Thresholds](img/ECC_threshold.png)

# The Attack

Finally: here is the attack in full. We start by repeating the trace acquisition, this time with a non-trivial value for k.

In [None]:
k = 0x70a12c2db16845ed56ff68cfc21a472b3f04d7d6851bf6349f2d7d5b3452b38a
#k = random_k()
traces = get_traces(30)

We'll repeat some previous cells that are needed here, so that you don't have to run through the whole notebook in order for this to work:

In [None]:
from tqdm import tnrange
import numpy as np
import time
cycles = np.load('data/ecc_cycles.npy')
rupdate_offset = 4195
rupdate_cycles = 8
ryread_offset = 473

### Define the decision thresholds:

If these parameters don't work for you, go back up to the last plot before this section and pick appropriate thresholds based on what you see on your own plot.

In [None]:
if PLATFORM == 'CWPRO':
    initial_threshold = -0.02
    regular_threshold = 0
elif PLATFORM == 'CWLITE':
    initial_threshold = -0.38
    regular_threshold = -0.23
elif PLATFORM == 'CWHUSKY':
    # 15 MHz:
    initial_threshold = -0.35
    regular_threshold = -0.39

### Compute the correlations:

In [None]:
corrs = []
attack_traces = len(traces)

for i in range (0, len(cycles)-1):
    corr = 0
    for trace in traces[:attack_traces]:
        corr += np.corrcoef(trace.wave[cycles[i]+rupdate_offset:cycles[i]+rupdate_offset+rupdate_cycles], trace.wave[cycles[i+1]+ryread_offset:cycles[i+1]+ryread_offset+rupdate_cycles])[0][1]
    corr /= attack_traces # normalize so that decisions thresholds remain constant if we change no_traces
    corrs.append(corr)


### Guess k one bit at a time:

In [None]:
threshold = initial_threshold
guess = ''
for kbit in range(255):
    if corrs[kbit] > threshold:
        guess += '0'
    else:
        guess += '1'
        threshold = regular_threshold

Since our decision metric for bit $i$ calculates correlation with events in processing bit $i+1$, we cannot use it to guess the last bit of $k$.

But since there are only two possibilities, we simply check which of the two is correct:

In [None]:
guesses = [int(guess + '0', 2), int(guess + '1', 2)]

if k in guesses:
    print('Guessed right!')
else:
    print('Attack failed.')
    print('Guesses: %s' % hex(guesses[0]))
    print('         %s' % hex(guesses[1]))
    print('Correct: %s' % hex(k))
    wrong_bits = []
    for kbit in range(255):
        if int(guess[kbit]) != ((k >> (255-kbit)) & 1):
            wrong_bits.append(255-kbit)
    print('%d wrong bits: %s' % (len(wrong_bits), wrong_bits))

Go back and reduce the number of traces used to see how many are required (`attack_traces` variable in the correlation calculation cell).

You should see the attack succeed with as few as 8 traces. What's perhaps surprising is that with just a single trace, a large percentage of the bits are guessed correctly. There are attacks which allow the full k to be recovered from partial knowledge of k (see for example https://link.springer.com/article/10.1023/A:1025436905711), but that's a lot more math-heavy and beyond the scope of this tutorial.

Go ahead and repeat the attack for different values of k. Note that k must be nonzero and must be less than the curve order; here is a function to generate a random valid k:

In [None]:
def random_k(bits=256, tries=100):
    import random
    for i in range(tries):
        k = random.getrandbits(bits)
        if k < target.curve.order and k > 0:
            return k
    raise ValueError("Failed to generate a valid random k after %d tries!" % self.tries)

# Epilogue: Black vs White Box

This attack was developed and presented as a white-box attack: we looked at the source code, we ran simulations, we extracted points of interest from simulation waveforms. From an educational point of view, this approach allowed us to better show how and why the attack works. It's the best case for offense, and the worst-case for defense.

But it begs the question: would an adversary be successful in a black-box scenario? It can be quickly shown that the answer is yes.

Intuitively, the periodicity of the raw power trace means the attacker will be able to identify the processing times for each bit of k. These do not need to be precise; they only need to be consistent (i.e. a fixed number of clock cycles offset from what the real times are). This RSA notebook shows an example of how one might do this: `jupyter/archive/PA_SPA_2-RSA_on_XMEGA_8bit.ipynb`

With these times in hand, an attack could be as simple as computing the correlation across the different bits of k. The attacker doesn't need to identify exactly when `move_inhibit` is occurring: by computing the correlation over the full bit processing, that event will be included. It is likely that more traces will be required -- how much is left as an exercise to the reader.

# What's Next

You could choose to stop here, but hopefully you've found this fascinating enough that you'll carry on to part 2, where we improve the attack and introduces new ways to measure its performance.