# Speedup Analysis

This notebook is about **exploring** the effects on the application **performance** of the following parameters:
 - **Number of threads**: the parallelization factor
 - **Batch size**: the number of events to gather before processing them in parallel

### Application interface
The `soccer-monitoring` application exposes the following interface through the command-line

```bash
$ soccer-monitoring -h                                                                                                        
DEBS 2013 - Soccer Monitoring tool:
  -h [ --help ]                   Print this message
  -T [ --time-units ] arg         Frequency of statistics (in seconds)
  -K [ --max-distance ] arg       Maximum distance for ball possession 
                                  eligibility
  -s [ --stream ] arg             Game stream file path
  -m [ --metadata ] arg           Metadata file path
  -t [ --threads ] arg (=0)       Number of threads
  -B [ --batch-size ] arg (=1500) Events batch size (default: auto)
  -o [ --output ] arg             Output file path (default: stdout)
```

### Design of experiment
In order to explore the domain space, we are going to measure the time required _on average_ to process a batch of events, subject to the parameters changing. Concerning the **number of threads** we are testing the values $nb\_threads = \left \{1, 2, 4, 8 \right \}$, while for **batch size** we are testing the values $batch\_size = \left \{1, 10, 10^2, 10^3, 10^4, 10^5 \right \}$.

As for the other parameters, we keep them fixed
 - Frequency of statistics = $60 s$
 - Maximum distance for ball possession eligibility = $1 m$
 
### Equipment
Macbook Pro Mid-2012 shipping a Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz.

In [31]:
import numpy as np
import pandas as pd
import re
import subprocess

%matplotlib ipympl

In [32]:
args = ['../cmake-build-release/soccer-monitoring', \
        '-T', '60', \
        '-K', '1', \
        '-s', '../test/resources/game_data_start_10_1e7', \
        '-m', '../datasets/preprocessed/metadata']
nb_threads = [1, 2, 4, 8]
batch_size = [int(10 ** x) for x in range(0, 8)]
pattern = r'Processed \d+ seconds of the stream \(~ \d+ events\) in (\d+\.\d+) seconds'
pattern = re.compile(pattern)

## Number of threads analysis

In [39]:
args_t = args + ['-B', '1000000']
data_t = np.zeros((len(nb_threads), 12))
for i, threads in enumerate(nb_threads):
    out = subprocess.run(args=args_t + ['-t', '{}'.format(threads)], \
                         check=True, \
                         capture_output=True, \
                         text=True)
    ts = [float(m.group(1)) for m in re.finditer(pattern, out.stdout)]
    data_t[i] = ts

In [40]:
def make_threads_df(data, nb_threads):
    sorted_data = np.sort(data, axis=1)
    batch_time = np.mean(sorted_data[::, 1:-1], axis=1)
    columns = ['BatchTime']
    index = [str(t) for t in nb_threads]

    df = pd.DataFrame(data=batch_time, columns=columns, index=index)
    
    speedup = (df.loc['1'] / df) - 1
    speedup.loc['1'] += 2e-3
    speedup = speedup.rename(columns={'BatchTime': 'Speedup'})
    return df, speedup

exec_time_t, speedup_t = make_threads_df(data_t, nb_threads)

exec_time_t.plot.bar(y='BatchTime', legend=False, title='Batch time (seconds)', rot=0)
speedup_t.plot.bar(y='Speedup', legend=False, title='Speedup (%)', rot=0)

FigureCanvasNbAgg()

FigureCanvasNbAgg()

<matplotlib.axes._subplots.AxesSubplot at 0x7fab23d20550>

## Batch size analysis

In [35]:
args_b = args + ['-t', '4']
data_b = np.zeros((len(batch_size), 12))
for i, size in enumerate(batch_size):
    out = subprocess.run(args=args_b + ['-B', '{}'.format(size)], \
                         check=True, \
                         capture_output=True, \
                         text=True)
    ts = [float(m.group(1)) for m in re.finditer(pattern, out.stdout)]
    data_b[i] = ts

In [37]:
def make_batchsize_df(data):
    sorted_data = np.sort(data, axis=1)
    batch_time = np.mean(sorted_data[::, 1:-1], axis=1)
    columns = ['BatchTime']
    index = ['1e+{}'.format(x) for x in range(0, 8)]

    df = pd.DataFrame(data=batch_time, columns=columns, index=index)

    speedup = (df.loc['1e+0'] / df) - 1
    speedup.loc['1e+0'] += 2e-2
    speedup = speedup.rename(columns={'BatchTime': 'Speedup'})
    return df, speedup

exec_time_b, speedup_b = make_batchsize_df(data_b)

exec_time_b.plot.bar(y='BatchTime', legend=False, title='Batch time (seconds)', rot=0)
speedup_b.plot.bar(y='Speedup', legend=False, title='Speedup (%)', rot=0)

FigureCanvasNbAgg()

FigureCanvasNbAgg()

<matplotlib.axes._subplots.AxesSubplot at 0x7fab23ed83c8>