# Benchmark Results Analysis

## Environment

This project was executed on a Ubuntu 22.04 container with the following specs:

- Host
    - Memory:
        - Capacity: 16GB
        - Speed: 3200Mhz
        - SWAP: 2GB
    - CPU:
        - Model: i5-1135G7
        - BaseClock: 2.4Ghz
        - Cores: 4 Cores and 8 Threads
        - Turbo: 4.2Ghz
        - L1: 128Kb
        - L2: 5Mb
        - L3: 8Mb
    - Disk:
        - Speed: 2.3Gb/s
        - Capacity: 500GB
    - System:
        - Distro: Ubuntu 22.04
- Container:
    - Memory:
        - Capacity: 6GB
        - SWAP: 2GB
    - CPU:
        - Cores: 4 Threads
    - System:
        - Distro: Ubuntu 22.04

## Keras NN vs Numpy BNN

### Data Explanation

In [1]:
import pandas as pd

In [2]:
results = pd.read_csv("data/benchmark.csv")
results.columns = results.columns.str.replace(r"\\", "", regex=False)
results

Unnamed: 0,name,case,time.min,time.max,time.median,time.mean,time.std,time.unit,memory.value,memory.unit,parameters_memory.value,parameters_memory.unit,benchmark.repeat
0,mnist,bit,0.000729,0.012377,0.000918,0.00099,0.000355,seconds,30.492188,kilobytes,53.234375,kilobytes,2000
1,mnist,keras,0.033221,0.123128,0.044749,0.04712,0.009216,seconds,88.515625,kilobytes,1590.609375,kilobytes,2000
2,mnist,comparison,45.584796,9.948382,48.751029,47.602551,25.976789,seconds,2.902895,kilobytes,29.879366,kilobytes,2000
3,housing,bit,0.00118,0.003951,0.001412,0.001521,0.000304,seconds,39.53125,kilobytes,3.59375,kilobytes,2000
4,housing,keras,0.034954,0.117986,0.040935,0.042585,0.007145,seconds,101.625,kilobytes,46.15625,kilobytes,2000
5,housing,comparison,29.626293,29.865567,28.987917,28.006006,23.504357,seconds,2.570751,kilobytes,12.843478,kilobytes,2000
6,iris-8,bit,0.001493,0.018302,0.001763,0.001972,0.000768,seconds,48.25,kilobytes,1.289062,kilobytes,2000
7,iris-8,keras,0.034915,0.175458,0.041313,0.043196,0.008345,seconds,115.539062,kilobytes,1.898438,kilobytes,2000
8,iris-8,comparison,23.387012,9.586824,23.434564,21.904484,10.863874,seconds,2.394592,kilobytes,1.472727,kilobytes,2000
9,iris-4,bit,0.001475,0.013956,0.001738,0.001873,0.000562,seconds,48.03125,kilobytes,1.195312,kilobytes,2000


The table above shows the time and memory consumption of Keras model created from dense layers and an authoral implementation of the bit neural network using numpy benchmarked 2000 times.
This benchmark was executed only on the feed-forward step of the network.

Names refeer to different defined topologies which are executed both in the bit and keras implementation, while the comparison is just the proportion of the metrics of keras to bit.

In [3]:
import yaml
import pprint
with open("../config/benchmark.yaml") as f:
    topologies = yaml.load(f, yaml.loader.FullLoader)
pprint.pprint(topologies["experiment"]["cases"])

{'all-together': {'inputs': 1024, 'units': [1024, 1024, 1024, 1024, 1024]},
 'housing': {'inputs': 13, 'units': [100, 100, 1]},
 'iris-4': {'inputs': 4, 'units': [4, 4, 4, 3]},
 'iris-8': {'inputs': 4, 'units': [8, 8, 8, 3]},
 'many-inputs': {'inputs': 1024, 'units': [32]},
 'many-layers': {'inputs': 10, 'units': [32, 32, 32, 32, 32]},
 'many-units': {'inputs': 10, 'units': [1024, 1024]},
 'mnist': {'inputs': 784, 'units': [512, 10]},
 'mnist-128': {'inputs': 784, 'units': [128, 10]}}


### Methodology

If we look at the first table we are going to see sometimes the max and min values are too far from the `median`, this means that the `mean` is not a good metric since it will consider these outlier. In reason of that, we are going to use only `median` for comparing time.

In [4]:
{
    "max": (results["time.median"] / results["time.max"]).mean(),
    "min": (results["time.median"] / results["time.min"]).mean(),
}

{'max': 0.9274996622624422, 'min': 1.1295207481333718}

Another thing we can see is that the memory consumption is very different from the `parameters_memory` column. This happened because the memory value was measuring the size of the network `object` while the `parameters_memory` is the real size occupied by the network. In reason of that, we are going to consider only the `parameter_memory`.

In [5]:
results = results[["name", "case", "time.median", "time.std", "parameters_memory.value"]]

## Analysis

In [6]:
results[results["case"] == "comparison"]

Unnamed: 0,name,case,time.median,time.std,parameters_memory.value
2,mnist,comparison,48.751029,25.976789,29.879366
5,housing,comparison,28.987917,23.504357,12.843478
8,iris-8,comparison,23.434564,10.863874,1.472727
11,iris-4,comparison,23.424221,12.975893,1.143791
14,mnist-128,comparison,41.184394,14.559084,28.920545
17,many-inputs,comparison,73.291715,44.967971,28.796848
20,many-layers,comparison,20.295817,16.240625,7.352239
23,many-units,comparison,37.99874,22.135037,29.483966
26,all-together,comparison,15.394573,12.956367,30.998464


The comparison table above shows the **speedup** in the time columns while the `parameters_memory` shows the **size compression rate** of the bit compared to the keras

### Time

If we look at the comparison table, we are going to see the behaviour of the time and the memory is not proportional. This happens due to the real reason of the speedup.

In [7]:
results[results["case"] == "keras"][["name", "case", "time.median"]]

Unnamed: 0,name,case,time.median
1,mnist,keras,0.044749
4,housing,keras,0.040935
7,iris-8,keras,0.041313
10,iris-4,keras,0.040703
13,mnist-128,keras,0.04055
16,many-inputs,keras,0.040206
19,many-layers,keras,0.042447
22,many-units,keras,0.041572
25,all-together,keras,0.042648


By looking at the table above we are going to see the times do not vary according to the size of the network. For instance, take the iris-4 and the mnist-128. The iris 4 has `13*4 + 4*4 + 4*4 + 4*3 = 96` parameters while the mnist-128 contains`784*128 + 128*10 = 101632` parameters.

This happens because Keras backend (Tensorflow) performs a lot of optimizations on the math operations. Some of the observed ones were:

- Multithreading: If we look at the CPU usage while executing Keras `predict` method we are going to see more than one core is being used at a time
- SIMD: When importing Keras or Tensorflow we are going to see a message telling us the binary uses SSE and AVX to vectorize operations
- Memory Manipulation: because the Tensorflow operations are implemented in low-level APIs the code is written focusing on the best memory usage, avoiding useless allocations, and cache optimization
- Loop: Similarly to the previous, because it is implemented in low-level API every loop operation is only executed once. Differently from the high-level APIs such as Numpy, where every operation is containerized in its own scope, so if you need to perform multiple transformations, it will have to loop through various method containers
- BLAS: Tensorflow uses long-term developed arithmetic libraries such as BLAS, iMKL (Intel), ATLAS (ARM), BLIS (AMD) for performing its operations which makes them much more efficient in terms of execution time
- Cycles Per Instruction: Number of CPU cycles required to execute an instruction can influence the code performance depending on the hardware executing it

All these factors lead us to know the real bottleneck on the Keras code is preparing the input, output, validation or any other internal logic and the calculation itself.

In [46]:
import numpy as np
import timeit
a = np.zeros((1000,), dtype=np.uint32)
b = np.ones((1000, 1000), dtype=np.uint32)
time = lambda fn: print("Spent: %f" % (timeit.timeit(fn, number=1000) / 1000))
time(lambda: np.dot(a, b))
time(lambda: np.bitwise_xor(a, b))

Spent: 0.000839
Spent: 0.000366


Take a look at the time spent above. XOR is only one of the operations executed by the bit network feed-forward and it takes approximately half the time as the dot product in numpy which is most of the compute intensive work required to run feed-forward on a traditional network.

In [19]:
keras_bit = keras_bit[["name", "case", "time.median", "time.std"]]
keras_bit = results[results["case"].isin(["keras", "bit"])]
keras_bit = keras_bit.assign(pct_std=keras_bit["time.std"] / keras_bit["time.median"])
keras_bit.groupby("case")["pct_std"].mean()

case
bit      0.331002
keras    0.190646
Name: pct_std, dtype: float64

Another interesting thing to see is the standard deviation in time consumption of each network type. If we look at the bit network we are going to see the time spent by the bit network varies in 33% from them median while the keras network varies 19%. This may indicate a better memory usage once the memory access time is not constant.

### Memory

On the other hand, when we look at the memory comparison we are going to see a huge difference proportional to the size of the network. Because we used an uin32 as the weights container, the maximum improvement in compression is close to 32x.

In [48]:
results[results["case"] == "comparison"]

Unnamed: 0,name,case,time.median,time.std,parameters_memory.value
2,mnist,comparison,48.751029,25.976789,29.879366
5,housing,comparison,28.987917,23.504357,12.843478
8,iris-8,comparison,23.434564,10.863874,1.472727
11,iris-4,comparison,23.424221,12.975893,1.143791
14,mnist-128,comparison,41.184394,14.559084,28.920545
17,many-inputs,comparison,73.291715,44.967971,28.796848
20,many-layers,comparison,20.295817,16.240625,7.352239
23,many-units,comparison,37.99874,22.135037,29.483966
26,all-together,comparison,15.394573,12.956367,30.998464


You may be wondering why smaller networks have a compression rate smaller than the larger ones. That can be explained by the wasted bits. For instance, let's consider a network that has 1 input and a layer with 100 units, this would make a layer with a weight matrix of size 100x1. Since we are using an uint32, i.e. 32 bits, to compress our weights, for each unit we are going to use 1 bit of the uint32 weight, thus in this specific network there's no compression at all because the weights matrix would have 100x1 size.

For smaller networks, using an uint8 or an uint16 would be a better idea if memory is a bottleneck, otherwise it won't be something you need to bother. In larger networks, the bit waste problem is not even noticeable because you'll only lose at most 30 bits per unit.

Another factor that doesn't allow us to have a 32x compression is when we have a bias vector on our network. That happens because the bias vector may not be binarized as well. In the networks presented above this is the case.

## CPP Float NN vs CPP BNN

### Data Explanation

In [4]:
cpp = pd.read_csv("./data/benchmark_cpp.csv")
cpp.columns = cpp.columns.str.replace(r"\\", "", regex=False)
cpp

Unnamed: 0,name,case,time.min,time.max,time.median,time.mean,time.std,time.unit,benchmark.repeat
0,mnist,bit,5.2e-05,0.000315,5.9e-05,6.4e-05,1.394052e-05,seconds,2000
1,mnist,float,0.001072,0.008807,0.001203,0.001246,0.0003027962,seconds,2000
2,mnist,comparison,20.615385,27.95873,20.389831,19.549087,21.72058,seconds,2000
3,housing,bit,4e-06,0.000109,6e-06,6e-06,3.695287e-06,seconds,2000
4,housing,float,2.9e-05,0.000152,3.4e-05,3.6e-05,7.130329e-06,seconds,2000
5,housing,comparison,7.25,1.394495,5.666667,5.658249,1.929574,seconds,2000
6,iris-8,bit,1e-06,1.4e-05,2e-06,2e-06,6.325433e-07,seconds,2000
7,iris-8,float,1e-06,0.0001,2e-06,2e-06,2.341802e-06,seconds,2000
8,iris-8,comparison,1.0,7.142857,1.0,0.889822,3.702201,seconds,2000
9,iris-4,bit,1e-06,0.00195,2e-06,3e-06,4.356069e-05,seconds,2000


This time we are comparing two authoral implementations made in C++ for benchmarking. The bit code is the bit neural network, while the float is the regular MLP implementation using float weights. The comparison is the proportion of float network stats to the bit network.

The topologies used in this benchmark are the same used in the previous benchmark.

### Methodology

The same considerations performed in the previous section will be replicated here.

## Analysis

### Time

In [5]:
cpp[cpp["case"] == "comparison"][["name", "case", "time.median"]]

Unnamed: 0,name,case,time.median
2,mnist,comparison,20.389831
5,housing,comparison,5.666667
8,iris-8,comparison,1.0
11,iris-4,comparison,0.5
14,mnist-128,comparison,18.882353
17,many-inputs,comparison,17.0
20,many-layers,comparison,3.0
23,many-units,comparison,19.044118
26,all-together,comparison,19.889974


Looking at the table above we can see the time is proportional to the size of the network. This is what we expect since we have a great reduction in the number of operations performed due to the SIMD nature of the bit neural network.

If we look at the smaller networks we are going to see the less number of parameters, the lower the time difference. This happens because the other operations required for the bit dot product do not compensate the time used to multiply and sum the elements in a traditional network. Thus, the problem is a side-product of the bit waste effect mentioned before.

### Memory

Memory was not evaluated here because the number of parameters is the same of the previous experiment. Thus do not make difference.

## Optimized CPP Float NN vs CPP BNN

### Data Explanation

In [8]:
opt = pd.read_csv("./data/benchmark_cpp_opt.csv")
opt.columns = opt.columns.str.replace(r"\\", "", regex=False)
opt

Unnamed: 0,name,case,time.min,time.max,time.median,time.mean,time.std,time.unit,benchmark.repeat
0,mnist,bit,9e-06,0.000123,1.1e-05,1.2e-05,4.602197e-06,seconds,2000
1,mnist,float,0.00056,0.00922,0.00064,0.000667,0.000268009,seconds,2000
2,mnist,comparison,62.222222,74.95935,58.181818,57.086895,58.235,seconds,2000
3,housing,bit,1e-06,0.000102,2e-06,2e-06,2.388848e-06,seconds,2000
4,housing,float,1.1e-05,0.000101,1.3e-05,1.4e-05,2.955354e-06,seconds,2000
5,housing,comparison,11.0,0.990196,6.5,7.188883,1.237146,seconds,2000
6,iris-8,bit,0.0,2.4e-05,1e-06,1e-06,7.822404e-07,seconds,2000
7,iris-8,float,0.0,3.2e-05,1e-06,1e-06,8.26844e-07,seconds,2000
8,iris-8,comparison,,1.333333,1.0,0.989076,1.05702,seconds,2000
9,iris-4,bit,0.0,8e-06,1e-06,1e-06,4.393606e-07,seconds,2000


The data presented in this section refeers to the same C++ code, but this time compiled with the flag `-O3` and `-march=tigerlake` which means the compiler will perform stronger optimizations such as vectorization better memory management and build code focused on the specific hardware architecture.

### Time

In [10]:
opt[opt["case"] == "comparison"][["name", "case", "time.median"]]

Unnamed: 0,name,case,time.median
2,mnist,comparison,58.181818
5,housing,comparison,6.5
8,iris-8,comparison,1.0
11,iris-4,comparison,1.0
14,mnist-128,comparison,39.0
17,many-inputs,comparison,26.5
20,many-layers,comparison,1.5
23,many-units,comparison,58.896552
26,all-together,comparison,69.553719


Looking at the comparison above we can see the larger networks are most likely to have the larger speedups. The problem with smaller networks is the same as mentioned before. However, in this optimized code we can see many units is the major reason for great speedups, and when we combine it with many inputs the speedup becomes even higher. This means it is not the number of inputs or the number of units that affect the speedup, but the number of parameters, i.e., the more parameters the more operations we would need to perform. 

### Sklearn MLP

### Data Explanation

In [11]:
sklearn = pd.read_csv("./data/benchmark_sklearn.csv")
sklearn.columns = sklearn.columns.str.replace("\\", "", regex=False)
sklearn

Unnamed: 0,name,time.min,time.max,time.median,time.mean,time.std,time.unit,memory.value,memory.unit,parameters_memory.value,parameters_memory.unit,benchmark.repeat
0,mnist,0.000113,0.013749,0.000162,0.000231,0.000611,seconds,9547.84375,kilobytes,3180.984375,kilobytes,2000
1,housing,6.7e-05,0.008214,9.9e-05,0.000124,0.000286,seconds,279.976562,kilobytes,91.71875,kilobytes,2000
2,iris-8,6e-05,0.007011,9.9e-05,0.0001,0.000156,seconds,14.0,kilobytes,3.0,kilobytes,2000
3,iris-4,6e-05,0.000373,6.6e-05,7.7e-05,2e-05,seconds,10.78125,kilobytes,1.9375,kilobytes,2000
4,mnist-128,7.2e-05,0.008321,7.9e-05,0.000121,0.000319,seconds,2392.84375,kilobytes,795.984375,kilobytes,2000
5,many-inputs,6.5e-05,0.007593,9.4e-05,0.000115,0.000284,seconds,776.101562,kilobytes,257.078125,kilobytes,2000
6,many-layers,6.4e-05,0.012837,0.000106,0.000121,0.000389,seconds,117.632812,kilobytes,37.578125,kilobytes,2000
7,many-units,0.000231,0.009055,0.000282,0.000333,0.000343,seconds,24895.34375,kilobytes,8296.828125,kilobytes,2000
8,all-together,0.001203,0.046884,0.001402,0.001587,0.001419,seconds,123033.820312,kilobytes,41009.578125,kilobytes,2000


In this dataset we only benchmarked sklearn implementation of a regular MLP. This benchmark was performed because most of its code is written over numpy operations and do not require that much preparation steps like Keras. Thus we can use it to compare with our implementation without worrying too much about the framework internal logic.

### Time

In [20]:
bit = opt[opt["case"] == "bit"][["name", "time.median"]].rename(columns={"time.median": "time.median.bit"})
sklearn_bit = sklearn[["name", "time.median"]].rename(columns={"time.median": "time.median.skl"}).merge(bit, on="name")
sklearn_bit["time.median.comparison"] = sklearn_bit["time.median.skl"] / sklearn_bit["time.median.bit"]
sklearn_bit

Unnamed: 0,name,time.median.skl,time.median.bit,time.median.comparison
0,mnist,0.000162,1.1e-05,14.744863
1,housing,9.9e-05,2e-06,49.62925
2,iris-8,9.9e-05,1e-06,99.1335
3,iris-4,6.6e-05,1e-06,66.219498
4,mnist-128,7.9e-05,4e-06,19.874376
5,many-inputs,9.4e-05,2e-06,47.195499
6,many-layers,0.000106,2e-06,52.9315
7,many-units,0.000282,2.9e-05,9.71169
8,all-together,0.001402,0.000121,11.582694


In this case we can see a speedup comparing the bit implementation with the sklearn one. However it is not as significant as the last example. This can be explained by looking at the smaller examples. Take iris-4 and many-inputs for instance, we can see sklearn speedup is not proportional in to the bit waste problem or to the network size. This is exaplained by the data validation or inner logic performed by the framework. Although sklearn has less pre-predict logic it is still there, and consumes time proportional to the size of the input. Another fact that indicate this same reason is looking at the all-together and to the many-units cases because they show us when computation is the most part of the prediction the speedup is not as high as when we have many inputs.

The reason why we don't have real speedups greater than 10x is due to the same reasons mentioned for Keras.