This notebooks aims to descriibe scalability in data storage using two approaches, scale-out and scale-up.

* Scale-out means more volumes are provided by adding more instances.
* Scale-up means more volumes are attached but same number of instances.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from IPython.core.display import display, HTML
%matplotlib inline

### Experiment setup

* vm: vm.denseio2.24
* terasort: 600GB

# Scale out

More instances reduce total execution time but the efficiency (throughput) is not improved.

In [2]:
data_o = {"12 local NVMe on 3 vms": { "teragen (second)": 160.052,
                    "terasort (second)": 987.929,
                                   "Throughput (MB/s)" : 600 / 987.929 / 12 * 1000},
        "28 local NVMe on 7 vms": { "teragen (second)": 96.510,
                    "terasort (second)": 579.816,
                                  "Throughput (MB/s)" : 600 / 579.816 / 28 * 1000},
       "44 local NVMe 11 vms": { "teragen (second)": 84.743, 
                    "terasort (second)": 472.102,
                               "Throughput (MB/s)" : 600 / 472.102 / 44 * 1000}}
df_o = pd.DataFrame.from_dict(data_o, orient='index')
df_o

Unnamed: 0,teragen (second),terasort (second),Throughput (MB/s)
12 local NVMe on 3 vms,160.052,987.929,50.610924
28 local NVMe on 7 vms,96.51,579.816,36.957537
44 local NVMe 11 vms,84.743,472.102,28.884359


* Throughput is per hdfs mount (e.g. /dataN), for terasort job

# Scale up

Scale-up has to be completed with higher server instance type i.e. bm.denseio2.52 but the current experiment, instead, uses block volumes to attach more volumes per instance. 667gb size is provisioned which provide maximum performance i.e. 25k IOPS and 320MB/s throughput.

In [3]:
data_u = {"12 local NVMe + 18x 667GB blocks on 3 vms": { "teragen (second)": 170.017,
                    "terasort (second)": 1025.434 },
        "28 local NVMe + 42x 667GB blocks on 7 vms": { "teragen (second)": 99.969,
                    "terasort (second)": 508.182 },
       "44 local NVMe + 66x 667GB blocks on 11 vms": { "teragen (second)": None ,
                    "terasort (second)": None }}
df_u = pd.DataFrame.from_dict(data_u, orient='index')
df_u

Unnamed: 0,teragen (second),terasort (second)
12 local NVMe + 18x 667GB blocks on 3 vms,170.017,1025.434
28 local NVMe + 42x 667GB blocks on 7 vms,99.969,508.182
44 local NVMe + 66x 667GB blocks on 11 vms,,


# Scale up vs Scale out

A little improvement is observed with additional block storages.

In [4]:
df = df_o.append(df_u)
df = df.drop("Throughput (MB/s)", 1)
df

Unnamed: 0,teragen (second),terasort (second)
12 local NVMe on 3 vms,160.052,987.929
28 local NVMe on 7 vms,96.51,579.816
44 local NVMe 11 vms,84.743,472.102
12 local NVMe + 18x 667GB blocks on 3 vms,170.017,1025.434
28 local NVMe + 42x 667GB blocks on 7 vms,99.969,508.182
44 local NVMe + 66x 667GB blocks on 11 vms,,
