# DataSet Performance

This notebook shows the trade-off between inserting data into the database row-by-row and as binary blobs. Inserting the data row-by-row means that we have direct access to all the data and may perform queries directly on the values of the data. On the other hand, as we will see below, this is much slower than inserting the data directly as binary blobs

First, we choose a new location for the database to ensure that we don't add a bunch of benchmarking data to the default database.

In [1]:
import os
cwd = os.getcwd()
import qcodes as qc
qc.config["core"]["db_location"] = os.path.join(cwd, 'testing.db')


In [4]:
%matplotlib notebook
import time
import matplotlib.pyplot as plt
import numpy as np

import qcodes as qc
from qcodes.instrument.parameter import ManualParameter
from qcodes.dataset.experiment_container import (Experiment,
                                                 load_last_experiment,
                                                 new_experiment)
from qcodes.dataset.sqlite.database import initialise_database
from qcodes import load_or_create_experiment
from qcodes.dataset.measurements import Measurement

In [3]:
initialise_database()
exp = load_or_create_experiment(experiment_name='tutorial_exp', sample_name="no sample")

Upgrading database; v0 -> v1: : 0it [00:00, ?it/s]
Upgrading database; v1 -> v2: 100%|█████████████████████████████████████████████████████| 1/1 [00:00<00:00, 144.50it/s]
Upgrading database; v2 -> v3: : 0it [00:00, ?it/s]
Upgrading database; v3 -> v4: : 0it [00:00, ?it/s]
Upgrading database; v4 -> v5: 100%|█████████████████████████████████████████████████████| 1/1 [00:00<00:00, 169.94it/s]
Upgrading database; v5 -> v6: : 0it [00:00, ?it/s]
Upgrading database; v6 -> v7: 100%|██████████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.77it/s]
Upgrading database; v7 -> v8: 100%|█████████████████████████████████████████████████████| 1/1 [00:00<00:00, 132.53it/s]


Here, we define a simple function to benchmark the time it takes to insert n points with either numeric or array data type.
We will compare both the time used to call add_result and the time used for the full measurement.

In [5]:
def insert_data(paramtype, npoints, nreps=1):

    meas = Measurement(exp=exp)

    x1 = ManualParameter('x1')
    x2 = ManualParameter('x2')
    x3 = ManualParameter('x3')
    y1 = ManualParameter('y1')
    y2 = ManualParameter('y2')

    meas.register_parameter(x1, paramtype=paramtype)
    meas.register_parameter(x2, paramtype=paramtype)
    meas.register_parameter(x3, paramtype=paramtype)
    meas.register_parameter(y1, setpoints=[x1, x2, x3],
                            paramtype=paramtype)
    meas.register_parameter(y2, setpoints=[x1, x2, x3],
                            paramtype=paramtype)
    start = time.perf_counter()
    with meas.run() as datasaver:
        start_adding = time.perf_counter()
        for i in range(nreps):
            datasaver.add_result((x1, np.random.rand(npoints)),
                                 (x2, np.random.rand(npoints)),
                                 (x3, np.random.rand(npoints)),
                                 (y1, np.random.rand(npoints)),
                                 (y2, np.random.rand(npoints)))
        stop_adding = time.perf_counter()
        run_id = datasaver.run_id
    stop = time.perf_counter()
    tot_time = stop - start
    add_time = stop_adding - start_adding
    return tot_time, add_time, run_id

## Comparison between numeric/array data and binary blob

### Case1: Short experiment time

In [6]:
sizes = [1,100,5000,7000,8000,10000,15000,20000]
t_numeric = []
t_numeric_add = []
t_array = []
t_array_add = []
for size in sizes:
    tn, tna, run_id_n =  insert_data('numeric', size)
    t_numeric.append(tn)
    t_numeric_add.append(tna)

    ta, taa, run_id_a =  insert_data('array', size)
    t_array.append(ta)
    t_array_add.append(taa)

Starting experimental run with id: 1
Starting experimental run with id: 2
Starting experimental run with id: 3
Starting experimental run with id: 4
Starting experimental run with id: 5
Starting experimental run with id: 6
Starting experimental run with id: 7
Starting experimental run with id: 8
Starting experimental run with id: 9
Starting experimental run with id: 10
Starting experimental run with id: 11
Starting experimental run with id: 12
Starting experimental run with id: 13
Starting experimental run with id: 14
Starting experimental run with id: 15
Starting experimental run with id: 16


In [9]:
fig, ax = plt.subplots(1,1)
ax.plot(sizes, t_numeric, 'o-', label='Inserting row-by-row')
ax.plot(sizes, t_numeric_add, 'o-', label='Inserting row-by-row: add_result only')
ax.plot(sizes, t_array, 'd-', label='Inserting as binary blob')
ax.plot(sizes, t_array_add, 'd-', label='Inserting as binary blob: add_result only')
ax.legend()
ax.set_xlabel('Array length')
ax.set_ylabel('Time (s)')
fig.tight_layout()

<IPython.core.display.Javascript object>

As we can see above, the time to setup and and close the experiment is approximately 0.4 sec. In case of small array sizes, the difference between inserting values of data as arrays and inserting them row-by-row is relatively unimportant. At larger array sizes i.e. above 10000 points, the cost of writing data as individual datapoints starts to become important.


### Case 2: Long experiment time 

In [12]:
sizes = [1,100,5000,7000,8000,10000,15000,20000]
nreps = 100
t_numeric = []
t_numeric_add = []
t_numeric_run_ids = []
t_array = []
t_array_add = []
t_array_run_ids = []
for size in sizes:
    tn, tna, run_id_n =  insert_data('numeric', size, nreps=nreps)
    t_numeric.append(tn)
    t_numeric_add.append(tna)
    t_numeric_run_ids.append(run_id_n)

    ta, taa, run_id_a =  insert_data('array', size, nreps=nreps)
    t_array.append(ta)
    t_array_add.append(taa)
    t_array_run_ids.append(run_id_a)

Starting experimental run with id: 29
Starting experimental run with id: 30
Starting experimental run with id: 31
Starting experimental run with id: 32
Starting experimental run with id: 33
Starting experimental run with id: 34
Starting experimental run with id: 35
Starting experimental run with id: 36
Starting experimental run with id: 37
Starting experimental run with id: 38
Starting experimental run with id: 39
Starting experimental run with id: 40
Starting experimental run with id: 41
Starting experimental run with id: 42
Starting experimental run with id: 43
Starting experimental run with id: 44


In [23]:
fig, ax = plt.subplots(1,1)
ax.plot(sizes, t_numeric, 'o-', label='Inserting row-by-row')
ax.plot(sizes, t_numeric_add, 'o-', label='Inserting row-by-row: add_result only')
ax.plot(sizes, t_array, 'd-', label='Inserting as binary blob')
ax.plot(sizes, t_array_add, 'd-', label='Inserting as binary blob: add_result only')
ax.legend()
ax.set_xlabel('Array length')
ax.set_ylabel('Time (s)')
fig.tight_layout()

<IPython.core.display.Javascript object>

However, as we increase the length of the experiment, as seen here by repeating the insertion 100 times, we see a big difference between inserting values of the data row-by-row and inserting it as a binary blob

## Loading the data 

In [16]:
from qcodes.dataset.data_set import load_by_id
from qcodes.dataset.data_export import get_data_by_id

As usual you can load the data using load_by_id but you will notice that the different storage methods
are reflected in shape of the data as it is retrieved. 

In [17]:
run_id_n = t_numeric_run_ids[0]
run_id_a = t_array_run_ids[0]

In [18]:
ds = load_by_id(run_id_n)
ds.get_data('x1')

[[0.848081065422738],
 [0.848081065422738],
 [0.374939332082798],
 [0.374939332082798],
 [0.191300263646959],
 [0.191300263646959],
 [0.379823709015795],
 [0.379823709015795],
 [0.607992770452508],
 [0.607992770452508],
 [0.0863347042712074],
 [0.0863347042712074],
 [0.852553555263995],
 [0.852553555263995],
 [0.78325951965642],
 [0.78325951965642],
 [0.957121068915867],
 [0.957121068915867],
 [0.846066388198381],
 [0.846066388198381],
 [0.932857226141523],
 [0.932857226141523],
 [0.21844305479826],
 [0.21844305479826],
 [0.668582002433123],
 [0.668582002433123],
 [0.140405606789395],
 [0.140405606789395],
 [0.631939072673613],
 [0.631939072673613],
 [0.12102689628114],
 [0.12102689628114],
 [0.61609498899417],
 [0.61609498899417],
 [0.473618803650297],
 [0.473618803650297],
 [0.47363984814937],
 [0.47363984814937],
 [0.126528470235682],
 [0.126528470235682],
 [0.854960582571513],
 [0.854960582571513],
 [0.0539765282606128],
 [0.0539765282606128],
 [0.793782838774629],
 [0.793782838774

And a dataset stored as binary arrays

In [19]:
ds = load_by_id(run_id_a)
ds.get_data('x1')

[[array([0.05974817])],
 [array([0.05974817])],
 [array([0.86113333])],
 [array([0.86113333])],
 [array([0.16469597])],
 [array([0.16469597])],
 [array([0.24034662])],
 [array([0.24034662])],
 [array([0.87129534])],
 [array([0.87129534])],
 [array([0.56242041])],
 [array([0.56242041])],
 [array([0.85374062])],
 [array([0.85374062])],
 [array([0.70618267])],
 [array([0.70618267])],
 [array([0.45162257])],
 [array([0.45162257])],
 [array([0.55719672])],
 [array([0.55719672])],
 [array([0.82195885])],
 [array([0.82195885])],
 [array([0.27702417])],
 [array([0.27702417])],
 [array([0.39939586])],
 [array([0.39939586])],
 [array([0.08514211])],
 [array([0.08514211])],
 [array([0.98454372])],
 [array([0.98454372])],
 [array([0.61312945])],
 [array([0.61312945])],
 [array([0.97417853])],
 [array([0.97417853])],
 [array([0.84881694])],
 [array([0.84881694])],
 [array([0.7915288])],
 [array([0.7915288])],
 [array([0.15036228])],
 [array([0.15036228])],
 [array([0.21516162])],
 [array([0.2151616

This is probably more useful as a numpy array. Here we use squeze to get rid of any singleton dimensions.

In [20]:
np.array(ds.get_data('x1')).squeeze()

array([0.05974817, 0.05974817, 0.86113333, 0.86113333, 0.16469597,
       0.16469597, 0.24034662, 0.24034662, 0.87129534, 0.87129534,
       0.56242041, 0.56242041, 0.85374062, 0.85374062, 0.70618267,
       0.70618267, 0.45162257, 0.45162257, 0.55719672, 0.55719672,
       0.82195885, 0.82195885, 0.27702417, 0.27702417, 0.39939586,
       0.39939586, 0.08514211, 0.08514211, 0.98454372, 0.98454372,
       0.61312945, 0.61312945, 0.97417853, 0.97417853, 0.84881694,
       0.84881694, 0.7915288 , 0.7915288 , 0.15036228, 0.15036228,
       0.21516162, 0.21516162, 0.76240888, 0.76240888, 0.36419401,
       0.36419401, 0.69709474, 0.69709474, 0.73316034, 0.73316034,
       0.36836762, 0.36836762, 0.49325713, 0.49325713, 0.39798175,
       0.39798175, 0.40305067, 0.40305067, 0.34283637, 0.34283637,
       0.09940271, 0.09940271, 0.17609265, 0.17609265, 0.38890329,
       0.38890329, 0.58489976, 0.58489976, 0.12307586, 0.12307586,
       0.57358834, 0.57358834, 0.91259653, 0.91259653, 0.02407

A better solution may be to use get_data_by_id which will load the data in a format that does not depend on the internal storage

In [21]:
get_data_by_id(run_id_n)

[[{'name': 'x1',
   'data': array([0.84808107, 0.37493933, 0.19130026, 0.37982371, 0.60799277,
          0.0863347 , 0.85255356, 0.78325952, 0.95712107, 0.84606639,
          0.93285723, 0.21844305, 0.668582  , 0.14040561, 0.63193907,
          0.1210269 , 0.61609499, 0.4736188 , 0.47363985, 0.12652847,
          0.85496058, 0.05397653, 0.79378284, 0.99503526, 0.8469791 ,
          0.72169356, 0.33762012, 0.03410516, 0.12879331, 0.66897919,
          0.84815118, 0.21313188, 0.63150579, 0.79204728, 0.37317595,
          0.70728949, 0.3127348 , 0.17957656, 0.07511766, 0.21010386,
          0.70708057, 0.34685299, 0.85465829, 0.95867875, 0.44860838,
          0.55063505, 0.29047546, 0.43314081, 0.44833789, 0.04633969,
          0.13969814, 0.24004556, 0.40840132, 0.92769869, 0.57238298,
          0.99948322, 0.79696939, 0.63522599, 0.19945709, 0.42217928,
          0.50182945, 0.96245379, 0.84613534, 0.9933854 , 0.89936862,
          0.243496  , 0.42138022, 0.09558242, 0.11175637, 0.08203

In [22]:
get_data_by_id(run_id_a)

[[{'name': 'x1',
   'data': array([0.05974817, 0.86113333, 0.16469597, 0.24034662, 0.87129534,
          0.56242041, 0.85374062, 0.70618267, 0.45162257, 0.55719672,
          0.82195885, 0.27702417, 0.39939586, 0.08514211, 0.98454372,
          0.61312945, 0.97417853, 0.84881694, 0.7915288 , 0.15036228,
          0.21516162, 0.76240888, 0.36419401, 0.69709474, 0.73316034,
          0.36836762, 0.49325713, 0.39798175, 0.40305067, 0.34283637,
          0.09940271, 0.17609265, 0.38890329, 0.58489976, 0.12307586,
          0.57358834, 0.91259653, 0.02407137, 0.37055284, 0.24785948,
          0.58769948, 0.11053699, 0.99915081, 0.71884427, 0.562046  ,
          0.58611144, 0.60084174, 0.52368658, 0.51681636, 0.36062905,
          0.67965271, 0.87314048, 0.4473332 , 0.56224204, 0.51443103,
          0.79344838, 0.62506451, 0.01326266, 0.00517486, 0.46698835,
          0.11916117, 0.59078511, 0.65135246, 0.94339582, 0.13436913,
          0.68540483, 0.08819629, 0.64529788, 0.55131487, 0.66050