# DataSet Performance

This notebook shows the tradeoffs between inserting data into the database row by row and as binary blobs.
There is a tradeoff between inserting the data in these to ways. 


Inserting the data row by row means that we direct access to all the data and may perform querys directly on the values of the data. On the other hand as we will see below this is much slower than inserting the data direcly as binary blobs

First we chose a new location for the database to ensuer that we don't add a bunch of benchmarking data to it

In [1]:
import os
cwd = os.getcwd()
import qcodes as qc
qc.config["core"]["db_location"] = os.path.join(cwd, 'testing.db')


In [2]:
%matplotlib notebook
import time
import matplotlib.pyplot as plt
import numpy as np

import qcodes as qc
from qcodes.instrument.parameter import ManualParameter
from qcodes.dataset.experiment_container import (Experiment,
                                                 load_last_experiment,
                                                 new_experiment)
from qcodes.dataset.database import initialise_database
from qcodes.dataset.measurements import Measurement

In [3]:
initialise_database()
new_experiment(name='tutorial_exp', sample_name="no sample")

tutorial_exp#no sample#3@C:\Users\jenielse\source\repos\Qcodes\docs\examples\DataSet\testing.db
-----------------------------------------------------------------------------------------------

Here we define a simple function to benchmark the time it takes to insert n points with either numeric or array data type.
We will compare both the time used to call add_result and the time used for the full measurement.

In [4]:
def insert_data(paramtype, npoints, nreps=1):

    meas = Measurement()

    x1 = ManualParameter('x1')
    x2 = ManualParameter('x2')
    x3 = ManualParameter('x3')
    y1 = ManualParameter('y1')
    y2 = ManualParameter('y2')

    meas.register_parameter(x1, paramtype=paramtype)
    meas.register_parameter(x2, paramtype=paramtype)
    meas.register_parameter(x3, paramtype=paramtype)
    meas.register_parameter(y1, setpoints=[x1, x2, x3],
                            paramtype=paramtype)
    meas.register_parameter(y2, setpoints=[x1, x2, x3],
                            paramtype=paramtype)
    start = time.perf_counter()
    with meas.run() as datasaver:
        start_adding = time.perf_counter()
        for i in range(nreps):
            datasaver.add_result((x1, np.random.rand(npoints)),
                                 (x2, np.random.rand(npoints)),
                                 (x3, np.random.rand(npoints)),
                                 (y1, np.random.rand(npoints)),
                                 (y2, np.random.rand(npoints)))
        stop_adding = time.perf_counter()
        run_id = datasaver.run_id
    stop = time.perf_counter()
    tot_time = stop - start
    add_time = stop_adding - start_adding
    return tot_time, add_time, run_id

And perform a quick comparison of the two 

In [5]:
sizes = [1,100,5000,7000,8000,10000,15000,20000]
t_numeric = []
t_numeric_add = []
t_array = []
t_array_add = []
for size in sizes:
    tn, tna, run_id_n =  insert_data('numeric', size)
    t_numeric.append(tn)
    t_numeric_add.append(tna)

    ta, taa, run_id_a =  insert_data('array', size)
    t_array.append(ta)
    t_array_add.append(taa)

Starting experimental run with id: 201
Starting experimental run with id: 202
Starting experimental run with id: 203
Starting experimental run with id: 204
Starting experimental run with id: 205
Starting experimental run with id: 206
Starting experimental run with id: 207
Starting experimental run with id: 208
Starting experimental run with id: 209
Starting experimental run with id: 210
Starting experimental run with id: 211
Starting experimental run with id: 212
Starting experimental run with id: 213
Starting experimental run with id: 214
Starting experimental run with id: 215
Starting experimental run with id: 216


In [6]:
fig, ax = plt.subplots(1,1)
ax.plot(sizes, t_numeric, 'o-', label='Inserting row by row')
ax.plot(sizes, t_numeric_add, 'o-', label='Inserting row by row: add_result only')
ax.plot(sizes, t_array, 'd-', label='Inserting as binary blob')
ax.plot(sizes, t_array_add, 'd-', label='Inserting as binary blob: add_result only')
ax.legend()
ax.set_xlabel('Array lenght')
ax.set_ylabel('Time (s)')
fig.tight_layout()

<IPython.core.display.Javascript object>

As we can see there the time to setup and and close the experiment is approximately 0.4 sec. At small array sizes the difference between inserting as arrays and inserting row by row is therefore relatively unimportant. At larger array sizes above 10000 points or so the cost of writing data as individual datapoints starts to become important.

In [7]:
sizes = [1,10,100,1000,3000,5000]
nreps = 100
t_numeric = []
t_numeric_add = []
t_numeric_run_ids = []
t_array = []
t_array_add = []
t_array_run_ids = []
for size in sizes:
    tn, tna, run_id_n =  insert_data('numeric', size, nreps=nreps)
    t_numeric.append(tn)
    t_numeric_add.append(tna)
    t_numeric_run_ids.append(run_id_n)

    ta, taa, run_id_a =  insert_data('array', size, nreps=nreps)
    t_array.append(ta)
    t_array_add.append(taa)
    t_array_run_ids.append(run_id_a)

Starting experimental run with id: 217
Starting experimental run with id: 218
Starting experimental run with id: 219
Starting experimental run with id: 220
Starting experimental run with id: 221
Starting experimental run with id: 222
Starting experimental run with id: 223
Starting experimental run with id: 224
Starting experimental run with id: 225
Starting experimental run with id: 226
Starting experimental run with id: 227
Starting experimental run with id: 228


In [8]:
fig, ax = plt.subplots(1,1)
ax.plot(sizes, t_numeric, 'o-', label='Inserting row by row')
ax.plot(sizes, t_numeric_add, 'o-', label='Inserting row by row: add_result only')
ax.plot(sizes, t_array, 'd-', label='Inserting as binary blob')
ax.plot(sizes, t_array_add, 'd-', label='Inserting as binary blob: add_result only')
ax.legend()
ax.set_xlabel('Array lenght')
ax.set_ylabel('Time (s)')
fig.tight_layout()

<IPython.core.display.Javascript object>

However, as we increase the length of the experimenter as seen here by repeating the insertion 100 times we see a big difference between inserting row by row and inserting as a binary blob

In [9]:
from qcodes.dataset.data_set import load_by_id
from qcodes.dataset.data_export import get_data_by_id

As usual you can load the data using load_by_id but you will notice that the different storage methods
are reflected in shape of the data as is is retried. 

In [10]:
run_id_n = t_numeric_run_ids[0]
run_id_a = t_array_run_ids[0]

In [11]:
ds = load_by_id(run_id_n)
ds.get_data('x1')

[[0.959549330723474],
 [0.940563783772708],
 [0.108666922220762],
 [0.312563323357191],
 [0.812348265797183],
 [0.247980401329598],
 [0.0298323729336736],
 [0.831936209835814],
 [0.709002499701497],
 [0.87910772751015],
 [0.260502412509244],
 [0.777657647072767],
 [0.547481545892687],
 [0.498479574749812],
 [0.273476961165475],
 [0.826983407043849],
 [0.0395121237529028],
 [0.696974544325046],
 [0.101859604657765],
 [0.889143003114224],
 [0.514398328323729],
 [0.740741657953947],
 [0.801730313017923],
 [0.711663624917591],
 [0.45312146600675],
 [0.225330546223433],
 [0.204755159120075],
 [0.92889392506205],
 [0.532049832625766],
 [0.00936108419592929],
 [0.0571597859694355],
 [0.905093782756854],
 [0.870500141499757],
 [0.625676196565432],
 [0.52832637119628],
 [0.790332212783862],
 [0.148408173415533],
 [0.933984859324323],
 [0.194444807231993],
 [0.834871227854778],
 [0.306160565285259],
 [0.976533391830069],
 [0.403514230735721],
 [0.293730690704184],
 [0.215300415378065],
 [0.99475

And a dataset stored as binary arrays

In [19]:
ds = load_by_id(run_id_a)
ds.get_data('x1')

[[array([0.94530173])],
 [array([0.32518326])],
 [array([0.11468476])],
 [array([0.08284734])],
 [array([0.56971591])],
 [array([0.54727455])],
 [array([0.79629953])],
 [array([0.31960544])],
 [array([0.96147543])],
 [array([0.86017149])],
 [array([0.29265343])],
 [array([0.69152178])],
 [array([0.97950319])],
 [array([0.61256864])],
 [array([0.36615283])],
 [array([0.23342426])],
 [array([0.535851])],
 [array([0.61921525])],
 [array([0.50592589])],
 [array([0.48363425])],
 [array([0.43771955])],
 [array([0.3752991])],
 [array([0.0784409])],
 [array([0.2220136])],
 [array([0.29096545])],
 [array([0.71920147])],
 [array([0.32633142])],
 [array([0.54561262])],
 [array([0.76110974])],
 [array([0.15487181])],
 [array([0.97766273])],
 [array([0.04568271])],
 [array([0.65503239])],
 [array([0.24678045])],
 [array([0.2030046])],
 [array([0.366546])],
 [array([0.79882029])],
 [array([0.18700458])],
 [array([0.55839833])],
 [array([0.42166372])],
 [array([0.26266574])],
 [array([0.7368249])],
 

This is probably more useful as a numpy array. Here we use squeze to get rid of any singleton dimensions.

In [27]:
np.array(ds.get_data('x1')).squeeze()

array([0.94530173, 0.32518326, 0.11468476, 0.08284734, 0.56971591,
       0.54727455, 0.79629953, 0.31960544, 0.96147543, 0.86017149,
       0.29265343, 0.69152178, 0.97950319, 0.61256864, 0.36615283,
       0.23342426, 0.535851  , 0.61921525, 0.50592589, 0.48363425,
       0.43771955, 0.3752991 , 0.0784409 , 0.2220136 , 0.29096545,
       0.71920147, 0.32633142, 0.54561262, 0.76110974, 0.15487181,
       0.97766273, 0.04568271, 0.65503239, 0.24678045, 0.2030046 ,
       0.366546  , 0.79882029, 0.18700458, 0.55839833, 0.42166372,
       0.26266574, 0.7368249 , 0.35078069, 0.2998865 , 0.29609682,
       0.96898321, 0.02865802, 0.82216362, 0.90724283, 0.71641643,
       0.95702289, 0.12620523, 0.30632632, 0.64706647, 0.15560899,
       0.40872134, 0.390249  , 0.77369711, 0.80643844, 0.82856457,
       0.17213151, 0.20903373, 0.75592854, 0.52253505, 0.39111524,
       0.2139777 , 0.9477797 , 0.2395648 , 0.95018639, 0.93704701,
       0.60849806, 0.47158601, 0.76885545, 0.71172346, 0.51840

A better solution may be to use get_data_by_id which will load the data in a format that does not depend on the internal storage

In [28]:
get_data_by_id(run_id_n)

[[{'name': 'x1',
   'label': 'x1',
   'unit': '',
   'data': array([0.95954933, 0.94056378, 0.10866692, 0.31256332, 0.81234827,
          0.2479804 , 0.02983237, 0.83193621, 0.7090025 , 0.87910773,
          0.26050241, 0.77765765, 0.54748155, 0.49847957, 0.27347696,
          0.82698341, 0.03951212, 0.69697454, 0.1018596 , 0.889143  ,
          0.51439833, 0.74074166, 0.80173031, 0.71166362, 0.45312147,
          0.22533055, 0.20475516, 0.92889393, 0.53204983, 0.00936108,
          0.05715979, 0.90509378, 0.87050014, 0.6256762 , 0.52832637,
          0.79033221, 0.14840817, 0.93398486, 0.19444481, 0.83487123,
          0.30616057, 0.97653339, 0.40351423, 0.29373069, 0.21530042,
          0.99475024, 0.63152644, 0.1957715 , 0.04773717, 0.34188268,
          0.01544931, 0.21583399, 0.77918986, 0.49699849, 0.04822403,
          0.21034273, 0.68393696, 0.76150984, 0.89750376, 0.75165336,
          0.73716954, 0.02094734, 0.81675833, 0.4895815 , 0.24799014,
          0.20031284, 0.34270411

In [29]:
get_data_by_id(run_id_a)

[[{'name': 'x1',
   'label': 'x1',
   'unit': '',
   'data': array([0.94530173, 0.32518326, 0.11468476, 0.08284734, 0.56971591,
          0.54727455, 0.79629953, 0.31960544, 0.96147543, 0.86017149,
          0.29265343, 0.69152178, 0.97950319, 0.61256864, 0.36615283,
          0.23342426, 0.535851  , 0.61921525, 0.50592589, 0.48363425,
          0.43771955, 0.3752991 , 0.0784409 , 0.2220136 , 0.29096545,
          0.71920147, 0.32633142, 0.54561262, 0.76110974, 0.15487181,
          0.97766273, 0.04568271, 0.65503239, 0.24678045, 0.2030046 ,
          0.366546  , 0.79882029, 0.18700458, 0.55839833, 0.42166372,
          0.26266574, 0.7368249 , 0.35078069, 0.2998865 , 0.29609682,
          0.96898321, 0.02865802, 0.82216362, 0.90724283, 0.71641643,
          0.95702289, 0.12620523, 0.30632632, 0.64706647, 0.15560899,
          0.40872134, 0.390249  , 0.77369711, 0.80643844, 0.82856457,
          0.17213151, 0.20903373, 0.75592854, 0.52253505, 0.39111524,
          0.2139777 , 0.9477797 