Dataset Context manager. Allow insertion of Data direcly as arrays #1207

jenshnielsen · 2018-07-25T14:50:27Z

This is significantly faster than unrolling and inserting point by point. This in turn uncovered some other issues.

Correct insertion of multidimensional Arrays in the context manager. Previously it would only
unroll the first axis resulting in inconsistent results.
Adapt the data exporter get_data_by_id to return full arrays of setpoints regardless of the way they are inserted.

TODO:

Add Benchmarks of the difference vs insertion as single points
The tests have a fair bit of unneeded code duplication
Examples of Array Insertion vs single point insertion
Naively extracting data the way that DataSet.get_data does will result in a different result from extracting the data inserted row by row but the get_data_by_id exporter works as expected

Edit by William: this PR fixes #1108

A combination would be meaningless

Otherwise a multidim array is not inserted as individial datapoints but as a blob pr point in the outermost dimention

…r parameters

… array

…inserted

astafan8 · 2018-07-25T15:06:12Z

Could you expand a bit more in the description what you changed exactly? because I see quite some stuff related to ArrayParameter, and the stuff that is related to tuples (qcodes_param, numpy_array_of_data) is lost there.

astafan8 · 2018-07-25T15:06:16Z

Also, 👍 (+1) for new get_data_as_* functions! I hope there soon will be a PR that makes these also methods of some object that DataSaver can return (similar to Sohail's and Wolfgang's _DataExtractor)

astafan8 · 2018-07-25T15:08:46Z

qcodes/dataset/measurements.py

-            self._results.append(res_dict)
+                param_spec = self.parameters[str(partial_result[0])]
+                if param_spec.type == 'array' and index == 0:
+                    res_dict[str(partial_result[0])] = partial_result[1]


is res_dict.update(..) better? (see below)

It's significantly slower by my benchmark

mydict = {'A': 'Foo', 'B': 'Bar'} %timeit mydict['C'] = 'FooBar' 35.4 ns ± 1.78 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) %timeit dict['C'] = 'FooBar' 36 ns ± 0.393 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) %timeit mydict.update({'C': 'FooBar'}) 159 ns ± 3.36 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

mmmmmm..... i might not be right, but for the .update case, could it be that creation of a new dictionary ({'C': 'FooBar'}) takes some time (while in the first two cases, just two strings are created)? in other words, can you try the same but with {'C': 'FooBar'} being created before the benchmark? also perhaps, 'C' and 'FooBar' created before as well for the upper two cases?

It's not all of it

mydict = {'A': 'Foo', 'B': 'Bar'} myotherdict = {'C': 'FooBar'} %timeit mydict['C'] = 'FooBar' %timeit mydict.update({'C': 'FooBar'}) %timeit mydict.update(myotherdict) 34.9 ns ± 0.977 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) 154 ns ± 1.9 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) 139 ns ± 4.94 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I don't think update is the right method here. It's really for merging dicts. We only have one element to insert

astafan8 · 2018-07-25T15:10:23Z

qcodes/dataset/measurements.py

+                                                                    str)):
+                        value = cast(Union[Sequence, np.ndarray], value)
+                        if isinstance(value, np.ndarray):
+                            value = np.atleast_1d(value).ravel()


when I was profiling this, I saw indeed that after the thing that I've fixed already, atleast_1d takes quite some time together with the as_float data adapter (if i call it correctly). And now with the .. == 'array' logic, this is avoided basically.

Afaik this is only there to handle the case of 0D numpy arrays (i.e. np.array(1)) which is not the same as a numpy scalar but will pass is instance check for array and take this code path. We could rewrite it to make sure that it doesn't

omg.... and i guess writing an explicit if statement (as opposed to just isinstance check) to cover the case is less efficient, hence the atleast_1d without any comments around for the rationale.

Yes atleast_1d is surprisingly slow especially already 1d and higher arrays. I will replace it with a single reshape if needed

a = np.array((0)) a Out[59]: array(0) a.shape Out[60]: () %%timeit if a.ndim == 0: a.reshape(1) 565 ns ± 5.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) a = np.array((0,1,3)) %%timeit if a.ndim == 0: a.reshape(1) 41 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) a = np.array((0)) %%timeit np.atleast_1d(a) 1.15 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) a = np.array((0,1,3)) %%timeit np.atleast_1d(a) 628 ns ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

astafan8 · 2018-07-25T15:11:21Z

qcodes/dataset/measurements.py

+                param_spec = self.parameters[str(partial_result[0])]
+                if param_spec.type == 'array' and index == 0:
+                    res_dict[str(partial_result[0])] = partial_result[1]
+                elif param_spec.type != 'array':


btw, shall we finally make an enum with these SQLite data types? Isn't there already one in sqlite module?

Done by exposing the list in ParamSpec

jenshnielsen · 2018-08-07T12:41:04Z

@astafan8 There is basically no difference between the benchmarks before and after the changes. In both cases I am seeing runtimes between 400 and 500 ms. With the changes to avoid atleast_1d I see a consistent speedup.

I added more benchmarks using the store as array method demonstrating that its faster along with a notebook showing the speedup.

astafan8 · 2018-08-07T16:01:41Z

benchmarking/benchmarks/dataset.py

    ]
+    # we are less interested in the cpu time used and more interested in
+    # the wall clock time used to insert the data so use a timer that measures
+    # wallclock time


wallclock time of this process, you mean, right?

No Wallclock time of a process is IMHO a undefined.There is only one wallclock time

astafan8 · 2018-08-07T16:05:08Z

qcodes/dataset/data_set.py

        Args:
            - *params: string parameter names, QCoDeS Parameter objects, and
               ParamSpec objects
            - start:
            - end:

        Returns:
-            - list of parallel NumPy arrays, one array per parameter
-        per parameter
+            - list of rows of data. Each row will contain a list of columns


"each row will contain a list of columns"? what do you mean exactly here?

astafan8 · 2018-08-07T16:08:39Z

qcodes/dataset/data_set.py

-        provided when the DataSet was created. The parameter list may contain
-        a mix of string parameter names, QCoDeS Parameter objects, and
+        The values are returned as a list of lists, rows by columns, e.g.
+        datapoint by parameter. The data type of each element is based on the


I still did not understand the format. for an sql table like this:

x y z

1 2 3

4 5 6

is it going to return:

[ [1, 2, 3], [4, 5, 6] ]

?

Yes. I rewrote this to hopefully make it more clear

astafan8 · 2018-08-07T16:12:11Z

qcodes/dataset/measurements.py

@@ -524,7 +582,7 @@ def register_custom_parameter(
        depends_on, inf_from = self._registration_validation(name, sp_strings,
                                                             bs_strings)

-        parspec = ParamSpec(name=name, paramtype='numeric',
+        parspec = ParamSpec(name=name, paramtype=paramtype,


no checks for paramtype here?

ah, found it in the ParamSpec constructor.

astafan8 · 2018-08-07T16:15:37Z

qcodes/dataset/measurements.py

        """
        # input validation
+        if paramtype not in ParamSpec.allowed_types:


is this check necessary here, or, instead, can we wait until ParamSpec object is initiated? Seems like it's good that it's here because the error message refers to the fact that a qcpdes parameter is being passed in to the method...

astafan8 · 2018-08-07T16:24:38Z

docs/examples/DataSet/Dataset Performance.ipynb

@@ -0,0 +1,2541 @@
+{


What is the way to convert the following into just list of numbers? (see closer to the end of the notebook)

[[array([0.71390511])], [array([0.90052465])], [array([0.93697729])], [array([0.90260228])],

Added example of that to the notebook

astafan8 · 2018-08-07T16:27:09Z

docs/examples/DataSet/Dataset Performance.ipynb

@@ -0,0 +1,2541 @@
+{
+ "cells": [


Very nice notebook! three comments:

could you use log scale for amount of numbers to insert?

could you add 1.000, 10.000, perhaps 100.000 as well to the benchmarking?

could you set the database location to smth else so that people's default .db files don't get suddenly filled with this test data?

could you use log scale for amount of numbers to insert?

I chose a linear scale because the scaling is linear in the number of points

could you add 1.000, 10.000, perhaps 100.000 as well to the benchmarking?

I added some more examples to the first example but not to the second one to keep the notebook runtime down

could you set the database location to smth else so that people's default .db files don't get suddenly filled with this test data?

Done

…codes into dataset_context_add_array

WilliamHPNielsen · 2018-08-09T12:35:28Z

The docstring for add_result should be updated regarding how arrays are handled.

Also, although it's all in the tests and performance notebook, I think it would be good with a dedicated tutorial notebook showing an example of storing and retrieving arrays. Perhaps we can extend existing notebooks? (I wouldn't mind doing that)

WilliamHPNielsen · 2018-08-09T12:46:10Z

qcodes/tests/dataset/test_measurement_context_manager.py

@@ -794,10 +827,11 @@ def test_datasaver_arrayparams_tuples(experiment, SpectrumAnalyzer, DAC, N, M):
    assert datasaver.points_written == N*M


-@settings(max_examples=5, deadline=None)
+@settings(max_examples=5, deadline=None, use_coverage=False)


I think use_coverage should not be False?

WilliamHPNielsen · 2018-08-09T13:34:13Z

qcodes/dataset/data_set.py

        ParamSpec objects. As long as they have a `name` field. If provided,
        the start and end parameters select a range of results by result count
        (index).
        If the range is empty -- that is, if the end is less than or
        equal to the start, or if start is after the current end of the
        DataSet – then a list of empty arrays is returned.

+        For a more type independent and easier to work with view of the data
+        you may want to consider using
+        :py:meth:`qcodes.dataset.data_exporter.get_data_by_id`


I believe there's a typo here, shouldn't it be qcodes.dataset.data_export.get_data_by_id? (no ER after export)

WilliamHPNielsen · 2018-08-09T14:28:29Z

I've added a high-level tutorial notebook. I think it's good to have, and if nothing else, writing it convinced me that this is a kick-ass PR! There is also a suggestion for an extension of the test in the notebook's version of the MultiDimSpectrum.

WilliamHPNielsen

Nice stuff right here! Approved modulo the few small typos mentioned above.

astafan8

Great job! With the updates to the notebooks, its even a greater PR.

jenshnielsen · 2018-08-17T11:49:24Z

Addressed @WilliamHPNielsen comments. I think this is ready to land if the CI passes

Merge: c044524 26335a1 Author: Jens Hedegaard Nielsen <jenshnielsen@gmail.com> Merge pull request #1207 from jenshnielsen/dataset_context_add_array

jenshnielsen and others added 21 commits July 17, 2018 15:20

Dont use mutable default argument

6831b25

make it possible to pass paramtype from register to ParamSpec

2a3135f

Make context manager add_result support Arraytypes for faster insert

07639fb

Dont try to append empty dicts

f21c2e1

ensure that we only insert an arrays or unrolled arrays

d3714e6

A combination would be meaningless

measurement context manager unravel multidim data

cb640a0

Otherwise a multidim array is not inserted as individial datapoints but as a blob pr point in the outermost dimention

Towards supporting arrays with more than one setpoint dim

85656c4

Add tests for inserting arrays and a combination or arrays and regula…

5d6c0f5

…r parameters

handle unrooling of arrays with multiple setpoints

2dd2dc1

Add test for MultiDimentional Array Parameter

e804db4

Support inserting multidimentional array paramters

8987795

Add tests for multidim array insertion

21ffdd4

Loop over len shape rather than setpoints and convert setpoints to np…

24e39bb

… array

test that shapes are as expected

c5fa1ea

Add tests for arrays and unraveling

a87cc2f

ensure that setpoints are indentical regardless of the way they were …

cb7a006

…inserted

Add tests for data exporter

99285aa

remove print statment

523406e

More tests of exporter

09111cf

Better error message

1640e9b

Merge branch 'master' into dataset_context_add_array

2ce8426

astafan8 reviewed Jul 25, 2018

View reviewed changes

astafan8 assigned jenshnielsen Jul 27, 2018

astafan8 added enhancement new dataset labels Jul 27, 2018

Merge branch 'master' into dataset_context_add_array

cee50d7

load a smaller dataset

8b3a5d5

astafan8 reviewed Aug 7, 2018

View reviewed changes

jenshnielsen and others added 2 commits August 8, 2018 15:05

Clarify return types

827aa0d

Merge branch 'master' into dataset_context_add_array

1eb03c2

jenshnielsen mentioned this pull request Aug 8, 2018

Feature requests for new dataset #1200

Open

4 tasks

jenshnielsen added 3 commits August 8, 2018 15:38

Tweek notebook a bit

1575e74

Add note about numpy conversion

8407015

Merge branch 'dataset_context_add_array' of github.com:jenshnielsen/Q…

303fb9b

…codes into dataset_context_add_array

WilliamHPNielsen reviewed Aug 9, 2018

View reviewed changes

Add high-level tutorial notebook

531fba7

WilliamHPNielsen approved these changes Aug 9, 2018

View reviewed changes

astafan8 approved these changes Aug 9, 2018

View reviewed changes

jenshnielsen and others added 4 commits August 17, 2018 13:40

Dont disable coverage

1c77dbf

Correct method name

67a5b69

Merge branch 'master' into dataset_context_add_array

0cbe3fb

add note to docstring

26335a1

jenshnielsen merged commit bab7f6e into microsoft:master Aug 17, 2018

jenshnielsen deleted the dataset_context_add_array branch August 17, 2018 13:00

giulioungaretti pushed a commit that referenced this pull request Aug 17, 2018

Generated gh-pages for commit bab7f6e

eadd8d8

Merge: c044524 26335a1 Author: Jens Hedegaard Nielsen <jenshnielsen@gmail.com> Merge pull request #1207 from jenshnielsen/dataset_context_add_array

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Context manager. Allow insertion of Data direcly as arrays #1207

Dataset Context manager. Allow insertion of Data direcly as arrays #1207

jenshnielsen commented Jul 25, 2018 •

edited by WilliamHPNielsen

astafan8 commented Jul 25, 2018

astafan8 commented Jul 25, 2018

astafan8 Jul 25, 2018

jenshnielsen Aug 6, 2018

astafan8 Aug 6, 2018 •

edited

jenshnielsen Aug 6, 2018

astafan8 Jul 25, 2018 •

edited

jenshnielsen Jul 25, 2018

astafan8 Aug 6, 2018

jenshnielsen Aug 6, 2018

astafan8 Jul 25, 2018

jenshnielsen Aug 7, 2018

jenshnielsen commented Aug 7, 2018

astafan8 Aug 7, 2018

jenshnielsen Aug 8, 2018

astafan8 Aug 7, 2018

astafan8 Aug 7, 2018

jenshnielsen Aug 8, 2018

astafan8 Aug 7, 2018

astafan8 Aug 7, 2018

astafan8 Aug 7, 2018

astafan8 Aug 7, 2018

jenshnielsen Aug 8, 2018

astafan8 Aug 7, 2018

jenshnielsen Aug 8, 2018 •

edited

WilliamHPNielsen commented Aug 9, 2018

WilliamHPNielsen Aug 9, 2018

WilliamHPNielsen Aug 9, 2018

WilliamHPNielsen commented Aug 9, 2018 •

edited

WilliamHPNielsen left a comment

astafan8 left a comment

jenshnielsen commented Aug 17, 2018

Dataset Context manager. Allow insertion of Data direcly as arrays #1207

Dataset Context manager. Allow insertion of Data direcly as arrays #1207

Conversation

jenshnielsen commented Jul 25, 2018 • edited by WilliamHPNielsen

astafan8 commented Jul 25, 2018

astafan8 commented Jul 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astafan8 Aug 6, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astafan8 Jul 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jenshnielsen commented Aug 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jenshnielsen Aug 8, 2018 • edited

Choose a reason for hiding this comment

WilliamHPNielsen commented Aug 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WilliamHPNielsen commented Aug 9, 2018 • edited

WilliamHPNielsen left a comment

Choose a reason for hiding this comment

astafan8 left a comment

Choose a reason for hiding this comment

jenshnielsen commented Aug 17, 2018

jenshnielsen commented Jul 25, 2018 •

edited by WilliamHPNielsen

astafan8 Aug 6, 2018 •

edited

astafan8 Jul 25, 2018 •

edited

jenshnielsen Aug 8, 2018 •

edited

WilliamHPNielsen commented Aug 9, 2018 •

edited