Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Context manager. Allow insertion of Data direcly as arrays #1207

Merged
merged 50 commits into from
Aug 17, 2018

Conversation

jenshnielsen
Copy link
Collaborator

@jenshnielsen jenshnielsen commented Jul 25, 2018

This is significantly faster than unrolling and inserting point by point. This in turn uncovered some other issues.

  • Correct insertion of multidimensional Arrays in the context manager. Previously it would only
    unroll the first axis resulting in inconsistent results.
  • Adapt the data exporter get_data_by_id to return full arrays of setpoints regardless of the way they are inserted.

TODO:

  • Add Benchmarks of the difference vs insertion as single points
  • The tests have a fair bit of unneeded code duplication
  • Examples of Array Insertion vs single point insertion
  • Naively extracting data the way that DataSet.get_data does will result in a different result from extracting the data inserted row by row but the get_data_by_id exporter works as expected

Edit by William: this PR fixes #1108

@astafan8
Copy link
Contributor

Could you expand a bit more in the description what you changed exactly? because I see quite some stuff related to ArrayParameter, and the stuff that is related to tuples (qcodes_param, numpy_array_of_data) is lost there.

@astafan8
Copy link
Contributor

Also, 👍 (+1) for new get_data_as_* functions! I hope there soon will be a PR that makes these also methods of some object that DataSaver can return (similar to Sohail's and Wolfgang's _DataExtractor)

self._results.append(res_dict)
param_spec = self.parameters[str(partial_result[0])]
if param_spec.type == 'array' and index == 0:
res_dict[str(partial_result[0])] = partial_result[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is res_dict.update(..) better? (see below)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's significantly slower by my benchmark

mydict = {'A': 'Foo', 'B': 'Bar'}
%timeit mydict['C'] = 'FooBar'
35.4 ns ± 1.78 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit dict['C'] = 'FooBar'
36 ns ± 0.393 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit mydict.update({'C': 'FooBar'})
159 ns ± 3.36 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Copy link
Contributor

@astafan8 astafan8 Aug 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmmmm..... i might not be right, but for the .update case, could it be that creation of a new dictionary ({'C': 'FooBar'}) takes some time (while in the first two cases, just two strings are created)? in other words, can you try the same but with {'C': 'FooBar'} being created before the benchmark? also perhaps, 'C' and 'FooBar' created before as well for the upper two cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not all of it

mydict = {'A': 'Foo', 'B': 'Bar'}
myotherdict = {'C': 'FooBar'}
%timeit mydict['C'] = 'FooBar'
%timeit mydict.update({'C': 'FooBar'})
%timeit mydict.update(myotherdict)
34.9 ns ± 0.977 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
154 ns ± 1.9 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
139 ns ± 4.94 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I don't think update is the right method here. It's really for merging dicts. We only have one element to insert

str)):
value = cast(Union[Sequence, np.ndarray], value)
if isinstance(value, np.ndarray):
value = np.atleast_1d(value).ravel()
Copy link
Contributor

@astafan8 astafan8 Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I was profiling this, I saw indeed that after the thing that I've fixed already, atleast_1d takes quite some time together with the as_float data adapter (if i call it correctly). And now with the .. == 'array' logic, this is avoided basically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik this is only there to handle the case of 0D numpy arrays (i.e. np.array(1)) which is not the same as a numpy scalar but will pass is instance check for array and take this code path. We could rewrite it to make sure that it doesn't

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg.... and i guess writing an explicit if statement (as opposed to just isinstance check) to cover the case is less efficient, hence the atleast_1d without any comments around for the rationale.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes atleast_1d is surprisingly slow especially already 1d and higher arrays. I will replace it with a single reshape if needed

a = np.array((0))
a
Out[59]: array(0)
a.shape
Out[60]: ()
%%timeit
if a.ndim == 0:
    a.reshape(1)
565 ns ± 5.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
a = np.array((0,1,3))
%%timeit
if a.ndim == 0:
    a.reshape(1)
41 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
a = np.array((0))
%%timeit
np.atleast_1d(a)
1.15 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
a = np.array((0,1,3))
%%timeit
np.atleast_1d(a)
628 ns ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

param_spec = self.parameters[str(partial_result[0])]
if param_spec.type == 'array' and index == 0:
res_dict[str(partial_result[0])] = partial_result[1]
elif param_spec.type != 'array':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, shall we finally make an enum with these SQLite data types? Isn't there already one in sqlite module?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by exposing the list in ParamSpec

@jenshnielsen
Copy link
Collaborator Author

@astafan8 There is basically no difference between the benchmarks before and after the changes. In both cases I am seeing runtimes between 400 and 500 ms. With the changes to avoid atleast_1d I see a consistent speedup.

I added more benchmarks using the store as array method demonstrating that its faster along with a notebook showing the speedup.

]
# we are less interested in the cpu time used and more interested in
# the wall clock time used to insert the data so use a timer that measures
# wallclock time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wallclock time of this process, you mean, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Wallclock time of a process is IMHO a undefined.There is only one wallclock time

Args:
- *params: string parameter names, QCoDeS Parameter objects, and
ParamSpec objects
- start:
- end:

Returns:
- list of parallel NumPy arrays, one array per parameter
per parameter
- list of rows of data. Each row will contain a list of columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"each row will contain a list of columns"? what do you mean exactly here?

provided when the DataSet was created. The parameter list may contain
a mix of string parameter names, QCoDeS Parameter objects, and
The values are returned as a list of lists, rows by columns, e.g.
datapoint by parameter. The data type of each element is based on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still did not understand the format. for an sql table like this:

x y z
1 2 3
4 5 6

is it going to return:

[ [1, 2, 3],
  [4, 5, 6] ]

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I rewrote this to hopefully make it more clear

@@ -524,7 +582,7 @@ def register_custom_parameter(
depends_on, inf_from = self._registration_validation(name, sp_strings,
bs_strings)

parspec = ParamSpec(name=name, paramtype='numeric',
parspec = ParamSpec(name=name, paramtype=paramtype,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no checks for paramtype here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, found it in the ParamSpec constructor.

"""
# input validation
if paramtype not in ParamSpec.allowed_types:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this check necessary here, or, instead, can we wait until ParamSpec object is initiated? Seems like it's good that it's here because the error message refers to the fact that a qcpdes parameter is being passed in to the method...

@@ -0,0 +1,2541 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the way to convert the following into just list of numbers? (see closer to the end of the notebook)

[[array([0.71390511])],
 [array([0.90052465])],
 [array([0.93697729])],
 [array([0.90260228])],

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added example of that to the notebook

@@ -0,0 +1,2541 @@
{
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice notebook! three comments:

  • could you use log scale for amount of numbers to insert?
  • could you add 1.000, 10.000, perhaps 100.000 as well to the benchmarking?
  • could you set the database location to smth else so that people's default .db files don't get suddenly filled with this test data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • could you use log scale for amount of numbers to insert?

I chose a linear scale because the scaling is linear in the number of points

  • could you add 1.000, 10.000, perhaps 100.000 as well to the benchmarking?

I added some more examples to the first example but not to the second one to keep the notebook runtime down

  • could you set the database location to smth else so that people's default .db files don't get suddenly filled with this test data?

Done

@WilliamHPNielsen
Copy link
Contributor

The docstring for add_result should be updated regarding how arrays are handled.

Also, although it's all in the tests and performance notebook, I think it would be good with a dedicated tutorial notebook showing an example of storing and retrieving arrays. Perhaps we can extend existing notebooks? (I wouldn't mind doing that)

@@ -794,10 +827,11 @@ def test_datasaver_arrayparams_tuples(experiment, SpectrumAnalyzer, DAC, N, M):
assert datasaver.points_written == N*M


@settings(max_examples=5, deadline=None)
@settings(max_examples=5, deadline=None, use_coverage=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think use_coverage should not be False?

ParamSpec objects. As long as they have a `name` field. If provided,
the start and end parameters select a range of results by result count
(index).
If the range is empty -- that is, if the end is less than or
equal to the start, or if start is after the current end of the
DataSet – then a list of empty arrays is returned.

For a more type independent and easier to work with view of the data
you may want to consider using
:py:meth:`qcodes.dataset.data_exporter.get_data_by_id`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's a typo here, shouldn't it be qcodes.dataset.data_export.get_data_by_id? (no ER after export)

@WilliamHPNielsen
Copy link
Contributor

WilliamHPNielsen commented Aug 9, 2018

I've added a high-level tutorial notebook. I think it's good to have, and if nothing else, writing it convinced me that this is a kick-ass PR! There is also a suggestion for an extension of the test in the notebook's version of the MultiDimSpectrum.

Copy link
Contributor

@WilliamHPNielsen WilliamHPNielsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice stuff right here! Approved modulo the few small typos mentioned above.

Copy link
Contributor

@astafan8 astafan8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! With the updates to the notebooks, its even a greater PR.

@jenshnielsen
Copy link
Collaborator Author

Addressed @WilliamHPNielsen comments. I think this is ready to land if the CI passes

@jenshnielsen jenshnielsen merged commit bab7f6e into microsoft:master Aug 17, 2018
@jenshnielsen jenshnielsen deleted the dataset_context_add_array branch August 17, 2018 13:00
giulioungaretti pushed a commit that referenced this pull request Aug 17, 2018
Merge: c044524 26335a1
Author: Jens Hedegaard Nielsen <jenshnielsen@gmail.com>

    Merge pull request #1207 from jenshnielsen/dataset_context_add_array
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

register_parameter always assumes 'numeric' type
3 participants