-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Context manager. Allow insertion of Data direcly as arrays #1207
Dataset Context manager. Allow insertion of Data direcly as arrays #1207
Conversation
A combination would be meaningless
Otherwise a multidim array is not inserted as individial datapoints but as a blob pr point in the outermost dimention
Could you expand a bit more in the description what you changed exactly? because I see quite some stuff related to ArrayParameter, and the stuff that is related to tuples |
Also, 👍 (+1) for new |
self._results.append(res_dict) | ||
param_spec = self.parameters[str(partial_result[0])] | ||
if param_spec.type == 'array' and index == 0: | ||
res_dict[str(partial_result[0])] = partial_result[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is res_dict.update(..)
better? (see below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's significantly slower by my benchmark
mydict = {'A': 'Foo', 'B': 'Bar'}
%timeit mydict['C'] = 'FooBar'
35.4 ns ± 1.78 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit dict['C'] = 'FooBar'
36 ns ± 0.393 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit mydict.update({'C': 'FooBar'})
159 ns ± 3.36 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmmmmm..... i might not be right, but for the .update
case, could it be that creation of a new dictionary ({'C': 'FooBar'}
) takes some time (while in the first two cases, just two strings are created)? in other words, can you try the same but with {'C': 'FooBar'}
being created before the benchmark? also perhaps, 'C'
and 'FooBar'
created before as well for the upper two cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not all of it
mydict = {'A': 'Foo', 'B': 'Bar'}
myotherdict = {'C': 'FooBar'}
%timeit mydict['C'] = 'FooBar'
%timeit mydict.update({'C': 'FooBar'})
%timeit mydict.update(myotherdict)
34.9 ns ± 0.977 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
154 ns ± 1.9 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
139 ns ± 4.94 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
I don't think update is the right method here. It's really for merging dicts. We only have one element to insert
qcodes/dataset/measurements.py
Outdated
str)): | ||
value = cast(Union[Sequence, np.ndarray], value) | ||
if isinstance(value, np.ndarray): | ||
value = np.atleast_1d(value).ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I was profiling this, I saw indeed that after the thing that I've fixed already, atleast_1d
takes quite some time together with the as_float
data adapter (if i call it correctly). And now with the .. == 'array'
logic, this is avoided basically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaik this is only there to handle the case of 0D numpy arrays (i.e. np.array(1)) which is not the same as a numpy scalar but will pass is instance check for array and take this code path. We could rewrite it to make sure that it doesn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omg.... and i guess writing an explicit if
statement (as opposed to just isinstance
check) to cover the case is less efficient, hence the atleast_1d
without any comments around for the rationale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes atleast_1d is surprisingly slow especially already 1d and higher arrays. I will replace it with a single reshape if needed
a = np.array((0))
a
Out[59]: array(0)
a.shape
Out[60]: ()
%%timeit
if a.ndim == 0:
a.reshape(1)
565 ns ± 5.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
a = np.array((0,1,3))
%%timeit
if a.ndim == 0:
a.reshape(1)
41 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
a = np.array((0))
%%timeit
np.atleast_1d(a)
1.15 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
a = np.array((0,1,3))
%%timeit
np.atleast_1d(a)
628 ns ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
param_spec = self.parameters[str(partial_result[0])] | ||
if param_spec.type == 'array' and index == 0: | ||
res_dict[str(partial_result[0])] = partial_result[1] | ||
elif param_spec.type != 'array': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, shall we finally make an enum with these SQLite data types? Isn't there already one in sqlite
module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done by exposing the list in ParamSpec
@astafan8 There is basically no difference between the benchmarks before and after the changes. In both cases I am seeing runtimes between 400 and 500 ms. With the changes to avoid atleast_1d I see a consistent speedup. I added more benchmarks using the store as array method demonstrating that its faster along with a notebook showing the speedup. |
] | ||
# we are less interested in the cpu time used and more interested in | ||
# the wall clock time used to insert the data so use a timer that measures | ||
# wallclock time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wallclock time of this process, you mean, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No Wallclock time of a process is IMHO a undefined.There is only one wallclock time
qcodes/dataset/data_set.py
Outdated
Args: | ||
- *params: string parameter names, QCoDeS Parameter objects, and | ||
ParamSpec objects | ||
- start: | ||
- end: | ||
|
||
Returns: | ||
- list of parallel NumPy arrays, one array per parameter | ||
per parameter | ||
- list of rows of data. Each row will contain a list of columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"each row will contain a list of columns"? what do you mean exactly here?
qcodes/dataset/data_set.py
Outdated
provided when the DataSet was created. The parameter list may contain | ||
a mix of string parameter names, QCoDeS Parameter objects, and | ||
The values are returned as a list of lists, rows by columns, e.g. | ||
datapoint by parameter. The data type of each element is based on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still did not understand the format. for an sql table like this:
x | y | z |
---|---|---|
1 | 2 | 3 |
4 | 5 | 6 |
is it going to return:
[ [1, 2, 3],
[4, 5, 6] ]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I rewrote this to hopefully make it more clear
@@ -524,7 +582,7 @@ def register_custom_parameter( | |||
depends_on, inf_from = self._registration_validation(name, sp_strings, | |||
bs_strings) | |||
|
|||
parspec = ParamSpec(name=name, paramtype='numeric', | |||
parspec = ParamSpec(name=name, paramtype=paramtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no checks for paramtype
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, found it in the ParamSpec
constructor.
""" | ||
# input validation | ||
if paramtype not in ParamSpec.allowed_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this check necessary here, or, instead, can we wait until ParamSpec
object is initiated? Seems like it's good that it's here because the error message refers to the fact that a qcpdes parameter is being passed in to the method...
@@ -0,0 +1,2541 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the way to convert the following into just list of numbers? (see closer to the end of the notebook)
[[array([0.71390511])],
[array([0.90052465])],
[array([0.93697729])],
[array([0.90260228])],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added example of that to the notebook
@@ -0,0 +1,2541 @@ | |||
{ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice notebook! three comments:
- could you use log scale for amount of numbers to insert?
- could you add 1.000, 10.000, perhaps 100.000 as well to the benchmarking?
- could you set the database location to smth else so that people's default
.db
files don't get suddenly filled with this test data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- could you use log scale for amount of numbers to insert?
I chose a linear scale because the scaling is linear in the number of points
- could you add 1.000, 10.000, perhaps 100.000 as well to the benchmarking?
I added some more examples to the first example but not to the second one to keep the notebook runtime down
- could you set the database location to smth else so that people's default
.db
files don't get suddenly filled with this test data?
Done
…codes into dataset_context_add_array
The docstring for Also, although it's all in the tests and performance notebook, I think it would be good with a dedicated tutorial notebook showing an example of storing and retrieving arrays. Perhaps we can extend existing notebooks? (I wouldn't mind doing that) |
@@ -794,10 +827,11 @@ def test_datasaver_arrayparams_tuples(experiment, SpectrumAnalyzer, DAC, N, M): | |||
assert datasaver.points_written == N*M | |||
|
|||
|
|||
@settings(max_examples=5, deadline=None) | |||
@settings(max_examples=5, deadline=None, use_coverage=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think use_coverage
should not be False
?
qcodes/dataset/data_set.py
Outdated
ParamSpec objects. As long as they have a `name` field. If provided, | ||
the start and end parameters select a range of results by result count | ||
(index). | ||
If the range is empty -- that is, if the end is less than or | ||
equal to the start, or if start is after the current end of the | ||
DataSet – then a list of empty arrays is returned. | ||
|
||
For a more type independent and easier to work with view of the data | ||
you may want to consider using | ||
:py:meth:`qcodes.dataset.data_exporter.get_data_by_id` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there's a typo here, shouldn't it be qcodes.dataset.data_export.get_data_by_id
? (no ER after export)
I've added a high-level tutorial notebook. I think it's good to have, and if nothing else, writing it convinced me that this is a kick-ass PR! There is also a suggestion for an extension of the test in the notebook's version of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice stuff right here! Approved modulo the few small typos mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! With the updates to the notebooks, its even a greater PR.
Addressed @WilliamHPNielsen comments. I think this is ready to land if the CI passes |
This is significantly faster than unrolling and inserting point by point. This in turn uncovered some other issues.
unroll the first axis resulting in inconsistent results.
get_data_by_id
to return full arrays of setpoints regardless of the way they are inserted.TODO:
DataSet.get_data
does will result in a different result from extracting the data inserted row by row but theget_data_by_id
exporter works as expectedEdit by William: this PR fixes #1108