Non-existing event introduced when converting pmaps DataFrames to dictionaries #149

gonzaponte · 2017-03-03T14:13:02Z

The function cdf_to_dict in pmaps_functions_c.pyx always returns event #0 even when it is not present in the input DataFrame.

gonzaponte · 2017-03-03T14:20:57Z

I have in my fork (gonzaponte/issue149) a demonstration of the failure. I had to modify the test file so it does not include event #0. Although I have written a new test to demonstrate the bug, all tests fail because of the bug. However, since the new test is more exhaustive, I will rewrite the previous ones to perform as the new one.

jacg · 2017-03-06T08:04:28Z

I had to modify the test file so it does not include event #0.

Would it be possible/better to do this by adding another test file? How big a new data file would you need to be able to write this test?

Please do a whitespace cleanup. pmaps_functions_c.pyx, for example, contains a number of spurious whitespace additions.

You have removed the initializations for evt and pk in cdf_to_dict. Is there any good reason for this? A quick search into uninitialized variables in Cython does not inspire confidence. Please initialize these variables, unless there's some good reason not to (which I find hard to imagine).

Please squash into two commits: one adding the tests in their final form, leaving the implementation alone; one implementing the solution, leaving the tests alone.

@jjgomezcadenas Please have a quick look at this, and comment on the how this fits in the grand scheme of things. Also please comment on the desirability of adding a new test data file vs. modifying the existing one.

gonzaponte · 2017-03-06T11:15:17Z

Would it be possible/better to do this by adding another test file? How big a new data file would you need to be able to write this test?

It is possible, but maybe redundant. It will not be very large (few tens or hundreds of kb).

Please do a whitespace cleanup. pmaps_functions_c.pyx, for example, contains a number of spurious whitespace additions.

Ok.

You have removed the initializations for evt and pk in cdf_to_dict. Is there any good reason for this? A quick search into uninitialized variables in Cython does not inspire confidence. Please initialize these variables, unless there's some good reason not to (which I find hard to imagine).

No good reason, it is just not necessary as they are always assigned when needed. Maybe initializing them to 0 can be misleading, but it is not a big deal. I will set them to 0.

Please squash into two commits: one adding the tests in their final form, leaving the implementation alone; one implementing the solution, leaving the tests alone.

Ok.

jacg · 2017-03-06T12:24:36Z

Maybe initializing them to 0 can be misleading, but it is not a big deal. I will set them to 0.

No. My bad. I looked at the code too hastily. You are initializing them before first use, so it's fine as it is. As you say, it might even be better this way.

It turns out that, beautiful though it was, the `xxx` implementation based on Pandas `groupby` was two orders of magnitude slower than `s12df_to_s12l`. It looks like Pandas `groupby` is fundamentally slow. There may be a way of achieving the desired effect with `numpy.unique` or something similar, but, for now, `xxx` has been implemented in Cython, forgetting any attempt to make the result contain small sections of the original dataframe. `xxx` now passes exactly the same tests as `s12df_to_s12l` and the tests used to drive the development of `xxx` are now also being applied to `s12df_to_s12l`: all these tests are parametrized with both implementations. As far as I can see so far, `xxx` has two advantages over `s12df_to_s12l` (and it's cython core `cdf_to_dict`): + it is legible + it is 15% faster The next step is to see how `xxx` stacks up against next-exp#149, and how it compares to the proposed solution. In the long term we want to look at whether we want the dictionaries which are produced to contain namedtuples or dataframes, rather than the current tuple of numpy arrays representing the times and energies in each peak.

jacg · 2017-03-08T00:56:42Z

This was bugging me, and it became clear that I wasn't going to get to sleep until I had fixed it ...

It turns out that, beautiful though it was, the xxx implementation (discussed in #158) based on Pandas groupby was two orders of magnitude slower than s12df_to_s12l.

It looks like Pandas groupby is fundamentally slow. There may be a way of achieving the desired effect with numpy.unique or something similar, but, for now, xxx has been reimplemented in Cython,
forgetting any attempt to make the result contain small sections of the original dataframe rather than a pair of numpy arrays stuffed into a tuple.

The new xxx now passes exactly the same tests as s12df_to_s12l and the tests used to drive the development of xxx are now also being applied to s12df_to_s12l: all these tests are parametrized with both implementations.

As far as I can see so far, xxx has two advantages over s12df_to_s12l (and it's cython core cdf_to_dict):

it is legible
it is 15% faster

The next step is to see how xxx stacks up against #149, and how it compares to the proposed solution.

In the long term we want to look at whether we want the dictionaries which are produced to contain namedtuples or dataframes, rather than the current tuple of numpy arrays representing the times and energies in each peak.

jacg · 2017-03-08T01:15:43Z

I have merged (yes! merged, don't worry, it's only temporary: it serves the purpose of keeping track of the alternatives until we decide which one we want) the implementation of xxx and the tests (which now run both on xxx and s12df_to_s12l) into the issue149 branch, and pushed it to my fork.

It seems that xxx passes the issue-149 tests out of the box. @gonzaponte please confirm that I haven't missed something, and that I am actually running the test that points out the bug, and not doing anything else stupid.

The proposed bugfix seems to slow down s12df_to_s12l a bit, so the performance advantage of xxx increases from 15% to about 20%.
The proposed bugfix shortens and clears up s12df_to_s12l quite a bit, but it still seems rather more cryptic than xxx, but as the author of the latter I would say that, wouldn't I!

Are there any other problems or bugs pending with s12df_to_s12l?

As this is a function which is pivotal and has been problematic, we should make an effort to test it more thoroughly.

@neuslopez Could you please confirm that my timings make sense: I merely ran timeit on the s1 dataframe that comes out of database/test_data/KrMC_pmaps.h5.

Good night.

gonzaponte · 2017-03-08T10:36:40Z

As far as I can see so far, xxx has two advantages over s12df_to_s12l (and it's cython core cdf_to_dict):

it is legible

it is 15% faster

Well, I don't find it much more readable. In s12df_to_s12l the workflow is clearer, imho. But of course, this is a matter of interpretation. Concerning performance, s12df_to_s12l contains a function call overload because it is a python function calling the cython one. It can by written in the same fashion as yours (pure cython function). I guess this structure comes because we want a pure python interface, @jjgomezcadenas?
In any case, I don't think performance is critical for this function, which will only be executed once per file.

It seems that xxx passes the issue-149 tests out of the box. @gonzaponte please confirm that I haven't missed something, and that I am actually running the test that points out the bug, and not doing anything else stupid.

I think it is ok!

Are there any other problems or bugs pending with s12df_to_s12l?

I don't think there are more bugs, but we can extend testing to include those that came up in issue #146.

neuslopez · 2017-03-08T10:40:57Z

Using branch issue149, see bellow the new timings using the new xxx (same Kr file):

pf.s12df_to_s12l(S1df,10000): 10 loops, best of 3: 58.2 ms per loop
cpm.xxx (S1df,10000): 10 loops, best of 3: 41.3 ms per loop

jacg · 2017-03-08T12:16:24Z

Concerning performance, s12df_to_s12l contains a function call overload because it is a python function calling the cython one.

It does, but this experiment

python -m timeit -s 'f = lambda:1' 'f()'
10000000 loops, best of 3: 0.0858 usec per loop

python -m timeit -s 'f = lambda:1' 'f'  
100000000 loops, best of 3: 0.0142 usec per loop

shows that an pure-Python function call can be performed in 86 nanoseconds, while the lookup of the function's name takes 14, leaving no more than about 70 nanoeconds for the overhead of calling a pure-Python function. This tiny O(1) overhead cannot explain the (O(n)-ish) differences of 30 microseconds that I have been observing for small files, and the 17 milliseconds that @neuslopez reports above for larger files.

It can by written in the same fashion as yours (pure cython function). I guess this structure comes because we want a pure python interface

In this case I cannot see any advantages offered by the pure python interface wrapper. It just needlessly complicates and obfuscates the issue. I'm always happy to be educated, though.

In any case, the current version of cdf_to_dict is much better (even if it is a little slower) than the old one. The old version was written in write-only style, and I admire your courage and determination in having fixed the bug in it. I was not up to the task, without rewriting it from scratch.

Tasks remaining:

Decide whether we prefer cdf_to_dict + d12df_to_s12l or xxx.
If we choose the former, combine it into a single Cython function.
Clean up and expand the tests, including any that emerge from Wrong type in last evt in s12df_to_s12l #146.

Once that is done we could merge. Or we could discuss whether we really want dictionaries containing, dictionaries, containing tuples, containing arrays; or whether we would like to replace the arrays-in-tuples with dataframes. But I think that the latter would be better discussed in a separate issue/PR. Some observations though:

After my experience with writing xxx, I am less keen on the dataframes: indexing them is very slow! But perhaps this doesn't matter in our usage?
This issue probably needs to be discussed with reference to Modifying the format of S2Si table #157.

Oh, and for cross-referencing purposes, this issue is addressed in PR #162.

(GitHub used to have the capability to turn issues into PRs. They got rid of it. Shame, as it seems very useful.)

gonzaponte · 2017-03-08T12:56:32Z

In this case I cannot see any advantages offered by the pure python interface wrapper. It just needlessly complicates and obfuscates the issue

The only "advantage" I see is that gathering all functions in a single file (pmaps_functions.py) allows you to forget the different implementations (either python or cython). For example in your fork you have:

import invisible_cities.reco.pmaps_functions   as pmapf
import invisible_cities.reco.pmaps_functions_c as pmapf_c
pmaps_read_parm = ('df_to_dict',
                   (pmapf.s12df_to_s12l,
                    pmapf_c.xxx))

The following may be nicer to look at:

import invisible_cities.reco.pmaps_functions   as pmapf
pmaps_read_parm = ('df_to_dict',
                   (pmapf.s12df_to_s12l,
                    pmapf.xxx))

This can still be addressed by creating aliases or using a from statement in the python module.

Once that is done we could merge. Or we could discuss whether we really want dictionaries containing, dictionaries, containing tuples, containing arrays; or whether we would like to replace the arrays-in-tuples with dataframes. But I think that the latter would be better discussed in a separate issue/PR. Some observations though:

After my experience with writing xxx, I am less keen on the dataframes: indexing them is very slow! But perhaps this doesn't matter in our usage?

This issue probably needs to be discussed with reference to Modifying the format of S2Si table #157.

Agreed. JJ wants to use DFs bacause of the statistical and correlation properties they provide. After using the current tuple-of-arrays structure in a small analysis I don't see any advantage in using the DFs functionalities, but my imagination is limited and future analysis may need them.

jjgomezcadenas · 2017-03-08T13:43:52Z

I am not very sure that we are following a sensible approach here. To review: 1) I had written a function to read back pmaps en transform them to desired format in transient memory. @gonzaponte had identified a number of issues and was working in fixing them. He has worked out a re-imp of the function and new tests. Was it really necessary that @Jacq and @neuslopez would write yet another implementation? (xxx)? I have a memory of complaining about it and receiving an immediate reprimand (including being called names, eg art guy). The lecture included a number of reasons to substitute my original implementation with something that did not reinvent the wheel but turned out to be two orders of magnitude slower. Then, a new re-reimp resurrected the wheel in cython code. Now we have 3 imps of the darned function. So my questions are: 1) can we solve this? By all means get rid of my faulty, clumsy and artistic imp. Use instead that of @gonzaponte or xxx. Choose one and let's be done with it. 2) can we agree in some scheme to share the work that does not prompt us to three reimps of the sane function in the future? We have a lot of work and limited man power. Every one tries hard to go by the book. But we need to keep also in mind that data taking starts next week and we haven't yet managed to write corona. While totally agreeing with the boss about reduction of TD and such I believe that we can do better than spinning around the reimp of a single function! Sorry about top posting writing from iphone in the train Enviado desde mi iPhone

…

El 8 mar 2017, a las 11:36, Gonzalo ***@***.***> escribió: As far as I can see so far, xxx has two advantages over s12df_to_s12l (and it's cython core cdf_to_dict): it is legible it is 15% faster Well, I don't find it much more readable. In s12df_to_s12l the workflow is clearer, imho. But of course, this is a matter of interpretation. Concerning performance, s12df_to_s12l contains a function call overload because it is a python function calling the cython one. It can by written in the same fashion as yours (pure cython function). I guess this structure comes because we want a pure python interface, @jjgomezcadenas? In any case, I don't think performance is critical for this function, which will only be executed once per file. It seems that xxx passes the issue-149 tests out of the box. @gonzaponte please confirm that I haven't missed something, and that I am actually running the test that points out the bug, and not doing anything else stupid. I think it is ok! Are there any other problems or bugs pending with s12df_to_s12l? I don't think there are more bugs, but we can extend testing to include those that came up in issue #146. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jacg · 2017-03-08T13:47:05Z

The only "advantage" I see is that gathering all functions in a single file (pmaps_functions.py) allows you to forget the different implementations (either python or cython).

Totally agreed. As far as I am concerned, all the Cython modules are low-level implementations of the Python modules, and I would like not to see them being imported directly in client code at all. (IIRC, @jjgomezcadenas disagrees). But the solution to this is not to implement pure-Python wrappers (containing arbitrary subsets of the functionality, like in the case of s12df_to_s12l (I have grown to hate that name, BTW)). No, one should simply import the Cython implementation into the pure python module. In this case, inside pmaps_functions.py we put

from invisible_cities.reco.pmaps_functions_c import xxx as s12df_to_12l

and we're done. No need to write a wrapper around it. And indeed, you seem to acknowledge this in your comment.

JJ wants to use DFs bacause of the statistical and correlation properties they provide.

Seems jolly sensible to me. As long as the huge indexing overhead of dataframes doesn't prevent us from doing other things in a reasonable amount of time.

the current tuple-of-arrays

I really would like those to be named tuples, though. I'll implement that tomorrow morning, unless somene beats me to it. It should take all of 30 seconds. Hmm, I might just do it now ... OK, I've pushed it to my fork on the issue149 branch. (No test for accessing time and energy by name yet, though!)

jjgomezcadenas · 2017-03-08T13:52:08Z

My five cents 1) i like pure pyhon hiding cython imp basically for the same reasons offered by @gonzaponte 2) tuple of arrays is fine for now. J Enviado desde mi iPhone

…

El 8 mar 2017, a las 13:56, Gonzalo ***@***.***> escribió: In this case I cannot see any advantages offered by the pure python interface wrapper. It just needlessly complicates and obfuscates the issue The only "advantage" I see is that gathering all functions in a single file (pmaps_functions.py) allows you to forget the different implementations (either python or cython). For example in your fork you have: import invisible_cities.reco.pmaps_functions as pmapf import invisible_cities.reco.pmaps_functions_c as pmapf_c pmaps_read_parm = ('df_to_dict', (pmapf.s12df_to_s12l, pmapf_c.xxx)) The following may be nicer to look at: import invisible_cities.reco.pmaps_functions as pmapf pmaps_read_parm = ('df_to_dict', (pmapf.s12df_to_s12l, pmapf.xxx)) This can still be addressed by creating aliases or using a from statement in the python module. Once that is done we could merge. Or we could discuss whether we really want dictionaries containing, dictionaries, containing tuples, containing arrays; or whether we would like to replace the arrays-in-tuples with dataframes. But I think that the latter would be better discussed in a separate issue/PR. Some observations though: After my experience with writing xxx, I am less keen on the dataframes: indexing them is very slow! But perhaps this doesn't matter in our usage? This issue probably needs to be discussed with reference to #157. Agreed. JJ wants to use DFs bacause of the statistical and correlation properties they provide. After using the current tuple-of-arrays structure in a small analysis I don't see any advantage in using the DFs functionalities, but my imagination is limited and future analysis may need them. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jacg · 2017-03-08T14:35:40Z

Was it really necessary that @Jacq and @neuslopez would write yet another implementation?

Given that the original was write-only: yes.
Given that the brokenness of this function was getting in our way in everything we wanted to do, and re-implementing a new one was easier than understanding and fixing the origial: yes.
Given that it was the quickest way to understand what cdf_to_dict was supposed to do: yes.
Given that it demonstrates in 4 clear lines what 49 convoluted lines were failing to do correctly: yes.
Given that it increased the bus number of the feature considerably: yes.

being called names, eg art guy

Please don't confuse statements about specific things you say, with calling you names. (We've been here before, haven't we?). It is patently clear that you are not an ARTist, but the argument you used on that occasion was very ARTistic. I make no apology for pointing this out, and I would be grateful if you return the favour, if the situation is ever reversed.

did not reinvent the wheel but turned out to be two orders of magnitude slower

While it was two orders of magnitude slower, it did have three rather important merits over the original.

it was correct.
it was obviously correct ... not to mention tested.
it explained what the faulty one was attempting to do.

The process demonstrated a clear performance problem with Pandas dataframes of which I, for one, was not previously aware. Oh, and it took less time to implement than these meta-discussions.

Then, a new re-reimp resurrected the wheel in cython code.

Yes, the ideas in the high-level imp were expressed at a lower level, and passed all tests at first go, beating all other existing implementations for speed.

can we solve this? By all means get rid of my faulty, clumsy and artistic imp.

That goes without saying: it contains bugs which neither of its successors do.

can we agree in some scheme to share the work that does not prompt us to three reimps of the sane function in the future?

I submit to you that if you were pera-programming with us when we came up with xxx, you would have been an enthusiastic supporter of it. It took little time, and it was a very edifying experience.

I believe that we can do better than spinning around the reimp of a single function!

This meta-discussion is taking more effort that the re-imp. The delays caused by the bugs in the original, took up much more than resources than the re-imp. The re-imp increased the bus number of the function in question by two.

If your point is that we should talk less and do more, I agree. If your point is that we should not write code when trying to understand problematic code, then I respectfully and vehemently disagree.

Sorry about top posting writing from iphone in the train Enviado desde mi iPhone

No worries, GitHub hid the noise.

jacg · 2017-03-08T14:42:37Z

i like pure pyhon hiding cython imp basically for the same reasons offered by @gonzaponte

Stripped to the bare bones, we are talking about these two alternatives:

Option 1

Inside python_implementation.py you write

 import cython_implementation

 def feature():
     return cython_implementation.feature()

Option 2

Inside python_implementation.py you write

  from cython_implementation import feature

Agreed?

jjgomezcadenas · 2017-03-08T14:57:49Z

Yes, agreed! Enviado desde mi iPhone

…

El 8 mar 2017, a las 15:42, jacg ***@***.***> escribió: i like pure pyhon hiding cython imp basically for the same reasons offered by @gonzaponte Stripped to the bare bones, we are talking about these two alternatives: Option 1 Inside python_implementation.py you write import cython_implementation def feature(): return cython_implementation.feature() Option 2 Inside python_implementation.py you write from cython_implementation import feature Agreed? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jacg · 2017-03-08T15:07:12Z

Good. Now, to me, ceteris paribus, one of those options is clearly better than the other.

Agreed?

jacg · 2017-03-08T15:20:35Z

Using branch issue149, see bellow the new timings using the new xxx (same Kr file):
pf.s12df_to_s12l(S1df,10000): 10 loops, best of 3: 58.2 ms per loop
cpm.xxx (S1df,10000): 10 loops, best of 3: 41.3 ms per loop

@neuslopez Could you please give us some idea of how long it takes read_pmaps to get these data from disk into memory?

jjgomezcadenas · 2017-03-08T15:53:58Z

Find comments inline (hopefully GitHub will still hid the extra noise)

On 8 Mar 2017, at 15:35, jacg ***@***.***> wrote: Was it really necessary that @Jacq <https://github.com/jacq> and @neuslopez <https://github.com/neuslopez> would write yet another implementation? Given that the original was write-only: yes. Given that the brokenness of this function was getting in our way in everything we wanted to do, and re-implementing a new one was easier than understanding and fixing the origial: yes. Given that it was the quickest way to understand what cdf_to_dict was supposed to do: yes. Given that it demonstrates in 4 clear lines what 49 convoluted lines were failing to do correctly: yes. Given that it increased the bus number of the feature considerably: yes.

Given that some one else (e.g, Gonzalo) was already looking at the problem and he was not around. Debatable, I think.

being called names, eg art guy Please don't confuse statements about specific things you say, with calling you names. (We've been here before, haven't we?). It is patently clear that you are not an ARTist, but the argument you used on that occasion was very ARTistic. I make no apology for pointing this out, and I would be grateful if you return the favour, if the situation is ever reversed.

Oh, well, I will manage to recover from current low level of Karma, don’t worry.

did not reinvent the wheel but turned out to be two orders of magnitude slower While it was two orders of magnitude slower, it did have three rather important merits over the original. it was correct. it was obviously correct ... not to mention tested. it explained what the faulty one was attempting to do. The process demonstrated a clear performance problem with Pandas dataframes of which I, for one, was not previously aware. Oh, and it took less time to implement that these meta-discussions.

Incidentally the reason I wrote the write-only, faulty, nasty, ugly cython imp was because I had found long ago that Pandas DF were slow. Alas, I failed to make that part of the public lore. Now is clear.

Then, a new re-reimp resurrected the wheel in cython code. Yes, the ideas in the high-level imp were expressed at a lower level, and passed all tests at first go, beating all other existing implementations for speed. can we solve this? By all means get rid of my faulty, clumsy and artistic imp. That goes without saying: it contains bugs which neither of its successors do. can we agree in some scheme to share the work that does not prompt us to three reimps of the sane function in the future? I submit to you that if you were pera-programming with us when we came up with xxx, you would have been an enthusiastic supporter of it. It took little time, and it was a very edifying experience. I believe that we can do better than spinning around the reimp of a single function! This meta-discussion is taking more effort that the re-imp. The delays caused by the bugs in the original, took up much more than resources than the re-imp. The re-imp increased the bus number of the function in question by two. If your point is that we should talk less and do more, I agree. If your point is that we should not write code when trying to understand problematic code, then I respectfully and vehemently disagree.

My point is simple: There is life (I hope) beyond XXX. Please, PR XXX and friends and let’s be done with it.

…

Sorry about top posting writing from iphone in the train Enviado desde mi iPhone No worries, GitHub hid the noise. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#149 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHskD3hJhLQXiC6hiUbgP64LWOaq6ruZks5rjry8gaJpZM4MSTZ6>.

jjgomezcadenas · 2017-03-08T15:54:33Z

Yes, option 2. Point taken.

…

On 8 Mar 2017, at 16:07, jacg ***@***.***> wrote: Good. Now, to me, ceteris paribus, one of those options is clearly better than the other. Agreed? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#149 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHskDwfmMLhAOK6W9p1WKSeO2gx_UTA0ks5rjsQggaJpZM4MSTZ6>.

jacg · 2017-03-08T16:40:12Z

Given that some one else (e.g, Gonzalo) was already looking at the problem and he was not around. Debatable, I think.

The bus number of this thing is now 4. Without xxx it would have been 2.

Post xxx I am confident that my opinion on the PR is worth hearing. Before xxx there is no way I could have formed a sensible opinion (in a any amount of time that I would have been prepared to invest in forming it).

Oh, well, I will manage to recover from current low level of Karma, don’t worry.

Your low level of Karma is but an illusion. Your Karma remains high.

Incidentally the reason I wrote the write-only, faulty, nasty, ugly cython imp was because I had found long ago that Pandas DF were slow. Alas, I failed to make that part of the public lore. Now is clear.

The personal journey of discovery was worth far more to me than lore.
Just because Pandas is slow, doesn't mean that the only alternative is writing write-only, buggy code!

My point is simple: There is life (I hope) beyond XXX. Please, PR XXX and friends and let’s be done with it.

That's what I was proposing before this meta-discussion got started. Repeating the task list I made at that point ...

Tasks remaining:

Decide whether we prefer cdf_to_dict + d12df_to_s12l or xxx.
If we choose the former, combine it into a single Cython function (i.e. option 2, above).
Clean up and expand the tests, including any that emerge from Wrong type in last evt in s12df_to_s12l #146.

jjgomezcadenas · 2017-03-08T16:55:45Z

Decide whether we prefer cdf_to_dict + d12df_to_s12l or xxx.

My vote goes for xxx. The only thing I don’t like about it is that I wasn’t around eating some peras and increasing my Karma for free. I suggest that we use xxx and my previous version in our docs to show the difference between read/write and write-only code.

If we choose the former, combine it into a single Cython function (i.e. option 2, above).

If we go for xxx we skip that step. What is your opinion, @gonzaponte?

Clean up and expand the tests, including any that emerge from #146 <#146>.

I would just go an submit a PR with the existing stuff (it all works), so that we can incoporate this important stuff to our running release. Then we can get @gonzaponte and @neuslopez to eat peras and expand tests.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#149 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHskDxhuiy2nP0ITlItO_DDwZ5T724eEks5rjtntgaJpZM4MSTZ6>.

gonzaponte · 2017-03-08T17:03:11Z

If we go for xxx we skip that step. What is your opinion, @gonzaponte?

To me, cdf_to_dict is easier to understand, but I have devoted more time to this function than to xxx. Brais says he likes xxx more and since he is not biased, I would stick to the latter. Let's move on!

gonzaponte · 2017-03-08T18:35:42Z

I've just realized that the xxx function assumes the ordering of the input data frame. It is not a problem now, but it must be taken into account and should be written in the function documentation. This probably means that the PMapWriter function/class/whatever should have a test that ensures the output file is written in the correct order.

jacg · 2017-03-08T19:37:21Z

OK, I'll clean up the branch and submit before I retire for the night. Failing that, first thing tomorrow morning.

The only thing I don’t like about it is that I wasn’t around eating some peras and increasing my Karma for free. I suggest that we use xxx and my previous version in our docs to show the difference between read/write and write-only code.

I think that last sentence is seriously raising your karma.

Jacq · 2017-03-08T20:33:58Z

Good to know.... I have no idea what is this project about and suddenly there are 15 notifications of this discussion in my inbox :D
I think that at some point @jacg mutated to @Jacq at the party started.
Cheers

jacg · 2017-03-09T07:28:26Z

I've just realized that the xxx function assumes the ordering of the input data frame.

Well, yes, that did seem to be a fundamental feature of the data I had to go on. But I suppose you are right: it is an assumption which could be violated by the pmaps writer.

While cdf_to_dict is a rather more robust in this respect than xxx, it will fail silently if an event limit is required and some portion of an event with id number greater than the limit appears before all lower-numbered events have been completed.

FWIW, here is my take on cdf_to_dict, imaginatively called yyy. It passes all the same tests and seems to be a little faster than s12df_to_s12l but slower than xxx. (Remember: we are well within a factor of two between slowest and fastest!)

cpdef yyy(df, max_events=None):

    cdef dict all_events = {}
    cdef dict current_event
    cdef tuple current_peak

    cdef int   [:] event = df.event.values
    cdef char  [:] peak  = df.peak .values
    cdef float [:] time  = df.time .values
    cdef float [:] ene   = df.ene  .values

    cdef int df_size = len(df.index)
    cdef int i
    cdef long limit = np.iinfo(int).max if max_events is None or max_events < 0 else max_events
    cdef int peak_number
    cdef list t, E

    for i in range(df_size):

        if event[i] >= limit: break

        current_event = all_events   .setdefault(event[i], {}      )
        current_peak  = current_event.setdefault( peak[i], ([], []))
        current_peak[0].append(time[i])
        current_peak[1].append( ene[i])

    # Postprocess: Turn lists into numpy arrays before returning
    for current_event in all_events.values():
        for peak_number, (t, E) in current_event.items():
            current_event[peak_number] = (np.array(t), np.array(E))

    return all_events

Perhaps we prefer this one to xxx?

I will prepare the PR with xxx in place, but it will be trivial to change

One more thing: Are we attached to the name s12df_to_s1l? I have grown to hate it. The l at the end is especially irksome.

And another thing: do I have your permission to replace the tuple with a namedtuple? (I have the implementation ready and tested. It slightly degrades the performance, but I'm confident that this is completely irrelevant.)

jjgomezcadenas · 2017-03-09T08:03:23Z

My five cents:

I prefer yyy to xxx. It is very concise and elegant, and the time difference is not all that relevant. Since you don't have the PR yet submitted, why don't you go ahead and use yyy instead of xxx.
I suggest df_to_pmaps_dict as the new name.
I am all in favour of a named tuple.

gonzaponte · 2017-03-09T08:39:18Z

What a wonderfull method setdefult is. I have been doing stupid things so far because I didn't know that method (and I thought I knew them all)!

Perhaps we prefer this one to xxx?

Fine with me.

I suggest df_to_pmaps_dict as the new name

Since it is already in pmaps_functions.py, I would avoid naming it pmaps (simply df_to_dict?).

I am all in favour of a named tuple.

Me too.

The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.

jacg · 2017-03-09T10:36:21Z

What a wonderfull method setdefult is.

Even better, for this purpose, is collections.defaultdict, but Cython doesn't know how to type that one, so I stuck with the basic dict.

The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.

gonzaponte mentioned this issue Mar 5, 2017

Wrong type in last evt in s12df_to_s12l #146

Closed

jacg mentioned this issue Mar 7, 2017

PMAP Writer #158

Closed

gonzaponte mentioned this issue Mar 7, 2017

Issue149 #162

Merged

jacg closed this as completed in 265a297 Mar 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-existing event introduced when converting pmaps DataFrames to dictionaries #149

Non-existing event introduced when converting pmaps DataFrames to dictionaries #149

gonzaponte commented Mar 3, 2017

gonzaponte commented Mar 3, 2017

jacg commented Mar 6, 2017

gonzaponte commented Mar 6, 2017

jacg commented Mar 6, 2017

jacg commented Mar 8, 2017

jacg commented Mar 8, 2017

gonzaponte commented Mar 8, 2017

neuslopez commented Mar 8, 2017

jacg commented Mar 8, 2017

gonzaponte commented Mar 8, 2017

jjgomezcadenas commented Mar 8, 2017 via email

jacg commented Mar 8, 2017

jjgomezcadenas commented Mar 8, 2017 via email

jacg commented Mar 8, 2017 •

edited

Loading

jacg commented Mar 8, 2017

jjgomezcadenas commented Mar 8, 2017 via email

jacg commented Mar 8, 2017

jacg commented Mar 8, 2017

jjgomezcadenas commented Mar 8, 2017 via email

jjgomezcadenas commented Mar 8, 2017 via email

jacg commented Mar 8, 2017

jjgomezcadenas commented Mar 8, 2017 via email

gonzaponte commented Mar 8, 2017

gonzaponte commented Mar 8, 2017

jacg commented Mar 8, 2017

Jacq commented Mar 8, 2017

jacg commented Mar 9, 2017

jjgomezcadenas commented Mar 9, 2017

gonzaponte commented Mar 9, 2017

jacg commented Mar 9, 2017

Non-existing event introduced when converting pmaps DataFrames to dictionaries #149

Non-existing event introduced when converting pmaps DataFrames to dictionaries #149

Comments

gonzaponte commented Mar 3, 2017

gonzaponte commented Mar 3, 2017

jacg commented Mar 6, 2017

gonzaponte commented Mar 6, 2017

jacg commented Mar 6, 2017

jacg commented Mar 8, 2017

jacg commented Mar 8, 2017

gonzaponte commented Mar 8, 2017

neuslopez commented Mar 8, 2017

jacg commented Mar 8, 2017

gonzaponte commented Mar 8, 2017

jjgomezcadenas commented Mar 8, 2017 via email

jacg commented Mar 8, 2017

jjgomezcadenas commented Mar 8, 2017 via email

jacg commented Mar 8, 2017 • edited Loading

jacg commented Mar 8, 2017

jjgomezcadenas commented Mar 8, 2017 via email

jacg commented Mar 8, 2017

jacg commented Mar 8, 2017

jjgomezcadenas commented Mar 8, 2017 via email

jjgomezcadenas commented Mar 8, 2017 via email

jacg commented Mar 8, 2017

jjgomezcadenas commented Mar 8, 2017 via email

gonzaponte commented Mar 8, 2017

gonzaponte commented Mar 8, 2017

jacg commented Mar 8, 2017

Jacq commented Mar 8, 2017

jacg commented Mar 9, 2017

jjgomezcadenas commented Mar 9, 2017

gonzaponte commented Mar 9, 2017

jacg commented Mar 9, 2017

jacg commented Mar 8, 2017 •

edited

Loading