-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-existing event introduced when converting pmaps DataFrames to dictionaries #149
Comments
I have in my fork (gonzaponte/issue149) a demonstration of the failure. I had to modify the test file so it does not include event #0. Although I have written a new test to demonstrate the bug, all tests fail because of the bug. However, since the new test is more exhaustive, I will rewrite the previous ones to perform as the new one. |
Would it be possible/better to do this by adding another test file? How big a new data file would you need to be able to write this test? Please do a whitespace cleanup. You have removed the initializations for Please squash into two commits: one adding the tests in their final form, leaving the implementation alone; one implementing the solution, leaving the tests alone. @jjgomezcadenas Please have a quick look at this, and comment on the how this fits in the grand scheme of things. Also please comment on the desirability of adding a new test data file vs. modifying the existing one. |
It is possible, but maybe redundant. It will not be very large (few tens or hundreds of kb).
Ok.
No good reason, it is just not necessary as they are always assigned when needed. Maybe initializing them to 0 can be misleading, but it is not a big deal. I will set them to 0.
Ok. |
No. My bad. I looked at the code too hastily. You are initializing them before first use, so it's fine as it is. As you say, it might even be better this way. |
It turns out that, beautiful though it was, the `xxx` implementation based on Pandas `groupby` was two orders of magnitude slower than `s12df_to_s12l`. It looks like Pandas `groupby` is fundamentally slow. There may be a way of achieving the desired effect with `numpy.unique` or something similar, but, for now, `xxx` has been implemented in Cython, forgetting any attempt to make the result contain small sections of the original dataframe. `xxx` now passes exactly the same tests as `s12df_to_s12l` and the tests used to drive the development of `xxx` are now also being applied to `s12df_to_s12l`: all these tests are parametrized with both implementations. As far as I can see so far, `xxx` has two advantages over `s12df_to_s12l` (and it's cython core `cdf_to_dict`): + it is legible + it is 15% faster The next step is to see how `xxx` stacks up against next-exp#149, and how it compares to the proposed solution. In the long term we want to look at whether we want the dictionaries which are produced to contain namedtuples or dataframes, rather than the current tuple of numpy arrays representing the times and energies in each peak.
This was bugging me, and it became clear that I wasn't going to get to sleep until I had fixed it ... It turns out that, beautiful though it was, the It looks like Pandas The new As far as I can see so far,
The next step is to see how In the long term we want to look at whether we want the dictionaries which are produced to contain namedtuples or dataframes, rather than the current tuple of numpy arrays representing the times and energies in each peak. |
I have merged (yes! merged, don't worry, it's only temporary: it serves the purpose of keeping track of the alternatives until we decide which one we want) the implementation of It seems that
Are there any other problems or bugs pending with As this is a function which is pivotal and has been problematic, we should make an effort to test it more thoroughly. @neuslopez Could you please confirm that my timings make sense: I merely ran timeit on the s1 dataframe that comes out of Good night. |
Well, I don't find it much more readable. In s12df_to_s12l the workflow is clearer, imho. But of course, this is a matter of interpretation. Concerning performance, s12df_to_s12l contains a function call overload because it is a python function calling the cython one. It can by written in the same fashion as yours (pure cython function). I guess this structure comes because we want a pure python interface, @jjgomezcadenas?
I think it is ok!
I don't think there are more bugs, but we can extend testing to include those that came up in issue #146. |
Using branch issue149, see bellow the new timings using the new xxx (same Kr file):
|
It does, but this experiment
shows that an pure-Python function call can be performed in 86 nanoseconds, while the lookup of the function's name takes 14, leaving no more than about 70 nanoeconds for the overhead of calling a pure-Python function. This tiny
In this case I cannot see any advantages offered by the pure python interface wrapper. It just needlessly complicates and obfuscates the issue. I'm always happy to be educated, though. In any case, the current version of Tasks remaining:
Once that is done we could merge. Or we could discuss whether we really want dictionaries containing, dictionaries, containing tuples, containing arrays; or whether we would like to replace the arrays-in-tuples with dataframes. But I think that the latter would be better discussed in a separate issue/PR. Some observations though:
Oh, and for cross-referencing purposes, this issue is addressed in PR #162. (GitHub used to have the capability to turn issues into PRs. They got rid of it. Shame, as it seems very useful.) |
The only "advantage" I see is that gathering all functions in a single file (pmaps_functions.py) allows you to forget the different implementations (either python or cython). For example in your fork you have:
The following may be nicer to look at:
This can still be addressed by creating aliases or using a from statement in the python module.
Agreed. JJ wants to use DFs bacause of the statistical and correlation properties they provide. After using the current tuple-of-arrays structure in a small analysis I don't see any advantage in using the DFs functionalities, but my imagination is limited and future analysis may need them. |
I am not very sure that we are following a sensible approach here. To review:
1) I had written a function to read back pmaps en transform them to desired format in transient memory.
@gonzaponte had identified a number of issues and was working in fixing them. He has worked out a re-imp of the function and new tests.
Was it really necessary that @Jacq and @neuslopez would write yet another implementation? (xxx)? I have a memory of complaining about it and receiving an immediate reprimand (including being called names, eg art guy). The lecture included a number of reasons to substitute my original implementation with something that did not reinvent the wheel but turned out to be two orders of magnitude slower. Then, a new re-reimp resurrected the wheel in cython code. Now we have 3 imps of the darned function. So my questions are:
1) can we solve this? By all means get rid of my faulty, clumsy and artistic imp. Use instead that of @gonzaponte or xxx. Choose one and let's be done with it.
2) can we agree in some scheme to share the work that does not prompt us to three reimps of the sane function in the future? We have a lot of work and limited man power. Every one tries hard to go by the book. But we need to keep also in mind that data taking starts next week and we haven't yet managed to write corona. While totally agreeing with the boss about reduction of TD and such I believe that we can do better than spinning around the reimp of a single function!
Sorry about top posting writing from iphone in the train
Enviado desde mi iPhone
… El 8 mar 2017, a las 11:36, Gonzalo ***@***.***> escribió:
As far as I can see so far, xxx has two advantages over s12df_to_s12l (and it's cython core cdf_to_dict):
it is legible
it is 15% faster
Well, I don't find it much more readable. In s12df_to_s12l the workflow is clearer, imho. But of course, this is a matter of interpretation. Concerning performance, s12df_to_s12l contains a function call overload because it is a python function calling the cython one. It can by written in the same fashion as yours (pure cython function). I guess this structure comes because we want a pure python interface, @jjgomezcadenas?
In any case, I don't think performance is critical for this function, which will only be executed once per file.
It seems that xxx passes the issue-149 tests out of the box. @gonzaponte please confirm that I haven't missed something, and that I am actually running the test that points out the bug, and not doing anything else stupid.
I think it is ok!
Are there any other problems or bugs pending with s12df_to_s12l?
I don't think there are more bugs, but we can extend testing to include those that came up in issue #146.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Totally agreed. As far as I am concerned, all the Cython modules are low-level implementations of the Python modules, and I would like not to see them being imported directly in client code at all. (IIRC, @jjgomezcadenas disagrees). But the solution to this is not to implement pure-Python wrappers (containing arbitrary subsets of the functionality, like in the case of
and we're done. No need to write a wrapper around it. And indeed, you seem to acknowledge this in your comment.
Seems jolly sensible to me. As long as the huge indexing overhead of dataframes doesn't prevent us from doing other things in a reasonable amount of time.
I really would like those to be named tuples, though. I'll implement that tomorrow morning, unless somene beats me to it. It should take all of 30 seconds. Hmm, I might just do it now ... OK, I've pushed it to my fork on the issue149 branch. (No test for accessing time and energy by name yet, though!) |
My five cents
1) i like pure pyhon hiding cython imp basically for the same reasons offered by @gonzaponte
2) tuple of arrays is fine for now.
J
Enviado desde mi iPhone
… El 8 mar 2017, a las 13:56, Gonzalo ***@***.***> escribió:
In this case I cannot see any advantages offered by the pure python interface wrapper. It just needlessly complicates and obfuscates the issue
The only "advantage" I see is that gathering all functions in a single file (pmaps_functions.py) allows you to forget the different implementations (either python or cython). For example in your fork you have:
import invisible_cities.reco.pmaps_functions as pmapf
import invisible_cities.reco.pmaps_functions_c as pmapf_c
pmaps_read_parm = ('df_to_dict',
(pmapf.s12df_to_s12l,
pmapf_c.xxx))
The following may be nicer to look at:
import invisible_cities.reco.pmaps_functions as pmapf
pmaps_read_parm = ('df_to_dict',
(pmapf.s12df_to_s12l,
pmapf.xxx))
This can still be addressed by creating aliases or using a from statement in the python module.
Once that is done we could merge. Or we could discuss whether we really want dictionaries containing, dictionaries, containing tuples, containing arrays; or whether we would like to replace the arrays-in-tuples with dataframes. But I think that the latter would be better discussed in a separate issue/PR. Some observations though:
After my experience with writing xxx, I am less keen on the dataframes: indexing them is very slow! But perhaps this doesn't matter in our usage?
This issue probably needs to be discussed with reference to #157.
Agreed. JJ wants to use DFs bacause of the statistical and correlation properties they provide. After using the current tuple-of-arrays structure in a small analysis I don't see any advantage in using the DFs functionalities, but my imagination is limited and future analysis may need them.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Please don't confuse statements about specific things you say, with calling you names. (We've been here before, haven't we?). It is patently clear that you are not an ARTist, but the argument you used on that occasion was very ARTistic. I make no apology for pointing this out, and I would be grateful if you return the favour, if the situation is ever reversed.
While it was two orders of magnitude slower, it did have three rather important merits over the original.
The process demonstrated a clear performance problem with Pandas dataframes of which I, for one, was not previously aware. Oh, and it took less time to implement than these meta-discussions.
Yes, the ideas in the high-level imp were expressed at a lower level, and passed all tests at first go, beating all other existing implementations for speed.
That goes without saying: it contains bugs which neither of its successors do.
I submit to you that if you were pera-programming with us when we came up with
This meta-discussion is taking more effort that the re-imp. The delays caused by the bugs in the original, took up much more than resources than the re-imp. The re-imp increased the bus number of the function in question by two. If your point is that we should talk less and do more, I agree. If your point is that we should not write code when trying to understand problematic code, then I respectfully and vehemently disagree.
No worries, GitHub hid the noise. |
Stripped to the bare bones, we are talking about these two alternatives:
Agreed? |
Yes, agreed!
Enviado desde mi iPhone
… El 8 mar 2017, a las 15:42, jacg ***@***.***> escribió:
i like pure pyhon hiding cython imp basically for the same reasons offered by @gonzaponte
Stripped to the bare bones, we are talking about these two alternatives:
Option 1
Inside python_implementation.py you write
import cython_implementation
def feature():
return cython_implementation.feature()
Option 2
Inside python_implementation.py you write
from cython_implementation import feature
Agreed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Good. Now, to me, ceteris paribus, one of those options is clearly better than the other. Agreed? |
@neuslopez Could you please give us some idea of how long it takes |
Find comments inline (hopefully GitHub will still hid the extra noise)
On 8 Mar 2017, at 15:35, jacg ***@***.***> wrote:
Was it really necessary that @Jacq <https://github.com/jacq> and @neuslopez <https://github.com/neuslopez> would write yet another implementation?
Given that the original was write-only: yes.
Given that the brokenness of this function was getting in our way in everything we wanted to do, and re-implementing a new one was easier than understanding and fixing the origial: yes.
Given that it was the quickest way to understand what cdf_to_dict was supposed to do: yes.
Given that it demonstrates in 4 clear lines what 49 convoluted lines were failing to do correctly: yes.
Given that it increased the bus number of the feature considerably: yes.
Given that some one else (e.g, Gonzalo) was already looking at the problem and he was not around. Debatable, I think.
being called names, eg art guy
Please don't confuse statements about specific things you say, with calling you names. (We've been here before, haven't we?). It is patently clear that you are not an ARTist, but the argument you used on that occasion was very ARTistic. I make no apology for pointing this out, and I would be grateful if you return the favour, if the situation is ever reversed.
Oh, well, I will manage to recover from current low level of Karma, don’t worry.
did not reinvent the wheel but turned out to be two orders of magnitude slower
While it was two orders of magnitude slower, it did have three rather important merits over the original.
it was correct.
it was obviously correct ... not to mention tested.
it explained what the faulty one was attempting to do.
The process demonstrated a clear performance problem with Pandas dataframes of which I, for one, was not previously aware. Oh, and it took less time to implement that these meta-discussions.
Incidentally the reason I wrote the write-only, faulty, nasty, ugly cython imp was because I had found long ago that Pandas DF were slow. Alas, I failed to make that part of the public lore. Now is clear.
Then, a new re-reimp resurrected the wheel in cython code.
Yes, the ideas in the high-level imp were expressed at a lower level, and passed all tests at first go, beating all other existing implementations for speed.
can we solve this? By all means get rid of my faulty, clumsy and artistic imp.
That goes without saying: it contains bugs which neither of its successors do.
can we agree in some scheme to share the work that does not prompt us to three reimps of the sane function in the future?
I submit to you that if you were pera-programming with us when we came up with xxx, you would have been an enthusiastic supporter of it. It took little time, and it was a very edifying experience.
I believe that we can do better than spinning around the reimp of a single function!
This meta-discussion is taking more effort that the re-imp. The delays caused by the bugs in the original, took up much more than resources than the re-imp. The re-imp increased the bus number of the function in question by two.
If your point is that we should talk less and do more, I agree. If your point is that we should not write code when trying to understand problematic code, then I respectfully and vehemently disagree.
My point is simple: There is life (I hope) beyond XXX. Please, PR XXX and friends and let’s be done with it.
… Sorry about top posting writing from iphone in the train Enviado desde mi iPhone
No worries, GitHub hid the noise.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#149 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHskD3hJhLQXiC6hiUbgP64LWOaq6ruZks5rjry8gaJpZM4MSTZ6>.
|
Yes, option 2. Point taken.
… On 8 Mar 2017, at 16:07, jacg ***@***.***> wrote:
Good. Now, to me, ceteris paribus, one of those options is clearly better than the other.
Agreed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#149 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHskDwfmMLhAOK6W9p1WKSeO2gx_UTA0ks5rjsQggaJpZM4MSTZ6>.
|
The bus number of this thing is now 4. Without Post
Your low level of Karma is but an illusion. Your Karma remains high.
That's what I was proposing before this meta-discussion got started. Repeating the task list I made at that point ... Tasks remaining:
|
Decide whether we prefer cdf_to_dict + d12df_to_s12l or xxx.
My vote goes for xxx. The only thing I don’t like about it is that I wasn’t around eating some peras and increasing my Karma for free. I suggest that we use xxx and my previous version in our docs to show the difference between read/write and write-only code.
If we choose the former, combine it into a single Cython function (i.e. option 2, above).
If we go for xxx we skip that step. What is your opinion, @gonzaponte?
I would just go an submit a PR with the existing stuff (it all works), so that we can incoporate this important stuff to our running release. Then we can get @gonzaponte and @neuslopez to eat peras and expand tests.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#149 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHskDxhuiy2nP0ITlItO_DDwZ5T724eEks5rjtntgaJpZM4MSTZ6>.
|
To me, |
I've just realized that the |
OK, I'll clean up the branch and submit before I retire for the night. Failing that, first thing tomorrow morning.
I think that last sentence is seriously raising your karma. |
Well, yes, that did seem to be a fundamental feature of the data I had to go on. But I suppose you are right: it is an assumption which could be violated by the pmaps writer. While FWIW, here is my take on
Perhaps we prefer this one to I will prepare the PR with One more thing: Are we attached to the name And another thing: do I have your permission to replace the tuple with a namedtuple? (I have the implementation ready and tested. It slightly degrades the performance, but I'm confident that this is completely irrelevant.) |
My five cents:
|
What a wonderfull method
Fine with me.
Since it is already in pmaps_functions.py, I would avoid naming it pmaps (simply df_to_dict?).
Me too. |
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
Even better, for this purpose, is |
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The write-only, buggy function `s12df_to_s12l` (the bulk of whose implementation was in the Cython function `cdf_to_dict`) has been replaced with a single Cython function `df_to_pmaps_dict`. An alterative implementation replaces this one in the next commit. Note: if the order of events in the input dataframe is not monotonically increasing *and* an event number limit is set, silent errors are likely to occur. Extensive tests for the feature have been added.
The function cdf_to_dict in pmaps_functions_c.pyx always returns event #0 even when it is not present in the input DataFrame.
The text was updated successfully, but these errors were encountered: