New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[general need] Currently, DataBrowser Bricks tag does not give at a glance the whole history of a data #236
Comments
I haven't thought about how to do this yet, but since we are currently working on tickets about iteration and indexing, I think we can handle this ticket at the same time. Indeed, we can imagine fixing this ticket by inheriting the Bricks tag from the input data and adding the passed through brick. Things may not be so simple in real life ... For example, if there are several input data... |
After a private discussion with @LStruber , it was decided:
|
oh yes, you're right, I hadn't thought about the display to adopt according to the "paths" of dependencies ... |
Hi,
|
@denisri, I think we agree on the main point, which is to offer the user a simple, avoiding redundancy and useful solution. It is also true that the potential complexity of a pipeline (that you take in a general way as a graph) makes difficult a simple representation with a succession of bricks, in the special Bricks tags/field of the database, as we had imagined at the beginning ... I am for the most efficient solution that will be realised as soon as possible (in this I tend to prefer a non-perfect but existing solution rather than the perfect solution that does not exist) and I am always ready to take the best solution even if it was not the one I proposed. So I am not against following your proposal which seems to meet the expectations. To make sure that we are in agreement on the expectations: The main idea is to have, when I am interested in a data, all its history: This is to be able to think when there is a problem with the data/pipeline, but not only that, it can be simply when we come back to a result several months later and want to see how exactly it was obtained. For the (1) point, I totally agree that the best solution would be to have the possibility to access the pipeline as it is available in the editor and as you propose in a different window. This would be a real improvement compared to what we had imagined at the beginning. However, I must admit that I don't completely see how we can implement this. I think @LStruber and I are ready to code in this direction but we'll have to agree and understand, on at least a minimum, of how to proceed! The idea is to allow to open a window reproducing the pipeline as we see it in the editor (with the possibility to open the pipeline to see the bricks in it, etc.) and the controller to have access to the parameters ? I must admit that if this is the solution you are thinking about, I don't see a better solution! |
(1) I don't think (as far as I remember) that full pipelines are recorded in the history, but just individual processes (bricks), thus we cannot withdraw the exact full pipeline that has produced the given data, but we can rebuild the part of processing graph that has lead to this data. In "standard" processing, we have to follow upstream data and processes that have produced them, and rebuild a pipeline from them. (2) will come for free since parameters are recorded in the history with process instances. We will get the full history of the data and its "ancestors", even if they were produced by different pipelines at different times. Up to now it's not very difficult. |
Anyway I can try to propose a pipeline rebuilding tool. |
yes on the master branch, but @LStruber has just started to work on the fix4#236 branch which at the time of the initialisation/run (which are now launched automatically) recovers all the informations (at this time we have all the informations for pipeline and parameters) to propose in the database Bricks field the whole of the bricks crossed to arrive at the data in questions and thanks to a combo box to represent by default only the last brick, but the user can display all the bricks if he wishes it. You can if you have not done it yet test this branch fix4#236, just to see what it gives (see the database Bricks field in DataBrowser). Now there is a problem in the case of a non linear pipeline, which is certainly the most common case, as raised by @LStruber earlier and I don't come back to it. This way is not perfect, for all the reasons discussed earlier, but it has the advantage of allowing to keep all the history at the time of the run and to save it in the database Doing it afterwards while rebuilding will inevitably lead to loss of information and to the impossibility to rebuild the history in any case. I think that we should find a solution that is based on both points of view... What you propose (possibility to represent exactly the pipeline as we see in the editor and visualisation of all the parameters of the bricks as in the controller) and what we had imagined by saving the information (the self.workflow object or something derived from it?) at runtime ? |
I haven't looked at branch fix4#236 (yet - I can't switch right now because I have modifs in progress in my current branch). But I wonder how you overcome the difficulties that have been raised above:
Doing this, have you modified the structure of the database ? |
Currently in master branch, when a pipeline is run, all executed bricks are added to the brick table of the database ( What I did in fix#236 branch is that instead of appending only the last brick that produced the file, I appended in the bricks tag every bricks that were used to produce the file. Then to answer your questions :
|
OK then, the only concern I still have is erasing a file history: if I understand you always append bricks ids in the bricks tag of file data, and never remove anything from it ? |
Thanks @LStruber . |
@denisri I'm not sure of the behavior in this case, I will check this asap. What I'm sure of, is that the behavior regarding erasing history is the same in master branch than in fix4#236 branch. In both branches, new brick(s) is (are) appended to the history. |
I just checked in MIA. Actually the bricks tag seems to be erased each time a pipeline is run. Then the bricks tag only keep track of the last pipeline history |
Currently master branch crashes if we go to the DataBrowser while the pipeline is running. To avoid the crash it is necessary to wait for the end of the calculation to go in the DataBrowser |
proceed backwards, using execution timestamps to avoid confusions when data have been written several times. Ambiguities still exist when several processes write the same data at the same time, but this should not happen often.
I don't observe this behaviour but I have modified the code since this has been reported, so I don't know if it's fixed or if I haven't tested it in the same situation. |
I agree, but I already have a number of meetings these days. Can it wait for a about couple of weeks ? |
With master on last commit I always observe the issue.
Edit: tested with mia_processes smooth or spatial_preprocessing_1. Same result. crash. |
no problem, we can continue to discuss in the tickets. |
I have tried using spatial_preprocessings_1 and using morphologist, and in each case, while it was running, I could switch to the data browser tab without a problem, and even see the data history. So I must do it in a different way, but I don't know how to reproduce the problem. |
Thanks for the branch. I see what's the problem with my branch, however I don't see for now how to fix it without disambiguate things with execution times as you've done in your branch, and it is precisely what I'm trying to avoid because it seems that "sometimes" it does not works. By sometimes I mean that with your solution (based on testing we've done with @servoz few days ago on master branch), it may work/not work on the exact same pipeline (spatial preprocessing) with same execution parameters (I didn't dig into your code to understand why...). Sometimes (most of the time!) the whole pipeline was displaying, sometimes only a brick, sometimes the brick that override the output (coregister) was missing... Unfortunately, I'm not sure it is actually possible to fix all problems we have without saving the real pipeline/workflow (and the brick collection is not adequate for that) |
I tried something to disambiguate things when several inputs/outputs have the same name: saving in the database (as a hidden list), the level of dependency of each brick in the workflow (a list of int, 1 meaning direct dependency, 2 meaning there is one brick between etc.). Then I check links between input of a brick and output of previous bricks by checking they have the same name, and if there are several possibilities I choose the one with the lower dependency. This corresponds to commit ca7fc20. It is working on spatial preprocessing pipeline (but it was already working before :))... @denisri could you again give a try to your pipeline ? As I modified the database entries, you may have to create a new project.. |
This time I'm not sure to understand what makes it not work... Either I made a mistake somewhere (not unlikely), or the "level of dependency" between bricks I added is not enough to cover all cases (also not unlikely :)) Anyway, in view of the 30 sec of calculation, the answer is not very important since it excludes the solution... This delay is due to the fact that to disambiguate things I need to iterate over all previous bricks to check if there is a potential conflict, and this for each input ! (whereas before I stopped iterating as soon as I found a match). I wanted to try a solution between @denisri's and mine by iterating over all previous bricks and disambiguate conflicts with execution time, but it's a waste of time since the delay will be the same. It let us with two solutions:
However, it won't solve the problem of deleted files/bricks, you also stated:
I may repeat what @denisri said few weeks ago, who was surely more clairvoyant, but I was sure it was possible to recreate the pipeline only from bricks (I still think it is possible, but as mention before, not in a reasonable timing) |
What do you think ?
|
I haven't thought about it yet, but just want to note that the pickle format is not suitable for long-term storage. It is good for temporary files, but not more: the pickle format (or sub-formats) evolves, is not compatible across different versions of python (it's impossible to write a pickle in python2 and read it in python3 or vice versa, in real life, even if it is documented to be possible), it loads modules thus depends on modules and their versions... But we can save workflows in Json format, and pipelines have an XML format, that we may (should) move to Json. |
Maybe it would be useful for testing purposes to have a fake Morphologist pipeline, with the same nodes and parameters as the "real" one, and its completion system, but with execution part of each process replaced with fake things (basically a kind of generic function which, for instance, sleeps a given time, and writes outputs files as text file with fake things such as one line containing the name of the process), so that it does not depend on actual execution code, programs and libraries. I could create it automatically via a script, and send it to you, or put is somewhere in some test pipelines data. I guess this would also be useful to test / sandbox projects such as https://github.com/brainvisa/use-cases |
Yes, it would be a good thing to have a complex shared use case. |
…g yet) the resulting pipelines lack some activations. (populse/populse_mia#236)
because they are otherwise exported in an uncontrolled order depending on the links iteration orders (populse/populse_mia#236)
I'm quite struggling to build and save a fake pipeline which has really the same structure as our "real" Morphologist. Some details of the plugs and links properties are not entirely saved (like an optional parameter exported as mandatory, or the contrary) and some of them are tuned internally when building the pipelines, so for now I always get a pipeline with a disabled part. I need more time to finish it... |
I think we all agree that it is necessary to save the pipeline or workflow object in the database to cover all cases (json format). To start investigating/coding in this direction we need to make some decisions:
And finally, do we need a meeting to set things up, or I start coding with answers to the previous questions ? |
I don't have definitive answers for the other questions. I guess if we store workflows or pipelines in the database, it will be a good idea to have a means to see them somewhere (at least for debugging) ? |
proceed backwards, using execution timestamps to avoid confusions when data have been written several times. Ambiguities still exist when several processes write the same data at the same time, but this should not happen often.
temp files do not exist any longer, are not indexed in the database, thus break the history graphs. Moreover they all appear as "<temp>" in process histories and several temp files cannot be distinguished.
Now graphs seem quite complete, including when temp intermediate values are used (with a little risk of error, however). But I had to disable the history cleaning for data and orphan bricks for now, because the notion of obsolete/orphan has chnaged, and it was erasing things that are used in the indirect history of data.
The implementation doesn't use Process instances but a light wrapper, and a pipeline (with fake processes) is only built at the end, in data_history_pipeline(). Other functions are Process-agnostic. Docstrings have been added, and a simpler function, get_data_history_bricks() whic may be used in the history cleaning functions (which are currently disabled in pipeline runs).
In the DataBrowser, we have a special tag (
Bricks
) that has been created to provide the history of a data.Let's say we have this pipeline:
anat.nii -> |BrickA| -> A_anat.nii ->|BrickB| -> A_B_anat.nii
Currently, after launching this pipeline, we will find in the DataBrowser, the
Bricks
tag with only the last brick which the document has passed through. For this example:anat.nii:
Bricks
tag is emptyA_anat.nii:
Bricks
tag = BrickAA_B_anat.nii:
Bricks
tag = BrickBI think this is not optimal. Indeed, we may want to have at a glance the whole history of a data (rather than having to reconstruct this history, which may not be instantaneous). In this case we want in DataBrowser:
anat.nii:
Bricks
tag is emptyA_anat.nii:
Bricks
tag = BrickAA_B_anat.nii:
Bricks
tag = BrickA BrickBThe text was updated successfully, but these errors were encountered: