Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unused input from previous experiments in suites #98

Open
akunft opened this issue Aug 19, 2016 · 5 comments
Open

Remove unused input from previous experiments in suites #98

akunft opened this issue Aug 19, 2016 · 5 comments
Assignees
Labels

Comments

@akunft
Copy link
Contributor

akunft commented Aug 19, 2016

Currently we only remove the output from previous experiments when executing a suite.

I experienced some problems lately in suites with lots of different input data while having a fixed system and cluster setup. This resulted in out of disk space errors because the input of previous experiments used all the disk space.

I would suggest we introduce a flag for suites which, when enabled, removes input from previous experiments (if not needed by the current experiment).

I would be up to integrate this feature but I'm not sure if we should make it the default or enable it by flag.

@akunft akunft added the feature label Aug 19, 2016
@aalexandrov
Copy link
Member

Currently we only remove the output from previous experiments when executing a suite.

I think that this is just because we were lazy to do it better from the onset.

I would be up to integrate this feature but I'm not sure if we should make it the default or enable it by flag.

👍 IMHO the implementation should be based on some sort of Graph analysis which determines whether an input is going to be needed later in the suite or not.

@aalexandrov aalexandrov self-assigned this Aug 19, 2016
@aalexandrov
Copy link
Member

Sorry it is not clear to me whether we talk about ExperimentOutput or DataSet beans?

@akunft
Copy link
Contributor Author

akunft commented Aug 19, 2016

I mean the DataSet beans from the experiment inputs.

👍 for the graph analysis, instead of only checking the previous experiment. Then we can also make it by default.

Do you think we should extend that also to the output data, to achieve suite with dependencies on previous output?

@aalexandrov
Copy link
Member

Here is an idea for the implementation:

  • construct the initial dependency graph
  • keep track of the initial set of input and output nodes
  • after each experiment
    • remove the experiment node from the graph
    • query each input and output for dependent nodes. If the result is empty, delete the data.

I suggest to introduce flags --preserve-input and --preserve-output that disable the logic for input and output data.

@akunft
Copy link
Contributor Author

akunft commented Sep 8, 2016

Actually I already have a prototype for the inputs only, but I could extend it for the output sets as well. Its not based on the graph though.

akunft@437f917

There are actually two problems regarding the configuration:

  1. We would have to set the config for all decentants for the exp beforehand. We need this to resolve the paths of the DataSet
  2. As DataSet for the same input path are currently not represented as the same object, we have to overload the equals and hashcode which requires the config to be resolved beforehand.

In the prototype, i did the implementation on the unresolved paths, but IMO we have to do it on the resolved paths.

I just ref this here for future discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants