CAUTION: Work-in-Progress!
This DataLad extension provides facilities to register arbitrary commands for files in git-annex,
which are then executed if datalad get
is called on those files (and they are not yet present).
This extension provides a new high-level datalad command called getexec
which can be used to register commands on files.
In the following we will assume that we have the extension installed and are inside a DataLad dataset.
As a simple example, we can register a command that writes "Hello World!" into a text file called "test.txt":
datalad getexec --path test.txt -- 'bash' '-c' 'printf "Hello World!" > "$1"' 'test-cmd'
As a result of this, we now have the file "test.txt" with it's expected content. Since we told git-annex that we can recreate this file with the specified bash call, we can now safely drop the file and then automatically get it recreated:
datalad drop test.txt
datalad get test.txt
Since our registered program might depend on some other annex'ed files we can specify those dependencies as well:
datalad getexec --path depends-on-test.txt -i test.txt -- 'bash' '-c' '(cat test.txt; printf "\nMore Text.") > "$1"' 'test-cmd'
This way,
if datalad get
is called on "depends-on-test.txt" git-annex will make sure,
that "test.txt" is present before executing the registered command.
Therefore,
the following will work:
datalad drop test.txt
datalad drop depends-on-test.txt
datalad get depends-on-test.txt
There are some limitations to what commands can be registered.
First of all,
there is no shell interpretation happening;
the command is essentially passed verbatim to python's subprocess.run
.
This is why the examples above look a bit more complex with the call to bash.
In the above examples,
each quoted part after --
becomes one element in the list passed to subprocess.run
.
In practice,
it would be a good idea to externalize the command into e.g. a shell script
and have a single argument in the getexec
call.
Second,
the command is expected to always produce a single output file,
the location of which is passed as the first (and only) argument to the command.
This is the $1
in the bash calls above.
Lastly,
since the command is executed in the context of a get
,
the resulting file is always expected to remain the same.
This means that two consecutive calls to the command need to produce files with identical checksums,
otherwise git-annex will complain.
Essentially,
the command is expected to behave somewhat like a pure function.
If this does not fit your use-case you are probably looking for DataLad's builtin run
and rerun
.
This extension works by implementing a new git-annex special remote which kind of abuses the URL handling of git-annex. The special remote takes responsibility of all URLs with a scheme of "getexec:"; encoded inside of these URLs then is the necessary information to re-execute the registered command. The DataLad part of the extension simply takes the user input, generates a matching URL for it and then registers the URL with git-annex.
When get
ing a file that is not currently present,
git-annex will do it's usual thing to determine from where to fetch the data.
If git-annex determines that the special remote of this extension should provide the data
then it will rerun the registered command.