DataLad allows us to add any type of data to our existing dataset and track it. So far however, this has been a manual process where we added data and then ran `datalad save` to commit it to our repo. When modifying existing files, we ran `datalad get` to download and `datalad unlock` to make files modifiable, modified the files, and then ran `datalad save` again.

However, in most data analysis projects, we don't manually generate and modify data. Instead, we use programs and scripts that do so. While we could tranfer this approach to running scripts, this would be rather tedious because we would always have to make sure the required files are present and unlocked and save new files after the script ran.

To make life easier, DataLad provides a `run` command. This command can execute any shell command (e.g running a Python script), make sure that inputs are available and keep track of outputs. It can even rerun commands or whoel pipelines to easily reproduce results.

## Datalad, Run!

The basic syntax of the run command looks like this: `datalad run "<command>"`.
Here, `<command>` can be any command that you could execute in a terminal, for example `"python script.py"`.
DataLad will automatically save any new files generated by the `run` command and write a commit message.

| Command | Description |
| --- | --- |
|`datalad run "python script.py"` | Run the Python script `script.py` |
|`datalad run -m "run script" "python script.py"` | Run the Python script `script.py` and add the commit message `"run script"`|
| `git log` | View the dataset's history stored in the `git log` |
| `git log -1` | View the last entry in `git log` |

Run the cell below to download the penguins dataset, change the directory to `penguins/` and print the dataset's contents. The data contains some `code/`, some `data/` (CSV files) and some `example/` images of different penguins.

In [None]:
!datalad clone https://gin.g-node.org/obi/penguins
%cd penguins
!ls **

**Example**: Use `datalad run` to run an `echo` command that writes the test `'Penguins are cool'` to `penguins.md`.

**NOTE**: When using quotations inside quotations, we must use different ones, e.g. double for the outer and single for the inner quotation: `" '' "`

In [None]:
!datalad run "echo 'Penguins are cool'>penguins.md"

**Exercise**: Check the last entry in `git log` to see the message generated by the run command.

In [None]:
!git log -1

**Exercise**: The `echo` command below appends another line to `penguins.md`. Wrap it with `datalad run` and execute it. Then check the last entry in `git log` to see the commit message created by the run command.

In [None]:
!echo 'The Linux mascot is a penguin '>>penguins.md

In [None]:
!datalad run "echo 'The Linux mascot is a penguin '>README.md"
!git log -1

**Exercise**: Use `datalad run` to execute an echo command that lists all penguin species in this dataset (adelie, chinstrap, gentoo) in the  file `"species.txt"` and show the last entry in `git log`.

**BONUS**: Add a custom commit message

In [None]:
!datalad run -m "listing species" "echo 'adelie chinstrap gentoo' > species.txt"
!git log -1

**Exercise**: Try to use `datalad run` to execute th python script `code/aggregate_culmen_data.py`, what error are you observing?

In [None]:
!datalad run "python code/aggregate_culmen_data.py"

## Handling Inputs and Outputs

In the previous exercise, we got an error because the `aggregate_culmen_data.py` script requires the CSV files in `data/` as input but we haven't downloaded those file contents yet. While we could simply do `datalad get data` and then rerun the command, there is a better way: we can give the run command the required `--input` and it will automatically get the content if required. We can also add the `--output` and the run command will automatically unlock the required files which allows us to overwrite them.

When specifiying inputs and ouputs, there is a tradeoff between verbosity and specificity. On one hand, listing every single file can be very tedious on the other hand, declaring a whole directory as input our output poses the danger of downloading or overwriting some files by accident.
Often, it is a  good compromise is to use all files of a given type.
For example `--input data/*.csv` means that any file that ends in `.csv` in the `/data` folder will be used as input.
This is called a [regular expression (regex)](https://www.regular-expressions.info/) - you can use them to create very concise and powerful queries.

| Command | Description |
| --- | --- |
|`datalad run --input "data.csv" "python script.py"` | Run `script.py` with input `"data.csv"` | 
|`datalad run --input "data/" "python script.py"` | Run `script.py` with the whole `"data/"` folder as input | 
|`datalad run --input "data/*.csv" "python script.py"` | Run `script.py` with every CSV file in `"data/"` as input | 
|`datalad run --output "figure.png" "python script.py"` | Run `script.py` with the output `"figure.png"`|
|`datalad run --input "data.csv" --output "figure.png" "python script.py"` | Run `script.py` with input `"data.csv"` and output `"figure.png"`|


**Exercise**: Repeat the `datalad run` command from the previous exercise but add all CSV files in `data/` as `--input`.

In [None]:
!datalad run --input "data/*.csv" "python code/aggregate_culmen_data.py"

**Exercise**: Repeat the datalad command from the previous exercise - what does the error message tell you?

In [None]:
!datalad run --input "data" "python code/aggregate_culmen_data.py"

**Exercise**: Repeat the run command from the previous exercise but add `"results/penguin_culmens.csv"` as `--output`.

In [None]:
!datalad run --input "data" --output "results/penguin_culmens.csv" "python code/aggregate_culmen_data.py"

**Exercise**: Use `datalad run` to execute the Python script `clode/plot_culmen_length_vs_depth.py` with `results/penguin_culmens.csv` as `--input` and `results/culmen_length_vs_depth.png` as `--output`. Then, execute the cell below to plot the graph created.

In [None]:
!datalad run --input "results/penguin_culmens.csv" --output "results/culmen_length_vs_depth.png" "python code/plot_culmen_length_vs_depth.py"

In [None]:
from IPython.display import Image
Image("results/culmen_length_vs_depth.png", width=600)

## From Single Scripts to Analysis Pipelines

Another nice feature of DataLad is the ability to `rerun` certain commands. This allows you to quickluy rerun an analysis step after making changes to a script without having to retype the whole run command.
You can also rerun all steps `--since` a certain commit.
So, if your analysis consists of a series of `datalad run` commands, you can reproduce the entire pipeline with a single command!

| Command | Description |
| --- | --- |
| `datalad rerun a268d8ca22b6` | Rerun the command from the `git log` with the checksum starting with `a268d8ca22b6e87959` |
| `datalad rerun --since a268d8ca22b6` | Rerun ALL commands `--since` the one with the checksum starting with `a268d8ca22b6e87959` |
| `git log -2` | View the last two entries in `git log` |
| `git log --oneline` | Get a compact view of the `git log` |

**Exercise**: View the last entry of the `git log` to see the message created by the last run command and note the commit hash (the first few element are enough).

In [None]:
!git log -1

**Exercise**: Rerun the last `datalad run` command.

In [None]:
!datalad rerun 8a40afb

**Exercise**: Check the `git log`. Did rerunning the command create a new entry?

In [None]:
!git log --oneline -2

**Exercise**: Open `code/plot_culmen_length_vs_depth.py` and change the `dpi` in `fig.savefig()` to 150. Then, save the file and use `datalad save` to track the changes. Now rerun the same run command and check the `git log`. Did the rerun create a commit message this time?

In [None]:
!datalad save
!datalad rerun 8a40afb
!git log --oneline -3

**Exercise**: Find the commit hash of the entry just before the first run command and rerun everything `--since` that commit (i.e. the full "analysis" pipeline)

In [None]:
!git log --oneline

In [None]:
!datalad rerun --since 0e8aebb