# Data Management for Neuroimaging with DataLad

Welcome to this introduction to DataLad at the INCF Neuroinformatics 2023. You can follow the tutorial by executing the code blocks.

The tutorial is based on [Neurohackdemy 2022: Data Management for Neuroimaging with DataLad](https://handbook.datalad.org/en/latest/code_from_chapters/neurohackademy.html) and was created with Datalad 0.19.3.

## Introduction and setup

To run the notebook, you need to have **DataLad** and its external dependencies, **Git** and **git-annex** installed. Several installation methods are available, and are covered in [DataLad handbook](https://handbook.datalad.org/en/latest/intro/installation.html#install) or on [DataLad's website](https://www.datalad.org/#install). Additionally, you need two DataLad extensions: **datalad-next** and (optionally) **datalad-containers** (both available as Python packages, through pip). For our examples, we will also need the following Python packages: **black**, **nilearn** and **matplotlib**.

We strongly recommend running the tutorial in a virtual environment (either conda or virtualenv). Many users find using conda the easiest, because you can install datalad, git, and git-annex with conda. It is also possible to use virtualenv (virtualenvwrapper), install DataLad with pip, and use system-wide installation of git and git-annex.

Note to Windows users: this notebook uses bash kernel. Although the workflow would, in principle, be the same on Windows, differences between bash / cmd / PowerShell make this notebook incompatible.

In [None]:
# pip install datalad # OR conda install datalad
# pip install datalad-next datalad-container
# pip install black nilearn matplotlib

If you are unsure about your version of DataLad, you can check it using the following command:

In [None]:
datalad --version

### Git identity
If you are unsure if you have configured your Git identity already, you can check if your name and email are printed to the terminal when you run

In [None]:
git config --get user.name
git config --get user.email

If nothing is returned, you need to configure your Git identity.

In [None]:
# git config --global --add user.name "Bob McBobface"
# git config --global --add user.email "bobmcbobface@uw.edu"

## Using DataLad

DataLad is a command line tool and it has a Python API. It also has a GUI for basic commands ([datalad-gooey](http://docs.datalad.org/projects/gooey/en/latest/?badge=latest))

We'll operate it through the command line, but this is how you would do it in Python API:

``` python
import datalad.api as dl
dl.create(path='mydataset')
```

In scripts using other programming languages, DataLad commands can be invoked via system calls. Here is an example with R:

``` R
system("datalad create mydataset"))
```

In the command line, typical usage consists of the datalad main command, optionally parametrized with additional flags, followed by a subcommand and its own optional flags.

![image](https://handbook.datalad.org/en/latest/_images/command-structure.png)

In [None]:
# display short help for create
datalad create -h

In [None]:
datalad create --help

In [None]:
# Technical section: remove the dataset created in the previous run
if [ -f ./my-analysis/.datalad/config ]; then datalad drop --what all --reckless kill --recursive --dataset my-analysis; fi

In [None]:
# Technical section: print working directory (want to be in INCF_preclinical/datalad)
pwd

## DataLad datasets
Everything happens in or involves DataLad datasets - DataLad’s core data structure. Datalad create only needs a name, and it will subsequently create a new directory under this name and instruct DataLad to manage it.

In [None]:
datalad create my-analysis

In [None]:
cd my-analysis

DataLad uses two mechanisms: git and git-annex to manage files. In this tutorial we will explicitly say which files we do not want to give to git-annex. We'll use some bash-fu to add two configuration lines to the `.gitattributes` file if they're not already present (normally we'd edit the file in a text editor).
```
README.md annex.largefiles=nothing
code/* annex.largefiles=nothing
```

In [None]:
grep -qF 'README.md' .gitattributes || echo "README.md annex.largefiles=nothing" >> .gitattributes
grep -qF 'code' .gitattributes || echo "code/* annex.largefiles=nothing" >> .gitattributes
cat .gitattributes

In [None]:
datalad save -m "Set large files configuration"

## Version control

Version controlling a file means to record its changes over time, associate those changes with an author, date, and identifier, creating a lineage of file content, and being able to revert changes or restore previous file versions. DataLad datasets make use of two established version control tools, Git and git-annex, to version control files regardless of size or type.

Let’s start building a dataset for an analysis by adding a README. The command below writes a simple header into a new file README.md:

In [None]:
echo "# My example DataLad dataset" > README.md

In [None]:
datalad status

New revisions (a.k.a. versions, snapshots, commits) are created explicitly:

In [None]:
datalad save -m "Create a short README"

Let's edit the text file more

In [None]:
echo "This dataset contains a toy data analysis" >> README.md

In [None]:
git diff

In [None]:
datalad save -m "Add information on the dataset contents to the README"

With each saved change, you build up your dataset’s revision history. Tools such as git log allow you to interrogate this history, and if you want to, you can use this history to find out what has been done in a dataset, reset it to previous states, and much more:

In [None]:
git log

## DataLad containers
- associate container with a dataset
- simplify execution of commands within the container
- see https://github.com/repronim/containers for a curated dataset with containers ready to use
- see [Handbook section](https://handbook.datalad.org/en/latest/basics/101-133-containersrun.html) for introduction

Adding a container (warning: 300 MB download)

In [None]:
# datalad containers-add nilearn \
#     --url shub://adswa/nilearn-container:latest

In [None]:
# datalad containers-list

Expected output:
```
nilearn -> .datalad/environments/nilearn/image
```

## Digital provenance

- author / date
- origin of file
  - url
  - command output

Download a script without provenance information:

In [None]:
wget -P code/ \
   https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py

In [None]:
datalad status

In [None]:
datalad save --message "Adding a nilearn-based script for brain masking"

### Registering URLs

Download a file ("large" file) with provenance tracking

In [None]:
datalad download-url -m "Add a tutorial on nilearn" \
   -O docs/nilearn-tutorial.pdf \
   https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf

In [None]:
datalad status

In [None]:
git annex whereis docs/nilearn-tutorial.pdf

See also: `datalad addurls` command and [uncurl special remote](http://docs.datalad.org/projects/next/en/latest/generated/datalad_next.annexremotes.uncurl.html)

### Recording command execution
Reformat code - and record that fact

In [None]:
datalad run -m "Reformat code with black" \
 "black code/get_brainmask.py"

In [None]:
git show

With provenance info we can re-run (reproduce)!

In [None]:
datalad rerun

## Data consumption and dataset nesting

- install datasets from local or remote sources
- clone datasets ‘as is’ as standalone data packages, or link datasets into one another in superdataset-subdataset hierarchies (“nesting”)
- link datasets as modular units together, and maximize the potential for reuse

Get input data for our analysis by cloning some BIDS-structured data; register a subdataset under the name input:

In [None]:
datalad clone -d . \
 https://gin.g-node.org/adswa/bids-data \
 input

How linkage works:

In [None]:
git show

In [None]:
datalad tree --include-files input

## Data transport

- cloning worked fast
- we don't have *large* file content yet
- we `get` them (files, directories) on demand

In [None]:
datalad get input/sub-02

Free up space

In [None]:
datalad drop input/sub-02

## Git and git-annex

In [None]:
man --pager="head -n 5" git

https://git-annex.branchable.com/how_it_works/
> - With git-annex, git is instead "a stupid filename and metadata tracker".
> - The contents of annexed files are not stored in git, only the names of the files and some other metadata remain there.

![git and git annex](https://handbook.datalad.org/en/latest/_images/publishing_gitvsannex.svg)

## Computational reproducibility

- `datalad run` / `datalad containers run`
- identical syntax
  - `input` / `output`
  - containers-run needs a container specification which container should be used
- usage in tutorial depends whether Docker / Singularity is installed

In [None]:
datalad run -m "Compute brain mask" \
  --input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
  --output "figures/*" \
  --output "sub-02*" \
  "python code/get_brainmask.py"

See files that changed in last commit

In [None]:
git diff --name-only HEAD^1 HEAD

Query an individual file how it came to be

In [None]:
git log sub-02_brain-mask.nii.gz

Reproduce a computation

In [None]:
datalad rerun

In [None]:
git log -n 2 --pretty=oneline

## Data publication

- create *siblings* with `create-sibling-*` commands
- siblings can hold Git part, git-annex part, or both
- can interface with many locations (GitHub, Gitlab, AWS S3, Nextcloud, ...)

We will use G-Node [GIN](https://gin.g-node.org/)

Note: this section depends on additional config (not shown):
- token
- ssh key (best described in [GitHub docs](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/about-ssh))

See DataLad handbook [Third party infrastructure](https://handbook.datalad.org/en/latest/basics/basics-thirdparty.html#chapter-thirdparty) chapter for [GIN walkthrough](https://handbook.datalad.org/en/latest/basics/101-139-gin.html) 

In [None]:
datalad create-sibling-gin \
  --name gin \
  --existing skip \
  incf01

In [None]:
datalad push --to gin

## Cleaning up

- The demo wouldn’t have the term “data management” in its title if we were to leave clutter in your home directory
- if you `rm` a file and save the deletion, the file can be brought back to life easily
- and an `rm -rf` on a dataset with annexed files will cause an explosion of permission errors
- `datalad drop` command is versatile

Remove local copies of *large* files

In [None]:
datalad drop input/sub-02

Uninstall datasets

In [None]:
datalad drop --what datasets input

In [None]:
datalad tree

Datalad `get` and `drop` are counterparts

In [None]:
datalad get --no-data input

Drop has protections built in. By default it requires presence of 1 copy.

In [None]:
git annex whereis figures/sub-02_brainmask.png

In [None]:
datalad -c "annex.numcopies=2" drop figures/sub-02_brainmask.png

Restrictions can be bypassed:

In [None]:
datalad -c "annex.numcopies=2" drop --reckless availability figures/sub-02_brainmask.png

To remove a dataset entirely, without any checks

In [None]:
# cd ..
# datalad drop --what all --reckless kill --recursive my-analysis