# Data Management for Neuroimaging with DataLad
alt. title A practical introduction to DataLad

Welcome to this introduction to DataLad at the INCF Neuroinformatics 2023.

## Introduction and setup

To run the notebook, you need to have **DataLad** and its external dependencies, **Git** and **git-annex** installed. Several installation methods are available, and are covered in [DataLad handbook](https://handbook.datalad.org/en/latest/intro/installation.html#install) or on [DataLad's website](https://www.datalad.org/#install). Additionally, you need two DataLad extensions: **datalad-next** and **datalad-containers** (both available as Python packages, through pip). For our examples, we will also need...

We strongly recommend running the tutorial in a virtual environment (either conda or virtualenv). Many users find using conda the easiest, because you can install datalad, git, and git-annex with conda. It is also possible to use virtualenv (virtualenvwrapper), install DataLad with pip, and use system-wide installation of git and git-annex.

Note to Windows users: this notebook uses bash kernel. Although the workflow would, in principle, be the same on Windows, differences between bash / cmd / PowerShell make this notebook incompatible.

If you are unsure about your version of DataLad, you can check it using the following command:

In [None]:
datalad --version

### Git identity
If you are unsure if you have configured your Git identity already, you can check if your name and email are printed to the terminal when you run

In [None]:
git config --get user.name
git config --get user.email

If nothing is returned, you need to configure your Git identity.

In [None]:
# git config --global --add user.name "Bob McBobface"
# git config --global --add user.email "bobmcbobface@uw.edu"

## Using DataLad

DataLad is a command line tool and it has a Python API. It also has a GUI for basic commands ([datalad-gooey](http://docs.datalad.org/projects/gooey/en/latest/?badge=latest))

We'll operate it through the command line, but this is how you would do it in Python API:

``` python
import datalad.api as dl
dl.create(path='mydataset')
```

In scripts using other programming languages, DataLad commands can be invoked via system calls. Here is an example with R:

``` R
system("datalad create mydataset"))
```

In the command line, typical usage consists of the datalad main command, optionally parametrized with additional flags, followed by a subcommand and its own optional flags.

![image](https://handbook.datalad.org/en/latest/_images/command-structure.png)

In [None]:
# print some information about the system to the terminal
datalad wtf -S system

In [None]:
# display short help
datalad wtf -h

In [None]:
datalad wtf --help

In [None]:
# Technical section: remove the dataset created in the previous run
if [ -f ./my-analysis/.datalad/config ]; then datalad drop --what all --reckless kill --recursive --dataset my-analysis; fi

## DataLad datasets
Everything happens in or involves DataLad datasets - DataLad’s core data structure. Datalad create only needs a name, and it will subsequently create a new directory under this name and instruct DataLad to manage it.

In [None]:
datalad create my-analysis

In [None]:
cd my-analysis

DataLad uses two mechanisms: git and git-annex to manage files. In this tutorial we will explicitly say which files we do not want to give to git-annex. We'll use some bash-fu to add two configuration lines to the `.gitattributes` file if they're not already present (normally we'd edit the file in a text editor).
```
README.md annex.largefiles=nothing
code/* annex.largefiles=nothing
```

In [None]:
grep -qF 'README.md' .gitattributes || echo "README.md annex.largefiles=nothing" >> .gitattributes
grep -qF 'code' .gitattributes || echo "code/* annex.largefiles=nothing" >> .gitattributes

In [None]:
cat .gitattributes

In [None]:
git status