# Version Control of Data with DataLad

## Contents

0. [Introduction and setup](#0)
1. [Creating a Datalad dataset](#1)
2. [Version control workflows](#2)
3. [Dataset consumption and nesting](#3)
4. [Dataset nesting](#4)
5. [More on data versioning, nesting, and a glimpse into reproducible paper](#5)
6. [Full provenance capture and reproducibility](#6)

## About this notebook

You can execute all the code cells below in this Jupyter Notebook running in [MyBinder](https://mybinder.org/).
Alternatively, if you have DataLad and its dependencies installed, you can copy and paste the code into your computer's command line to follow along.

## Acknowledgements

This tutorial was initially created by Adina Wagner for the 2020 OHBM Brainhack Traintrack session on DataLad.
This notebook accompanies [this tutorial video](https://www.youtube.com/watch?v=L5A0MXqFrOY) by Adina Wagner.

<div id="0"></div>

## 0. Introduction & setup

[DataLad](https://www.datalad.org/) is a data management multitool that can assist you in handling the entire life cycle of digital objects.
It is a command-line tool, free and open source, and available for all major operating systems.
In the command line, all operations begin with the general `datalad` command.

In [None]:
# datalad

For example, I can type `datalad --help` to find out more about the available commands.

You can find more details about how to [install DataLad and its dependencies on all operating systems in the DataLad Handbook](https://handbook.datalad.org/en/latest/intro/installation.html).
This section also explains how to install DataLad on shared machines where you may not have administrative privileges (`sudo` rights), such as high-performance computing clusters.
If you already have DataLad installed, **make sure that it is a recent version**.
You can check the installed version using the `datalad --version` command:

In [None]:
datalad --version

### Configuring your Git identity

The first step, if you haven't done so already, is to configure your [Git](https://git-scm.com/) identity.
If you're new to Git, don't worry!
This configuration simply involves setting your name and email address, which will associate your changes in a project with you as the author.

Below, we provide instructions on how to configure your Git identity.
However, please note that a central Git configuration has already been set up for you during the `postBuild` process of Binder, so you don't need to perform this step manually.
That's why the code examples below is commented out.

In [None]:
# git config --global --add user.name "Your name"
# git config --global --add user.email "Your email address"

<div id="1"></div>

## 1. Creating a DataLad dataset

Every command in DataLad affects or uses DataLad *datasets*, the core data structure of DataLad.
A dataset is a directory on a computer that DataLad manages.
You can create new, empty datasets from scratch and populate them, or transform existing directories into datasets.

<img src="https://handbook.datalad.org/en/latest/_images/dataset.svg" style="width: 400px;"/>

Let's start by creating a new DataLad dataset.
Creating a new dataset is accomplished using the `datalad create` command.
This command requires only a name for the dataset.
It will then create a new directory with that name and instruct DataLad to manage it.

Additionally, the command includes an option, `-c text2git`.
The `-c` option allows for specific configurations of the dataset at the time of creation.
You can find detailed information about the `text2git` configuration in the DataLad handbook, specifically in the [sections on configurations and procedures](https://handbook.datalad.org/en/latest/basics/basics-configuration.html).
Dont't worry about configurations for now.
In general, this configuration serves as a very useful standard for datasets.

In [None]:
datalad create -c text2git DataLad-101

Right after dataset creation, there is a new directory on the computer called `DataLad-101`.
Let's navigate into this directory using the `cd` command and list the directory contents using `ls`.

In [None]:
cd DataLad-101

In [None]:
ls # ls does not show any output, because the dataset is empty

Datasets have the exciting feature of recording everything done within them.
They provide version control for all content managed by DataLad, regardless of its size.
Additionally, datasets maintain a complete history that you can interact with.
This history is already present, although it is quite short at this point in time.
Let’s take a look at it nonetheless.
This history exists thanks to Git.
You can access the history of a dataset using any tool that displays Git history.
For simplicity, we will use Git's built-in `git log` command.

In [None]:
git log

<div id="2"></div>

## 2. Version control workflows

Building on top of [Git](https://git-scm.com/) and [git-annex](https://git-annex.branchable.com/), DataLad allows you to version control arbitrarily large files in datasets.
You can keep track of revisions of data of any size, and view, interact with or restore any version of your dataset’s history.

<img src="https://handbook.datalad.org/en/latest/_images/local_wf.svg" style="width: 400px;"/>

Let's start by creating a `books` directory using the `mkdir` command.
Next, we will download two books from the internet.
Here, we are using the command line tool `curl` to accomplish this, allowing us to perform all actions from the command line.
However, if you prefer, you can also download the books manually and save them into the dataset using a file manager.
Remember, a dataset is simply a directory on your computer.

In [None]:
mkdir books

In [None]:
cd books

In [None]:
curl -L -o TLCL.pdf https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download
curl -L -o byte-of-python.pdf https://edisciplinas.usp.br/pluginfile.php/3252353/mod_resource/content/1/b_Swaroop_Byte_of_python.pdf

Let's navigate back to the dataset root (`DataLad-101` folder)  and run he `tree` command which can visualize the directory hierarchy:

In [None]:
cd ../

In [None]:
tree

Use the `datalad status` command to find out what has happened in the dataset.
This command is very helpful as it reports on the current state of your dataset.
Any new or changed content will be highlighted.
If nothing has changed, the `datalad status` command will report what is known as a clean dataset state.
In general, it is very useful to maintain a clean dataset state.
If you know Git, you can think about `datalad status` as the `git status` of DataLad.

In [None]:
datalad status

Any content that we want DataLad to manage needs to be explicitly added to DataLad.
It is not enough to simply place it inside the dataset.
To give new or changed content to DataLad, we need to save it using `datalad save`.
This is the first time we need to specify a "commit message", which is done using the `-m` option of the command.
A "commit" is a snapshot of your project's files at a specific point in time.
The commit message is a short text description that explains the changes made when saving the current changes in a Datalad dataset.

In [None]:
datalad save -m "Add books on Python and Unix to read later"

With `git log -n 1` you can take a look at the most recent commit in the history:

In [None]:
git log -n 1

`datalad save` saves all untracked content to the dataset.
Sometimes, this can be inconvenient.
One significant advantage of a dataset's history is that it allows you to revert changes you are not happy with.
However, this is only easily possible at the level of single commits.
If one save commits several unrelated files or changes, it can be difficult to disentangle them if you ever want to revert some of those changes.
To address this, you can provide a path to the specific file you want to save, allowing you to specify more precisely what will be saved together.

Let's demonstrate this by adding another book from the internet:

In [None]:
cd books

In [None]:
curl -L -o progit.pdf https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf

In [None]:
cd ../

In [None]:
datalad status

Now when you run `datalad save`, attach a path to the command:

In [None]:
datalad save -m "Add reference book about Git" books/progit.pdf

Let's take a look at files that are frequently modified, such as code or text.
To demonstrate this, we will create a file and modify it.
We will use a [here doc](https://en.wikipedia.org/wiki/Here_document) for this, but you can also write the note using an editor of your choice.
If you execute this code snippet, make sure to copy and paste everything, starting with `cat` and ending with the second `EOT`.

In [None]:
cat << EOT > notes.txt
A DataLad dataset can be created with "datalad create PATH".
The dataset is created empty.

EOT

`datalad status` will, as expected, say that there is a new untracked file in the
dataset:

In [None]:
datalad status

We can save the newly created `notes.txt`-file with the `datalad save` command and a helpful commit message.
As this is the only change in the dataset, there is no need to provide a path:

In [None]:
datalad save -m "Add notes on datalad create"

Let’s now add another note to modifiy this file:

In [None]:
cat << EOT >> notes.txt
The command "datalad save [-m] PATH" saves the file (modifications) to history.
Note to self: Always use informative and concise commit messages.

EOT

A `datalad status` command reports the file as not untracked.
However, because it differs from the state it was saved under, it is reported as modified.

In [None]:
datalad status

Let’s save this:

In [None]:
datalad save -m "Add notes on datalad save"

If you take a look at the history of this file with `git log`, the history
neatly summarizes all of the changes that have been done:

In [None]:
git log -n 2

<div id="3"></div>

## 3. Dataset consumption

DataLad lets you consume datasets provided by others, and collaborate with them.
You can install existing datasets and update them from their sources, or create sibling datasets that you can publish updates to and pull updates from for collaboration and data sharing.

<img src="https://handbook.datalad.org/en/latest/_images/collaboration.svg" style="width: 600px;"/>

To demonstrate this, let's first create a new subdirectory to be organized:

In [None]:
mkdir recordings

Afterwards, let's install an existing dataset, either from a path or a URL.
The dataset we want to install in this example is hosted on GitHub, so we will provide its URL to the `datalad clone` command.
We will also specify a path where we want it to be installed.
Importantly, we are installing this dataset as a subdataset of `DataLad-101`, which means we will *nest* the two datasets inside each other.
This is accomplished using the `--dataset` flag.

In [None]:
datalad clone --dataset . https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow

There are new directories in the `DataLad-101` dataset.
Within these new directories, there are hundreds of MP3 files.

In [None]:
tree -d # we limit the output of the tree command to directories

Let's move into one of these directories and take a look at its contents:

In [None]:
cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking

In [None]:
ls

### Have access to more data than you have disk space: `get` and `drop`

Here is a crucial and incredibly handy feature of DataLad datasets:
After cloning, the dataset contains small files, such as the README, but larger files do not have any content yet.
It only retrieved what we can simplistically refer to as file availability metadata, which is displayed as the file hierarchy in the dataset.
While we can read the file names and determine what the dataset contains, we don’t have access to the file contents *yet*.
If we were to try to play one of the recordings using the Python `Audio` functionality, this would fail.

In [None]:
# vlc Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
#from IPython.display import *

#Audio(filename="Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3", autoplay=True)

This might seem like curious behavior, but there are many advantages to it.
One advantage is speed, and another is reduced disk usage.
Here is the total size of this dataset:

In [None]:
cd ../

In [None]:
du -sh  # Unix command to show size of contents

It is tiny!
However, we can also find out how large the dataset would be if we had all of its contents by using `datalad status` with the `--annex` flag.
In total, there are more than 15 GB of podcasts that you now have access to.

In [None]:
datalad status --annex

You can retrieve individual files, groups of files, directories, or entire datasets using the ` datalad get` command.
This command fetches the content for you.

In [None]:
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3

Content that is already present is not re-retrieved.  

In [None]:
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3  \Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3  \Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3

If you no longer need the data locally, you can drop the content from your dataset to save disk space.

In [None]:
datalad drop Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3

Afterwards, as long as DataLad knows where a file came from, its content can be retrieved again.

In [None]:
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3

<div id="4"></div>

## 4. Dataset nesting

Datasets can contain other datasets (subdatasets), nested arbitrarily deep.
Each dataset has an independent revision history, but can be registered at a precise version in higher-level datasets.
This allows to combine datasets and to perform commands recursively across a hierarchy of datasets, and it is the basis for advanced provenance capture abilities.

<img src="https://handbook.datalad.org/en/latest/_images/linkage_subds.svg" style="width: 600px;"/>

Let’s take a look at the history of the `longnow` subdataset.
We can see that it has preserved its history completely.
This means that the data we retrieved retains all of its provenance.

In [None]:
git log --reverse

How does this look in the top-level dataset?
If we query the history of `DataLad-101`, there will be no commits related to MP3 files or any of the commits we have seen in the subdataset.
Instead, we can see that the superdataset recorded the `recordings/longnow` dataset as a subdataset.
This means it recorded where this dataset came from and what version it is in.

In [None]:
cd ../../

In [None]:
git log -n 1

The subproject commit registered the most recent commit of the subdataset, and thus
the subdataset version:

In [None]:
cd recordings/longnow

In [None]:
git log --oneline

In [None]:
cd ../../

<div id="5"></div>

## 5. More on data versioning, nesting, and a glimpse into reproducible paper

We'll clone a repository for a [paper](https://doi.org/10.3758/s13428-020-01428-x) that [shares manuscript, code, and data](https://github.com/psychoinformatics-de/paper-remodnav/):

In [None]:
cd ../

In [None]:
datalad clone https://github.com/psychoinformatics-de/paper-remodnav.git

The top-level dataset has many subdatasets. One of it, `remodnav`, is a dataset that contains the sourcecode for a Python package called `remodnav` used in eyetracking analyses:

In [None]:
cd paper-remodnav

In [None]:
datalad subdatasets

After cloning a dataset, its subdatasets will be recognized, but just as content is not yet retrieved for files in datasets, the subdatasets of datasets are not yet installed.
If we navigate into an uninstalled subdataset, it will appear as an empty directory.

In [None]:
cd remodnav

In [None]:
ls

In order to install a subdataset, we use `datalad get` with the `--recursive` flag:

In [None]:
datalad get --recursive --recursion-limit 2 -n .

In [None]:
ls

This command not only retrieves file contents, but it also installs subdatasets.
So, if you want to be really lazy, just run `datalad get --recursive -n` in the root of a dataset to install all available subdatasets.
The `-n` option prevents `get` from downloading any data, so only the subdatasets are installed without any data being downloaded.
Here, the depth of recursion is limited.
For one, it would take a while to install all subdatasets, but the very raw eye tracking dataset contains subject IDs that should not be shared.
Therefore, this subdataset is not accessible.
If you try to install all subdatasets, the source eye tracking data will throw an error, because it is not made publicly available.

Afterwards, you can see that the `remodnav` subdataset also contains further subdatasets.
In this case, these subdatasets contain data used for testing and validating software performance.

In [None]:
datalad subdatasets

One of the validation data subdatasets came from another lab that shared their data.
After the researchers were almost finished with their paper, they found another paper that reported a mistake in this data.
The mistake was still present in the data they were using.
By inspecting the history of this dataset, you can see that at one point, they contributed a fix that changed the data.

In [None]:
cd remodnav/tests/data/anderson_etal

In [None]:
git log -n 3

Because DataLad can link subdatasets to precise versions, it is possible to consciously decide and openly record which version of the data is used.
It is also possible to test how much results change by resetting the subdataset to an earlier state or updating the dataset to a more recent version.

<div id="6"></div>

## 6. Full provenance capture and reproducibility

DataLad allows to capture full provenance (i.e., a record that describes entities and processes that were involved in producing or influencing a digital resource):
The origin of datasets, the origin of files obtained from web sources, complete machine-readable and automatically reproducible records of how files were created (including software environments).
You or your collaborators can thus reobtain or reproducibly recompute content with a single command, and make use of extensive provenance of dataset content (who created it, when, and how?).

<img src="https://handbook.datalad.org/en/latest/_images/reproducible_execution.svg" style="width: 600px;"/>

In [None]:
cd ../../../../

First, create a new dataset, in this case with the `yoda` configuration:

In [None]:
datalad create -c yoda myanalysis

This sets up a helpful structure for my dataset with a code directory and some README files,
and applies helpful configurations:

In [None]:
cd myanalysis

In [None]:
tree

Read more about the YODA principles and the YODA configuration in the [section on YODA](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html) in the DataLad Handbook.

Next, install the input data as a subdataset.
For this, the DataLad developers created a [DataLad dataset](https://github.com/datalad-handbook/iris_data.git) with the ["iris" data](https://en.wikipedia.org/wiki/Iris_flower_data_set) and published it on GitHub.
Here, we're installing it into a directory named `input`.

In [None]:
datalad clone -d . https://github.com/datalad-handbook/iris_data.git input/

The last thing needed is code to run on the data and produce results.
For this, here is a k-means classification analysis script written in Python.
You can find more details about this analysis in the [section on a YODA-compliant data analysis projects](https://handbook.datalad.org/en/latest/basics/101-130-yodaproject.html).

In [None]:
cat << EOT > code/script.py

import pandas as pd
import seaborn as sns
import datalad.api as dl
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

data = "input/iris.csv"

# make sure that the data are obtained (get will also install linked sub-ds!):
dl.get(data)

# prepare the data as a pandas dataframe
df = pd.read_csv(data)
attributes = ["sepal_length", "sepal_width", "petal_length","petal_width", "class"]
df.columns = attributes

# create a pairplot to plot pairwise relationships in the dataset
plot = sns.pairplot(df, hue='class', palette='muted')
plot.savefig('pairwise_relationships.png')

# perform a K-nearest-neighbours classification with scikit-learn
# Step 1: split data in test and training dataset (20:80)
array = df.values
X = array[:,0:4]
Y = array[:,4]
test_size = 0.20
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,
                                                                    test_size=test_size,
                                                                    random_state=seed)
# Step 2: Fit the model and make predictions on the test dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_test)

# Step 3: Save the classification report
report = classification_report(Y_test, predictions, output_dict=True)
df_report = pd.DataFrame(report).transpose().to_csv('prediction_report.csv')

EOT

So far the script is untracked:

In [None]:
datalad status

Let's save it with a `datalad save` command:

In [None]:
datalad save -m "Add script for kNN classification and plotting" code/script.py

### `datalad run`

The challenge that DataLad helps accomplish is running this script in a way that links the script to the results it produces and the data it was computed from.
We can do this with the `datalad run` command.
In principle, it is simple.
You start with a clean dataset:

In [None]:
datalad status

Then, give the command you would execute with `datalad run`, in this case `python code/script.py`.
DataLad will take the command, run it, and save all of the changes in the dataset under the commit message specified with the `-m` option.
Thus, it associates the script with the results.  

But it can be even more helpful.
Here, we also specify the input data that the command needs, and DataLad will retrieve the data beforehand.
We also specify the output of the command.
Specifying the outputs will allow us to rerun the command later and update any outdated results.

In [None]:
datalad run -m "Analyze iris data with classification analysis" \
--input "input/iris.csv" \
--output "prediction_report.csv" \
--output "pairwise_relationships.png" \
"python3 code/script.py"

In [None]:
datalad save pairwise_relationships.png

Let's take a look at the output:

<img src="../myanalysis/pairwise_relationships.png" style="width: 600px;"/>

DataLad creates a commit in the dataset history.
This commit includes my commit message as a human-readable summary of what was done.
It contains the produced output, and it has a machine-readable record that includes information on the input data, the results, and the command that was run to create this result.

In [None]:
git log -n 1

### `datalad rerun`

This machine-readable record is particularly helpful, because we can now instruct DataLad to *rerun* this command.
This means we don't have to memorize what we did, and people that we share the dataset with don’t need to ask how this result was produced.
They can simply let DataLad tell them.

This is accomplished with the `datalad rerun` command.  
For this demonstration, Adina Wagner has prepared this analysis dataset and published it to GitHub at [github.com/adswa/my_analysis](https://github.com/adswa/myanalysis).

In [None]:
cd ../

In [None]:
git clone https://github.com/adswa/myanalysis.git analysis_clone

In [None]:
cd analysis_clone

We can clone this repository and provide, for example, the checksum of the run command to the `datalad rerun` command.
DataLad will read the machine-readable record of what was done and recompute the exact same thing.

In [None]:
datalad rerun 71cb8c5

This allows others to easily rerun my computations.
It also spares you the need to remember how you executed a script, and you can inquire about where the results came from.

In [None]:
git log pairwise_relationships.png

**Done! Thanks for coding along!**