# Working with DataLad Datasets

In the recent decade there has been a big increase in the number of published neuroscience datasets and open science repositories. However, it can be challenging to manage this open science ecosystem if the repositories use different backends to share data. DataLad helps by providing a unified interface that can manage data across many different services. In this notebook, we are going to download data from [OpenNeuro](openneuro.org) - a platform for hosting neuroimaging datasets that uses DataLad on the backend. Simply execute the cell below to clone the dataset into the current directory - it will be stored in a folder called `ds004408/`.

**Note for Windows**: you will get a message that says "detected a crippled filesystem". Don't worry this does not mean that there is anything wrong with your computer - it just means git-annex is working slightly differently on Windows (more on this later).

In [None]:
!datalad clone https://github.com/OpenNeuroDatasets/ds004408.git

In the following section we will explore the content of this DataLad dataset and learn how to access and modify it. This will give you a deeper understanding of how DataLad works and equip you with the tools to reuse publicly available datasets.

## Understanding the Structure of a Dataset

### Background

DataLad is a tool that is primarily used through the terminal. Thus, when exploring the content of a DataLad dataset, it makes sense to use terminal commands like `ls` (Linux/macOS) or `dir` (Windows). In VSCode you can open the terminal via the menu bar by clicking **View > Terminal** or by pressing the **Ctrl+`** keyboard shortcut and execute these commands there. On Windows you can also open a terminal like this: **Open Explorer** > **Move to folder** > **Right Click** > **Open in Terminal**. If you want to use the Linux terminal commands rather than the Windows alternative, you can use **Git Bash** as your terminal, which comes with the Git installation on Windows.

Alternatively, you can execute the terminal commands in the code cells of this Jupyter notebook by prefacing them with `!`. With `!` we can execute any shell command as an independent subprocess. Because these commands can't modify the state of the notebook, there is a special prefix for the `cd` (change directory) command: `%cd`. This allows the `cd` command to persistently change the working directory within the notebook.

### Exercises

In the following exercises, we are going to explore the dataset we cloned at the beginning of the notebook. You can do this in the terminal or in the notebook using the `!` and `%` operators, or try both - however you prefer! Here are the commands you need to know:

| Linux/macOS | Windows | Description |
| --- | --- | --- |
| `ls` | `dir` | List the content of the current directory |
| `ls -a` | `dir /a` | List the content of the current directory (including hidden files) |
| `ls -a data` | `dir /a data` | List the content of the `data` directory |
| `cd code/` | `cd code/` | Move to the `code/` directory |
| `cd ..` | `cd ..` | Move to the parent of the current directory |
| `cat file.txt` | `type file.txt` | Display the content of `file.txt` |

**Example**: Change the current directory to `ds004408/` (i.e., the root directory of the cloned dataset).

In [None]:
%cd ds004408/

**Example**: List the content of the current directory (i.e., `ds004408/`).

In [None]:
# Linux/macOS
!ls

In [None]:
# Windows
!dir

**Exercise**: Change the current working directory to the `stimuli/` folder and list the contents.

In [None]:
# Linux/macOS
%cd stimuli
!ls

In [None]:
# Windows
%cd stimuli
!dir

**Exercise**: Change the directory back to `ds004408/` (i.e., the parent directory of `stimuli/`).

In [None]:
%cd ..

**Exercise**: List the contents of `ds004408/` including all hidden files and folders.

In [None]:
# Linux/macOS
!ls -a

In [None]:
# Windows
!dir /a

**Exercise**: List the contents of the `.git/` folder.

In [None]:
# Linux/macOS
!ls .git

In [None]:
# Windows
!dir .git

**Exercise**: List the contents of the `.datalad/` folder.

In [None]:
# Linux/macOS
!ls .datalad

In [None]:
# Windows
!dir .datalad

**Example**: Display the content of `README.md`.

In [None]:
# Linux/macOS
!cat README

In [None]:
# Windows
!type README

**Exercise**: Display the file content of `participants.tsv`.

In [None]:
# Linux/macOS
!cat participants.tsv

In [None]:
# Windows
!type participants.tsv

**Exercise**: Display the file content of `.datalad/config`. This file contains a DataLad ID that uniquely identifies this dataset.

In [None]:
# Linux/macOS
!cat .datalad/config

In [None]:
# Windows
!type .datalad\config

## Managing File Content

### Background

You may have noticed that, even though the dataset contains lots of different folders, cloning it was really fast. This is because DataLad manages dataset structure and file content separately. When you cloned the dataset, you didn't actually download the file content - you merely downloaded tiny symbolic links that represent the files. To download the actual content of specific files, we have to use the `datalad get` command. This is very useful when we are working with large datasets. For example, you can clone the whole dataset to your notebook, download some sample files for testing a new analysis you are developing, and integrate your changes into the original dataset without having to move large amounts of data.

The downloaded data will be stored in the files and folders you can see in your file tree but not in the hidden `.git/` folder.
**For Windows this is actually different**. Because of limitations in the Windows filesystem, DataLad has to duplicate the data and store one copy in your working tree and one backup copy in the `.git/` folder.

## Exercises

In the following exercises, we are going to get the file content for some of the files in the dataset we just cloned, and we are also going to drop them again.
As we do that, we'll repeatedly check the disk usage (`du -sh` on Linux/macOS, `dir /s` on Windows) to see how the size of our dataset is changing.
Here are the commands you need to know - the commands for listing folders and checking disk usage are OS-specific while the DataLad commands are the same across all platforms:

**DataLad Commands**
| Command | Description |
| --- | --- |
| `datalad get data/` | Download the content of the directory `data/` |
| `datalad drop data/` | Delete the content of the directory `data/` |
| `datalad get data/example.txt` | Download the content of the file `data/example.txt` |
| `datalad get data/*.txt` | Download the content of all `.txt` files in `data/` |

**OS-specific commands**
| Linux/macOS | Windows | Description |
| --- | --- | --- |
| `du -sh .` | `dir /s` | Print the disk usage of the current directory |
| `du -sh data` | `dir /s data` | Print the disk usage of the `data/` directory |

**Example**: Print the size of the current directory.

In [None]:
# Linux/macOS
!du -sh .

In [None]:
# Windows
!dir /s

**Example**: Get the data for the file `stimuli/audio01.wav`.

In [None]:
!datalad get stimuli/audio01.wav

**Exercise**: Check the disk usage of the current directory, again.

In [None]:
# Linux/macOS
!du -sh

In [None]:
# Windows
!dir /s

**Exercise**: Get the data for `stimuli/audio02.wav`, then print the disk usage for the current directory.

In [None]:
# Linux/macOS
!datalad get stimuli/audio02.wav
!du -sh

In [None]:
# Windows
!datalad get stimuli/audio02.wav
!dir /s

**Exercise**: Drop the data of the whole stimulus folder, then print the disk usage of the current directory.

In [None]:
# Linux/macOS
!datalad drop stimuli/
!du -sh

In [None]:
# Windows
!datalad drop stimuli/
!dir /s

**Exercise**: Get the disk usage of the `stimuli/` folder.

In [None]:
# Linux/macOS
!du -sh stimuli

In [None]:
# Windows
!dir /s stimuli

**Exercise**: Get all `*.TextGrid` files in the `stimuli/` folder, then get the folder's disk usage again.

**NOTE (for Windows)**: Because Windows doesn't process the `"*"` wildcards correctly, the easiest way is to either get the whole stimuli folder (takes a while) or just a single file.

In [None]:
!datalad get stimuli/*.TextGrid
!du -sh stimuli/

**Exercise**: Get the size of the `.git/` folder.

In [None]:
# Linux/macOS
!du -sh .git

In [None]:
# Windows
!dir /s .git

## Inspecting File Identifiers

### Background

DataLad is a decentralized data management system, which means it does not rely on any central issuing service. This presents a challenge: how can files be unambiguously identified when there exists an unknown number of DataLad datasets that were created independently? The answer is checksums. Checksums are alphanumeric strings that are generated from the file content via a hashing algorithm. Even the tiniest change in the file will result in a different checksum, which makes them unique identifiers of file content.

DataLad manages these file identifiers for us using git-annex under the hood. While most of the time we don't have to think about the git-annex operations, it can be useful to peek under the hood and use some git-annex commands directly to get more detailed information or configure the dataset's behavior.

### Exercises

In this section we are going to use `git annex` directly to get more detailed information on the files in our dataset, like their identifiers and storage locations. We'll also use `git annex` to configure how many copies of a given file we want to keep. Here are the commands you need to know:

| Code | Description |
| --- | --- |
| `git annex info` | Show the git-annex information for the whole dataset |
| `git annex info folder/image.png` | Show the git-annex information for the file `image.png` |
| `git annex whereis folder/image.png` | List the repositories that have the file content for `image.png` |
| `git annex numcopies 2` | Configure the dataset so that the required number of copies for a file is 2 |



**Example**: Get the git-annex `info` for the file `stimuli/audio01.wav`.

In [None]:
!git annex info stimuli/audio01.wav

**Exercise**: Get the file content for `stimuli/audio01.wav`, then print the git-annex `info` for that file, again.

In [None]:
!datalad get stimuli/audio01.wav
!git annex info stimuli/audio01.wav

**Exercise**: List the repositories that contain the file content for `stimuli/audio01.wav`.

In [None]:
!git annex whereis stimuli/audio01.wav

**Exercise**: List the repositories that contain the file content for `stimuli/audio02.wav` - how is this different from the list of repositories in the previous exercise?

In [None]:
!git annex whereis stimuli/audio02.wav

**Exercise**: Set the number of required copies of a file to `3`.

In [None]:
!git annex numcopies 3

**Exercise**: Try to drop `stimuli/audio01.wav`. What does the error message say?

In [None]:
!datalad drop stimuli/audio01.wav

**Exercise**: Set the number of required copies of a file to 1 and drop `stimuli/audio01.wav`.

In [None]:
!git annex numcopies 1
!datalad drop stimuli/audio01.wav

**Exercise**: Print the git-annex info for the whole dataset.

In [None]:
!git annex info

## Examining a New Data Set

Now you are equipped to consume any DataLad dataset that has been published online - let's try it out!
Search the [OpenNeuro database](https://openneuro.org/search?query={%22keywords%22:[]}) for a dataset that interests you and clone it. Then:
- print the git annex info of that dataset
- get some of the file contents and check the disk usage before and after
- drop the file contents and check the disk usage again