# Creating a DataLad Dataset from Scratch

DataLad is a highly flexible tool that can be easily integrated into every workflow because, in its essence, a DataLad dataset is just a regular folder on your machine (with some additional metadata in the `.git` and `.datalad` folders).

In this section we are going to create a new dataset from scratch using DataLad's `create` command.
Because every DataLad dataset is also a Git repository, this will initialize git automatically.
Once we create the dataset, we can add any kind of data.

We can even add other DataLad datasets as so-called subdatasets!
As we add data and make changes to our dataset, DataLad will keep track of everything in the `git log`.
This gives us a comprehensive history of our dataset which allows us (and anyone we share the dataset with) to understand what has been done and even restore older versions of files.

## Creating a new Dataset

### Background

Once we create a dataset, DataLad will watch out for changes to any file.
By using the `status` command, we can get a report on any files or changes in our dataset that are currently untracked.
When we run `datalad save`, the untracked changes will be committed into the dataset's history.
We can add a little message with the `-m` flag to describe what has been done, e.g., `-m "added raw recordings"`.
While this is not required, it is a good practice that will make the dataset's history more transparent to collaborators and your future self.

### Exercises

In this section we are going to create a new DataLad dataset. We are then going to add different kinds of content like text files and PDFs downloaded from the web and save them so DataLad keeps track of them. Here are the commands you need to know:

| Code | Description |
| --- | --- |
| `mkdir data/` | Create a new directory called `data/` |
| `cd data/` | Change the working directory to `data/` |
| `datalad create my-dataset` | Create a DataLad dataset in the new directory `my-dataset` |
| `datalad status` | Show any untracked changes in the current dataset |
| `datalad save -m "adding data"` | Save all untracked changes in the current dataset with a commit message |
| `echo "hello" > file.txt` | Save the text `"hello"` to `file.txt` | 
| `curl -o file.txt <URL>` | Download content from the given URL and write it to `file.txt` |

**Example**: Create a new DataLad dataset called `my-dataset` in the current directory.

In [1]:
!datalad create my-dataset

[1;1mcreate[0m([1;32mok[0m): /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/my-dataset ([1;35mdataset[0m)


**Exercise**: Create a new DataLad dataset called `learn-datalad` in the current directory.

In [2]:
!datalad create learn-datalad

[1;1mcreate[0m([1;32mok[0m): /home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad ([1;35mdataset[0m)


**Exercise**: Change the current directory to `learn-datalad` and print the dataset's `status`.

In [3]:
%cd learn-datalad
!datalad status

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad
nothing to save, working tree clean


**Example**: Create a new directory `code/` in `learn-datalad/`.

In [4]:
!mkdir audio

**Exercise**: Create a new directory `books/` in `learn-datalad/` and change the current directory to `books/`.

In [5]:
!mkdir books

Run the cell below to download [https://homepages.uc.edu/~becktl/byte_of_python.pdf](https://homepages.uc.edu/~becktl/byte_of_python.pdf) and write it to the output file `byte-of-python.pdf` in `books/`.

In [6]:
!curl -o books/byte-of-python.pdf https://homepages.uc.edu/~becktl/byte_of_python.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2630k  100 2630k    0     0   142k      0  0:00:18  0:00:18 --:--:--  151k


**Exercise**: Check the `status` of the dataset.

In [7]:
!datalad status

[1;31muntracked[0m: books ([1;35mdirectory[0m)


**Exercise**: `save` the untracked file and add the message `"add a book on Python"`. Then, check the `status` of the dataset again.

In [8]:
!datalad save -m "add a book on Python"
!datalad status

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                   | 0.00/2.69M [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): books/byte-of-python.pdf ([1;35mfile[0m)        [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean


**Exercise**: Run the cell below to download [https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf](https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf) and write it to `books/progit.pdf`. Then, save the untracked file with a message `"add a book on Git"` and check the dataset's `status` to make sure there are no untracked changes.

In [9]:
!curl -o progit.pdf https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0


In [10]:
!datalad save -m "add a book on Git"
!datalad status

[1;1madd[0m([1;32mok[0m): progit.pdf ([1;35mfile[0m)                      
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean


**Exercise**: Run the cell below to create a new file `README.md` with the text `"This is a DataLad dataset"`. Then, save the untracked file and check the dataset's status.

In [11]:
!echo "This is a DataLad dataset" > README.md

In [12]:
!datalad save -m "add README"
!datalad status

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                    | 0.00/26.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean


## Modifying Content and Tracking Changes

### Background

While DataLad wraps many functions of Git, there are some instances where we need to use `git` directly.
Viewing the `git log` is one of those instances.
This log is a critical part of any Git repository and DataLad dataset because it contains a comprehensive history of our dataset and every time we run `datalad save`, a new entry is created.
Each commit has a unique hash and contains the commit's author and their email as well as the commit message.

Because DataLad wants to make sure that we don't accidentally overwrite our files once they are committed, it locks them to make them unmodifiable.
This is why we have to use `datalad unlock` before we can modify them.
When we run `datalad save` to save our changes, the file will be locked again.
Even though DataLad reports unlocking as a file modification, it will only create a new entry in the commit history if the file actually changed.

**Note for Windows**: The Windows file system does not support file locking in the same way that Linux/macOS does. Instead, Windows duplicates the data and keeps one copy in the working directory and one backup copy for safety in the `.git` folder. This has the advantage that you don't need to unlock files before modifying them, but it also makes your dataset twice as big!
Another consequence is that DataLad is creating a separate commit for this adjustment, so the most recent entry in your `git log` will always show the message *"git-annex adjusted branch"*. This means that, to get the most recent commit you made to the dataset, you have to look at the second-to-last entry in `git log`.

### Exercises

In the following exercises we are going to inspect the `git log` to view the history of our dataset. We are also going to modify existing files by unlocking them and saving the changes to the commit history. Here are the commands you need to know:

| Code | Description |
| --- | --- |
| `git log` | Display the commit history of the repository |
| `git log -2` | Display the last two entries in commit history |
| `git log --oneline` | Display a compact one-line view of the commit history |
| `datalad unlock data/` | Unlock the file content of the `data/` folder |
| `datalad unlock file.txt` | Unlock the file content of `file.txt` |
| `datalad status` | Show any untracked changes in the current dataset |
| `datalad save` | Save untracked changes and lock unlocked file contents |
| `datalad save -m "message"` | Save untracked changes with a commit message |
| `echo "content" >> file.txt` | Append the text `"content"` to `file.txt` |

**Exercise**: Display the `git log` to view all commits you made to the `learn-datalad` dataset.

In [13]:
!git log

[33mcommit e50d7cbf9faf240a5889a05a7c8fe31081743f60[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m
Author: obi <ole.bialas@posteo.de>
Date:   Thu Dec 4 21:16:18 2025 +0100

    add README

[33mcommit 6c24174edfa3dbe873da2faa652290bb76f321f4[m
Author: obi <ole.bialas@posteo.de>
Date:   Thu Dec 4 21:16:14 2025 +0100

    add a book on Git

[33mcommit 2d49e4ba0a6e2dd4f703e268239489f5be054282[m
Author: obi <ole.bialas@posteo.de>
Date:   Thu Dec 4 21:16:10 2025 +0100

    add a book on Python

[33mcommit a77d0803f0ed92ee09e3b1e21ae50d7536f2fbea[m
Author: obi <ole.bialas@posteo.de>
Date:   Thu Dec 4 21:15:34 2025 +0100

    [DATALAD] new dataset


**Exercise**: Display the `git log` in a compact one-line view.

In [14]:
!git log --oneline

[33me50d7cb[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m add README
[33m6c24174[m add a book on Git
[33m2d49e4b[m add a book on Python
[33ma77d080[m [DATALAD] new dataset


**Exercise**: Unlock the content of `README.md`. Then, check the dataset's status.

In [15]:
!datalad unlock README.md
!datalad status

[1;1munlock[0m([1;32mok[0m): README.md ([1;35mfile[0m)                    
 [1;31mmodified[0m: README.md ([1;35mfile[0m)                               


**Example**: Append the line `"It uses git and git-annex"` to `README.md`, either using your editor or the echo command. Then, `save` with a message and check the dataset's `status`.

In [16]:
!echo "It uses git and git-annex" >> README.md
!datalad save -m "add line"
!datalad status

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                    | 0.00/52.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean


**Exercise**: Unlock `README.md` and append another line `"for decentralized version control"`. Then, `save` the changes and check the `status`.

In [17]:
!datalad unlock README.md
!echo "For decentralized version control" >> README.md
!datalad save -m "add another line"
!datalad status

[1;1munlock[0m([1;32mok[0m): README.md ([1;35mfile[0m)                    
Total: 0.00 datasets [00:00, ? datasets/s]                                      
Total:   0%|                                    | 0.00/86.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean


**Exercise**: Display the last two entries in the git history.

In [18]:
!git log -2

[33mcommit 2118200577c87d9f08471a5737e48de630881f4e[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m
Author: obi <ole.bialas@posteo.de>
Date:   Thu Dec 4 21:16:28 2025 +0100

    add another line

[33mcommit c6a173952ea4c48cd085b95e73200894028fe972[m
Author: obi <ole.bialas@posteo.de>
Date:   Thu Dec 4 21:16:25 2025 +0100

    add line


**Exercise**: Unlock `README.md` and then, without making any changes, `save` with a message. Check the last two entries in the git history - did your `save` command create an entry?

In [19]:
!datalad unlock README.md
!datalad save -m "did nothing"
!git log -2

[1;1munlock[0m([1;32mok[0m): README.md ([1;35mfile[0m)                    
Total: 0.00 datasets [00:00, ? datasets/s]                                      
Total:   0%|                                    | 0.00/86.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
action summary:                                                                 
  add (ok: 1)
  save (notneeded: 1)
[33mcommit 2118200577c87d9f08471a5737e48de630881f4e[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m
Author: obi <ole.bialas@posteo.de>
Date:   Thu Dec 4 21:16:28 2025 +0100

    add another line

[33mcommit c6a173952ea4c48cd085b95e73200894028fe972[m
Author: obi <ole.bialas@posteo.de>
Date:   Thu Dec 4 21:16:25 2025 +0100

    add line


## Installing Subdatasets

### Background

You can add any data to your DataLad dataset, including other datasets!
DataLad allows you to install datasets as submodules, which means that they are added to your repository while maintaining their own, independent git history.
This allows us to modularize research projects by, for example, creating subdatasets for different modalities, conditions, or analysis methods.
Modularizing the dataset often results in a cleaner history and easier-to-maintain project, and it also increases the reusability because it allows you and others to reuse only specific components of the data. 

Installing subdatasets is done via DataLad's `install` command.
This works similarly to `clone` but is more versatile and allows us to install a subdataset into a given path while automatically registering it into the superdataset's history.   

### Exercises

In the following exercises, we are going to install datasets from OpenNeuro as subdatasets into our new dataset. Here are the commands you need to know:

| Code | Description |
| --- | --- |
| `datalad install -d my-dataset <URL>` | Install the dataset from the given URL as a subdataset into the `my-dataset/` directory |
| `datalad install -d . <URL>` | Install the dataset from the given URL as a subdataset into the current directory |
| `datalad subdatasets` | List all subdatasets of the current directory |

**Example**: Install the dataset from the OpenNeuro URL [https://github.com/OpenNeuroDatasets/ds005131.git](https://github.com/OpenNeuroDatasets/ds005131.git) as a subdataset into the current dataset.

In [20]:
!datalad install -d . https://github.com/OpenNeuroDatasets/ds005131.git

Cloning:   0%|                             | 0.00/2.00 [00:00<?, ? candidates/s]
Enumerating: 0.00 Objects [00:00, ? Objects/s][A
                                              [A
Counting:   0%|                              | 0.00/3.05k [00:00<?, ? Objects/s][A
                                                                                [A
Compressing:   0%|                           | 0.00/1.81k [00:00<?, ? Objects/s][A
                                                                                [A
Receiving:   0%|                             | 0.00/3.05k [00:00<?, ? Objects/s][A
Receiving:   6%|█▎                    | 184/3.05k [00:00<00:01, 1.62k Objects/s][A
Receiving:  19%|████▌                   | 580/3.05k [00:00<00:02, 935 Objects/s][A
Receiving:  50%|██████████          | 1.53k/3.05k [00:00<00:00, 2.42k Objects/s][A
Receiving:  62%|████████████▍       | 1.89k/3.05k [00:00<00:00, 2.31k Objects/s][A
Receiving:  72%|██████████████▍     | 2.20k/3.05k [00:01<00:00,

**Exercise**: Change the directory to the root of the newly installed subdataset `ds005131/` and check its `git log`.

In [21]:
%cd ds005131
!git log --oneline

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad/ds005131
[33m51c1338[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmain[m[33m, [m[1;33mtag: [m[1;33m1.0.1[m[33m, [m[1;31morigin/master[m[33m, [m[1;31morigin/main[m[33m, [m[1;31morigin/HEAD[m[33m)[m [OpenNeuro] Recorded changes
[33m7bb0e92[m [OpenNeuro] Recorded changes
[33m579b3aa[m [OpenNeuro] Recorded changes
[33m95b3ce9[m [OpenNeuro] Recorded changes
[33m82286ff[m [OpenNeuro] Recorded changes
[33m577f003[m[33m ([m[1;33mtag: [m[1;33m1.0.0[m[33m)[m [OpenNeuro] Recorded changes
[33m2779065[m [OpenNeuro] Recorded changes
[33mde27cca[m [OpenNeuro] Recorded changes
[33m087aafd[m [OpenNeuro] Recorded changes
[33m86fb2d1[m [OpenNeuro] Recorded changes
[33m3e04d03[m [OpenNeuro] Recorded changes
[33m3a6eca0[m [OpenNeuro] Recorded changes
[33mf9191e6[m [OpenNeuro] Recorded changes
[33mbc72cc4[m [OpenNeuro]

**Exercise**: Change the directory back to the parent `learn-datalad/`. Then, browse the [OpenNeuro database](https://openneuro.org/search?query={%22keywords%22:[]}), choose a dataset and install it as another subdataset. 

In [22]:
%cd ..
!datalad install -d . https://github.com/OpenNeuroDatasets/ds003507.git

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad
Cloning:   0%|                             | 0.00/2.00 [00:00<?, ? candidates/s]
Enumerating: 0.00 Objects [00:00, ? Objects/s][A
                                              [A
Counting:   0%|                              | 0.00/2.63k [00:00<?, ? Objects/s][A
                                                                                [A
Compressing:   0%|                           | 0.00/1.74k [00:00<?, ? Objects/s][A
                                                                                [A
Receiving:   0%|                             | 0.00/2.63k [00:00<?, ? Objects/s][A
Receiving:   7%|█▌                    | 185/2.63k [00:00<00:01, 1.78k Objects/s][A
Receiving:  19%|████▏                 | 501/2.63k [00:00<00:01, 1.82k Objects/s][A
Receiving:  26%|█████▋                | 685/2.63k [00:00<00:01, 1.17k Objects/s][A
Receiving: 

**Exercise**: Change the directory to the newly installed subdataset and inspect its `git log`.

In [23]:
%cd ds003507
!git log --oneline

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad/ds003507
[33m8b8fad4[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m, [m[1;33mtag: [m[1;33m1.0.1[m[33m, [m[1;31morigin/master[m[33m, [m[1;31morigin/HEAD[m[33m)[m [DATALAD] Recorded changes
[33m29ce3cc[m [DATALAD] Recorded changes
[33mb82c86f[m [DATALAD] Recorded changes
[33me548886[m [DATALAD] Recorded changes
[33mf212034[m [DATALAD] Recorded changes
[33m540710b[m [DATALAD] Recorded changes
[33mea7a5e4[m [DATALAD] Recorded changes
[33m80eeffd[m [DATALAD] Recorded changes
[33ma821149[m [DATALAD] Recorded changes
[33mb461339[m [DATALAD] Recorded changes
[33m98e47d8[m [DATALAD] Recorded changes
[33ma9d5d59[m [DATALAD] Recorded changes
[33m5cb3c0b[m[33m ([m[1;33mtag: [m[1;33m1.0.0[m[33m)[m [DATALAD] Recorded changes
[33m2f692c4[m [DATALAD] Recorded changes
[33m7527e33[m [DATALAD] exclude paths

**Exercise**: Change the directory back to the parent `learn-datalad/` and list all `subdatasets`.

In [24]:
%cd ..
!datalad subdatasets

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch/learn-datalad
[1;1msubdataset[0m([1;32mok[0m): ds003507 ([1;35mdataset[0m)
[1;1msubdataset[0m([1;32mok[0m): ds005131 ([1;35mdataset[0m)


## Going Back and Forth in Time

### Background

Because DataLad keeps track of all changes to our dataset, we can restore any previous version of a given file. This can be very useful if we made a mistake and want to restore an older version of our project, or we simply want to check how the data looked previously. In this section, we are going to learn two ways of doing this: checking out to a specific commit and resetting the repository. The `checkout` is mostly useful if we want to look at an older state of our project without actually changing the current state of the repository, while the `reset` is used to modify the repository's state.

Note that a `checkout` creates a new branch of the dataset.
A branch is like a copy that can be modified independently of the original.
To switch back to the previous branch (i.e., main/master), use `git switch -`.

### Exercises

In the following exercises we are going to use the git history to look at old file versions and restore previous states of our dataset. Here are the commands you need to know:

| Code | Description |
| --- | --- |
| `git log --oneline` | Display a compact one-line view of the commit history |
| `git checkout d0e83f29` | `checkout` to the state of the repository at the commit with the hash `d0e83f29` |
| `git switch -` | Switch back to the previous branch |
| `git reset --mixed d0e83f29` | `reset` the state of the repository to the commit with the hash `d0e83f29` but keep the working directory as-is |
| `git reset --hard d0e83f29` | `reset` the state of the repository and delete files from the working directory |
| `datalad status` | Show any untracked changes in the current dataset |
| `datalad save -m "message"` | Save untracked changes with a commit message |
| `cat file.txt` | Display the content of `file.txt` (Linux/macOS) |
| `type file.txt` | Display the content of `file.txt` (Windows) |


**Exercise**: Run the cell below to print the git history, identify the last commit before we made any changes to `README.md` and note its commit hash.

In [25]:
!git log --oneline

[33m07a4216[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m [DATALAD] Added subdataset
[33m4010f6f[m [DATALAD] Added subdataset
[33m2118200[m add another line
[33mc6a1739[m add line
[33me50d7cb[m add README
[33m6c24174[m add a book on Git
[33m2d49e4b[m add a book on Python
[33ma77d080[m [DATALAD] new dataset


**Exercise**: Use `checkout` to the commit hash you noted in the exercise above (this hash is going to be different for every repository).

In [26]:
!git checkout dbe3ff9

error: pathspec 'dbe3ff9' did not match any file(s) known to git


**Exercise**: Check the git commit history and inspect the content of `README.md`.

In [27]:
!git log --oneline

[33m07a4216[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m [DATALAD] Added subdataset
[33m4010f6f[m [DATALAD] Added subdataset
[33m2118200[m add another line
[33mc6a1739[m add line
[33me50d7cb[m add README
[33m6c24174[m add a book on Git
[33m2d49e4b[m add a book on Python
[33ma77d080[m [DATALAD] new dataset


In [28]:
# Linux/macOS
!cat README.md

This is a DataLad dataset
It uses git and git-annex
For decentralized version control


In [29]:
# Windows
!type README.md

README.md is /mnt/c/Users/olebi/AppData/Local/Microsoft/WinGet/Packages/equalsraf.win32yank_Microsoft.Winget.Source_8wekyb3d8bbwe/README.md


**Exercise**: Switch back to the previous (i.e. the main/master) branch.

In [30]:
!git switch -

fatal: invalid reference: @{-1}


**Exercise**: Identify the hash of the commit where we appended the first line to `README.md`. Then, `checkout` to that commit and inspect the content of `README.md`.

In [31]:
!git checkout 70ed602

error: pathspec '70ed602' did not match any file(s) known to git


In [32]:
# Linux/macOS
!cat README.md

This is a DataLad dataset
It uses git and git-annex
For decentralized version control


In [33]:
# Windows
!type README.md

README.md is /mnt/c/Users/olebi/AppData/Local/Microsoft/WinGet/Packages/equalsraf.win32yank_Microsoft.Winget.Source_8wekyb3d8bbwe/README.md


**Exercise**: Switch back to the master branch and inspect the content of `README.md` to make sure it was restored.

In [34]:
!git switch -

fatal: invalid reference: @{-1}


In [35]:
# Linux/macOS
!cat README.md

This is a DataLad dataset
It uses git and git-annex
For decentralized version control


In [None]:
# Windows
!type README.md

**Exercise**: Use `git reset --mixed` to reset the repository's state to the point before `README.md` was modified. Then, check the `git log` and the dataset's `status`.

**NOTE**: Using `--mixed` resets the repository's state but does not affect your working directory - commits that happened after the point of reset will appear as unstaged changes.

In [None]:
!git reset --mixed dbe3ff9
!git log --oneline
!datalad status

Unstaged changes after reset:
M	README.md
[33mdbe3ff9[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m add README
[33mfa67eb5[m add a book on Git
[33mbdd7b31[m add a book on Python
[33m417f3a3[m [DATALAD] new dataset
[1;31muntracked[0m: .gitmodules ([1;35mfile[0m)
[1;31muntracked[0m: ds003507 ([1;35mdirectory[0m)
[1;31muntracked[0m: ds005131 ([1;35mdirectory[0m)
 [1;31mmodified[0m: README.md ([1;35msymlink[0m)


**Exercise**: Save the unstaged changes to `README.md`. Then, check the content of `README.md` to make sure nothing got lost.

**NOTE**: Since you are adding what were multiple commits in a single operation, you may choose a different commit message.

In [None]:
!datalad save -m "adding info to README"

[1;1madd[0m([1;32mok[0m): ds003507 ([1;35mdataset[0m)                     
[1;1madd[0m([1;32mok[0m): ds005131 ([1;35mdataset[0m)                     
[1;1madd[0m([1;32mok[0m): .gitmodules ([1;35mfile[0m)                     
Total:   0%|                                 | 0.00/1.00 [00:00<?, ? datasets/s]
Total:   0%|                                     | 0.00/808 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 4)
  save (ok: 1)
This is a DataLad dataset
It uses git and git-annex
For decentralized version control


In [None]:
# Linux/macOS
!cat README.md

In [None]:
# Windows
!type README.md

**Exercise**: Use `git reset --hard` to reset the repository's state to the point before `README.md` was modified. Then, check the `git log` and the dataset's `status`.

**NOTE**: Using `--hard` modifies your working directory and all commits that happened after the point of reset will be gone (they can still be recovered if they haven't been deleted by git's garbage collector, which happens after 30 days by default). Also, this won't remove the installed subdatasets (you can simply remove them manually).

In [None]:
!git reset --hard dbe3ff9
!git log --oneline
!datalad status

HEAD is now at dbe3ff9 add README
[33mdbe3ff9[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmaster[m[33m)[m add README
[33mfa67eb5[m add a book on Git
[33mbdd7b31[m add a book on Python
[33m417f3a3[m [DATALAD] new dataset
[1;31muntracked[0m: ds003507 ([1;35mdirectory[0m)
[1;31muntracked[0m: ds005131 ([1;35mdirectory[0m)


## Dataset Configurations: To Annex or not to Annex?

### Background

By default, DataLad will use `git-annex` to handle the content of every single file in your dataset. However, this is not always desirable. For example, you may not want to annex small text files like code to avoid having to unlock them for every edit. We can tell DataLad which files should be annexed by editing the `.gitattributes` file. Let's look at the default `.gitattributes` that was created when we initialized the dataset:

In [35]:
!cat .gitattributes

* annex.backend=MD5E
**/.git* annex.largefiles=nothing


There are two lines in this file:
- `* annex.backend=MD5E`: tells git-annex to use the `MD5E` backend for generating file hashes
- `**/.git* annex.largefiles=nothing`: tells git-annex to not annex the `.git` folder (because that folder is where the annexed contents are stored)

We usually don't want to edit these default values. Instead, we want to add lines to `.gitattributes` to specify which contents should and shouldn't be annexed. Note that changes in the configuration will not automatically be applied to files that are already tracked. Thus, it is best to configure `.gitattributes` right after initializing the dataset, before data is added.

### Exercises

In the following exercises, we are going to modify our dataset's `.gitattributes` to apply some custom configurations.
Here are the different commands and configuration options that you'll need:

| Code | Description |
| --- | --- |
| `* annex.largefiles=(mimeencoding=binary)` | Only annex files with a `binary` encoding |
| `myfile.pdf annex.largefiles=nothing` | Don't annex `myfile.pdf` |
| `* annex.largefiles=(largerthan=5kb)` | Only annex files whose size exceeds 5KB |
| `* annex.largefiles=((largerthan=5kb)or(mimeencoding=binary))` | Only annex binary files and files greater than 5KB |
| `git annex unannex <files>` | Unannex the content of the given files |
| `git annex whereis <files>` | Show the location of the annexed file content (empty if the file isn't annexed) |

**Exercise**: Get the location of the annexed file content of `books/byte_of_python.pdf`.


In [None]:
!git annex whereis books/byte_of_python.pdf

**Exercise**: Add the line below to `.gitattributes` to avoid annexing pdfs.

In [None]:
**/*.pdf annex.largefiles=nothing

**Example**: Unannex `books/byte_of_python.pdf` and save to apply the changed configuration.

In [None]:
!git annex unannex books/
!datalad save -m "unannex book"

**Exercise**: Get the location of the annexed file content of `books/byte_of_python.pdf` again. This should return nothing since the file isn't annexed anymore.

In [None]:
!git annex whereis books/byte_of_python.pdf

total 2632
-rw-r--r-- 1 olebi olebi 2693891 Dec  2 13:07 byte-of-python.pdf
-rw-r--r-- 1 olebi olebi       0 Dec  2 13:07 progit.pdf


**Exercise**: Change the last line in `.gitattributes`, so that only binary files will be annexed.

In [None]:
# new content of .gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
* annex.largefiles=(mimeencoding=binary)

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                    | 0.00/90.0 [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): .gitattributes ([1;35mfile[0m)                  [A
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)


**Exercise**: Unannex `README.md` and save to apply the changed configuration. Now you should be able to edit `README.md` without having to unlock it.

In [None]:
!git annex unannex README.md
!datalad save -m "annex only binary"

                                                                                

**Exercise**: Change the last line in `.gitattributes` so that (non-binary) files greater than 5kb will also be annexed.

In [None]:
# new content of .gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
* annex.largefiles=((mimeencoding=binary)or(largerthan=5kb))

/home/olebi/projects/Introduction-to-Scientific-Data-Management-with-DataLad/notebooks/02_creating_a_dataset_from_scratch


**Exercise**: Execute the cell below to save a large text file. Then inspect `README.md` and the new file `test.txt`. Then, get the location of the annexed file content for `test.txt` and `README.md` - if you configured `.gitattributes` correctly in the exercise above, `test.txt` should be a symlink but `README.md` shouldn't.

In [47]:
open('test.txt', 'w').write('he' * 5000)
!datalad save -m "add large text file"

Total: 0.00 datasets [00:00, ? datasets/s]
Total:   0%|                                   | 0.00/10.2k [00:00<?, ? Bytes/s][A
[1;1madd[0m([1;32mok[0m): test.txt ([1;35mfile[0m)                        [A
[1;1madd[0m([1;32mok[0m): .gitattributes ([1;35mfile[0m)                  
[1;1madd[0m([1;32mok[0m): README.md ([1;35mfile[0m)                       
[1;1msave[0m([1;32mok[0m): . ([1;35mdataset[0m)                           
action summary:                                                                 
  add (ok: 3)
  save (ok: 1)


In [None]:
!git annex whereis test.txt
!git annex whereis README.md