Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open Workflows (Software/process demo): Reproducible Research Objects with DataLad #25

Open
jsheunis opened this issue May 13, 2020 · 13 comments

Comments

@jsheunis
Copy link
Contributor

Reproducible Research Objects with DataLad

By Adina Wagner, Institute for Neuroscience and Medicine, Brain and Behavior (INM-7), Juelich Research Centre

  • Theme: Open Workflows
  • Format: Software/process demo

Abstract

DataLad makes it easy to link code, arbitrary amounts of data, software environments, procedures used for computations, and the results in a lightweight and easily shareable format, provenance-tracked and version controlled. This allows to create reproducible research objects of any level of elaborateness: From “only” joining data and code, up to completely executable “reproducible paper”-type publications, hosted as open as public repositories on hosting services such as GitHub, GitLab, or Gin.
In this demonstration, I will walk through a DataLad-centric analysis workflow using the human connectome project data, featuring

  • consuming HCP data with DataLad,
  • reproducible, re-executable, and provenance-tracked data analyses with DataLad,
  • and open dissemination of data, workflows, and results in a public repository.

Useful Links

http://handbook.datalad.org/en/latest/usecases/reproducible-paper.html

Tagging @adswa

@adswa
Copy link

adswa commented Jun 2, 2020

this is to confirm the talk :) (sorry, I missed the follow up e-mail...)

@adswa
Copy link

adswa commented Jun 16, 2020

The slides are available here, for anyone who is interested, and a write up with further pointers is here.

@Starborn
Copy link

Starborn commented Jun 17, 2020 via email

@adswa
Copy link

adswa commented Jun 17, 2020

Hi @Starborn, no, there is no relational database involved. It all builds up on Git and git-annex. There is a hands-on introduction in the chapter on datasets in the datalad handbook.

As I'm not very familiar with relational databases I may be misunderstanding your questions, so please bear with me and re-ask if necessary ;-)

where is the data structure?

In the case of the HCP data, the original data comes from the Amazon S3 buckets of the HCP project. Once locally available, the data is stored in each datasets "object tree", a key-value store of git-annex within .git/annex/objects of each dataset. There are details on this in this section and a technical overview in git-annex documentation.

Can the data/program/model be visualized in any other way than by using docker/code?

I'm a bit unsure what exactly you are referring to. :) Docker/Singularity is only required to attach a software environment to the data and code in the dataset, and for the execution of commands inside of this software environment. Its not necessary to do this, and by no means necessary for visualizing any dataset contents. If your question is whether there is a GUI, then no, not for datalad. Everything happens as command line calls or via the Python API.

I think it would be a it easier for me to answer if I understood what you are interested in (the "data" aspect of it, i.e., getting HCP data? The reproducible execution aspect? Version control aspect?, ...). In any case, the user documentation http://handbook.datalad.org/en/latest/index.html and the technical docs http://docs.datalad.org/en/stable/ may be a good resource to browse.

@Starborn
Copy link

Starborn commented Jun 17, 2020 via email

@adswa
Copy link

adswa commented Jun 17, 2020

I think I understand a bit better where you are coming from, thanks for clarifying :)

  • First of all, leave containers out of the equation for now. They're certainly useful to understand, but not the starting point.
  • "I guess in the first instance I need to understand where is the data": I would recommend reading the chapter http://handbook.datalad.org/en/latest/basics/basics-datasets.html to get a general idea, and to take a look into git-annex'es documentation for more technical stuff. To phrase it simple: The data can be anywhere (a webtorrent, an S3 bucket, a dropbox account, a private webserver, ...) but its location is registered in a dataset. Upon demand, it can be retrieved in precise versions from this location and is then locally available on your machine.
  • " So, does your system use each data file individually rather than a set of files/records in a database?": I guess it isn't wrong to phrase it like this. DataLad only knows about files and folders, everything happens at the level of individual files in a dataset. It is completely unrelated to any database-related approach.
  • "I want to build knowledge models and processes based on the data/workflows available, so that I can try to query the data to answer different questions": Everything that is done to the data in a dataset is stored in the Git history. Maybe this is a useful starting point for queries. And the git history can be visualized with many existing GUIs. The development of a GUI for DataLad is not actively in progress at the moment. But the datalad handbook is written in a way that you do not need to be familiar with python or the command line to read it, so I'm hopeful that this resource can give a comprehensive understanding of the tool. :)

@Starborn
Copy link

Starborn commented Jun 22, 2020 via email

@adswa
Copy link

adswa commented Jun 22, 2020

Ah! Maybe http://nidm.nidash.org/ is what you are looking for?

@Starborn
Copy link

Starborn commented Jun 22, 2020 via email

@adswa
Copy link

adswa commented Jun 22, 2020

I'm not too knowledgeable in this domain, so I'd suggest you contact the team around NIDM (links and pointers on the website). This particular workflow of mine isn't concerned at all with NIDM. All the best!

@Starborn
Copy link

Starborn commented Jun 22, 2020 via email

@Starborn
Copy link

Starborn commented Jun 24, 2020 via email

@adswa
Copy link

adswa commented Jun 24, 2020

hihi, sure Paola!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants