Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What information is needed to reuse code? #2

Open
kaythaney opened this Issue Feb 12, 2014 · 38 comments

Comments

Projects
None yet
@kaythaney
Copy link
Member

kaythaney commented Feb 12, 2014

(For more, see the full post on the blog: http://mozillascience.org/what-else-is-needed-for-code-reuse/)

When we first started discussions around our latest "Code as a research object" project, one of the main topics that arose was reuse. It's one thing for code and software to have an identifier that the community trusts so that it can be integrated into scholarly publishing systems. But what about the researchers looking to use that information to build or reuse that code in their own work? What information is needed for the code to be picked up, forked and run by someone else outside of their lab? What would the ideal README look like?

A few ideas:

  • What does the code do? For what field? (Short descriptor)
  • What's needed to get the code to run? Is it part of a larger codebase? Links to relevant repositories or tools used to run the code.
  • Contributors.
  • Link to the documentation on how the code is used.
  • License*
  • We recommend MIT or BSD, but for more options/clarity, visit: http://choosealicense.com/.
@ctb

This comment has been minimized.

Copy link

ctb commented Feb 12, 2014

A few off the cuff thoughts:

at least minimal automated tests so we can be sure the software runs properly;
example data set with expected output;
a lot of the stuff that the Software Sustainability Institute puts up about sustainable projects;
how to cite the code;

@mr-c, what are we missing?

@gearmonkey

This comment has been minimized.

Copy link

gearmonkey commented Feb 12, 2014

a quick one, that I see missing (and sometimes keeps me from using) lots of code from research:

  • Example use for core cases, ie. how do I run this code?
@JensTimmerman

This comment has been minimized.

Copy link

JensTimmerman commented Feb 12, 2014

  • ALL dependencies needed to compile the code.
  • Versions of these dependencies that are KNOWN to work (including compiler)
  • (unit) test cases
  • examples and their expected output
  • hash of the release, or git commit id, optionally signed by the release manager so the authenticity of the code can be checked.
  • optionally benchmark figures for the examples on a certain platform to get an idea of how long a run of the code will take.

shameless plug:

  • I've been involved in a framework to build (mainly) scientific software, we often run into trouble when building scientific software, so we figure out how to do it, and distribute the build procedure, in python code, so other people can benefit from our work. More information: http://hpcugent.github.io/easybuild/
@manics

This comment has been minimized.

Copy link

manics commented Feb 12, 2014

  • How does it compare to other similar projects (if any)?
  • Use cases
  • Situations that it's not designed for

These help users make a decision, and benefits the authors since it reduces negative feedback from people expecting it to do something it's not designed for.

@Carreau

This comment has been minimized.

Copy link

Carreau commented Feb 12, 2014

I know if is more from the prospective of someone that want to publish code, but I'm often missing a good tool that generate all this boilerplate and set the distribution of software.

Beeing mostly a python dev, I'm still highly annoyed when I need to publish python code even after all this time, I'm much more impressed by language like julia that have a built-in mechanism to generate a package in minutes. (set up github unit-tests , performance test, register the name, set-up doc in less than a minute) an npm (javascript).

Then as a "user" of that code I shouldn't need to know how to install this package it should "just works". And this is true as code author, if I can't use language way of installing things on other computer in my lab in less than 10 minutes, I won't even try to make this work across computers.

This is (for me) more than 80% of what is needed to reuse code :

  • A quick way to install test it unmodified on my data.
  • A quick an simple way to publish it that don't asked me to read 10 pages on different website.
@kaythaney

This comment has been minimized.

Copy link
Member Author

kaythaney commented Feb 12, 2014

Thanks, @ctb! Only concern is that for researchers who don't indentify as "computational scientists" who may be doing something slightly more entry level, is that too onerous? Thinking the bare minimum here for someone to be able to get up and running with the code ...

@kaythaney

This comment has been minimized.

Copy link
Member Author

kaythaney commented Feb 12, 2014

Great stuff, all. Keep it coming. :)

@khinsen

This comment has been minimized.

Copy link

khinsen commented Feb 12, 2014

I think it's important to distinguish short-term goals (What can we do right now? What should we recommend as best practices?) from long-term goals (In which direction should we develop our infrastructures?) In between we have the category of "thin layers on top of our current toolstack that would make life easier". The comments I see are about short-term and mid-term tasks, so I'll tackle the long-term directions.

The number one long-term goal for me is a stable code representation for scientific software. This can't be source code: programming languages are a user interface, which we want ever better and adapted to our domain-specific needs. So we will always have multiple and evolving languages. Machine-level code can't be stable either, because hardware evolves as well. So a stable layer must be somewhere in the middle, at a level that no one cares about too much to reject it. JVM or CLR bytecode are at the right level, but not so well suited for scientific applications.

Why do we need a stable code representation? For two reasons:

  1. Language interoperability. A scientific library should be reusable from a different language, to make investments into high-quality domain-specific code possible.
  2. Long-term reusability. Sustainable software development is great, and should be a goal for big community codes, but it is not reasonable to require that every PhD student's explorative code should be maintained. Still, people should be able to read and run it, even years later.
@kaythaney

This comment has been minimized.

Copy link
Member Author

kaythaney commented Feb 13, 2014

Great point, @khinsen. For this, we're thinking more the 5 fields you fill in to go along with your code as you push a release to say, figshare, so that someone can meaningfully glean (without much pain) what the code does, how to run it, etc in minimal time. There's of course a longer term play here, but styling this in the ilk of some of the standards listed in the blog post, looking at a) whittling it down to the basics for this first instance (they're easier to implement, higher fill rates) and b) for those who do not identify as "computational researchers" necessarily. Stellar points.

@IanCal

This comment has been minimized.

Copy link

IanCal commented Feb 13, 2014

[edit - this is all with regards to the building, distributing and running part]

Docker would be an obvious candidate for this kind of thing: https://www.docker.io/

A public repository with a dockerfile allows people to build the software, and you can distribute built images easily and efficiently. It'd leave the explanation of how to just run the system as simple as

docker pull IanCal/experiment_1
docker run -t IanCal/experiment_1 run.sh

All dependencies would be contained in the image, which could run optimised BLAS libraries or python or whatever they want (as long as it works on linux).

Converting something that already runs into a docker image is as simple as providing a build script (which should read pretty much like a good README setup section), no changing language or anything like that.

If the original researcher uses this to manage their dependencies & build process, then you also know it will actually run. Reviewers could also easily run the software without fighting build systems.

@Carreau

This comment has been minimized.

Copy link

Carreau commented Feb 13, 2014

@IanCal for me this is reproductibility, not reuse, if you got 2 project having a docker image A, and B, how do you use A in docker image B ?

Agree that it might be helpfull, but it should be an extra to have something like that available. not the main point.

@IanCal

This comment has been minimized.

Copy link

IanCal commented Feb 13, 2014

@Carreau There's nothing stopping you from reusing the code within it. You'd publish your code, a script for building it, and a fully contained runnable system. This means you know all the dependencies and you know it'll build.

Many things may benefit from being structured such that they communicate over the network, in which case you can very simply link your docker container and theirs. This allows you to have isolated dependencies, so you're not stuck because you both require different, conflicting dependencies.

@mbjones

This comment has been minimized.

Copy link

mbjones commented Feb 14, 2014

Here's a diagram of a subset of EML metadata proposed back around 2000 for software for reuse on eco and enviro data:
http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-software.png
Might be of use, could certainly be improved. I think @cboettig has this implemented for the R EML package from ROpenSci that is in progress.

@bobbledavidson

This comment has been minimized.

Copy link

bobbledavidson commented Feb 14, 2014

It may help to consider what sort of endpoint could be achieved by truly re-usable code. To my mind, a good outcome would be to have a standardised system that could build complex software projects for you out of combinations of other software projects that were available in the unified, re-usable format.

There are a few examples of this type of thing that already exist:

Automatic installation systems for *nix systems e.g. MacPorts, HomeBrew, Yum, Apt-Get. There is a standardised 'make' format and a standardised meta-data format for storing your project in a repository that allows these systems to search for and install/build/utilise disparate and non-uniform projects.

Web documents and browsers. Each browser ought to be able to interpret the structure, layout and information present in billions of disparate projects. This is done via separating information from metadata and layout. Tags like 'document type definition' could be very useful for specifying programming language and version for non-web based projects.

The pipelining software, Galaxy (galaxyproject.org) or other variants e.g. Taverna. I've been using Galaxy to make my disparate conglomerate of R, Python and Matlab scripts and tools more accessible to the biologists in my lab. Each tool is 'wrapped' with an XML file that describes all possible inputs, outputs and command line instructions. By putting this uniform descriptor on each uniquely designed project, they are able to be lassooed together into one large complex project.

I honestly think that something like Galaxy is very close to where you want to go with this, but the next step would be to design a universal standard for this type of meta-data wrapper system so that a) it is optimised/advanced and b) anyone can design a Galaxy-style system for pulling projects together, without necessarily having to be a Galaxy expert first.

Finally, while that system is useful for merging commandline-callable standalone tools, i think it would utlimately be possible to have a system that built tools for you by wrapping libraries and packages in this way along with 'include and calling methods' rather than 'command line instructions'.

@cboettig

This comment has been minimized.

Copy link

cboettig commented Feb 14, 2014

@kaythaney I think we could use a bit more guidance here as to what problems this question seeks to address. I think you're asking what we should consider the minimal metadata fields for scientific software (e.g. title, version, maintainers, etc). Other things like code documentation, unit testing, functionalized code, good API, knowledge of the algorithms and their limitations, etc can all be necessary for reuse but aren't necessarily things we can treat as software metadata.

Regarding minimal metadata, I think it's instructive to consider what various software packaging systems have decided should be minimal metadata: e.g. R's DESCRIPTION files on CRAN, Perl's META.yml files on CPAN, Ruby's Gem::Specification, Debian's metadata for .deb packages, etc. (p.s. Python people, how do you guys handle this?)

While I don't think any of these provide an exhaustive list of what is needed to "re-use" the software, each has been developed and tested within it's own ecosystem to be a reasonably reliable source of information to (a) facilitate installation by handling dependencies, etc, and (b) provide enough information for users and developers to search for desired features (like "parse xml") within a repository, and (c) usually provide some indication of how to get documentation and/or support for the software. Notably too, none of these systems try to answer "can I re-use this software", or "can I trust the results" (questions best left to the user community to evaluate), and yet people build upon these systems all the time.

So if Debian etc have effectively answered what metadata they expect from package providers to help promote re-usable software on their platform, I suspect the question is what are the metadata elements (and format?) to promote reusable software in scientific research?

It seems to me that part of the answer is simply to meet whatever the platform / system specific standards are for software distribution (R software should use the package system with valid DESCRIPTION file, provided on CRAN, etc). Given that, we might then ask are there elements missing from any of these existing standards that we would expect should be part of the minimal metadata for a scientific package? (a DOI? Citation(s)? particular licenses? Bug tracker/mailing list/maintainer contact? Description? Keywords?) What fields should be optional, and what required?

p.p.s. A related but different question is whether or not there should be a standard metadata format for scientific software, such as one might collect into a central searchable repository akin to the language-specific ones. Personally, I think success in so doing is hard and we should instead build on the language-specific repositories, but I could be convinced otherwise.

@pbulsink

This comment has been minimized.

Copy link

pbulsink commented Feb 14, 2014

I think there needs to be a philosophical shift amongst scientists who code for their research as well. It's been discussed that the largest reasons code isn't shared is because the authors feel that their code isn't clean enough to share, the code has poor documentation, researchers don't want to support their code when others have trouble with it, and authors don't want their code scrutinized if it could invalidate their results (see Shamir, L. et. al. Astronomy and Computing 1(2013) 54-58 http://dx.doi.org/10.1016.jascom.2013.04.001). These are all valid concerns, but if 87 % of authors aren't publishing their code, these reasons must be overcome.

I'm not sure what that will look like. Obviously, if code is horrendous, and poor programming results in bad data, it could lead to paper amendments or retractions. But at the same time we have to express to authors that they don't need to maintain their code for years and years with installation and usage support or bugfixes/enhancements over time.

The purpose of code sharing from a scientific point of view would be to allow other people to run the data through the same software and get the same result as a validation of the work that was done. Code should be published so that new data can be compared to the old to support or refute hypotheses. If that means that code is attached as an addition to the supporting information of a paper, and nothing more, then so be it.

While the end goal may be for a centralized repository with one button setup and execution that anyone and everyone could use (as discussed well by @bobbledavidson, @IanCal, and others above) that won't happen without the acceptance first of sharing code in whatever ugly, unsupported fashion it may appear.

@cboettig

This comment has been minimized.

Copy link

cboettig commented Feb 14, 2014

@pbulsink Great point. I think it is worth distinguishing between what researchers release as "software" with the intent for reuse (and usually described in dedicated "software papers"), vs code snippets and scripts that just document what the researchers actually did in a particular publication. I was thinking only of the former case. The very same concerns arose during the first phase of the Mozilla Code Review project (e.g. see http://carlboettiger.info/2013/09/25/mozilla-software-review.html)

I agree entirely with your perspective that the first step is simply getting people to publish the code or scripts they used, regardless of what they look like (as eloquently argued by Nick Barnes in Publish your computer code: it is good enough, and by Ince et al who provide a damning critique of why pseudocode isn't enough in The case for open computer programs).

Perhaps it is silly to try and distinguish between 'software intended for reuse' and 'code as supplemental methods documentation' or some such, I don't know. At the moment I'm in favor of treating them as different concepts and holding them to different standards.

Still, providing simple guidelines on how to increase the reusability of code that is otherwise ugly, un-abstracted, and intended only to show what a particular author did to get a particular result is not necessarily a bad thing, and need not discourage others from sharing in whatever way they see fit. Here, simple practices such as declaring dependencies with versions, dating the script, and providing it as a text file instead of a pdf image (yes, I've reviewed more than 1 paper in which code was included as a pdf file) could go a long way.

@karthik

This comment has been minimized.

Copy link

karthik commented Feb 16, 2014

If we are prioritizing these needs, I'd say what does the code exactly do is the hardest challenge, especially for leveraging plenty of legacy code. This includes the short description and some meaningful documentation (both of which remain really short or as placeholders for a future time after all other work is done. This time never comes).

For example I find tons of useful code but little to no documentation, no details on which implementation was used, and any examples of where the code was used. If I have to spend a significant amount of time understanding the ins and outs of code, I'm better off starting from scratch. My most recent research coding (just to distinguish from all the other coding I do) suffered from this exact problem. I found many statisticians had written some implementation of the lomb-scargle but never in a form that I could easily reuse.

License: Often there is none. But this is one thing that can be solved with better training for scientists.

What's needed to get the code to run? Is it part of a larger codebase? Links to relevant repositories or tools used to run the code.

This is a challenge but it can go either way. Depending on the software and the packaging system, it can be really easy or super painful. But finding that out doesn't take much of a time investment (easy to see ones that are hard to use and move on).

One more not on the list.
unit tests and some sort of build integration. Both are red flags that I should proceed with caution.

@ctb

This comment has been minimized.

Copy link

ctb commented Feb 16, 2014

@kaythaney, trying to get back to something concrete -- if these are blockers,

  1. at least minimal automated tests so we can be sure the software runs properly;
  2. example data set with expected output;
  3. how to cite the code;

then you are not ready to write code for yourself to use, much less anyone else. I'm even willing to equivocate on #1 ;).

(Yes, I removed the Software Sustainability Inst stuff -- too vague)

@gvwilson

This comment has been minimized.

Copy link

gvwilson commented Feb 16, 2014

Building something other people can download, install, understand, and use is roughly 3X the effort of building something that works for you on your machine [1]. What's the incentive for the working scientist to put hours (days, weeks, ...) into reusability instead of using their software to produce another publishable result themselves?

  1. http://www.amazon.com/Facts-Fallacies-Software-Engineering-Robert/dp/0321117425/
@karthik

This comment has been minimized.

Copy link

karthik commented Feb 16, 2014

👏 to what @gvwilson said. But is this particular discussion about incentives?

@cboettig

This comment has been minimized.

Copy link

cboettig commented Feb 16, 2014

Well said @karthik . Without detracting from the importance of incentives, its worth knowing just what needs to be incentivised. "Reuse" is too vague.

Nick Barnes and @pbulsink argues persuasively that it is just the publishing of whatever scripts the authors used. Others on this thread have argued just as persuasively that a lot more effort than that is needed for something to be really reusable. Only in very few specialized cases have the publishers offered clear guidance (e.g. Journal of Open Research Software, which has just a few simple additions beyond @ctb 's list).

Without any guidelines, even well meaning folks will share code whose reuse is hampered by things that take minutes, not hours to weeks, to fix. Setting the bar too high will only cause trouble. I believe there would be great value in a community consensus middle-ground.

Consider this script as typical-to-above-average example of things I see in my field where someone has bothered to share code. Using the imperfect criterion that the publication mentioned is indeed a valid example of both the intended purpose and intended output, I believe this could arguably meet JOAR's criteria (quoted below) for software merely by (a) moving it to an established repository and (b) adding a license.

  • Is the software in a suitable repository?
  • Does the software have a suitable open licence?
  • Is the link in the form of a persistent identifier, e.g. a DOI? Can you download the software from this link?
  • If the Code repository section is filled out, does the identifier link to the appropriate place to download the source code? Can you download the source code from this link?
  • Is the software license included in the software in the repository? Is it included in the source code?
  • Is sample input and output data provided with the software?
  • Is the code adequately documented? Can a reader understand how to build/deploy/install/run the software, and identify whether the software is operating as expected?
  • Does the software run on the systems specified?
  • Is it obvious what the support mechanisms for the software are?

I'm not saying these are the right criteria, or that these criteria make this reproducible, that's all good stuff to argue over. I'm only trying to illustrate what a middle ground between the 3X effort and "publish your code, it's good enough" might look like. (Noting too that criteria may differ for different types of code, e.g. software vs snippets like the one above; and potentially also between languages, which face different challenges in certain issues like cross-platform compatibility).

@bobbledavidson

This comment has been minimized.

Copy link

bobbledavidson commented Feb 17, 2014

Regarding @gvwilson and @karthik's comments about incentivising scientific programmers to add in these extra levels of work - I'd like to state that while many programmers would like to make a perfect code snippet for re-use by all in sundry, many will feel that they do not have time. I am often asked to 'just make up some code' to get something working or to try something out but I'm never encouraged to develop that into a proper tool because the 'effort to reward ratio' is to low- or so i'm told. If re-usable snippets of code were able to be counted as examples of successful outputs in government research funding proposals etc then the 'effort to reward' ratio would shift and I'd be encouraged by my boss(es) to add the test-data and meta-data and to make my work public.

With regards to @pbulsink's thoughts on encouraging people to share their work - I think that having a standardised format would actually help convince people to share their scrappy, unsupported work, IF there was a basic level that didn't require test-data, support etc.

A lot of programmers who know they haven't put enough effort into their code do not want to show this as an example of their work to the community. But if there were different levels of release in the same way that there are different types of GNU license, CC license etc. then someone could present their work in the "don't blame me, this is just a quickie" category and not expect any negative comeback.

For example, if there was a standard metafile format for sharing snippets of code (or whole projects) that at the lowest level only required putting details such as:

title
description
author(s)
reusable research code level: "don't blame me!"
language, version, and format (script, class, library, package)
dependencies
how to cite
actively developed or not
supported or not

then authors could happily add this tiny file to their e.g. GitHub repository and make the code public without worrying about being asked for help all the time (by saying 'not supported') or worrying that someone will think they do shoddy work (by stating that it's low level release).

@F1000Research

This comment has been minimized.

Copy link

F1000Research commented Feb 19, 2014

Code discovery and reuse would be aided by implementing standardised software metadata descriptor files. Human readable descriptor files could be developed alongside machine readable descriptor files. Machine readable descriptors would aid discovery of relevant code, human readable descriptors make it possible to evaluate the search results for relevance. If people can find relevant code easily, they are more likely to reuse it.

The contents of the descriptor files would need to be agreed by community consensus, could this be a viable goal for this group?

@gvwilson

This comment has been minimized.

Copy link

gvwilson commented Feb 19, 2014

Code discovery and reuse would be aided by implementing standardised
software metadata descriptor files....

The contents of the descriptor files would need to be agreed by
community consensus, could this be a viable goal for this group?

The last two decades suggest not...

@bobbledavidson

This comment has been minimized.

Copy link

bobbledavidson commented Feb 19, 2014

Admittedly, the last few decades have not shown this type of coordinated uptake of standards but the last two decades have seen programming methodology change first with Java and Object Oriented programming, then with all the advances in web technology, realisation of ubiquitous computing and now 'big data' in all its forms.

I think with the likes of github and other 'social' code sharing initiatives, alongside the general acceptance of open source, open access and open data, even at government level, that we have a better chance now than we did over the last few decades.

Personally I'm inclined to think that all it would take is to present a 'verson 1.0' of some schema for how to implement this and then to open it up to people to take it up, reject it, feed back - but that the ground is fertile for this type of thing to take off once the seed has been planted.

@IanCal

This comment has been minimized.

Copy link

IanCal commented Feb 20, 2014

The contents of the descriptor files would need to be agreed by community consensus, could this be a viable goal for this group?

I'd advocate "ask for forgiveness, not permission" here. I doubt a consensus will be reached.

Rather than asking people to come to an agreement on what to do, if mozilla were to pick something reasonable (like the list provided by @bobbledavidson) and a format (yaml, json, anything but xml ;) ) and run with it we can find out what the actual problems are.

Simple proposal, specify the name and format of the file and start publicly listing all github repos that have the file. People will start adding it because that's the way of getting on the list, then because it's being used we'll find out what to change. People will still argue about it, but they'll be arguing while an actual implementation exists rather than having nothing.

If there's a name for the file and a format, I'll add it to my code today.

@bobbledavidson

This comment has been minimized.

Copy link

bobbledavidson commented Feb 20, 2014

I completely agree with @IanCal. We should just come up with something minimalist but expandable and then present it as e.g. Mozilla Reusable Code Object Standard v1.0 and start making use of it. Then we can have feedback sessions and further congresses to develop the version upgrades and stratifications etc.

@rgaiacs

This comment has been minimized.

Copy link

rgaiacs commented Feb 20, 2014

+1 for something is better than nothing.

@npch

This comment has been minimized.

Copy link

npch commented Feb 20, 2014

Sorry, for jumping in late into the conversation - there have been some very good comments so far.

Here's my pragmatic view (personal opinion, may not be shared by others at the SSI!)

ABSOLUTE minimum:

  • code has a license that allows reuse (this might include non-Open Source licences)
  • code has been published somewhere such that people can find it (this could be as a tarball on a website)
  • code has some indication of what it is supposed to do (e.g. "This code sorts finds variants in a XXX file")
  • code has some way of contacting the original author (in lieu of good documentation)

USEFUL minimum:

  • all of the above, plus
  • code has enough documentation to understand how to run it without contacting the original author. This would normally include sample input and output files.
  • code has a licence that allows modification as well as reuse
  • documentation says what combination of dependencies the author believes it runs on
  • code gives some way of citing/attributing the authors

PRAGMATIC TO STRIVE FOR minimum:

  • all of the above, plus
  • code should be in a code repository and commit messages should be minimally useful
  • code should provide some form of automatic testing framework but not necessarily have 100% test coverage
  • code should provide at least two "system tests" including input, output and parameter data which enable a user to run the software through a complete pipeline
  • code documentation should be about design and scientific purpose, not the mechanics of the code. Each major package / subroutine should have some documentation
  • code should be associated some mailing list, issue tracker, etc for raising and resolving issues
  • there is a DOI attached to the code / a paper about the code

IDEALISTIC minimum:

  • well, let's not get into that right now.

There's already a lot of good work on this. For the idealistic end, the NASA Reuse Readiness Levels are a good read. You might be interested in the description criteria we've used for the Journal of Open Research Software as well: http://openresearchsoftware.metajnl.com/about/editorialPolicies#peerReviewProcess - these are somewhere between my description of USEFUL and PRAGMATIC above.

I do like the @bobbledavidson list - my suggestion is that people should use the CRAPL (http://matt.might.net/articles/crapl/) for the "don't blame me" license :-) One thing though, for the categories of "actively developed or not" - my heart says this should be identified through the repository stats rather than the metadata file, though my head says that given the findings of http://firstmonday.org/ojs/index.php/fm/article/view/1477/1392 maybe it is the original author who decides this category.

@kaythaney

This comment has been minimized.

Copy link
Member Author

kaythaney commented Feb 21, 2014

+1 to @npch 's comments. i don't disagree with your point about testing, @ctb, but for researchers who are doing a bit of data analysis and aren't accustomed to testing, we still want to nudge them in the right direction, even if it's not perfect practice out the gate. (and yes, agree that without testing, it's not always easy or advised to reuse, but well, we're not going to change everything overnight ... ;) )

@brainstorm

This comment has been minimized.

Copy link

brainstorm commented Feb 25, 2014

@ctb

2. example data set with expected output;

If the dataset can be generated synthetically with a fixed seed, one can avoid shipping potentially large files (with the accompanying archival/infrastructure issues).

I.e, see how we use simNGS to generate biggish fastq files:

https://github.com/SciLifeLab/facs/blob/master/tests/test_simngs.py#L39

@codersquid

This comment has been minimized.

Copy link

codersquid commented Apr 3, 2014

I'd like to see tooling that discovers dependencies and environment information to capture easy guesses to autopopulate metadata fields. Author experience is going to vary wildly and and they may not know the answers to what you want to include, nor have the time to investigate the answers.

Sumatra is a project I see that attempts to solve this, https://pythonhosted.org/Sumatra/introduction.html

as much as possible should be recorded automatically. If it is left to the researcher to record critical details there is a risk that some details will be missed or left out, particularly under pressure of deadlines.

In addition to what has been discussed so far, Sumatra gathers information about the platform architecture the code is run on, and for support languages it gathers dependencies. R, python, matlab.

@codersquid

This comment has been minimized.

Copy link

codersquid commented Apr 3, 2014

I really want to emphasize tooling. Authors are not going to be able to spend time chasing down all of this useful information. It is a hard problem to get compliance for formal processes in the software industry from people who do this as a full time job.

Thus those of us who create tools should bake affordances in to the tools so that all of the information is captured as a side-effect of use.

everyone should fund for UX development and testing in grant proposals. reproducibility can be a side effect of usable design.

@khinsen

This comment has been minimized.

Copy link

khinsen commented Apr 10, 2014

@codersquid I agree fully with your call for better tooling support. But let's not forget that tools such as Sumatra don't perform miracles: they obtain metadata from the software packages themselves, and thus rely on conventions and techniques that make it possible to discover them. So we need not only tools but also conventions. Sumatra works because the Python ecosystem has informal de-facto conventions that most packages follow, such as the module.__version__ attribute.

@cboettig

This comment has been minimized.

Copy link

cboettig commented May 1, 2014

Just two further thoughts on this thread:

NSF's recent Dear Colleague letter includes ideas that are very much in this spirit of establishing metadata for code reuse, particularly along the lines of a study that might empirically demonstrate what elements contribute most to effective reuse; see http://www.nsf.gov/pubs/2014/nsf14059/nsf14059.jsp Encourages applicants to apply for exploratory EAGER grants on this topic via SciSIP or SI2 programs. Hope that Mozilla Science Lab and/or others on this thread might consider such an angle?

In a different vein, I don't believe it's been mentioned on this thread yet so I thought I might bring up the Science Code Manifesto by @drj11 and others at the climate code foundation as related perspective on this question of minimal metadata. I would describe it as somewhere between @npch 's useful minimum and absolute minimum (archive in repository, state license, citation); though it is more of a cultural guideline than a technical one.

@pdurbin

This comment has been minimized.

Copy link

pdurbin commented May 3, 2014

A couple thoughts. Folks might be interested in this lecture by @victoriastodden: Toward Reproducible Computational Science: Reliability, Re-Use, and Readability - http://www.ischool.berkeley.edu/newsandevents/events/20140409stodden (slides at http://www.stanford.edu/~vcs/talks/BerkeleyISchool-April92014-STODDEN.pdf )

Also, there is also discussion going at http://forum.mozillascience.org/t/what-information-is-needed-to-reuse-code/20

@hofmockel

This comment has been minimized.

Copy link

hofmockel commented May 13, 2014

The test for sufficient metadata would be:
That one can freely and immediately build a system (maybe including specific hardware) that can transform the raw data into figure #n in a peer-review published article. As well as produce an alternative approach and figure that suggests an alternative interpretation from the same raw data.

What is needed to reuse data? - there might be easer questions that should be answered first.
What is needed to reproduce the figure in x article from the raw data? - If we don't answer this, we can't be sure what was done to produce the figure the whole argument is built on.

Finding the correct metadata to render data usage is a great discussion but the connection between published interpretations, figures, applications (transformative algorithms) in python, R, or what ever must align. I believe this could be a faster approach to answering what meta-data do I need to record.

I propose that working on coordination of publications, data and the applications that transform the raw data into figures or other must be repeatable as code. Attacking individual problems with this holistic approach will produce a publishing platform that respects text, code, and data; as well as all there implicit and semantic relationships. Our best platforms consider two of these at best. Let's get serious and recognize these languages and their relationships can not be thought of separately!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.