New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should software be cited? #12

Open
hubgit opened this Issue Mar 18, 2014 · 18 comments

Comments

Projects
None yet
9 participants
@hubgit
Member

hubgit commented Mar 18, 2014

Here's an example of a software citation. Does it include all the appropriate information, and can it be improved? When a snapshot of the code has been archived somewhere, how should that be included in the citation?

Goddard TD, Kneller DG. 2007. SPARKY 3 (v3.114, Windows). San Francisco: University of California. Available from http://www.cgl.ucsf.edu/home/sparky/

The citation in JATS XML:

<element-citation publication-type="software">
  <person-group person-group-type="author">
    <name><surname>Goddard</surname><given-names>TD</given-names></name>
    <name><surname>Kneller</surname><given-names>DG</given-names></name>
  </person-group>
  <source>SPARKY 3</source>
  <edition designator="3.114">v3.114, Windows</edition>
  <year iso-8601-date="2007">2007</year>
  <publisher-loc>San Francisco</publisher-loc>
  <publisher-name>University of California</publisher-name>
  <comment>Available from <uri>http://www.cgl.ucsf.edu/home/sparky/</uri></comment>
</element-citation>

@hubgit hubgit added the question label Mar 18, 2014

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Mar 18, 2014

We currently capture this on figshare based on the data citation principles as follows. We will follow the advice of the community here:

Sparks, Adam (2014): Global-Late-Blight-Modelling. figshare.
http://dx.doi.org/10.6084/m9.figshare.963593

ghost commented Mar 18, 2014

We currently capture this on figshare based on the data citation principles as follows. We will follow the advice of the community here:

Sparks, Adam (2014): Global-Late-Blight-Modelling. figshare.
http://dx.doi.org/10.6084/m9.figshare.963593

@hubgit

This comment has been minimized.

Show comment
Hide comment
@hubgit

hubgit Mar 18, 2014

Member

The DataCite metadata page for that code/dataset has a link for the XML that describes it:

<?xml version="1.0" encoding="UTF-8"?>
<resource xmlns="http://datacite.org/schema/kernel-2.2"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://datacite.org/schema/kernel-2.2
    http://schema.datacite.org/meta/kernel-2.2/metadata.xsd">
  <identifier identifierType="DOI">10.6084/M9.FIGSHARE.963593</identifier>
  <creators>
    <creator>
      <creatorName>Adam Sparks</creatorName>
    </creator>
  </creators>
  <titles>
    <title>Global-Late-Blight-Modelling</title>
  </titles>
  <publisher>Figshare</publisher>
  <publicationYear>2014</publicationYear>
</resource>
Member

hubgit commented Mar 18, 2014

The DataCite metadata page for that code/dataset has a link for the XML that describes it:

<?xml version="1.0" encoding="UTF-8"?>
<resource xmlns="http://datacite.org/schema/kernel-2.2"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://datacite.org/schema/kernel-2.2
    http://schema.datacite.org/meta/kernel-2.2/metadata.xsd">
  <identifier identifierType="DOI">10.6084/M9.FIGSHARE.963593</identifier>
  <creators>
    <creator>
      <creatorName>Adam Sparks</creatorName>
    </creator>
  </creators>
  <titles>
    <title>Global-Late-Blight-Modelling</title>
  </titles>
  <publisher>Figshare</publisher>
  <publicationYear>2014</publicationYear>
</resource>
@ldodds

This comment has been minimized.

Show comment
Hide comment
@ldodds

ldodds Mar 18, 2014

Some comments:

  • A piece of software might be created by dozens, if not thousands of contributors. Do all of them get cited? This seems like the classic "attribution stacking" problem. Perhaps it might be better to have variations, with or without authors and/or with a pointer to contributors
  • Publisher name/location is similarly difficult. If I collaborate on a project with another developer and its published on github then who is the publisher and which location do we use? Is it github? Is it the repo owner (which might be an organisation)?
  • For software I'd be interested in getting a snapshot (for archival purposes) but I'm also interested in the main, live repo (if there is one) as that is likely to have more context, e.g. for identifying/reporting bugs, future collaboration, etc.
  • For edition, what if that's just a github commit version?

ldodds commented Mar 18, 2014

Some comments:

  • A piece of software might be created by dozens, if not thousands of contributors. Do all of them get cited? This seems like the classic "attribution stacking" problem. Perhaps it might be better to have variations, with or without authors and/or with a pointer to contributors
  • Publisher name/location is similarly difficult. If I collaborate on a project with another developer and its published on github then who is the publisher and which location do we use? Is it github? Is it the repo owner (which might be an organisation)?
  • For software I'd be interested in getting a snapshot (for archival purposes) but I'm also interested in the main, live repo (if there is one) as that is likely to have more context, e.g. for identifying/reporting bugs, future collaboration, etc.
  • For edition, what if that's just a github commit version?
@seinecle

This comment has been minimized.

Show comment
Hide comment
@seinecle

seinecle Mar 18, 2014

It seems that it is not clear if we discuss software ( = relatively stable, executable package) or code base (rapidly evolving, versioned object). While in many case referring to the software is enough, in other cases there is actually no software, just an evolving code base.

seinecle commented Mar 18, 2014

It seems that it is not clear if we discuss software ( = relatively stable, executable package) or code base (rapidly evolving, versioned object). While in many case referring to the software is enough, in other cases there is actually no software, just an evolving code base.

@ldodds

This comment has been minimized.

Show comment
Hide comment
@ldodds

ldodds Mar 18, 2014

@seinecle that's a good point, maybe it would be useful to identify what the citation is intended to support? E.g:

  • accessing a pre-built piece of software, e.g. in order to re-run some analysis, a simulation, or open some files
  • accessing a code base, e.g. in order to inspect and possibly build/run the application/code/analysis
  • finding a code base in order to perform new analyses with the same software

The final one is perhaps not a typical goal for scientific citations, but I think for both data and software citations, the "live"/current version ought to be discoverable from the citation.

ldodds commented Mar 18, 2014

@seinecle that's a good point, maybe it would be useful to identify what the citation is intended to support? E.g:

  • accessing a pre-built piece of software, e.g. in order to re-run some analysis, a simulation, or open some files
  • accessing a code base, e.g. in order to inspect and possibly build/run the application/code/analysis
  • finding a code base in order to perform new analyses with the same software

The final one is perhaps not a typical goal for scientific citations, but I think for both data and software citations, the "live"/current version ought to be discoverable from the citation.

@pbulsink

This comment has been minimized.

Show comment
Hide comment
@pbulsink

pbulsink Mar 18, 2014

With the fidgit project progressing parallel to this, would make sense to include a meta pointer to the assigned DOI from figshare? This would eliminate having to put a link to the code in a tag manually.

pbulsink commented Mar 18, 2014

With the fidgit project progressing parallel to this, would make sense to include a meta pointer to the assigned DOI from figshare? This would eliminate having to put a link to the code in a tag manually.

@ScottBGI

This comment has been minimized.

Show comment
Hide comment
@ScottBGI

ScottBGI Mar 19, 2014

One comment from looking at the metadata, and especially in light of the comments about the minimal information we need to capture in a previous thread from @bobbledavidson, @npch & others, but is the other project metadata being captured anywhere? The intermediary page currently asks you to input language, platform, maintainer, description, and (probably most importantly) license, so was that input into the above example?

@IDodds RE citation etiquette, I don't know how many authors it can handle before the system breaks down (we've added >90 to some DataCite DOIs), but if you follow the practice of journals and the style of the human genome project, you could just list as the author of a massive group of contributors [x consortia] or [x community of developers]. DataCite metadata has the ability to set different levels of granularity to research objects if you wanted to credit separate units of code. Their RelatedIdentifier field can precisely describe relationships to other research objects through values like IsSupplementTo/IsContinuedBy/IsNewVersionOf/IsDocumentedBy/IsCompiledBy/etc.

ScottBGI commented Mar 19, 2014

One comment from looking at the metadata, and especially in light of the comments about the minimal information we need to capture in a previous thread from @bobbledavidson, @npch & others, but is the other project metadata being captured anywhere? The intermediary page currently asks you to input language, platform, maintainer, description, and (probably most importantly) license, so was that input into the above example?

@IDodds RE citation etiquette, I don't know how many authors it can handle before the system breaks down (we've added >90 to some DataCite DOIs), but if you follow the practice of journals and the style of the human genome project, you could just list as the author of a massive group of contributors [x consortia] or [x community of developers]. DataCite metadata has the ability to set different levels of granularity to research objects if you wanted to credit separate units of code. Their RelatedIdentifier field can precisely describe relationships to other research objects through values like IsSupplementTo/IsContinuedBy/IsNewVersionOf/IsDocumentedBy/IsCompiledBy/etc.

@hubgit

This comment has been minimized.

Show comment
Hide comment
@hubgit

hubgit Mar 19, 2014

Member

@ldodds:

A piece of software might be created by dozens, if not thousands of contributors. Do all of them get cited? This seems like the classic "attribution stacking" problem. Perhaps it might be better to have variations, with or without authors and/or with a pointer to contributors

There's also the difference between the maintainer(s) of a project (who is/are currently responsible for it), and the contributors (those who have committed code to the project, or made other contributions). I imagine it would be the maintainers (and possibly also previous maintainers) that would be cited, though there are bound to be exceptions.

Publisher name/location is similarly difficult. If I collaborate on a project with another developer and its published on github then who is the publisher and which location do we use? Is it github? Is it the repo owner (which might be an organisation)?

I think this can be optional, but might be useful when software is produced solely by a specific university or software company. A better analogy to book publishers, though, might be the code hosting (e.g. "GitHub") or archiving (e.g. "fighare") service. The geographic location is probably irrelevant, unless it's necessary for distinguishing between multiple entities with the same name.

For software I'd be interested in getting a snapshot (for archival purposes) but I'm also interested in the main, live repo (if there is one) as that is likely to have more context, e.g. for identifying/reporting bugs, future collaboration, etc.

Yes, I think being able to cite the snapshot as well as providing details of the current codebase (even if it's just the equivalent of an "Available from" or "Accessed at" URL) needs to be in there.

For edition, what if that's just a github commit version?

I guess the version would be the hash, in that case, and it would be nice to add a URL for it…

Member

hubgit commented Mar 19, 2014

@ldodds:

A piece of software might be created by dozens, if not thousands of contributors. Do all of them get cited? This seems like the classic "attribution stacking" problem. Perhaps it might be better to have variations, with or without authors and/or with a pointer to contributors

There's also the difference between the maintainer(s) of a project (who is/are currently responsible for it), and the contributors (those who have committed code to the project, or made other contributions). I imagine it would be the maintainers (and possibly also previous maintainers) that would be cited, though there are bound to be exceptions.

Publisher name/location is similarly difficult. If I collaborate on a project with another developer and its published on github then who is the publisher and which location do we use? Is it github? Is it the repo owner (which might be an organisation)?

I think this can be optional, but might be useful when software is produced solely by a specific university or software company. A better analogy to book publishers, though, might be the code hosting (e.g. "GitHub") or archiving (e.g. "fighare") service. The geographic location is probably irrelevant, unless it's necessary for distinguishing between multiple entities with the same name.

For software I'd be interested in getting a snapshot (for archival purposes) but I'm also interested in the main, live repo (if there is one) as that is likely to have more context, e.g. for identifying/reporting bugs, future collaboration, etc.

Yes, I think being able to cite the snapshot as well as providing details of the current codebase (even if it's just the equivalent of an "Available from" or "Accessed at" URL) needs to be in there.

For edition, what if that's just a github commit version?

I guess the version would be the hash, in that case, and it would be nice to add a URL for it…

@hubgit

This comment has been minimized.

Show comment
Hide comment
@hubgit

hubgit Mar 19, 2014

Member

@pbulsink Yes, the DOI should definitely be in there, though there is perhaps ambiguity between whether it's an identifier for the snapshot, an identifier for the specific release, or an identifier for the software as a whole (which could be assigned a separate DOI, linked to DOIs for specific releases using versioning metadata).

Member

hubgit commented Mar 19, 2014

@pbulsink Yes, the DOI should definitely be in there, though there is perhaps ambiguity between whether it's an identifier for the snapshot, an identifier for the specific release, or an identifier for the software as a whole (which could be assigned a separate DOI, linked to DOIs for specific releases using versioning metadata).

@hubgit

This comment has been minimized.

Show comment
Hide comment
@hubgit

hubgit Mar 19, 2014

Member

@ScottBGI Some of the metadata that could be attached to the project is most useful for discovery, rather than citation (which just needs to identify the software specifically enough that a reader could find it). "platform" should be in the citation, I think, and possibly "maintainer" (see the comment above) but the code language, description and license probably don't need to be.

Member

hubgit commented Mar 19, 2014

@ScottBGI Some of the metadata that could be attached to the project is most useful for discovery, rather than citation (which just needs to identify the software specifically enough that a reader could find it). "platform" should be in the citation, I think, and possibly "maintainer" (see the comment above) but the code language, description and license probably don't need to be.

@khinsen

This comment has been minimized.

Show comment
Hide comment
@khinsen

khinsen Mar 20, 2014

It looks like we are discussing a couple of different problems in parallel:

  1. How should specifically code be cited, i.e. what are the differences with respect to citing papers or datasets?
  2. How should work in progress be cited?
  3. How to cite work by a large and/or indefinite group of authors?

As for 1), I see two distinct cases. Software cited for the scientific record (we used package X) should exist in an archive and be cited with a DOI. The reference should be to a precise version. Software cited as a recommendation for use (we implemented our algorithm in package X, ...) should be on a development site such as GitHub, and referenced there.

Point 2) is already a standard situation in citing Web resources such as Wikipedia. The habit is to state the date at which the resource was consulted.

Point 3) doesn't have a good solution in the academic tradition. We are very attached to citing specific people, maybe companies, but not communities.

khinsen commented Mar 20, 2014

It looks like we are discussing a couple of different problems in parallel:

  1. How should specifically code be cited, i.e. what are the differences with respect to citing papers or datasets?
  2. How should work in progress be cited?
  3. How to cite work by a large and/or indefinite group of authors?

As for 1), I see two distinct cases. Software cited for the scientific record (we used package X) should exist in an archive and be cited with a DOI. The reference should be to a precise version. Software cited as a recommendation for use (we implemented our algorithm in package X, ...) should be on a development site such as GitHub, and referenced there.

Point 2) is already a standard situation in citing Web resources such as Wikipedia. The habit is to state the date at which the resource was consulted.

Point 3) doesn't have a good solution in the academic tradition. We are very attached to citing specific people, maybe companies, but not communities.

@npch

This comment has been minimized.

Show comment
Hide comment
@npch

npch Mar 20, 2014

@mikej888 did some work for the Software Sustainability Institute looking at citing software in traditional outputs: http://software.ac.uk/so-exactly-what-software-did-you-use

This includes a summary of what various journals ask for, as well as some software platforms like R.

@seinecle and @khinsen comments are very insightful - the "citation" metadata associated with a piece of software conflates a number of issues.

Taking @khinsen points in order:

  1. There are many similarities to datasets, which also have a sense of "versions" however typically datasets have clearer boundaries / collection hierarchy and authorship (though may also suffer from many authors). There's also not always a direct analogy to for publisher ("self published")?However I do think that if we disregard the "who was an author on this code and what contribution did they play", then code can be cited using the following metadata:

Author List
Code naming identifier (some human readable name for the "the code")
Code version identifier (e.g. a tag, a release version, ideally uniquely identifying a set of files which collectively form "the code")
Code location identifier (e.g. a DOI or URl that can be dereferenced to get to the code
[optionally] A release date

Now this means that for 2) work in progress, citation is no different - the code version identifier and code location identifier will just point to a work in progress version. However by advertising that version through a citation, you're effectively identifying a new version of the code. Given that most repositories (like GitHub) enable some sort of hash identifier for each commit, you could simply use that as an (automatically generated) identifier.

  1. Is always going to be an issue. Should an author drop off if all their contributions have been removed from the code base? I know that some projects insist on only naming the project, and then maintain the author list on the project website, but my issue with that approach is that it's not easily machine understandable

I don't think that platform, language or potentially even license information should be part of the citation metadata (though I might be willing to budge on license). When we were undertaking the SoftwareHub project for Jisc looking at creating "showcase catalogues" of software funded by Jisc, we quickly realised that things like platform or programming language were not useful either for citation or first level discovery. They are useful for categorisation and filtering, but it they aren't as useful as they first appear.

npch commented Mar 20, 2014

@mikej888 did some work for the Software Sustainability Institute looking at citing software in traditional outputs: http://software.ac.uk/so-exactly-what-software-did-you-use

This includes a summary of what various journals ask for, as well as some software platforms like R.

@seinecle and @khinsen comments are very insightful - the "citation" metadata associated with a piece of software conflates a number of issues.

Taking @khinsen points in order:

  1. There are many similarities to datasets, which also have a sense of "versions" however typically datasets have clearer boundaries / collection hierarchy and authorship (though may also suffer from many authors). There's also not always a direct analogy to for publisher ("self published")?However I do think that if we disregard the "who was an author on this code and what contribution did they play", then code can be cited using the following metadata:

Author List
Code naming identifier (some human readable name for the "the code")
Code version identifier (e.g. a tag, a release version, ideally uniquely identifying a set of files which collectively form "the code")
Code location identifier (e.g. a DOI or URl that can be dereferenced to get to the code
[optionally] A release date

Now this means that for 2) work in progress, citation is no different - the code version identifier and code location identifier will just point to a work in progress version. However by advertising that version through a citation, you're effectively identifying a new version of the code. Given that most repositories (like GitHub) enable some sort of hash identifier for each commit, you could simply use that as an (automatically generated) identifier.

  1. Is always going to be an issue. Should an author drop off if all their contributions have been removed from the code base? I know that some projects insist on only naming the project, and then maintain the author list on the project website, but my issue with that approach is that it's not easily machine understandable

I don't think that platform, language or potentially even license information should be part of the citation metadata (though I might be willing to budge on license). When we were undertaking the SoftwareHub project for Jisc looking at creating "showcase catalogues" of software funded by Jisc, we quickly realised that things like platform or programming language were not useful either for citation or first level discovery. They are useful for categorisation and filtering, but it they aren't as useful as they first appear.

@hubgit

This comment has been minimized.

Show comment
Hide comment
@hubgit

This comment has been minimized.

Show comment
Hide comment
@hubgit

hubgit Apr 7, 2014

Member

Another example:

The PERMANOVA+ add-on for PRIMER is often referenced as a book citation:

Anderson MJ, Gorley RN, Clarke KR. 2008. PERMANOVA+ for PRIMER: guide to software and statistical methods. PRIMER-E Ltd.

The makers of the software don't provide citation examples for that specific add-on, but do provide citation examples for the main software package, by citing the user manual:

Clarke, KR, Gorley, RN, 2006. PRIMER v6: User Manual/Tutorial. PRIMER-E, Plymouth.

The citation most commonly used for the PERMANOVA add-on doesn't include information about which version of the software was used, or on which platform - this is usually described in the Methods section instead.

Member

hubgit commented Apr 7, 2014

Another example:

The PERMANOVA+ add-on for PRIMER is often referenced as a book citation:

Anderson MJ, Gorley RN, Clarke KR. 2008. PERMANOVA+ for PRIMER: guide to software and statistical methods. PRIMER-E Ltd.

The makers of the software don't provide citation examples for that specific add-on, but do provide citation examples for the main software package, by citing the user manual:

Clarke, KR, Gorley, RN, 2006. PRIMER v6: User Manual/Tutorial. PRIMER-E, Plymouth.

The citation most commonly used for the PERMANOVA add-on doesn't include information about which version of the software was used, or on which platform - this is usually described in the Methods section instead.

@hubgit

This comment has been minimized.

Show comment
Hide comment
@hubgit

hubgit Apr 9, 2014

Member

Users of R are also asked to cite the manual:

@Manual{,
  title        = {R: A Language and Environment for Statistical
                  Computing},
  author       = {{R Core Team}},
  organization = {R Foundation for Statistical Computing},
  address      = {Vienna, Austria},
  year         = 2013,
  url          = {http://www.R-project.org}
}
Member

hubgit commented Apr 9, 2014

Users of R are also asked to cite the manual:

@Manual{,
  title        = {R: A Language and Environment for Statistical
                  Computing},
  author       = {{R Core Team}},
  organization = {R Foundation for Statistical Computing},
  address      = {Vienna, Austria},
  year         = 2013,
  url          = {http://www.R-project.org}
}
@hubgit

This comment has been minimized.

Show comment
Hide comment
@hubgit

hubgit Mar 31, 2016

Member

JATS 1.1 provides <version> and <data-title> elements:

<element-citation publication-type="software">
  <person-group person-group-type="author">
    <name><surname>Goddard</surname><given-names>TD</given-names></name>
    <name><surname>Kneller</surname><given-names>DG</given-names></name>
  </person-group>
  <data-title>SPARKY 3</data-title><!-- could be "software-title"? -->
  <version designator="3.114">3.114, Windows</version><!-- needs a "platform" element or attribute? -->
  <!-- use a "source" element for the host, e.g. "GitHub"? -->
  <year iso-8601-date="2007">2007</year>
  <publisher-loc>San Francisco</publisher-loc>
  <publisher-name>University of California</publisher-name>
  <uri>http://www.cgl.ucsf.edu/home/sparky/</uri>
</element-citation>
Member

hubgit commented Mar 31, 2016

JATS 1.1 provides <version> and <data-title> elements:

<element-citation publication-type="software">
  <person-group person-group-type="author">
    <name><surname>Goddard</surname><given-names>TD</given-names></name>
    <name><surname>Kneller</surname><given-names>DG</given-names></name>
  </person-group>
  <data-title>SPARKY 3</data-title><!-- could be "software-title"? -->
  <version designator="3.114">3.114, Windows</version><!-- needs a "platform" element or attribute? -->
  <!-- use a "source" element for the host, e.g. "GitHub"? -->
  <year iso-8601-date="2007">2007</year>
  <publisher-loc>San Francisco</publisher-loc>
  <publisher-name>University of California</publisher-name>
  <uri>http://www.cgl.ucsf.edu/home/sparky/</uri>
</element-citation>
@Melissa37

This comment has been minimized.

Show comment
Hide comment
@Melissa37

Melissa37 Jun 22, 2016

Does this still stand as the best source for this? I will recommend to JATS4R your suggestions

Melissa37 commented Jun 22, 2016

Does this still stand as the best source for this? I will recommend to JATS4R your suggestions

@hubgit

This comment has been minimized.

Show comment
Hide comment
@hubgit

hubgit Jun 22, 2016

Member

@Melissa37 This probably needs updating to take into account the Force11 Software Citation Principles.

Member

hubgit commented Jun 22, 2016

@Melissa37 This probably needs updating to take into account the Force11 Software Citation Principles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment