What are the degrees of reproducible data analysis? #11

jeromyanglim opened this Issue Jul 16, 2012 · 1 comment

1 participant


I was wanting to conceptualise reproducible data analysis in a broader context.

  • What are the different ways that reproducible data analysis can be achieved?
  • How do such degrees relate to achieving the aims of reproducible analysis?
  • What are the different aspects of reproducible data analysis?

I've previously provided a summary of thoughts on reproducible data analysis and terminology.

In terms of aspects of reproducibility, I wrote about different broad aims:

  • Reproducibility
    • Can the analyses easily be re-run to transform raw data into final report with the same results?
  • Correctness
    • Is the data analysis consistent with the intentions of the researcher?
    • Are the intentions of the researcher correct?
  • Openness
    • Transparency, accountability
      • Can others check and verify the accuracy of analyses performed?
    • Extensibility, modfifiability
      • Can others modify, extend, reuse, and mash, the data, analyses, or both to create new research works?

Degrees of reproducibility

Language of reproducibility:

Are the methods used unambiguously communicated? This often involves the use of code. But mathematics is another relatively unambiguous language relevant to data analysis. Normal language can also be used but for certain purposes it is often ambiguous (e.g., exactly what cluster analysis algorithm was used; exactly what was done with missing data; etc.) .

A basic principle of quality systems is that processes are in some way documented.

Ease of reproducibility

The one-click build really appeals to me.

The quicker it is to reproduce a set of analyses, the easier it is to verify that the final result is consistent with the procedure.

Quality assurance

In the above quote I distinguish between

  • verifying that the intended procedure was applied, and
  • verifying that the intended procedure was appropriate.

A first level of reproducible data analysis is to ensure that the intended procedure was applied.

Inferring analysis intention

Quality assurance is partially about ensuring that analyses were performed as intended.
But what are the different types of intended analyses?

  • Code: At a basic level if the code is viewed as the intention, then as long as you run the code correctly, then there will be agreement between intention and results. You could put bugs as a middle ground where your script code is correct but the program is not performing as specified.
  • Written description and code: When writing a report, analysis may be described in a methods or results section of a scientific report or in the results or technical specification of a consulting report. In such cases, the written description of the analyses performed can be seen as the primary source of inferring analysis intentions, and the code is used to clarify any details not specified in the written description. Thus, code can clarify written description; it would be an error for the code to be inconsistent with the written description. There would also be degrees to which the written description might be considered misleading or problematic. e.g., perhaps the omission of key information.
  • Broader sense of intention: There is also a much broader sense of intention where the underlying aims of the analysis are considered. This adds a further layer where errors or questionable decisions can be raised regarding both the reported analyses and the code that implements it all. This also starts to move from consistency of intention and analyses to whether the analyses were appropriate.

Verifying quality versus achieving quality

One click builds facilitate both verification of quality and achieving quality.

  • Testing can be explicitly incorporated.
  • The entire analyses can be re-run step-by-step.
  • The source of any strange results can be inspected, diagnosed, and then rectified.
  • They achieve at least a very literal form of consistency between code and output.
  • They remove the risk of inconsistencies entering analyses (data transformations cascade into output).
  • They ensure that there is a clear statement of intention rather than a fuzzy set of go-with-the-flow analyses and data transformations.

However, adopting a one-click build approach (such as facilitated by R and knitr) on its own does not ensure a high quality product. And with effort it is possible to have a high quality product using more manual approaches.

1. Worst case scenario

Data is loaded into GUI statistical software. The data is irreversibly transformed and manipulated in various ways. Graphs, tables, and results at the end of this process are copy and pasted into a document for reporting purposes. Further analyses and transformations of the data are performed, and then also incorporated The analyst can't remember exactly what they did. The analyst performs a minimum of checks and balances to even see whether what they are doing makes sense. If a fatal transformation was performed the analyst is unlikely to know.

2. SPSS + Syntax + Copying and Pasting

This involves saving SPSS syntax used to perform analyses. Results are then typically copy and pasted into programs like Word, or perhaps Excel for formatting, and then Word.

The syntax does permit a degree of reproducibility, although within this approach there is a wide range in the quality of syntax organisation and commentary. In better cases, the syntax will document all transformations of the data (e.g., removing cases, creating new variables, removal of outliers, any imputing of missing data, etc.). In worse cases, it is disorganised and incomplete.

I have observed this approach being applied a lot in psychology.

However, some analyses in SPSS are too complex to easily perform with syntax. E.g., you need to post-process some output; you need to use some other software to compute a value. Some processes require moving between programs and applying multiple manual steps.

3. Reproducible report incorporated into static document

This process involve creating some reproducible report and then manually incorporating the results into a static document. This second step is often performed by another person.

For example I once did an analysis for another academic where I analysed the data and produced a report using Sweave. I sent this to the academic and he requested additional analyses. Once this iterative process was complete, he then incorporated the analyses into the write-up of a report. The initial phase was reproducible, but there were a few manual steps in incorporating the graphs, tables, and textual results into the final document.

The process is often necessary where a collaborator is driving the project and they wish to use a document preparation system such as Microsoft Word. It can also be necessary where the publishing system requires an extensive set of stylistic elements that are difficult to produce with plain text formats.

It works reasonably well when analysis to write-up is a sequential process. However, for most projects I find that analysis and write-up iterates extensively. Even once an article is submitted to a journal, reviewers may come back and request changes. Of course, you can try to keep track of whether any changes require previous analyses to be updated, but this can be error prone.

This approach also works reasonably well where analyses play a fairly small part in the overall document.

4. Reproducible Document

In this situation the final product is produced using code. This includes inputting data, data transformations, analysis code, and code for incorporating the figures, tables, and text into the document.

Even in this case, there are degrees and limits to reproducibility. To take one limit, documents that report analyses (e.g., theses, journal articles, etc.) are highly interconnected documents. Much of what is written is either directly or indirectly dependent on the results of the analyses. In a direct sense, there are sentences that summarise the results in a table or a graph, or there are sentences that summarise the significance or direction of an effect. It would generally be too much work to have such text conditionally displayed based on the results of the study.

However, once you break away from reproducibility, there is always the risk that results will change as a result of a tweaking of preliminary analyses, and that the conditional text will need to be updated.

There are several implications of this. It is good to

  • be aware of the analysis dependencies in a document
  • simplify dependencies through the structure and placement of code. For example, code that transforms the data (e.g., creates new variables, removes problematic cases, merges data files) is often best placed in a single chunk at the beginning.
  • Be aware when a change is being made to analyses that is likely to have a large cascading effect on subsequent analyses.

Summary of degrees of reproducibility

  • Reproducible data analysis is a matter of degrees.
  • The sharing of fully reproducible analyses has so many benefits, yet it is so far from mainstream in psychological sciences that I struggle to point to a single document that implements it (although see this list of examples across disciplines). That said, I'm sure that for every share, there are many more that are done that are not shared.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment