Design rationale

klmr edited this page Dec 10, 2014 · 3 revisions

This document aims to answer the following questions regarding the design of the modules package. For questions regarding the implementation, refer to the wiki page on Specification.

Why? Why not use/write packages?

While using R for exploratory data analysis as well as writing more robust analysis code, I have experienced the R mechanism of clumsily sourceing lots of files to be a big hindrance. In fact, just adding a few helper functions to make using source less painful naturally evolved into an incomplete ad-hoc implementation of modules.

The standard answer to this problem is “write a package”. But in the humble opinion of this person, R packages fall short in several regards, which this package (the irony is not lost on me) strives to rectify.

i) Effort

Writing packages incurs a non-trivial overhead. Packages need to live in their own folder hierarchy (and, importantly, cannot be nested), they require the specification of some meta information, a lot of which is simply irrelevant unless there is an immediate interest in publishing the package (such as the author name and contact, and licensing information). While it’s all right to thus encourage publication, realistically most code, even if reused internally, is never published.

Last but not least, packages, before they can be used in code, need to be built and installed. And this needs to be repeated every time a single line of code is changed in the package. This is fine when developing a package in isolation; not so much when developing it in tandem with a bigger code base.

devtools improves this work flow, but, as a commenter on Stack Overflow has pointed out,

devtools […] reduces the packaging effort from X to X/5, but X/5 in R is still significant. In sensible interpreted languages X equals zero!

A direct consequence of this is that many people do end up sourceing all their code, and copying it between projects, and not putting their reusable code into a package. At best this is a lost opportunity. At worst you struggle keeping helper files between different projects in sync, which I’ve seen happen a lot.

ii) Not hierarchical

Modular code often naturally forms recursive hierarchies. Most languages recognise this and allow modules to be nested (just think of Python’s or Java’s packages). R is the only widely used modern language (that I can think of) which has a flat package hierarchy.

Allowing hierarchical nesting encourages users to organise project code into small, reusable modules from the outset. Even if these modules never get reused, they still improve the maintainability of the project.

iii) Low cohesion, tight coupling

R’s packaging mechanism encourages huge, monolithic packages chock full of unrelated functions. CRAN has plenty of such packages. Without pointing fingers, let me give, as an example, the otherwise tremendously helpful agricolae package, whose description reads

Statistical Procedures for Agricultural Research

… I know projects which use this package because it includes a function to generate a consensus tree via bootstrapping. The projects in question have no relation whatsoever to agricultural research – and yet they resort to using a package whose name hints at its purpose, simply because of low cohesion.

R’s packages fundamentally bias development towards bad software engineering practices.

iv) Name clashes

R packages provide namespaces and a mechanism for shielding client code from imports in the packages themselves. Nevertheless, there are situations where name clashes occur, because not all packages use namespaces (correctly). R 3.0.0 has allegedly solved this (by requiring use of namespaces) but I can still reproducibly generate a name clash with at least one package.

Why do I manually need to assign the loaded module to a variable?

In other words, why does import force the user to write

module = import('module')

Where the module name is redundant, instead of

import('module')

With the latter call automatically defining the required variable in the calling code? R definitely makes this possible (reload does it). However, several reasons speak against it. It’s potentially destructive (in as much as it may inadvertently overwrite an existing variable), and it makes the function rely entirely on side-effects, something which R code should always be wary of. It also makes it less obvious how to define an alias for the imported module in user code. As it is, the user can simply alias a module by assigning it to a different name, e.g. m = import('module').

Granted, both unload and reload violate this. However, both are actually safe because they only change the variable explicitly passed to them, and they shouldn’t be used in most code anyway (their purpose is for use in interactive sessions while developing modules).

Why are nested names accessed via $?

Module objects are environments and, as such, allow any form of access that normal environments allow. This notably includes access of objects via the $ operator. This differs from R packages, where objects can be explicitly addressed with the package::object syntax. For now, this syntax is not supported for modules because it is ambiguous when a module name shadows a package.