Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
This document aims to answer the following questions regarding the design of the modules package. For questions regarding the implementation, refer to the wiki page on Specification.
- Why? Why not use/write packages?
- Why do I manually need to assign the loaded module to a variable?
- Why are nested names accessed via
Why? Why not use/write packages?
While using R for exploratory data analysis as well as writing more robust
analysis code, I have experienced the R mechanism of clumsily
of files to be a big hindrance. In fact, just adding a few helper functions to
source less painful naturally evolved into an incomplete ad-hoc
implementation of modules.
The standard answer to this problem is “write a package”. But in the humble opinion of this person, R packages fall short in several regards, which this package (the irony is not lost on me) strives to rectify.
Writing packages incurs a non-trivial overhead. Packages need to live in their own folder hierarchy (and, importantly, cannot be nested), they require the specification of some meta information, a lot of which is simply irrelevant unless there is an immediate interest in publishing the package (such as the author name and contact, and licensing information). While it’s all right to thus encourage publication, realistically most code, even if reused internally, is never published.
Last but not least, packages, before they can be used in code, need to be built and installed. And this needs to be repeated every time a single line of code is changed in the package. This is fine when developing a package in isolation; not so much when developing it in tandem with a bigger code base.
devtools improves this work flow, but, as a commenter on Stack Overflow
has pointed out,
devtools […] reduces the packaging effort from X to X/5, but X/5 in R is still significant. In sensible interpreted languages X equals zero!
A direct consequence of this is that many people do end up
their code, and copying it between projects, and not putting their reusable code
into a package. At best this is a lost opportunity. At worst you struggle
keeping helper files between different projects in sync, which I’ve seen happen
ii) Not hierarchical
Modular code often naturally forms recursive hierarchies. Most languages recognise this and allow modules to be nested (just think of Python’s or Java’s packages). R is the only widely used modern language (that I can think of) which has a flat package hierarchy.
Allowing hierarchical nesting encourages users to organise project code into small, reusable modules from the outset. Even if these modules never get reused, they still improve the maintainability of the project.
iii) Low cohesion, tight coupling
R’s packaging mechanism encourages huge, monolithic packages chock full of unrelated functions. CRAN has plenty of such packages. Without pointing fingers, let me give, as an example, the otherwise tremendously helpful agricolae package, whose description reads
Statistical Procedures for Agricultural Research
… I know projects which use this package because it includes a function to generate a consensus tree via bootstrapping. The projects in question have no relation whatsoever to agricultural research – and yet they resort to using a package whose name hints at its purpose, simply because of low cohesion.
R’s packages fundamentally bias development towards bad software engineering practices.
iv) Name clashes
R packages provide namespaces and a mechanism for shielding client code from imports in the packages themselves. Nevertheless, there are situations where name clashes occur, because not all packages use namespaces (correctly). R 3.0.0 has allegedly solved this (by requiring use of namespaces) but I can still reproducibly generate a name clash with at least one package.
Why do I manually need to assign the loaded module to a variable?
In other words, why does
import force the user to write
module = import('module')
module name is redundant, instead of
With the latter call automatically defining the required variable in the calling
code? R definitely makes this possible (
reload does it). However, several
reasons speak against it. It’s potentially destructive (in as much as it may
inadvertently overwrite an existing variable), and it makes the function rely
entirely on side-effects, something which R code should always be wary of. It
also makes it less obvious how to define an alias for the imported module in
user code. As it is, the user can simply alias a module by assigning it to a
different name, e.g.
m = import('module').
reload violate this. However, both are actually
safe because they only change the variable explicitly passed to them, and they
shouldn’t be used in most code anyway (their purpose is for use in interactive
sessions while developing modules).
Why are nested names accessed via
Module objects are environments and, as such, allow any form of access that
normal environments allow. This notably includes access of objects via the
operator. This differs from R packages, where objects can be explicitly
addressed with the
package::object syntax. For now, this syntax is not
supported for modules because it is ambiguous when a module name shadows a