Skip to content

Commit

Permalink
some edits for publication in The Political Methologist.
Browse files Browse the repository at this point in the history
  • Loading branch information
kjhealy committed Mar 18, 2011
1 parent 263d1dc commit 8290a05
Showing 1 changed file with 56 additions and 59 deletions.
115 changes: 56 additions & 59 deletions workflow-apps.org
Original file line number Original file line Diff line number Diff line change
Expand Up @@ -74,22 +74,21 @@ Two remarks at the outset. First, because this discussion is aimed at
** Just Make Sure You Know What You Did ** Just Make Sure You Know What You Did


For any kind of formal data analysis that leads to a scholarly paper, For any kind of formal data analysis that leads to a scholarly paper,
however you do it, there are basic principles that you will want to however you do it, there are some basic principles to adhere
adhere to. Perhaps the most important thing is to do your work in a to. Perhaps the most important thing is to do your work in a way that
way that leaves a coherent record of your actions. Instead of doing a leaves a coherent record of your actions. Instead of doing a bit of
bit of statistical work and then just keeping the resulting table of statistical work and then just keeping the resulting table of results
results or graphic that you produced, for instance, write down what or graphic that you produced, for instance, write down what you did as
you did as a documented piece of code. Rather than figuring out but a documented piece of code. Rather than figuring out but not recording
not recording a solution to a problem you might have again, write down a solution to a problem you might have again, write down the answer as
the answer as an explicit procedure. Instead of copying out some an explicit procedure. Instead of copying out some archival material
archival material without much context, file the source properly, or without much context, file the source properly, or at least a precise
at least a precise reference to it. reference to it.


Why should you bother to do any of this? Because when you inevitably Why should you bother to do any of this? Because when you inevitably
return to your table or figure or quotation nine months down the line, return to your table or figure or quotation nine months down the line,
your future self will have been saved hours spent wondering what it your future self will have been saved hours spent wondering what it
was you thought you were doing and where in the hell you got that was you thought you were doing and where you got that result from.
stuff from, anyway.


A second principle is that a document, file or folder should always be A second principle is that a document, file or folder should always be
able to tell you what it is. Beyond making your work reproducible, you able to tell you what it is. Beyond making your work reproducible, you
Expand Down Expand Up @@ -164,9 +163,10 @@ this way:
running Windows is easy, and even catered to by Mac OS's Boot Camp running Windows is easy, and even catered to by Mac OS's Boot Camp
utility. Beyond installing OS X and Windows side-by-side, utility. Beyond installing OS X and Windows side-by-side,
third-party virtualization software is available (for about \$80 third-party virtualization software is available (for about \$80
from [[http://www.vmware.com/products/fusion/][VMWare]] or [[http://www.parallels.com/][Parallels]]) that allows you to run Windows or Linux from [[http://www.vmware.com/products/fusion/][VMWare]] or [[http://www.parallels.com/][Parallels]], or free from [[http://www.virtualbox.org/][VirtualBox]]) that allows you
seamlessly within OS X. Thus, Apple hardware is the only setup where to run Windows or Linux seamlessly within OS X. Thus, Apple hardware
you can easily try out each of the main desktop operating systems. is the only setup where you can easily try out each of the main
desktop operating systems.


- Linux is stable, secure, and free. User-oriented distributions such - Linux is stable, secure, and free. User-oriented distributions such
as [[http://www.ubuntu.com/][Ubuntu]] are much better-integrated and well-organized than in the as [[http://www.ubuntu.com/][Ubuntu]] are much better-integrated and well-organized than in the
Expand All @@ -183,9 +183,8 @@ this way:


These days, I use Mac OS X, and the discussion here reflects that These days, I use Mac OS X, and the discussion here reflects that
choice to some extent. But the other two options are also perfectly choice to some extent. But the other two options are also perfectly
viable alternatives. Rather than try to convince you to plump for one viable alternatives, and most of the applications I will discuss are
option or another, let's look at some applications that will run on freely available for all of these operating systems.
all of these operating systems.


The dissertation, book, or articles you write will generally consist The dissertation, book, or articles you write will generally consist
of the main text, the results of data analysis (perhaps presented in of the main text, the results of data analysis (perhaps presented in
Expand All @@ -196,11 +195,10 @@ data* and *minimize error*. In the next section I describe some
applications and tools designed to let you do this easily. They fit applications and tools designed to let you do this easily. They fit
together well (by design) and are all freely available for Windows, together well (by design) and are all freely available for Windows,
Linux and Mac OS X. They are not perfect, by any means --- in fact, Linux and Mac OS X. They are not perfect, by any means --- in fact,
some of them are kind of a pain in the ass to learn. (I'll discuss some of them can be awkward to learn. But graduate-level research and
some nicer alternatives, too.) But graduate-level research and writing writing can also be awkward to learn. Specialized tasks need
is also kind of a pain in the ass to learn. Specialized tasks need specialized tools and, unfortunately, although they are very good at
specialized tools and, unfortunately, even if they are very good at what they do, these tools don't always go out of their way to be
what they do these tools don't always go out of their way to be
friendly. friendly.


** Edit Text ** Edit Text
Expand Down Expand Up @@ -257,9 +255,9 @@ evolved in a much earlier era of computing (before decent graphical
displays, for instance, and possibly also fire), it doesn't share many displays, for instance, and possibly also fire), it doesn't share many
of the conventions of modern applications.[fn:emacs] Emacs offers many of the conventions of modern applications.[fn:emacs] Emacs offers many
opportunities to waste your time learning its particular conventions, opportunities to waste your time learning its particular conventions,
tweaking its settings, and generally customizing the bejaysus out of tweaking its settings, and generally customizing it. There are several
it. There are several good alternatives on each major platform, and I good alternatives on each major platform, and I discuss some of them
discuss some of them below. below.


[fn:emacs] One of the reasons that Emacs' keyboard shortcuts are so [fn:emacs] One of the reasons that Emacs' keyboard shortcuts are so
strange is that they have their roots in a model of computer that laid strange is that they have their roots in a model of computer that laid
Expand All @@ -274,7 +272,9 @@ good, in fact, that Emacs has recently become quite popular amongst a
set of software developers pretty much all of whom are much younger set of software developers pretty much all of whom are much younger
than Emacs itself. The upshot is that there has been a run of good, than Emacs itself. The upshot is that there has been a run of good,
new resources available for learning it and optimizing it easily. [[http://peepcode.com/products/meet-emacs][Meet new resources available for learning it and optimizing it easily. [[http://peepcode.com/products/meet-emacs][Meet
Emacs]], a screencast available for purchase from PeepCode, walks you through the basics of the application. Emacs itself also has a built-in tutorial. Emacs]], a screencast available for purchase from PeepCode, walks you
through the basics of the application. Emacs itself also has a
built-in tutorial.


If text editors like Emacs are not concerned with formatting your If text editors like Emacs are not concerned with formatting your
documents nicely, then how do you produce properly typeset papers? You documents nicely, then how do you produce properly typeset papers? You
Expand Down Expand Up @@ -346,9 +346,8 @@ color-coding the marked-up text to make it easier to read, providing
shortcuts to LaTeX's formatting commands, and helping you manage shortcuts to LaTeX's formatting commands, and helping you manage
references to Figures, Tables and bibliographic citations in the references to Figures, Tables and bibliographic citations in the
text. These packages could also be listed under the ``Minimize Error'' text. These packages could also be listed under the ``Minimize Error''
section below, because they help ensure that, e.g., your references section below, because they help ensure that your references and
and bibliography will be complete and consistently bibliography will be complete and consistently formatted.[fn:fonts]
formatted.[fn:fonts]


[fn:fonts] A note about fonts and LaTeX. It used to be that getting [fn:fonts] A note about fonts and LaTeX. It used to be that getting
LaTeX to use anything but a relatively small set of fonts was a very LaTeX to use anything but a relatively small set of fonts was a very
Expand Down Expand Up @@ -443,21 +442,20 @@ error. In particular, it is easy for a table of results to get
detached from the sequence of steps that produced it. Almost everyone detached from the sequence of steps that produced it. Almost everyone
who has written a quantitative paper has been confronted with the who has written a quantitative paper has been confronted with the
problem of reading an old draft containing results or figures that problem of reading an old draft containing results or figures that
need to be revisited or reproduced (as a result of the peer-review need to be revisited or reproduced (as a result of peer-review, say)
process, say) but which lack any information about the circumstances but which lack any information about the circumstances of their
of their creation. Academic papers take a long time to get through the creation. Academic papers take a long time to get through the cycle of
cycle of writing, review, revision, and publication, even when you're writing, review, revision, and publication, even when you're working
working hard the whole time. It is not uncommon to have to return to hard the whole time. It is not uncommon to have to return to something
something you did two years previously in order to answer some you did two years previously in order to answer some question or other
question or other from a reviewer. You do not want to have to do from a reviewer. You do not want to have to do everything over from
everything over from scratch in order to get the right answer. I am scratch in order to get the right answer. I am not exaggerating when I
not exaggerating when I say that, whatever the challenges of say that, whatever the challenges of replicating the results of
replicating the results of someone else's quantitative analysis, after someone else's quantitative analysis, after a fairly short period of
a fairly short period of time authors themselves find it hard to time authors themselves find it hard to replicate their /own/
replicate their /own/ work. Computer Science people have a term of art work. Computer Science people have a term of art for the inevitable
for the inevitable process of decay that overtakes a project simply in process of decay that overtakes a project simply in virtue of its
virtue of its being left alone on the hard drive for six months or being left alone on the hard drive for six months or more: bit--rot.
more: bit--rot.


*** Literate Programming with Sweave *** Literate Programming with Sweave
A first step toward closing this gap is to use *Sweave* when doing A first step toward closing this gap is to use *Sweave* when doing
Expand Down Expand Up @@ -531,13 +529,13 @@ peer-reviewed studies using Sweave, and the errors uncovered as a
result, see \textcite{hothorn11:_case_studies_reprod}. result, see \textcite{hothorn11:_case_studies_reprod}.


A weakness of the Sweave model is that when you make changes, you have A weakness of the Sweave model is that when you make changes, you have
to reprocess the all of the code to reproduce the final LaTeX file. If to reprocess all of the code to reproduce the final LaTeX file. If
your analysis is computationally intensive this can take a long your analysis is computationally intensive this can take a long
time. You can go a little ways toward working around this by designing time. You can go a little ways toward working around this by designing
projects so that they are relatively modular, which is good practice projects so that they are relatively modular, which is good practice
anyway. But for projects that are unavoidably large or computationally anyway. But for projects that are unavoidably large or computationally
intensive, the add-on package =cacheSweave=, available from the R intensive, the add-on package =cacheSweave=, available from the R
website, does a good job alleviating the problem. website, does a good job alleviating the problem.


*** Literate Programming with Org-mode *** Literate Programming with Org-mode
*[[http://orgmode.org/][Org-mode]]* is an Emacs mode originally designed to make it easier to *[[http://orgmode.org/][Org-mode]]* is an Emacs mode originally designed to make it easier to
Expand Down Expand Up @@ -582,7 +580,7 @@ directly. I don't show the code for this here, but you can look in the
#+ATTR_LaTeX: width=5in #+ATTR_LaTeX: width=5in
#+source: ggplot-example #+source: ggplot-example
#+begin_src R :results output graphics :file figures/ggplot-example.pdf :useDingbats FALSE :exports results #+begin_src R :results output graphics :file figures/ggplot-example.pdf :useDingbats FALSE :exports results
qplot(tea, biscuits) + geom_smooth(method="lm") + scale_x_continuous(name="Tea") + scale_y_continuous(name="Biscuits") qplot(tea, biscuits) + geom_smooth(method="lm") + scale_x_continuous(name="Tea") + scale_y_continuous(name="Biscuits") + theme_bw()
#+end_src #+end_src




Expand Down Expand Up @@ -611,9 +609,8 @@ control as a way to keep track of whole projects (not just individual
documents) in a much better-organized, comprehensive, and transparent documents) in a much better-organized, comprehensive, and transparent
fashion. Modern version control systems such as [[http://subversion.tigris.org/][Subversion]], [[http://www.selenic.com/mercurial/][Mercurial]] fashion. Modern version control systems such as [[http://subversion.tigris.org/][Subversion]], [[http://www.selenic.com/mercurial/][Mercurial]]
and [[http://git.or.cz/][Git]] can, if needed, manage very large projects with many branches and [[http://git.or.cz/][Git]] can, if needed, manage very large projects with many branches
spread across multiple users. As such, they require a little time to spread across multiple users. As such, you have to get used to some
get comfortable with, mostly because you have to get used to some new new concepts related to tracking your files, and then learn how your
concepts related to tracking your files, and then learn how your
version control system implements these concepts. Because of their version control system implements these concepts. Because of their
power, these tools might seem like overkill for individual power, these tools might seem like overkill for individual
users. (Again, though, many people find Word's ``Track Changes'' users. (Again, though, many people find Word's ``Track Changes''
Expand All @@ -624,7 +621,7 @@ with your text editor.[fn:magit] Moreover, you can meet these systems
half way. The excellent [[https://www.getdropbox.com/][DropBox]], for example, allows you to share half way. The excellent [[https://www.getdropbox.com/][DropBox]], for example, allows you to share
files between different computers you own, or with collaborators or files between different computers you own, or with collaborators or
general public. But it also automatically version-controls the general public. But it also automatically version-controls the
contents of these folders (using Subversion behind the scenes). contents of these folders.


[fn:magit] Emacs comes with support for a variety of VCS systems built [fn:magit] Emacs comes with support for a variety of VCS systems built
in. There's also a very good add-on package, [[http://philjackson.github.com/magit/][Magit]], devoted in. There's also a very good add-on package, [[http://philjackson.github.com/magit/][Magit]], devoted
Expand Down Expand Up @@ -686,13 +683,13 @@ up everything automatically to an external (or remote) hard disk
without you having to remember to do anything. On Macs, Apple's *Time without you having to remember to do anything. On Macs, Apple's *Time
Machine* software is built in to the operating system and makes Machine* software is built in to the operating system and makes
backups very easy. On Linux, you can use [[http://www.psychocats.net/ubuntu/backup][rsync]] for backups. It is also backups very easy. On Linux, you can use [[http://www.psychocats.net/ubuntu/backup][rsync]] for backups. It is also
worth looking into a secure, peer-to-peer or offsite backup service worth looking into a secure, peer-to-peer, or offsite backup service
like [[http://www.crashplan.com/][Crashplan]] or [[https://spideroak.com/][Spider Oak]]. Offsite backup means that in the event like [[http://www.crashplan.com/][Crashplan]], [[https://spideroak.com/][Spider Oak]], or [[http://www.backblaze.com/][Backblaze]]. Offsite backup means that in
(unlikely, but not unheard of) that your computer /and/ your local the event (unlikely, but not unheard of) that your computer /and/ your
backups are stolen or destroyed, you will still have copies of your local backups are stolen or destroyed, you will still have copies of
files.[fn:tornado] As Jamie Zawinski [[http://jwz.livejournal.com/801607.html][has remarked]], when it comes to your files.[fn:tornado] As Jamie Zawinski [[http://jwz.livejournal.com/801607.html][has remarked]], when it comes
losing your data ``The universe tends toward maximum irony. Don't push to losing your data ``The universe tends toward maximum irony. Don't
it.'' push it.''


[fn:tornado] I know of someone whose office building was hit by a [fn:tornado] I know of someone whose office building was hit by a
tornado. She returned to find her files and computer sitting in a foot tornado. She returned to find her files and computer sitting in a foot
Expand Down

0 comments on commit 8290a05

Please sign in to comment.