Materials for tutorial, "A survival guide to large-scale data analysis in R."
Large-scale data analysis in R is like the "super G" events in the Winter Olympics---it is about pushing the physical limits of your computer (or compute cluster). My first aim is to show you some techniques to push your R computing further. My second aim is to help you make more effective use of the most precious commodity in computing---memory---and to demonstrate how R sometimes makes poor use of it. This presentation is intended to be hands-on---bring your laptop, and we will work through the examples together. This git repository contains the source code for running the demos.
-
In this tutorial I attempt to apply elements of the Software Carpentry approach. See also this article. Please also take a look at the Code of Conduct, and the license information.
-
To generate PDFs of the slides from the R Markdown source, run
make slides.pdf
in the docs directory. For this to work, you will need to to install the rmarkdown package in R, as well as any additional packages used in slides.Rmd. For more details, see the Makefile. -
See also the instructor notes.
These materials were developed by Peter Carbonetto at the University of Chicago. Thank you to Matthew Stephens for his support and guidance. Also thanks to Gao Wang for sharing the Python script to profile memory usage, to David Gerard for sharing his code that ultimately improved several of the examples, and to John Blischak, John Novembre and Stefano Allesia for providing great examples to learn from.