Repository for the "Exploration and visualization of large, complex datasets with R, Hadoop, and Spark" tutorial at Strata Hadoop World 2017
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Strata Hadoop World 2017 Exploration with R Tutorial

Welcome to the repository for the "Exploration and visualization of large, complex datasets with R, Hadoop, and Spark" tutorial that will be given by Steve Elston of Quantia Analytics and Ryan Hafen at Strata Hadoop World on Tuesday, March 14, 2017, 9:00am–12:30pm.


To ensure that we can focus as much time as possible on content during the tutorial, please follow the installation instructions below prior to the day of the tutorial and run the final check to make sure your system is set up appropriately. The examples and exercises will be run locally on your own system, allowing you to continue to experiment with the techniques beyond the tutorial.

If you encounter any issues with installation, please file an issue detailing the problem in this repository including the output of running sessionInfo() in your R session. Any known issues and possible workarounds will be documented more prominently at the bottom of this README.

System prerequisites:

  • Latest version of R (3.3.2) (download and install from here)
  • Latest version of RStudio (download and install from here)
  • Java JDK (download and install from here)

R packages

Install the following R packages with the following commands:

install.packages(c("devtools", "tidyverse", "nycflights13", "sparklyr", "digest",
  "scales", "prettyunits", "httpuv", "xtable"))

Now we can install a local version of Spark with SparklyR's spark_install().

spark_install(version = "1.6.2")

Check the installation

Ensure that this example will now run without any issues:


sc <- spark_connect(master = "local")
flights <- copy_to(sc, flights, "flights")
airlines <- copy_to(sc, airlines, "airlines")
filter(flights, dep_delay > 1000)

Course material

This repository contains all of the resources needed for the course. You can either clone the repository or simply download the zip file for the repository and unzip it on your computer.

Once you have the repository, you simply need to open up BigDataVisualization.Rmd in RStudio and you are ready to go.


The following resources can be useful to browse prior to the tutorial to help attendees have a better understanding of some concepts that will be built upon.

Installation issues

If when loading the tidyverse package you get an error like the following:

Error : object `as_factor' is not exported by 'namespace:forcats'
Error: package or namespace load failed for `tidyverse'

The following should fix it: