Strata Hadoop World 2017 Exploration with R Tutorial
Welcome to the repository for the "Exploration and visualization of large, complex datasets with R, Hadoop, and Spark" tutorial that will be given by Steve Elston of Quantia Analytics and Ryan Hafen at Strata Hadoop World on Tuesday, March 14, 2017, 9:00am–12:30pm.
To ensure that we can focus as much time as possible on content during the tutorial, please follow the installation instructions below prior to the day of the tutorial and run the final check to make sure your system is set up appropriately. The examples and exercises will be run locally on your own system, allowing you to continue to experiment with the techniques beyond the tutorial.
If you encounter any issues with installation, please file an issue detailing the problem in this repository including the output of running
sessionInfo() in your R session. Any known issues and possible workarounds will be documented more prominently at the bottom of this README.
- Latest version of R (3.3.2) (download and install from here)
- Latest version of RStudio (download and install from here)
- Java JDK (download and install from here)
Install the following R packages with the following commands:
install.packages(c("devtools", "tidyverse", "nycflights13", "sparklyr", "digest", "scales", "prettyunits", "httpuv", "xtable")) devtools::install_github("hafen/trelliscopejs")
Now we can install a local version of Spark with SparklyR's
library(sparklyr) spark_install(version = "1.6.2")
Check the installation
Ensure that this example will now run without any issues:
library(sparklyr) library(dplyr) library(nycflights13) library(ggplot2) sc <- spark_connect(master = "local") flights <- copy_to(sc, flights, "flights") airlines <- copy_to(sc, airlines, "airlines") src_tbls(sc) filter(flights, dep_delay > 1000)
This repository contains all of the resources needed for the course. You can either clone the repository or simply download the zip file for the repository and unzip it on your computer.
Once you have the repository, you simply need to open up
BigDataVisualization.Rmd in RStudio and you are ready to go.
The following resources can be useful to browse prior to the tutorial to help attendees have a better understanding of some concepts that will be built upon.
If when loading the tidyverse package you get an error like the following:
Error : object `as_factor' is not exported by 'namespace:forcats' Error: package or namespace load failed for `tidyverse'
The following should fix it:
remove.packages("tidyverse") install.packages("forcats") install.packages("tidyverse")