RSpark Development Repository
This repo is used to build docker images for R/RStudio, Postgres, Hadoop, Hive, and Spark. Building the images and deploying the images as Docker containers can be done in several ways as described below.
Operating Systems Requirements:
- Mac OS 10.11 or greater
- Windows 10 Enterprise or Professional
- Java OpenJDK 8
- Hadoop 2.7.4
- Hive 2.1.1
- Postgres 2.4
- Spark 2.2.1 (for Hadoop 2.7.1 or greater)
rspark from this repo
rspark repo using the following
git clone https://github.com/jharner/rspark.git
rspark can then be built in one of two ways on your local computer (assuming you meet the system requirements):
- Spark integrated into the
rstudiocontainer as a single node
This single-node Spark environment can be built by running the following
bash script in the
rspark directory (
cd as necessary):
- Spark deployed as a small cluster
This "cluster" Spark environment can be built by running the following
bash script (assuming you are in the
Building the images and launching the containers will take time, but once complete, leave the shell script running in your terminal, i.e., do not quit or close the terminal window. Open a browser and enter
localhost:8787 as the URL:port and login with credential
rstudio for both the user name and password.
Use Control-C to stop the containers.
To restart the containers type:
Deleting containers, images, and volumes
If you are modifying Dockerfiles, which are used to build images, and something goes wrong or if you get into trouble for whatever reason, you often want to destroy the containers and images. To do this, from the
rspark directory, run:
Click on the return or the enter key for interactive choices. Note that containers are running instances of images. Therefore, you should stop and delete containers before deleting images and volumes.
At this point, you will need to rebuild
rspark from scratch, i.e., execute
ssh into any container and run a bash shell to debug issues within the container. First you meed to identify the name or ID of the container you wish to enter. The following Docker command provides this information:
Then you can run an interactive bash shell by the
docker exec command by specifying the container name and the program name
bash. For example, to run bash inside the
rstudio container, execute the following:
docker exec -it rspark_rstudio_1 bash
which provides you root access.
rspark from the DockerHub Images.
The method presented in this section is the preferred way to run
rspark The above approaches are primarily for development purposes. However, using the pre-built images in DockerHub only allows a single node version of Spark to be built. The "cluster" version must be built from scratch, at least for now.
rspark Docker images, built from the
rspark repo, are available on DockerHub. Go to DockerHub and search for
However, it is not necessary to manually download the tagged images from DockerHub. The
rspark-docker repo will do this for you. The directions for building and launching the Docker containers are available in the README file here:
start shell script will download the Docker images from DockerHub and launch the containers. If any of the Docker images have been upgraded, the newer version will be used.
rspark using Vagrant
rspark on AWS
Amazon Web Services (AWS) allows users to create virtual machines in the AWS cloud among other services. The cost of using AWS services is based on the computing power required, the amount of storage, and the run-time of the service. However, this cost is minimal for the services needed in this course. Thus, to use
rspark with AWS, you need to be prepared to pay for the cloud services you utilize unless you have received an academic free allocation.
- Amazon Web Services Account
- Modern Browser (Safari, Chrome, or Firefox)
Detailed instructions for running
rspark an AWS are available here:
The pre-built image on AWS is called
rsparkbox and it contains the
rspark-tutorial. Once you follow the steps in the README, connect to your
rspark server through a web browser.
Note If you are given an IP address for you instance, enter it into your browser's URL bar as 'http://0.0.0.0:8787' replacing '0.0.0.0' with the IP address of your instance. In this case you do not need an AWS account. Log into RStudio with the credentials:
username: rstudio password: rstudio
IMPORTANT: When you have finished using
rspark, you need to stop or terminate your EC2 instance. If you neglect to do this, you will be charged by Amazon for the duration it is left running.
Building the docker images needs to be more robust. In particular, the container dependencies can be violated depending on the processing power of the installation computer and the network speeds among other issues not under the control of the developer.
The plan is to scale
rspark up, e.g., the size of the Spark cluster, and to mangage dependencies using Kubernetes. Kubernetes is being built into Spark by the core team and an external team is working on scaling up HDFS for distributed storage.