Skip to content

Data Science box: Spark, Jupyter, R+RStudio, Zeppelin, Python 2 & 3, Java, Scala.

License

Notifications You must be signed in to change notification settings

mcolebrook/DSbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Science box (DSbox)

This is a Linux (Ubuntu) box deployed by vagrant including the following Data Science apps:

It has been succesfully tested on both ubuntu/trusty32 and ubuntu/trusty64 systems.

Pre-deployment steps

To install the box, you must follow the next steps:

  1. Install VirtualBox: if you use any other provider, you must change the provider parameter in the Vagrantfile.
  2. Install Vagrant.
  3. Install Git.
  4. Clone this repository to a specific folder:
$ git clone https://github.com/mcolebrook/dsbox.git <YOUR_BOX_FOLDER>

Remark: some Windows users told me that they got line-ending issues (see GitHub Help) after cloning and starting up the box. To fix this problem, and BEFORE CLONING the box, just type:

$ git config --global core.autocrlf input

Config parameters

Go to <YOUR_BOX_FOLDER>, and edit the Vagrantfile to change the parameters:

Parameter Description Default value
provider VM provider "virtualbox"
boxMaster OS in master node "ubuntu/trusty32"
boxSlave OS in slave nodes ubuntu/trusty32
masterRAM Master's RAM in MB 3072
masterCPU Master's CPU cores 2
masterName name of the master node used in scripts/spark-env-sh "spark-master"
masterIP private IP of master node "10.20.30.100"
slaves # of slaves 2 (max 9)
slaveRAM Slave's RAM in MB 2048
slaveCPU Slave's CPU cores 2
slaveName base name for slave nodes "spark-slave"
slavesIP base private IP for slave nodes "10.20.30.10"
IPythonPort IPython/Jupyter port to forward (set in Jupyter/IPython config file) 8001
SparkMasterPort SPARK_MASTER_WEBUI_PORT 8080
SparkWorkerPort SPARK_WORKER_WEBUI_PORT 8081
SparkAppPort Spark app web UI port 4040
RStudioPort RStudio server port 8787
ZeppelinPort Zeppelin default port is 8080 -> conflict with Spark 8888
SlidesPort jupyter-nbconvert <file.ipynb> --to slides --post serve 8000

Starting up and shutting down the cluster

You have several ways to start up the cluster.

Deploy the master and all the slaves

To deploy the cluster with one master node and two slave nodes by default:

$ vagrant up

Bear in mind that the whole process (bringing master+slaves up and the provisioning) may take several minutes!! On my Intel Core i7-4790 CPU (4 cores @ 3.60GHz) with 32 Gb RAM, I got the following times:

Master

==> spark-master: END provisioning 2016/**/** **:**:**
==> spark-master: TOTAL TIME: 788 seconds

Slaves

==> spark-slave-1: END provisioning 2016/**/** **:**:**
==> spark-slave-1: TOTAL TIME: 228 seconds

Deploy only the master

In case you only want to deploy the master node:

$ vagrant up spark-master

Halt the cluster

To shutdown the whole cluster:

$ vagrant halt

Halt only the master node

If you only want to halt the master node:

$ vagrant halt spark-master

Delete the whole cluster (master + slaves)

In case you want to delete the whole cluster:

$ vagrant destroy

Start/Stop Spark

To start up the Spark cluster (master + slaves):

$ vagrant ssh spark-master
...
$ $SPARK_HOME/sbin/start-all.sh

You can also start the cluster up from the host machine by typing:

$ vagrant ssh spark-master -c "bash /opt/spark/sbin/start-all.sh"

To halt the cluster, just run stop-all.sh. Remember that you can access Spark info in the following ports:

Starting Jupyter

The best way to start the Jupyter notebook is the following:

$ vagrant ssh spark-master
...
$ cd /vagrant/jupyter-notebooks
$ jupyter-notebook

Inside the folder jupyter-notebooks you may find some sample notebooks. Then, go to your favorite browser and type in localhost:8001. Besides, you can also start the Jupyter notebook with pyspark as the default interpreter by using the script scripts/start-pyspark-notebook.sh. Remember that inside the Jupyter notebook you can:

To stop the notebook, just press the keys Ctrl+C.

Starting RStudio

The RStudio Server daemon should be alreaday running in the background, so you only have to type in your browser localhost:8787. In order to work with Spark, you have to run the commands inside the config.R script. You may find helpful this RStudio cheat sheet.

Installing Zeppelin

I recommend you to build Zeppelin aside from the provision of the master node, since it takes a long time to complete the compilation. Thus, you can run the following lines, and wait until all modules are built.

$ vagrant ssh spark-master
$ cd /vagrant/scripts
$ sudo ./60-zeppelin.sh

Once all the modules are compiled inside the spark-master node, you can start Zeppelin typing:

$ sudo env "PATH=$PATH" /opt/zeppelin/bin/zeppelin-daemon.sh start

Remeber to use the same command with 'stop' to halt the daemon. Alternatively, you can run the script directly from the host machine by means of:

$ vagrant ssh spark-master -c "bash /opt/zeppelin/bin/zeppelin-daemon.sh start"

Finally, to start working with Zeppelin you may use the notebooks inside the folder /vagrant/zeppelin_notebooks.

Installing scikit-learn and tensorflow

You may install these two libraries running the following lines:

$ vagrant ssh spark-master
$ cd /vagrant/scripts
$ sudo ./61-scikit-learn-tensorflow.sh

Remember that Tensor Flow is available for 64-bit systems only.

License

GNU. Please refer to the LICENSE file in this repository.

Acknowledgements (in alphabetical order)

Thanks to the following people for sharing their projects: Adobe Research, Damián Avila, Dan Koch, Felix Cheung, Francisco Javier Pulido, Gustavo Arjones, IBM Cloud Emerging Technologies, Jee Vang, Jeffrey Thompson, José A. Dianes, Maloy Manna, NGUYEN Trong Khoa, and Peng Cheng.

Thanks also to the following people for pointing me out some bugs: Carlos Pérez-González, Christos Iraklis Tsatsoulis.

About

Data Science box: Spark, Jupyter, R+RStudio, Zeppelin, Python 2 & 3, Java, Scala.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published