README

This is a Spark getting started project. It mainly aims to help understanding Spark fundamentals and the impact of using different algorithms when coding Spark-based applications.

What is this for?

With this project you will be able to visualize what means using coalesce(2) against using coalesce(4) in terms of performance.

You first clone it. Then do a commit using groupBy, then a commit but using reduceBy. When done with the versions of your app (versions that you want to compare against), then you execute batch.bash providing the target commits. The framework will run the application in its different versions and, once done, you will be able to compare the performances of all of them using the beautiful Spark UI portal (DAGS, memory usage, execution time, etc.).

The input data used will be automatically generated the first time you launch the application. It will remain the same for all runs.

This is the related Google Document where I log my conclusions from some basic experiments.

How to use it?

To get started you need to first download Spark (v1.4 or older). Then launch its history server:

cd $SPARK_HOME
./sbin/start-history-server.sh

Here you will be able to visualize the performances of the execution of different versions of your application. It will allow you to compare Spark performances as you change your algorithms.

The history server can be browsed here:

http://localhost:18080/

The history server will not show anything at the beginning. Now you can launch your tests using a range of commit: from a specified commit (through it's GIT commit ID) until now, using only the commits whose message begins with the string TESTME. Launch the batch test:

cd $THIS_APP_HOME
./batch.bash <from_commit_id>

How to run a single test?

To launch a single test run (for debugging purposes):

cd $THIS_APP_HOME
./run.bash

See localhost:4040 web interface.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
conf		conf
data/scripts		data/scripts
project		project
src		src
.gitignore		.gitignore
README.md		README.md
batch.bash		batch.bash
build.sbt		build.sbt
run.bash		run.bash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conf

conf

data/scripts

data/scripts

project

project

src

src

.gitignore

.gitignore

README.md

README.md

batch.bash

batch.bash

build.sbt

build.sbt

run.bash

run.bash

Repository files navigation

README

What is this for?

How to use it?

How to run a single test?

About

Releases

Packages

Languages

Navigation Menu

mauriciojost/spark-example

Folders and files

Latest commit

History

Repository files navigation

README

What is this for?

How to use it?

How to run a single test?

About

Resources

Stars

Watchers

Forks

Languages