Playground used to learn and experiment with Apache Spark using Scala. Do you want to learn Apache Spark? Try to resolve the exercised proposed.
This repository contains a bunch of exercises resolved using Apache Spark and written in Scala. The exercises resolved use public APIs or open datasets in order to experiment with the different Apache Spark APIs. The goal is to practice and learn. Inside this repository you will find RDDs, DataSets, and DataFrames usage, Spark SQL queries, Spark Streaming examples and Machine Learning stuff 😃.
This table contains all the exercises resolved in this repository sorted by goals with links for the solution and the specs.
# | Goal | Statement | Code | Tests |
---|---|---|---|---|
1 | Learn how to use SparkContext and some basic RDDs methods. |
El Quijote. | ElQuijote.scala | ElQuijoteSpec.scala |
2 | Learn how to parallelize Scala collections and work with them as RDDs . |
Numerical series. | NumericalSeries.scala | NumericalSeriesSpec.scala |
3 | Learn how to use set transformations for RDDs . |
Sets. | Sets.scala | SetsSpec.scala |
4 | Learn how to use Pair RDDs . |
Build executions. | BuildExecutions.scala | BuildExecutionsSpec.scala |
5 | Learn how to read and save data using different formats. | Read and write data. | ReadAndWrite.scala | ReadAndWriteSpec.scala |
6 | Learn how to use shared variables and numeric operations. | Movies. | Movies.scala | MoviesSpec.scala |
7 | Learn how to submit and execute Spark applications on a cluster. | RunningOnACluster. | - | - |
8 | Learn how to use Kryo serialization. | Kryo. | Kryo.scala | KryoSpec.scala |
9 | Learn how to use Spark SQL. | Fifa. | Fifa.scala | FifaSpec.scala |
10 | Learn how to use Spark Streaming. | Logs. | Logs.scala | - |
11 | Learn how to use Spark Machine Learning. | MachineLearning. | MachineLearning.scala | - |
12 | Learn how to use some not so common Spark API transformations/actions. | Tweets. | Tweets.scala | TweetsSpec.scala |
To build and test this project you can execute sbt test
. You can also use sbt
interactive mode (you just have to execute sbt
in your terminal) and then use the triggered execution to execute your tests using the following commands inside the interactive mode:
~ test // Runs every test in your project
~ test-only *AnySpec // Runs specs matching with the filter passed as param.
Spark applications are developed to run on a cluster. Before to run your app you need to generate a .jar
file you can submit to Spark to be executed. You can generate the sparkPlayground.jar
file executing sbt assembly
. This will generate a binary file you can submit using spark-submit
command. Ensure your local Spark version is Spark 2.1.1
.
You can submit this application to your local spark installation executing these commands:
sbt assembly ./submitToLocalSpark.sh
You can submit this application to a dockerized Spark cluster using these commands:
sbt assembly
cd docker
docker-compse up -d
cd ..
./submitToDockerizedSpark.sh
- Pedro Vicente Gómez Sánchez - pedrovicente.gomez@gmail.com
Copyright 2017 Pedro Vicente Gómez Sánchez
Licensed under the GNU General Public License, Version 3 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.gnu.org/licenses/gpl-3.0.en.html
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.