Data engineering using Spark-Scala - Hands-on
Tools used: Databricks, Zeppelin • Programming langages: Scala, Spark SQL
The following gist is intended for Data Engineers. It focuses on
If we want to handle
real-time data processing, this gist is definitely worth looking into.
We'll learn how to install and use
Scala on a
We'll learn the latest
Spark 2.0 methods and updates to the
MLlib library working with
Spark SQL and Dataframes.
Please fork it if you find it relevant for your educational or professional path.
How is gist is structured
This gist is structured into 2 parts.
Part 1. Installation of JVM, Spark, Scala on a Linux OS
Related section: SCALA_SPARK_INSTALL
Part 2. Spark-Scala programing using Atom, Databricks, Zeppelin
Related sections: SPARK_SCALA_Programming, SPARK_SCALA_entry SPARK_SCALA_intermediary
Notes related to Spark and Scala
Spark is one of the most powerful
Big Data tools.
Spark runs programs up to 100x faster than Hadoop's
Spark can use data stored in
MapReduce requires files to be stored in
Spark does not.
Spark performs 100x faster than
Mapreduce because it writes jobs in-memory.
Mapreduce writes jobs on disk.
MapReduce (Hadoop) writes most data to disk after each
Spark keeps most of the data in memory after each transformation.
At the core of
Spark there are
Resilient Distributed Datasets also known as
RDD has 4 main features:
- Distributed collection of data
- Parallel operations which are partitioned
- An RDD can use many data sources
RDDs are immutable, cacheable and lazily evaluated.
There are 2 types of
- Transformations: recipes to follow
- Actions: performs recipe's instructions and returns a result
Environment options for
- Text editors, such as
- IDEs (Integrated Development Environments), such as
- Notebooks, such as
Scala is a general purpose programming language.
Scala was designed by Martin Odersky (Ecole Polytechnique Fédérale de Lausanne).
Scala source code is intended to be compiled to
Java bytecode to run on a
Java Virtual Machine (JVM).
Java librairies can be used directly in
I've uploaded a
.zip which contains useful slides
For storing datasets and granting access to them, I've used
- Isaac Arnault - AWS Cloud series - Related tags: #EC2 #TLS #AWSCLI #Linux