Skip to content
Data engineering using Spark-Scala - Hands-on
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
INTERVIEW_Questions.md
JVM_SPARK_SCALA_INSTALL.md
OUTPUT.md
README.md
SPARK_SCALA.entry.md
SPARK_SCALA.intermediary.md
SPARK_SCALA.programming.md
SPARK_SCALA.project.md

README.md

Data engineering using Spark-Scala - Hands-on

Tools used: Databricks, Zeppelin • Programming langages: Scala, Spark SQL

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

The following gist is intended for Data Engineers. It focuses on Spark and Scalaprogramming.
If we want to handle batch and real-time data processing, this gist is definitely worth looking into.
We'll learn how to install and use Spark and Scala on a Linux system.
We'll learn the latest Spark 2.0 methods and updates to the MLlib library working with Spark SQL and Dataframes. Please fork it if you find it relevant for your educational or professional path.

How is gist is structured

This gist is structured into 2 parts.

Part 1. Installation of JVM, Spark, Scala on a Linux OS

Related section: SCALA_SPARK_INSTALL

Part 2. Spark-Scala programing using Atom, Databricks, Zeppelin

Related sections: SPARK_SCALA_Programming, SPARK_SCALA_entry SPARK_SCALA_intermediary

Notes related to Spark and Scala

Spark

Spark is one of the most powerful Big Data tools.
Spark runs programs up to 100x faster than Hadoop's MapReduce.
Spark can use data stored in Cassandra, Amazon S3, Hadoop'sHDFS, etc.
MapReduce requires files to be stored in HDFS, Spark does not.
Spark performs 100x faster than Mapreduce because it writes jobs in-memory. Mapreduce writes jobs on disk.

Data processing
MapReduce (Hadoop) writes most data to disk after each Map and Reduce operation.
Spark keeps most of the data in memory after each transformation.
At the core of Spark there are Resilient Distributed Datasets also known as RDDs.
An RDD has 4 main features:

  1. Distributed collection of data
  2. Fault-tolerant
  3. Parallel operations which are partitioned
  4. An RDD can use many data sources

RDDs are immutable, cacheable and lazily evaluated. There are 2 types of RDD operations:

  1. Transformations: recipes to follow
  2. Actions: performs recipe's instructions and returns a result

Environment options for Scala and Spark

  1. Text editors, such as Sublime Text and Atom
  2. IDEs (Integrated Development Environments), such as IntelliJ and Eclipse
  3. Notebooks, such as Jupyter, Zeppelin and Databricks

Scala

Scala is a general purpose programming language.
Scala was designed by Martin Odersky (Ecole Polytechnique Fédérale de Lausanne).
Scala source code is intended to be compiled to Java bytecode to run on a Java Virtual Machine (JVM).
Java librairies can be used directly in Scala.

Knowledge base

I've uploaded a .zip which contains useful slides MachineLearning, Spark and Scala.

Storing

For storing datasets and granting access to them, I've used AWS.

Author

  • Isaac Arnault - AWS Cloud series - Related tags: #EC2 #TLS #AWSCLI #Linux
You can’t perform that action at this time.