Spark examples repo for getting started

Insight from following course

Spark Documentation

Setting it up (Pre-reqs)

Or start with Docker Image with preinstalled Spark


Starting Spark with scala

Open Command prompt or shell and run


All Spark jobs begin with sc (The Spark Context) which is supplied by the spark-shell

The shell creates 2 contexts

  • sc - a special interpreter-aware SparkContext is already created for you in spark shell
  • sqlContext - entry point for working with structured data (rows and columns) in Spark

Using Spark shell

:help for help
use two tabs to see method definitions rather than return (e.g. sc.parallelize)

Simple Example

Read in a text file and write first line

val textFile = sc.textFile("file:///<SPARK_HOME>/")

res0: String = # Apache Spark

Tokenize the File Data with a space

 val tokenizedFileData = textFile.flatMap(line=>line.split(" "))

This is the Map in Map/Reduce

Count the instances of each Word

Here the word is the key and the value is the count

 val countPrep =>(word,1))
 val counts = countPrep.reduceByKey((accumValue, newValue)=>accumValue + newValue)

This is the Reduce in Map/Reduce

Sort in decending order (_2 represents 2nd position in the tuple)

 val sortedCounts = counts.sortBy(kvPair=>kvPair._2, false)

Write File and check output parts


An even Simpler way using the Api