# Getting Started With Spark using Scala

The Spark Context has been created for you upon initialization of the Toree kernel.

In [1]:
print(sc.version)
sc

2.3.3

org.apache.spark.SparkContext@4c339372

### The scala API

Since Spark is written in scala, it is a bit oddd to also think of it as an API, but since there are so many different APIs to Spark, it is helpful to think of scala as the "best API", or simply the native language.  All of Spark's functionality is accessible via scala, and you are strongly encouraged to write code in scala for applications that may require any specialized code, since you will have more access to the low level features and need not be concerned about the latencies associated with writing in other languages.

## Creating RDDs

For demonstration purposes, we create an RDD here by calling `sc.parallelize()`

In [2]:
val data = 1 to 30 
val xrangeRDD = sc.parallelize(data, 4)

data = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
xrangeRDD = ParallelCollectionRDD[5] at parallelize at <console>:30


ParallelCollectionRDD[5] at parallelize at <console>:30

## Transformations

A transformation is an operation on an RDD that results in a new RDD.  The transformed RDD is generated very rapidly, because the new RDD is *lazily evaluated*, which means that the calculation is not carried out when the new RDD is generated.  The RDD will contain a series of transformations, or computation instructions, that will only be carried out when an action is called.

In [3]:
val subRDD = xrangeRDD.map(x => x - 1)
val filteredRDD = subRDD.filter(x => x < 10)

subRDD = MapPartitionsRDD[6] at map at <console>:32
filteredRDD = MapPartitionsRDD[7] at filter at <console>:33


MapPartitionsRDD[7] at filter at <console>:33

## Actions 

A transformation returns a result to the driver.

In [4]:
println("Count: " + filteredRDD.count())
filteredRDD.collect()

Count: 10


Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

## Caching Data

This simple example shows how to create an RDD and cache it.  The timings seem to be trivial improvements, but this is mostly because the main latency here is in sending the results back to the driver.  If you wish to see the actual computation time, browse to the Spark UI...it's at host:4040.  You'll see that the second calculation took much less time!

In [6]:
val test = sc.parallelize(1 to 50000,2)
//cache this data
test.cache
 
val t1 = System.nanoTime()
// first count will trigger evaluation of count *and* cache
test.count
val dt1 = (System.nanoTime() - t1).toDouble/1.0e9

val t2 = System.nanoTime()


test.count
val dt2 = (System.nanoTime() - t2).toDouble/1.0e9

test = ParallelCollectionRDD[9] at parallelize at <console>:31
t1 = 300451456960206
dt1 = 0.083757667
t2 = 300451540722802
dt2 = 0.038629022


0.038629022

## Spark SQL

In order to work with the extremely powerful SQL engine in Apache Spark, you will need to create a Spark Session.

In [7]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()


spark = org.apache.spark.sql.SparkSession@f76b9a9c


org.apache.spark.sql.SparkSession@f76b9a9c

### Create Your First DataFrame!

You can create a structured data set (much like a database table) in Spark.  Once you have done that, you can then use powerful SQL tools to query and join your dataframes.

In [9]:
import sys.process._

"wget https://raw.githubusercontent.com/nilmeier/DSatEnterpriseScale/master/people.json" !

val df = spark.read.json("people.json")
df.show
df.printSchema

// Register the DataFrame as a SQL temporary view
df.createTempView("people")

--2019-12-05 19:06:56--  https://raw.githubusercontent.com/nilmeier/DSatEnterpriseScale/master/people.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73 [text/plain]
Saving to: 'people.json'

     0K                                                       100% 13.0M=0s

2019-12-05 19:06:56 (13.0 MB/s) - 'people.json' saved [73/73]

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



df = [age: bigint, name: string]


lastException: Throwable = null


[age: bigint, name: string]

### Some More DataFrame Examples

In [10]:
//neeed to register dataframe to use (hive like) sql
println("Query 1: select statements")
df.select("name").show
spark.sql("SELECT name FROM people").show

Query 1: select statements
+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+



In [11]:
println("Query 2: filter statements")
df.filter(df("age") > 21).show
df.filter($"age" > 21).show
spark.sql("SELECT age, name FROM people WHERE age > 21").show

Query 2: filter statements
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+



In [12]:
println("Query 3: group by statements")
df.groupBy("age").count().show
spark.sql("SELECT age, COUNT(age) as count FROM people GROUP BY age").show

Query 3: group by statements
+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    1|
|  30|    1|
+----+-----+

+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    0|
|  30|    1|
+----+-----+

