# Getting Started With Spark using Scala

The Spark Context has been created for you upon initialization of the Toree kernel.

In [1]:
print(sc.version)
sc

2.3.0

## Creating RDDs

For demonstration purposes, we create an RDD here by calling `sc.parallelize()`

In [2]:
val data = 1 to 30 
val xrangeRDD = sc.parallelize(data, 4)

data = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
xrangeRDD = ParallelCollectionRDD[0] at parallelize at <console>:28


ParallelCollectionRDD[0] at parallelize at <console>:28

### Applying Transormations

### The scala API

In [3]:
val subRDD = xrangeRDD.map(x => x - 1)
val filteredRDD = subRDD.filter(x => x < 10)

subRDD = MapPartitionsRDD[1] at map at <console>:30
filteredRDD = MapPartitionsRDD[2] at filter at <console>:31


MapPartitionsRDD[2] at filter at <console>:31

## Actions 

A transformation returns a result to the driver.

In [4]:
println("Count: " + filteredRDD.count())
filteredRDD.collect()

Count: 10


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]:
val test = sc.parallelize(1 to 50000,50)
//cache this data
test.cache

val t1 = System.nanoTime()
// first count will trigger evaluation of count *and* cache
test.count
val dt1 = (System.nanoTime() - t1).toDouble/1.0e9

val t2 = System.nanoTime()


test.count
val dt2 = (System.nanoTime() - t2).toDouble/1.0e9

test = ParallelCollectionRDD[3] at parallelize at <console>:27
t1 = 852300669254662
dt1 = 0.242165945
t2 = 852300911432550
dt2 = 0.084373291


0.084373291

In [5]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()


spark = org.apache.spark.sql.SparkSession@546731b4


In [6]:
val df = spark.read.json("people.json")
df.show
df.printSchema

// Register the DataFrame as a SQL temporary view
df.createTempView("people")

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



df = [age: bigint, name: string]


[age: bigint, name: string]

In [7]:
//neeed to register dataframe to use (hive like) sql
println("Query 1: select statements")
df.select("name").show
spark.sql("SELECT name FROM people").show

Query 1: select statements
+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+



In [8]:
println("Query 2: filter statements")
df.filter(df("age") > 21).show
df.filter($"age" > 21).show
spark.sql("SELECT age, name FROM people WHERE age > 21").show

Query 2: filter statements
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+



In [9]:
println("Query 3: group by statements")
df.groupBy("age").count().show
spark.sql("SELECT age, COUNT(age) as count FROM people GROUP BY age").show

Query 3: group by statements
+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    1|
|  30|    1|
+----+-----+

+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    0|
|  30|    1|
+----+-----+

