In [2]:
import $file.Qa
import Qa._

[32mimport [39m[36m$file.$ 
[39m
[32mimport [39m[36mQa._[39m

# Scala: Spark Introduction

### What is Spark?
Apache Spark is a computation and processing library. It distributes data over a large number of nodes, and for performing paraellel computations over that cluster.

### Spark Computation


Spark operations concern Resilient Distributed Data (RDD) objects: representations of the data partitioned across multiple nodes, able to be operated on in parallel.

In Spark 1, programmers work with RDDs directly. In Spark 2, the DataFrame and SQL API is strongly preferred -- but RDDs are still available. Since quite a lot of legacy code still exists, and RDDs still have their use in Spark 2, we start with RDDs. However Databricks (the authors) strongly recommends DataFrames for new projects. (And "Datasets" in Scala which are strongly-typed DataFrames).  

The general approach with RDDs is:

1. Create an RDD representation of the data set, distributing data across the cluster
2. Perform a **transformation** on the RDD representation, producing a new distributed dataset
3. Perform an **action** to extract the final result from the cluster, in a non-distributed format 

Transformations are not computed until an action takes place. 

### Spark Languages

The native language for Spark is Scala. This is for good reason - most operations associated with Spark are transformations and actions on RDDs, and functional programming lends itself well to these types of operation.

A Spark program would consists of a sequence of mappings applied one after the other on an initial RDD.

However, Spark is not restricted to Scala, it also provides APIs to most of the popular Data Science languages such as Python and R.


### Spark Clusters


Spark can be setup on any cluster, including Hadoop, in which case it would be integrated with Yet Another Resource Negotiator (YARN).

The main difference between a Spark RDD and a file on HDFS is that the Spark RDD lives in the memory of each of the nodes, while a file on HDFS lives on their respective hard drives. 

Hence, while Hadoop MapReduce always reads and writes to files, operations on Spark RDDs are performed *in memory*, making Spark - in theory - **several orders of magnitude faster**.


## Spark Sessions

In [3]:
spark

[36mres2[39m: [32morg[39m.[32mapache[39m.[32mspark[39m.[32msql[39m.[32mSparkSession[39m = org.apache.spark.sql.SparkSession@2373eca9

## DataFrames

In [4]:
Map(
    "name" -> Array("Michael", "Kunal"),
    "age" -> Array("30", "20")
)

[36mres3[39m: [32mMap[39m[[32mString[39m, [32mArray[39m[[32mString[39m]] = [33mMap[39m(
  [32m"name"[39m -> [33mArray[39m([32m"Michael"[39m, [32m"Kunal"[39m),
  [32m"age"[39m -> [33mArray[39m([32m"30"[39m, [32m"20"[39m)
)

In [2]:
val df = spark.range(18, 30).toDF("age")

df = [age: bigint]


[age: bigint]

* Transformation Pipeline vs. Result

In [12]:
val transform = df.where("age < 25")

transform = [age: bigint]


[age: bigint]

### Actions

* Actions cause results

In [11]:
val result = transform.count()

result = 7


7

### Infering Schema (Read-Time)

In [15]:
val crime = spark
    .read
    .option("inferSchema", "true")
    .option("header", "true")
    .csv("crime.csv")

crime = [M: int, So: int ... 14 more fields]


[M: int, So: int ... 14 more fields]

In [16]:
crime.columns

Array(M, So, Ed, Po1, Po2, LF, M.F, Pop, NW, U1, U2, GDP, Ineq, Prob, Time, y)

In [9]:
crime.sort("Ineq").take(2).foreach { println }

[121,0,110,118,115,547,964,25,44,84,29,689,126,0.034201,20.9995,682]
[130,0,116,128,128,536,934,51,24,78,34,627,135,0.019099,24.9008,750]


### Explain & Spark Plans

In [4]:
crime.sort("Ineq").explain()

cmd4.sc:1: not found: value crime
val res4 = crime.sort("Ineq").explain()
           ^Compilation Failed

: 

In [12]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

## SQL vs DataFrame

* Register table with spark

In [13]:
crime.createOrReplaceTempView("crime_data")

* Use table on `spark` session

In [19]:
spark.sql("""
    SELECT So AS SothernState, AVG(Ineq) AS Inequality
    FROM crime_data
    GROUP BY So
""").show()

+------------+------------------+
|SothernState|        Inequality|
+------------+------------------+
|           0|173.09677419354838|
|           1|             234.5|
+------------+------------------+



* With dataframe api, no need to register anything
* Use methods directly

In [23]:
crime.groupBy("So").avg("Ineq").show()

+---+------------------+
| So|         avg(Ineq)|
+---+------------------+
|  0|173.09677419354838|
|  1|             234.5|
+---+------------------+



In [11]:
object Person {
    def say() = {
        println("Hi")
        Person
    }
    def bye() = {
        println("Bye!")
        Person
    }
}

Person
    .say()
    .bye()
    .say()
    .bye()

Hi
Bye!
Hi
Bye!


defined [32mobject[39m [36mPerson[39m
[36mres10_1[39m: [32mPerson[39m.type = ammonite.$sess.cmd10$Helper$Person$@7819f772

In [25]:
import org.apache.spark.sql.functions.desc

crime
  .groupBy("So")
  .avg("Ed")
  .withColumnRenamed("avg(Ed)", "EducationAverage")
  .sort(desc("EducationAverage"))
  .limit(1)
  .show()


+---+-----------------+
| So| EducationAverage|
+---+-----------------+
|  0|111.2258064516129|
+---+-----------------+



## Spark Applications

In [3]:
import org.apache.log4j.Logger
import org.apache.spark.sql.SparkSession


object MyLib extends Serializable {
  @transient lazy val logger = Logger.getLogger(getClass.getName)

  def userDefinedFn(input: String): String = {
    logger.info(input)
    input.toUpperCase
  }

}

object SparkApplication extends Serializable {
    import spark.implicits._
    def main(args: Array[String]) = {

    val spark = SparkSession
        .builder()
        .appName("Spark Application Example")
        .getOrCreate()

    spark.udf.register("userDefinedFn", MyLib.userDefinedFn _)

    
    val authorsDF = spark
        .sparkContext
        .parallelize(Array("sample text", "some more"))
        .toDF("output")
        .selectExpr("split(output, ' ') as values")
        .selectExpr("userDefinedFn(values[0]) as first", "values[1] as second")
        .show()
    }
}

[32mimport [39m[36morg.apache.log4j.Logger
[39m
[32mimport [39m[36morg.apache.spark.sql.SparkSession


[39m
defined [32mobject[39m [36mMyLib[39m
defined [32mobject[39m [36mSparkApplication[39m

In [4]:
SparkApplication.main(Array())

19/07/25 12:40:55 INFO cmd2$Helper$MyLib$: sample
19/07/25 12:40:55 INFO cmd2$Helper$MyLib$: some


+------+------+
| first|second|
+------+------+
|SAMPLE|  text|
|  SOME|  more|
+------+------+

