# The Apache Spark Scala API

## 1. Introduction

This notebook shows how to connect Jupyter notebooks to a Spark cluster to process data using Spark Scala API.

## 2. The Spark Cluster

### 2.1. Get Spark

Let's start by importing Apache Spark from Maven repository (mind the Spark **version**).

In [81]:
import $ivy.`org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0`;

[32mimport [39m[36m$ivy.$                                                 ;[39m

We will be disabling Spark internal logs to let us focus on its API.

In [82]:
import org.apache.log4j.{Level, Logger};
Logger.getLogger("org").setLevel(Level.OFF);

[32mimport [39m[36morg.apache.log4j.{Level, Logger};
[39m

### 2.2. Connection

To connect to the Spark cluster, create a SparkSession object with the following params:

+ **appName:** application name displayed at the [Spark Master Web UI](http://localhost:8080/);
+ **master:** Spark Master URL, same used by Spark Workers;
+ **spark.executor.memory:** must be less than or equals to docker compose SPARK_WORKER_MEMORY config.

In [101]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.
            builder().
            appName("scala-spark-notebook").
            master("spark://spark-master:7077").
            config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0").
            config("spark.executor.memory", "512m").
            getOrCreate()

[32mimport [39m[36morg.apache.spark.sql.SparkSession
[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@323778fc

More confs for SparkSession object in standalone mode can be added using the **config** method. Checkout the API docs [here](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html).

## 3. The Data

### 3.1. Introduction

We will be using Spark Scala API to read, process and write data. Checkout the API docs [here](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html).

### 3.2. Read

Let's read some UK's macroeconomic data ([source](https://www.kaggle.com/bank-of-england/a-millennium-of-macroeconomic-data)) from the cluster's simulated **Hadoop distributed file system (HDFS)** into a Spark dataframe.

Let's then display some dataframe metadata, such as the number of rows and cols and its schema (cols name and type).

In [134]:
import spark.implicits._
val df_streamed_raw = (spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "kafka:9093")
        .option("subscribe", "topic_test")
        .load())

[32mimport [39m[36mspark.implicits._
[39m
[36mdf_streamed_raw[39m: [32mDataFrame[39m = [key: binary, value: binary ... 5 more fields]

In [135]:
import spark.implicits._
import org.apache.spark.sql.functions.{col, from_json}
import org.apache.spark.sql.types._

val df_streamed_kv = df_streamed_raw
  .withColumn("key", col("key").cast("STRING"))
  .withColumn("value", col("value").cast("STRING"))

[32mimport [39m[36mspark.implicits._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions.{col, from_json}
[39m
[32mimport [39m[36morg.apache.spark.sql.types._

[39m
[36mdf_streamed_kv[39m: [32mDataFrame[39m = [key: string, value: string ... 5 more fields]

In [136]:
val test1=(df_streamed_kv 
              .writeStream 
              .format("console") 
              .outputMode("update")
              .queryName("test_query_table")
              .start())

[36mtest1[39m: [32mstreaming[39m.[32mStreamingQuery[39m = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@2539b9a0

In [137]:
spark.sql("select * from test_query_table").show()

+---+-----+-----+---------+------+---------+-------------+
|key|value|topic|partition|offset|timestamp|timestampType|
+---+-----+-----+---------+------+---------+-------------+
+---+-----+-----+---------+------+---------+-------------+



In [143]:
val eventSchema = new StructType()
  .add("valid", StringType)
  .add("tmpf", StringType)
  .add("dwpf", StringType)
  .add("relh", StringType)
  .add("feel", StringType)
  .add("drct", StringType)
  .add("sped", StringType)
  .add("alti", StringType)
  .add("p01m", StringType)
  .add("vsby", StringType)
  .add("skyc1", StringType)
  .add("skyl1", StringType)
  .add("wxcodes", StringType)
  .add("station_encoded", StringType)
  .add("skyc1_encoded", StringType)


 val personDF = df_streamed_kv.select(from_json(col("value"), eventSchema).as("data"))
   .select("data.*")
// val dfParsed = df_streamed_kv
//   .withColumn("value", from_json(col("value"), eventSchema))

[36meventSchema[39m: [32mStructType[39m = [33mStructType[39m(
  [33mStructField[39m([32m"valid"[39m, StringType, true, {}),
  [33mStructField[39m([32m"tmpf"[39m, StringType, true, {}),
  [33mStructField[39m([32m"dwpf"[39m, StringType, true, {}),
  [33mStructField[39m([32m"relh"[39m, StringType, true, {}),
  [33mStructField[39m([32m"feel"[39m, StringType, true, {}),
  [33mStructField[39m([32m"drct"[39m, StringType, true, {}),
  [33mStructField[39m([32m"sped"[39m, StringType, true, {}),
  [33mStructField[39m([32m"alti"[39m, StringType, true, {}),
  [33mStructField[39m([32m"p01m"[39m, StringType, true, {}),
  [33mStructField[39m([32m"vsby"[39m, StringType, true, {}),
  [33mStructField[39m([32m"skyc1"[39m, StringType, true, {}),
  [33mStructField[39m([32m"skyl1"[39m, StringType, true, {}),
  [33mStructField[39m([32m"wxcodes"[39m, StringType, true, {}),
  [33mStructField[39m([32m"station_encoded"[39m, StringType, true, {}),
  [3

In [144]:
personDF.writeStream
      .format("console")
      .outputMode("append")
      .start()
      .awaitTermination()

-------------------------------------------
Batch: 0
-------------------------------------------
+-----+----+----+----+----+----+----+----+----+----+-----+-----+-------+---------------+-------------+
|valid|tmpf|dwpf|relh|feel|drct|sped|alti|p01m|vsby|skyc1|skyl1|wxcodes|station_encoded|skyc1_encoded|
+-----+----+----+----+----+----+----+----+----+----+-----+-----+-------+---------------+-------------+
+-----+----+----+----+----+----+----+----+----+----+-----+-----+-------+---------------+-------------+



: 

In [147]:
import org.apache.spark.sql.functions.{col, to_timestamp, unix_timestamp}
import org.apache.spark.sql.types.IntegerType

val dfFormatted = dfParsed.select(
    col("key").alias("event_key"),
    col("topic").alias("event_topic"),
    col("timestamp").alias("event_timestamp"),
    col("value.valid").alias("valid"),
    col("value.tmpf").alias("tmpf"),
    col("value.dwpf").alias("dwpf"),
    col("value.relh").alias("relh"),
    col("value.feel").alias("feel"),
    col("value.drct").alias("drct"),
    col("value.sped").alias("sped"),
    col("value.alti").alias("alti"),
    col("value.p01m").alias("p01m"),
    col("value.vsby").alias("vsby"),
    col("value.skyc1").alias("skyc1"),
    col("value.skyl1").alias("skyl1"),
    col("value.wxcodes").alias("wxcodes"),
    col("value.station_encoded").alias("station_encoded"),
    col("value.skyc1_encoded").alias("skyc1_encoded"),

)
// .select(
//     col("event_key"),
//     col("event_topic"),
//     col("event_timestamp"),
//     col("valid"),
//     col("tmpf"),
//     col("dwpf"),
//     col("relh"),
//     col("feel"),
//     col("drct"),
//     col("sped"),
//     col("alti"),
//     col("p01m"),
//     col("vsby"),
//     col("skyc1"),
//     col("skyl1"),
//     col("wxcodes"),
//     col("station_encoded"),
//     col("skyc1_encoded"),
// )


[32mimport [39m[36morg.apache.spark.sql.functions.{col, to_timestamp, unix_timestamp}
[39m
[32mimport [39m[36morg.apache.spark.sql.types.IntegerType

[39m
[36mdfFormatted[39m: [32mDataFrame[39m = [event_key: string, event_topic: string ... 16 more fields]

In [148]:
dfFormatted.writeStream
      .format("console")
      .outputMode("append")
      .start()
      .awaitTermination()

-------------------------------------------
Batch: 0
-------------------------------------------
+---------+-----------+---------------+-----+----+----+----+----+----+----+----+----+----+-----+-----+-------+---------------+-------------+
|event_key|event_topic|event_timestamp|valid|tmpf|dwpf|relh|feel|drct|sped|alti|p01m|vsby|skyc1|skyl1|wxcodes|station_encoded|skyc1_encoded|
+---------+-----------+---------------+-----+----+----+----+----+----+----+----+----+----+-----+-----+-------+---------------+-------------+
+---------+-----------+---------------+-----+----+----+----+----+----+----+----+----+----+-----+-----+-------+---------------+-------------+



: 

In [132]:
import org.apache.spark.sql.streaming.Trigger

val query = dfFormatted.writeStream
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .outputMode("append")
  .option("truncate", false)
  .start()

[32mimport [39m[36morg.apache.spark.sql.streaming.Trigger

[39m
[36mquery[39m: [32mstreaming[39m.[32mStreamingQuery[39m = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@acdd836

In [69]:
query.stop()

In [5]:
data.count

[36mres4[39m: [32mLong[39m = [32m841L[39m

In [6]:
data.columns.size

[36mres5[39m: [32mInt[39m = [32m77[39m

In [None]:
data.printSchema

### 3.3. Process

In this example, we will get UK's population and unemployment rate thoughtout the years. Let's start by selecting the relevant columns.

In [8]:
var unemployment = data.select("Description", "Population (GB+NI)", "Unemployment rate")

In [9]:
unemployment.show(10)

+-----------+------------------+-----------------+
|Description|Population (GB+NI)|Unemployment rate|
+-----------+------------------+-----------------+
|      Units|              000s|                %|
|       1209|              null|             null|
|       1210|              null|             null|
|       1211|              null|             null|
|       1212|              null|             null|
|       1213|              null|             null|
|       1214|              null|             null|
|       1215|              null|             null|
|       1216|              null|             null|
|       1217|              null|             null|
+-----------+------------------+-----------------+
only showing top 10 rows



We successfully selected the desired columns but two problems were found:
+ The first line contains no data but the unit of measurement of each column;
+ There are many years with missing population and unemployment data.

Let's then remove the first line.

In [10]:
val cols_description = unemployment.filter(unemployment("Description") === "Units")

[36mcols_description[39m: [32mDataset[39m[[32mRow[39m] = [Description: string, Population (GB+NI): string ... 1 more field]

In [11]:
cols_description.show()

+-----------+------------------+-----------------+
|Description|Population (GB+NI)|Unemployment rate|
+-----------+------------------+-----------------+
|      Units|              000s|                %|
+-----------+------------------+-----------------+



In [12]:
unemployment = unemployment.join(cols_description, unemployment("Description") === cols_description("Description"), "left_anti")

In [13]:
unemployment.show(10)

+-----------+------------------+-----------------+
|Description|Population (GB+NI)|Unemployment rate|
+-----------+------------------+-----------------+
|       1209|              null|             null|
|       1210|              null|             null|
|       1211|              null|             null|
|       1212|              null|             null|
|       1213|              null|             null|
|       1214|              null|             null|
|       1215|              null|             null|
|       1216|              null|             null|
|       1217|              null|             null|
|       1218|              null|             null|
+-----------+------------------+-----------------+
only showing top 10 rows



Nice! Now, let's drop the dataframe rows with missing data and refactor its columns names.

In [14]:
unemployment = unemployment.na.drop()

In [15]:
unemployment = unemployment.
                withColumnRenamed("Description", "year").
                withColumnRenamed("Population (GB+NI)", "population").
                withColumnRenamed("Unemployment rate", "unemployment_rate")

In [16]:
unemployment.show(10)

+----+----------+-----------------+
|year|population|unemployment_rate|
+----+----------+-----------------+
|1855|     23241|             3.73|
|1856|     23466|             3.52|
|1857|     23689|             3.95|
|1858|     23914|             5.23|
|1859|     24138|             3.27|
|1860|     24360|             2.94|
|1861|     24585|             3.72|
|1862|     24862|             4.68|
|1863|     25142|             4.15|
|1864|     25425|             2.99|
+----+----------+-----------------+
only showing top 10 rows



### 3.4. Write

Lastly, we persist the unemployment data into the cluster's simulated **HDFS**.

In [17]:
unemployment.repartition(1).write.format("csv").mode("overwrite").option("sep", ",").option("header", "true").save("data/uk-macroeconomic-unemployment-data.csv")