# Apache Spark's Structured APIs

## Spark: What's Underneath an RDD?

The RDD is the most basic abstraction in Spark. There are three vital characteristics to an RDD.
* Dependencies
* Partitions (with some locality information)
* Compute function: Partiotion=> Iterator[T]

All three are intergral to the simple RDD programming API model upon which all higher-level functionality is constructed.
1. A list of *dependencies* that instincts Spark how an RDD is constructed with its inputs is required. When Reproducing Results, Spark can remake RDDS from these dependencies and replicate the operations. It gives the RDDs **Resiliency**
2. *Partitions* Provide Spark the ability to split the work to parallelize computation on partitions across executors. Spark sometimes (HDFS and others) will us locality information to send work to executors close to the data. This reduces data transmitted over the network
3. RDDs also have a *compute function* that produces an `Iterator [T]` for the data that will be stored in the RDD.

Simple and Elegant! There arise a couple of problems with this model though. Spark does not know *what* you are doing in the compute function. In other terms, the compute function is **opaque** to Spark. This means that joins, filters, selects, or aggreations are seen by Spark mostly as `lambda` expressions. 

Another problem also arises with Python RDDs; Spark sees the `Iterator [T]` data type as opaque. Spark only registers the Iterator as a generic Python object.

## Structuring Spark

There are a few schemes to structure spark. 
- express computations by using common patterns found in data analysis. (avg, filters, selects, aggregates, etc.)
- Can also use a set of common operators within a DSL, in the form of APIs in a Spark compatible language. Very specific
- Also, can use a order and structure scheme. This allows data to be arranged in a tabular format. like SQL tables or a spreadsheet. it has its own supported datatypes. 

### Key Merits and Benefits

Structure yields a number of benefits, including better performance and space effi‐ ciency across Spark components. We will explore these benefits further when we talk about the use of the DataFrame and Dataset APIs shortly, but for now we’ll concen‐ trate on the other advantages: expressivity, simplicity, composability, and uniformity.

In [2]:
import sys
!{sys.executable} -m pip install scala

Defaulting to user installation because normal site-packages is not writeable
[31mERROR: Could not find a version that satisfies the requirement scala (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for scala[0m[31m
[0m

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [7]:
# Create an RDD of tuples (name, age)
dataRDD = sc.parallelize([("Brooke", 20), ("Denny", 31), ("Jules", 30),
                         ("TD", 35), ("Brooke", 25)])
# Use map and reduceByKey transformations with their lambda
# expressions to aggregate and then compute average

agesRDD = (dataRDD
          .map(lambda x: (x[0], (x[1], 1)))
          .reduceByKey(lambda x, y: (x[0]+y[0], x[1] + y[1]))
          .map(lambda x: (x[0], x[1][0]/x[1][1])))

In [10]:
# Create a DataFrame using SparkSession 
spark = (SparkSession
      .builder
      .appName("AuthorsAges")
      .getOrCreate())

# Create a DataFrame
data_df = spark.createDataFrame([("Brooke", 20), ("Denny", 31), ("Jules", 30),
                                 ("TD", 35), ("Brooke", 25)], ["name", "age"])

# Group the same names together, aggregate their ages, and compute an average 
avg_df = data_df.groupBy("name").agg(avg("age"))
# Show the results of the final execution
avg_df.show()

[Stage 0:>                                                          (0 + 8) / 8]

+------+--------+
|  name|avg(age)|
+------+--------+
|Brooke|    22.5|
| Denny|    31.0|
| Jules|    30.0|
|    TD|    35.0|
+------+--------+



                                                                                