# Chapter 4. Structured API Overview

- The Structured APIs are a tool for manipulating all sorts of data, from unstructured log files to semi-structured CSV files and highly structured Parquet files.
- These APIs refer to 3 core types of distributed collections APIs:
  - Datasets.
  - DataFrames.
  - SQL tables and views.
- Structured APIs simplify to migrate from batch to streaming (or vice versa).
- Structured APIs are the fundamental abstraction that will use to write majority of your data flows.

### DataFrames and Datasets
- Are (distributed) table-like collections with well-defined rows and columns.
- Each column must have the same number of rows as all the the columns (null to inform absence of value)
- Each column has type information that must be consistent for every row in the collection.
- To Spark, they represent immutable, lazily evaluated plans that specify what operations to apply to data residing at a location to generate some output.
- Spark DataFrames represent plans for manipulating rows and columns.
- We instruct Spark to perform these transformations and return results.

### Note: Tables and view are basically the same things as DataFrames. We just execute SQL against them instead of DataFrame code.



### Schemas
- A schema defines the column names and types of a DataFrame.
- You can define schema from a data source (often called schema on read).
- Schemas consist of types, meaning that you need a way of specifying what lies where.

### Overview of Structured Spark Types
- Spark uses an engine called Catalyst that maintains its own information through the planning and processing of work.
- Catalyst enables execution optimizations for Spark types.
- Spark types map to different languages APIs. (Scala, Java, Python, SQL, R).
- Operations on Spark types are not language-specific (e.g., Python addition)

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
  .appName("Structured API Overview")\
    .getOrCreate()

24/08/06 09:02:51 WARN Utils: Your hostname, Khanhs-MAC.local resolves to a loopback address: 127.0.0.1; using 192.168.0.102 instead (on interface en0)
24/08/06 09:02:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/06 09:02:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
df = spark.range(500).toDF("number")
df.select(df["number"] + 10).show(5)

+-------------+
|(number + 10)|
+-------------+
|           10|
|           11|
|           12|
|           13|
|           14|
+-------------+
only showing top 5 rows



### DataFrames Versus Datasets
- Spark's Structured APIs include untyped DataFrames and typed Datasets.
- DataFrames have types, but Spark checks them at runtime.
- Datasets have compile-time type checking and are available only in Scala and Java.
- DataFrames are actually Datasets of Type Row.
- Row is Spark's optimized in-memory format for efficient computation.
- In Python and R, everything is a DataFrame, operating on the optimized format.
- DataFrames offer efficiency gains for all Spark's language APIs.
- Strict compile-time checking can be achieved with Datasets (Chapter 11).
- Using DataFrames means you leverage Spark's optimized internal format for efficient computation.
- This optimization benefits all of Spark's supported language APIs.
- If you require strict compile-time type checking, you can explore Datasets in Chapter 11.

### Columns
- Columns represent:
  - Simple type (string, integer).
  - Complex type (array, map).
  - Null value.
- Spark tracks column types and provides various transformation methods. You can think of Spark columns as columns in a table.

### Rows
- Is a record of data.
- Each record must be of type Row.
- These rows can be created from various sources:
  - SQL queries.
  - Resilient Distributed Datasets (RDDs).
  - External data sources.
  - Manual creation.

In [6]:
spark.range(2).collect()

                                                                                

[Row(id=0), Row(id=1)]

24/08/06 12:22:37 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 908094 ms exceeds timeout 120000 ms
24/08/06 12:22:37 WARN SparkContext: Killing executors is not supported by current scheduler.
24/08/06 12:22:39 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:80)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:642)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1223)
	at o

### Spark Types
- Spark supports many internal type representations.
- Reference table provided to map language-specific types to Spark types.

### Overview of Structured API Execution
How code is actually executed across a cluster after walk through the execution of a single structured API query.
 1. Write DataFrame/Dataset/SQL code.
 2. If valid code, Spark converts this to Logical Plan.
 3. Spark transforms this Logical Plan to Physical Plan, checking for optimizations along the way.
 4. Spark then executes this Physical Plan(RDD manipulation) on cluster.

User Code -> Catalyst Optimizer -> Execution Plan -> Cluster Execution -> Result to User

### Logical Planning
- Unresolved Logical Plan: Represents a set of transformations without verifying the existence of referenced tables or columns.
- Catalog: A repository that holds information about all tables and DataFrames to help resolve references.
- Analyzer: A component that checks and resolves the references in the logical plan using the catalog.
- Catalyst Optimizer: A rule-based optimizer that refines the resolved logical plan into an optimized version.
- Optimized Logical Plan: The end product of the logical planning process, ready for execution with all possible optimizations applied.

### Physical Planning
- Physical Plan: Also called a Spark plan, it details how the logical plan will be executed on the cluster. Multiple physical plans may be generated and compared.
- Cost Model: Used to evaluate and compare physical plans based on factors like data size, partitioning, and computational costs to determine the most efficient plan.
- RDD Transformations: The physical plan results in a series of RDD transformations, which are the core operations in Spark for processing data across the cluster.
- Compiler Analogy: Spark acts like a compiler by taking high-level queries in DataFrames, Datasets, and SQL, and compiling them into low-level RDD transformations for execution.

### Execution
- RDDs: The fundamental data structure in Spark, representing an immutable distributed collection of objects that can be processed in parallel.
- Java Bytecode Generation: Spark generates native Java bytecode at runtime to optimize the execution process.
- Runtime Optimizations: Additional optimizations that Spark performs while executing the physical plan to improve efficiency and performance.
