#  Spark Structured

# Originally

RDD was the primary user-facing API in Spark since its inception.

At the core, an RDD is:

- an immutable distributed collection of elements of your data,

- partitioned across nodes in your cluster that can be operated in parallel 

- with a low-level API that offers transformations and actions.

# Is that enough ?

![](https://www.mememaker.net/api/bucket?path=static/img/memes/full/2020/Apr/22/5/rdd-1683.png)

nics @https://www.mememaker.net/meme/rdd-1683

# When to use RDDs?
 
Consider these scenarios or common use cases for using RDDs when:

**Pro**

- you want low-level transformation and actions and control on your dataset;

- your data is unstructured, such as media streams or streams of text;

- you want to manipulate your data with functional programming constructs than domain specific expressions;

**Contra**

- you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; 

- you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

![](https://databricks.com/wp-content/uploads/2016/07/memory-usage-when-caching-datasets-vs-rdds.png)

# We love types

![](../images/types.png)

Source to find

![](https://cdn2.hexlet.io/derivations/image/original/eyJpZCI6IjFkMDUwZmZhNGIwNGMxMzU3ZTI0M2UwMDlhYWI1ZmZmLnBuZyIsInN0b3JhZ2UiOiJjYWNoZSJ9?signature=356de17911b2b04657ed56dd6f6b884e5ad82e5def90d2adf8c72bdbe0b05213)

# Evolution

![](../images/human-evolution-monkey-modern-man-programmer-computer-user-isolated-white_33099-1593.jpg)

![](https://image.slidesharecdn.com/jumpstartintoapachesparkanddatabricks-160212150759/95/jump-start-into-apache-spark-and-databricks-13-638.jpg?cb=1463623478)

![](https://databricks.com/wp-content/uploads/2016/06/Unified-Apache-Spark-2.0-API-1.png)

# When should I use DataFrames or Datasets?

- If you want rich semantics, high-level abstractions, and domain specific APIs, use DataFrame or Dataset.
- If your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data, use DataFrame or Dataset.
- If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.
- If you want unification and simplification of APIs across Spark Libraries, use DataFrame or Dataset.
- If you are a R user, use DataFrames.
- If you are a Python user, use DataFrames and resort back to RDDs if you need more control.

# A nice comparison

https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/

# Demo

In [15]:
import findspark
import pyspark
findspark.find( ) 
findspark

<module 'findspark' from '/home/nics/anaconda3/lib/python3.7/site-packages/findspark.py'>

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TapDataFrame").getOrCreate()

In [2]:
spark

In [16]:
file = "../dataset/olympic-games/summer.csv"  # Should be some file on your system
dataset = spark.read.csv(file,header=True)
dataset

DataFrame[Year: string, City: string, Sport: string, Discipline: string, Athlete: string, Country: string, Gender: string, Event: string, Medal: string]

In [17]:
dataset.show()

+----+------+---------+----------+--------------------+-------+------+--------------------+------+
|Year|  City|    Sport|Discipline|             Athlete|Country|Gender|               Event| Medal|
+----+------+---------+----------+--------------------+-------+------+--------------------+------+
|1896|Athens| Aquatics|  Swimming|       HAJOS, Alfred|    HUN|   Men|      100M Freestyle|  Gold|
|1896|Athens| Aquatics|  Swimming|    HERSCHMANN, Otto|    AUT|   Men|      100M Freestyle|Silver|
|1896|Athens| Aquatics|  Swimming|   DRIVAS, Dimitrios|    GRE|   Men|100M Freestyle Fo...|Bronze|
|1896|Athens| Aquatics|  Swimming|  MALOKINIS, Ioannis|    GRE|   Men|100M Freestyle Fo...|  Gold|
|1896|Athens| Aquatics|  Swimming|  CHASAPIS, Spiridon|    GRE|   Men|100M Freestyle Fo...|Silver|
|1896|Athens| Aquatics|  Swimming|CHOROPHAS, Efstat...|    GRE|   Men|     1200M Freestyle|Bronze|
|1896|Athens| Aquatics|  Swimming|       HAJOS, Alfred|    HUN|   Men|     1200M Freestyle|  Gold|
|1896|Athe

In [18]:
dataset.dtypes

[('Year', 'string'),
 ('City', 'string'),
 ('Sport', 'string'),
 ('Discipline', 'string'),
 ('Athlete', 'string'),
 ('Country', 'string'),
 ('Gender', 'string'),
 ('Event', 'string'),
 ('Medal', 'string')]

In [22]:
dataset.select('City','Discipline').head(10)

[Row(City='Athens', Discipline='Swimming'),
 Row(City='Athens', Discipline='Swimming'),
 Row(City='Athens', Discipline='Swimming'),
 Row(City='Athens', Discipline='Swimming'),
 Row(City='Athens', Discipline='Swimming'),
 Row(City='Athens', Discipline='Swimming'),
 Row(City='Athens', Discipline='Swimming'),
 Row(City='Athens', Discipline='Swimming'),
 Row(City='Athens', Discipline='Swimming'),
 Row(City='Athens', Discipline='Swimming')]

In [24]:
dataset.groupBy('Year','Sport').count().collect()

[Row(Year='1896', Sport='Aquatics', count=11),
 Row(Year='1920', Sport='Weightlifting', count=15),
 Row(Year='1952', Sport='Modern Pentathlon', count=12),
 Row(Year='1984', Sport='Equestrian', count=47),
 Row(Year='1948', Sport='Canoe / Kayak', count=39),
 Row(Year='2004', Sport='Weightlifting', count=45),
 Row(Year='1932', Sport='Sailing', count=41),
 Row(Year='1996', Sport='Boxing', count=48),
 Row(Year='1920', Sport='Equestrian', count=42),
 Row(Year='2000', Sport='Badminton', count=24),
 Row(Year='2004', Sport='Fencing', count=61),
 Row(Year='1976', Sport='Rowing', count=162),
 Row(Year='1900', Sport='Basque Pelota', count=4),
 Row(Year='1980', Sport='Sailing', count=36),
 Row(Year='1996', Sport='Rowing', count=144),
 Row(Year='2012', Sport='Badminton', count=24),
 Row(Year='1920', Sport='Tug of War', count=24),
 Row(Year='1924', Sport='Boxing', count=24),
 Row(Year='1968', Sport='Fencing', count=72),
 Row(Year='1972', Sport='Handball', count=45),
 Row(Year='2000', Sport='Handball'

# Biblio
- https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/
- https://medium.com/@ravi.g/sparks-structured-api-s-cdeb381f6407
- https://www.kdnuggets.com/2017/08/three-apache-spark-apis-rdds-dataframes-datasets.html
- https://www.slideshare.net/databricks/jump-start-into-apache-spark-and-databricks