# RDDs, Dataframes, and Datasets

## RDDs

Resilient Distributed Datasets (We talked about these!). A new range of API's has been introduced to let people take advantage of Spark's parallel execution framework and fault tolerance without making the same set of mistakes.

## Dataframes

- RDD's with named *untyped* columns.
- Columnar storage
  - Similar optimizations for OLAP queries as vertica
- Memory Management (Tungsten)
  - direct control of data storage in memory
    - cpu cache, and read ahead
  - largest source of performance increase
- avoids java serialization (or other not as slow but still slow serialization)
  - Kryo serialization
  - compression
- no garbage collection overhead
- Execution plans (Catalyst Optimizer)
  - rule based instead of cost-based optimizer

## Datasets

adds to Dataframes
- compile time safety
- API only available through the scala (python has no type safety)

Encoders act as liason between JVM object and off-heap memory (the new formats introduced with Tungsten)

## Let's load a file

1. select 'Tables'
2. in new tab, select 'Create Table'
3. we could really select anything (from file upload, s3, DBFS, or JDBC) here but for now we will upload the 'mallard.csv' from the vertica demo
  (https://s3-us-west-2.amazonaws.com/cse599c-sp17/mallard.csv)
4. select preview table
5. we can name the table, select our file delimiter, etc.
6. retrieve the DBFS path befor
7. select 'create table'

In [3]:
# set file path
# mallardFilePath = 'PATH.csv'

In [4]:
mallard = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true', delimiter=',').load(mallardFilePath)
mallard.count()

In [5]:
sortMallard = mallard.sort("location-long")

In [6]:
from pyspark.sql.functions import countDistinct, max

countDistinctMallard = mallard.select("location-long", "location-lat")\
  .groupBy("location-long", "location-lat")\
  .agg(countDistinct("location-long").alias('c'))\
  .agg(max('c'))
  
countDistinctMallard.head()

# Use Case: On-Time Flight Performance

This notebook provides an analysis of On-Time Flight Performance and Departure Delays

Source Data: 
* [OpenFlights: Airport, airline and route data](http://openflights.org/data.html)
* [United States Department of Transportation: Bureau of Transportation Statistics (TranStats)](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time)
 * Note, the data used here was extracted from the US DOT:BTS between 1/1/2014 and 3/31/2014*

References:
* [GraphFrames User Guide](http://graphframes.github.io/user-guide.html)
* [GraphFrames: DataFrame-based Graphs (GitHub)](https://github.com/graphframes/graphframes)
* [D3 Airports Example](http://mbostock.github.io/d3/talk/20111116/airports.html)

### Preparation
Extract the Airports and Departure Delays information from S3 / DBFS

In [9]:
# Set File Paths
tripdelaysFilePath = "/databricks-datasets/flights/departuredelays.csv"
airportsnaFilePath = "/databricks-datasets/flights/airport-codes-na.txt"

In [10]:
# Obtain airports dataset
airportsna = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true', delimiter='\t').load(airportsnaFilePath)
airportsna.registerTempTable("airports_na")

In [11]:
# Obtain departure Delays data
departureDelays = sqlContext.read.format("com.databricks.spark.csv").options(header='true').load(tripdelaysFilePath)
departureDelays.registerTempTable("departureDelays")
departureDelays.cache()

In [12]:
airportsna.printSchema()

In [13]:
departureDelays.printSchema()

In [14]:
# offers programatic SQL type commands
from pyspark.sql.functions import *
sortDelays = departureDelays.sort("delay")
sortDelays.head(3)

In [15]:
# register the DataFrame as a temp table so that we can query it using SQL language
departureDelays.registerTempTable("depature_delays")
sortDelays_sql = sqlContext.sql("SELECT * FROM depature_delays ORDER BY delay")
sortDelays_sql.head(3)

In [16]:
# We can also do more complex selections
longAvgDistByDest = departureDelays.groupBy("destination")\
  .agg(avg("distance").alias("avg_dist"))\
  .where("avg_dist > 1000")
longAvgDistByDest.head(3)

In [17]:
# and again with a declarative command
longAvgDistByDest_sql = sqlContext.sql("SELECT destination, avg(distance) AS avg_dist FROM depature_delays GROUP BY destination HAVING avg_dist > 1000")
longAvgDistByDest_sql.head(3)

In [18]:
# we can also use the python fluent API to execute joins
from pyspark.sql.functions import col
delayedSeaDest = departureDelays.join(airportsna, departureDelays["destination"] == airportsna["IATA"], 'inner')\
  .filter(col("origin") == 'SEA')\
  .groupBy("destination")\
  .agg(avg("delay").alias("avg_delay"))\
  .orderBy(col("avg_delay").desc())
delayedSeaDest.head(3)

In [19]:
# and again with a declarative command
airportsna.registerTempTable("airports")
delayedSeaDest_sql = sqlContext.sql("SELECT destination, avg(delay) AS avg_delay FROM depature_delays, airports WHERE origin = 'SEA' AND destination = IATA GROUP BY destination ORDER BY -avg_delay")
delayedSeaDest_sql.head(3)

In [20]:
delayedSeaDest.explain()

In [21]:
delayedSeaDest_sql.explain()