## Parquet

Parquet is another data format that plays well with Spark. 
 
It's a "flat file" format, like JSON or CSV, but it contains extra information about types, allows for "predicate pushdown", is column-oriented, and has first-class support for nested columns!

Predicate pushdown means that Spark doesn't have to read all the data from the disk! It can avoid certain sections of disk altogether because Parquet knows that we don't want that data.

In [None]:
orders = spark.read.parquet('data/orders')

In [None]:
# Here, Spark won't read any information about countries other than Belgium!
# NOTE: the nested type!

orders.createOrReplaceTempView('orders')

res = spark.sql("""
SELECT count(order_number)
FROM orders 
WHERE customer.country = 'Belgium'
""".strip())

res.show()

## Nested types in SQL!

How do we deal with these pesky nested types now? 

Spark SQL gives us built-in functions to deal with nested "Array" types!

1. TRANSFORM: this is a `map` operation.
2. AGGREGATE: a slightly more general form of `reduce`.

You can look at the documentation to see exactly how they work: 

https://spark.apache.org/docs/latest/api/sql/index.html

In [None]:
# Exercise 1:

# Try to reproduce what we did before, getting the total sales, in Spark SQL, 
# using TRANSFORM and AGGREGATE on the individuals "line_items" and then 
# summing over the rows to get the total sales.

# HINT: Using aggregate you need to get the types right, which can be a bit 
# confusing (bigint aka long, vs int aka short)

# HINT2: You can alternatively use EXPLODE. Note this works similar to flatMap 
# in rdd language.