## Aggregate Functions

Let us see how to perform aggregations within each group while projecting the raw data that is used to perform the aggregation.

 * We have functions such as `sum`, `avg`, `min`, `max` etc which can be used to aggregate the data.
 * We need to create `WindowSpec` object using `partitionBy` to get aggregations within each group.
 * Typically we don’t need to sort the data to perform aggregations, however if we want to perform cumulative aggregations using rowsBetween, then we have to sort the data using cumulative criteria.
 * Let us try to get total departure delay, minimum departure delay, maximum departure delay and average departure delay for each day for each airport. We will ignore all those flights which are departured early or ontime.

Let us start spark context for this Notebook so that we can execute the code provided.

If you want to use terminal for the practice, here is the command to use.

```
spark2-shell \
  --master yarn \
  --name "Joining Data Sets" \
  --conf spark.ui.port=0
```

In [None]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    appName("Windowing Functions").
    master("yarn").
    getOrCreate()

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "2")

In [None]:
import spark.implicits._

In [None]:
val airlines_path = "/public/airlines_all/airlines-part/flightmonth=200801"

In [None]:
val airlines = spark.
  read.
  parquet(airlines_path)

In [None]:
import org.apache.spark.sql.functions.{col, lit, lpad, concat}

In [None]:
import org.apache.spark.sql.functions.{min, max, sum, avg, round}

In [None]:
import org.apache.spark.sql.expressions.Window

In [None]:
airlines.printSchema

In [None]:
val spec = Window.
    partitionBy("FlightDate", "Origin")

In [None]:
airlines.
    filter("IsDepDelayed = 'YES' and Cancelled = 0").
    select(concat($"Year", 
                  lpad($"Month", 2, "0"), 
                  lpad($"DayOfMonth", 2, "0")
                 ).alias("FlightDate"),
           $"Origin",
           $"UniqueCarrier",
           $"FlightNum",
           $"CRSDepTime",
           $"IsDepDelayed",
           $"DepDelay".cast("int").alias("DepDelay")
          ).
    withColumn("DepDelayMin", min("DepDelay").over(spec)).
    withColumn("DepDelayMax", max("DepDelay").over(spec)).
    withColumn("DepDelaySum", sum("DepDelay").over(spec)).
    withColumn("DepDelayAvg", round(avg("DepDelay").over(spec), 2)).
    orderBy("FlightDate", "Origin", "DepDelay").
    show