## Solutions - Problem 2

Get number of flights which are delayed in departure and number of flights delayed in arrival for each day along with number of flights departed for each day. 

* Output should contain 4 columns - **FlightDate**, **FlightCount**, **DepDelayedCount**, **ArrDelayedCount**
* **FlightDate** should be of **YYYY-MM-dd** format.
*   Data should be **sorted** in ascending order by **flightDate**

### Grouping Data by Flight Date


In [None]:
import org.apache.spark.sql.functions.{lit, concat, lpad}

In [None]:
import spark.implicits._

* Example of `groupBy`. It should follow with `agg`. **If you run the below code, it will throw exception.**

In [None]:
airlines.
  groupBy(concat($"Year", lit("-"), 
                 lpad($"Month", 2, "0"), lit("-"), 
                 lpad($"DayOfMonth", 2, "0")).
          alias("FlightDate"))

### Getting Counts by FlightDate

In [None]:
import org.apache.spark.sql.functions.{lit, concat, lpad, count}

airlines.
    groupBy(concat($"Year", lit("-"), 
                   lpad($"Month", 2, "0"), lit("-"), 
                   lpad($"DayOfMonth", 2, "0")).
            alias("FlightDate")).
    agg(count(lit(1)).alias("FlightCount")).
    show(31)

In [None]:
// Alternative to get the count with out using agg
// We will not be able to provide alias for aggregated fields
import org.apache.spark.sql.functions.{lit, concat, lpad}

In [None]:
airlines.
    groupBy(concat($"Year", lit("-"), 
                   lpad($"Month", 2, "0"), lit("-"), 
                   lpad($"DayOfMonth", 2, "0")).
            alias("FlightDate")).
    count.
    show(31)

### Getting total as well as delayed counts for each day

In [None]:
import org.apache.spark.sql.functions.{lit, concat, lpad, count, sum, expr}

In [None]:
airlines.
    groupBy(concat($"Year", lit("-"), 
                   lpad($"Month", 2, "0"), lit("-"), 
                   lpad($"DayOfMonth", 2, "0")).
            alias("FlightDate")).
    agg(count(lit(1)).alias("FlightCount"),
        sum(expr("CASE WHEN IsDepDelayed = 'YES' THEN 1 ELSE 0 END")).alias("DepDelayedCount"),
        sum(expr("CASE WHEN IsArrDelayed = 'YES' THEN 1 ELSE 0 END")).alias("ArrDelayedCount")
       ).
    show

### Sorting Data By FlightDate

In [None]:
import org.apache.spark.sql.functions.{lit, concat, lpad, sum, expr}

In [None]:
airlines.
    groupBy(concat($"Year", lit("-"), 
                   lpad($"Month", 2, "0"), lit("-"), 
                   lpad($"DayOfMonth", 2, "0")).
            alias("FlightDate")).
    agg(count(lit(1)).alias("FlightCount"),
        sum(expr("CASE WHEN IsDepDelayed = 'YES' THEN 1 ELSE 0 END")).alias("DepDelayedCount"),
        sum(expr("CASE WHEN IsArrDelayed = 'YES' THEN 1 ELSE 0 END")).alias("ArrDelayedCount")
       ).
    orderBy("FlightDate").
    show(31)

### Sorting Data in descending order by count

In [None]:
import org.apache.spark.sql.functions.{lit, concat, lpad, sum, expr, col}

In [None]:
airlines.
    groupBy(concat($"Year", lit("-"), 
                   lpad($"Month", 2, "0"), lit("-"), 
                   lpad($"DayOfMonth", 2, "0")).
            alias("FlightDate")).
    agg(count(lit(1)).alias("FlightCount"),
        sum(expr("CASE WHEN IsDepDelayed = 'YES' THEN 1 ELSE 0 END")).alias("DepDelayedCount"),
        sum(expr("CASE WHEN IsArrDelayed = 'YES' THEN 1 ELSE 0 END")).alias("ArrDelayedCount")
       ).
    orderBy(col("FlightCount").desc).
    show(31)