## Solutions - Problem 2

Get number of flights which are delayed in departure and number of flights delayed in arrival for each day along with number of flights departed for each day. 

* Output should contain 4 columns - **FlightDate**, **FlightCount**, **DepDelayedCount**, **ArrDelayedCount**
* **FlightDate** should be of **YYYY-MM-dd** format.
*   Data should be **sorted** in ascending order by **flightDate**

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [None]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Basic Transformations'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

### Grouping Data by Flight Date

In [None]:
from pyspark.sql.functions import lit, concat, lpad
airlines. \
  groupBy(concat("Year", lit("-"), 
                 lpad("Month", 2, "0"), lit("-"), 
                 lpad("DayOfMonth", 2, "0")).
          alias("FlightDate"))

### Getting Counts by FlightDate

In [None]:
from pyspark.sql.functions import lit, concat, lpad, count

airlines. \
    groupBy(concat("Year", lit("-"), 
                   lpad("Month", 2, "0"), lit("-"), 
                   lpad("DayOfMonth", 2, "0")).
            alias("FlightDate")). \
    agg(count(lit(1)).alias("FlightCount")). \
    show(31)

In [None]:
# Alternative to get the count with out using agg
# We will not be able to provide alias for aggregated fields
from pyspark.sql.functions import lit, concat, lpad

airlines. \
    groupBy(concat("Year", lit("-"), 
                   lpad("Month", 2, "0"), lit("-"), 
                   lpad("DayOfMonth", 2, "0")).
            alias("FlightDate")). \
    count(). \
    show()

### Getting total as well as delayed counts for each day

In [None]:
from pyspark.sql.functions import lit, concat, lpad, count, sum, expr

airlines. \
    groupBy(concat("Year", lit("-"), 
                   lpad("Month", 2, "0"), lit("-"), 
                   lpad("DayOfMonth", 2, "0")).
            alias("FlightDate")). \
    agg(count(lit(1)).alias("FlightCount"),
        sum(expr("CASE WHEN IsDepDelayed = 'YES' THEN 1 ELSE 0 END")).alias("DepDelayedCount"),
        sum(expr("CASE WHEN IsArrDelayed = 'YES' THEN 1 ELSE 0 END")).alias("ArrDelayedCount")
       ). \
    show()

### Sorting Data By FlightDate

In [None]:
help(airlines.sort)

In [None]:
help(airlines.orderBy)

In [None]:
from pyspark.sql.functions import lit, concat, lpad, sum, expr
airlines. \
    groupBy(concat("Year", lit("-"), 
                   lpad("Month", 2, "0"), lit("-"), 
                   lpad("DayOfMonth", 2, "0")).
            alias("FlightDate")). \
    agg(count(lit(1)).alias("FlightCount"),
        sum(expr("CASE WHEN IsDepDelayed = 'YES' THEN 1 ELSE 0 END")).alias("DepDelayedCount"),
        sum(expr("CASE WHEN IsArrDelayed = 'YES' THEN 1 ELSE 0 END")).alias("ArrDelayedCount")
       ). \
    orderBy("FlightDate"). \
    show(31)

### Sorting Data in descending order by count

In [None]:
from pyspark.sql.functions import lit, concat, lpad, sum, expr, col
airlines. \
    groupBy(concat("Year", lit("-"), 
                   lpad("Month", 2, "0"), lit("-"), 
                   lpad("DayOfMonth", 2, "0")).
            alias("FlightDate")). \
    agg(count(lit(1)).alias("FlightCount"),
        sum(expr("CASE WHEN IsDepDelayed = 'YES' THEN 1 ELSE 0 END")).alias("DepDelayedCount"),
        sum(expr("CASE WHEN IsArrDelayed = 'YES' THEN 1 ELSE 0 END")).alias("ArrDelayedCount")
       ). \
    orderBy(col("FlightCount").desc()). \
    show()