## Solutions - Problem 1
Get total number of flights as well as number of flights which are delayed in departure and number of flights delayed in arrival. 
* Output should contain 3 columns - **FlightCount**, **DepDelayedCount**, **ArrDelayedCount**

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [None]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Basic Transformations'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

### Reading airtraffic data

In [None]:
airtraffic_path = "/public/airtraffic_all/airtraffic-part/flightmonth=200801"

airtraffic = spark. \
    read. \
    parquet(airtraffic_path)

airtraffic.printSchema()

### Get flights with delayed arrival

In [None]:
# SQL Style
airtraffic.filter("IsArrDelayed = 'YES'").show()

In [None]:
# Data Frame Style
airtraffic.filter(airtraffic["IsArrDelayed"] == 'YES').show()

In [None]:
airtraffic.filter(airtraffic.IsArrDelayed == 'YES').show()

### Get delayed counts

In [None]:
## Departure Delayed Count
airtraffic. \
    filter(airtraffic.IsDepDelayed == "YES"). \
    count()

In [None]:
## Arrival Delayed Count
airtraffic. \
    filter(airtraffic.IsArrDelayed == "YES"). \
    count()

In [None]:
airtraffic. \
    filter("IsDepDelayed = 'YES' OR IsArrDelayed = 'YES'"). \
    select('Year', 'Month', 'DayOfMonth', 
           'FlightNum', 'IsDepDelayed', 'IsArrDelayed'
          ). \
    show()

In [None]:
## Both Departure Delayed and Arrival Delayed
from pyspark.sql.functions import col, lit, count, sum, expr
airtraffic. \
    agg(count(lit(1)).alias("FlightCount"),
        sum(expr("CASE WHEN IsDepDelayed = 'YES' THEN 1 ELSE 0 END")).alias("DepDelayedCount"),
        sum(expr("CASE WHEN IsArrDelayed = 'YES' THEN 1 ELSE 0 END")).alias("ArrDelayedCount")
       ). \
    show()