## Solutions - Problem 3
Get all the flights which are departed late but arrived early (**IsArrDelayed is NO**).
* Output should contain - **FlightCRSDepTime**, **UniqueCarrier**, **FlightNum**, **Origin**, **Dest**, **DepDelay**, **ArrDelay**
* **FlightCRSDepTime** need to be computed using **Year**, **Month**, **DayOfMonth**, **CRSDepTime**
* **FlightCRSDepTime** should be displayed using **YYYY-MM-dd HH:mm** format.
* Output should be sorted by **FlightCRSDepTime** and then by the difference between **DepDelay** and **ArrDelay**
* Also get the count of such flights


Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [None]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Basic Transformations'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

In [None]:
airtraffic.select('Year', 'Month', 'DayOfMonth', 'CRSDepTime').show()

In [None]:
l = [(2008, 1, 23, 700),
     (2008, 1, 10, 1855),
    ]

In [None]:
df = spark.createDataFrame(l, "Year INT, Month INT, DayOfMonth INT, DepTime INT")
df.show()

In [None]:
from pyspark.sql.functions import substring
df.select(substring(col('DepTime'), -2, 2)). \
    show()

In [None]:
df.select("DepTime", date_format(lpad('DepTime', 4, "0"), 'HH:mm')).show()

In [None]:
help(substring)

In [None]:
df.select(substring(col('DepTime'), 1, length(col('DepTime').cast('string')))). \
    show()

In [None]:
from pyspark.sql.functions import lit, col, concat, lpad, sum, expr

flightsFiltered = airtraffic. \
    filter("IsDepDelayed = 'YES' AND IsArrDelayed = 'NO'"). \
    select(concat("Year", lit("-"), 
                  lpad("Month", 2, "0"), lit("-"), 
                  lpad("DayOfMonth", 2, "0"), lit(" "),
                  lpad("CRSDepTime", 4, "0")
                 ).alias("FlightCRSDepTime"),
           "UniqueCarrier", "FlightNum", "Origin", 
           "Dest", "DepDelay", "ArrDelay"
          ). \
    orderBy("FlightCRSDepTime", col("DepDelay") - col("ArrDelay")). \
    show()

### Getting Count

In [None]:
from pyspark.sql.functions import lit, col, concat, lpad, sum, expr

flightsFiltered = airtraffic. \
    filter("IsDepDelayed = 'YES' AND IsArrDelayed = 'NO'"). \
    select(concat("Year", lit("-"), 
                  lpad("Month", 2, "0"), lit("-"), 
                  lpad("DayOfMonth", 2, "0"), lit(" "),
                  lpad("CRSDepTime", 4, "0")
                 ).alias("FlightCRSDepTime"),
           "UniqueCarrier", "FlightNum", "Origin", 
           "Dest", "DepDelay", "ArrDelay"
          ). \
    count()

flightsFiltered