## Solutions - Problem 6

Get the total number of flights per airport that do not contain entries in airport-codes.

* This is an example for outer join.
* We need to get number of flights from the 2008 January airlines data which do not have entries in airport-codes.
* Based on the side of the airlines data set, we can say left or right. We will be using airlines as the driving data set and hence we will use left outer join.
* We will be peforming join first and then we will aggregate to get number of flights from the concerned airports per airport.
* In this case will get total number of flights per airport.

Let us start spark context for this Notebook so that we can execute the code provided.

If you want to use terminal for the practice, here is the command to use.

```
spark2-shell \
  --master yarn \
  --name "Joining Data Sets" \
  --conf spark.ui.port=0
```

In [None]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    appName("Joining Data Sets").
    master("yarn").
    getOrCreate()

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "2")

In [None]:
import spark.implicits._

In [None]:
val airlinesPath = "/public/airlines_all/airlines-part/flightmonth=200801"

In [None]:
val airlines = spark.
    read.
    parquet(airlinesPath)

In [None]:
airlines.select("Year", "Month", "DayOfMonth", "Origin", "Dest", "CRSDepTime").show

In [None]:
airlines.count

In [None]:
val airportCodesPath = "/public/airlines_all/airport-codes"

In [None]:
def getValidAirportCodes(airportCodesPath: String) = {
    val airportCodes = spark.
        read.
        option("sep", "\t").
        option("header", true).
        option("inferSchema", true).
        csv(airportCodesPath).
        filter("!(State = 'Hawaii' AND IATA = 'Big') AND Country = 'USA'")
    airportCodes
}

In [None]:
val airportCodes = getValidAirportCodes(airportCodesPath)

In [None]:
airportCodes.show 

In [None]:
airportCodes.count

In [None]:
airlines.
    join(airportCodes, airlines("Origin") === airportCodes("IATA"), "left").
    filter("IATA IS NULL").
    select(airlines("Year"), airlines("Month"), airlines("DayOfMonth"), 
           airlines("Origin"), airlines("Dest"), airlines("CRSDepTime"), 
           airportCodes("*")
          ).
    show

In [None]:
import org.apache.spark.sql.functions.{lit, count}

In [None]:
airlines.
    join(airportCodes, airlines("Origin") === airportCodes("IATA"), "left").
    filter("IATA IS NULL").
    groupBy("Origin").
    agg(count(lit(1)).alias("FlightCount")).
    orderBy($"FlightCount".desc).
    show