## Solutions - Problem 1

Get number of flights departed from each of the US airport in the month of 2008 January.

* We have to use airport codes to determine US airport.
* We need to use airlines data to get departure details.
* To solve this problem we have to perform inner join.

Let us start spark context for this Notebook so that we can execute the code provided.

If you want to use terminal for the practice, here is the command to use.

```
spark2-shell \
  --master yarn \
  --name "Joining Data Sets" \
  --conf spark.ui.port=0
```

In [None]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    appName("Joining Data Sets").
    master("yarn").
    getOrCreate()

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "2")

In [None]:
import spark.implicits._

In [None]:
val airlinesPath = "/public/airlines_all/airlines-part/flightmonth=200801"

In [None]:
val airlines = spark.
    read.
    parquet(airlinesPath)

In [None]:
airlines.select("Year", "Month", "DayOfMonth", "Origin", "Dest", "CRSDepTime").show

In [None]:
airlines.count

In [None]:
val airportCodesPath = "/public/airlines_all/airport-codes"

In [None]:
def getValidAirportCodes(airportCodesPath: String) = {
    val airportCodes = spark.
        read.
        option("sep", "\t").
        option("header", true).
        option("inferSchema", true).
        csv(airportCodesPath).
        filter("!(State = 'Hawaii' AND IATA = 'Big') AND Country = 'USA'")
    airportCodes
}

In [None]:
val airportCodes = getValidAirportCodes(airportCodesPath)

In [None]:
airportCodes.count

In [None]:
import org.apache.spark.sql.functions.{col, lit, count}

In [None]:
airlines.
    join(airportCodes, airportCodes("IATA") === airlines("Origin")).
    select(col("Year"), col("Month"), col("DayOfMonth"), airportCodes("*"), col("CRSDepTime")).
    show

In [None]:
airlines.
    join(airportCodes, airportCodes("IATA") === airlines("Origin")).
    groupBy("Origin").
    agg(count(lit(1)).alias("FlightCount")).
    orderBy(col("FlightCount").desc).
    show