## Analyze Datasets for Joins

Let us analyze data sets that are going to be used for joins.
* We will use January 2008 airlines data which have all relevant flight details.
* Let us read and review the airlines data quickly

Let us start spark context for this Notebook so that we can execute the code provided.

If you want to use terminal for the practice, here is the command to use.

```
spark2-shell \
  --master yarn \
  --name "Joining Data Sets" \
  --conf spark.ui.port=0
```

In [None]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    appName("Joining Data Sets").
    master("yarn").
    getOrCreate()

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "2")

In [None]:
import spark.implicits._

In [None]:
val airlines = spark.
    read.
    parquet("/public/airlines_all/airlines-part/flightmonth=200801")

In [None]:
airlines.printSchema

In [None]:
airlines.show

* We will be using another data set to get details about airports. Details include information such as State, City etc for a given airport code.
* Let us analyze the Dataset to confirm if there is header and also how the data is structured.

In [None]:
val airportCodesPath = "/public/airlines_all/airport-codes"

In [None]:
spark.
    read.
    text(airportCodesPath).
    show(false)

 * Data is tab separated.
 * There is header for the data set.
 * Dataset have 4 fields - **Country, State, City, IATA**
    
    
Create DataFrame airport_codes applying appropriate Schema.


In [None]:
val airportCodesPath = "/public/airlines_all/airport-codes"

In [None]:
val airportCodes = spark.
    read.
    option("sep", "\t").
    option("header", true).
    option("inferSchema", true).
    csv(airportCodesPath)

* Preview and Understand the data.

In [None]:
airportCodes.show

* Get schema of **airport_codes**.

In [None]:
airportCodes.printSchema

* Get the count of records

In [None]:
airportCodes.count

   * Get the count of unique records and see if it is the same as total count.

In [None]:
airportCodes.
    select("IATA").
    distinct.
    count

 * If they are not equal, analyze the data and identify IATA codes which are repeated more than once.

In [None]:
import org.apache.spark.sql.functions.{lit, count}

In [None]:
val duplicateIATACount = airportCodes.
    groupBy("IATA").
    agg(count(lit(1)).alias("iata_count")).
    filter("iata_count > 1")

In [None]:
duplicateIATACount.show

 * Filter out the duplicates using the most appropriate one and discard others.

In [None]:
airportCodes.
    filter("IATA = 'Big'").
    show

In [None]:
airportCodes.
    filter("!(State = 'Hawaii' AND IATA = 'Big')").
    show

In [None]:
airportCodes.
    filter("!(State = 'Hawaii' AND IATA = 'Big')").
    count

 * Get number of airports (IATA Codes) for each state in the US. Sort the data in descending order by count.

In [None]:
val airportCodesPath = "/public/airlines_all/airport-codes"

In [None]:
val airportCodes = spark.
    read.
    option("sep", "\t").
    option("header", true).
    option("inferSchema", true).
    csv(airportCodesPath).
    filter("!(State = 'Hawaii' AND IATA = 'Big') AND Country = 'USA'")

In [None]:
airportCodes.count

In [None]:
import org.apache.spark.sql.functions.{count, col, lit}

In [None]:
val airportCountByState = airportCodes.
    groupBy("Country", "State").
    agg(count(lit(1)).alias("IATACount")).
    orderBy(col("IATACount").desc)

In [None]:
airportCountByState.show(51)