## Overview of Filtering
Let us understand few important details related to filtering before we get into the solution
* Filtering can be done either by using `filter` or `where`. These are like synonyms to each other.
* When it comes to the condition, we can either pass it in **SQL Style** or **Data Frame Style**.
* Example for SQL Style - `airlines.filter("IsArrDelayed = "YES"").show() or airlines.where("IsArrDelayed = "YES"").show()`
* Example for Data Frame Style - `airlines.filter(airlines["IsArrDelayed"] == "YES").show()` or `airlines.filter(airlines.IsArrDelayed == "YES").show()`. We can also use where instead of filter.
* Here are the other operations we can perform to filter the data - `!=`, `>`, `<`, `>=`, `<=`, `LIKE`, `BETWEEN` with `AND`
* If we have to validate against multiple columns then we need to use boolean operations such as `AND` and `OR`.
* If we have to compare each column value with multiple values then we can use the `IN` operator.
    

### Tasks

Let us perform some tasks to understand filtering in detail. Solve all the problems by passing  conditions using both SQL Style as well as API Style.

* Read the data for the month of 2008 January.

In [None]:
val airlines_path = "/public/airlines_all/airlines-part/flightmonth=200801"

In [None]:
val airlines = spark.
    read.
    parquet(airlines_path)

In [None]:
airlines.printSchema

* Get count of flights which are departed late at origin and reach destination early or on time.


In [None]:
airlines.
    filter("IsDepDelayed = 'YES' AND IsArrDelayed = 'NO'").
    count

* API Style

In [None]:
import org.apache.spark.sql.functions.col

In [None]:
airlines.
    filter(col("IsDepDelayed") === "YES" and
           col("IsArrDelayed") === "NO"
          ).
    count

In [None]:
airlines.
    filter(airlines("IsDepDelayed") === "YES" and
           airlines("IsArrDelayed") === "NO"
          ).
    count

* Get count of flights which are departed late from origin by more than 60 minutes.


In [None]:
airlines.
    filter("DepDelay > 60").
    count


* API Style

In [None]:
import org.apache.spark.sql.functions.col

In [None]:
airlines.
    filter(col("DepDelay") > 60).
    count

* Get count of flights which are departed early or on time but arrive late by at least 15 minutes.


In [None]:
airlines.
    filter("IsDepDelayed = 'NO' AND ArrDelay >= 15").
    count

* API Style

In [None]:
import org.apache.spark.sql.functions. col

In [None]:
airlines.
    filter(col("IsDepDelayed") === "NO" and col("ArrDelay") >= 15).
    count()

* Get count of flights departed from following major airports - ORD, DFW, ATL, LAX, SFO.

In [None]:
airlines.count

In [None]:
airlines.
    filter("Origin IN ('ORD', 'DFW', 'ATL', 'LAX', 'SFO')").
    count

* API Style

In [None]:
import org.apache.spark.sql.functions.col

In [None]:
airlines.
    filter(col("Origin").isin("ORD", "DFW", "ATL", "LAX", "SFO")).
    count

* Add a column FlightDate by using Year, Month and DayOfMonth. Format should be **yyyyMMdd**.


In [None]:
import org.apache.spark.sql.functions.{col, concat, lpad}

In [None]:
airlines.
    withColumn("FlightDate",
                concat(col("Year"),
                       lpad(col("Month"), 2, "0"),
                       lpad(col("DayOfMonth"), 2, "0")
                      )
              ).
    show

* Get count of flights departed late between 2008 January 1st to January 9th using FlightDate.


In [None]:
import org.apache.spark.sql.functions.{col, concat, lpad}

In [None]:
airlines.
    withColumn("FlightDate",
               concat(col("Year"),
                      lpad(col("Month"), 2, "0"),
                      lpad(col("DayOfMonth"), 2, "0")
                     )
              ).
    filter("IsDepDelayed = 'YES' AND FlightDate LIKE '2008010%'").
    count

In [None]:
import org.apache.spark.sql.functions.{col, concat, lpad}

In [None]:
airlines.
    withColumn("FlightDate",
               concat(col("Year"),
                      lpad(col("Month"), 2, "0"),
                      lpad(col("DayOfMonth"), 2, "0")
                     )
              ).
    filter("""
           IsDepDelayed = "YES" AND 
           FlightDate BETWEEN 20080101 AND 20080109
          """).
    count

* API Style

In [None]:
import org.apache.spark.sql.functions.{col, concat, lpad}

In [None]:
airlines.
    withColumn("FlightDate",
               concat(col("Year"),
                      lpad(col("Month"), 2, "0"),
                      lpad(col("DayOfMonth"), 2, "0")
                     )
              ).
    filter(col("IsDepDelayed") === "YES" and 
           (col("FlightDate") like ("2008010%"))
          ).
    count

In [None]:
import org.apache.spark.sql.functions.{col, concat, lpad}

airlines.
    withColumn("FlightDate",
               concat(col("Year"),
                      lpad(col("Month"), 2, "0"),
                      lpad(col("DayOfMonth"), 2, "0")
                     )
              ).
    filter(col("IsDepDelayed") === "YES" and 
           (col("FlightDate") between ("20080101", "20080109"))
          ).
    count

* Get number of flights departed late on Sundays.

In [None]:
val l = List("X")
val df = l.toDF("dummy")

In [None]:
import org.apache.spark.sql.functions.current_date

In [None]:
df.select(current_date).show

In [None]:
import org.apache.spark.sql.functions.date_format

In [None]:
df.select(current_date, date_format(current_date, "EE")).show

In [None]:
import org.apache.spark.sql.functions.{col, concat, lpad}

airlines.
    withColumn("FlightDate",
               concat(col("Year"),
                      lpad(col("Month"), 2, "0"),
                      lpad(col("DayOfMonth"), 2, "0")
                     )
              ).
    filter("""
           IsDepDelayed = "YES" AND 
           date_format(to_date(FlightDate, "yyyyMMdd"), "EEEE") = "Sunday"
           """).
    count

* API Style

In [None]:
import spark.implicits._

In [None]:
import org.apache.spark.sql.functions.{col, concat, lpad, date_format, to_date}

airlines.
    withColumn("FlightDate",
               concat(col("Year"),
                      lpad(col("Month"), 2, "0"),
                      lpad(col("DayOfMonth"), 2, "0")
                     )
              ).
    filter(col("IsDepDelayed") === "YES" and
           date_format(
               to_date($"FlightDate", "yyyyMMdd"), "EEEE"
           ) === "Sunday"
          ).
    count