# Question 03
Find the greatest number of countries a passenger has been in without being in the UK. For example, if the countries a passenger was in were: UK -> FR -> US -> CN -> UK -> DE -> UK, the correct answer would be 3 countries.

## Assumptions
1. Run length is in-between UK and ends with UK. Do not count a run that does not starts or ends with UK.
2. Data is clearned and not errorneous
3. Timezone consideration is not required

## Approaches

1. SQL window function (row_number)  
    a. Add row numbers over each passenger partition that has been orderd by date.  
    b. Identify matching depart-from UK (+1), and return-to UK (-1) rows, and others (0).  
    c. Remove rows that is neither depart nor return -> marked as 0.  
    d. Get the row number differnce between (depart, return) rows. <br/><br/>  

2. RDD groupByKey & map  
    a. Generate a string of flight-run for each passengerId e.g. "JP KR **UK** FR US CN **UK** DE **UK** CR TH".   
    b. Extract each match with a regexp '**UK** (.+) **UK**'.  
    c. Find the longest one.   

groupByKey does not preserve order, hence need to insert a row number within a passengerId partition in the (k,v) pair where v is (row_number, country)). 

# Setup

In [1]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import java.time.temporal.ChronoUnit
import java.time.{Period, LocalDate, Instant}
import java.sql.Timestamp

### Spark parition control based on core availability

In [2]:
val NUM_CORES = 4
val NUM_PARTITIONS = 3

lazy val spark: SparkSession = SparkSession.builder()
    .master("local")
    .appName("flight")
    .getOrCreate()

spark.conf.set("spark.sql.shuffle.partitions", NUM_CORES * NUM_PARTITIONS)
spark.conf.set("spark.default.parallelism", NUM_CORES * NUM_PARTITIONS)

import spark.implicits._

NUM_CORES = 4
NUM_PARTITIONS = 3
spark = <lazy>


<lazy>

## Constants

In [3]:
val FLIGHTDATA_CSV_PATH = "../resources/flightData.csv"
val PASSENGER_CSV_PATH = "../resources/passengers.csv"
val RESULT_DIR = "results/longestRun"

FLIGHTDATA_CSV_PATH = ../resources/flightData.csv
PASSENGER_CSV_PATH = ../resources/passengers.csv
RESULT_DIR = results/longestRun


results/longestRun

# Tools

### Elapsed time profiler

In [4]:
val timing = new StringBuffer
def timed[T](label: String, code: => T): T = {
    val start = System.currentTimeMillis()
    val result = code
    val stop = System.currentTimeMillis()
    timing.append(s"Processing $label took ${stop - start} ms.\n")
    result
}

timing = 


timed: [T](label: String, code: => T)T


In [5]:
// To flush out error: missing argument list for method timed
println("")

<console>:46: error: missing argument list for method timed
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `timed _` or `timed(_,_)` instead of `timed`.
       timed
       ^
lastException: Throwable = null


### UDF
Get monthes between dates.

In [6]:
//val BASE_TIMESTAMP = java.sql.Timestamp.valueOf("2017-01-01 00:00:00.0")
val BASE_LOCALDATE = LocalDate.parse("2017-01-01").withDayOfMonth(1)

def get_months_between(to: Timestamp): Short = {
    val monthsBetween = ChronoUnit.MONTHS.between(
        BASE_LOCALDATE,
        to.toLocalDateTime().toLocalDate().withDayOfMonth(1)
    )
    monthsBetween.toShort
}
val udf_months_between = udf((t:Timestamp) => get_months_between(t))

BASE_LOCALDATE = 2017-01-01
udf_months_between = UserDefinedFunction(<function1>,ShortType,Some(List(TimestampType)))


get_months_between: (to: java.sql.Timestamp)Short


UserDefinedFunction(<function1>,ShortType,Some(List(TimestampType)))

# Main

## Base DataFrame
* Mark depart from UK as +1, return to UK as -1, and else as 0.  
* Sort by date

In [7]:
// Transformations, no action yet
val flightData = spark.read.format("csv")
    .option("header", "true")
    .option("delimiter", ",")
    .option("dateFormat", "yyyy-MM-dd")
    .option("inferSchema", "true")
    .load("../resources/flightData.csv")
    .withColumn(
        "direction", 
        when(lower(col("from")) === "uk", 1)
        .when(lower(col("to"))   === "uk", -1)
        .otherwise(0)
    )
    .withColumn(
        "count", lit(1)
    )
    .orderBy(asc("passengerId"), asc("date"))

flightData.printSchema()
flightData.createOrReplaceTempView("flightData")

root
 |-- passengerId: integer (nullable = true)
 |-- flightId: integer (nullable = true)
 |-- from: string (nullable = true)
 |-- to: string (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- direction: integer (nullable = false)
 |-- count: integer (nullable = false)



flightData = [passengerId: int, flightId: int ... 5 more fields]


[passengerId: int, flightId: int ... 5 more fields]

## Row number
Add row-number within each passernerId partition.

In [8]:
val querySequencedRun = """
SELECT 
    f.*,
    ROW_NUMBER() OVER (PARTITION BY passengerId ORDER BY passengerId, date) as seq 
FROM
    flightData f
ORDER BY 
    passengerId, date
"""

val sequencedRun = spark.sql(querySequencedRun)
sequencedRun.createOrReplaceTempView("sequencedRun")

querySequencedRun = 
sequencedRun = [passengerId: int, flightId: int ... 6 more fields]


"
SELECT
    f.*,
    ROW_NUMBER() OVER (PARTITION BY passengerId ORDER BY passengerId, date) as seq
FROM
    flightData f
ORDER BY
    passengerId, date
"


[passengerId: int, flightId: int ... 6 more fields]

## Longest run per passenger

In [9]:
val queryLongestRun = """
WITH 
    closedRun AS (
        SELECT 
            passengerId, 
            from, to, 
            direction, 
            seq,
            -------------------------------------------------------------------------------- 
            -- For a departure flight, take the the return flight, if there is, seq num
            -------------------------------------------------------------------------------- 
            CASE 
                WHEN direction == 1
                THEN lead(seq) OVER (PARTITION BY passengerId ORDER BY seq)
            END AS return,
            -------------------------------------------------------------------------------- 
            -- For a departure flight, count the visiting countries, if returned.
            -------------------------------------------------------------------------------- 
            CASE 
                WHEN direction == 1
                THEN lead(seq) OVER (PARTITION BY passengerId ORDER BY seq) - seq
            END AS countries
        FROM sequencedRun s
        WHERE 
            -------------------------------------------------------------------------------- 
            -- Remove those without UK
            -------------------------------------------------------------------------------- 
            direction != 0
            -------------------------------------------------------------------------------- 
            -- Select passengers having both depart (+1) and return (-1), which is 
            -- distinct direction count is 2.
            -------------------------------------------------------------------------------- 
            AND EXISTS (  
                SELECT passengerId
                FROM
                    sequencedRun
                WHERE 
                    direction != 0 AND
                    passengerId == s.passengerId
                GROUP BY
                    passengerId
                Having count(DISTINCT direction) == 2
            )
        ORDER BY 
            passengerId, seq
    )
    

SELECT 
    passengerId as `Passenger ID`,
    max(countries) as `Longest Run`
FROM closedRun
WHERE 
    countries IS NOT NULL
GROUP BY 
    passengerId
ORDER BY 
    max(countries) DESC
"""

queryLongestRun = 


"
WITH
    closedRun AS (
        SELECT
            passengerId,
            from, to,
            direction,
            seq,
            --------------------------------------------------------------------------------
            -- For a departure flight, take the the return flight, if there is, seq num
            --------------------------------------------------------------------------------
            CASE
                WHEN direction == 1
                THEN lead(seq) OVER (PARTITION BY passengerId ORDER BY seq)
            END AS return,
            --------------------------------------------------------------------------------
            -- For a departure flight, count the visiting countries, if returned.
            --------------------------...


In [10]:
val longestRun = spark.sql(queryLongestRun)

timed(
    "Run longest closed run.",
    longestRun.show(5)
)
println(timing)
println(sequencedRun.rdd.toDebugString)
println(longestRun.rdd.toDebugString)

longestRun
    .coalesce(1)
    .write
    .format("csv")
    .mode(SaveMode.Overwrite)
    .option("header", "true")
    .save(RESULT_DIR)

+------------+-----------+
|Passenger ID|Longest Run|
+------------+-----------+
|        2975|         16|
|        2939|         15|
|        8562|         15|
|         760|         15|
|        3573|         15|
+------------+-----------+
only showing top 5 rows

Processing Run longest closed run. took 6119 ms.

(12) MapPartitionsRDD[78] at rdd at <console>:59 []
 |   MapPartitionsRDD[77] at rdd at <console>:59 []
 |   MapPartitionsRDD[76] at rdd at <console>:59 []
 |   ShuffledRowRDD[75] at rdd at <console>:59 []
 +-(12) MapPartitionsRDD[74] at rdd at <console>:59 []
    |   MapPartitionsRDD[70] at rdd at <console>:59 []
    |   MapPartitionsRDD[69] at rdd at <console>:59 []
    |   ShuffledRowRDD[68] at rdd at <console>:59 []
    +-(12) MapPartitionsRDD[67] at rdd at <console>:59 []
       |   MapPartitionsRDD[66] at rdd at <console>:59 []
       |   ShuffledRowRDD[65] at rdd at <console>:59 []
       +-(1) MapPartitionsRDD[64] at rdd at <console>:59 []
          |  MapPartitions

longestRun = [Passenger ID: int, Longest Run: int]


[Passenger ID: int, Longest Run: int]

# Validations
Simple tests.

## TBD 
Proper tests.

### Test cases
#### Normal cases
1. Passenger without UK -> .. -> UK  
    a. Run length is 1  
    b. Run length > 1    
<br/>

2. Passenger with UK  
    a. Run with depart-from UK only  
    b. Run with return-to UK only  
    c. Run with only one (return, depart)  
    d. Run with only one (depart, return)  
    e. Run with return, (depart, return)+  
    f. Run with (depart, return)+, depart  
    f. Run with (depart, return)+  

#### Error cases
TBD e.g. (return-to UK, return-to UK) without depart in-between.

```
$ cat ../../../main/resources/flightData.csv | awk '{FS=","} /^53,/{print $3,$4, $5}'  | sort -k3
cg ir 2017-01-01
ir sg 2017-01-10
sg nl 2017-01-29
nl at 2017-02-13
at ch 2017-03-27
ch uk 2017-04-07 < Return
uk se 2017-04-10 > Depart
se uk 2017-04-23 < Return
uk tj 2017-05-26 > Depart
tj fr 2017-05-29
fr pk 2017-06-03
pk th 2017-06-04
th uk 2017-06-14 <---- Return

$ cat ../../../main/resources/flightData.csv | awk '{FS=","} /^227,/{print $3,$4, $5}'  | sort -k3
ca cn 2017-01-01
cn at 2017-01-13
at pk 2017-01-17 
pk iq 2017-03-29
iq uk 2017-04-11 < (no matching depart)
uk uk 2017-05-11 > Depart
uk ca 2017-07-24 < Return
ca cn 2017-08-06
cn bm 2017-08-16
bm iq 2017-10-04
```