## PROJET HADOOP - MS-SIO-2019 - SNCF - API TRANSILIEN - PARTIE II

#### SPARK STRUCTURED STREAMING (KAFKA CONSUMER)

P. Hamy,  N. Leclercq, L. Poncet - MS-SIO-2019

In [None]:
import os
import json
import time
import logging
from pyspark.sql import SparkSession
import pyspark.sql.types as st
import pyspark.sql.functions as sf
from pyspark.sql.window import Window as spark_window

In [None]:
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.ERROR, datefmt='%H:%M:%S')

Changement du logging level afin d'éliminer le bruit généré dans la console par un [_warning_](https://stackoverflow.com/questions/39351690/got-interruptedexception-while-executing-word-count-mapreduce-job) récurrent

In [None]:
log4j = sc._jvm.org.apache.log4j
log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)

#### Création de la session Spark associé au flux Kafka

In [None]:
kafka_session = SparkSession.builder.appName("MS-SIO-HADOOP-PROJECT-STREAM-PART-II").getOrCreate()

Limitation du nombre de taches lancées par spark (conseil de configutation glané sur internet pour les configurations matérielles les plus modestes).

In [None]:
kafka_session.conf.set('spark.sql.shuffle.partitions', 4)

#### Création du flux Kafka
On utilise ici un [structured spark stream](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) associé à une source Kafka. 

Il s'agit de spécifier la source via l'adresse du serveur Kafka et le nom du topic auquel on souhaite s'abonner. 

In [None]:
kafka_stream = kafka_session \
                .readStream \
                .format("kafka") \
                .option("kafka.bootstrap.servers", "sandbox-hdp.hortonworks.com:6667") \
                .option("subscribe", "transilien-02") \
                .option("startingOffsets", "earliest") \
                .option("kafkaConsumer.pollTimeoutMs", 512) \
                .load()

Données associées auxw stations

In [None]:
stations_data = kafka_session \
                .read \
                .format("csv") \
                .option("sep", ",") \
                .option("inferSchema", "true") \
                .option("header", "true") \
                .load("file:/root/ms-sio-hdp/api-transilien/transilien_line_l_stations_by_code.csv")

In [None]:
stations_data.show()

#### Schéma de désérialisation des messages  
Les messages injectés dans le flux Kafka sont sérialisés et encodés en binaire dans le champ _value_ du dataframe (format générique des dataframe issus d'un stream Kafka).
```
kafka_stream.printSchema()
root
root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)
 ```
Il est donc nécessaire de spécifier le schéma de désérialisation qui sera passé à la fonction **from_json**.

In [None]:
json_schema = st.StructType(
    [
        st.StructField("station", st.IntegerType(), True),
        st.StructField("train", st.StringType(), True),
        st.StructField("timestamp", st.TimestampType(), True),
        st.StructField("mode", st.StringType(), True),
        st.StructField("mission", st.StringType(), True),
        st.StructField("terminus", st.IntegerType(), True)
    ]
)

A travers, la variable **json_options**, on précise également le format du champ _timestamp_ afin que les valeurs temporelles soient correctement interprétées.

In [None]:
json_options = {"timestampFormat": "yyyy-MM-dd'T'HH:mm:ss.sss'Z'"}

Désérialisation/reformatage des messages.

In [None]:
df = kafka_stream \
    .select(sf.from_json(sf.col("value").cast("string"), json_schema, json_options).alias("departure")) \
    .select("departure.*")

A l'issue de opération le dataframe a le schéma suivant:
```
df.printSchema()
root
 |-- station: integer (nullable = true)
 |-- train: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- mode: string (nullable = true)
 |-- mission: string (nullable = true)
 |-- terminus: integer (nullable = true)
```

Un train apparaitra dans les réponses aux requêtes de l'API SNCF tant que son heure de départ n'appartient pas au passé. On supprime donc les doublons associés aux couples (train, heure de départ). Inutile d'ajouter la station à la contrainte d'exclusion car l'idenfiant d'un train est unique.

In [None]:
df = df.dropDuplicates(["train", "timestamp"])

Les stations contigües A et B (cf. énoncé partie II) 

In [None]:
contiguous_stations = {
    'sa':87381129, # Station A: CLICHY LEVALLOIS
    'sb':87381137  # Station B: ASNIERES SUR SEINE
}

On filtre sur les 'contiguous_stations' et sur le 'mode' de l'horaire de départ de chaque train. On ne retient que les trains au départ des stations qui apparaissent dans la liste _contiguous_stations_ pour lesquels le mode de l'horaire annoncé vaut "R" (horaire réel).

In [None]:
df = df.filter("mode='R'")

conversion de l'heure de départ au format unix timestamp (plus simple à manipuler) - on supprime la colonne "timestamp", devenue inutile.

In [None]:
df = df.withColumn("departure", sf.unix_timestamp("timestamp"))

selection des trains dont l'heure de départ est se situe dans l'interval : maintenant +/- (time_window/2) exprimé dqns l'unité unix timestamp (i.e. la seconde)

In [None]:
df.printSchema()

In [None]:
time_window = 1800

In [None]:
df = df.where(sf.col("departure").between(sf.unix_timestamp(sf.current_timestamp()) - int(time_window/2.), 
                                          sf.unix_timestamp(sf.current_timestamp()) + int(time_window/2.)))

Pseudo aggregation pour obtenir notre ensemble de trains en un _batch_ unique - l'idée est de pouvoir effectuer une requête en mode _complete_ sur notre stream. Il s'agit d'une astuce qui vise à satisfaire une contrainte imposée par Spark : le mode 'complete' ne s'appliquer qu'à des données aggrégées - i.e. issue d'une fonction d'aggrégration de Spark.

In [None]:
df = df.groupBy("train", "departure", "timestamp", "station", "mission", "terminus").agg(sf.count("train").alias("tmp")).drop("tmp")

In [None]:
df.printSchema()

In [None]:
df = df.orderBy("train", "departure", "timestamp", "station")

In [None]:
df.printSchema()

In [None]:
def forEachBatchCallback(batch, batch_number):

    if batch.rdd.isEmpty():
        print(f"ignoring empty batch #{batch_number}")
        return
    
    t = time.time()
    
    # create next_departure & next_station lead columns: departure & station columns up shifted by 1 row
    tmp = batch.withColumn('next_departure', sf.lead('departure').over(spark_window.partitionBy("train").orderBy("departure")))
    tmp = tmp.withColumn('next_station', sf.lead('station').over(spark_window.partitionBy("train").orderBy("departure")))
    
    tmp.show()
    
    # create humanly readable columns for departure date/time 
    tmp = tmp.withColumn("departure_date", sf.from_unixtime(tmp.departure, "hh:mm:ss"))
    tmp = tmp.withColumn("next_departure_date", sf.from_unixtime(tmp.next_departure, "hh:mm:ss"))
    
    # swap departure date/time (due to train direction) - this is just for readability & display 
    tmp = tmp.withColumn("temp_departure_date", tmp.departure_date)
    tmp = tmp.withColumn("departure_date", sf.when(tmp.departure < tmp.next_departure, tmp.departure_date).otherwise(tmp.next_departure_date))
    tmp = tmp.withColumn("next_departure_date", sf.when(tmp.departure < tmp.next_departure, tmp.next_departure_date).otherwise(tmp.temp_departure_date))
    tmp = tmp.drop("temp_departure_date")
    
    # tmp.show()
    
    # compute travel time between 'departure' and 'next_departure' - i.e. from one station to the next
    tmp = tmp.withColumn("dt", tmp.departure -  tmp.next_departure)
    
    # create column to store the current time (i.e. now)
    tmp = tmp.withColumn("now", sf.unix_timestamp(sf.current_timestamp()))
    
    # the travel (from one station to the next) can belong to the past, the future or can be in progress 
    tmp = tmp.withColumn("in_past", (tmp.now > tmp.departure) & (tmp.now > tmp.next_departure))
    tmp = tmp.withColumn("in_future", (tmp.now < tmp.departure) & (tmp.now < tmp.next_departure))
    tmp = tmp.withColumn("in_progress", (tmp.in_past != sf.lit(True)) & (tmp.in_future != sf.lit(True)))
    
    # tmp.show()
    
    # keep only 'in progress' travels - i.e. the ones not in past nor in the future
    # we also remove standby (i.e fake travel from one station to the same - train waiting for next departure)
    tmp = tmp.filter((~tmp.in_past & ~tmp.in_future) & (tmp.station != tmp.next_station) & (tmp.next_departure.isNotNull()))
    
    # tmp.show()
    
    # compute travel progression in %
    tmp = tmp.withColumn("progress", (100. * sf.abs((tmp.now - tmp.departure))) / sf.abs(tmp.dt))  
    # compute trains progression: maintain value in  the [O, 100]% range 
    tmp = tmp.withColumn("progress", sf.when(tmp.progress < sf.lit(0.), sf.lit(0.)).otherwise(tmp.progress))             
    # compute trains progression: maintain value in  the [O, 100]% range 
    tmp = tmp.withColumn("progress", sf.when(tmp.progress > sf.lit(100.), sf.lit(100.)).otherwise(tmp.progress))
    
    # round progress values to 1 digit
    tmp = tmp.withColumn("progress", sf.format_number(tmp.progress, 1).cast("double"))

    # tmp.show()
        
    # select the required columns
    tmp = tmp.select(tmp.train, 
                     tmp.departure_date.alias("departure"),
                     tmp.next_departure_date.alias("arrival"),
                     tmp.mission, 
                     tmp.station.alias("from_st"), 
                     tmp.next_station.alias("to_st"), 
                     tmp.progress.alias("prg")) 

    # from (departure location)
    tmp = tmp.join(self.stations_data, tmp.from_st == self.stations_data.station, how="left")
    tmp = tmp.withColumn("from_st_lt", tmp.latitude).drop("latitude")
    tmp = tmp.withColumn("from_st_lg", tmp.longitude).drop("longitude")
    tmp = tmp.withColumn("from_st_lb", tmp.label).drop("label")
    tmp = tmp.drop("station")

    # to (destination location) 
    tmp = tmp.join(self.stations_data, tmp.to_st == self.stations_data.station, how="left")
    tmp = tmp.withColumn("to_st_lt", tmp.latitude).drop("latitude")
    tmp = tmp.withColumn("to_st_lg", tmp.longitude).drop("longitude")
    tmp = tmp.withColumn("to_st_lb", tmp.label).drop("label")
    tmp = tmp.drop("station")

    # compute current train latitude & longitude
    tmp = tmp.withColumn("train_lt", tmp.from_st_lt + ((tmp.prg / 100.) * (tmp.to_st_lt - tmp.from_st_lt)))
    tmp = tmp.withColumn("train_lg", tmp.from_st_lg + ((tmp.prg / 100.) * (tmp.to_st_lg - tmp.from_st_lg)))

    # remove tmp data from table
    tmp = tmp.select("train",       # train identifier 
                     "departure",   # departure time
                     "arrival",     # arrival time
                     "mission",     # mission code
                     "from_st",     # departure station code
                     "to_st",       # arrival station code
                     "from_st_lb",  # departure station label
                     "to_st_lb",    # arrival station label
                     "prg",         # travel progress
                     "train_lt",    # current train location: latitude 
                     "train_lg")    # current train location: longitude 

    # log
    tmp.show()
    
    #kafka_session.createDataFrame(tmp.rdd).createOrReplaceTempView("train_progression")
    print(f"`-> took {round(time.time() - t, 2)} s")

In [None]:
query_2 = df \
    .writeStream \
    .foreachBatch(forEachBatchCallback) \
    .outputMode("complete") \
    .start()

In [None]:
query_2.stop()