# Predict delays for every scheduled connection

In this notebook, we provide schedule connections with an uncertainty parameter for prediciting possible delays. In other words, we map the static public transport network with the delay analysis in order to obtain a meaningful and accurate predictive model. The notebook is structured as follows: 

*   **[Start Spark](#spark)** 
*   **[Get relevant trips](#relevant)**  
*   **[Transform trips to connections](#raw_connections)** 
*   **[Add useful mappingg information](#id_connections)**  



<a id = 'spark'></a>
### 1. Start Spark

We will be using a Spark Session for performing different transformations and actions on dataframes

In [1]:
%%configure -f
{
    "conf": {
        "spark.app.name": "wow",
        "spark.driver.memory": '8g',
        'spark.driver.maxResultSize': '6g',
        "spark.executor.memory": "2g",
        "spark.executor.instances": "64"
    }
}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
8387,application_1589299642358_2919,pyspark,idle,Link,Link,
8390,application_1589299642358_2922,pyspark,idle,Link,Link,
8393,application_1589299642358_2925,pyspark,idle,Link,Link,
8397,application_1589299642358_2929,pyspark,idle,Link,Link,
8398,application_1589299642358_2930,pyspark,idle,Link,Link,
8400,application_1589299642358_2932,pyspark,idle,Link,Link,
8401,application_1589299642358_2933,pyspark,idle,Link,Link,
8403,application_1589299642358_2935,pyspark,idle,Link,Link,
8405,application_1589299642358_2937,pyspark,idle,Link,Link,
8407,application_1589299642358_2939,pyspark,idle,Link,Link,


In [2]:
spark

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
8421,application_1589299642358_2953,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<pyspark.sql.session.SparkSession object at 0x7fa6ab43e290>

In [3]:
%%info

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
8387,application_1589299642358_2919,pyspark,idle,Link,Link,
8390,application_1589299642358_2922,pyspark,busy,Link,Link,
8393,application_1589299642358_2925,pyspark,idle,Link,Link,
8397,application_1589299642358_2929,pyspark,busy,Link,Link,
8398,application_1589299642358_2930,pyspark,idle,Link,Link,
8400,application_1589299642358_2932,pyspark,idle,Link,Link,
8401,application_1589299642358_2933,pyspark,idle,Link,Link,
8403,application_1589299642358_2935,pyspark,idle,Link,Link,
8405,application_1589299642358_2937,pyspark,idle,Link,Link,
8407,application_1589299642358_2939,pyspark,idle,Link,Link,


In [4]:
import pyspark.sql.functions as f
from pyspark.sql import Window

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
id_connections = spark.read.orc('/user/datavirus/id_connections_new.orc').repartition(150, 'id')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [17]:
probability = spark.read.orc('/user/datavirus/probability.orc').repartition(150, 'id')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [22]:
# we add the probability information to all connections

probability_connections = (
    id_connections
    .join(probability, ['id', 'station_id', 'arrival_time_minute'], how='left_outer')
    .select(
        id_connections.stop_sequence,
        id_connections.route_type,
        id_connections.arrival_time_hour,
        id_connections.produkt_id,
        id_connections.start_id,
        id_connections.start_time,
        id_connections.trip_id,
        probability.transport_type,
        probability.line_text,
        id_connections.stop_time,
        id_connections.stop_id,
        probability.delay_probability,
        probability.delay_parameter
    )
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
probability_connections.write.format('orc').save('/user/datavirus/probability_connections_new.orc', mode='overwrite')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [24]:
probability_connections = spark.read.orc('/user/datavirus/probability_connections_new.orc').alias('probability_connections')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [25]:
# for some connections we have no line specific probabilities 
# so we have to add transport probabilities as backup

transport_probability = spark.read.orc('/user/datavirus/transport_probability.orc').alias('transport_probability')
transport_probability = (
    transport_probability
    .withColumn('arrival_time_hour', transport_probability.ankunftszeit_hour)
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [27]:
# we get the final dataframe

connections = (
    probability_connections
    .join(transport_probability, ['produkt_id', 'arrival_time_hour'])
    .select(
        probability_connections.stop_sequence,
        probability_connections.route_type,
        probability_connections.start_id,
        probability_connections.start_time,
        probability_connections.trip_id,
        probability_connections.produkt_id.alias('transport_type'),
        probability_connections.line_text,
        probability_connections.stop_time,
        probability_connections.stop_id,
        f.when(
            f.col('probability_connections.delay_probability').isNotNull(),
            f.col('probability_connections.delay_probability')
        ).otherwise(
            f.col('transport_probability.transport_delay_probability')
        ).alias('delay_probability'),
        f.when(f.col('probability_connections.delay_parameter').isNotNull(),
              f.col('probability_connections.delay_parameter')
        ).otherwise(
            f.col('transport_probability.transport_delay_parameter')
        ).alias('delay_parameter')
    )
    .orderBy([probability_connections.stop_time.desc(), probability_connections.stop_sequence.desc()])
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [28]:
connections.write.format('orc').save('/user/datavirus/connections_new.orc', mode='overwrite')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [11]:
connections = spark.read.orc('/user/datavirus/connections_new.orc')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [31]:
connections.write.format('csv').save('/user/datavirus/connections.csv', mode='overwrite')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
connections = spark.read.csv('/user/datavirus/connections.csv')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
connections.show(10, False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+---+-------+--------+-----------------------+---+----+--------+-------+------------------+--------------------+
|_c0|_c1|_c2    |_c3     |_c4                    |_c5|_c6 |_c7     |_c8    |_c9               |_c10                |
+---+---+-------+--------+-----------------------+---+----+--------+-------+------------------+--------------------+
|2  |700|8502209|10:05:00|9.TA.30-170-Y-j19-1.1.H|bus|null|10:05:00|8502209|0.9193866847990115|0.011589134124321832|
|2  |700|8502771|10:05:00|392.TA.26-235-j19-1.5.R|bus|235 |10:05:00|8502771|0.9159136766383016|0.013732196842403623|
|2  |700|8502771|10:05:00|493.TA.26-235-j19-1.5.R|bus|235 |10:05:00|8502771|0.9159136766383016|0.013732196842403623|
|2  |700|8502771|10:05:00|360.TA.26-235-j19-1.5.R|bus|235 |10:05:00|8502771|0.9159136766383016|0.013732196842403623|
|2  |700|8502771|10:05:00|345.TA.26-235-j19-1.5.R|bus|235 |10:05:00|8502771|0.9159136766383016|0.013732196842403623|
|2  |700|8502771|10:05:00|439.TA.26-235-j19-1.5.R|bus|235 |10:05

In [13]:
connections.write.csv('/user/datavirus/connections_test.csv', header=True, mode='overwrite')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
c = spark.read.csv('/user/datavirus/connections_test.csv', header=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [17]:
c.show(10, False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------+----------+--------+----------+-----------------------+--------------+---------+---------+-------+------------------+--------------------+
|stop_sequence|route_type|start_id|start_time|trip_id                |transport_type|line_text|stop_time|stop_id|delay_probability |delay_parameter     |
+-------------+----------+--------+----------+-----------------------+--------------+---------+---------+-------+------------------+--------------------+
|2            |700       |8502209 |10:05:00  |9.TA.30-170-Y-j19-1.1.H|bus           |null     |10:05:00 |8502209|0.9193866847990115|0.011589134124321832|
|2            |700       |8502771 |10:05:00  |392.TA.26-235-j19-1.5.R|bus           |235      |10:05:00 |8502771|0.9159136766383016|0.013732196842403623|
|2            |700       |8502771 |10:05:00  |493.TA.26-235-j19-1.5.R|bus           |235      |10:05:00 |8502771|0.9159136766383016|0.013732196842403623|
|2            |700       |8502771 |10:05:00  |360.TA.26-235-j19-1.5.R|bus   