# Overview:

In this notebook, we implement the validation part of our project. \
To validate a trip predicted by the route planning algorithm, we take as input the list of nodes (stops) and edges (connecting trips), and then we compare the same trips in the SBB real-world data. We see if the predicted route would lead to one of the following real-world events:
1. SUCCESSFUL TRIP
2. MISSED CONNECTION
3. Connection not found (this occurs due to a slight mismatch between the timetable data used for route prediction and the SBB data) 
We take consider 50 random pairs of origin and destination stations and use 3 different algorithms to find the route. For each of these we then find the percentage of successful trip as $\frac{\text{#SUCCESSFUL TRIPS}}{\text{#SUCCESSFUL TRIPS+#MISSED CONNECTIONS}}$. (We ignore the 'Connection not found' cases). \
We find the following results:
| Algorithm                                           | Percentage of successful trips|
|:---------------------------------------------------:|:-----------------------------:|
| Shortest path algorithm (no delay considerations)   | 72.4%                         |
| Our algorithm (confidence 0.8)                      | 90.3%                         |   
| Our algorithm (confidence 0.96)                     | 96.7%                         |   

In [1]:
%%local
import os
username = 'moiseev'
username = os.environ['JUPYTERHUB_USER'] # Uncomment if you want to use your models and data.
get_ipython().run_cell_magic('configure', line="-f", cell='{ "name":"%s-final", "executorMemory":"4G", "executorCores":4, "numExecutors":10, "driverMemory": "32G" }' % username)

ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
7192,application_1618324153128_6907,pyspark,idle,Link,Link,,
7196,application_1618324153128_6922,pyspark,idle,Link,Link,,
7197,application_1618324153128_6924,pyspark,idle,Link,Link,,
7198,application_1618324153128_6927,pyspark,idle,Link,Link,,
7199,application_1618324153128_6928,pyspark,idle,Link,Link,,
7201,application_1618324153128_6930,pyspark,busy,Link,Link,,
7203,application_1618324153128_6932,pyspark,idle,Link,Link,,
7204,application_1618324153128_6933,pyspark,idle,Link,Link,,
7205,application_1618324153128_6934,pyspark,idle,Link,Link,,


In [2]:
%%send_to_spark -i username -t str -n username 

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
7206,application_1618324153128_6935,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'username' as 'username' to Spark kernel

In [3]:
import pyspark.sql.functions as F
from pyspark import SparkConf, SparkContext
import pandas as pd
import math

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
# Some helper functions

# To correctly format the time
def format_time(time):
    return "{0}:{1}".format(str(time).split(":")[0].zfill(2), str(time).split(":")[1].zfill(2))#.zfill(2)

# To add minutes to a given time
def add_minutes(t, minutes):
    (hour_t, min_t) = tuple(map(int, t.split(':')))
    new_min = min_t + minutes
    if new_min > 59:
        new_min = new_min % 60
        hour_t = hour_t + 1
    ret = str(hour_t)+':'+str(new_min)
    return ret

# To subtract minutes from a give time
def sub_minutes(t, minutes):
    (hour_t, min_t) = tuple(map(int, t.split(':')))
    new_min = min_t - minutes
    if new_min < 0:
        new_min = new_min % 60
        hour_t = hour_t - 1
    ret = str(hour_t)+':'+str(new_min)
    return ret

# Create a 5-min time bracket
def create_time_bracket(t):
    return [format_time(sub_minutes(t,2)), format_time(sub_minutes(t,1)), format_time(t), format_time(add_minutes(t,1)), format_time(add_minutes(t,2))]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
# Function to read relevant data for validation from hdfs
def read_data():
    sbb_connections = spark.read.orc('/data/sbb/orc/istdaten')
    sbb_connections = sbb_connections.selectExpr(
        "betriebstag as date",

        "fahrt_bezeichner as trip_id",

        "betreiber_id as operator_id",
        "betreiber_abk as operator_abbr",
        "betreiber_name as operator_name",

        "produkt_id as product_id",
        "linien_id as line_id",
        "linien_text as line_text",
        "umlauf_id as circulation_id",
        "verkehrsmittel_text as transportation_text",
        "zusatzfahrt_tf as is_extra",
        "faellt_aus_tf as is_cancelled",
        "haltestellen_name as stop_name",
        # The bpuic corresponds to the stop_id in the sbb_stops from the geostops file
        "bpuic as stop_id",

        "ankunftszeit as scheduled_arrival_time",
        "an_prognose as actual_arrival_time",
        "an_prognose_status as arrival_forecast_status", 

        "abfahrtszeit as scheduled_departure_time",
        "ab_prognose as actual_departure_time",
        "ab_prognose_status as departure_forecast_status",

        "durchfahrt_tf as is_transit"
    )
    

    stops = spark.read.parquet("/user/{}/stops.parquet".format(username))
    routes = spark.read.parquet("/user/{}/routes.parquet".format(username))
    trips = spark.read.parquet("/user/{}/trips.parquet".format(username))
    return sbb_connections, stops, routes, trips

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
# Function to validate the given trip
# Return values:
# 0: TRIP SUCCESSFUL
# 1: CONNECTION MISSED
# 2: Couldn't find connection

def validate_trip(nodes, edges, conns, stops, routes, trips, date_of_journey):

    flag = 0
    
    # Filter stops to retain only relevant stops in journey
    stops_filt = stops\
        .select("stop_id", "stop_name")\
        .filter(F.col("stop_id").isin(nodes))\
        .toPandas()\
        .set_index("stop_id")
    
    trip_ids = set()
    for e in edges:
        if e['trip_id'] != 'walk':
            trip_ids.add(e['trip_id'])
    trip_ids = list(trip_ids)
    
    # Obtain the routes and operators of the trips which comprose the journey
    trip_to_route = routes\
        .join(trips, 'route_id', 'left')\
        .filter(F.col("trip_id").isin(trip_ids))\
        .select("route_desc", "route_short_name", "trip_id", "agency_id")\
        .toPandas().set_index('trip_id')
    
    # Filter SBB data to contain inly interested stops and on date of journey
    # Add columns for arrival/departure time, both scheduled and actual
    conns_filt = conns\
                .filter(F.col("stop_name").isin(list(stops_filt['stop_name'])))\
                .filter(F.col("date")==date_of_journey)\
                .withColumn("scheduled_arrival_time_ft", F.substring(F.col("scheduled_arrival_time"),12,5))\
                .withColumn("scheduled_departure_time_ft", F.substring(F.col("scheduled_departure_time"),12,5))\
                .withColumn("actual_arrival_time_ft", F.substring(F.col("actual_arrival_time"),12,8))\
                .withColumn("actual_departure_time_ft", F.substring(F.col("actual_departure_time"),12,8))\
                .withColumn("operator_id_ft", F.substring(F.col("operator_id"),4,12))\
                .select("operator_id_ft","stop_name","scheduled_arrival_time_ft", "scheduled_departure_time_ft", "actual_arrival_time_ft", "actual_departure_time_ft", "stop_id", "product_id", "line_text", "circulation_id")\
                .toPandas()
    
    curr_trip_id = edges[0]['trip_id']
    if curr_trip_id=='walk':
        walking_time = edges[0]['walking_time']
        scheduled_walking_start_time = pd.to_datetime(format_time(edges[0]['dep']), infer_datetime_format=True)
        actual_walking_start_time = pd.to_datetime(format_time(edges[0]['dep']), infer_datetime_format=True)
    else:
        walking_time = 0

    for e in range(len(edges)):
        stop_id = nodes[e]
        # If trip_id changes, ie a transfer occurs on the journey
        if(edges[e]['trip_id']!=curr_trip_id):
            if curr_trip_id=='walk':
                scheduled_arrival = scheduled_walking_start_time + pd.Timedelta(minutes=math.ceil(walking_time))
                actual_arrival = actual_walking_start_time + pd.Timedelta(minutes=math.ceil(walking_time))
            else:
                # Obtain the corresponding trip from SBB data which arrives at the desired stop 
                # using the same mode of transport, and in a 5-min time bracket
                arrival_edge = conns_filt[\
                                         (conns_filt['stop_name'].isin([stops_filt.loc[stop_id]['stop_name']])) &\
                                         (conns_filt['operator_id_ft']==str(trip_to_route.loc[curr_trip_id]['agency_id'])) &\
                                         (conns_filt['line_text'].str.contains(trip_to_route.loc[curr_trip_id]['route_short_name'])) &\
                                         (conns_filt['scheduled_arrival_time_ft'].isin(create_time_bracket(edges[e-1]['arr'])))\
                                        ]
                if (len(arrival_edge)==0):
                    print("Couldn't find connection!")
                    flag = 2
                    break
                scheduled_arrival = min(pd.to_datetime(arrival_edge['scheduled_arrival_time_ft'], infer_datetime_format=True))
                actual_arrival = min(pd.to_datetime(arrival_edge['actual_arrival_time_ft'], infer_datetime_format=True))

            if edges[e]['trip_id']=='walk':
                scheduled_departure = scheduled_arrival + pd.Timedelta(minutes=2)
                actual_departure = actual_arrival + pd.Timedelta(minutes=2)
                scheduled_walking_start_time = scheduled_arrival
                actual_walking_start_time = actual_arrival
            else:
                # Obtain the corresponding trip from SBB data which departs from the desired stop 
                # using the same mode of transport, and in a 5-min time bracket
                dep_edge = conns_filt[\
                                         (conns_filt['stop_name'].isin([stops_filt.loc[stop_id]['stop_name']])) &\
                                         (conns_filt['operator_id_ft']==str(trip_to_route.loc[edges[e]['trip_id']]['agency_id'])) &\
                                         (conns_filt['line_text'].str.contains(trip_to_route.loc[edges[e]['trip_id']]['route_short_name'])) &\
                                         (conns_filt['scheduled_departure_time_ft'].isin(create_time_bracket(edges[e]['dep'])))\
                                        ]
                if (len(dep_edge)==0):
                    print("Couldn't find connection!")
                    flag = 2
                    break
                scheduled_departure = max(pd.to_datetime(dep_edge['scheduled_departure_time_ft'], infer_datetime_format=True))
                actual_departure = max(pd.to_datetime(dep_edge['actual_departure_time_ft'], infer_datetime_format=True))

            if curr_trip_id!='walk':
                # If the actual departure from the stop is within 2 minutes of the actual
                # arrivak time, the connection was missed!
                if (actual_arrival+pd.Timedelta(minutes=2)>actual_departure):
                    print("Connection MISSED!")
                    flag = 1
                    break
            else:
                if (actual_arrival>actual_departure):
                    print("Connection MISSED!")
                    flag = 1
                    break
            curr_trip_id = edges[e]['trip_id']
    if flag==0:
        print("Trip SUCCESSFUL!")
    return flag

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
# Read relevant data
conns, stops, routes, trips = read_data()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
%%local
import pickle
import pandas as pd

# Read saved trips from file 

# File A_08 containes route predictions using our algorithm with a confidence of 0.8
with open('../data/A_08.pickle', 'rb') as config_dictionary_file:
    ret = pickle.load(config_dictionary_file)
    
# File A_096 containes route predictions using our algorithm with a confidence of 0.96
with open('../data/A_096.pickle', 'rb') as config_dictionary_file:
    ret1 = pickle.load(config_dictionary_file)
    
# File A_base1 containes route predictions using the baseline shortest path algorithm, with 
# no delay considerations
with open('../data/A_base1.pickle', 'rb') as config_dictionary_file:
    ret_base = pickle.load(config_dictionary_file)

nodes = [ret[i][0][0][0] for i in range(len(ret))]
edges = [pd.DataFrame(ret[i][0][0][1]) for i in range(len(ret))]
nodes1 = [ret1[i][0][0][0] for i in range(len(ret1))]
edges1 = [pd.DataFrame(ret1[i][0][0][1]) for i in range(len(ret1))]
nodes_b = [ret_base[i][0][0][0] for i in range(len(ret_base))]
edges_b = [pd.DataFrame(ret_base[i][0][0][1]) for i in range(len(ret_base))]

nodes = pd.DataFrame([nodes])
edges = pd.DataFrame([edges])

nodes1 = pd.DataFrame([nodes1])
edges1 = pd.DataFrame([edges1])

nodes_b = pd.DataFrame([nodes_b])
edges_b = pd.DataFrame([edges_b])


In [9]:
# Send the above read variable to spark

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
%%send_to_spark -i nodes -t df

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'nodes' as 'nodes' to Spark kernel

In [11]:
%%send_to_spark -i edges -t df

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'edges' as 'edges' to Spark kernel

In [12]:
%%send_to_spark -i nodes_b -t df

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'nodes_b' as 'nodes_b' to Spark kernel

In [13]:
%%send_to_spark -i edges_b -t df

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'edges_b' as 'edges_b' to Spark kernel

In [14]:
%%send_to_spark -i nodes1 -t df

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'nodes1' as 'nodes1' to Spark kernel

In [15]:
%%send_to_spark -i edges1 -t df

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Successfully passed 'edges1' as 'edges1' to Spark kernel

In [17]:
# Run validation on the predicted routes, using the 3 different algorithms. 
# We aim to find the fraction of trips which succeed
# NOTE: We have saved 50 routes between randomly chosen origin and destination
# stations, and validation of these 50 trips can take some time. To test for a 
# smaller set of trip, reduce the variable NUM_TRIPS in the line below.

NUM_TRIPS = 10

import ast
count_base = 0
success_base = 0
count = 0
success = 0
count1 = 0
success1 = 0
for i in range(NUM_TRIPS):
    nodes_i = nodes.toPandas()[str(i)].item()
    edges_i = ast.literal_eval(edges.toPandas()['_corrupt_record'].item())[str(i)]
    flag = validate_trip(nodes_i, edges_i, conns, stops, routes, trips, "16.05.2019")
    if flag==0:
        success+=1
        count+=1
    if flag==1:
        count+=1
    nodes_i = nodes1.toPandas()[str(i)].item()
    edges_i = ast.literal_eval(edges1.toPandas()['_corrupt_record'].item())[str(i)]
    flag = validate_trip(nodes_i, edges_i, conns, stops, routes, trips, "16.05.2019")
    if flag==0:
        success1+=1
        count1+=1
    if flag==1:
        count1+=1
    nodes_i = nodes_b.toPandas()[str(i)].item()
    edges_i = ast.literal_eval(edges_b.toPandas()['_corrupt_record'].item())[str(i)]
    flag = validate_trip(nodes_i, edges_i, conns, stops, routes, trips, "16.05.2019")
    if flag==0:
        success_base+=1
        count_base+=1
    if flag==1:
        count_base+=1

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Couldn't find connection!
Trip SUCCESSFUL!
Connection MISSED!
Trip SUCCESSFUL!
Trip SUCCESSFUL!
Couldn't find connection!
Couldn't find connection!
Couldn't find connection!
Trip SUCCESSFUL!
Couldn't find connection!
Couldn't find connection!
Couldn't find connection!
Trip SUCCESSFUL!
Couldn't find connection!
Couldn't find connection!
Trip SUCCESSFUL!
Trip SUCCESSFUL!
Couldn't find connection!
Couldn't find connection!
Couldn't find connection!
Connection MISSED!
Couldn't find connection!
Couldn't find connection!
Couldn't find connection!
Trip SUCCESSFUL!
Trip SUCCESSFUL!
Trip SUCCESSFUL!
Couldn't find connection!
Couldn't find connection!
Trip SUCCESSFUL!

In [18]:
print("Fraction of successful trips using shortest path algo: ", success_base*1.0/count_base)
print("Fraction of successful trips using our algo, confidence = 0.80: ", success*1.0/count)
print("Fraction of successful trips using our algo, confidence = 0.96: ", success1*1.0/count1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

('Fraction of successful trips using shortest path algo: ', 0.6)
('Fraction of successful trips using our algo, confidence = 0.80: ', 1.0)
('Fraction of successful trips using our algo, confidence = 0.96: ', 1.0)