# Results Validation

Here we will do the validation of our algorithm. We picked a number routes and looked at the actual probablity of success when the user has to do a transfer. We extract the total number of trips that are leaving from the station where the transfer is happening at a specific time, from the historical data. The number of successful trips is estimated as the trips where the actual arrival time plus walking plus transfer time (2mins) was less than the departure time of the next leg of the trip. Our algorithm is considered correct when the actual probability of success is higher or equal to the confidence probability that the user picked.

In [1]:
%%configure -f
{"driverMemory": "6g",
"executorMemory": "6g",
"conf": {"spark.app.name": "miaou_final"},
"kind": "pyspark"}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
8869,application_1589299642358_3401,pyspark,idle,Link,Link,
8875,application_1589299642358_3407,pyspark,idle,Link,Link,
8880,application_1589299642358_3412,pyspark,idle,Link,Link,
8883,application_1589299642358_3415,pyspark,idle,Link,Link,
8884,application_1589299642358_3416,pyspark,busy,Link,Link,
8885,application_1589299642358_3417,pyspark,busy,Link,Link,
8888,application_1589299642358_3420,pyspark,busy,Link,Link,
8889,application_1589299642358_3421,pyspark,idle,Link,Link,
8891,application_1589299642358_3423,pyspark,busy,Link,Link,
8895,application_1589299642358_3427,pyspark,idle,Link,Link,


In [2]:
# Initialisation

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
8919,application_1589299642358_3451,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [3]:
#import necessary table
# df_final = spark.read.orc("/user/ellouz/df_final9.orc") #table with lognormal parameters
walking_table = spark.read.orc('/user/abourjei/walking_table.orc') #table that link stations within 500m of each other
edges=spark.read.parquet("/user/eckes/edges.parquet") #edges between stations in the same trip ID
sbb_filtered = spark.read.orc("/user/ellouz/sbb_filtered_final.orc")#historical data
important_columns = ['stop_id_s', 'trip_id', 'arrival_time', 'departure_time', 'next_stop_id_s', 'next_trip_id',\
                'next_arrival_time', 'next_departure_time', 'Travel_time', 'stop_name', 'arrival_time_80',\
               'arrival_time_85', 'arrival_time_90', 'arrival_time_95', 'arrival_time_99']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
from pyspark.sql import Window
from pyspark.sql.functions import lit

#create edges in list format
edges_list=edges.rdd.map(lambda x : (x[0] , x[1], x[2], x[3], x[4], x[5], x[6], x[7], x[8], x[9], x[10], x[11], x[12], x[13], x[14])).collect()
walking_table=walking_table.toPandas()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
import pandas as pd
import itertools
from collections import defaultdict
from datetime import datetime
from scipy.sparse import csr_matrix
from scipy.sparse.csgraph import shortest_path
from datetime import datetime, timedelta
import copy
pd.set_option('display.max_columns', None)

important_columns = ['stop_id_s', 'trip_id', 'arrival_time', 'departure_time', 'next_stop_id_s', 'next_trip_id',\
                'next_arrival_time', 'next_departure_time', 'Travel_time', 'stop_name', 'arrival_time_80',\
               'arrival_time_85', 'arrival_time_90', 'arrival_time_95', 'arrival_time_99']

class Graph:
    """ Class representing the transport network as a graph given a schedule. 
    Attributes:
        start_node: the station identifier of the starting point.
        goal_node: the station identifier of the destination.
        edges_list: the existing list of edges between stations.
        end_time: the desired arrival time defined by the user.
    """
    def __init__(self, start_node, goal_node, edges_list, end_time, confidence_proba):
        self.start_node = start_node
        self.goal_node = goal_node
        self.end_time = end_time
        self.confidence_proba = confidence_proba
        self.filter_edges_list = self.filtering(edges_list, end_time)
        self.final_edges_list = self.new_edges(self.filter_edges_list)
        
    def __repr__(self):
        """ SciPy sparse adjacency matrix representation of the graph and a dictionnary where each node has their respective indices in the matrix.
        """
        edges_list = self.filter_edges_list[:]
        length = len(edges_list)
        final_edges_list = self.final_edges_list[:]
        nodes = {}
        for i,edge in enumerate(edges_list):
            nodes[edge[:4]] = i
        
        col = []
        row = []
        data = []
        for edge in final_edges_list:
            if edge[4] is not None:
                try:
                    row.append(nodes[edge[4:8]])
                    col.append(nodes[edge[:4]])
                    data.append(edge[8])
                except KeyError:
                    pass
        
        adj_matrix = csr_matrix((data,(row,col)), shape=(length,length))
        
        return adj_matrix, nodes
    
    def filtering(self, edge_list, end_time):
        """ Filter the nodes where the arrival time is between the end_time and end_time minus 2 hours and drop delays for
        confidence probability not specified by the user.
        Parameters:
            edges_list: the existing list of edges between stations.
            end_time: the desired arrival time defined by the user.
        Return:
            The filtered edge list.
        """
        start_time = datetime.strftime(datetime.strptime(end_time, "%H:%M:%S")-timedelta(hours=2), "%H:%M:%S")
        keep_col = important_columns.index('arrival_time_'+str(self.confidence_proba))
        return [(edges[:9] + (edges[9], edges[keep_col])) for edges in edges_list if (edges[3]<=end_time)&(edges[3]>start_time)]
        
    def new_edges(self, edges_list):
        """ Create edges between nodes of same station and edges between stations that are 10 minutes of walking.
        Parameters: 
            edges_list: the existing list of edges between stations.
        Return:
            The edge list that is used for the graph representation.
        """
        final_edges = []
        d = defaultdict(list)
        for edge in self.filter_edges_list:
            d[edge[0]].append((edge[1:3] + (edge[3], edge[10])))
        
        # Create edges between nodes of same station
        for key, value in d.iteritems():
            
            # Can't wait in the starting station. Need to take the latest possible transport that can arrive at destination.
            if (key!=self.start_node):

                value.sort(key=lambda tup: tup[3])
                comb = itertools.combinations(value, 2)
            
                for pair in comb:
                    if (datetime.strptime(pair[0][1], "%H:%M:%S"))+timedelta(minutes=2)<(datetime.strptime(pair[1][1], "%H:%M:%S")):
                        weight = ((datetime.strptime(pair[1][3], "%H:%M:%S")-datetime.strptime(pair[0][3], "%H:%M:%S")).seconds//60)%60
                        if weight > 2:
                            final_edges.append((key, pair[0][0], pair[0][1], pair[0][2], key, pair[1][0], pair[1][1], pair[1][2], weight, pair[0][3]))
                     
            # The wait at the arrival station is zero. Guarantee that the user can arrive before the end time and that no useless transport are taken. 
            if (key==self.goal_node):
                value.sort(key=lambda tup: tup[3])
                comb = itertools.combinations(value, 2)
            
                for pair in comb:
                    if (pair[0][1]<pair[1][1]):
                        final_edges.append((key, pair[0][0], pair[0][1], pair[0][2], key, pair[1][0], pair[1][1], pair[1][2], 0, pair[0][3]))
            
            # Create edges between nodes that are reachable by walking
            df_searchable = walking_table.set_index("stop_id_s")
            walk_edges = df_searchable[df_searchable.index==key]
            
            if not walk_edges.empty:
                for _, row in walk_edges.iterrows():
                    for time in value:
                        if (d.get(row.close_stop_id_s, "test") != "test"):
                            close_time = list(d.get(row.close_stop_id_s))
                            close_time.sort(key=lambda tup: tup[3])
                            possible_time = datetime.strptime(time[3], "%H:%M:%S")+timedelta(minutes=row.time+2)
                        
                            while (close_time) and (datetime.strptime(close_time[0][3], "%H:%M:%S")<possible_time):
                                close_time.pop(0)
                        
                            if close_time:
                                i=0
                                while (i<len(close_time)) and (datetime.strptime(close_time[i][3], "%H:%M:%S")<possible_time+timedelta(minutes=2)):
                                    if (datetime.strptime(time[1], "%H:%M:%S")+timedelta(minutes=row.time+2))<(datetime.strptime(close_time[i][1], "%H:%M:%S")):
                                        weight = ((datetime.strptime(close_time[i][3], "%H:%M:%S")-datetime.strptime(time[3], "%H:%M:%S")).seconds//60)%60 
                                        final_edges.append((key, time[0], time[1], time[2], row.close_stop_id_s, close_time[i][0], close_time[i][1], close_time[i][2], weight, time[3]))
                                    i += 1
            
        return final_edges+self.filter_edges_list
            
    def make_itinary(self):
        """ Make the itinary from the destination. The shotest path algorithm was performed backwards.
        Return: 
             predecessors: The list of predecessors to compute the shortest path.
             stop_idx: the indices in the graph representation for the destination.
             nodes: Dictionnary where each node has their respective indices in the matrix.
        """
        graph, nodes = self.__repr__()
        stop = [edge for edge in self.final_edges_list if edge[4]==self.goal_node]
        stop.sort(key=lambda tup: (tup[9], tup[6]), reverse=True)
        compt = 0
        while stop[compt][7] > self.end_time:
            compt += 1
        stop_idx = nodes[stop[compt][4:8]]

        dist, predecessors = shortest_path(csgraph=graph, indices=stop_idx, directed=True, unweighted=False, return_predecessors=True)
        
        return predecessors, stop_idx, nodes
    
    def show_itinary(self):
        """ DataFrame where each intermediate stations of the itinary are represented as well as their trip id and times.
        """
        df_itinary = pd.DataFrame(columns=['stop_id_s', 'stop_name', 'trip_id', 'arrival_time', 'departure_time'])
        predecessors, stop_idx, nodes = self.make_itinary()

        start = [edge for edge in self.final_edges_list if edge[0]==self.start_node]
        start.sort(key=lambda tup: (tup[3], tup[2]), reverse=True)
        start_idx = nodes[start[0][:4]]
        
        i=0
        while predecessors[start_idx]==-9999:
            i += 1
            start_idx = nodes[start[i][:4]]
        
        idx = start_idx
        compt = 0
        while (idx != stop_idx):
            info = self.filter_edges_list[idx]
            df_itinary.loc[compt] = [info[0], info[9], info[1], info[2], info[3]]
            idx = predecessors[idx]
            compt += 1
                
        info = self.filter_edges_list[idx]
        df_itinary.loc[compt] = [info[0], info[9], info[1], info[2], info[3]]
        
        return df_itinary
    
    def clean_itinary(self):
        """ DataFrame representing the itinary with the stations, trip id, the time and the connections changes to be made.
        """
        
        df_clean_itinary = pd.DataFrame(columns=['stop_id_s', 'stop_name', 'trip_id', 'arrival_time', 'departure_time'])
        df_itinary = self.show_itinary()
        
        if (df_itinary.trip_id.loc[0]==df_itinary.trip_id.loc[1]):
            df_clean_itinary.loc[0] = [df_itinary["stop_id_s"].loc[0], df_itinary["stop_name"].loc[0], df_itinary["trip_id"].loc[0], None, df_itinary["departure_time"].loc[0]]
            compt = 1
        else:
            df_clean_itinary.loc[0] = [df_itinary["stop_id_s"].loc[0], df_itinary["stop_name"].loc[0], "walking", None, df_itinary["departure_time"].loc[0]]
            df_clean_itinary.loc[1] = [df_itinary["stop_id_s"].loc[1], df_itinary["stop_name"].loc[1], df_itinary["trip_id"].loc[1], None, df_itinary["departure_time"].loc[1]]
            compt = 2
        
        trip1 = list(df_itinary.trip_id)[1:-1]
        trip2 = list(df_itinary.trip_id)[2:]
        
        # When there is a change in the trip
        changes = [idx for idx, (t1, t2) in enumerate(zip(trip1, trip2)) if t1!=t2]

        for change in changes:
            # Case where the change occurs in the same station this mean a trip connection
            if (df_itinary["stop_id_s"].loc[change+1] == df_itinary["stop_id_s"].loc[change+2]):
                df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+1], df_itinary["stop_name"].loc[change+1], df_itinary["trip_id"].loc[change+1], df_itinary["arrival_time"].loc[change+1], None]
                compt += 1
                df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+2], df_itinary["stop_name"].loc[change+2], df_itinary["trip_id"].loc[change+2], None, df_itinary["departure_time"].loc[change+2]]
                compt += 1
            # Case where the change occurs on different stations this mean a walking connection between two stations
            else:
                if change+1 in changes:
                    if (df_itinary["stop_id_s"].loc[change] == df_itinary["stop_id_s"].loc[change+1]):
                        df_clean_itinary = df_clean_itinary[:-1]
                        compt -= 1
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+1], df_itinary["stop_name"].loc[change+1], df_itinary["trip_id"].loc[change], df_itinary["arrival_time"].loc[change], None]
                        compt += 1
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+3], df_itinary["stop_name"].loc[change+3], "Walking", df_itinary["arrival_time"].loc[change+3], None]
                        compt += 1
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+3], df_itinary["stop_name"].loc[change+3], df_itinary["trip_id"].loc[change+3], None, df_itinary["departure_time"].loc[change+2]]
                        compt += 1
                    else:
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+1], df_itinary["stop_name"].loc[change+1], df_itinary["trip_id"].loc[change+1], df_itinary["arrival_time"].loc[change+1], None]
                        compt += 1
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+3], df_itinary["stop_name"].loc[change+3], "Walking", df_itinary["arrival_time"].loc[change+3], None]
                        compt += 1
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+3], df_itinary["stop_name"].loc[change+3], df_itinary["trip_id"].loc[change+3], None, df_itinary["departure_time"].loc[change+3]]
                        compt += 1
                        
                    changes.remove(change+1)
                else:
                    if (df_itinary["stop_id_s"].loc[change] == df_itinary["stop_id_s"].loc[change+1]):
                        df_clean_itinary = df_clean_itinary[:-1]
                        compt -= 1
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+1], df_itinary["stop_name"].loc[change+1], df_itinary["trip_id"].loc[change], df_itinary["arrival_time"].loc[change], None]
                        compt += 1
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+2], df_itinary["stop_name"].loc[change+2], "Walking", df_itinary["arrival_time"].loc[change+2], None]
                        compt += 1
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+2], df_itinary["stop_name"].loc[change+2], df_itinary["trip_id"].loc[change+2], None, df_itinary["departure_time"].loc[change+2]]
                        compt += 1
                    else:
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+1], df_itinary["stop_name"].loc[change+1], df_itinary["trip_id"].loc[change+1], df_itinary["arrival_time"].loc[change+1], None]
                        compt += 1
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+2], df_itinary["stop_name"].loc[change+2], "Walking", df_itinary["arrival_time"].loc[change+2], None]
                        compt += 1
                        df_clean_itinary.loc[compt] = [df_itinary["stop_id_s"].loc[change+2], df_itinary["stop_name"].loc[change+2], df_itinary["trip_id"].loc[change+2], None, df_itinary["departure_time"].loc[change+2]]
                        compt += 1
        
        if (df_itinary["stop_id_s"].loc[len(df_itinary)-1]==df_itinary["stop_id_s"].loc[len(df_itinary)-2]):
            i = len(df_itinary)-2
            while df_itinary["stop_id_s"].loc[len(df_itinary)-1]==df_itinary["stop_id_s"].loc[i]:
                if i==len(df_itinary)-2:
                    df_clean_itinary = df_clean_itinary[:-1]
                else:
                    df_clean_itinary = df_clean_itinary[:-1]
                    df_clean_itinary = df_clean_itinary[:-1]
                i -= 1
            
        else:
            df_clean_itinary.loc[len(df_itinary)-1] = [df_itinary["stop_id_s"].loc[len(df_itinary)-1], df_itinary["stop_name"].loc[len(df_itinary)-1], "Walking", df_itinary["arrival_time"].loc[len(df_itinary)-1], None]
                
        
        
        return df_clean_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Route 1 Zürich  HB to Zürich, Auzelg

On the SBB app, the fastest route that arrives before the selected time is not chosen by algorithm since the probability of success (12%) is much lower than the user selection (80%), therefore our algorithm finds another route.

In [16]:
graph=Graph('8503000','8591049',edges_list,"12:30:00",80)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

  stop_id_s                 stop_name  ... arrival_time departure_time
0   8503000                 Zürich HB  ...         None       11:55:00
1   8503006           Zürich Oerlikon  ...     11:59:00           None
2   8591382  Zürich, Sternen Oerlikon  ...     12:09:00           None
3   8591382  Zürich, Sternen Oerlikon  ...         None       12:09:00
4   8591049            Zürich, Auzelg  ...     12:17:00           None

[5 rows x 5 columns]

In [17]:
df_itinary['trip_id']

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

0     528.TA.26-8-A-j19-1.344.H
1     528.TA.26-8-A-j19-1.344.H
2                       Walking
3    1915.TA.26-11-A-j19-1.27.R
4    1915.TA.26-11-A-j19-1.27.R
Name: trip_id, dtype: object

In [93]:
stop= '8503006'
arrival_time="11:59:00"
departure_time="12:03:00" #departure time of next_train - walking time (6mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.99. Success: 1735, Total: 1750

Below is the best option on SBB (arrives the latest) but the algorithm doesn't pick it up because the probability of success is lower than the selection

In [94]:
stop= '8503006'
arrival_time="12:11:00"
departure_time="12:11:00" #departure time of next_train - walking time (4mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<=df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.12. Success: 93, Total: 749

### Route 2 from Zürich HB to Herrliberg, Vogtei

Our algorithm selects the same option as the SBB app and we can see that the probability of success (96%) is higher than the user selection (80%)

In [55]:
graph=Graph('8503000','8588010',edges_list,"12:30:00",80)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

   stop_id_s                       stop_name                   trip_id  \
0    8503000                       Zürich HB  101.TA.26-6-A-j19-1.34.R   
1    8503103           Herrliberg-Feldmeilen  101.TA.26-6-A-j19-1.34.R   
2    8590647  Herrliberg-Feldmeilen,Bhf West                   Walking   
3    8590647  Herrliberg-Feldmeilen,Bhf West    67.TA.26-972-j19-1.2.H   
12   8588010              Herrliberg, Vogtei                   Walking   

   arrival_time departure_time  
0          None       12:00:00  
1      12:18:00           None  
2      12:23:00           None  
3          None       12:23:00  
12     12:27:00           None

In [92]:
stop= '8503103'
arrival_time="12:19:00"
departure_time="12:21:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.96. Success: 243, Total: 254

### Route 3 from Birmensdorf ZH, Sternen/WSL to Zürich, Neumarkt

Here we wanted to test that the probability of success is higher than the user selection on both parts of the trip. Unfortunately, the first leg doesn't have any data available in the historical data. The second leg meets our selection. The algorithm doesn't select the SBB option because there is less than 2 minutes between the transfers.

In [75]:
graph=Graph('8503578','8591287',edges_list,"14:30:00",80)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

  stop_id_s                    stop_name                    trip_id  \
0   8503578  Birmensdorf ZH, Sternen/WSL     12.TA.26-350-j19-1.1.H   
1   8503610              Zürich, Triemli     12.TA.26-350-j19-1.1.H   
2   8503610              Zürich, Triemli  819.TA.26-14-A-j19-1.10.R   
3   8591381          Zürich, Stauffacher  819.TA.26-14-A-j19-1.10.R   
4   8591381          Zürich, Stauffacher    826.TA.26-3-A-j19-1.2.H   
5   8591287             Zürich, Neumarkt    826.TA.26-3-A-j19-1.2.H   

  arrival_time departure_time  
0         None       13:53:00  
1     13:59:00           None  
2         None       14:02:00  
3     14:13:00           None  
4         None       14:18:00  
5     14:27:00           None

In [15]:
stop= '8503610'
arrival_time="14:00:00"
departure_time="14:02:00" #departure time of next_train - walking time

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<=df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
float division by zero
Traceback (most recent call last):
ZeroDivisionError: float division by zero



There is no historical data for trains between Birmensdorf ZH, Sternen/WSL and Zürich, Triemli

In [91]:
stop= '8591381'
arrival_time="14:13:00"
departure_time="14:16:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.98. Success: 588, Total: 600

### Route 4 from Zürich HB to Uster, Zentralstrasse

Our algorithm selects the same option as the SBB app and we can see that the probability of success (100%) is higher than the user selection (80%)

In [11]:
graph=Graph('8503000','8587952',edges_list,"13:30:00",80)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

  stop_id_s              stop_name                  trip_id arrival_time  \
0   8503000              Zürich HB  261.TA.26-14-j19-1.41.H         None   
1   8503125                  Uster  261.TA.26-14-j19-1.41.H     13:05:00   
2   8573504         Uster, Bahnhof                  Walking     13:15:00   
3   8573504         Uster, Bahnhof   42.TA.26-842-j19-1.1.H         None   
4   8587952  Uster, Zentralstrasse   42.TA.26-842-j19-1.1.H     13:16:00   

  departure_time  
0       12:42:00  
1           None  
2           None  
3       13:15:00  
4           None

In [90]:
stop= '8503125'
arrival_time="13:05:00"
departure_time="13:10:00" #departure time of next_train - walking time (5mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 1.0. Success: 760, Total: 763

### Route 5 from Zürich HB to Zürich, Rote Fabrik

Our algorithm doesn't suggest the more direct route on SBB app for the 90% probability and but instead selects another route that meets the user selection at all legs of the trip. At 80%, it selects the same route as SBB. When calculate the actual probability for the more direct route, it is lower than the user selection of 80% at 67%. Therefore the algorithm here fails, but that is expected in some cases since we are using a distribution that is more general than the distribution at this particular stop.

In [18]:
graph=Graph('8503000','8587347',edges_list,"13:30:00",90)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

  stop_id_s              stop_name                     trip_id arrival_time  \
0   8503000              Zürich HB    458.TA.26-12-j19-1.165.R         None   
1   8503003     Zürich Stadelhofen    458.TA.26-12-j19-1.165.R     12:49:00   
2   8503059  Zürich Stadelhofen FB                     Walking     13:04:00   
3   8503059  Zürich Stadelhofen FB  1887.TA.26-11-A-j19-1.27.R         None   
4   8591105    Zürich, Bürkliplatz  1887.TA.26-11-A-j19-1.27.R     13:08:00   
5   8591105    Zürich, Bürkliplatz      96.TA.26-165-j19-1.1.H         None   
6   8587347    Zürich, Rote Fabrik      96.TA.26-165-j19-1.1.H     13:20:00   

  departure_time  
0       12:46:00  
1           None  
2           None  
3       13:04:00  
4           None  
5       13:13:00  
6           None

In [95]:
stop= '8503003'
arrival_time="12:49:00"
departure_time="13:02:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 1.0. Success: 505, Total: 505

In [96]:
stop= '8503059'
arrival_time="13:04:00"
departure_time="13:06:00" #departure time of next_train - walking time (4mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.9. Success: 36, Total: 40

Probability of success for both legs of the trips is above 90%

In [120]:
graph=Graph('8503000','8587347',edges_list,"13:30:00",80)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

  stop_id_s                      stop_name                    trip_id  \
0   8503000                      Zürich HB  667.TA.26-8-A-j19-1.357.R   
1   8503009             Zürich Wollishofen  667.TA.26-8-A-j19-1.357.R   
2   8591080  Zürich Wollishofen, Bhf (Bus)                    Walking   
3   8591080  Zürich Wollishofen, Bhf (Bus)     96.TA.26-165-j19-1.1.H   
4   8587347            Zürich, Rote Fabrik     96.TA.26-165-j19-1.1.H   

  arrival_time departure_time  
0         None       13:07:00  
1     13:14:00           None  
2     13:18:00           None  
3         None       13:18:00  
4     13:20:00           None

In [121]:
stop= '8503009'
arrival_time="13:14:00"
departure_time="13:15:00" #departure time of next_train - walking time (4mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.67. Success: 178, Total: 264

### Route 6 Zürich HB to Effretikon, Lindenwiese

Our algorithm doesn't suggest the more direct route on SBB app for the 85% probability and but instead selects another route. At 80%, it selects the same route as SBB. When calculate the actual probability for the more direct route, it is lower than the user selection of 80% at 75%. Therefore the algorithm here fails, but that is expected in some cases since we are using a distribution that is more general than the distribution at this particular stop.

In [37]:
graph=Graph('8503000','8575921',edges_list,"15:30:00",85)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

   stop_id_s                  stop_name                  trip_id arrival_time  \
0    8503000                  Zürich HB                  walking         None   
1    8503088              Zürich HB SZU   78.TA.26-4-B-j19-1.2.H         None   
2    8503088              Zürich HB SZU   78.TA.26-4-B-j19-1.2.H     13:58:00   
3    8587348    Zürich, Bahnhofplatz/HB                  Walking     14:01:00   
4    8587348    Zürich, Bahnhofplatz/HB  829.TA.26-3-A-j19-1.2.H         None   
5    8591233          Zürich, Klusplatz  829.TA.26-3-A-j19-1.2.H     14:12:00   
6    8591233          Zürich, Klusplatz  264.TA.26-704-j19-1.9.H         None   
7    8576127  Schwerzenbach ZH, Bahnhof  264.TA.26-704-j19-1.9.H     14:37:00   
8    8576127  Schwerzenbach ZH, Bahnhof   26.TA.26-720-j19-1.3.R         None   
9    8575918        Effretikon, Bahnhof   26.TA.26-720-j19-1.3.R     15:05:00   
10   8575918        Effretikon, Bahnhof  538.TA.26-652-j19-1.9.R         None   
41   8575921    Effretikon, 

In [39]:
graph=Graph('8503000','8575921',edges_list,"15:30:00",80)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

  stop_id_s                stop_name                  trip_id arrival_time  \
0   8503000                Zürich HB   407.TA.26-3-j19-1.37.R         None   
1   8503305               Effretikon   407.TA.26-3-j19-1.37.R     15:20:00   
2   8575918      Effretikon, Bahnhof                  Walking     15:24:00   
3   8575918      Effretikon, Bahnhof  538.TA.26-652-j19-1.9.R         None   
8   8575921  Effretikon, Lindenwiese                  Walking     15:27:00   

  departure_time  
0       15:04:00  
1           None  
2           None  
3       15:24:00  
8           None

In [98]:
stop= '8503305'
arrival_time="15:20:00"
departure_time="15:21:00" #departure time of next_train - walking time (3mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.75. Success: 196, Total: 262

### Route 7 Zürich HB to Zürich, Waldhaus Dolder

Our algorithm doesn't suggest the more direct route on SBB app for the 95% probability and but instead selects another route that meets the user selection. At 80%, it selects the same route as SBB and the probability of success here is 100%, though it is based on only one data point and therefore is not reliable. We consider that the algorithm passes the test.

In [70]:
graph=Graph('8503000','8591421',edges_list,"15:30:00",95)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

  stop_id_s                stop_name                    trip_id arrival_time  \
0   8503000                Zürich HB   156.TA.26-9-A-j19-1.64.R         None   
1   8503003       Zürich Stadelhofen   156.TA.26-9-A-j19-1.64.R     15:00:00   
2   8503059    Zürich Stadelhofen FB                    Walking     15:07:00   
3   8503059    Zürich Stadelhofen FB  3694.TA.26-8-C-j19-1.27.H         None   
4   8503083         Zürich, Römerhof  3694.TA.26-8-C-j19-1.27.H     15:11:00   
5   8503083         Zürich, Römerhof    37.TA.26-25-A-j19-1.1.R         None   
9   8591421  Zürich, Waldhaus Dolder                    Walking     15:24:00   

  departure_time  
0       14:58:00  
1           None  
2           None  
3       15:07:00  
4           None  
5       15:21:00  
9           None

In [99]:
stop= '8503003'
arrival_time="15:00:00"
departure_time="15:05:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 1.0. Success: 502, Total: 502

In [101]:
stop= '8503083'
arrival_time="15:11:00"
departure_time="15:19:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Succes: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 1.0. Succes: 959, Total: 963

In [71]:
graph=Graph('8503000','8591421',edges_list,"15:30:00",80)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

  stop_id_s                stop_name                  trip_id arrival_time  \
0   8503000                Zürich HB                  walking         None   
1   8588078          Zürich, Central  815.TA.26-3-A-j19-1.2.H         None   
2   8503083         Zürich, Römerhof  815.TA.26-3-A-j19-1.2.H     15:17:00   
3   8503083         Zürich, Römerhof  37.TA.26-25-A-j19-1.1.R         None   
8   8591421  Zürich, Waldhaus Dolder                  Walking     15:24:00   

  departure_time  
0       15:07:00  
1       15:11:00  
2           None  
3       15:21:00  
8           None

In [123]:
stop= '8503083'
arrival_time="15:17:00"
departure_time="15:19:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.0. Success: 0, Total: 1

Only one example in dataset, not very reliable

### Route 8 Zürich HB to Zürich, Germaniastrasse

The algorithm fails to meet the user selection in one leg of the trip based on actual probabilities.

In [102]:
graph=Graph('8503000','8591156',edges_list,"15:30:00",95)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

   stop_id_s                stop_name                   trip_id arrival_time  \
0    8503000                Zürich HB  461.TA.26-12-j19-1.164.R         None   
1    8503003       Zürich Stadelhofen  461.TA.26-12-j19-1.164.R     14:49:00   
2    8503059    Zürich Stadelhofen FB                   Walking     15:00:00   
3    8503059    Zürich Stadelhofen FB  302.TA.26-15-A-j19-1.5.R         None   
4    8576193         Zürich, Bellevue  302.TA.26-15-A-j19-1.5.R     15:01:00   
5    8576193         Zürich, Bellevue   345.TA.26-9-B-j19-1.1.R         None   
6    8591255     Zürich, Letzistrasse   345.TA.26-9-B-j19-1.1.R     15:15:00   
7    8591255     Zürich, Letzistrasse      4.TA.26-39-j19-1.1.H         None   
18   8591156  Zürich, Germaniastrasse                   Walking     15:27:00   

   departure_time  
0        14:46:00  
1            None  
2            None  
3        15:00:00  
4            None  
5        15:05:00  
6            None  
7        15:21:00  
18           None

In [104]:
stop= '8503003'
arrival_time="14:49:00"
departure_time="14:58:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 1.0. Success: 503, Total: 503

In [117]:
stop= '8576193'
arrival_time="15:01:00"
departure_time="15:03:00" #departure time of next_train - walking time (2mins walk)
transport='Tram'
df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time)&(sbb_filtered.transport_type==transport))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.78. Success: 719, Total: 925

In [106]:
stop= '8591255'
arrival_time="15:15:00"
departure_time="15:19:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.98. Success: 974, Total: 991

### Route 9 Zürich HB to Zürich, Kantonsschule

Our algorithm doesn't pick the more direct route as SBB because the transfer time is equal to the walk time and the connection is thefore always missed. It suggests therefore another route that meets the criteria.

In [119]:
graph=Graph('8503000','8591220',edges_list,"15:30:00",90)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

  stop_id_s              stop_name                    trip_id arrival_time  \
0   8503000              Zürich HB   433.TA.26-11-j19-1.106.H         None   
1   8503003     Zürich Stadelhofen   433.TA.26-11-j19-1.106.H     15:04:00   
2   8503059  Zürich Stadelhofen FB                    Walking     15:11:00   
3   8503059  Zürich Stadelhofen FB    503.TA.26-18-j19-1.10.H     15:07:00   
4   8576193       Zürich, Bellevue                    Walking     15:18:00   
5   8576193       Zürich, Bellevue  1100.TA.26-5-B-j19-1.23.R         None   
6   8591220  Zürich, Kantonsschule  1100.TA.26-5-B-j19-1.23.R     15:21:00   

  departure_time  
0       15:01:00  
1           None  
2           None  
3           None  
4           None  
5       15:18:00  
6           None

In [125]:
stop= '8503003'
arrival_time="15:04:00"
departure_time="15:05:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.93. Success: 245, Total: 264

In [126]:
stop= '8576193'
arrival_time="15:07:00"
departure_time="15:16:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 1.0. Success: 1004, Total: 1005

Showing that the route selected by SBB fails the minimum confidence probability below

In [127]:
stop= '8503003'
arrival_time="15:11:00"
departure_time="15:11:00" #departure time of next_train - walking time (7mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.42. Success: 221, Total: 528

### Route 10 Zürich HB to Zürich, Kantonsschule

Our algorithm doesn't pick the more direct route as SBB because the transfer time is less than the walk time and the connection is thefore always missed. It suggests therefore another route that fails to meet the user selection in one leg of the trip based on actual probabilities.

In [128]:
graph=Graph('8503000','8591318',edges_list,"17:30:00",90)
df_itinary = graph.clean_itinary()
df_itinary

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

   stop_id_s                 stop_name                     trip_id  \
0    8503000                 Zürich HB   399.TA.26-7-A-j19-1.108.R   
1    8503003        Zürich Stadelhofen   399.TA.26-7-A-j19-1.108.R   
2    8503059     Zürich Stadelhofen FB                     Walking   
3    8503059     Zürich Stadelhofen FB  1980.TA.26-11-A-j19-1.27.R   
4    8580449  Zürich Oerlikon, Bahnhof  1980.TA.26-11-A-j19-1.27.R   
5    8580449  Zürich Oerlikon, Bahnhof    2036.TA.26-781-j19-1.3.R   
25   8591318          Zürich, Riedbach                     Walking   

   arrival_time departure_time  
0          None       16:41:00  
1      16:44:00           None  
2      16:50:00           None  
3          None       16:50:00  
4      17:17:00           None  
5          None       17:21:00  
25     17:26:00           None

In [130]:
stop= '8503003'
arrival_time="16:44:00"
departure_time="16:48:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.97. Success: 986, Total: 1013

In [133]:
stop= '8580449'
arrival_time="17:17:00"
departure_time="17:19:00" #departure time of next_train - walking time (2mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<=df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.6. Success: 295, Total: 489

Showing that the route selected by SBB fails the minimum confidence probability below

In [129]:
stop= '8503006'
arrival_time="15:18:00"
departure_time="15:18:00" #departure time of next_train - walking time (6mins walk)

df_val=sbb_filtered.filter(((sbb_filtered.bpuic==stop)&(sbb_filtered.arrival_time==arrival_time))).withColumn('dep_time',lit(departure_time))
df_val=df_val.withColumn('max_arrival',df_val.dep_time.cast('timestamp')).drop(df_val.dep_time)
total=df_val.count()
df_val=df_val.filter(df_val.actual_arrival_time.cast('timestamp')<df_val.max_arrival)
success=df_val.count()
prob_success=round(float(success)/total,2)
print("Probability of success is {}. Success: {}, Total: {}".format(prob_success,success, total))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Probability of success is 0.0. Success: 0, Total: 503

Additional code using for checks

In [65]:
#walking time between two stops
walking_table[(walking_table.stop_id_s == '8503125') & (walking_table.close_stop_id_s == '8573504')]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

     stop_id_s close_stop_id_s  time
2021   8503125         8573504     3

In [7]:
stops = spark.read.csv('/data/sbb/timetables/csv/stops/2019/05/14/stops.txt', header=True)
stops=stops.toPandas()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
#stop name to check with SBB app
stops[(stops.stop_id == '8503709')]

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

      stop_id      stop_name          stop_lat          stop_lon  \
5331  8503709  Waldegg, Post  47.3683329937139  8.46200421344404   

     location_type parent_station  
5331          None           None