# Uncertanty aware journey planner

For the final project of Lab in Data Science we have to implement a journey planner that takes into consideration delays of the means of transportations when recommending itineraries.

In this notebook we present our approach.

Imports needed throughout the notebook

In [2]:
%matplotlib inline
import matplotlib.pylab as plt
plt.rcParams['figure.figsize'] = (10,6)
plt.rcParams['font.size'] = 18
plt.style.use('fivethirtyeight')

In [3]:
import getpass
import pyspark
from pyspark.sql import SparkSession

conf = pyspark.conf.SparkConf()
conf.setMaster('yarn')
conf.setAppName('final_proj-{0}'.format(getpass.getuser()))
conf.set('spark.executor.memory', '4g')
conf.set('spark.executor.instances', '6')
conf.set('spark.executor.cores', 2)
conf.set('spark.port.maxRetries', '100')
sc = pyspark.SparkContext.getOrCreate(conf)
conf = sc.getConf()
sc

In [77]:
import os
import pickle
import requests
import time

import pandas as pd 
import numpy as np
import pyspark.sql.functions as fct

from datetime import datetime
from ipywidgets import interact, interactive, fixed, interact_manual, widgets

from geopy.distance import distance as geo_dist

from pyspark.sql.functions import unix_timestamp, to_timestamp, hour, to_date, date_format, month
from pyspark.sql.types import FloatType, StringType, IntegerType, DoubleType, StructType
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler, VectorIndexer
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.sql.functions import collect_list, struct, count, lit, when, col
from pyspark.sql.functions import udf

In [5]:
spark = SparkSession(sc)

In [7]:
df = spark.read.csv('/datasets/project/istdaten/*/*/*', sep=';', header=True)

First, we rename the columns to English:

In [8]:
columns = 'TripDate string, TripId string, OperatorId string, OperatorAbbrv string, OperatorName string, ProductId string, LineId string, LineType string, UmlaufId string, TransportType string, AdditionalTrip boolean, FailedTrip boolean, BPUIC string, StopName string, ArrivalTimeScheduled string, ArrivalTimeActual string, ArrivalTimeActualStatus string,     DepartureTimeScheduled string, DepartureTimeActual string, DepartureTimeActualStatus string, SkipStation boolean'
columns = list(map(lambda x: x.split()[0],columns.split(',')))

for old, new in zip(df.columns, columns):
    df = df.withColumnRenamed(old, new)

# Computing the quality of a transfer

## Assumptions: 

   * everytime when making a transfer in a station, the traveler needs one minute for actually changing transport.
   * even though a train departs late all the time in a specific station, the trip planner will never use the fact that it does so, so we will only take into consideration the early departures and the correct ones. 

## Main idea: 

The idea behind computing the quality of a specific transfer given the *expected arrival hour* in the station and the *expected departure hour* from that same station, and some *extra information* regarding the trip before the transfer and the one after the transfer:

   * First, we compute the **discrete distribution of arrival delays $\mathcal{D}_a$** in that station, given the information of the trip before the transfer.
   * Then, we compute the **discrete distribution of negative departure delays $\mathcal{D}_d$** in that station, given the information of the trip after the transfer.
   * Next, we compute the probability of successfully realizing the transfer, by computing a convolution between the two given distributions. Therefore, assuming that the time of transfer is $k$ minutes, then we would simply compute:
      
      $\sum\limits_{t_a }\Pr[\mathcal{D}_a = t_a] \cdot \Pr[\mathcal{D}_d = k-1+t_a]$,
      
       where we have taken into consideration the minute needed by the traveler for changing the transport. 
       
---
       
Therefore, we first need to decide what are the features which will decide the distributions of the delays. For that, we will use a **Decision Tree Regressor**, selecting several features which might be important from the data, and the target label will be the delay for each datapoint, expressed in seconds. Then, we will train the regressor on both departures and arrivals data, and will look into which are the most important features in each case, for making a good prediction of the delay time. 

We have to emphasize that we considered this method, because of the way that Decision Trees decide which are the most important feature, i.e. the one which have the most variance of delays between the different values for the specific feature. 

After constructing the Decision Tree and deciding which are the most important features, we will construct the distributions of the delays from the **actual data**, by grouping the datapoints with the same value for the decisive features, and making the distribution of delays for each group.

We decided to use the actual data instead of modelling the distribution of delays using a fixed distribution family (e.g. Log-normal or Gamma distributions), because we consider that the actual data is more relevant, then considering just an estimator or to assume that it follows a distribution in a family of distributions.

## Constructing the Decision Tree Regressor

The first step in constructing the Decision Tree Regressor is to construct some potential important features from the given data, and also to compute the delays for each datapoint:

In [None]:
DATE_FORMAT_SCHEDULED = 'dd.MM.yyyy HH:mm' 
DATE_FORMAT_ACTUAL = 'dd.MM.yyyy HH:mm:ss' # both formats are used

df_processed = df.withColumn('ArrivalTimeScheduledDate', to_timestamp(df.ArrivalTimeScheduled, DATE_FORMAT_SCHEDULED))
df_processed = df_processed.withColumn('DepartureTimeScheduledDate', to_timestamp(df_processed.DepartureTimeScheduled, DATE_FORMAT_SCHEDULED))

df_processed = df_processed.withColumn('ArrivalTimeScheduled', unix_timestamp(df_processed.ArrivalTimeScheduled, DATE_FORMAT_SCHEDULED))
df_processed = df_processed.withColumn('ArrivalTimeActual', unix_timestamp(df_processed.ArrivalTimeActual, DATE_FORMAT_ACTUAL))
df_processed = df_processed.withColumn('DepartureTimeScheduled', unix_timestamp(df_processed.DepartureTimeScheduled, DATE_FORMAT_SCHEDULED))
df_processed = df_processed.withColumn('DepartureTimeActual', unix_timestamp(df_processed.DepartureTimeActual, DATE_FORMAT_ACTUAL))

Let's look into how the data looks so far:

In [None]:
df_processed.head()

Next, we also add the hour of departure and of the arrival to the dataset:

In [None]:
df_to_classify = df_processed.select(
    df_processed.LineId.alias('line_id'), 
    df_processed.ProductId.alias('product_id'), 
    df_processed.StopName.alias('stop_name'),
    df_processed.AdditionalTrip.alias('additional_trip'), 
    hour(df_processed.ArrivalTimeScheduledDate).alias("arrival_hour").astype(StringType()),
    hour(df_processed.DepartureTimeScheduledDate).alias("departure_hour").astype(StringType()),
    date_format(to_date(df_processed.TripDate, 'dd.MM.yyyy'), 'u').alias("day_of_week"),
    ((df_processed.ArrivalTimeActual - df_processed.ArrivalTimeScheduled)).alias("delta_arrival").astype(FloatType()),
    ((df_processed.DepartureTimeActual - df_processed.DepartureTimeScheduled)).alias("delta_departure").astype(FloatType()))

df_to_classify.cache()

In [None]:
df_to_classify.head(5)

Next, for using the Decision Tree Regressor, and because each feature is in fact categorial, we must index each one of them using a *StringIndexer*:

In [None]:
def transform_dataset(dataset, departure):
    '''
    Function that transforms a dataset, adding for each categorial feature a column, which represents the output of the 
    StringIndexer applied to that column. 
    
    Parameters:
        - dataset: the dataset to be processed
        - departure: True if the dataset is for departures, False otherwise
    '''
    
    line_id_indexer = StringIndexer(inputCol="line_id", outputCol="line_id_cat", handleInvalid='keep') # keep nulls 
    product_id_indexer = StringIndexer(inputCol="product_id", outputCol="product_id_cat", handleInvalid='skip')
    stop_name_indexer = StringIndexer(inputCol="stop_name", outputCol="stop_name_cat", handleInvalid='skip')
    additional_trip_indexer = StringIndexer(inputCol="additional_trip", outputCol="additional_trip_cat", handleInvalid='skip')
    day_of_week_indexer = StringIndexer(inputCol="day_of_week", outputCol="day_of_week_cat", handleInvalid='skip')
    departure_hour_indexer = StringIndexer(inputCol="departure_hour", outputCol="departure_hour_cat", handleInvalid='skip')
    arrival_hour_indexer = StringIndexer(inputCol="arrival_hour", outputCol="arrival_hour_cat", handleInvalid='skip')

    indexers = [line_id_indexer, product_id_indexer, stop_name_indexer, additional_trip_indexer,day_of_week_indexer]
    
    if departure:
        indexers.append(departure_hour_indexer)
    else:
        indexers.append(arrival_hour_indexer)

    indexed = dataset

    for indexer in indexers:
        indexed = indexer.fit(indexed).transform(indexed) # add columns to dataset
        
    return indexed

Next, we use the *VectorAssembler* to construct the column for features, which will be used by the Decision Tree:

In [None]:
def compute_features_column(dataset, is_departure):
    '''
    Function that computes the features column for the given dataset.
    
    Parameters:
        - dataset: the dataset to compute the features column for
        - is_departure: True is dataset is used for departures, False otherwise.
    '''
    input_cols = ['line_id_cat', 'product_id_cat', 'stop_name_cat', 'additional_trip_cat', 'day_of_week_cat']
    
    if is_departure:
        input_cols.append('departure_hour_cat') # departure dataset
    else:
        input_cols.append('arrival_hour_cat') # arrival dataset
        
    vector_assembler = VectorAssembler(inputCols = input_cols, outputCol = 'features')
    dataset = transform_dataset(dataset, is_departure) # add categorial features
    
    df_features = vector_assembler.transform(dataset) # add features column
    # Use VectorIndexer to make sure that the added features are recognized as categorical
    
    featureIndexer = \
        VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=100000000).fit(df_features)
    
    df_features = featureIndexer.transform(df_features) # transform features to categorical
    
    if is_departure:
        df_final = df_features.select(df_features.indexedFeatures, df_features.delta_departure.alias("delta"))
    else:
        df_final = df_features.select(df_features.indexedFeatures, df_features.delta_arrival.alias("delta"))
    
    return df_final

Finally, we construct our datasets to input to the Decision Tree:

In [None]:
# Construct departures dataset
df_departure_to_regress = df_to_classify.filter(
    df_to_classify.departure_hour.isNotNull() & # filter only departures
    df_to_classify.delta_departure.isNotNull())

df_departure = compute_features_column(df_departure_to_regress, is_departure=True)

# Construct arrivals dataset
df_arrival_to_regress = df_to_classify.filter(
    df_to_classify.arrival_hour.isNotNull() & # filter only arrivals
    df_to_classify.delta_arrival.isNotNull())

df_arrival = compute_features_column(df_arrival_to_regress, is_departure=False)

Let's check the generated dataframes:

In [None]:
df_departure.head(5)

In [None]:
df_arrival.head(5)

Next, we write the function for training the Decision Tree Regressor:

In [None]:
def train_regressor(dataset):
    dt = DecisionTreeRegressor(featuresCol ='indexedFeatures', labelCol = 'delta', maxBins=100000000, maxDepth=3)
    dt_model = dt.fit(dataset)
    
    return dt_model

Finally, we train the decision trees for both datasets and we extract the most important features:

In [None]:
# Get most important fetrain_regressorpartures dataset
regressor_departures = train_regressor(df_departure)
print("Feature importances departures: {}".format(regressor_departures.featureImportances))

# Get most important features for departures dataset
regressor_arrivals = train_regressor(df_arrival)
print("Feature importances arrivals: {}".format(regressor_arrivals.featureImportances))

So, we can see that the 3 most important features are, in both cases, the *hour*, the *line_id* and the *stop_name*. We can see that everything makes very much sense, because we have big differences of delays between normal hours and rush hours, for example, and also specific stops and routes have usually more delays than the others.

Therefore, we continue by constructing the probability distributions for each possible value of the three most important features.

## Computing the probability distributions 

First, we only consider the three most important features in the two initial datasets. We will consider the unity of time to be the minute from now on, instead of seconds: 

In [None]:
df_best_feat_departures = df_departure_to_regress.select(
                df_departure_to_regress.departure_hour,
                df_departure_to_regress.stop_name,
                df_departure_to_regress.line_id,
                (df_departure_to_regress.delta_departure / 60).astype(IntegerType()).alias("delta_minutes"))

df_best_feat_departures = df_best_feat_departures.filter(df_best_feat_departures.delta_minutes <= 0) 
# only keep departures which left on time or earlier, we do not want to base our recommendation on assumption
# that a train or bus leaves with a delay.

df_best_feat_arrival = df_arrival_to_regress.select(
                df_arrival_to_regress.arrival_hour,
                df_arrival_to_regress.stop_name,
                df_arrival_to_regress.line_id,
                (df_arrival_to_regress.delta_arrival / 60).astype(IntegerType()).alias("delta_minutes"))

In [None]:
df_best_feat_departures.head(5)

Finally, we want to make the distribution of delays for each possible value of the features, for both departures and arrivals:

In [None]:
df_departures_grouped_count = df_best_feat_departures.groupby( 
                df_best_feat_departures.departure_hour,
                df_best_feat_departures.stop_name,
                df_best_feat_departures.line_id,
                df_best_feat_departures.delta_minutes).agg(count(lit(1)).alias("count_min")) # add a count for each possible value
        
df_departures_distribution = df_departures_grouped_count.\
                                    groupby('departure_hour', 'stop_name', 'line_id').\
                                    agg(collect_list(struct('delta_minutes', 'count_min')).alias('counts'))

# for each value of (departure_hour, stop_name, line_id), we have a list of the form [(delay_minutes, count)]

In [None]:
df_departures_grouped_count.head(3)

In [None]:
df_departures_distribution.show(10)

In [None]:
def compute_key_for_feature_values(hour, line_id, stop_name):
    return '{}#{}#{}'.format(hour, line_id, stop_name)

In [None]:
collected = df_departures_distribution.collect()

distribution_departures = {
    compute_key_for_feature_values(x.departure_hour, x.line_id, x.stop_name) : 
    list(sorted(x.counts, key=lambda y: y[0])) for x in collected}

We do the same now for the arrivals: 

In [None]:
df_arrivals_grouped_count = df_best_feat_arrival.groupby( 
                df_best_feat_arrival.arrival_hour,
                df_best_feat_arrival.stop_name,
                df_best_feat_arrival.line_id,
                df_best_feat_arrival.delta_minutes).agg(count(lit(1)).alias("count_min")) # add a count for each possible value
        
df_arrivals_distribution = df_arrivals_grouped_count.\
                                    groupby('arrival_hour', 'stop_name', 'line_id').\
                                    agg(collect_list(struct('delta_minutes', 'count_min')).alias('counts'))
        
collected = df_arrivals_distribution.collect()

distribution_arrivals = {
    compute_key_for_feature_values(x.arrival_hour, x.line_id, x.stop_name) : 
    list(sorted(x.counts, key=lambda y: y[0])) for x in collected}

We also want to include a default distribution, for the case we have new data, which was not encountered anymore. We will compute it as the distribution of all the data:

In [None]:
df_default_distrib_departures = df_best_feat_departures.groupby('delta_minutes').agg(count(lit(1)).alias("count_min")) # add a count for each possible value
collected_default = df_default_distrib_departures.collect()
default_departures = list(sorted(collected_default, key=lambda x: x[0]))

df_default_distrib_arrivals = df_best_feat_arrival.groupby('delta_minutes').agg(count(lit(1)).alias("count_min")) # add a count for each possible value
collected_default = df_default_distrib_arrivals.collect()
default_arrivals = list(sorted(collected_default, key=lambda x: x[0]))

Next, we add the default values to the dictionary of distributions:

In [None]:
distribution_departures['default'] = default_departures
distribution_arrivals['default'] = default_arrivals

Finally, we transform the counts to probabilities, to be able to compute the final quality faster:

In [None]:
def transform_to_proba(counts_list):
    total_sum = 0
    final_proba = []
    
    for row in counts_list:
        total_sum += row.count_min
        
    for row in counts_list:
        final_proba.append((row.delta_minutes, row.count_min / total_sum))
        
    return final_proba

In [None]:
distribution_departures = {k : transform_to_proba(v) for k, v in distribution_departures.items()}
distribution_arrivals = {k : transform_to_proba(v) for k, v in distribution_arrivals.items()}

We finally write the computed dictionaries to file, to be able to load them later:

In [None]:
FILE_DISTRIBUTION_DEPARTURES = 'distrib_departures.pic'
FILE_DISTRIBUTION_ARRIVALS = 'distrib_arrivals.pic'

pickle.dump(distribution_departures, open(FILE_DISTRIBUTION_DEPARTURES, 'wb'))
pickle.dump(distribution_arrivals, open(FILE_DISTRIBUTION_ARRIVALS, 'wb'))

## The exposed API for computing distributions

Finally, the last part is to write a function which receives the features of a specific transfer, and it returns the quality of the transfer, by performing the convolution of the corresponding distributions, using the formula:

$\sum\limits_{t_a }\Pr[\mathcal{D}_a = t_a] \cdot \Pr[\mathcal{D}_d = k-1+t_a]$,
      
where we have taken into consideration the minute needed by the traveler for changing the transport. 
       
Here, we considered $\mathcal{D}_a$ to be the distribution of arrivals and $\mathcal{D}_d$ the distribution of departures.


In [10]:
DATE_FORMAT = '%b %d %Y %H:%M:%S'

FILE_DISTRIBUTION_DEPARTURES = 'distrib_departures.pic'
FILE_DISTRIBUTION_ARRIVALS = 'distrib_arrivals.pic'

class TransferQualityComputer:
    def __init__(self):
        self.distribution_departures = pickle.load(open(FILE_DISTRIBUTION_DEPARTURES, 'rb'))
        self.distribution_arrivals = pickle.load(open(FILE_DISTRIBUTION_ARRIVALS, 'rb'))
        
    def compute_key_for_feature_values(self, hour, line_id, stop_name): # same as before
        return '{}#{}#{}'.format(hour, line_id, stop_name)
    
    def compute_quality(self, arrival_timestamp, departure_timestamp, departure_stop_name, arrival_stop_name, 
                        departure_line_id, arrival_line_id, walktime=1):
        # timestamp in the format: Dec 31 2017 20:40:49,01

        arrival_time = datetime.strptime(arrival_timestamp[:-3], DATE_FORMAT)
        departure_time = datetime.strptime(departure_timestamp[:-3], DATE_FORMAT)

        arrival_hour = arrival_time.hour
        departure_hour = departure_time.hour
        delta_minutes = int((time.mktime(departure_time.timetuple()) - time.mktime(arrival_time.timetuple())) / 60)

        if delta_minutes < 0:
            return 0 # impossible to complete the transfer

        departure_key = self.compute_key_for_feature_values(departure_hour, departure_line_id, departure_stop_name)
        if departure_key in self.distribution_departures:
            departure_dist = self.distribution_departures[departure_key]
        else: 
            departure_dist = self.distribution_departures['default'] # default distribution

        arrival_key = self.compute_key_for_feature_values(arrival_hour, arrival_line_id, arrival_stop_name)
        if arrival_key in self.distribution_arrivals:
            arrival_dist = self.distribution_arrivals[arrival_key]
        else: 
            arrival_dist = self.distribution_arrivals['default'] # default distribution

        total_proba = 0

        for dep_delay, dep_proba in departure_dist:
            for arr_delay, arr_proba in arrival_dist:

                delta_minutes = ((departure_time - arrival_time).seconds // 60) % 60
                if delta_minutes >= dep_delay + arr_delay + walktime: 
                # consider also walktime between stations, if the same station then we considered the walk time 1min
                    total_proba += (dep_proba * arr_proba)
                else:
                    break

        return total_proba

Testing the code:

In [11]:
computer = TransferQualityComputer()

print(computer.compute_quality('Dec 31 2017 00:40:49,01', 'Dec 31 2017 00:41:58,01', 'Dietikon, Birmensdorferstrasse', 'Dietikon, Birmensdorferstrasse', '85:849:303', '85:849:303', walktime=1))

0.8007152739708147


### Metadata processing

We need to process the metadata provided in order to extract station names with their coordinates. These are needed later when we query the journey planner.

We first start by creating our dataframe for the metadata in order to select stop station

In [17]:
def create_meta_df(csv_path):
    #reading csv file
    df_meta = spark.read.csv(csv_path)
    
    #selecting columns
    df_meta = df_meta.select(fct.split(df_meta['_c0'], '  ')[1].alias('Long'), 
                         fct.split(fct.split(df_meta['_c0'], '  ')[2], ' ')[0].alias('Lat'), 
                         fct.split(df_meta['_c0'], '% ')[1].alias('StopName_Meta')) 
    
    #casting to Float
    df_meta = df_meta.withColumn("Long", df_meta["Long"].cast(FloatType()))
    df_meta = df_meta.withColumn("Lat", df_meta["Lat"].cast(FloatType()))
    
    #Keeping data points that are inside Switzerland                         
    df_meta = df_meta.filter(df_meta.Lat.between(45.490404, 47.485074))
    df_meta = df_meta.filter(df_meta.Long.between(5.572263, 10.2931))   
                             
    #Processing lattitude and longitude fields
    slen = udf(lambda s: len(str(s).split('.')[1]), IntegerType())                         
    df_meta = df_meta.withColumn("lat_len", slen(df_meta.Lat))
    df_meta = df_meta.withColumn("lon_len", slen(df_meta.Long))
    
    #Verifying minimum precision
    print(df_meta.agg({"lat_len": "min"}).collect())
    print(df_meta.agg({"lon_len": "min"}).collect())
                             
    df_meta = df_meta.select('Long', 'Lat', 'StopName_Meta')
    
    round_6 = udf(lambda s: round(s, 6), DoubleType())
                             
    df_meta = df_meta.withColumn("Round_Long", round_6(df_meta.Long))
    df_meta = df_meta.withColumn("Round_Lat", round_6(df_meta.Lat))
                             
    return df_meta

In [18]:
df_meta = create_meta_df('/datasets/project/metadata')

[Row(min(lat_len)=6)]
[Row(min(lon_len)=6)]


The output above verifies the minimum precision in our dataset. We have found a precision of 6 digit which is sufficient for our work. 

First we only keep points in/near switzerland. We chose to do this by drawing a square around the country and keep points that are inside. Here we find the extreme points of switzerland: https://fr.wikipedia.org/wiki/Liste_de_points_extr%C3%AAmes_de_la_Suisse

In [16]:
print("Number of distinct stopnames: {}, number of distinct coordinates: {}".format(df_meta.select('Round_Lat', 'Round_Long').distinct().count(),df_meta.select('StopName_Meta').distinct().count()))

Number of distinct stopnames: 22671, number of distinct coordinates: 5996


### Use another dataset to fill missing names

We can see that there are many duplicate names with different coordinates. For example we find Lausanne many times. After investigating we understand that all the subway station were simply Lausanne. We decided to fill that problem using another dataset in order to merge them.

We decided to merge the two dataset using coordinate, in order to do this we round coordinate to match them. A round at 3 decimal change the precission by max 135m. For example Google Maps use 6 decimal

In [19]:
def create_stops_df(txt_path):
    
    #reading .txt and processing
    with open(txt_path, 'r') as file: 
        one_splitted = file.readline().strip().split(",")
        file_lines = [line.strip().split('"') for line in file.readlines()]
    stop_names = [x[3] for x in file_lines]
    Lat = [float(x[5]) for x in file_lines]
    Long = [float(x[7]) for x in file_lines]

    #create panda dataframe
    df_stop = pd.DataFrame({
            "StopName": stop_names, 
            "Lat_stop": Lat, 
            "Long_stop": Long,   
        })
    
    #Apply schema
    mySchema = StructType([ StructField("Lat_stop", DoubleType(), True)\
                        ,StructField("Long_stop", DoubleType(), True)\
                        ,StructField("StopName", StringType(), True) ])
    df_stop = spark.createDataFrame(df_stop, mySchema)
    
    #Processing lattitude and longitude fields
    slen = udf(lambda s: len(str(s).split('.')[1]), IntegerType()) 
    df_stop = df_stop.withColumn("lat_len", slen(df_stop.Lat_stop))
    df_stop = df_stop.withColumn("lon_len", slen(df_stop.Long_stop))
    
    #Treating special case
    df_stop = df_stop.withColumn("Lat_stop", \
              when(df_stop["StopName"] == 'Isola Superiore', 45.901230).otherwise(df_stop["Lat_stop"]))
    df_stop = df_stop.withColumn("Long_stop", \
              when(df_stop["StopName"] == 'Isola Superiore', 8.520450).otherwise(df_stop["Long_stop"]))
    
    #Here we round all the coordinate by 6 again in order to keep consistance with the df_meta dataframce
    round_6 = udf(lambda s: round(s, 6), DoubleType())
    df_stop = df_stop.withColumn("Round_Long", round_6(df_stop.Long_stop))
    df_stop = df_stop.withColumn("Round_Lat", round_6(df_stop.Lat_stop))

    return df_stop

In [20]:
df_stop = create_stops_df('stops.txt')
df_stop.show(10)

+----------------+----------------+--------------------+-------+-------+----------+---------+
|        Lat_stop|       Long_stop|            StopName|lat_len|lon_len|Round_Long|Round_Lat|
+----------------+----------------+--------------------+-------+-------+----------+---------+
|45.9899010293845|8.34506152974108|      Anzola, chiesa|     13|     14|  8.345062|45.989901|
|46.1672513851495|  8.345807131427|            Altoggio|     13|     12|  8.345807|46.167251|
| 46.060121674738|8.11361957990831|        Antronapiana|     12|     14|   8.11362|46.060122|
|45.9898698225697|8.34571729989858|              Anzola|     13|     14|  8.345717| 45.98987|
|46.2614983591677|8.31925293162473|              Baceno|     13|     14|  8.319253|46.261498|
|46.0790618438814|8.29927439970313|Beura Cardezza, c...|     13|     14|  8.299274|46.079062|
|46.1222963432243|8.21077237789936|Bognanco, T. Vill...|     13|     14|  8.210772|46.122296|
|46.0656504576122|8.26113193273411|           Boschetto|    

Here Isola Superiore is an error in the dataset, thus we adapted it with google map coordinates

### Merge Dataframes

In [21]:
# Merging the two dataframes
Df_meta = df_meta.join(df_stop, on = ['Round_Lat', 'Round_Long'], how='outer') 

We can see that for the example of Lausanne we just recover 2 name over about a hundred. 
After investigation we find the coordidate for particular station in both dataset: 
<br/>
<br/>Lausanne Malley: 46.524212 - 6.603306 -- 46.524211 - 6.603309
<br/>Lausanne Bourdonette: 46.523466 - 6.589805 -- 46.523465 - 6.589807
<br/>Lausanne Provence: 46.523384 - 6.608102 -- 46.523382 - 6.608106

We can see that each time our merge fail for 1 digit

We try again with a round at 5 digits whith is still a very good precision

In [22]:
#Rerounding both dataframes
round_5 = udf(lambda s: round(s, 5), DoubleType())
df_meta = df_meta.withColumn("Round_Long", round_5(df_meta.Long))
df_meta = df_meta.withColumn("Round_Lat", round_5(df_meta.Lat))
df_stop = df_stop.withColumn("Round_Long", round_5(df_stop.Long_stop))
df_stop = df_stop.withColumn("Round_Lat", round_5(df_stop.Lat_stop))

#New merge
Df_meta = df_meta.join(df_stop, on = ['Round_Lat', 'Round_Long'], how='outer') 

print(df_meta.filter(df_meta['StopName_Meta'].like("Lausanne")).count())
print(Df_meta.filter(Df_meta['StopName_Meta'].like("Lausanne") & Df_meta['StopName'].isNull()).count())

188
65


We now achieved a satisfactory result

In [23]:
Df_meta = Df_meta.select('Long', 'Lat', 'StopName_Meta', 'StopName')
Df_meta.show()

+--------+---------+----------------+--------------------+
|    Long|      Lat|   StopName_Meta|            StopName|
+--------+---------+----------------+--------------------+
|    null|     null|            null|Macugnaga, Pestarena|
|    null|     null|            null| Lugano, Via Ginevra|
|    null|     null|            null|      Gandria, Paese|
|    null|     null|            null|               Gozzi|
|8.943882|46.034714|        Cureglia|   Cureglia, Rotonda|
|    null|     null|            null|        Bogno, Paese|
|6.090986| 46.15237|           Perly|                null|
|6.044045|46.161507|        Laconnex|Laconnex, Chemin ...|
|8.912559|46.179436|         Agarone|                null|
|8.699336| 46.18245|        Cresmino|      Cresmino, Case|
|6.246757|46.183704|       Annemasse|Annemasse, Généra...|
|7.393176| 46.19771|Les Mayens-de-S.|Les Mayens-de-S.,...|
|6.167676| 46.20001|          Genève|  Genève, Amandolier|
|6.157857|46.203766|          Genève|                nul

In [24]:
#Drop null valued rows 
Df_meta = Df_meta.na.drop(subset=["Long", 'Lat'])
Df_meta = Df_meta.withColumn("StopName_Meta", \
              when(Df_meta["StopName"].isNotNull(), Df_meta["StopName"]).otherwise(Df_meta["StopName_Meta"]))
Df_meta = Df_meta.select('StopName_Meta', 'Lat', 'Long')

#Convert to panda dataframe
Df_meta = Df_meta.toPandas()

### Deterministic journey planner request

In our approach we use a standalone classic journey planner that works with a GTFS timetable. This planner is used to compute routes between two points in the deterministic way, without taking into consideration delays.

It is implemented as a server which exposes a REST endpoint which allows GET requests to be made in order to obtain itineraries between two points.

Below, we implement the requests to this planner.

First, we need an utility function that given a stop name uses the metadata to return the coordinates of the stop name. These coordinates are needed in the request.

In [26]:
## Get latitude and longitude from a stop name (string)
def get_lat_long(name): 
    tmp = Df_meta.loc[Df_meta['StopName_Meta'] == name][['Lat', 'Long']]
    
    assert len(tmp) != 0, "Problement with the location {}".format(name)
    
    tmp = tmp.iloc[0]
    return tmp['Lat'], tmp['Long']

Now we can implement the actual GET request.

In [27]:
## Make a request to the OTP server
def return_request(fromPlace, toPlace, departure, Months, Days, Hours, AM_PM, Minutes, Seconds, lat_long_from = False, lat_long_to = False):
    
    
    # Handle the case where the stopnames have an added 'stop' at the beginning
    if (fromPlace.split(' ')[0] == 'stop'):
        fromPlace = toPlace[5:-1]
    if (toPlace.split(' ')[0] == 'stop'):
        toPlace = toPlace[5:-1]
        
    # If lattitude and longitude are not specified, find them thanks to the stop name
    # Otherwise use their values straight away (safer from dataset mismatches)
    if lat_long_from == False:
        lat_from, long_from = get_lat_long(fromPlace)
    else:
        lat_from, long_from = lat_long_from[0], lat_long_from[1] 

    if lat_long_to == False:
        lat_to, long_to = get_lat_long(toPlace)
    else:
        lat_to, long_to = lat_long_to[0], lat_long_to[1]
    
    #Compose the url for the request
    url = 'http://10.90.38.21:8829/otp/routers/default/plan?fromPlace=stop+'
    url += '+'.join(fromPlace.split()) +  '+%3A%3A' + str(lat_from) + '%2C' + str(long_from)
    url += '&toPlace=stop+' +  '+'.join(toPlace.split()) +  '+%3A%3A' + str(lat_to) + '%2C' + str(long_to)
    url += '&time={}%3A{}{}&date={}-{}-2018&mode=TRANSIT%2CWALK&maxWalkDistance=804.672&arriveBy={}&wheelchair=false&locale=en&numItineraries=3'.format(Hours, Minutes, AM_PM, Months, Days, not(departure))

    #make the request and return the json
    r = requests.get(url)

    return r.json()

### Processing request answer
The server responds to the GET request with a json structure encompassing a number of itineraries with additional information about each of them.

We process this json in order to extract the data that is of interest to us and put it in a standard format which will be used throughout the later code when dealing with itineraries.

Mainly, the data of interest for each itinerary returned by the server is mainly composed of details about the lets of the itinerary, like the transportation time (eg. BUS, RAIL, WALK), the coordinates of the stops and the arrival and departure times for each them.

In [43]:
def read_json_extract_itineraries(json_data, df_BT):
    #Instatiate the list of itineraries to be returned
    itinerary_list = []
    
    #Avoid error. Usually caused by an empty json
    if 'plan' not in json_data:
        return []
    
    #Iterate over the itineraries
    for route in json_data['plan']['itineraries']:
        #Instatiate an itinerary
        itinerary_ = []
        #Iterate over the trip (legs) in an itinerary
        for leg in route['legs']: 
            #Here handle relevant data for one trip of the itinerary   
            mode = leg['mode']
            from_ = leg['from']['name']
            lat_from = leg['from']['lat']
            lon_from = leg['from']['lon']
            to_ = leg['to']['name']
            lat_to = leg['to']['lat']
            lon_to = leg['to']['lon']
            
            start_time = str(leg['from']['departure'])
            departure_time = time.strftime("%b %d %Y %H:%M:%S,%M", time.localtime(float(start_time[:len(start_time)-3])))
            end_time = str(leg['endTime'])
            arrival_time = time.strftime("%b %d %Y %H:%M:%S,%M", time.localtime(float(end_time[:len(end_time)-3])))
            duration = str(leg['duration'])
            
            
                        
            #Not all trips have these attrubutes, so set default (meaning unknown) values if missing
            route_id = 0
            trip_id = 0
            agency_name = 'unknown'
            
            if ('routeShortName' in leg.keys()):
                route_id = leg['routeShortName']
            if('tripShortName' in leg.keys()):
                trip_id = leg['tripShortName']
            if('agencyName' in leg.keys()):
                agency_name = leg['agencyName']
                
            #For trains ('RAIL') the line_id is given by the trip_id
            line_id = trip_id
            
            #For Bus and Tram there is no single indicator for line_id. 
            #We combine product_id, operatorName and routeID to uniquely identify the line_id in a pre-created dataframe
            #In case of another mode of transport (product_id), also identify its line_id this way
            if mode != 'RAIL' and mode!='WALK':
                if mode == 'BUS' or mode == 'Bus':
                    tmp = df_BT.query('ProductId.str.lower() == "bus" & OperatorName == @agency_name & LineType == @route_id')[:1]
                elif mode =='TRAM' or mode == 'Tram':
                    tmp = df_BT.query('ProductId.str.lower() == "tram" & OperatorName == @agency_name & LineType == @route_id')[:1]
                else: 
                    tmp = df_BT.query('ProductId == @mode & OperatorName == @agency_name & LineType == @route_id')[:1]
                if len(tmp) == 0:
                    line_id = 'unknown'
                else:
                    line_id = tmp['LineId'].iloc[0]
                
           # print('Trip from {} at {} to {} at {} with {} and line_id: {}'.format(from_,departure_time,to_,arrival_time, mode, line_id))
        
            itinerary_.append({'product_id': mode, 'from': from_, 'lat_long_from': [lat_from, lon_from] ,'departure_time':departure_time,'to':to_, 'lat_long_to': [lat_to, lon_to], 'arrival_time':arrival_time, 'line_id': line_id})
        itinerary_list.append(itinerary_)
    return itinerary_list 

### Filtering itineraries by quality

As we said previously, we are using a deterministic journey planner in order to get itineraries between two stops.

On top of these itineraries we add the **confidence** of each transfer using the *TransferQualityComputer* functionality implemented above.

We implement below a function that given an itinerary containing multiple legs with transfer, it computes its whole confidence using the assumptions described in the first part. 

In [29]:
#Compute the overall quality of an itinerary
def comp_itinerary_quality(itinerary, quality_computer):
    #Quality start at 1 and decays with every transfer
    itinerary_quality = 1
    
    prev_leg = None
    crt_leg = None
    walking_time = 1
    #iterate over the trips of an itinerary and affect the overall itinerary quality
    for crt_leg in itinerary:
        #Walk is a special case of trip. It should not diminish quality, but influence the next trip.
        if crt_leg['product_id'] == 'WALK':
            walking_time += (datetime.strptime(crt_leg['arrival_time'][:-3], DATE_FORMAT) - \
                            datetime.strptime(crt_leg['departure_time'][:-3], DATE_FORMAT)).total_seconds()//60
        else:
            if prev_leg is not None:
                #Compute the quality of the transfer
                transfer_quality = quality_computer.compute_quality(
                    prev_leg['arrival_time'],
                    crt_leg['departure_time'],
                    prev_leg['to'],
                    prev_leg['line_id'],
                    crt_leg['line_id'],
                    walking_time
                )
                #Affect the itinerary
                itinerary_quality *= transfer_quality
                
            prev_leg = crt_leg
            walking_time = 1
            
    return itinerary_quality

Another utility function that we need further down the process is to simply simply split a list of itineraries into itineraries that have a quality bigger than a specified threshold and itineraries with a lower quality.

In [46]:
def split_with_quality(itinerary_list, Quality, quality_computer):
    itinerary_quality_ = [0,]*len(itinerary_list)
    #print('Quality of itineraries:  ')
    for i_ in range(len(itinerary_list)):
        itinerary_quality_[i_] = comp_itinerary_quality(itinerary_list[i_], quality_computer)
        #print('Itinerary number {}, quality: {}'.format(i_,itinerary_quality_[i_]))
    itinerary_list_accepted = np.array(itinerary_list)[[it_>Quality for it_ in itinerary_quality_]].tolist()
    itinerary_list_refused = np.array(itinerary_list)[[not(it_>Quality) for it_ in itinerary_quality_]].tolist()
    return itinerary_list_accepted, itinerary_list_refused

### Explore itineraries "around" a too-low-quality itinerary TODO: make it an actual tree

In [31]:
def explore_itineraries(itinerary, df_BT, quality):
    itinerary_list = []
    for j_ in range(len(itinerary)-1):
        #leg_1 = itinerary[j_]
        if comp_itinerary_quality(itinerary[0:j_+1]) < quality:
            continue
        arr_month, arr_day, arr_hour, arr_minute, arr_AM_PM, arr_second = date_to_cells(itinerary[j_]['arrival_time'])
        #new_semi_its = request_with_quality(fromPlace = leg_1['to'], toPlace = itinerary[-1]['to'], Months = arr_month, Days = arr_day, Hours = arr_hour, Minutes = arr_minute, Seconds = arr_second, AM_PM = arr_AM_PM, departure = True, Quality = quality, lat_long_from = leg_1['lat_long_from'], lat_long_to = itinerary[-1]['lat_long_to'] )
        temp_json = return_request(fromPlace = itinerary[j_]['to'], toPlace = itinerary[-1]['to'], Months = arr_month, Days = arr_day, Hours = arr_hour, Minutes = arr_minute, Seconds = arr_second, AM_PM = arr_AM_PM, departure = True, lat_long_from = itinerary[j_]['lat_long_to'], lat_long_to = itinerary[-1]['lat_long_to'])
        #print(temp_json)
        new_partial_its = read_json_extract_itineraries(temp_json, df_BT)
        new_itineraries = [np.append(itinerary[:j_+1],new_partial_its[k_]).tolist() for k_ in range(len(new_partial_its))]
        
        itinerary_list.extend(new_itineraries)
    return itinerary_list

In [32]:
def date_to_cells(date):
    dt = datetime.strptime(date[:-3], DATE_FORMAT)
    return dt.month,\
          dt.day,\
          (dt.hour if dt.hour <= 12 else dt.hour-12),\
          ('AM' if dt.hour <= 12 else 'PM'), \
          dt.minute,\
          dt.second

In [50]:
def explore_itineraries(itinerary, df_BT, quality, quality_computer):
    itinerary_list = []
    for j_ in range(len(itinerary)-1):
        #leg_1 = itinerary[j_]
        if comp_itinerary_quality(itinerary[0:j_+1], quality_computer) < quality:
            continue
        arr_month, arr_day, arr_hour, arr_AM_PM, arr_minute, arr_second = date_to_cells(itinerary[j_]['arrival_time'])
        #new_semi_its = request_with_quality(fromPlace = leg_1['to'], toPlace = itinerary[-1]['to'], Months = arr_month, Days = arr_day, Hours = arr_hour, Minutes = arr_minute, Seconds = arr_second, AM_PM = arr_AM_PM, departure = True, Quality = quality, lat_long_from = leg_1['lat_long_from'], lat_long_to = itinerary[-1]['lat_long_to'] )
        temp_json = return_request(fromPlace = itinerary[j_]['to'], toPlace = itinerary[-1]['to'], Months = arr_month, Days = arr_day, Hours = arr_hour, Minutes = arr_minute, Seconds = arr_second, AM_PM = arr_AM_PM, departure = True, lat_long_from = itinerary[j_]['lat_long_to'], lat_long_to = itinerary[-1]['lat_long_to'])
        #print(temp_json)
        new_partial_its = read_json_extract_itineraries(temp_json, df_BT)
        new_itineraries = [np.append(itinerary[:j_+1],new_partial_its[k_]).tolist() for k_ in range(len(new_partial_its))]
        
        itinerary_list.extend(new_itineraries)
    return itinerary_list

### Get news from SBB

In [34]:
def display_info(date, stopName):
    url = 'https://data.sbb.ch/api/records/1.0/search/?dataset=rail-traffic-information&lang=en&rows=1000&sort=validityend&facet=validitybegin&facet=validityend&refine.validitybegin={}'.format(date[0])
    tmp = requests.get(url).json()
    infos = []
    for el in tmp['records']: 
        end = str(el['fields']['validityend'].split('T')[0]).split('-')
        if((int(end[0]) == int(date[0]) and int(end[1]) == int(date[1]) and int(end[2]) < int(date[2])) or (int(end[0]) == int(date[0]) and int(end[1]) < int(date[1])) or (int(end[0]) < int(date[0]))):
            break
        #print(end)
        title = el['fields']['title']
        if('End of announcement:' in title): 
            pass
        else:
            if(len(title.split(':')) > 1):
                title = str(title.split(':')[1])
            title = title.replace(' and', '-').replace('engineering work is in progress', '').replace(',','').replace('.', '').replace('Between', '').replace('In', '').replace(' station', '').replace('Work due to a disruption','').strip()
            #print(title.split('- '))
            for el_title in title.split('- '): 
                for el_stop in stopName: 
                    if(el_title.strip() == el_stop.strip()): 
                        print(el_title)
                        infos.append(el['fields']['description'])
    return infos

### Find best itineraries

Now that we are able to find itineraries given a specified confidence(quality) threshold, we need to expose a nice API that would allow this functionality to be exploited in an easy manner, hiding the complexity of the search and uncertanity distribution behind a simple function call

First we initialize the a TransferQualityComputer object.

In [35]:
quality_computer = TransferQualityComputer()

Then, given the different line ids between the timetable and line data, we need a dataframe containing the ProductId, LineType and OperatorName for busses and trams

In [37]:
#Create Dataframe to find LineId from ProductId, LineType and OperatorName. Relevant for Bus and Tram
df_BT = df.where(col('ProductId') != 'Zug').select('ProductId','LineType','OperatorName','LineId').distinct().toPandas()

Now we can implement the actual function that searches for an itinerary given a quality threshold.

In [54]:
def find_itinerary_with_quality(fromPlace , toPlace, Months, Days, Hours, AM_PM, Minutes, Seconds, departure, quality):
    #Get the json of quickest itineraries from local OTP server
    test_json = return_request(fromPlace=fromPlace , toPlace= toPlace ,Months = Months, Days= Days, Hours= Hours, AM_PM = AM_PM, Minutes = Minutes, Seconds = Seconds, departure = departure)
    
    #Read json and create itinerary list of dicts
    itinerary_first_list = read_json_extract_itineraries(test_json, df_BT)
    itinerary_acc, itinerary_refu = split_with_quality(itinerary_first_list, quality, quality_computer)
    itinerary_searched = []
    iter_=0

    ## sort bad quality itineraries by arrival time
    sorter_ids = np.argsort([itinerary_refu[i_][-1]['arrival_time'] for i_ in range(len(itinerary_refu))])
    itinerary_refu = np.array(itinerary_refu)[sorter_ids].tolist()
    
    while len(itinerary_refu) != 0:
        if len(itinerary_acc)>=3:
            break
        iter_+=1
        
        itinerary_searched_ = itinerary_refu.pop(0)
        itinerary_searched.append(itinerary_searched_)
        
        itinerary_test_list_explored = explore_itineraries(itinerary_searched_, df_BT, quality, quality_computer)
        itinerary_acc_explored, itinerary_refu_explored = split_with_quality(itinerary_test_list_explored, quality, quality_computer)
        
        for iti_refu in itinerary_refu_explored:
            if not(any([(iti_refu[:-1] == iti[:-1]) for iti in itinerary_refu+itinerary_searched])):
                itinerary_refu.append(iti_refu)
        for iti_acc in itinerary_acc_explored:
            if not(any([(iti_acc == iti) for iti in itinerary_acc])):
                itinerary_acc.append(iti_acc)

        ## sort by arrival time
        sorter_ids = np.argsort([itinerary_refu[i_][-1]['arrival_time'] for i_ in range(len(itinerary_refu))])
        itinerary_refu = np.array(itinerary_refu)[sorter_ids].tolist()
    
    
    ## sort selected itineraries by arrival time
    sorter_ids = np.argsort([itinerary_acc[i_][-1]['arrival_time'] for i_ in range(len(itinerary_acc))])
    itinerary_acc = np.array(itinerary_acc)[sorter_ids].tolist()

    return itinerary_acc, itinerary_first_list

Now we can test it with a sample example

In [55]:
quality_ = 0.90
fromPlace_ = "Zürich, Zürichbergstrasse"
toPlace_ = 'Lausanne'
Months_ = 2
Days_ = 4
Hours_ = 6
AM_PM_ = 'PM' 
Minutes_ = 20
Seconds_ = 1
departure_ = True
quality_itins, classic_itins = find_itinerary_with_quality(fromPlace_ , toPlace_, Months_, Days_, Hours_, AM_PM_, Minutes_, Seconds_, departure_, quality_)

In [56]:
for leg in itin[0]:
    print('Take {} from {} at {} to {} arriving at {}'.format(leg['product_id'],leg['from'], leg['departure_time'], leg['to'], leg['arrival_time']))

Take TRAM from Zürich, Zürichbergstrasse at Feb 04 2018 18:38:00,38 to Zürich, Haldenegg arriving at Feb 04 2018 18:48:00,48
Take WALK from Zürich, Haldenegg at Feb 04 2018 18:48:00,48 to Zürich HB arriving at Feb 04 2018 18:52:11,52
Take RAIL from Zürich HB at Feb 04 2018 19:02:00,02 to Bern arriving at Feb 04 2018 19:58:00,58
Take WALK from Bern at Feb 04 2018 19:58:00,58 to Bern arriving at Feb 04 2018 19:58:59,58
Take RAIL from Bern at Feb 04 2018 20:04:00,04 to Lausanne arriving at Feb 04 2018 21:16:00,16
Take WALK from Lausanne at Feb 04 2018 21:16:00,16 to Lausanne, gare arriving at Feb 04 2018 21:17:00,17
Take BUS from Lausanne, gare at Feb 04 2018 21:24:00,24 to Lausanne, casernes arriving at Feb 04 2018 21:36:00,36


In order to further validate our itineraries, we will take a look at what would've been the itineraries computes in the deterministic way and our itineraries.

First we implement an utility function that counts the number of trips in a transfer without taking into consideration walking

In [65]:
def compute_length_no_walking(itinerary):
    length = 0
    for leg in itinerary:
        if leg['product_id'] != 'WALK':
            length += 1
    return length

In [66]:

print('\n Fastest itineraries without quality constraint:')
itinerary_initial_quality_ = [0,]*len(classic_itins)
for i_ in range(len(classic_itins)):
    itinerary_initial_quality_[i_] = comp_itinerary_quality(classic_itins[i_], quality_computer)
    print('Itinerary number: {}, quality: {}, dpt: {}, arr: {}, transfers: {}'\
                  .format(i_,
                          itinerary_initial_quality_[i_],
                          classic_itins[i_][0]['departure_time'],
                          classic_itins[i_][-1]['arrival_time'],
                          compute_length_no_walking(classic_itins[i_])))

print('\n Fastest itineraries with quality constraint:')
## Print out the arrival time and quality of the three selected paths
itinerary_selected_quality_ = [0,]*len(quality_itins)
for i_ in range(len(quality_itins)):
    itinerary_selected_quality_[i_] = comp_itinerary_quality(quality_itins[i_], quality_computer)
    print('Itinerary number: {}, quality: {}, dpt: {}, arr: {}, legs: {}'\
                  .format(i_,
                          itinerary_selected_quality_[i_],
                          quality_itins[i_][0]['departure_time'],
                          quality_itins[i_][-1]['arrival_time'],
                          compute_length_no_walking(quality_itins[i_])))


 Fastest itineraries without quality constraint:
Itinerary number: 0, quality: 0.9666478981827387, dpt: Feb 04 2018 18:38:00,38, arr: Feb 04 2018 21:36:00,36, transfers: 4
Itinerary number: 1, quality: 0.8746300614112293, dpt: Feb 04 2018 18:28:00,28, arr: Feb 04 2018 21:39:00,39, transfers: 4
Itinerary number: 2, quality: 0.9656806142174605, dpt: Feb 04 2018 19:08:00,08, arr: Feb 04 2018 21:54:00,54, transfers: 3

 Fastest itineraries with quality constraint:
Itinerary number: 0, quality: 0.9666478981827387, dpt: Feb 04 2018 18:38:00,38, arr: Feb 04 2018 21:36:00,36, legs: 4
Itinerary number: 1, quality: 0.9680733167379645, dpt: Feb 04 2018 18:28:00,28, arr: Feb 04 2018 21:36:00,36, legs: 4
Itinerary number: 2, quality: 0.9680733167379645, dpt: Feb 04 2018 18:28:00,28, arr: Feb 04 2018 21:36:00,36, legs: 4
Itinerary number: 3, quality: 0.9733572131043425, dpt: Feb 04 2018 18:28:00,28, arr: Feb 04 2018 21:51:00,51, legs: 4
Itinerary number: 4, quality: 0.9733572131043425, dpt: Feb 04 

## Visualizing confidence of trips

One of the validation methods we could use is to visualize an isochronous map showing how far one can hypothetically go in a fix number of minutes. 

On top of this, our visualization also conveys the % of time said travels are successful.

As we are interested in the area surrounding Zurich HB by a radius with 10km and because we did not want to add functionality for this in the core of our route planning algorithm, to compute the data for the map we query the route planner from Zurich HB to every other station within a 10km radius.

For each of the stations, we will plot a circle centered in it with radius directly proportional with the walking time left up until the time limit. We set an average walking speed of 5km/h and using the time left, we compute the distance around the station that can be walked.

For each station we also get the certainty of arriving there in % of times we would be able to actually make the trip there and this value between (0,1) is linearly map to a color scale. Hence, red corresponds to a % value of 100, while blue corresponds to 0.

In [74]:
import branca.colormap as cm
import folium

ZURICH_HB_COORDS = [47.377941, 8.540141]

AVERAGE_WALKING_SPEED_PER_SECOND = 1.38889 # 5kph but in meters per second
MINIMUM_CIRCLE_RADIUS = 30
# LINEAR_CM = cm.LinearColormap(
#     ['blue', 'red'],
#     vmin=0, vmax=1,
# )
LINEAR_CM = cm.StepColormap(
    ['red', 'blue','green'],
    index=[0, 0.25, 0.75, 1]
)
LINEAR_CM.caption = 'Quality of trip'


def add_circle(m, coords, quality, time_left_in_seconds, popup_data):
    radius = time_left_in_seconds * AVERAGE_WALKING_SPEED_PER_SECOND / 10
    folium.Circle(
        coords,
        radius,
        fill=True,
        fill_color=LINEAR_CM(quality),
        fill_opacity=0.2,
        stroke=False,
        fill_rule='nonzero',
        popup="Arrival time: {}<br\>Q : {:.3f}<br\>Time left: {} mins<br\>Legs: {}"\
                    .format(popup_data['arrival_time'], quality, (time_left_in_seconds//60), popup_data['nb_transfers'])
    ).add_to(m)

    
def create_map_with_quality(source_name, source_coord, stations_data):
    m = folium.Map(source_coord, zoom_start=13, tiles='Stamen toner') 
    m.add_child(LINEAR_CM)
    popup_data = {}
    for data in stations_data:
        popup_data['station_name'] = data[0]
        popup_data['arrival_time'] = data[3]
        popup_data['nb_transfers'] = compute_length_no_walking(data[5])
        add_circle(m, data[1], data[2], data[4]*60, popup_data)
    return m

First thing we do, we select the stations that are at most 10 km from Zurich HB.

For this, we compute the distance from Zurich HB for every stop name in the df_meta. Then we keep those stops that have a distance of less than 10km form Zurich HB

In [75]:
#Coordinate of the main station of Zürich
Lat_zu = 47.377941
Long_ZU = 8.540141
def dist_to_ZU(lat, long): 
    res = str(geo_dist((lat, long), (Lat_zu, Long_ZU)))
    res = round(float(res.split()[0]),1)
    return res

In [78]:
Df_meta['Dist in km'] = Df_meta.apply(lambda x: dist_to_ZU(x['Lat'], x['Long']), axis=1)
stops_zurich = Df_meta[Df_meta['Dist in km'] < 10]

Now that we have in **stops_zurich** the stops that are in a 10km radius from Zurich, we extract the names as we need them for the route query.

In [79]:
zurich_stations = stops_zurich['StopName_Meta'].tolist()

If we take a look at the station names, we can see there are duplicates which we choose to drop as they do not come with information about additional stops.

In [80]:
zurich_stations[:10]

['Zürich',
 'Wettswil a.A., Heidenchilen',
 'Wallisellen, Florastrasse',
 'Weiningen ZH, Aegelsee',
 'Rümlang, Heuelstrasse',
 'Kilchberg',
 'Kilchberg',
 'Neue Forch',
 'Schlieren, Wagonsfabrik',
 'Schlieren, Bahnhof']

In [81]:
zurich_stations = list(set(zurich_stations))

## Computing the travel times to stops close to Zurich

The next step in the visualization process is to compute the **arrival time** and **qualities** to the stops of interest. 

In order to do this, we query our route planner for routes from Zurich HB to every stop within 10km obtaining itineraries of which we are interested only in the arrival time at the final stop and the quality.

There are a two parameters that will shape our visualization:
    1. the start time we set for the trips - parameter required to make the queries
    2. the maximum length in time of the trips - used to filter the destinations to which the travel time takes more than this value

We first implement some utility functions:

In [82]:
def parse_datetime_string(dt):
    return datetime.strptime(dt, '%b %d %Y %H:%M:%S,%f')

def compute_remaining_travel_time(itinerary, departure_datetime, max_travel_minutes):
    '''
    Function that computes travel minutes left from the quota specified by max_travel_minutes
    '''
    last_step = itinerary[-1]
    arrival_datetime = parse_datetime_string(last_step['arrival_time'])
    travel_time_minutes = (arrival_datetime - departure_datetime).total_seconds()/60
    return max_travel_minutes - travel_time_minutes
    

def compute_length_no_walking(itinerary):
    length = 0
    for leg in itinerary:
        if leg['product_id'] != 'WALK':
            length += 1
    return length

def compute_itineraries_deterministic(station, source_station, departure_datetime, max_travel_minutes):
    '''
    Function that obtaines for the specified parameters, itineraries in the deterministic way, i.e. without taking
    into consideration the confidence of intervals
    '''
    request_json = return_request(source_station,
                                  station,
                                  True,
                                  departure_datetime.month,
                                  departure_datetime.day,
                                  departure_datetime.hour if departure_datetime.hour <= 12 else departure_datetime.hour-12,
                                  'AM' if departure_datetime.hour <= 12 else 'PM',
                                  departure_datetime.minute,
                                  departure_datetime.second,)

    itineraries = read_json_extract_itineraries(request_json, df_BT)
    
    return itineraries

def compute_itineraries_given_confidence(station, source_station, departure_datetime, max_travel_minutes, quality_computer, quality=0.95):
    '''
    Function that, for the given parameters, obtains the itineraries that respect the confidence(quality)
    threshold specified as a parameter
    '''
    itineraries = find_itinerary_with_quality(source_station,
                                              station,
                                             departure_datetime.month,
                                             departure_datetime.day,
                                             departure_datetime.hour if departure_datetime.hour <= 12 else departure_datetime.hour-12,
                                            'AM' if departure_datetime.hour <= 12 else 'PM',
                                             departure_datetime.minute,
                                             departure_datetime.second,
                                             departure=True,
                                             quality=quality,
                                             quality_computer=quality_computer
                                             )
    return itineraries
    

In [83]:

def get_stop_plot_data(station, source_station, departure_datetime, max_travel_minutes, quality_computer, with_confidence=True, quality=0.95):
    '''
    Function that computes, for each station from stations_names which is within max_travel_minutes
    of source_station, the coords, quality and time left from max_travel_minutes after arriving there
    '''
    if with_confidence:
        itineraries = compute_itineraries_given_confidence(station, source_station, departure_datetime, max_travel_minutes,
                                                          quality_computer, quality=quality)
    else:
        itineraries = compute_itineraries_deterministic(station, source_station, departure_datetime, max_travel_minutes)
    
    if len(itineraries) == 0:
         return None

    fastest_itinerary = itineraries[0]
    remaining_travel_minutes = compute_remaining_travel_time(fastest_itinerary, departure_datetime, max_travel_minutes)

    if remaining_travel_minutes > 0:
        quality = comp_itinerary_quality(fastest_itinerary, quality_computer)
        plot_data = (
            station,
            fastest_itinerary[-1]['lat_long_to'],
            quality,
            fastest_itinerary[-1]['arrival_time'],
            remaining_travel_minutes,
            fastest_itinerary
        )
        return plot_data
    else:
        return None
    


In [84]:
# Parameters of the visualization
ZURICH_HB_NAME = 'Zürich HB'
BERN_NAME = 'Bern'

Months_ = 4
Days_ = 6
Hours_ = 8
AM_PM_ = 'AM'
Minutes_ = 30
Seconds_ = 1

departure_ = True
Max_travel_time_ = 60 # in minutes
Hours_24 = Hours_%12 if AM_PM_ == 'AM' else (Hours_%12)+12
departure_datetime = datetime(2018, Months_, Days_, Hours_24, Minutes_, Seconds_)

In [85]:
quality_plot_data = []
for i, station in enumerate(zurich_stations):
    station_data = get_stop_plot_data(station, ZURICH_HB_NAME, departure_datetime, Max_travel_time_, quality_computer,
                                     with_quality=False)
    quality_plot_data.append(station_data)
    if i%100 == 0:
        print("Done with {}".format(i))

quality_plot_data = [data for data in quality_plot_data if data is not None]

TypeError: get_stop_plot_data() got an unexpected keyword argument 'with_quality'

In [541]:
PLOT_DATA_FILENAME = 'zurich-830am-60mins_quality_computed_itineraries.pkl'
with open(PLOT_DATA_FILENAME, 'wb') as f:
    pickle.dump(quality_plot_data, f)

In [543]:
quality_map = create_map_with_quality('Zurich HB', ZURICH_HB_COORDS, quality_plot_data)
quality_map

In [388]:
no_rush_map = create_map_with_quality("Zurich HB", ZURICH_HB_COORDS, plot_data)
no_rush_map

In [377]:
am_map = create_map_with_quality("Zurich HB", ZURICH_HB_COORDS, am_plot_data)
am_map

In [376]:
am_map.save('zurich_830am-60mins-map.html')

In [378]:
pm_map = create_map_with_quality("Zurich HB", ZURICH_HB_COORDS, plot_data)
pm_map

In [379]:
pm_map.save('zurich_6pm-60mins-map.html')

### Comparison with Bern

In order to have a comparison baseline, we do the same isochroneous map also for Bern railway station.

In [380]:
#Coordinate of the main station of Bern 
Lat_B = 46.94972
Long_B = 7.43944
def dist_to_BN(lat, long): 
    res = str(geo_dist((lat, long), (Lat_B, Long_B)))
    res = round(float(res.split()[0]),1)
    return res

In [381]:
Df_meta['km_to_bern'] = Df_meta.apply(lambda x: dist_to_BN(x['Lat'], x['Long']), axis=1)
stops_bern = Df_meta[Df_meta['km_to_bern'] < 10]
bern_stations = stops_bern['StopName_Meta'].tolist()
bern_stations = list(set(bern_stations))

In [382]:
len(bern_stations)

413

In [409]:
BERN_NAME = 'Bern'
BERN_COORDS = (46.94972, 7.43944)

bern_plot_data = []
for i, station in enumerate(bern_stations):
    station_data = get_stop_plot_data(station, BERN_NAME, departure_datetime, Max_travel_time_, quality_computer)
    bern_plot_data.append(station_data)
    if i%10 == 0:
        print("Done with {}".format(i))

bern_plot_data = [data for data in bern_plot_data if data is not None]

Done with 0
Done with 10
Done with 20
Done with 30
Done with 40
Done with 50
Done with 60
Done with 70
Done with 80
Done with 90
Done with 100
Done with 110
Done with 120
Done with 130
Done with 140
Done with 150
Done with 160
Done with 170
Done with 180
Done with 190
Done with 200
Done with 210
Done with 220
Done with 230
Done with 240
Done with 250
Done with 260
Done with 270
Done with 280
Done with 290
Done with 300
Done with 310
Done with 320
Done with 330
Done with 340
Done with 350
Done with 360
Done with 370
Done with 380
Done with 390
Done with 400
Done with 410


In [410]:
PLOT_DATA_FILENAME = 'bern-830am-60mins.pkl'
with open(PLOT_DATA_FILENAME, 'wb') as f:
    pickle.dump(plot_data, f)

In [411]:
bern_map_am = create_map_with_quality("Bern", BERN_COORDS, bern_plot_data)
bern_map_am

In [412]:
bern_map_am.save('bern-830am-60mins-map.html')

In [406]:
bern_map = create_map_with_quality("Bern", BERN_COORDS, bern_plot_data)
bern_map

In [407]:
bern_map.save('bern-11am-60min-map.html')