# Create walking edges:

This notebook computes the time it takes to walk between stations that are under 500m from each-other. It will then be saved as a pickle file. The time to walk is extracted from the ```transfers``` file present on the hdfs filesystem and otherwise is computed as 2min + 1min per 50m.

In [None]:
%%configure
{"conf": {
    "spark.app.name": "dslab-group_final"
}}

In [None]:
stops = spark.read.format('orc').load('/data/sbb/timetables/orc/stops/000000_0')
transfers = spark.read.format('orc').load('/data/sbb/timetables/orc/transfers/000000_0')

### Imports:

In [None]:
from geopy.distance import distance as geo_distance
from pyspark.sql import Row
from pyspark.sql.functions import col
from pyspark.sql.functions import udf

Calculate the distance between all stops in order to select those under 500m distance (as the crow flies).

In [None]:
def zurich_distance(x, y):
    """zurich_distance: returns the distance of a station to Zurich HB
    @input: (lat,lon) of a station
    @output: distance in km to Zurich HB
    """
    # position calculated in creat_edge_and_nodes.ipynb
    zurich_pos = (47.3781762039461, 8.54019357578468)
    return geo_distance(zurich_pos, (x,y)).km

In [None]:
@udf("float")
def compute_distance(x1, y1, x2, y2):
    """
    Compute distance takes as input two pairs of latitude and longitude coming
    from two different stops and compute the distance between those two stops in meters 
    """
    return geo_distance((x1, y1), (x2,y2)).m

Use latitude and longitude to filter stops that are only within 15 km of Zurich HB

In [None]:
stops_distance = stops.rdd.map(lambda x: (x['stop_id'], zurich_distance(x['stop_lat'], x['stop_lon'])))
stops_distance = spark.createDataFrame(stops_distance.map(lambda r: Row(stop_id=r[0], 
                                                                        zurich_distance=r[1])))
stops_distance = stops_distance.filter(col('zurich_distance') <= 15)

Create DataFrame containing all possible pairs of stops with their latitude and longitude and then compute the distance between old pairs of stations

In [None]:
stops_pos = stops.join(stops_distance, 'stop_id').select(col('stop_id'), 
                                                         col('stop_lat'), col('stop_lon'))
stops_pos = stops_pos.select(col('stop_id').alias('stop_id_1'), 
                             col('stop_lat').alias('stop_lat_1'), 
                             col('stop_lon').alias('stop_lon_1'))
stops_pos = stops_pos.crossJoin(stops_pos.select(col('stop_id_1').alias('stop_id_2'), 
                                                 col('stop_lat_1').alias('stop_lat_2'), col('stop_lon_1').alias('stop_lon_2')))
stops_pos = stops_pos.where(col('stop_id_1') != col('stop_id_2'))
stops_pos_dist = stops_pos.withColumn('distance', 
                                      compute_distance(col('stop_lat_1'), 
                                                       col('stop_lon_1'), 
                                                       col('stop_lat_2'), 
                                                       col('stop_lon_2')))

Keep only pairs for which the distance is less than 500 meters and compute the duration using a speed of 50 meters per minutes

In [None]:
walking_edges = stops_pos_dist.select(col('stop_id_1').alias('source'), col('stop_id_2').alias('target'), 
                                      col('distance'))\
                                        .where(col('distance') <= 500)\
                                        .withColumn('duration', col('distance')/50)

In [None]:
transfers = transfers.select('from_stop_id', 'to_stop_id', 'min_transfer_time')

In [None]:
%%spark -o transfers -n 30000

In [None]:
%%spark -o walking_edges -n 30000

If there is available information for a transfer between stations, we add it to `walking_edges`. 

In [None]:
%%local
merged = transfers.merge(walking_edges, 
                         left_on=['from_stop_id', 'to_stop_id'], 
                         right_on=['source', 'target'], how='right')

mask_transfers = merged.to_stop_id.notnull() & merged.from_stop_id.notnull()

merged.loc[mask_transfers, 'walk_duration'] = merged.loc[mask_transfers, 'min_transfer_time'] / 60
merged.loc[~mask_transfers, 'walk_duration'] = merged.loc[~mask_transfers, 'duration'] + 2
merged[['source', 'target', 'walk_duration']].to_pickle('walking_edges.pickle')