# Get Zurich stations and corresponding walking connections

In this notebook, we filter the station list data from two different data sources to obtain a dataframe containing the IDs (and additional information) of every station within 15 km from Zurich HB. The notebook is structured as follows: 

*   **[Start Spark](#spark)** 
*   **[Get station list from BFKOORD_GEO](#stations)**  
*   **[Get station list from time-table stops](#stops)** 
*   **[Compare station lists](#difference)**  
*   **[Compute walking times](#walking)** 
*   **[Download CSV files](#download)** 



<a id = 'spark'></a>
### 1. Start Spark


We will be using a Spark Session for performing didfferent transformations and actions on the raw dataframes

In [None]:
%%configure
{"conf": {
    "spark.app.name": "datavirus_final"
}}

In [2]:
# Initialization
spark

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
8627,application_1589299642358_3152,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<pyspark.sql.session.SparkSession object at 0x7f605a918710>

<a id = 'stations'></a>
### 2. Get station list from [BFKOORD_GEO](https://opentransportdata.swiss/en/cookbook/hafas-rohdaten-format-hrdf/#Abgrenzung)


We load the station list provided by the Open Data Switzerland platform

In [3]:
metadata = spark.read.csv('/data/sbb/stations/bfkoordgeo.csv', header=True)
metadata.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+---------+---------+------+----------------+
|StationID|Longitude| Latitude|Height|          Remark|
+---------+---------+---------+------+----------------+
|  0000002|26.074412|44.446770|     0|       Bucuresti|
|  0000003| 1.811446|50.901549|     0|          Calais|
|  0000004| 1.075329|51.284212|     0|      Canterbury|
|  0000005|-3.543547|50.729172|     0|          Exeter|
|  0000007| 9.733756|46.922368|   744|Fideris, Bahnhof|
+---------+---------+---------+------+----------------+
only showing top 5 rows

We compute the distance from stations using the Haversine formula, expressed in terms of a two-argument inverse tangent function to calculate the great circle distance between two points on the Earth:

$$ a = sin^2(\Delta{\phi}/2) + cos(\phi_1) * cos(\phi_2) * sin^2(\Delta{\lambda}/2) $$

$$ distance =  2R * atan2(\sqrt{a},\sqrt{1-a}) $$

with $R$ being the mean Earth radius (6371 km)</li>, $\phi$ the latitude and $\lambda$ the longitude
       



In [15]:
import pyspark.sql.functions as F
import math
df_stations =(
    metadata
            .withColumn("dlon", F.radians(F.col("Longitude")) - math.radians(8.540192)) 
            .withColumn("dlat", F.radians(F.col("Latitude")) - math.radians(47.378177)) 
            .withColumn("a", F.sin(F.col("dlat") / 2) ** 2 + math.cos(math.radians(47.378177)) 
                        *F.cos(F.radians(F.col("Latitude")))* F.sin(F.col("dlon") / 2) ** 2)
            .withColumn("Distance_from_Zurich", 1000*6371*2*F.atan2( F.sqrt(F.col("a")),F.sqrt( 1- F.col("a"))))             
            .drop("dlon", "dlat","a") \
           .filter(F.col("Distance_from_Zurich")<15000)
)
df_stations.show(5)
df_stations.select("StationID").write.format('csv').mode('overwrite').save("../data/zurich_stations_ids.csv")
df_stations.write.format('csv').mode('overwrite').save("../data/zurich_stations_info.csv",header=True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+---------+---------+------+--------------------+--------------------+
|StationID|Longitude| Latitude|Height|              Remark|Distance_from_Zurich|
+---------+---------+---------+------+--------------------+--------------------+
|  0000065| 8.595545|47.409209|   430|  Wallisellen, Glatt|   5409.956757262041|
|  0000066| 8.595545|47.409209|   430|Wallisellen, Zent...|   5409.956757262041|
|  0000176| 8.521961|47.351679|     0|Zimmerberg-Basist...|   3250.669988842345|
|  8502186| 8.398942|47.393407|   428|Dietikon Stoffelbach|  10768.073179887586|
|  8502187| 8.377032|47.364740|   502|Rudolfstetten Hof...|  12377.426176789762|
+---------+---------+---------+------+--------------------+--------------------+
only showing top 5 rows

<a id = 'stops'></a>
### 3. Get station list from time-table stops


Another alternative is to get the stations from the stops appearing in the time table. We will try to obtain the Zurich stations from this datset and compared them with the stations obtained from the BFKOORD_GEO dataset

In [5]:
stop_metadata = spark.read.orc("hdfs:///data/sbb/timetables/orc/stops")
stop_metadata.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------------------+----------------+----------------+-------------+--------------+
|stop_id|           stop_name|        stop_lat|        stop_lon|location_type|parent_station|
+-------+--------------------+----------------+----------------+-------------+--------------+
|1322000|            Altoggio|46.1672513851495|  8.345807131427|             |              |
|1322001|        Antronapiana| 46.060121674738|8.11361957990831|             |              |
|1322002|              Anzola|45.9898698225697|8.34571729989858|             |              |
|1322003|              Baceno|46.2614983591677|8.31925293162473|             |              |
|1322004|Beura Cardezza, c...|46.0790618438814|8.29927439970313|             |              |
+-------+--------------------+----------------+----------------+-------------+--------------+
only showing top 5 rows

In [6]:
df_stops =(
    stop_metadata
            .withColumn("dlon", F.radians(F.col("stop_lon")) - math.radians(8.540192)) 
            .withColumn("dlat", F.radians(F.col("stop_lat")) - math.radians(47.378177)) 
            .withColumn("a", F.sin(F.col("dlat") / 2) ** 2 + math.cos(math.radians(47.378177)) 
                        *F.cos(F.radians(F.col("stop_lat")))* F.sin(F.col("dlon") / 2) ** 2)
            .withColumn("Distance_from_Zurich", 1000*6371*2*F.atan2( F.sqrt(F.col("a")),F.sqrt( 1- F.col("a"))))             
            .drop("dlon", "dlat","a") \
           .filter(F.col("Distance_from_Zurich")<15000)
)

df_stops.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+--------------------+----------------+----------------+-------------+--------------+--------------------+
|    stop_id|           stop_name|        stop_lat|        stop_lon|location_type|parent_station|Distance_from_Zurich|
+-----------+--------------------+----------------+----------------+-------------+--------------+--------------------+
|    8500926|Oetwil a.d.L., Sc...|47.4236270123012| 8.4031825286317|             |              |  11483.706414892196|
|    8502186|Dietikon Stoffelbach|47.3934058321612|8.39894248049007|             |      8502186P|  10768.017150422354|
|8502186:0:1|Dietikon Stoffelbach|47.3934666445388|8.39894248049007|             |      8502186P|  10769.076553178611|
|8502186:0:2|Dietikon Stoffelbach|47.3935274568464|8.39894248049007|             |      8502186P|  10770.140096033407|
|   8502186P|Dietikon Stoffelbach|47.3934058321612|8.39894248049007|            1|              |  10768.017150422354|
+-----------+--------------------+--------------

We observe that the stop_ids contain more information than in the previous dataframe because in this case they also distiguish between different platfroms inside the same station. There are 

In [7]:
df_stops.where(F.col('stop_name')=='Zürich HB').show(5) 

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------+---------+----------------+----------------+-------------+--------------+--------------------+
|     stop_id|stop_name|        stop_lat|        stop_lon|location_type|parent_station|Distance_from_Zurich|
+------------+---------+----------------+----------------+-------------+--------------+--------------------+
|     8503000|Zürich HB|47.3781762039461|8.54019357578468|             |      8503000P| 0.14803143371142663|
|8503000:0:10|Zürich HB|47.3794536181612|8.54019357578468|             |      8503000P|   141.9535123729413|
|8503000:0:11|Zürich HB|47.3795144466376|8.54019357578468|             |      8503000P|   148.7173280882197|
|8503000:0:12|Zürich HB|47.3786020121232|8.54019357578468|             |      8503000P|   47.25934080565981|
|8503000:0:13|Zürich HB|47.3785411825942|8.54019357578468|             |      8503000P|  40.495430668280754|
+------------+---------+----------------+----------------+-------------+--------------+--------------------+
only showing top 5 

<a id = 'difference'></a>
### 4. Compare both dataframes to see which stations are missing


Nonetheless, if we don't make a distinction between the patforms and only take into account the first 7 characters of the station ID (corresponding to the parent station) we see that the first dataframe is more complete than the second one, so we decide to stick to the filtered station list obtained from the BFKOORD_GEO dataset

In [8]:
only_names_stations = df_stations.select('StationID').distinct()
only_names_stops =  df_stops.where(F.length('stop_id')==7).select('stop_id')
missing_ids = only_names_stations.subtract(only_names_stops)
missing_stations = df_stations.join(missing_ids,on =['StationID'])
missing_stations.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+---------+---------+------+--------------------+--------------------+
|StationID|Longitude| Latitude|Height|              Remark|Distance_from_Zurich|
+---------+---------+---------+------+--------------------+--------------------+
|  0000065| 8.595545|47.409209|   430|  Wallisellen, Glatt|   5409.956757262041|
|  0000066| 8.595545|47.409209|   430|Wallisellen, Zent...|   5409.956757262041|
|  0000176| 8.521961|47.351679|     0|Zimmerberg-Basist...|   3250.669988842345|
|  8502229| 8.430330|47.380971|   456|   Urdorf Weihermatt|   8277.819213423541|
|  8502273| 8.346555|47.351473|   386|          Bremgarten|  14883.063708751546|
|  8502276| 8.366793|47.362187|   550|       Berikon-Widen|  13178.777609530798|
|  8502758| 8.532976|47.244746|   617|Hausen am Albis, ...|  14846.820829250355|
|  8503001| 8.488940|47.391481|   399|   Zürich Altstetten|   4132.461995948391|
|  8503006| 8.544115|47.411529|   442|     Zürich Oerlikon|   3720.310973238964|
|  8503007| 8.544636|47.4187

<a id = 'walking'></a>
### 5. Get the walking transfer times between different stations


Once we have a dataframe containing only the relevant stations, we can compute the distance between each of them, thanks to the latitude, longitude and height data. Using this distance value, we can estimate the walking time between nearby stations (maximal Haversine distance $\Delta X$ of 500 m) using the following formulas: 

$$ distance(m) = \sqrt{\Delta X^2 + \Delta h^2} $$

$$ speed (m/min) = 50 (m/min) - 0.01*\Delta h $$

$$ Time (sec) = 120 s + 60*\frac{distance (m)}{speed (m/min)} $$

We have taken into account both the Harvesine distance $\Delta X$ and the height difference $\Delta h$, because it is not the same thing to go downhill than to go uphill ;)

In [10]:
station_pairs = df_stations.select('StationID','Longitude','Latitude','Height')
joinedDF = station_pairs.crossJoin(station_pairs).toDF('id1','lon1','lat1','h1','id2','lon2','lat2','h2')
joinedDF.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------+---------+---+-------+--------+---------+---+
|    id1|    lon1|     lat1| h1|    id2|    lon2|     lat2| h2|
+-------+--------+---------+---+-------+--------+---------+---+
|0000065|8.595545|47.409209|430|0000065|8.595545|47.409209|430|
|0000065|8.595545|47.409209|430|0000066|8.595545|47.409209|430|
|0000065|8.595545|47.409209|430|0000176|8.521961|47.351679|  0|
|0000065|8.595545|47.409209|430|8502186|8.398942|47.393407|428|
|0000065|8.595545|47.409209|430|8502187|8.377032|47.364740|502|
+-------+--------+---------+---+-------+--------+---------+---+
only showing top 5 rows

In [14]:
from pyspark.sql.types import IntegerType

distance =(
    joinedDF
            .withColumn("dlon", F.radians(F.col("lon1")) -F.radians(F.col("lon2"))) 
            .withColumn("dlat", F.radians(F.col("lat1")) - F.radians(F.col("lat2"))) 
            .withColumn("a", F.sin(F.col("dlat") / 2) ** 2 + F.radians(F.col("lat2"))
                        *F.cos(F.radians(F.col("lat1")))* F.sin(F.col("dlon") / 2) ** 2)
            .withColumn("Distance", 1000*6371*2*F.atan2( F.sqrt(F.col("a")),F.sqrt( 1- F.col("a"))))             
            .drop("dlon", "dlat","a") \
            .filter(F.col("Distance")<500)
            .withColumn("dh", F.col("h2")-F.col("h1"))
            .withColumn("Distance(m)", F.sqrt(F.pow(F.col("Distance"),2)+F.pow(F.col("dh"),2)))
            .withColumn("speed",50-F.col("dh")/100)
            .withColumn("Transfer_time (s)", F.round(60*(2+(F.col("Distance(m)")/F.col("speed")).cast(FloatType()))))
            .drop("dlon", "dlat","a","Distance",'dh')
            .toDF('ID1','Lon1','Lat1','H1','ID2','Lon2','Lat2','H2','Distance (m)','Speed (m/min)','Transfer time (sec)')
)

distance.write.format('csv').mode('overwrite').save("../data/zurich_walking_connections.csv",header=True)
distance.where(F.abs(F.col('dh'))>30).show(10)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+--------+---------+---+-------+--------+---------+---+------------------+-------------+-------------------+
|    ID1|    Lon1|     Lat1| H1|    ID2|    Lon2|     Lat2| H2|      Distance (m)|Speed (m/min)|Transfer time (sec)|
+-------+--------+---------+---+-------+--------+---------+---+------------------+-------------+-------------------+
|0000176|8.521961|47.351679|  0|8503086|8.526232|47.352124|422| 553.9113633018651|        45.78|              846.0|
|8502188|8.354599|47.355907|445|8502268|8.359234|47.357579|523|435.18144490895634|        49.22|              650.0|
|8502188|8.354599|47.355907|445|8502274|8.354713|47.352524|410|377.91621690174395|        50.35|              570.0|
|8502188|8.354599|47.355907|445|8517377|8.350274|47.353792|400|432.23423336871673|        50.45|              634.0|
|8502188|8.354599|47.355907|445|8580847|8.349997|47.353839|386| 450.5219233405746|        50.59|              654.0|
|8502208|8.589802|47.258748|484|8573553|8.589041|47.261463|409| 

<a id = 'download'></a>
### 6. (Optional) Download data in CSV format


In [44]:
%%spark -o df_stations -n -1

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [45]:
%%local
df_stations.to_csv("../data/Zurich_Stations.csv", index=False)

In [46]:
%%spark -o distance -n -1

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [47]:
%%local
distance.to_csv("../data/Zurich_WalkingConnections.csv", index=False)