# Create edges and nodes: 

This notebook creates the nodes from all stations around Zurich HB (in 15 km radius) and the edges between all stations during the day (8am to 8pm). It will then write these to the home of the person running the notebook.

In [None]:
%%configure
{"conf": {
    "spark.app.name": "dslab-group_final"
}}

#### Imports:

In [None]:
from geopy.distance import distance as geo_distance
from pyspark.sql import Row
from pyspark.sql.functions import col
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf

#### Load data:

In [None]:
# Loading data, these are snapshots of the all available data
# Calendar and trips are useful to filter the other dataframe according to the day

stop_times = spark.read.format('orc').load('/data/sbb/timetables/orc/stop_times/000000_0')
stops = spark.read.format('orc').load('/data/sbb/timetables/orc/stops/000000_0')

#### Criteria 1: Stop times during rush-hour 

Only consider journeys at reasonable hours of the day, thus we take only stop times that are in the window of rush-hour (e.g. from 8 a.m. to 8 p.m.). 

In [None]:
# Filter stop_times to be only in 08:00-19:59:
stop_times = stop_times.where((col('departure_time') >= '08:00:00') 
                              & (col('departure_time') <= '19:59:59'))

#### Criteria 2: Stations around Zürich HB

Only consider stations in a 15km radius of Zürich's train station (Zürich HB). 

First we get the geolocation of Zürich Hauptbahnhof to be able to calculate the distance of the other stations to the Hauptbahnhof. 

In [None]:
zurich_pos = stops.where(col('stop_name') == 'Zürich HB').select('stop_lat', 'stop_lon').collect()
zurich_pos = (zurich_pos[0][0], zurich_pos[0][1])
print('Location of Zürich Hauptbahnhof (lat, lon) :'+str(zurich_pos))

In [None]:
def zurich_distance(x, y):
    """zurich_distance: returns the distance of a station to Zurich HB
    @input: (lat,lon) of a station
    @output: distance in km to Zurich HB
    """
    return geo_distance(zurich_pos, (x,y)).km

Then we create a dataframe `stops_zurich` of the stations where we add a column for the distance to Zurich HB. In that dataframe, we keep only those that are in a radius of 15km to the HB. The same filter is applied to the `stop_times` df mentioned above. 

In [None]:
# filter stops:
stops_distance = stops.rdd.map(lambda x: (x['stop_id'], zurich_distance(x['stop_lat'], x['stop_lon'])))
stops_distance = spark.createDataFrame(stops_distance.map(lambda r: Row(stop_id=r[0], 
                                                                        zurich_distance=r[1])))

stops_distance = stops_distance.filter(col('zurich_distance') <= 15)

# add distance to HB to stops info and keep only in radius of 15km
stops_zurich = stops_distance.join(stops, on='stop_id')

# keep only stop times in radius of 15km of Zurich
stop_times_zurich = stop_times.join(stops_distance.select('stop_id'), on='stop_id')

In [None]:
# Cache it to save time:
stop_times_zurich.cache()

### Have a look at the data we have so far: 

#### Stop times in Zurich: 
Arrival and departure times at stops in the 15km radius of Zurich HB. 

In [None]:
stop_times_zurich.show(3)

#### Stops in Zurich:
Information about stops in the 15km radius of Zurich HB.

In [None]:
stops_zurich.show(3)

## Create network data:

From the pre-processed data, we would like to create a directed network where each node is a station and each edge between two nodes corresponds to a possible trip. 

A node will have the following attributes:
- stop_name: name of the station (e.g. Zurich HB)
- latitude
- longitude

An directed edge will have the following attributes:
- stop_id: the id of the stop the (directed) edge points from
- next_stop: the id of the stop the edge points to
- duration: the duration of the trip from stop_id to next_stop
- departure time: the time from which the service departs from stop_id

(further attributes with train type and mean, std will be added in other notebooks, these are just the basic edges)

### Nodes:

Then we create a **multigraph** (e.g. more than one edge allowed between two nodes) and add the stations as nodes.

In [None]:
%%local
import os
username = os.environ['JUPYTERHUB_USER']

In [None]:
%%send_to_spark -i username -t str -n username

In [None]:
stops_zurich.write.format("orc").mode('overwrite').save("/user/{}/nodes.orc".format(username))

### Edges:

For basic information about edges, we only want a table with a station, the next station on its trip, the departure time and the duration of the trip. 

In [None]:
@udf
def convertToMinute(s):
    h, m, _ = s.split(':')
    h,m = int(h), int(m)
    
    return h*60+m

In [None]:
# Convert time information to minutes elapsed since 0am
stop_times_zurich = stop_times_zurich.withColumn('arrival_time', 
                                                 convertToMinute(col('arrival_time')))
stop_times_zurich = stop_times_zurich.withColumn('departure_time', 
                                                 convertToMinute(col('departure_time')))
stop_times_zurich.show(3)

Then we want a dataframe that has the trip duration to the next stop from the current one on the trip. For that, we first create a table with the next stop and arrival time for each stop sequence in a trip. 

In [None]:
stop_times_zurich_2 = (stop_times_zurich.withColumn('stop_sequence_prev', col('stop_sequence')-1)
                       .select('trip_id',
                               col('stop_id').alias('next_stop'),
                               col('stop_sequence_prev').alias('stop_sequence'),
                               col('arrival_time').alias('next_arrival_time')))

stop_times_zurich_2.show(2)

Then we join this to the `stop_times_zurich` table to have trip duration (in minutes) and next stop information. 

In [None]:
# Add trip duration and next stop: 
stop_times_zurich = stop_times_zurich.join(stop_times_zurich_2, 
                                           on=['trip_id', 'stop_sequence']).orderBy('trip_id', 'stop_sequence')
stop_times_zurich = stop_times_zurich.withColumn('trip_duration', 
                                                 col('next_arrival_time')-col('departure_time'))
stop_times_zurich = stop_times_zurich.select('trip_id', 
                                             'stop_id', 'arrival_time', 'departure_time', 
                                             'next_stop', 'trip_duration').cache()
stop_times_zurich.show(2)

##### Save stop_times informations to hdfs:

In [None]:
stop_times_zurich.write.format("orc").mode('overwrite').save("/user/{}/edges.orc".format(username))