# Add statistical information to edges:

In this notebook, the statistical data computed in `compute_mean_std.ipynb` and saved in `delay_distribution_percentiles.orc` are loaded and added to the edges of our network (saved to `edges_with_mean_and_std_sec.orc`, in the home of the current user). This will later be used to create a network.

## Set up:

In [None]:
%%configure
{"conf": {
    "spark.app.name": "dslab-group_final"
}}

In [None]:
from pyspark.sql.functions import col, udf, lit
from pyspark.sql.types import IntegerType

### Load data:

In [None]:
edges_df = spark.read.orc("/user/liseli/edges.orc")
delays = spark.read.orc("/user/liseli/delay_distribution_percentiles.orc")
trips = spark.read.format('orc').load('/data/sbb/timetables/orc/trips/000000_0')
routes = spark.read.format('orc').load('/data/sbb/timetables/orc/routes/000000_0')

edges_with_route = trips.join(routes, 'route_id').select(col('trip_id'), col('route_desc')).distinct()\
                        .join(edges_df, 'trip_id')
edges_with_route.show()

### Add transportation type and mean,std:

Create a dictionnary for transportation types:

In [None]:
translate_route_desc = {
    'TGV': 'TGV',
    'Eurocity': 'EC',
    'tandseilbahn': 'AT',
    'Regionalzug': 'R',
    'RegioExpress': 'RE',
    'S-Bahn': 'S',
    'Luftseilbahn': '',
    'Sesselbahn': '',
    'Taxi': '',
    'Fähre': '',
    'Tram': 'Tram',
    'ICE': 'ICE',
    'Bus': 'Bus',
    'Gondelbahn': '',
    'Nacht-Zug': '',
    'Standseilbahn': 'AT',
    'Auoreisezug': 'ARZ',
    'Eurostar': 'EC',
    'Schiff': '',
    'Schnellzug': 'TGV',
    'Intercity': 'IC',
    'InterRegio': 'IR',
    'Extrazug': 'EXT',
    'Metro': 'Metro'
}

In [None]:
@udf("string")
def translate_dict(text):
    return translate_route_desc[text]

@udf('string')
def truncate_stop_id_column(s):
    return s.split(':')[0]

@udf('string')
def truncate_stop_id_len(s):
    return str(s)[:7]

@udf('long')
def leng(s):
    return len(str(s))

In [None]:
edges_with_route = edges_with_route.withColumn('route_desc_translated', translate_dict(col('route_desc')))\
                                       .withColumn('hour', (col('arrival_time')/60).cast(IntegerType()))\
                                       .withColumn('truncated_stop_id', truncate_stop_id_column(col('stop_id'))).cache()

In [None]:
delays = delays.select((col('mean')/60).alias('mean'), (col('std')/60).alias('std'),
                       (col('p_90')/60).alias('p_90'), (col('p_91')/60).alias('p_91'), (col('p_92')/60).alias('p_92'), (col('p_93')/60).alias('p_93'), (col('p_94')/60).alias('p_94'), (col('p_95')/60).alias('p_95'), (col('p_96')/60).alias('p_96'), (col('p_97')/60).alias('p_97'), (col('p_98')/60).alias('p_98'), (col('p_99')/60).alias('p_99'),
                       col('hour').alias('hour_2'), col('stop_id').alias('stop_id_2'), col('verkehrsmittel_text'))\
               .withColumn('truncated_stop_id', truncate_stop_id_len(col('stop_id_2')))

In [None]:
edges_final = edges_with_route.join(delays, (edges_with_route.hour == delays.hour_2) &\
                                            (edges_with_route.truncated_stop_id == delays.truncated_stop_id) &\
                                            (edges_with_route.route_desc_translated == delays.verkehrsmittel_text), how='left')

Create edges dataframe with the following information:
 - trip_id
 - stop_id
 - train_type
 - arrival_time
 - departure_time
 - next_stop
 - trip_duration
 - mean
 - std 
 
From the original edges dataframe (from `edges.orc`, we now add mean, std and train information)

In [None]:
edges_final = edges_final.select('trip_id', 'stop_id', col('route_desc').alias('train_type'),
                                 'arrival_time', 'departure_time', 'next_stop', 'trip_duration', 'mean', 'std',
                                 'p_90', 'p_91', 'p_92', 'p_93', 'p_94', 'p_95', 'p_96', 'p_97', 'p_98', 'p_99').cache()

#### Check how many edges have no mean or std information:

In [None]:
print('Proportion of null values:\n\tMean: {:.2f}%'
      .format(edges_final.filter(col('mean').isNull()).count() / float(edges_final.count()) * 100))
print('\tStd: {:.2f}%'.format(edges_final.filter(col('std').isNull()).count() / float(edges_final.count()) * 100))

Notice that most edges have statistical information about the mean and std. For those that have no values, we will replace the mean by the duration of the trip and std by 0 in further computations. 

#### Write the edges to orc:

In [None]:
%%local
import os
username = os.environ['JUPYTERHUB_USER']

In [None]:
%%send_to_spark -i username -t str -n username

In [None]:
edges_final.write.format("orc").mode('overwrite').save("/user/{}/edges_with_mean_and_std_sec.orc".format(username))