# Prepare a Daily Schedule
Before conducting analysis on a day's worth of delay data, there is a lot of data in the schedule datasets that can be removed to make our life a lot easier.

By the end of this notebook, we will have access to two tables: the trips, and the stop times for each trip. These tables will have columns ready to parse in delay information when this has been analysed.

## Accessing the archive
This project archives each day's schedule information.

In [1]:
import zipfile
import io
file_name = '20190124.zip'
with open(file_name, "rb") as f:
    z = zipfile.ZipFile(io.BytesIO(f.read()))

z.extractall()
print("Extracted file " + file_name)

Extracted file 20190124.zip


This contains all the set schedules, as well as the real time delay information, which we will deal with later.

## The schedules
The schedules are a bunch of text files.

In [2]:
import glob
data_path = 'home/pi/sydney-transport-tracker/data/raw/20190124/'
timetable_files = glob.glob(data_path + '*.txt')
print('The timetable files are:\n' + '\n'.join(timetable_files))

The timetable files are:
home/pi/sydney-transport-tracker/data/raw/20190124/stop_times.txt
home/pi/sydney-transport-tracker/data/raw/20190124/shapes.txt
home/pi/sydney-transport-tracker/data/raw/20190124/stops.txt
home/pi/sydney-transport-tracker/data/raw/20190124/calendar.txt
home/pi/sydney-transport-tracker/data/raw/20190124/trips.txt
home/pi/sydney-transport-tracker/data/raw/20190124/agency.txt
home/pi/sydney-transport-tracker/data/raw/20190124/routes.txt


This is the data you will get when querying the Transport for NSW Open Data [schedules API](https://opendata.transport.nsw.gov.au/dataset/public-transport-timetables-realtime). 

### Trips
`trips.txt` is every trip that will run over a period of time (more than 1 day). Trains running on the same "line" will share a `route_id`, each will have a unique `trip_id`, but they also have a `service_id` which can be used to determine which trips are running on what day.

These schedule text files that are downloaded may be relevant for a week or more. What we want to do is filter out this trips file so that it only contains the trips we are interested in, that ran on the 23/01/2019 (and on the city network only - we will get to that next).

To determine what trips ran today, we look at the `calendar.txt` file. It contains the `service_id` (matching those from `trips.txt`) for each day in this period.

In [3]:
import datetime
import csv

day_of_analysis = 'thursday'
date_of_analysis = datetime.datetime.strptime('20190124', "%Y%m%d").date()
todays_services = []

with open(data_path + 'calendar.txt', mode='r', encoding='utf-8-sig') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    line_count = 0
    for row in csv_reader:
        if row[day_of_analysis] == '1':
            start_date = datetime.datetime.strptime(row['start_date'], "%Y%m%d").date()
            end_date = datetime.datetime.strptime(row['end_date'], "%Y%m%d").date()
            if start_date <= date_of_analysis <= end_date:
                todays_services.append(row['service_id'])

print("Todays services:\n" + '\n'.join(todays_services))

Todays services:
1260.122.100
1260.122.104
1260.122.108
1260.122.112
1260.122.116
1260.122.120
1260.122.124
1260.122.32
1260.122.36
1260.122.40
1260.122.44
1260.122.48
1260.122.52
1260.122.56
1260.122.60
1260.122.96


With this information, we can filter out all the trips from `trips.txt` that do not run on this day.

In [4]:
import pandas as pd
df_trips = pd.read_csv(data_path + 'trips.txt',
                       header=0,
                       encoding='utf-8-sig',
                       usecols=["route_id", "service_id", "trip_id"])
df_trips = df_trips[df_trips['service_id'].isin(todays_services)]
pd.options.display.max_rows = 10
df_trips

Unnamed: 0,route_id,service_id,trip_id
1,BNK_2a,1260.122.48,1--A.1260.122.48.M.8.55188157
19,APS_1a,1260.122.48,1--B.1260.122.48.M.8.55188160
37,APS_2a,1260.122.48,1--C.1260.122.48.M.8.55188159
55,APS_1a,1260.122.48,1--D.1260.122.48.M.8.55188306
73,APS_2a,1260.122.48,1--E.1260.122.48.M.8.55188307
...,...,...,...
57224,BMT_2,1260.122.56,WN12.1260.122.56.N.2.55188260
57244,BMT_1,1260.122.56,WN17.1260.122.56.N.2.55187512
57259,BMT_2,1260.122.56,WN18.1260.122.56.N.2.55187511
57281,CTY_W1a,1260.122.60,WT27.1260.122.60.X.5.55187038


If you were peruse the entire dataset, you would find that there are some `route_id` values that are non-commuter trains (see page 4 of the [Sydney Trains Realtime GTFS & GTFS- R Technical Document](https://opendata.transport.nsw.gov.au/sites/default/files/Real-Time_Train_Technical_Document_v2.5.pdf)). As well as this, there are interstate and regional services that we don't really want to consider when analysing delays, as they are long journeys rather than a daily commute.
Let's filter those out.

In [5]:
ROUTES_TO_IGNORE = ["CTY_NC1", "CTY_NC1a", "CTY_NC2", 
                    "CTY_NW1a", "CTY_NW1b", "CTY_NW1c", "CTY_NW1d", "CTY_NW2a", "CTY_NW2b", 
                    "CTY_S1a", "CTY_S1b", "CTY_S1c", "CTY_S1d", "CTY_S1e", "CTY_S1f", 
                    "CTY_S1g", "CTY_S1h", "CTY_S1i", 
                    "CTY_S2a", "CTY_S2b", "CTY_S2c", "CTY_S2d", "CTY_S2e", "CTY_S2f", 
                    "CTY_S2g", "CTY_S2h", "CTY_S2i", 
                    "CTY_W1a", "CTY_W1b", "CTY_W2a", "CTY_W2b", 
                    "HUN_1a", "HUN_1b", "HUN_2a", "HUN_2b", 
                    "RTTA_DEF", "RTTA_REV"]
df_trips = df_trips[~df_trips['route_id'].isin(ROUTES_TO_IGNORE)]
df_trips

Unnamed: 0,route_id,service_id,trip_id
1,BNK_2a,1260.122.48,1--A.1260.122.48.M.8.55188157
19,APS_1a,1260.122.48,1--B.1260.122.48.M.8.55188160
37,APS_2a,1260.122.48,1--C.1260.122.48.M.8.55188159
55,APS_1a,1260.122.48,1--D.1260.122.48.M.8.55188306
73,APS_2a,1260.122.48,1--E.1260.122.48.M.8.55188307
...,...,...,...
56917,BMT_1,1260.122.32,W597.1260.122.32.V.4.55188855
57214,BMT_1,1260.122.56,WN11.1260.122.56.N.2.55190142
57224,BMT_2,1260.122.56,WN12.1260.122.56.N.2.55188260
57244,BMT_1,1260.122.56,WN17.1260.122.56.N.2.55187512


In [6]:
df_trips.to_pickle('trips.pickle')

That removed hundreds of trips from consideration.

### Stops
`stop_times.txt`, like `trips.txt`, contains information for journeys that occur on different days, and on routes we don't really care about.
Now that we have a table containing every `trip_id` under consideration, we can filter out all of the stop times that don't matter.

In [7]:
df_stop_times = pd.read_csv(data_path + 'stop_times.txt', header=0,
                            encoding='utf-8-sig',
                            dtype={'stop_id': str},
                            usecols=["trip_id", "arrival_time", "departure_time", "stop_id"],
                            parse_dates=['arrival_time', 'departure_time'])

# remove any trips from stop_times that did NOT happen on this date, using the trips dataset
df_stop_times = df_stop_times[df_stop_times['trip_id'].isin(df_trips['trip_id'])]
df_stop_times

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id
10,1--A.1260.122.48.M.8.55188157,03:52:00,03:52:00,2144243
11,1--A.1260.122.48.M.8.55188157,03:54:12,03:55:00,2141313
12,1--A.1260.122.48.M.8.55188157,03:57:30,03:57:30,214063
13,1--A.1260.122.48.M.8.55188157,03:58:42,03:58:42,214074
14,1--A.1260.122.48.M.8.55188157,04:01:24,04:01:24,2135234
...,...,...,...,...
1032152,WN18.1260.122.56.N.2.55187511,22:23:00,22:23:00,279536
1032153,WN18.1260.122.56.N.2.55187511,22:49:00,22:49:00,27874
1032154,WN18.1260.122.56.N.2.55187511,23:04:30,23:04:30,2790154
1032155,WN18.1260.122.56.N.2.55187511,23:12:00,23:12:00,284515


In [8]:
df_stop_times.to_pickle('stop_times.pickle')

Now that we're only dealing with a subset of the original data dump from the Open Data API, we can start parsing in the real-time delay data in the next notebook.