# Dublin Buses - Download Data

This notebook downloads and consolidates the 31 Dublin Bus daily datasets into a single parquet file. The dataset is freely provided by the [Smart Dublin](https://data.smartdublin.ie/dataset/dublin-bus-gps-sample-data-from-dublin-city-council-insight-project) website. It contains the GPS traces of the Dublin buses for January 2013.

Notes:
* Please install the `pyarrow` package so you can use the parquet file format
* The final DataFrame is quite large and may impose some memory pressure on your machine

In [1]:
import numpy as np
import pandas as pd
import zipfile
import urllib
import os
from ipywidgets import interact, interact_manual
from tqdm import tqdm_notebook as tqdm

Prepare the file names for later use. The `data` folder may not yet exist on your project, but it will be created in due course.

In [2]:
zip_file_name = "data/sir010113-310113.zip"
parquet_file_name = "data/sir010113-310113.parquet"

In [3]:
def download_zip():
    url = "http://opendata.dublincity.ie/TrafficOpenData/sir010113-310113.zip"
    urllib.request.urlretrieve(url, zip_file_name)

The `read_data_frame` function reads each of the daily datasets into a Pandas DataFrame. These are then concatenated into a single DataFrame and saved to storage.

In [4]:
def read_data_frame(filename):
    header = ['Timestamp', 'LineID', 'Direction', 'PatternID', 'TimeFrame', 
              'JourneyID', 'Operator', 'Congestion', 'Lon', 'Lat', 
              'Delay', 'BlockID', 'VehicleID', 'StopID', 'AtStop']   
    types = {'Timestamp': np.int64,
             'JourneyID': np.int32,
             'Congestion': np.int8,
             'Lon': np.float64,
             'Lat': np.float64,
             'Delay': np.int8,
             'VehicleID': np.int32,
             'AtStop': np.int8}
    df = pd.read_csv(filename, header=None, names=header, dtype=types, 
                     parse_dates=['TimeFrame'], infer_datetime_format=True)
    return df

The `prepare_data_frame` operates on the final DataFrame to make `NaN` replacements and final type changes that were not possible during reading.

In [5]:
def prepare_data_frame(df): 
    null_replacements = {'LineID': 0, 'StopID': 0}
    df = df.fillna(value=null_replacements)
    df['LineID'] = df['LineID'].astype(np.int32)
    df['StopID'] = df['StopID'].astype(np.int32)
    df['DateTime'] = pd.to_datetime(df['Timestamp'], unit='us')
    return df

The `read_zip_file` function reads in in all the extracted files into a single DataFrame. The generator expression makes the concatenation quite swift and precludes the use of other supporting variables. Note that as previously mentioned, this function may exert some pressure on your memory.

In [6]:
def read_zip_file(filename):
    final_df = None
    file_names = []
    
    print("Unzipping:")
    with zipfile.ZipFile(filename) as z:
        files = z.infolist()
        for f in tqdm(files):
            z.extract(f, path='data')
            file_names.append("data/" + f.filename)
    
    print("Concatenating:")
    df = pd.concat((read_data_frame(file) for file in tqdm(file_names)), ignore_index=True)
    
    print("Deleting:")
    for file in tqdm(file_names):
        os.remove(file)
        
    print("Terminating...")
    df = prepare_data_frame(df)
    return df

Create the `data` directory if it does not yet exist.

In [7]:
if not os.path.exists("data"):
    os.makedirs("data")

Conditionally download and process the data file. The download process can take a *long time*, so please make sure to retain the zip file for future use.

In [8]:
if not os.path.exists(zip_file_name):
    download_zip()

In [9]:
df = read_zip_file(zip_file_name)

Unzipping:


HBox(children=(IntProgress(value=0, max=31), HTML(value='')))


Concatenating:


HBox(children=(IntProgress(value=0, max=31), HTML(value='')))


Deleting:


HBox(children=(IntProgress(value=0, max=31), HTML(value='')))


Terminating...


Save the consolidated DataFrame to a parquet-formatted file. This is where you need the `pyarrow` package.

In [10]:
if not os.path.exists(parquet_file_name):
    df.to_parquet(parquet_file_name, index=False)

In [11]:
df = None

We are done. Please proceed to the next notebook to clean up the data.