# Dublin Buses - Clean Data

Prerequisites: `00-download-data.ipynb`

Before running the code in this notebook, you must download and concatenate all the original per-day data files into a single parquet file. Please use the above notebook to do this.

In [1]:
import numpy as np
import pandas as pd
import ipywidgets as widgets
import osmnx as ox
from ipywidgets import interact, interact_manual
from tqdm import tqdm_notebook as tqdm
import folium
import multiprocessing
import collections
import matplotlib.pyplot as plt

from sklearn.neighbors import BallTree

from geo.geomath import vec_haversine, num_haversine
from geo.df import DataCleaner
from par.allel import parallel_process
from geo.df import mem_usage, categorize_columns

Read the data in parquet format, as generated by the first step. Note that not all columns are being read in.

In [2]:
columns_to_read = ['Timestamp', 'LineID', 'Direction', 'PatternID', 
                   'JourneyID', 'Congestion', 'Lon', 'Lat', 
                   'Delay', 'BlockID', 'VehicleID', 'StopID', 'AtStop']
df = pd.read_parquet("data/sir010113-310113.parquet", columns=columns_to_read)

In [3]:
mem_usage(df)

' 4997.05 MB'

In [4]:
df = categorize_columns(df, ['PatternID'])
mem_usage(df)

' 2586.19 MB'

In [5]:
journeys = df.JourneyID.unique()

In [6]:
journeys.shape

(18614,)

In [7]:
cleaner = DataCleaner()

In [8]:
vehicles = df['VehicleID']

In [9]:
unique_vehicles = df['VehicleID'].unique()

In [10]:
def zero_runs(a):
    # Source: https://stackoverflow.com/questions/24885092/finding-the-consecutive-zeros-in-a-numpy-array
    # Create an array that is 1 where a is 0, and pad each end with an extra 0.
    iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
    absdiff = np.abs(np.diff(iszero))
    # Runs start and end where absdiff is 1.
    ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
    return ranges

The `get_top_whisker_speed` function calculates the speed corresponding to the top whisker on a box-and-whiskers plot, using Tukey's formulation. The top whisker corresponds to 1.5 times the interquartile range added to the third quartile value. Use this function to calculate the most likely top speed on a per-vehicle basis, when fixing the type-2 anomalies.

In [11]:
def get_top_whisker_speed(df):
    q = df['v'].quantile([.25, .5, .75])
    iqr = q.loc[0.75] - q.loc[0.25]
    return q.loc[0.75] + 1.5 * iqr

The `process_vehicle` function fixes all anomalies on a vehicle partition of the data. After fixing the anomalies, the function returns a dictionary containing the vehicle identifier, the cleaned-up DataFrame, the type 2 anomalies DataFrame, and the _top whisker speed_ value for the vehicle.

In [12]:
def process_vehicle(v, df):
    df = cleaner.calculate_derived_columns(df)
    df = cleaner.fix_type1_anomalies(df)
    max_v = get_top_whisker_speed(df)
    df, anomalies = cleaner.fix_type2_anomalies(df, max_speed=max_v)
    return {'v': v, 'df': df, 'anom': anomalies, 'max_v': max_v }

Create an array with the input data for the parallel anomaly correction process. Each array element contains the required parameters for a call to the `process_vehicle` function. Having the data split up by vehicle helps in the parallelization process. Each process looks at a single partition of the data, avoiding data concurrency problems.

In [13]:
vehicle_data = [{'v': v, 'df': df[df['VehicleID'] == v].copy().sort_values(by='Timestamp')} for v in tqdm(unique_vehicles)]

HBox(children=(IntProgress(value=0, max=911), HTML(value='')))




We do not need the main DataFrame anymore, so we can do away with it and save some precious memory in the process.

In [14]:
df = None

Now run the parallel process that fixes all the anomalies. Note that this piece of code can take a long time to run, depending on your hardware. The more cores, the better!

In [15]:
fixed_data = parallel_process(vehicle_data, process_vehicle, use_kwargs=True, tqdm=tqdm)

HBox(children=(IntProgress(value=0, max=908), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




Create the sub folders in the `data` folder to receive the fixed data and the anomaly data _per vehicle_.

In [16]:
import os
if not os.path.exists("data/fixed"):
    os.makedirs("data/fixed")
    
if not os.path.exists("data/anomaly"):
    os.makedirs("data/anomaly")

Save both the fixed per-vehicle DataFrames and the anomaly points. These might still be of use in the future, who knows?

In [17]:
for vd in tqdm(fixed_data):
    if isinstance(vd, dict):
        v = vd['v']
        df = vd['df']
        anom = vd['anom']
        df.to_parquet("data/fixed/v_{0}.parquet".format(v), index=False)
        
        if anom is not None:
            anom.to_parquet("data/anomaly/a_{0}.parquet".format(v), index=False)

HBox(children=(IntProgress(value=0, max=911), HTML(value='')))




Calculate the bounding box of the samples and save it on a JSON text file for later use.

In [27]:
import sys
import json

min_lat = sys.float_info.max
max_lat = -sys.float_info.max
min_lon = sys.float_info.max
max_lon = -sys.float_info.max

for vd in tqdm(fixed_data):
    if isinstance(vd, dict):
        v = vd['v']
        df = vd['df']
        anom = vd['anom']       
        min_lat = min(min_lat, df['Lat'].min())
        max_lat = max(max_lat, df['Lat'].max())
        min_lon = min(min_lon, df['Lon'].min())
        max_lon = max(max_lon, df['Lon'].max())

bbox = {'west': min_lon, 'east': max_lon, 'north': max_lat, 'south': min_lat}

with open('data/bbox.txt', 'w') as json_file:
  json.dump(bbox, json_file)

HBox(children=(IntProgress(value=0, max=911), HTML(value='')))


