# Bike Availability Preprocessing

## Data Dictionary

The raw data contains the following data per station per reading:

* Id - String - API Resource Id
* Name - String - The common name of the station
* PlaceType - String ?
* TerminalName - String - ?
* NbBikes - Integer - The number of available bikes
* NbDocks - Integer - The total number of docking spaces
* NbEmptyDocks - Integer - The number of available empty docking spaces
* Timestamp - DateTime - The moment this reading was captured
* InstallDate - DateTime - Date when the station was installed
* RemovalDate - DateTime - Date when the station was removed
* Installed - Boolean - If the station is installed or not
* Locked - Boolean - ?
* Temporary - Boolean - If the station is temporary or not (TfL adds temporary stations to cope with demand.)
* Latitude - Float - Latitude Coordinate
* Longitude - Float - Longitude Coordinate

The following variables will be derived from the raw data.

* NbUnusableDocks - Integer - The number of non-working docking spaces. Computed with NbUnusableDocks = NbDocks - (NbBikes + NbEmptyDocks)

## Set up

### Imports

In [4]:
%matplotlib inline

import logging
import itertools
import json
import os
import pickle
import folium
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from mpl_toolkits.basemap import Basemap
from datetime import datetime
from os import listdir
from os.path import isfile, join
from IPython.display import Image
from datetime import date

from src.data.parse_dataset import parse_dir, parse_json_files, get_file_list
from src.data.string_format import format_name, to_short_name
from src.data.visualization import lon_min_longitude, lon_min_latitude, lon_max_longitude, lon_max_latitude, lon_center_latitude, lon_center_longitude, create_london_map

logger = logging.getLogger()
logger.setLevel(logging.INFO)

## Parse Raw Data 

### Define the Parsing Functions

In [52]:
def parse_cycles(json_obj):
    """Parses TfL's BikePoint JSON response"""

    return [parse_station(element) for element in json_obj]

def parse_station(element):
    """Parses a JSON bicycle station object to a dictionary"""

    obj = {
        'Id': element['id'],
        'Name': element['commonName'],
        'Latitude': element['lat'],
        'Longitude': element['lon'],
        'PlaceType': element['placeType'],
    }

    for p in element['additionalProperties']:
        obj[p['key']] = p['value']

        if 'timestamp' not in obj:
            obj['Timestamp'] = p['modified']
        elif obj['Timestamp'] != p['modified']:
            raise ValueError('The properties\' timestamps for station %s do not match: %s != %s' % (
            obj['id'], obj['Timestamp'], p['modified']))

    return obj

In [53]:
def bike_file_date_fn(file_name):
    """Gets the file's date"""

    return datetime.strptime(os.path.basename(file_name), 'BIKE-%Y-%m-%d:%H:%M:%S.json')

def create_between_dates_filter(file_date_fn, date_start, date_end):
    def filter_fn(file_name):
        file_date = file_date_fn(file_name)
        return file_date >= date_start and file_date <= date_end
    
    return filter_fn

### Quick Data View

#### Load Single Day Data

In [54]:
filter_fn = create_between_dates_filter(bike_file_date_fn, 
                                       datetime(2016, 5, 16, 7, 0, 0),
                                       datetime(2016, 5, 16, 23, 59, 59))

records = parse_dir('/home/jfconavarrete/Documents/Work/Dissertation/spts-uoe/data/raw/cycles', 
                    parse_cycles, sort_fn=bike_file_date_fn, filter_fn=filter_fn)

# records is a list of lists of dicts
df = pd.DataFrame(list(itertools.chain.from_iterable(records))) 

####  All Station View

In [55]:
df.head()

Unnamed: 0,Id,InstallDate,Installed,Latitude,Locked,Longitude,Name,NbBikes,NbDocks,NbEmptyDocks,PlaceType,RemovalDate,Temporary,TerminalName,Timestamp
0,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",11,19,7,BikePoint,,False,1023,2016-05-16T06:26:24.037
1,BikePoints_2,1278585780000,True,51.499606,False,-0.197574,"Phillimore Gardens, Kensington",12,37,25,BikePoint,,False,1018,2016-05-16T06:26:24.037
2,BikePoints_3,1278240360000,True,51.521283,False,-0.084605,"Christopher Street, Liverpool Street",6,32,26,BikePoint,,False,1012,2016-05-16T06:51:27.5
3,BikePoints_4,1278241080000,True,51.530059,False,-0.120973,"St. Chad's Street, King's Cross",14,23,9,BikePoint,,False,1013,2016-05-16T06:51:27.5
4,BikePoints_5,1278241440000,True,51.49313,False,-0.156876,"Sedding Street, Sloane Square",27,27,0,BikePoint,,False,3420,2016-05-16T06:46:27.237


####  Single Station View

In [56]:
df[df['Id'] == 'BikePoints_1'].head()

Unnamed: 0,Id,InstallDate,Installed,Latitude,Locked,Longitude,Name,NbBikes,NbDocks,NbEmptyDocks,PlaceType,RemovalDate,Temporary,TerminalName,Timestamp
0,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",11,19,7,BikePoint,,False,1023,2016-05-16T06:26:24.037
762,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",11,19,7,BikePoint,,False,1023,2016-05-16T06:26:24.037
1524,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",10,19,8,BikePoint,,False,1023,2016-05-16T07:01:29.163
2286,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",8,19,10,BikePoint,,False,1023,2016-05-16T07:11:30.433
3048,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",8,19,10,BikePoint,,False,1023,2016-05-16T07:11:30.433


#### Observations

* There are some duplicate rows <- remove duplicates
* RemovalDate may contain a lot of nulls <- remove if not helpful
* Locked and Installed might be constant <- remove if not helpful

### Build Dataset

#### Work with Chunks

Due to memory constraints we'll parse the data in chunks. In each chunk we'll remove the redundant candidate keys and also duplicate rows.

In [57]:
def chunker(seq, size):
    return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

#### Tables

We will have two different tables, one for the stations and one for the availability readings

In [58]:
def split_data(parsed_data):
    master_df = pd.DataFrame(list(itertools.chain.from_iterable(parsed_data)))
    
    readings_df = pd.DataFrame(master_df, columns=['Id', 'Timestamp', 'NbBikes', 'NbDocks', 'NbEmptyDocks'])
    stations_df = pd.DataFrame(master_df, columns=['Id', 'Name', 'TerminalName' , 'PlaceType', 'Latitude', 
                                                   'Longitude', 'Installed', 'Temporary', 'Locked',
                                                   'RemovalDate', 'InstallDate'])
    
    return (readings_df, stations_df)

#### Build the Dataset

In [59]:
# get the files to parse
five_weekdays_filter = create_between_dates_filter(bike_file_date_fn, 
                                                   datetime(2016, 6, 19, 0, 0, 0), 
                                                   datetime(2016, 6, 27, 23, 59, 59))

files = get_file_list('data/raw/cycles', filter_fn=None, sort_fn=bike_file_date_fn)

# process the files in chunks
files_batches = chunker(files, 500)

In [60]:
# start with an empty dataset
readings_dataset = pd.DataFrame()
stations_dataset = pd.DataFrame()

# append each chunk to the datasets while removing duplicates
for batch in files_batches:
    parsed_data = parse_json_files(batch, parse_cycles)
    
    # split the data into two station data and readings data
    readings_df, stations_df = split_data(parsed_data)
    
    # append the datasets
    readings_dataset = pd.concat([readings_dataset, readings_df])
    stations_dataset = pd.concat([stations_dataset, stations_df])
    
    # remove duplicated rows
    readings_dataset.drop_duplicates(inplace=True)
    stations_dataset.drop_duplicates(inplace=True)

In [61]:
# put the parsed data in pickle files
pickle.dump(readings_dataset, open("data/parsed/readings_dataset_raw.p", "wb"))
pickle.dump(stations_dataset, open("data/parsed/stations_dataset_raw.p", "wb"))

## Read the Parsed Data

In [1]:
stations_dataset = pickle.load(open('data/parsed/stations_dataset_raw.p', 'rb'))
readings_dataset = pickle.load(open('data/parsed/readings_dataset_raw.p', 'rb'))

NameError: name 'pickle' is not defined

## Technically Correct Data

The data is set to be technically correct if it:

1. can be directly recognized as belonging to a certain variable
2. is stored in a data type that represents the value domain of the real-world variable.

In [None]:
# convert columns to their appropriate datatypes
stations_dataset['InstallDate'] = pd.to_numeric(stations_dataset['InstallDate'], errors='raise')
stations_dataset['RemovalDate'] = pd.to_numeric(stations_dataset['RemovalDate'], errors='raise')

stations_dataset['Installed'].replace({'true': True, 'false': False}, inplace=True)
stations_dataset['Temporary'].replace({'true': True, 'false': False}, inplace=True)
stations_dataset['Locked'].replace({'true': True, 'false': False}, inplace=True)

readings_dataset['NbBikes'] = readings_dataset['NbBikes'].astype('uint16')
readings_dataset['NbDocks'] = readings_dataset['NbDocks'].astype('uint16')
readings_dataset['NbEmptyDocks'] = readings_dataset['NbEmptyDocks'].astype('uint16')

In [None]:
# format station name
stations_dataset['Name'] = stations_dataset['Name'].apply(format_name)

In [None]:
# convert string timestamp to datetime
stations_dataset['InstallDate'] = pd.to_datetime(stations_dataset['InstallDate'], unit='ms', errors='raise')
stations_dataset['RemovalDate'] = pd.to_datetime(stations_dataset['RemovalDate'], unit='ms', errors='raise')

readings_dataset['Timestamp'] =  pd.to_datetime(readings_dataset['Timestamp'], format='%Y-%m-%dT%H:%M:%S.%f', errors='raise')

In [None]:
# sort the datasets
stations_dataset.sort_values(by=['Id'], ascending=True, inplace=True)

readings_dataset.sort_values(by=['Timestamp'], ascending=True, inplace=True)

## Derive Data

In [None]:
stations_dataset['ShortName'] = stations_dataset['Name'].apply(to_short_name)

readings_dataset['NbUnusableDocks'] = readings_dataset['NbDocks'] - (readings_dataset['NbBikes'] + readings_dataset['NbEmptyDocks'])

### Add Station Priority Column
Priorities downloaded from https://www.whatdotheyknow.com/request/tfl_boris_bike_statistics?unfold=1

In [None]:
stations_priorities = pd.read_csv('data/raw/priorities/station_priorities.csv', encoding='latin-1')
stations_priorities['Site'] = stations_priorities['Site'].apply(format_name)

In [None]:
stations_dataset = pd.merge(stations_dataset, stations_priorities, how='left', left_on='ShortName', right_on='Site')
stations_dataset['Priority'].replace({'One': '1', 'Two': '2', 'Long Term Suspended': np.NaN, 'Long term suspension': np.NaN}, inplace=True)
stations_dataset.drop(['Site'], axis=1, inplace=True)
stations_dataset.drop(['Borough'], axis=1, inplace=True)

In [None]:
stations_dataset

## Consistent Data

### Stations Analysis

#### Overview

In [None]:
stations_dataset.shape

In [None]:
stations_dataset.info(memory_usage='deep')

In [None]:
stations_dataset.head()

In [None]:
stations_dataset.describe()

In [None]:
stations_dataset.apply(lambda x:x.nunique())

In [None]:
stations_dataset.isnull().sum()

#### Observations:
* Id, Name and Terminal name seem to be candidate keys
* The minimum latitude and the maximum longitude are 0
* Some stations have the same latitude or longitude
* Id, TerminalName and Name have different unique values
* Placetype, Installed, Temporary and Locked appear to be constant
* Some stations do not have an install date
* Some Stations have a removal date (very sparse)

#### Remove Duplicate Stations

In [None]:
def find_duplicate_ids(df):
    """Find Ids that have more than one value in the given columns"""
    
    df = df.drop_duplicates()
    value_counts_grouped_by_id = df.groupby('Id').count()    
    is_duplicate_id = value_counts_grouped_by_id.applymap(lambda x: x > 1).any(axis=1)
    duplicate_ids = value_counts_grouped_by_id[is_duplicate_id == True].index.values
    return df[df['Id'].isin(duplicate_ids)]

diplicate_ids = find_duplicate_ids(stations_dataset)
diplicate_ids

Given these records have the same location and Id but different Name or TerminalName, we'll assume the station changed name and remove the first entries.

In [None]:
# remove the one not in merchant street
stations_dataset.drop(417, inplace=True)

# remove the one with the shortest name
stations_dataset.drop(726, inplace=True)

# remove the one that is not in kings cross (as the name of the station implies)
stations_dataset.drop(745, inplace=True)

# remove the duplicated entries 
stations_dataset.drop([747, 743, 151, 754, 765, 768],  inplace=True)

In [None]:
# make sure there are no repeated ids 
assert len(find_duplicate_ids(stations_dataset)) == 0

#### Check Locations

Let's have a closer look at the station locations. All of them should be in Greater London.

In [None]:
def find_locations_outside_box(locations, min_longitude, min_latitude, max_longitude, max_latitude):
    latitude_check = ~(locations['Latitude'] >= min_latitude) & (locations['Latitude'] <= max_latitude) 
    longitude_check = ~(locations['Longitude'] >= min_longitude) & (locations['Longitude'] <= max_longitude) 
    return locations[(latitude_check | longitude_check)]

outlier_locations_df = find_locations_outside_box(stations_dataset, lon_min_longitude, lon_min_latitude, 
                                                  lon_max_longitude, lon_max_latitude)
outlier_locations_df

This station looks like a test dation, so we'll remove it.

In [None]:
outlier_locations_idx = outlier_locations_df.index.values

stations_dataset.drop(outlier_locations_idx, inplace=True)

In [None]:
# make sure there are no stations outside London
assert len(find_locations_outside_box(stations_dataset, lon_min_longitude, lon_min_latitude, 
                                      lon_max_longitude, lon_max_latitude)) == 0

We will investigate the fact that there are stations with duplicate latitude or longitude values.

In [None]:
# find stations with duplicate longitude
id_counts_groupedby_longitude = stations_dataset.groupby('Longitude')['Id'].count()
nonunique_longitudes = id_counts_groupedby_longitude[id_counts_groupedby_longitude != 1].index.values
nonunique_longitude_stations = stations_dataset[stations_dataset['Longitude'].isin(nonunique_longitudes)].sort_values(by=['Longitude'])

id_counts_groupedby_latitude = stations_dataset.groupby('Latitude')['Id'].count()
nonunique_latitudes = id_counts_groupedby_latitude[id_counts_groupedby_latitude != 1].index.values
nonunique_latitudes_stations = stations_dataset[stations_dataset['Latitude'].isin(nonunique_latitudes)].sort_values(by=['Latitude'])

nonunique_coordinates_stations = pd.concat([nonunique_longitude_stations, nonunique_latitudes_stations])
nonunique_coordinates_stations

In [None]:
def draw_stations_map(stations_df):    
    stations_map = create_london_map()

    for index, station in stations_df.iterrows():        
        folium.Marker([station['Latitude'],station['Longitude']], popup=station['Name']).add_to(stations_map)
    
    return stations_map

In [None]:
draw_stations_map(nonunique_coordinates_stations)

We can observe that the stations are different and that having the same Longitude is just a coincidence.

Let's plot all the stations in a map to see how it looks

In [None]:
london_longitude = -0.127722
london_latitude = 51.507981

MAX_RECORDS = 100

stations_map = create_london_map()

for index, station in stations_dataset[0:MAX_RECORDS].iterrows():
    folium.Marker([station['Latitude'],station['Longitude']], popup=station['Name']).add_to(stations_map)
    
stations_map

#folium.Map.save(stations_map, 'reports/maps/stations_map.html')

### Readings Analysis

#### Overview

In [None]:
readings_dataset.shape

In [None]:
readings_dataset.info(memory_usage='deep')

In [None]:
readings_dataset.head()

In [None]:
readings_dataset.describe()

In [None]:
readings_dataset.apply(lambda x:x.nunique())

In [None]:
readings_dataset.isnull().sum()

In [None]:
timestamps = readings_dataset['Timestamp']
ax = timestamps.groupby([timestamps.dt.year, timestamps.dt.month, timestamps.dt.day]).count().plot(kind="bar")
ax.set_xlabel('Days')
ax.set_title('Readings per Day')

#### Observations:
* The number of readings in each day varies widely

#### Discard Out of Range Data

In [None]:
start_date = date(2016, 5, 16)
end_date = date(2016, 6, 27)
days = set(pd.date_range(start=start_date, end=end_date, closed='left'))
           
readings_dataset = readings_dataset[(timestamps > start_date) & (timestamps < end_date)]

#### Readings Consistency Through Days
Lets get some insight about which stations do not have readings during an entire day

In [None]:
# get a subview of the readings dataset
id_timestamp_view = readings_dataset.loc[:,['Id','Timestamp']]

# remove the time component of the timestamp
id_timestamp_view['Timestamp'] = id_timestamp_view['Timestamp'].apply(lambda x: x.replace(hour=0, minute=0, second=0, microsecond=0))

# compute the days of readings per stations
days_readings = id_timestamp_view.groupby('Id').aggregate(lambda x: set(x))
days_readings['MissingDays'] = days_readings['Timestamp'].apply(lambda x: list(days - x))
days_readings['MissingDaysCount'] = days_readings['MissingDays'].apply(lambda x: len(x))

In [None]:
def expand_datetime(df, datetime_col):
    df['Weekday'] = df[datetime_col].apply(lambda x: x.weekday())
    return df

In [None]:
# get the stations with missing readings only
missing_days_readings = days_readings[days_readings['MissingDaysCount'] != 0]
missing_days_readings = missing_days_readings['MissingDays'].apply(lambda x: pd.Series(x)).unstack().dropna()
missing_days_readings.index = missing_days_readings.index.droplevel()

# sort and format in their own DF
missing_days_readings = pd.DataFrame(missing_days_readings, columns=['MissingDay'], index=None).reset_index().sort_values(by=['Id', 'MissingDay'])

# expand the missing day date
expand_datetime(missing_days_readings, 'MissingDay')

In [None]:
missing_days_readings['Id'].nunique()

In [None]:
# plot the missing readings days 
days = missing_days_readings['MissingDay']
missing_days_counts = days.groupby([days.dt.year, days.dt.month, days.dt.day]).count()
ax = missing_days_counts.plot(kind="bar")
ax.set_xlabel('Days')
ax.set_title('Missing Readings Days Count')

Stations with no readings in at least one day

In [None]:
missing_days_readings_stations = stations_dataset[stations_dataset['Id'].isin(missing_days_readings['Id'].unique())]
draw_stations_map(missing_days_readings_stations)

Stations with no readings in at least one day during the weekend

In [None]:
weekend_readings = missing_days_readings[missing_days_readings['Weekday'] > 4]
missing_dayreadings_stn = stations_dataset[stations_dataset['Id'].isin(weekend_readings['Id'].unique())]
draw_stations_map(missing_dayreadings_stn)

Stations with no readings in at least one day during weekdays

In [None]:
weekday_readings = missing_days_readings[missing_days_readings['Weekday'] < 5]
missing_dayreadings_stn = stations_dataset[stations_dataset['Id'].isin(weekday_readings['Id'].unique())]
draw_stations_map(missing_dayreadings_stn)

Observations:
* There are 29 stations that do not have readings in at least one day
* There were more stations without readings during May than in June
* Other than that, there is no visible pattern

#### Discard Non Relevant Data

In [None]:
hour = readings_dataset['Timestamp'].apply(lambda x: x.hour)
selector = (hour < 7) | (hour > 22)
#readings_dataset.drop(readings_dataset[selector].index, inplace=True)

## Build Datasets
### Readings

In [None]:
readings_dataset.reset_index(inplace=True, drop=True)

In [None]:
readings_dataset.head()

In [None]:
readings_dataset.describe()

In [None]:
readings_dataset.info(memory_usage='deep')

In [None]:
pickle.dump(readings_dataset, open("data/parsed/readings_dataset_final.p", "wb"))

### Stations

In [None]:
stations_dataset.reset_index(inplace=True, drop=True)

In [None]:
stations_dataset.head()

In [None]:
stations_dataset.describe()

In [None]:
stations_dataset.info(memory_usage='deep')

In [None]:
pickle.dump(stations_dataset, open("data/parsed/stations_dataset_final.p", "wb"))