# Data Cleaning

## Data Dictionary

The raw data contains the following data per station per reading:

* Id - String - API Resource Id
* Name - String - The common name of the station
* PlaceType - String ?
* TerminalName - String - ?
* NbBikes - Integer - The number of available bikes
* NbDocks - Integer - The total number of docking spaces
* NbEmptyDocks - Integer - The number of available empty docking spaces
* Timestamp - DateTime - The moment this reading was captured
* InstallDate - DateTime - Date when the station was installed
* RemovalDate - DateTime - Date when the station was removed
* Installed - Boolean - If the station is installed or not
* Locked - Boolean - ?
* Temporary - Boolean - If the station is temporary or not (TfL adds temporary stations to cope with demand.)
* Latitude - Float - Latitude Coordinate
* Longitude - Float - Longitude Coordinate

The following variables will be derived from the raw data.

* NbUnusableDocks - Integer - The number of non-working docking spaces. Computed with NbUnusableDocks = NbDocks - (NbBikes + NbEmptyDocks)

## Set up

### Imports

In [28]:
import logging
import itertools
import json
import os
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import folium

from mpl_toolkits.basemap import Basemap
from datetime import datetime
from os import listdir
from os.path import isfile, join
from src.data.parse_dataset import parse_dir, parse_json_files, get_file_list
from IPython.display import Image

logger = logging.getLogger()
logger.setLevel(logging.INFO)

## Parse Raw Data 

### Define the Parsing Functions

In [2]:
def parse_cycles(json_obj):
    """Parses TfL's BikePoint JSON response"""

    return [parse_station(element) for element in json_obj]

def parse_station(element):
    """Parses a JSON bicycle station object to a dictionary"""

    obj = {
        'Id': element['id'],
        'Name': element['commonName'],
        'Latitude': element['lat'],
        'Longitude': element['lon'],
        'PlaceType': element['placeType'],
    }

    for p in element['additionalProperties']:
        obj[p['key']] = p['value']

        if 'timestamp' not in obj:
            obj['Timestamp'] = p['modified']
        elif obj['Timestamp'] != p['modified']:
            raise ValueError('The properties\' timestamps for station %s do not match: %s != %s' % (
            obj['id'], obj['Timestamp'], p['modified']))

    return obj

In [3]:
def bike_file_date_fn(file_name):
    """Gets the file's date"""

    return datetime.strptime(os.path.basename(file_name), 'BIKE-%Y-%m-%d:%H:%M:%S.json')

def create_between_dates_filter(file_date_fn, date_start, date_end):
    def filter_fn(file_name):
        file_date = file_date_fn(file_name)
        return file_date >= date_start and file_date <= date_end
    
    return filter_fn

### Quick Data View

#### Load Single Day Data

In [4]:
filter_fn = create_between_dates_filter(bike_file_date_fn, 
                                       datetime(2016, 5, 16, 7, 0, 0),
                                       datetime(2016, 5, 16, 23, 59, 59))

records = parse_dir('/home/jfconavarrete/Documents/Work/Dissertation/spts-uoe/data/raw', 
                    parse_cycles, sort_fn=bike_file_date_fn, filter_fn=filter_fn)

# records is a list of lists of dicts
df = pd.DataFrame(list(itertools.chain.from_iterable(records))) 

####  All Station View

In [5]:
df.head()

Unnamed: 0,Id,InstallDate,Installed,Latitude,Locked,Longitude,Name,NbBikes,NbDocks,NbEmptyDocks,PlaceType,RemovalDate,Temporary,TerminalName,Timestamp
0,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",11,19,7,BikePoint,,False,1023,2016-05-16T06:26:24.037
1,BikePoints_2,1278585780000,True,51.499606,False,-0.197574,"Phillimore Gardens, Kensington",12,37,25,BikePoint,,False,1018,2016-05-16T06:26:24.037
2,BikePoints_3,1278240360000,True,51.521283,False,-0.084605,"Christopher Street, Liverpool Street",6,32,26,BikePoint,,False,1012,2016-05-16T06:51:27.5
3,BikePoints_4,1278241080000,True,51.530059,False,-0.120973,"St. Chad's Street, King's Cross",14,23,9,BikePoint,,False,1013,2016-05-16T06:51:27.5
4,BikePoints_5,1278241440000,True,51.49313,False,-0.156876,"Sedding Street, Sloane Square",27,27,0,BikePoint,,False,3420,2016-05-16T06:46:27.237


####  Single Station View

In [6]:
df[df['Id'] == 'BikePoints_1'].head()

Unnamed: 0,Id,InstallDate,Installed,Latitude,Locked,Longitude,Name,NbBikes,NbDocks,NbEmptyDocks,PlaceType,RemovalDate,Temporary,TerminalName,Timestamp
0,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",11,19,7,BikePoint,,False,1023,2016-05-16T06:26:24.037
762,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",11,19,7,BikePoint,,False,1023,2016-05-16T06:26:24.037
1524,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",10,19,8,BikePoint,,False,1023,2016-05-16T07:01:29.163
2286,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",8,19,10,BikePoint,,False,1023,2016-05-16T07:11:30.433
3048,BikePoints_1,1278947280000,True,51.529163,False,-0.10997,"River Street , Clerkenwell",8,19,10,BikePoint,,False,1023,2016-05-16T07:11:30.433


#### Observations

* There are some duplicate rows <- remove duplicates
* RemovalDate may contain a lot of nulls <- remove if not helpful
* Locked and Installed might be constant <- remove if not helpful

### Build Dataset

#### Work with Chunks

Due to memory constraints we'll parse the data in chunks. In each chunk we'll remove the redundant candidate keys and also duplicate rows.

In [7]:
def chunker(seq, size):
    return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

#### Tables

We will have two different tables, one for the stations and one for the availability readings

In [8]:
def split_data(parsed_data):
    master_df = pd.DataFrame(list(itertools.chain.from_iterable(parsed_data)))
    
    readings_df = pd.DataFrame(master_df, columns=['Id', 'Timestamp', 'NbBikes', 'NbDocks', 'NbEmptyDocks'])
    stations_df = pd.DataFrame(master_df, columns=['Id', 'Name', 'TerminalName' , 'PlaceType', 'Latitude', 
                                                   'Longitude', 'Installed', 'Temporary', 'Locked',
                                                   'RemovalDate', 'InstallDate'])
    
    return (readings_df, stations_df)

#### Build the Dataset

In [9]:
# get the files to parse
five_weekdays_filter = create_between_dates_filter(bike_file_date_fn, 
                                                   datetime(2016, 5, 15, 11, 0, 0), 
                                                   datetime(2016, 6, 7, 8, 0, 0))
files = get_file_list('data/raw', filter_fn=None, sort_fn=bike_file_date_fn)

# process the files in chunks
files_batches = chunker(files, 100)

In [10]:
# start with an empty dataset
readings_dataset = pd.DataFrame()
stations_dataset = pd.DataFrame()

# append each chunk to the datasets while removing duplicates
for batch in files_batches:
    parsed_data = parse_json_files(batch, parse_cycles)
    
    # split the data into two station data and readings data
    readings_df, stations_df = split_data(parsed_data)
    
    # append the datasets
    readings_dataset = pd.concat([readings_dataset, readings_df])
    stations_dataset = pd.concat([stations_dataset, stations_df])
    
    # remove duplicated rows
    readings_dataset.drop_duplicates(inplace=True)
    stations_dataset.drop_duplicates(inplace=True)

## Technically Correct Data

The data is set to be technically correct if it:

1. can be directly recognized as belonging to a certain variable
2. is stored in a data type that represents the value domain of the real-world variable.

In [11]:
# convert columns to their appropriate datatypes
stations_dataset['InstallDate'] = pd.to_numeric(stations_dataset['InstallDate'], errors='raise')
stations_dataset['Installed'] = stations_dataset['Installed'].astype('bool_')
stations_dataset['Temporary'] = stations_dataset['Temporary'].astype('bool_')
stations_dataset['Locked'] = stations_dataset['Locked'].astype('bool_')

readings_dataset['NbBikes'] = readings_dataset['NbBikes'].astype('uint16')
readings_dataset['NbDocks'] = readings_dataset['NbDocks'].astype('uint16')
readings_dataset['NbEmptyDocks'] = readings_dataset['NbEmptyDocks'].astype('uint16')

# convert string timestamp to datetime
stations_dataset['InstallDate'] = pd.to_datetime(stations_dataset['InstallDate'], unit='ms')

readings_dataset['Timestamp'] =  pd.to_datetime(readings_dataset['Timestamp'], format='%Y-%m-%dT%H:%M:%S.%f', errors='raise')

In [12]:
# sort the datasets
stations_dataset.sort_values(by=['Id'], ascending=True, inplace=True)

readings_dataset.sort_values(by=['Timestamp'], ascending=True, inplace=True)

## Derive Data

In [13]:
readings_dataset['NbUnusableDocks'] = readings_dataset['NbDocks'] - (readings_dataset['NbBikes'] + readings_dataset['NbEmptyDocks'])

## Consistent Data

### Stations Analysis

#### Overview

In [14]:
stations_dataset.shape

(778, 11)

In [15]:
stations_dataset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 778 entries, 0 to 94
Data columns (total 11 columns):
Id              778 non-null object
Name            778 non-null object
TerminalName    778 non-null object
PlaceType       778 non-null object
Latitude        778 non-null float64
Longitude       778 non-null float64
Installed       778 non-null bool
Temporary       778 non-null bool
Locked          778 non-null bool
RemovalDate     778 non-null object
InstallDate     683 non-null datetime64[ns]
dtypes: bool(3), datetime64[ns](1), float64(2), object(5)
memory usage: 424.7 KB


In [16]:
stations_dataset.head()

Unnamed: 0,Id,Name,TerminalName,PlaceType,Latitude,Longitude,Installed,Temporary,Locked,RemovalDate,InstallDate
0,BikePoints_1,"River Street , Clerkenwell",1023,BikePoint,51.529163,-0.10997,True,True,True,,2010-07-12 15:08:00
9,BikePoints_10,"Park Street, Bankside",1024,BikePoint,51.505974,-0.092754,True,True,True,,2010-07-04 11:21:00
95,BikePoints_100,"Albert Embankment, Vauxhall",1059,BikePoint,51.490435,-0.122806,True,True,True,,2010-07-14 09:31:00
96,BikePoints_101,"Queen Street 1, Bank",999,BikePoint,51.511553,-0.09294,True,True,True,,2010-07-14 10:18:00
97,BikePoints_102,"Jewry Street, Aldgate",1045,BikePoint,51.513406,-0.076793,True,True,True,,2010-07-14 10:21:00


In [17]:
stations_dataset.describe()

Unnamed: 0,Latitude,Longitude
count,778.0,778.0
mean,51.439607,-0.128938
std,1.846681,0.056126
min,0.0,-0.236769
25%,51.493134,-0.172954
50%,51.509123,-0.132102
75%,51.520686,-0.09294
max,51.549369,0.122299


In [18]:
stations_dataset.apply(lambda x:x.nunique())

Id              773
Name            774
TerminalName    773
PlaceType         1
Latitude        770
Longitude       770
Installed         1
Temporary         1
Locked            1
RemovalDate       4
InstallDate     682
dtype: int64

#### Observations:
* Id, Name and Terminal name seem to be candidate keys
* The minimum latitude and the maximum longitude are 0
* Some stations have the same latitude or longitude
* Id, TerminalName and Name have different unique values
* Placetype, Installed, Temporary and Locked appear to be constant
* Some stations do not have an install date
* Some Stations have a removal date (very sparse)

#### Remove Duplicate Stations

In [19]:
def find_names_changes(df):
    """Find Ids that have more than one Name or TerminalName"""
    
    
    names_per_id_count = df.groupby('Id')['Name', 'TerminalName'].count()
    ids_with_several_names = names_per_id_count[(names_per_id_count['Name'] != 1) | (names_per_id_count['Name'] != 1)]
    return df[df['Id'].isin(ids_with_several_names.index.values)]

ids_with_several_names_df = find_names_changes(stations_dataset)
ids_with_several_names_df

Unnamed: 0,Id,Name,TerminalName,PlaceType,Latitude,Longitude,Installed,Temporary,Locked,RemovalDate,InstallDate
56333,BikePoints_798,"Birkenhead Street, King's Cross",300212,BikePoint,51.530199,-0.122299,True,True,True,,NaT
41855,BikePoints_798,"Birkenhead Street, King's Cross",300212,BikePoint,51.530199,0.122299,True,True,True,,NaT
71111,BikePoints_799,"Kings Gate House, Westminster",300202,BikePoint,51.497698,-0.137598,True,True,True,,NaT
67978,BikePoints_799,"Kings Gate House, Westminster",300202,BikePoint,51.497698,-0.137598,True,True,True,,2016-06-02 14:08:00
44409,BikePoints_802,"Albert Square, Stockwell",300209,BikePoint,51.47659,-0.118256,True,True,True,,NaT
40439,BikePoints_802,"Albert Square, Stockwell",300209,BikePoint,51.47659,-0.118256,True,True,True,,2016-06-02 11:05:00
52069,BikePoints_814,"Clapham Road, Lingham Street, Stockwell",300245,BikePoint,51.471433,-0.12367,True,True,True,,NaT
52672,BikePoints_814,"Clapham Road, Lingham Street, Stockwell",300245,BikePoint,51.471433,-0.12367,True,True,True,,2016-06-02 12:21:00
39676,BikePoints_818,"One Tower Bridge, Southwark",300249,BikePoint,51.503127,-0.078655,True,True,True,,NaT
43547,BikePoints_818,"One Tower Bridge, Bermondsey",300249,BikePoint,51.503127,-0.078655,True,True,True,,NaT


Given these records have the same location and Id but different Name or TerminalName, we'll assume the station changed name and remove the first entries.

In [20]:
# get the index of the first repeated entries
is_duplicated_id = ids_with_several_names_df.duplicated(['Id'], keep='last')
duplicated_id_idx = is_duplicated_id[is_duplicated_id == True].index

# drop entries using the index
stations_dataset.drop(duplicated_id_idx, inplace=True)

In [21]:
# make sure there are no repeated ids 
assert len(find_names_changes(stations_dataset)) == 0

#### Check Locations

Let's have a closer look at the station locations. All of them should be in Greater London.

In [22]:
# bounding box for Greater London
min_longitude = -0.489
min_latitude = 51.28
max_longitude = 0.236
max_latitude = 51.686

def find_locations_outside_box(locations, min_longitude, min_latitude, max_longitude, max_latitude):
    latitude_check = ~(locations['Latitude'] >= min_latitude) & (locations['Latitude'] <= max_latitude) 
    longitude_check = ~(locations['Longitude'] >= min_longitude) & (locations['Longitude'] <= max_longitude) 
    return locations[(latitude_check | longitude_check)]

outlier_locations_df = find_locations_outside_box(stations_dataset, min_longitude, min_latitude, max_longitude, max_latitude)
outlier_locations_df

Unnamed: 0,Id,Name,TerminalName,PlaceType,Latitude,Longitude,Installed,Temporary,Locked,RemovalDate,InstallDate
750,BikePoints_791,Test Desktop,666666,BikePoint,0.0,0.0,True,True,True,,2016-01-15 12:39:00


In [23]:
outlier_locations_idx = outlier_locations_df.index.values

stations_dataset.drop(outlier_locations_idx, inplace=True)

This station looks like a test dation, so we'll remove it.

In [24]:
# make sure there are no stations outside london
assert len(find_locations_outside_box(stations_dataset, min_longitude, min_latitude, max_longitude, max_latitude)) == 0

We will investigate the fact that there are stations with duplicate latitude or longitude values.

In [25]:
# find stations with duplicate longitude
id_counts_groupedby_longitude = stations_dataset.groupby('Longitude')['Id'].count()
nonunique_longitudes = id_counts_groupedby_longitude[id_counts_groupedby_longitude != 1].index.values
nonunique_longitude_stations = stations_dataset[stations_dataset['Longitude'].isin(nonunique_longitudes)].sort_values(by=['Longitude'])

id_counts_groupedby_latitude = stations_dataset.groupby('Latitude')['Id'].count()
nonunique_latitudes = id_counts_groupedby_latitude[id_counts_groupedby_latitude != 1].index.values
nonunique_latitudes_stations = stations_dataset[stations_dataset['Latitude'].isin(nonunique_latitudes)].sort_values(by=['Latitude'])

nonunique_coordinates_stations = pd.concat([nonunique_longitude_stations, nonunique_latitudes_stations])
nonunique_coordinates_stations

Unnamed: 0,Id,Name,TerminalName,PlaceType,Latitude,Longitude,Installed,Temporary,Locked,RemovalDate,InstallDate
208,BikePoints_216,"Old Brompton Road, South Kensington",3479,BikePoint,51.490945,-0.18119,True,True,True,,2010-07-19 11:12:00
538,BikePoints_573,"Limerston Street, West Chelsea",200001,BikePoint,51.485587,-0.18119,True,True,True,,2012-03-15 07:21:00
20,BikePoints_21,"Hampstead Road (Cartmel), Euston",3426,BikePoint,51.530078,-0.138846,True,True,True,,2010-07-06 14:49:00
304,BikePoints_318,"Sackville Street, Mayfair",1197,BikePoint,51.510048,-0.138846,True,True,True,,2010-07-23 11:42:00
103,BikePoints_108,"Abbey Orchard Street, Westminster",3429,BikePoint,51.498125,-0.132102,True,True,True,,2010-07-14 11:42:00
586,BikePoints_624,"Courland Grove, Wandsworth Road",200173,BikePoint,51.472918,-0.132102,True,True,True,,2013-10-08 09:24:00
96,BikePoints_101,"Queen Street 1, Bank",999,BikePoint,51.511553,-0.09294,True,True,True,,2010-07-14 10:18:00
401,BikePoints_427,"Cheapside, Bank",22180,BikePoint,51.51397,-0.09294,True,True,True,,2011-07-15 10:28:00
103,BikePoints_108,"Abbey Orchard Street, Westminster",3429,BikePoint,51.498125,-0.132102,True,True,True,,2010-07-14 11:42:00
444,BikePoints_474,"Castalia Square, Cubitt Town",200155,BikePoint,51.498125,-0.011457,True,True,True,,2012-01-17 17:56:00


In [29]:
def draw_stations_map(stations_df):
    london_longitude = -0.127722
    london_latitude = 51.507981
    
    stations_map = folium.Map(location=[london_latitude, london_longitude], zoom_start=12,
                      min_lat=min_latitude, max_lat=max_latitude,
                      min_lon=min_longitude, max_lon=max_longitude)

    for index, station in stations_df.iterrows():
        folium.Marker([station['Latitude'],station['Longitude']], popup=station['Name']).add_to(stations_map)
    
    return stations_map

In [30]:
draw_stations_map(nonunique_coordinates_stations)

We can observe that the stations are different and that having the same Longitude is just a coincidence.

Let's plot all the stations in a map to see how it looks

In [31]:
london_longitude = -0.127722
london_latitude = 51.507981

MAX_RECORDS = 100

stations_map = folium.Map(location=[london_latitude, london_longitude], zoom_start=12,
                  min_lat=min_latitude, max_lat=max_latitude,
                  min_lon=min_longitude, max_lon=max_longitude)

for index, station in stations_dataset[0:MAX_RECORDS].iterrows():
    folium.Marker([station['Latitude'],station['Longitude']], popup=station['Name']).add_to(stations_map)
    
stations_map

#folium.Map.save(stations_map, 'reports/maps/stations_map.html')

#### Check Station Status

We must make sure that all stations in our dataset are installed and locked.

In [32]:
# make sure all stations in our dataset are installed and locked
assert len(stations_dataset[stations_dataset['Installed'] == False]) == 0
assert len(stations_dataset[stations_dataset['Locked'] == False]) == 0

## Data Summary

In [None]:
pd.concat([a, pd.DataFrame()])

### Import into Pandas

In [None]:
dataset = pd.DataFrame(list(itertools.chain.from_iterable(records)))

dataset.shape

In [None]:

dataset.head()

In [None]:

nuniques = dataset.apply(lambda x:x.nunique())
constant_cols = nuniques[nuniques == 1].index
print 'Constant columns: %s' % constant_cols
dataset = dataset.drop(constant_cols, axis=1)

### Convert to Appropriate DataTypes

### Derive Variables

### Data Description

In [None]:
dataset.info(memory_usage='deep')

## Consistent Correct Data

In [None]:
dataset.describe()

### Missing Values



### Outliers

### Errors

### Consistency

## Exploratory Data Analysis

### Visual Representation

### Examine Variable Relationships

### Analyze Variable Over Time

## Conclusions