# Data Cleaning

### Data Dictionary

The raw data contains the following data per station per reading:

* Id - String - API Resource Id
* Name - String - The common name of the station
* PlaceType - String ?
* TerminalName - String - ?
* NbBikes - Integer - The number of available bikes
* NbDocks - Integer - The total number of docking spaces
* NbEmptyDocks - Integer - The number of available empty docking spaces
* Timestamp - DateTime - The moment this reading was captured
* InstallDate - DateTime - Date when the station was installed
* RemovalDate - DateTime - Date when the station was removed
* LastUpdated - DateTime - ?
* Installed - Boolean - If the station is installed or not
* Locked - Boolean - If the station is locked or not
* Temporary - Boolean - If the station is temporary or not (TfL adds temporary stations to cope with demand.)
* Latitude - Float - Latitude Coordinate
* Longitude - Float - Longitude Coordinate

The following variables will be derived from the raw data.

* NbUnusableDocks - Integer - The number of non-working docking spaces. Computed with NbUnusableDocks = NbDocks - (NbBikes + NbEmptyDocks)

## Set up

### Imports

In [1]:
import logging
import itertools
import json
import os
import pandas as pd
import pickle

from datetime import datetime
from os import listdir
from os.path import isfile, join
from src.data.parse_dataset import parse_dir, parse_json

logger = logging.getLogger()
logger.setLevel(logging.INFO)

### Parse Raw Data 

#### Define the Parsing Functions

In [5]:
def parse_cycles(json_obj):
    """Parses TfL's BikePoint JSON response"""

    return [parse_station(element) for element in json_obj]

def parse_station(element):
    """Parses a JSON bicycle station object to a dictionary"""

    obj = {
        'Id': element['id'],
        'Name': element['commonName'],
        'Latitude': element['lat'],
        'Longitude': element['lon'],
        'PlaceType': element['placeType'],
    }

    for p in element['additionalProperties']:
        obj[p['key']] = p['value']

        if 'timestamp' not in obj:
            obj['Timestamp'] = p['modified']
        elif obj['Timestamp'] != p['modified']:
            raise ValueError('The properties\' timestamps for station %s do not match: %s != %s' % (
            obj['id'], obj['Timestamp'], p['modified']))

    return obj

In [6]:
def bike_file_date_fn(file_name):
    """Gets the file's date"""

    return datetime.strptime(os.path.basename(file_name), 'BIKE-%Y-%m-%d:%H:%M:%S.json')

def create_between_dates_filter(file_date_fn, date_start, date_end):
    def filter_fn(file_name):
        file_date = file_date_fn(file_name)
        return file_date >= date_start and file_date <= date_end
    
    return filter_fn

In [7]:
filter_fn = create_between_dates_filter(bike_file_date_fn, 
                                       datetime(2016, 5, 16, 0, 0, 0),
                                       datetime(2016, 5, 16, 23, 59, 59))

#records = parse_dir('/home/jfconavarrete/Documents/Work/Dissertation/spts-uoe/data/dev_json', 
#                    parse_cycles, sort_fn=bike_file_date_fn, filter_fn=filter_fn)

records = pickle.load(open("data/parsed/dataset.pickle", "rb"))

# records is a list of lists of dicts
df = pd.DataFrame(list(itertools.chain.from_iterable(records))) 

df = remove_redundant_candidate_keys(df)

#### Duplicate Removal

We suspect that Id, Name and TerminalName have the same values in each record, meaning they can be used interchangeably. If this is the case, then we drop Name and TerminalName to reduce memory usage.

In [59]:
def remove_redundant_candidate_keys(df):
    nunique_col_vals = len(pd.DataFrame(df, columns=['Id', 'Name', 'TerminalName']).drop_duplicates())
    
    if nunique_col_vals == dataset['Id'].nunique():
        return df.drop(['Name', 'TerminalName'], 1)
    else:
        raise ValueError('Id, Name and TerminalName are not interchangeable')

Values are not updated frequently, causing the data to have lots of duplicate rows. Therefore, we'll remove rows where Id, Timestam, NbBikes, NbDocks and NbEmptyDocks have the same value.

In [61]:
def drop_duplicate_rows(df):
    return df.drop_duplicates(['Id', 'Timestamp', 'NbBikes', 'NbDocks' , 'NbEmptyDocks'])


In [60]:
#### Parse Slices of the JSON files

In [8]:
pickle.dump(records, open('data/parsed/dataset.pickle', 'wb' ) )

In [7]:
filter_fn = create_between_dates_filter(bike_file_date_fn, 
                                       datetime(2016, 5, 16, 0, 0, 0),
                                       datetime(2016, 5, 16, 23, 59, 59))

#records = parse_dir('/home/jfconavarrete/Documents/Work/Dissertation/spts-uoe/data/dev_json', 
#                    parse_cycles, sort_fn=bike_file_date_fn, filter_fn=filter_fn)

records = pickle.load(open("data/parsed/dataset.pickle", "rb"))

# records is a list of lists of dicts
df = pd.DataFrame(list(itertools.chain.from_iterable(records))) 

df = remove_redundant_candidate_keys(df)

### Import into Pandas

In [5]:
dataset = pd.DataFrame(list(itertools.chain.from_iterable(records)))
dataset = dataset.sort_values(by=['Timestamp'], ascending=True)

dataset.shape

(219635, 15)

In [5]:

dataset.head()

Unnamed: 0,Id,InstallDate,Installed,Latitude,Locked,Longitude,Name,NbBikes,NbDocks,NbEmptyDocks,PlaceType,RemovalDate,Temporary,TerminalName,Timestamp
44184,BikePoints_791,1452861540000,False,0.0,False,0.0,Test Desktop,0,0,0,BikePoint,,False,666666,2016-05-10T15:34:07.137
91428,BikePoints_791,1452861540000,False,0.0,False,0.0,Test Desktop,0,0,0,BikePoint,,False,666666,2016-05-10T15:34:07.137
148607,BikePoints_791,1452861540000,False,0.0,False,0.0,Test Desktop,0,0,0,BikePoint,,False,666666,2016-05-10T15:34:07.137
34278,BikePoints_791,1452861540000,False,0.0,False,0.0,Test Desktop,0,0,0,BikePoint,,False,666666,2016-05-10T15:34:07.137
149370,BikePoints_791,1452861540000,False,0.0,False,0.0,Test Desktop,0,0,0,BikePoint,,False,666666,2016-05-10T15:34:07.137


In [8]:

nuniques = dataset.apply(lambda x:x.nunique())
constant_cols = nuniques[nuniques == 1].index
print 'Constant columns: %s' % constant_cols
dataset = dataset.drop(constant_cols, axis=1)

Constant columns: Index([u'LastUpdated', u'Locked', u'PlaceType', u'Temporary'], dtype='object')


## Technically Correct Data

The data is set to be technically correct if it:

1. can be directly recognized as belonging to a certain variable
2. is stored in a data type that represents the value domain of the real-world variable.

### Convert to Appropriate DataTypes

In [9]:
# convert columns to their appropriate datatypes
dataset['InstallDate'] = pd.to_numeric(dataset['InstallDate'], errors='raise')
dataset['Installed'] = dataset['Installed'].astype('bool_')
dataset['Temporary'] = dataset['Temporary'].astype('bool_')
dataset['Locked'] = dataset['Locked'].astype('bool_')
dataset['NbBikes'] = dataset['NbBikes'].astype('uint16')
dataset['NbDocks'] = dataset['NbDocks'].astype('uint16')
dataset['NbEmptyDocks'] = dataset['NbEmptyDocks'].astype('uint16')

# convert string timestamp to datetime
dataset['Timestamp'] =  pd.to_datetime(dataset['Timestamp'], format='%Y-%m-%dT%H:%M:%S.%f')
dataset['InstallDate'] = pd.to_datetime(dataset['InstallDate'], unit='ms')

### Derive Variables

In [10]:
dataset['NbUnusableDocks'] = dataset['NbDocks'] - (dataset['NbBikes'] + dataset['NbEmptyDocks'])

### Data Description

In [11]:
dataset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100894 entries, 176838 to 654026
Data columns (total 13 columns):
Id                 100894 non-null object
InstallDate        90556 non-null datetime64[ns]
Installed          100894 non-null bool
Latitude           100894 non-null float64
Longitude          100894 non-null float64
Name               100894 non-null object
NbBikes            100894 non-null uint16
NbDocks            100894 non-null uint16
NbEmptyDocks       100894 non-null uint16
RemovalDate        100894 non-null object
TerminalName       100894 non-null object
Timestamp          100894 non-null datetime64[ns]
NbUnusableDocks    100894 non-null uint16
dtypes: bool(1), datetime64[ns](2), float64(2), object(4), uint16(4)
memory usage: 45.9 MB


## Consistent Correct Data

In [12]:
dataset.describe()

Unnamed: 0,Latitude,Longitude,NbBikes,NbDocks,NbEmptyDocks,NbUnusableDocks
count,100894.0,100894.0,100894.0,100894.0,100894.0,100894.0
mean,51.507157,-0.127769,12.926923,27.334906,13.725217,0.682766
std,0.163196,0.050149,9.248575,9.547941,9.666593,1.02508
min,0.0,-0.236769,0.0,0.0,0.0,0.0
25%,51.494881,-0.166878,6.0,20.0,6.0,0.0
50%,51.510701,-0.126021,12.0,25.0,12.0,0.0
75%,51.521113,-0.09294,18.0,33.0,19.0,1.0
max,51.549369,0.0,61.0,63.0,63.0,11.0


### Missing Values



### Outliers

### Errors

### Consistency

## Exploratory Data Analysis

### Visual Representation

### Examine Variable Relationships

### Analyze Variable Over Time

## Conclusions