# Fair Weather pedalers
** Weather's effect on bike share ridership on the Hubway system in Boston and its environs. **  
Kevin Burek &lt;<kburek@fas.harvard.edu>>, Joshua Mclellan &lt;<jvl.mclellan@gmail.com>>  
Harvard AM 207 Spring 2016 Final Project  

## Abstract
The Hubway bicycle share system serves riders in the urban core and near suburbs of Boston.  We'd like to discover something about the riders who use the system.  In 2013, the service providers released a dataset which we will analyze with respect to historical weather data, in order to explore the validity of different modeling hypotheses.  In brief, we believe that ridership may be composed of different constituencies, commuters and joyriders, which are affected differently by varying weather conditions.

## Tools
For analysis, we are using SciPy, NumPy, Pandas, PyMC, and Matplotlib.  
Some other core python libaries are coming along for the ride, too.

In [1]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_style("white")

import time
import timeit

import scipy.stats 
import pandas as pd
import pymc as pm

import re
import numpy as np

import string

## Data
The good folks at [Hubway](http://www.thehubway.com/) published the corpus of ridership logs from the system's inception in July 2011 through the end of the cycling season in 2013<sup>[[1]][@hubwaydatachallenge_zip]</sup>. For weather data, we turn to NOAA, whose [National Centers for Environmental Information](https://www.ncdc.noaa.gov/) provides the service of publishing [Climate Data Online](https://www.ncdc.noaa.gov/cdo-web/). We requested and received a data set including daily observed weather conditions for the relevant time period, for the Boston metropolitan area<sup>[[2]][@ncdc_boston]</sup>.


[@hubwaydatachallenge_zip]: http://files.hubwaydatachallenge.org/hubway_2011_07_through_2013_11.zip "Hubway Data Challenge. "hubway_2011_07_through_2013_11". Zip Archive. Retrieved 16 March 2016."

[@ncdc_boston]: http://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/locations/CITY:US250002/detail "NOAA National Centers for Environmental Information. "Daily Summaries Location Details: Boston, MA""

In [38]:
# Set a flag for whether to load the whole data set, or just a portion.
load_all = False
nb_root = ""
# Read ridership data
rides_path = "data/hubway_2011_07_through_2013_11/%shubway_trips.csv"
raw_rides = pd.read_csv(nb_root + rides_path % "") if load_all else pd.read_csv(nb_root + rides_path % "fewer_")
# Read weather data
weather_na = ("unknown", "9999", "-9999")
raw_weather = pd.read_csv(nb_root + "data/ncdc-2013.csv", na_values=weather_na)

raw_weather.head()

Unnamed: 0,STATION,STATION_NAME,ELEVATION,LATITUDE,LONGITUDE,DATE,DATETIME,PRCP,PRCP Measurement Flag,PRCP Quality Flag,...,TMIN,TMIN Measurement Flag,TMIN Quality Flag,TMIN Source Flag,TMIN Time of Observation,TOBS,TOBS Measurement Flag,TOBS Quality Flag,TOBS Source Flag,TOBS Time of Observation
0,GHCND:USW00054704,NORWOOD MEMORIAL AIRPORT MA US,15.2,42.19083,-71.17361,20100101,2010-01-01T12:00,0,T,,...,-38,,,W,,,,,,
1,GHCND:USW00054704,NORWOOD MEMORIAL AIRPORT MA US,15.2,42.19083,-71.17361,20100102,2010-01-02T12:00,23,,,...,-60,,,W,,,,,,
2,GHCND:USW00054704,NORWOOD MEMORIAL AIRPORT MA US,15.2,42.19083,-71.17361,20100103,2010-01-03T12:00,5,,,...,-93,,,W,,,,,,
3,GHCND:USW00054704,NORWOOD MEMORIAL AIRPORT MA US,15.2,42.19083,-71.17361,20100104,2010-01-04T12:00,0,,,...,-88,,,W,,,,,,
4,GHCND:USW00054704,NORWOOD MEMORIAL AIRPORT MA US,15.2,42.19083,-71.17361,20100105,2010-01-05T12:00,0,,,...,-93,,,W,,,,,,


In [37]:
weather = pd.DataFrame();
weather['STATION'] = raw_weather['STATION']
weather['DATE'] = raw_weather['DATE']
weather.set_index(['STATION', 'DATE'], drop=True, inplace=True)
weather['YEAR'] = [d[0:4] for d in raw_weather['DATE'].astype(np.str_)]
weather['MONTH'] = [d[4:6] for d in raw_weather['DATE'].astype(np.str_)]
weather['DAY'] = [d[6:] for d in raw_weather['DATE'].astype(np.str_)]
weather.head()


Unnamed: 0_level_0,Unnamed: 1_level_0,YEAR,MONTH,DAY
STATION,DATE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GHCND:USW00054704,20100101,2010,1,1
GHCND:USW00054704,20100102,2010,1,2
GHCND:USW00054704,20100103,2010,1,3
GHCND:USW00054704,20100104,2010,1,4
GHCND:USW00054704,20100105,2010,1,5


## Cleaning
Here we clean the data for analysis.  

### Augmenting ridership data
The ridership data has some properties that aren't well expressed in its current form. We calculate and add columns for the following extracted properties:  
* Start & End day of week (0/Monday - 6/Sunday) 
* Start & End time of day (00:00 - 23:59)
* ... Duration, rider age, ...

### Augmenting weather data
The raw weather data needs measurement flag columns renamed to associate them with the measurement dimensions they describe.  Special values need to be handled.

