# Visualizing Seattle Bicycle Counts Hourly Data Set
## Chapter 3: Data Manipulation with Pandas
### Python Data Science / Page 202

**Note:** Excelent way of doing data mining in Jupyter. In this notebook I demonstraded how to mine data for patterns and how to visualise those patterns using Jupyter and Pandas.

**Links:**
1. __[Is Seattle Really Seeing an Uptick In Cycling?](https://jakevdp.github.io/blog/2014/06/10/is-seattle-really-seeing-an-uptick-in-cycling/)__
1. __[A statistical analysis of biking on the Fremont Bridge, Part 1: Overview](https://www.seattlebikeblog.com/2014/06/09/a-statistical-analysis-of-biking-on-the-fremont-bridge-part-1-overview/)__
1. __[A statistical analysis of biking on the Fremont Bridge, Part 2: Rain](https://www.seattlebikeblog.com/2014/06/10/a-statistical-analysis-of-biking-on-the-fremont-bridge-part-2-rain/)__
1. __[A statistical analysis of biking on the Fremont Bridge, Part 3: Bike Month](https://www.seattlebikeblog.com/2014/06/11/a-statistical-analysis-of-biking-on-the-fremont-bridge-part-3-bike-month/)__
1. __[A statistical analysis of biking on the Fremont Bridge, Part 4: Are more people biking?](https://www.seattlebikeblog.com/2014/06/12/a-statistical-analysis-of-biking-on-the-fremont-bridge-part-4-are-more-people-biking/)__
1. __[City Showdown: How do Cambridge Cyclers Compare to Seattle Cyclers?](http://nbviewer.jupyter.org/gist/lindabli/ee7aed9d875a698526fd)__

**TODO:**
1. Add support for other variables like weather and temperature. See first link.

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
import seaborn; seaborn.set()
import numpy as np

%matplotlib inline

## Visualise the data:

In [None]:
data_set_home = %env DATA_SETS_HOME
date_format_daily = "%m/%d/%Y"
date_format_hourly = "%m/%d/%Y %I:%M:%S %p"
dateparse = lambda x: pd.datetime.strptime(x, date_format_hourly)

data = pd.read_csv(
    "{0}/Miscellaneous/Fremont_Bridge_Hourly_Bicycle_Counts.csv".format(data_set_home),
    index_col=[0],
    parse_dates=True,
    date_parser=dateparse
)
data.sort_index(inplace=True)
data.head()

## Change the name of columns and add total column:

In [None]:
data.columns = ["West", "East"]
data["Total"] = data.eval("West + East")
data.head()

## Describe the data:

In [None]:
data.dropna().describe()

## Plot the total data:

In [None]:
data.plot()
plt.ylabel("Hourly Bicycle Count")

## Plot a subset of the data:

In [None]:
data["2017":"2018"].plot()
plt.ylabel("Hourly Bicycle Count")

## Resample by day:

In [None]:
daily = data["2017":"2018"].resample("D").sum()
daily.plot(style=[":", "--", "-"])
plt.ylabel("Daily Bicycle Count")

## Resample and group the data by month:

The following will resample the data set summing all days into months, like:

```
2012/01/31 | 2000 # Uses the last day of the month as day.
2012/02/28 | 2400 # Uses the last day of the month as day.
...
```

After which the resulting data can be grouped by month over all years in the data set and averaged.

**Observation:** What we can see here is a pattern, where summer months are when cyclists are crossing more times the bridge. Likely because the weather is much better for a bike ride.

In [None]:
segment = data["2013":"2017"].resample("M").sum()
monthly = segment.groupby(segment.index.month).mean()
monthly.plot(style=[":", "--", "-"])
plt.ylabel("Monthly Bicycle Mean")

## Resample by day and compute a 30-day window rolling mean:

In [None]:
daily = data.resample("D").sum()
daily.rolling(30, center=True).mean().plot(style=[":", "--", "-"])
plt.ylabel("Mean Daily Count")

## Compute a Gaussian rolling window:

In [None]:
daily.rolling(30, center=True, win_type="gaussian").mean(std=10).plot(style=[":", "--", "-"])
plt.ylabel("Mean Daily Count")

## Resample and group the data by hour:

In this case there is no need to resample the data set, because each entry is already separated by hour of the day.

The result of the grouping will be averaged and plotted.

In [None]:
by_time = data.groupby(data.index.time).mean()
by_time.head()

### Plot the data set grouped by hour:

Each tick will mark a 3 hour time frame.

**Observation:** Also we can see here that bike activity is higher from **06:00-09:00** and **15:00-18:00**. Which probably means commuting using the bike.

In [None]:
hourly_ticks = (3 * 60 * 60) * np.arange(8)
by_time.plot(xticks=hourly_ticks, style=[":", "--", "-"])
plt.ylabel("Hourly Bicycle Mean")

## Resample and group the data by week day:

The following will resample the data set summing all hours into days, like:

```
2012/01/01 | 200
2012/01/02 | 240
...
```

After which the resulting data can be grouped by week day over all months and years in the data set and averaged.

**Observation:** What we can see here is the same pattern as before with commuting activity. During business days the activity is higher, leading to a drop at weekends.

In [None]:
segment = data["2013":"2017"].resample("D").sum()
weekday = segment.groupby(segment.index.dayofweek).mean()
weekday.plot(style=[":", "--", "-"])
plt.ylabel("Week Day Bicycle Mean")

## Look for hourly patterns by doing a compound grouping Weekday vs Weekend:

**Observation:** An interesting pattern emerges here, where as we have seen before, during week days the higher counts corresponds to start and end of business days, which implies commuting using bikes, but in this case, we can also seen that during weekends the higher counts are seen after lunch, in the afternoon, between **12:00-18:00**, which implies biking as a leisure. **Bimodal** pattern during week days and **unimodal** pattern during week ends.

In [None]:
segment = data["2013":"2017"]
# first create a numpy array out of the days in the index. i.e. iterate each
# entry in the index and if it's less than 5 then it means it's a week day else
# it's a weekend day.
# the resulting array will contain only two values: Weekday or Weekend
weekend = np.where(segment.index.weekday < 5, "Weekday", "Weekend") # returns ndarray
compound = segment.groupby([weekend, segment.index.time]).mean() # segment.index.time is ndarray as well
# plot
fig, ax = plt.subplots(1, 2, figsize=(14, 5))
compound.loc["Weekday"].plot(
    ax=ax[0], 
    title="Weekdays", 
    xticks=hourly_ticks, 
    style=[":", "--", "-"]
)
compound.loc["Weekend"].plot(
    ax=ax[1], 
    title="Weekends", 
    xticks=hourly_ticks, 
    style=[":", "--", "-"]
)
plt.ylabel("Hourly Bicycle Mean")

# debug
# print(segment.index.time.shape)
# print(weekend.shape)
# compound.head(40)