### Example: Analyzing Citibike Station Activity using Pandas

We are going to use the database of snapshots of Citibike stations statuses. 

In [None]:
%matplotlib inline
import pandas as pd
import MySQLdb as mdb
import matplotlib 
matplotlib.style.use(['seaborn-talk', 'seaborn-ticks', 'seaborn-whitegrid'])
matplotlib.rcParams['figure.figsize'] = (20,10)

We will first fetch the data from the database.

In [None]:
con = mdb.connect(host = 'localhost', 
                  user = 'root', 
                  passwd = 'dwdstudent2015', 
                  charset='utf8', 
                  use_unicode=True, 
                  database='citibike');

If we try to retrieve all the data, we will see that we have way too many data points (more than 10 million). 

In [None]:
cur = con.cursor(mdb.cursors.DictCursor)
cur.execute("SELECT COUNT(*) AS cnt FROM citibike.Docks_Status")
result = list(cur.fetchall())
cur.close()

result

Retrieving millions of data points from the database is going to take long time, and may cause memory errors. 

#### Pushing part of the computation down to the database

The goal of our analysis is to see how bike usage varies over time. Therefore, we can reduce the amount of retrieved data by asking to get back only averages over a period of, say, 60 minutes. 

Unfortunately, SQL does not provide elegant tools for handling time series, do we are going to resort to a few "hacks". We are going to round the `last_communication_time` field in the database into periods of 15 minutes (i.e., 900 seconds), and then compute the average level of "fullness" of the bike station (defined as number of bikes over the number of docks in the station).

* The command `DATE_FORMAT(last_communication_time, '%Y-%m-%d %H:00:00')` truncates each timestamp to the nearest hour.
* We also limit our query to only data from February 13 to March 13th.
* We also limit our query only to statuses where the station was operating and reported back a proper status
* We GROUP BY timestamp and station, and we compute the average fullness level of the station over that time.

 *(Note: The DATE_FORMAT approach works for truncating the timestamp The following, more complicated, code can work for arbitrary time periods. For example, to get 900 intervals (ie 15 mins), we can do `CONCAT(DATE(last_communication_time), ' ',  SEC_TO_TIME((TIME_TO_SEC(last_communication_time) DIV 900) * 900))`)

In [None]:
query = '''
SELECT * 
FROM citibike.stations
'''


In [None]:
cur = con.cursor()
cur.execute(query)
df = pd.DataFrame(list(cur.fetchall()), columns=['id', 'bikes', 'timestamp'])
cur.close()
# We retrieved the data in memory, so we do not need the database connection anymore.
con.close()

In [None]:
len(df)

So, we reduced now our dataset from more than 10+ million data points to around half a million. That will give us a big speedup in our subsequent operations and can easily be handled in-memory by Pandas.

In [None]:
df

In [None]:
df.dtypes

Let's convert into proper data types. 

*Note: We use the "astype" as opposed to "pd.to_numeric" for converting the bikes column, because the bikes variable
that comes back from MySQL is a Decimal data type, and Pandas.to_numeric seems to have  problems converting Decimal data types. We can use the technique from http://stackoverflow.com/questions/7483363/python-mysqldb-returns-datetime-date-and-decimal if we want to get back floats instead of Decimals from MySQL.*

*Note2: We use the 'downcast' option to reduce the size of the variables. This reduces memory needs, and can (slightly) improve execution time.*

In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['bikes'] = df['bikes'].astype(float)
df['bikes'] = pd.to_numeric(df['bikes'], downcast='float')
df['id'] = pd.to_numeric(df['id'], downcast='unsigned')

df.dtypes

In [None]:
df.head(10)

### Exploratory Analysis

As a first step, let's see how the status of the bike stations evolves over time. We compute the average "fullness" of all the bike stations over time. We can use the `groupby` function of pandas, and compute the `mean()` for the groups.

In [None]:
# Notice that this also returns an average for the station ID's which is kind of useless
# We will eliminate these next.
df.groupby('timestamp').mean()

In [None]:
df.groupby('timestamp').mean()['bikes']

Now let's plot the activity over time. We can see that the percentage of bikes in the stations falls from 35% overnight to 30% during the morning and evening commute times, while the average availability during the day is around 31%.

In [None]:
df.groupby('timestamp').mean()['bikes'].plot(
    figsize=(20,10), grid=True
)

Let's do also the seasonal decomposition to see the result.

In [None]:
!sudo pip3 install statsmodels

In [None]:
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

time_series = df.groupby('timestamp').mean()['bikes']

# We decompose assumming a 24-hour periodicity. There is a weekly component as well, which we ignore
decomposition = seasonal_decompose(time_series, freq=168)

seasonal = decomposition.plot()  

### Examining Time Series per Station

We now create a pivot table, to examine the time series for individual stations.

Notice that we use the `fillna` method, where we fill the cells where we do not have values using the prior, non-missing value.

In [None]:
import numpy as np
df2 = df.pivot_table(
    index='timestamp', 
    columns='id', 
    values='bikes', 
    aggfunc=np.mean
).interpolate(method='time') 
df2

Let's plot the time series for *all* bike stations, for a couple of days in February.

In [None]:
df2.plot(
    alpha=0.05, 
    color='b', 
    legend=False, 
    figsize=(20,10), 
    xlim=('2017-02-15','2017-02-17')
)

Let's limit our plot to just two stations:
* Station 3260 at "Mercer St & Bleecker St"
* Station 161 at "LaGuardia Pl & W 3 St"

which are nearby and tend to exhibit similar behavior. Remember that the list of stations is [available as a JSON](https://feeds.citibikenyc.com/stations/stations.json) 

In [None]:
df2[[161, 3260, 260]].plot(
    alpha=0.5,  
    legend=False, 
    figsize=(20,10), 
   xlim=('2017-02-15','2017-02-27')
)

### Finding Bike Stations with Similar Behavior

For our next analysis, we are going to try to find bike stations that have similar behaviors over time. A very simple technique that we can use to find similar time series is to treat the time series as vectors, and compute their correlation. Pandas provides the `corr` function that can be used to calculate the correlation of columns. (If we want to compute the correlation of rows, we can just take the transpose of the dataframe using the `transpose()` function, and compute the correlations there.)

In [None]:
similarities = df2.corr(method='pearson')
similarities

Let's see the similarities of the two stations that we examined above.

In [None]:
stations = [161, 3260]

similarities[stations].loc[stations]

For bookkeeping purposes, we are going to drop columns that contain NaN values, as we cannot use such similarity values.

In [None]:
similarities.dropna(axis=0, how='any', inplace=True)

We are now going to convert our similarities into distance metrics, that are positive, and bounded to be between 0 and 1.

* If two stations have correlation 1, they behave identically, and therefore have distance 0, 
* If two stations have correlation -1, they have exactly the oppositite behaviors, and therefore we want to have distance 1 (the max) 

In [None]:
distances = ((.5*(1-similarities))**2)
distances

### Clustering Based on Distances

Without explaining too much about clustering, we are going to use a clustering technique and cluster together bike stations that are "nearby" according to our similarity analysis. The code is very simple:

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans

cluster = KMeans(n_clusters=2)
cluster.fit(distances.values)

We will now take the results of the clustering and associate each of the data points into a cluster.

In [None]:
labels = pd.DataFrame(list(zip(distances.index.values.tolist(), cluster.labels_)), columns = ["id", "cluster"])
labels.head(10)

Let's see how many stations in each cluster

In [None]:
labels.groupby('cluster').count()

### Visualizing the Time Series Clusters

We will start by assining a color to each cluster, so that we can plot each station-timeline with the cluster color. (We put a long list of colors, so that we can play with the number of clusters in the earlier code, and still get nicely colored results.)

In [None]:
colors = list(['r','b', 'g', 'm', 'y', 'k', 'w', 'c'])
labels['color'] = labels['cluster'].apply(lambda cluster_id : colors[cluster_id]) 
labels.head(10)

In [None]:
stations_plot = df2.plot(
    alpha=0.02, 
    legend=False, 
    figsize=(20,5), 
    color=labels["color"],
    #xlim=('2017-02-15','2017-02-17')
)

The plot still looks messy. Let's try to plot instead a single line for each cluster. To represent the cluster, we are going to use the _median_ fullness value across all stations that belong to a cluster, for each timestamp. For that, we can again use a pivot table: we define the `timestamp` as one dimension of the table, and `cluster` as the other dimension, and we use the `percentile` function to compute the median. 

For that, we first _join_ our original dataframe with the results of the clustering, using the `merge` command, and add an extra column that includes the clusterid for each station. Then we compute the pivot table.

In [None]:
import numpy as np

median_cluster = df.merge(
    labels, 
    how='inner', 
    on='id'
).pivot_table(
    index='timestamp', 
    columns='cluster', 
    values='bikes', 
    aggfunc=lambda x: np.percentile(x, 50) # median
)

median_cluster

Now, we can plot the medians for the two clusters.

In [None]:
median_plot = median_cluster.plot(
        figsize=(20,5), 
        linewidth = 2, 
        alpha = 0.75,
        color=colors,
        ylim = (0,0.75),
        grid = True
    )

And just for fun and for visual decoration, let's put the two plots together. We are going to fade a lot the individual station time series (by putting the `alpha=0.01`) and we are going to make more prominent the median lines by increasing their linewidths. We will limit our plot to one week's worth of data:

In [None]:
stations_plot = df2.plot(
    alpha=0.01, 
    legend=False, 
    figsize=(20,5), 
    color=labels["color"]
)

median_cluster.plot(
    figsize=(20,5), 
    linewidth = 3, 
    alpha = 0.5,
    color=colors, 
    xlim=('2017-02-13','2017-02-20'),
    ax = stations_plot
)