# Divvy Exploratory Data Analysis

Exploratory data analysis for Divvy bikesharing data from April 2020 to May 2023.

### Preliminaries

In [1]:
import pandas as pd
import os
from datetime import datetime
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
import plotly.express as px

In [2]:
colours = ['#D5573B','#074F57','#077187','#9ECE9A','#D7CEB2']
sns.set_palette(colours)

In [3]:
dtypes = {
    'ride_id': str,
    'rideable_type': str,
    'started_at': str,
    'ended_at': str,
    'start_station_name': str,
    'start_station_id': str,
    'end_station_name': str,
    'end_station_id': str,
    'start_lat': float,
    'start_lng': float,
    'end_lat': float,
    'end_lng': float,
    'member_casual': str,
    'time': float,
    'distance': float,
}

In [4]:
data_path = os.getcwd() + '/../data/data_dist_time.csv'
data = pd.read_csv(data_path, dtype=dtypes, index_col=0)

In [5]:
data['started_at'] = pd.to_datetime(data['started_at'])
data['ended_at'] = pd.to_datetime(data['ended_at'])
data['time'] = data['time'].div(60)

In [6]:
data.info()

In [7]:
data['year'] = data['started_at'].dt.year
data['month'] = data['started_at'].dt.month
data['hour'] = data['started_at'].dt.hour

data.year = data.year.astype('category')
data.month = data.month.astype('category')
data.hour = data.hour.astype('category')

In [8]:
data.head()

### Further Cleaning

In [9]:
data[['time', 'distance']].describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

We can remove some outliers, particularly when time is negative or absurdly high. The maximum time is 58,720 minutes—that's over 40 days! Let's filter out the top and bottom percentage of the data in terms of time.

In [10]:
lower = np.percentile(data['time'], 1)
upper = np.percentile(data['time'], 99)
cleaned = data[data.time.between(lower, upper)]
cleaned[['time', 'distance']].describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

Looks much better. However, the maximum distance of 6,100 miles is also a bit fishy, especially considering the maximum time is now just over 2 hours. That's nearly 3,000mph, more than five times faster than a commercial plane. Impressive, but probably not realistic. Chicago is about 25 miles at its greatest extent, so let's also filter out data points where the distances exceed 25 miles.

In [11]:
cleaned = cleaned[cleaned.distance < 25]
cleaned[['time', 'distance']].describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

In [12]:
cleaned2 = cleaned.replace({'docked_bike': 'classic_bike'}, regex=True)

In [13]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.kdeplot(data=cleaned2, x='time', ax=axes[0])
sns.kdeplot(data=cleaned2, x='distance', ax=axes[1])

# Analysis

I will perform the following analyses:

<u> Casual vs member analysis </u>
1. Usage time
    - Average time
2. Usage (by hour of day)
3. Usage (by month of year)
4. Distance rode
    - Average distance
5. Popular stations (starting and ending)
6. Bike type
7. Trip count

<u> Inter-year analysis (zoom in on 2021 vs 2022 quarters) </u>
1. Usage time
    - Casual vs member
    - Bike type
2. Popular stations
3. Distance rode
    - Casual vs member
    - Bike type
4. Trip count

<u> Inter-station analysis </u>
1. Casual vs member
2. Peak hours
3. Peak months
4. Trip count

Much of the analyses will be concentrated on the time rather than the distance as the distance travelled can be somewhat misleading. It is possible that riders pick up and dock at the same station, and the distance travelled metric does not capture this behaviour accurately. Also for this reason, I did not impose a lower bound on distance rode when cleaning the dataset in the previous section.

#### Casual vs member analysis

In [14]:
sns.countplot(data=cleaned2, y='year', hue='member_casual')

Throughout the past three years, there have been more member rides than casual rides, which is largely unsurprising.

In [15]:
cleaned2.groupby('member_casual')['time'].describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

In [16]:
cleaned2.groupby('member_casual')['distance'].describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

In [17]:
sns.kdeplot(data=cleaned2, x='time', hue='member_casual', fill=True)

On average, casual riders ride for 10 minutes longer than members, averaging 22 minutes a ride compared to the members' average of 12 minutes. This is apparent in the plot as well, with casual members taking a greater proportion of rides over 20 minutes. However, we can observe that this increased time is not a result of riding further as the distance rode by member and casual riders differ by a mere 4/100ths of a mile.

Let's look at the times of day and times of the year where member and casual riders take their rides:

In [18]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.countplot(data=cleaned2[cleaned2['year'].isin([2021, 2022])], y='hour', hue='member_casual', ax=axes[0])
sns.countplot(data=cleaned2, y='month', hue='member_casual', ax=axes[1])

Casual riders tend to be more active in the early hours of the morning and the late hours of the evening. During more typical waking hours (4am-9pm), a greater proportion of the rides are from member riders. And because member riders make up a larger proportion of the rides, it is unsurprising that member rides outnumber casual rides for most months of the year. However, it is interesting to note that in July, casual rides exceed member rides. This is likely because, as the weather becomes warmer in the summer months, nonmembers are more inclined to bike around the city for transit, touring, and leisure.

Let's also compare the differences between stations for casual and member riders:

In [19]:
member_starting_count = cleaned2[cleaned2['member_casual']=='member']['start_station_name'].value_counts()
member_starting_count.sort_values(ascending=False)

In [20]:
casual_starting_count = cleaned2[cleaned2['member_casual']=='casual']['start_station_name'].value_counts()
casual_starting_count.sort_values(ascending=False)

In [21]:
station_counts = cleaned2['start_station_name'].value_counts()
top_10_stations = cleaned2[cleaned2['start_station_name'].isin(station_counts.head(10).index)]
sns.countplot(data=top_10_stations, y='start_station_name', hue='member_casual')

In [22]:
station_coords = cleaned2.drop_duplicates(subset=['start_station_name'], keep='first') [['start_station_name','start_lat','start_lng']].set_index('start_station_name')
station_coords.head()

In [23]:
member_starting_count = member_starting_count.to_frame().join(station_coords)
casual_starting_count = casual_starting_count.to_frame().join(station_coords)

In [24]:
# px.scatter_geo(member_starting_count, lat='start_lat', lon='start_lng', hover_name=member_starting_count.index, size='count',title='member starting stations')

Most casual riders begin their rides at stations near lakeside tourist attractions, whereas member riders typically begin their rides in the city. For instance, Streeter Dr & Grand Ave (nearby Navy Pier), Millennium Park (nearby well, Millennium Park), and Michigan Ave & Oak St (nearby the Oak Street Beach) are the most-visited start stations for casual riders. On the other hand, the top four most-visited stations for member riders are in the Near North Side, closer to commercial districts than tourist attractions.

Let's do the same for end stations:

In [25]:
station_counts = cleaned2['end_station_name'].value_counts()
top_10_stations = cleaned2[cleaned2['end_station_name'].isin(station_counts.head(10).index)]
sns.countplot(data=top_10_stations, y='end_station_name', hue='member_casual')

The results seem to be more or less similar to the starting station data. However, note that this station analysis may exclude some data from electric bikes, as electric bikes need not be started or ended at an actual dock. Let's look at the data between casual and member riders in terms of only electric bikes:

In [26]:
fig, axes = plt.subplots(1, 4, figsize=(12, 4), sharey=True)

member_casual_palette = {'member': '#D5573B', 'casual': '#074F57'}
base_year = 2020
for i in range(4):
    sns.countplot(data=cleaned2[cleaned2['year']==base_year+i].sort_values('member_casual', ascending=False), x='rideable_type', hue='member_casual', palette=member_casual_palette, order=['classic_bike','electric_bike'], ax=axes[i])
    axes[i].set_title(str(base_year+i))
    # axes[i].tick_params(axis='x', labelrotation=45)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].legend_.remove()

fig.text(0.5, -0.02, 'rideable_type', ha='center', fontsize=12)
fig.text(-0.02, 0.5, 'count', va='center', rotation='vertical', fontsize=12)
handles, labels = axes[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='upper right', bbox_to_anchor=(0.983, 0.9))
fig.tight_layout()

There are some noteworthy trends to observe, particularly in relation to the proportion of classic bike usages to electric bike usages. There is a large shift from classic to electric bikes between 2021 and 2022 for both member and casual riders. In fact, there were more e-bike rides than classic bike rides from casual users in 2022. With the 2023 data up until May, this shift toward e-bike usage also appears to persist, with even member riders more frequently using electric bikes than classic bikes. Electric bikes were first introduced to the fleet at the end of July 2020, so it is interesting to observe its changes in popularity over the years. I will further examine this trend in the Inter-year analysis below.

#### Inter-year analysis

Divvy announced a change in its pricing structure in May 2022. Most notably, electric bikes, which were once free of charge to members for 45 minutes within certain zones, have become only free to unlock and cost 15 cents for each subsequent minute. Additionally, a $1 fee is assessed for docking the e-bike outside of a station. With these changes, it would be interesting to observe changes in members' behaviours, which we will do in this section.

In [28]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.countplot(data=cleaned2[cleaned2['year']==2021], y='month', hue='member_casual', ax=axes[0])
sns.countplot(data=cleaned2[cleaned2['year']==2022], y='month', hue='member_casual', ax=axes[1])
axes[0].set_title('2021')
axes[1].set_title('2022')