# User Behavior Analysis for Divvy Bike Share System from September to November 2023 in Chicago

# Introduction

Divvy is the top bike sharing system in the Chicagoland area, serving both local residents and tourists in Chicago and Evanston. According to the City of Chicago, Divvy, a mode of transportation that is convenient, enjoyable, and cost-effective, achieved a new milestone in 2022 by surpassing 5.6 million bike trips. This represents a significant increase of over 60 percent compared to 2019. Furthermore, Divvy is expected to break this record again in 2023.
Divvy offers an annual membership priced at 10.91 per month. This membership includes several enticing advantages, such as unlimited 45-minute rides on classic bikes, a 60% discount on ebikes for faster travel, and 5 free unlocks for visitors. For occasional consumers, accessing a classic bike merely costs 1 for the initial 30 minutes, followed by a nominal fee of 0.17 per additional minute.In terms of electric bikes, a casual rider will be charged 1 to unlock the bike, and then a fee of 0.42 per minute. However, for members, the rate is reduced to 0.17 per minute. Comparing to classic bikes, the expense for electric bikes is slightly higher due to the pedal-assist motor that helps riders avoid exerting themselves. 

The objective of this study is to conduct a comprehensive analysis of user behavior, focusing on the differences between casual riders and member riders from September to November 2023. Gaining a comprehensive understanding of the unique behaviors, preferences, and usage patterns among these different user categories yields essential information. To achieve this, a series of business tasks have been outlined, encompassing inquiries into total rides, average ride durations, and preferred bicycle types among these user segments. Leveraging descriptive analytics for robust insights, Python serves as the primary tool for data manipulation, enabling the extraction of meaningful patterns through visualization and analysis. Throughout this study, the findings not only emphasize existing trends but also have the ability to offer valuable insights for improving user experience and boosting engagement.

# Data process

Initially, I will proceed with the installation of the necessary Python packages for my project

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import datetime

Following that, I will continue to upload the unprocessed dataset obtained from Divvy.

In [2]:
# Load datasets for September, October, and November
data_sept = pd.read_csv(r"C:\Users\mkkhanh\Documents\Downloads\202309-divvy-tripdata.csv")
data_oct = pd.read_csv(r"C:\Users\mkkhanh\Documents\Downloads\202310-divvy-tripdata.csv")
data_nov = pd.read_csv(r"C:\Users\mkkhanh\Documents\Downloads\202311-divvy-tripdata.csv")

Prior to integrating the data, I will examine the contents of three dataframes to determine if the columns align with each other, ascertain the data type of each column, and identify any missing values within the data.

In [12]:
data_sept.info()
data_oct.info()
data_nov.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 666371 entries, 0 to 666370
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   ride_id             666371 non-null  object 
 1   rideable_type       666371 non-null  object 
 2   started_at          666371 non-null  object 
 3   ended_at            666371 non-null  object 
 4   start_station_name  565059 non-null  object 
 5   start_station_id    565059 non-null  object 
 6   end_station_name    559080 non-null  object 
 7   end_station_id      559080 non-null  object 
 8   start_lat           666371 non-null  float64
 9   start_lng           666371 non-null  float64
 10  end_lat             665533 non-null  float64
 11  end_lng             665533 non-null  float64
 12  member_casual       666371 non-null  object 
dtypes: float64(4), object(9)
memory usage: 66.1+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 537113 entries, 0 to 537112
Data col

Upon examining the data, it is apparent that all the column names are identical. However, there are null values present in the start station and end station columns. Additionally, the start time and end time columns are currently in object type. Consequently, I intend to convert them to datetime
format in order to facilitate the comparison and manipulation of dates and times. Upon reviewing the values in each column, I have observed the presence of a station named "OH - BONFIRE - TESTING," which likely corresponds to the testing dock station. Thus, I will initially merge three dataframes and eliminate erroneous data, which encompasses the following:
- Eliminate any recordings that have the station name "OH - BONFIRE - TESTING".
- Eliminate any records that have a null value for the station name.
- Eliminate any records where the start time is greater than the end time.

In [4]:
# Combine datasets
data = pd.concat([data_sept, data_oct, data_nov], ignore_index=True)

# Convert start time and end time to datetime
data['start_time'] = pd.to_datetime(data['started_at'])
data['end_time'] = pd.to_datetime(data['ended_at'])

# Eliminate records with null station names or where start time > end time
data = data.dropna(subset=['start_station_name', 'end_station_name'])
data = data[data['start_time'] < data['end_time']]

# Add new columns for ride duration and day of the week
data['ride_duration'] = data['end_time'] - data['start_time']
data['duration_in_minutes'] = data['ride_duration'].dt.total_seconds() / 60
data['day_of_week'] = data['start_time'].dt.day_name()

# Remove test data (e.g., 'OH - BONFIRE - TESTING')
data = data[data['start_station_name'] != 'OH - BONFIRE - TESTING']

# Exploratory Data Analysis

1. Total rides between casual and member riders

In [32]:
import plotly.express as px

# Assuming 'data' is your DataFrame and 'member_casual' is the column of interest
# Count rides by user type (member vs casual)
ride_counts = data['member_casual'].value_counts()

# Specify the exact colors
colors = ['#0056b3', '#ff7f0e']

# Plot the total rides with specific colors
fig = px.pie(values=ride_counts.values, names=ride_counts.index, 
             title='Total Rides Between Casual and Member Riders',
             color_discrete_sequence=colors)

# Update traces to show only percentage inside the pie slices
fig.update_traces(textposition='inside', textinfo='percent')

fig.show()

During the period from September 2023 to November 2023, there were a total of 1,185,173 rides.
Among them, the member type accounted for approximately three-fourth of the pie chart, with
785,837 rides, while casual users only made up 399,336 trips. Evidently, individuals who enrolled
in an annual membership demonstrate a higher propensity for utilizing bicycles compared to nonregistered users.

2. Total rides by start time between casual and member riders

In [33]:
import pandas as pd
import plotly.express as px

# Assuming data is loaded into DataFrame 'data'
# Example DataFrame columns: 'start_time', 'member_casual'

# Ensure the 'start_time' column is in datetime format
data['start_time'] = pd.to_datetime(data['start_time'])

# Extract the hour from the start time
data['start_hour'] = data['start_time'].dt.hour

# Group by member type (casual or member) and start hour
rides_by_hour = data.groupby(['member_casual', 'start_hour']).size().reset_index(name='ride_count')

# Specify the exact colors for the lines
colors = {
    'casual': '#ff7f0e',  # Darker blue
    'member': '#0056b3'   # Vibrant orange
}

# Plot rides by start time for casual and member riders
fig = px.line(rides_by_hour, x='start_hour', y='ride_count', color='member_casual', 
              color_discrete_map=colors,  # Using the custom color map
              title='Total Rides by Start Time')
fig.update_layout(
    xaxis_title="Hour of the Day",
    yaxis_title="Total Rides",
    legend_title="User Type"
)
fig.show()


The analysis of total rides by the starting time unveils intriguing patterns for both member and casual riders. Notably, both groups exhibit minimal bike usage during the late-night to early morning hours, particularly between 3 to 4 o'clock, registering approximately 1,000 uses per type.

For member riders, a significant upsurge in bicycle usage is evident from 5 o'clock, peaking at early 55,000 rides by 8 o'clock. A gradual decline is observed until 10 o'clock, followed by a steady increase, reaching its highest point at 17 o'clock with an impressive 85,000 rides. Conversely, casual riders demonstrate a different trend. While the highest peak, also at 17 o'clock, accounts for 38,000 uses, the pattern lacks the fluctuation seen in member riders. Instead, there's a steady rise in usage until 17 o'clock, followed by a gradual decrease towards midnight.

3. Total Rides by Start Time Between Casual and Member Riders in each month

In [34]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Ensure the 'start_time' column is in datetime format
data['start_time'] = pd.to_datetime(data['started_at'])

# Extract the month and hour from 'start_time'
data['month'] = data['start_time'].dt.month
data['start_hour'] = data['start_time'].dt.hour

# Define colors for the lines based on user type
colors = {
    'casual': '#ff7f0e',  # This is actually vibrant orange, assigned to casual
    'member': '#0056b3'   # Darker blue, assigned to member
}

# Create subplots: 1 row, 3 columns
fig = make_subplots(rows=1, cols=3, subplot_titles=("September", "October", "November"))

# Function to update subplots for each month
def add_rides_by_hour_subplot(month, month_name, col_num):
    # Filter the data for the specific month
    month_data = data[data['month'] == month]
    
    # Group by user type (member_casual) and start hour
    rides_by_hour = month_data.groupby(['member_casual', 'start_hour']).size().reset_index(name='ride_count')
    
    # Create a line plot for this month
    for user_type in rides_by_hour['member_casual'].unique():
        subset = rides_by_hour[rides_by_hour['member_casual'] == user_type]
        fig.add_trace(
            go.Scatter(x=subset['start_hour'], y=subset['ride_count'], mode='lines',
                       name=f'{user_type} - {month_name}', line=dict(color=colors[user_type])),
            row=1, col=col_num
        )

# Add plots for each month to the subplots
add_rides_by_hour_subplot(9, 'September', 1)
add_rides_by_hour_subplot(10, 'October', 2)
add_rides_by_hour_subplot(11, 'November', 3)

# Update layout
fig.update_layout(title_text="Total Rides by Start Time Between Casual and Member Riders (Sep, Oct, Nov)",
                  height=600, width=1200, showlegend=True)

# Show the plot
fig.show()

The examination reveals several key implications:
- Member riders exhibit a significant increase in bicycle utilization throughout the late afternoon, in contrast to the continuous and steady pattern observed among casual riders throughout the day.
- The overall trend in total rides remains consistent over the three-month period, highlighting stability in biking patterns
- In November, there is a noticeable increase in demand at 16:00, which requires more inquiry into possible variables that may be causing this surge.

4. Total casual and member rides by day of the week

In [35]:
import plotly.express as px
import pandas as pd

# Ensure 'start_time' is in datetime format and add 'day_of_week'
data['start_time'] = pd.to_datetime(data['started_at'])
data['day_of_week'] = data['start_time'].dt.day_name()

# Group by day of the week and user type (member_casual), then calculate ride counts
rides_by_day = data.groupby(['day_of_week', 'member_casual']).size().reset_index(name='Total Rides')

# Reorder the days of the week to match typical calendar representation
category_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
rides_by_day['day_of_week'] = pd.Categorical(rides_by_day['day_of_week'], categories=category_order, ordered=True)

# Define custom colors for casual and member
color_map = {
    'casual': '#ff7f0e',  # Vibrant orange for casual
    'member': '#0056b3'   # Darker blue for member
}

# Plotting the graph
fig = px.bar(rides_by_day, x='day_of_week', y='Total Rides', color='member_casual',
             barmode='group', title='Total Casual and Member Rides by Day of the Week',
             text='Total Rides', text_auto=True,
             color_discrete_map=color_map)  # Apply custom colors

fig.update_layout(xaxis_title="Day of the Week",
                  yaxis_title="Total Rides",
                  legend_title="User Type",
                  xaxis={'categoryorder':'array', 'categoryarray': category_order})
fig.show()

Member riders demonstrate a greater inclination for riding on weekdays as opposed to weekends, with almost three times as many rides on weekdays compared to casual riders. In contrast, occasional riders have a preference for weekends, engaging in nearly twice as many rides on weekends as they do on weekdays. The difference in riding behavior between member and casual riders emphasizes their different usage patterns on weekdays and weekends.

5. Average Ride Duration by day of the week between casual and member riders

In [37]:
import plotly.express as px
import pandas as pd

# Ensure 'start_time' and 'end_time' are in datetime format
data['start_time'] = pd.to_datetime(data['started_at'])
data['end_time'] = pd.to_datetime(data['ended_at'])

# Calculate ride duration in minutes
data['ride_duration_minutes'] = (data['end_time'] - data['start_time']).dt.total_seconds() / 60

# Add 'day_of_week'
data['day_of_week'] = data['start_time'].dt.day_name()

# Group by day of the week and user type (member_casual), calculate average ride duration
avg_ride_duration = data.groupby(['day_of_week', 'member_casual'])['ride_duration_minutes'].mean().reset_index()

# Reorder the days of the week to match typical calendar representation
category_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
avg_ride_duration['day_of_week'] = pd.Categorical(avg_ride_duration['day_of_week'], categories=category_order, ordered=True)

# Define custom colors for the plot
color_discrete_map={'casual': '#ff7f0e', 'member': '#0056b3'}  # Orange for casual, blue for member

# Update the plotting line to include custom colors
fig = px.bar(avg_ride_duration, x='day_of_week', y='ride_duration_minutes', color='member_casual',
             barmode='group', title='Average Ride Duration by Day of the Week between Casual and Member Riders',
             text='ride_duration_minutes', text_auto=True, color_discrete_map=color_discrete_map)

fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.update_layout(xaxis_title="Day of the Week",
                  yaxis_title="Average Ride Duration (minutes)",
                  legend_title="User Type",
                  xaxis={'categoryorder':'array', 'categoryarray': category_order},
                  yaxis=dict(range=[0, 30]))  # Adjust the y-axis range if necessary

fig.show()

The average plot illustrates a consistent ride length trend for member riders throughout the weekdays, with a slight increase observed on weekends. Notably, the average ride length for members remains considerably lower than that of casual riders across all days. Casual riders showcase their highest average ride length on Saturdays and Sundays, at 24.86 and 26.12, respectively, displaying inconsistency throughout the week. However, relying solely on the mean values might provide a misleading representation

6. Mean vs Median ride duration by day of the week between casual and member riders

In [38]:
import pandas as pd
import plotly.express as px

# Assume data has been loaded into DataFrame 'data'
# Ensure 'start_time' and 'end_time' are in datetime format
data['start_time'] = pd.to_datetime(data['started_at'])
data['end_time'] = pd.to_datetime(data['ended_at'])

# Calculate ride duration in minutes
data['ride_duration_minutes'] = (data['end_time'] - data['start_time']).dt.total_seconds() / 60

# Add 'day_of_week'
data['day_of_week'] = data['start_time'].dt.day_name()

# Order the days starting from Monday
days_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
data['day_of_week'] = pd.Categorical(data['day_of_week'], categories=days_order, ordered=True)

# Group by day of the week and user type (member_casual), calculate average and median ride duration
summary_stats = data.groupby(['day_of_week', 'member_casual'])['ride_duration_minutes'].agg(['mean', 'median']).reset_index()

# Create separate dataframes for average and median for easier plotting
avg_data = summary_stats[['day_of_week', 'member_casual', 'mean']]
median_data = summary_stats[['day_of_week', 'member_casual', 'median']]

# Define custom colors
color_discrete_map = {'casual': '#ff7f0e', 'member': '#0056b3'}  # Orange for casual, blue for member

# Plotting Average Ride Duration
fig_avg = px.bar(avg_data, x='day_of_week', y='mean', color='member_casual', barmode='group',
                 title='Average Ride Duration by Day of the Week', labels={'mean': 'Average Duration (minutes)'},
                 text='mean', color_discrete_map=color_discrete_map)
fig_avg.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig_avg.update_layout(xaxis_title='Day of the Week', yaxis_title='Average Duration (minutes)', legend_title='User Type')
fig_avg.show()

# Plotting Median Ride Duration
fig_median = px.bar(median_data, x='day_of_week', y='median', color='member_casual', barmode='group',
                    title='Median Ride Duration by Day of the Week', labels={'median': 'Median Duration (minutes)'},
                    text='median', color_discrete_map=color_discrete_map)
fig_median.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig_median.update_layout(xaxis_title='Day of the Week', yaxis_title='Median Duration (minutes)', legend_title='User Type')
fig_median.show()

The difference in the duration of rides between the mean and median is evident and substantial. The average journey duration is significantly influenced by outliers (lengthier trips), particularly with casual rides. Behaviorally, most casual rides are not double the length of member rides, contrary to what one might assume based just on the mean. Typically, the majority of rides are shorter than the average ride duration.

7. Stations frequencies between casual and member riders

In [41]:
import pandas as pd
import plotly.express as px

# Assuming data is loaded into DataFrame 'data'
# Example DataFrame columns: 'start_station_name', 'end_station_name', 'member_casual'

# Filter for casual riders
casual_data = data[data['member_casual'] == 'casual']

# Calculate top 5 start stations for casual riders and sort in ascending order
top_start_stations_casual = casual_data['start_station_name'].value_counts().nlargest(5).reset_index()
top_start_stations_casual.columns = ['Station', 'Total Rides']
top_start_stations_casual = top_start_stations_casual.sort_values(by='Total Rides', ascending=True)

# Calculate top 5 end stations for casual riders and sort in ascending order
top_end_stations_casual = casual_data['end_station_name'].value_counts().nlargest(5).reset_index()
top_end_stations_casual.columns = ['Station', 'Total Rides']
top_end_stations_casual = top_end_stations_casual.sort_values(by='Total Rides', ascending=True)

# Define custom color for casual riders
custom_color = ['#ff7f0e']  # Vibrant orange color

# Plotting for casual riders - Top Start Stations
fig_casual_start = px.bar(top_start_stations_casual, y='Station', x='Total Rides', orientation='h',
                          title='Top 5 Start Stations Frequencies for Casual Riders', text='Total Rides',
                          color_discrete_sequence=custom_color)
fig_casual_start.update_layout(xaxis_title="Total Rides", yaxis_title="Station")
fig_casual_start.update_traces(texttemplate='%{x}', textposition='outside')
fig_casual_start.show()

# Plotting for casual riders - Top End Stations
fig_casual_end = px.bar(top_end_stations_casual, y='Station', x='Total Rides', orientation='h',
                        title='Top 5 End Stations Frequencies for Casual Riders', text='Total Rides',
                        color_discrete_sequence=custom_color)
fig_casual_end.update_layout(xaxis_title="Total Rides", yaxis_title="Station")
fig_casual_end.update_traces(texttemplate='%{x}', textposition='outside')
fig_casual_end.show()

The analysis reveals that among 1,281 stations, Streeter Dr & Grand Ave stands out as the most favored station for both hiring and parking among casual riders. This station notably witnessed significantly higher bike rentals and returns compared to others, with 10,319 rides commencing and 11,418 concluding at this station. These figures notably double the usage observed at each of
the bottom three out of the top five stations. Research into these stations indicates their proximity to popular attractions and recreational sites like Navy Pier, Millennium Park, Shedd Aquarium, and North Avenue Beach, suggesting a tourist-centric preference among casual riders for these locations.

However, there is some differences in terms of the station usage patterns of member riders. While the trend for the top five stations for member riders shows a slight decrease, the stations frequented by member riders starkly differ from those of casual riders.

In [43]:
import pandas as pd
import plotly.express as px

# Assuming data is loaded into DataFrame 'data'
# Example DataFrame columns: 'start_station_name', 'end_station_name', 'member_casual'

# Filter for member riders
member_data = data[data['member_casual'] == 'member']

# Calculate top 5 start stations for member riders and sort in ascending order
top_start_stations_member = member_data['start_station_name'].value_counts().nlargest(5).reset_index()
top_start_stations_member.columns = ['Station', 'Total Rides']
top_start_stations_member = top_start_stations_member.sort_values(by='Total Rides', ascending=True)

# Calculate top 5 end stations for member riders and sort in ascending order
top_end_stations_member = member_data['end_station_name'].value_counts().nlargest(5).reset_index()
top_end_stations_member.columns = ['Station', 'Total Rides']
top_end_stations_member = top_end_stations_member.sort_values(by='Total Rides', ascending=True)

# Define custom color for member riders
custom_color = ['#0056b3']  # Darker blue color for member riders

# Plotting for member riders - Top Start Stations
fig_member_start = px.bar(top_start_stations_member, y='Station', x='Total Rides', orientation='h',
                          title='Top 5 Start Stations Frequencies for Member Riders', text='Total Rides',
                          color_discrete_sequence=custom_color)
fig_member_start.update_layout(xaxis_title="Total Rides", yaxis_title="Station")
fig_member_start.update_traces(texttemplate='%{x}', textposition='outside')
fig_member_start.show()

# Plotting for member riders - Top End Stations
fig_member_end = px.bar(top_end_stations_member, y='Station', x='Total Rides', orientation='h',
                        title='Top 5 End Stations Frequencies for Member Riders', text='Total Rides',
                        color_discrete_sequence=custom_color)
fig_member_end.update_layout(xaxis_title="Total Rides", yaxis_title="Station")
fig_member_end.update_traces(texttemplate='%{x}', textposition='outside')
fig_member_end.show()

Further investigation into these stations reveals their proximity to residential and commercial areas. This suggests a usage pattern that is more in line with the demands of residents who commute, local businesses, or people commuting to nearby workplaces or educational institutions.

8. Total rides by bike type between casual and member riders

In [45]:
import pandas as pd
import plotly.express as px

# Assuming data is loaded into DataFrame 'data'
# Example DataFrame columns: 'rideable_type', 'member_casual'

# Count the total rides by bike type and user type
ride_counts = data.groupby(['rideable_type', 'member_casual']).size().reset_index(name='Total Rides')

# Define custom colors
color_discrete_map = {'casual': '#ff7f0e', 'member': '#0056b3'}  # Orange for casual, blue for member

# Plotting
fig = px.bar(ride_counts, x='rideable_type', y='Total Rides', color='member_casual', barmode='group',
             title='Total Rides by Bike Type between Casual and Member Riders',
             text='Total Rides', color_discrete_map=color_discrete_map)
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(xaxis_title="Rideable Type",
                  yaxis_title="Total Rides",
                  legend_title="User Type",
                  xaxis_tickangle=-45)
fig.show()

It can be seen that members demonstrate a significantly higher utilization of classic bikes, more than double that of casual riders, with 533,382 rides compared to 256,463, respectively. Additionally, both member and casual riders exhibit a notable decrease in the usage of electric bikes, recording 252,455 rides for member riders and 142,873 for casual riders.

# Reference

- Bike share in the Chicago area | Divvy. (n.d.). https://account.divvybikes.com/access-plans
- Divvy Data. (n.d.). https://divvybikes.com/system-data
- Jaggia, S., Kelly, A., Lertwachara, K., & Chen, L. (2022, January 1). ISE Business Analytics.
http://books.google.ie/books?id=Y1TUzwEACAAJ&dq=%5BLertwachara,+Kevin/Chen,+Leida/Jaggia,+Sanjiv.+Business+Analytics+ISE%5D&hl=&cd=1&source=gbs_api
- New Divvy Stations and Bikes Coming to Chicago as Part of Continued Bikeshare Expansion. (n.d.).
https://www.chicago.gov/city/en/depts/cdot/provdrs/bike/news/2023/october/new-divvystations-and-bikes-coming-to-chicago-as-part-ofcontin.html#:~:text=The%20Divvy%20system%20hit%20a,devices%20(scooters%20and%20bikes)
- Single Ride | Divvy Bikes. (n.d.). https://divvybikes.com/pricing/single-ride