# **Milestone** | Capital Bikeshare Ride Analysis

<div style="text-align: center;">
<img src="https://cdn.lyft.com/static/bikesharefe/logo/CapitalBikeshare-main.svg" alt="Capital Bikeshare Logo" width="320"/>
</div>


## Introduction
In this Milesone, you'll take on the role of a junior data analyst at Capital Bikeshare, the public bicycle-sharing system in Washington, D.C. Your job is to help city planners understand how people are using the public bike share system across Washington, D.C. in 2024. The city wants to make data-driven decisions to improve bike availability, reduce maintenance downtime, and better serve high-demand areas.

Your manager has asked you to analyze ride data to identify patterns in usage volume, trip distances, and which stations are most frequently used. This information will inform where to allocate bikes, prioritize maintenance resources, and promote underused locations.

You will use the `2024_capitol_bikeshare.csv` dataset to complete your analysis. Each row represents one completed bike ride.

To start, import the pandas library, so that you can load the data into a DataFrame.

In [2]:
# import the pandas library
import pandas as pd
# load the data into a dataframe called bike_rides
bike_rides = pd.read_csv('datasets/2024_capitol_bikeshare.csv')
bike_rides.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,trip_duration_min,start_station_name,start_station_id,end_station_name,end_station_id,member_type
0,4E026D43FD09E59C,classic_bike,2024-03-11 17:46:29,2024-03-11 18:03:07,17.0,22nd & H St NW,31127.0,15th & W St NW,31125.0,member
1,AB210B914033D41B,classic_bike,2024-03-17 19:31:24,2024-03-17 19:39:34,8.0,Crystal Dr & 15th St S,31003.0,Pentagon City Metro / 12th St & S Hayes St,31005.0,member
2,3B328C72BC05FDAB,classic_bike,2024-03-07 14:32:34,2024-03-07 15:19:45,47.0,Crystal Dr & 15th St S,31003.0,Crystal Dr & 15th St S,31003.0,casual
3,A2FD150593E11106,classic_bike,2024-03-29 18:44:08,2024-03-29 18:49:59,6.0,Columbia Rd & Belmont St NW,31113.0,Massachusetts Ave & Dupont Circle NW,31200.0,member
4,4E18243CAADD3542,classic_bike,2024-03-24 11:18:00,2024-03-24 11:24:28,6.0,Columbia Rd & Belmont St NW,31113.0,Massachusetts Ave & Dupont Circle NW,31200.0,member


## Task 1: Exploring The Data

This task will help you build a foundational understanding of the dataset — what kind of data you’re working with? How much of it is there? Are there any potential issues like missing values? Before any analysis, it’s important to get familiar with the structure so you can make informed decisions later.

In [5]:
# preview the first 10 rows of the dataset
bike_rides.head(10)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,trip_duration_min,start_station_name,start_station_id,end_station_name,end_station_id,member_type
0,4E026D43FD09E59C,classic_bike,2024-03-11 17:46:29,2024-03-11 18:03:07,17.0,22nd & H St NW,31127.0,15th & W St NW,31125.0,member
1,AB210B914033D41B,classic_bike,2024-03-17 19:31:24,2024-03-17 19:39:34,8.0,Crystal Dr & 15th St S,31003.0,Pentagon City Metro / 12th St & S Hayes St,31005.0,member
2,3B328C72BC05FDAB,classic_bike,2024-03-07 14:32:34,2024-03-07 15:19:45,47.0,Crystal Dr & 15th St S,31003.0,Crystal Dr & 15th St S,31003.0,casual
3,A2FD150593E11106,classic_bike,2024-03-29 18:44:08,2024-03-29 18:49:59,6.0,Columbia Rd & Belmont St NW,31113.0,Massachusetts Ave & Dupont Circle NW,31200.0,member
4,4E18243CAADD3542,classic_bike,2024-03-24 11:18:00,2024-03-24 11:24:28,6.0,Columbia Rd & Belmont St NW,31113.0,Massachusetts Ave & Dupont Circle NW,31200.0,member
5,610AAB1B374BFEB6,classic_bike,2024-03-13 11:04:09,2024-03-13 11:40:47,37.0,Long Bridge Aquatic Center,31950.0,17th & K St NW,31213.0,casual
6,D51D47977A0903CE,classic_bike,2024-03-22 13:21:09,2024-03-22 13:40:02,19.0,22nd & H St NW,31127.0,Wisconsin Ave & O St NW,31312.0,member
7,6A6D04EA0D480F85,electric_bike,2024-03-14 12:55:53,2024-03-14 13:39:56,44.0,Columbia Pike & S Walter Reed Dr,31067.0,N. Beauregard St. & Berkley St.,31928.0,member
8,D94ACBEFABC62625,classic_bike,2024-03-06 16:46:36,2024-03-06 16:52:14,6.0,15th & M St NW,31298.0,23rd & M St NW,31128.0,member
9,5962545649EED75E,electric_bike,2024-03-31 10:35:10,2024-03-31 10:50:02,15.0,5th & K St NW,31600.0,14th & Irving St NW,31124.0,member


In [6]:
# How many rows and columns are in the data?
bike_rides.shape

(862444, 10)

In [8]:
# What kinds of data are in each column? Are there any missing values?
bike_rides.info()
bike_rides.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 862444 entries, 0 to 862443
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   ride_id             862444 non-null  object 
 1   rideable_type       862444 non-null  object 
 2   started_at          862444 non-null  object 
 3   ended_at            862444 non-null  object 
 4   trip_duration_min   862444 non-null  float64
 5   start_station_name  736408 non-null  object 
 6   start_station_id    736408 non-null  float64
 7   end_station_name    731170 non-null  object 
 8   end_station_id      731004 non-null  float64
 9   member_type         862444 non-null  object 
dtypes: float64(3), object(7)
memory usage: 65.8+ MB


ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
trip_duration_min          0
start_station_name    126036
start_station_id      126036
end_station_name      131274
end_station_id        131440
member_type                0
dtype: int64

There are a couple of columns that do have missing values. What might be some reasons for that? How could those missing values affect your analysis?


<div style="border: 3px solid #30EE99; background-color: #f0fff4; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
  <span style="font-size: 10pt;">
    <strong>Try This AI Prompt:</strong> I noticed the following columns in my dataset on bikeshare rides contain null values: <i>[list the columns]</i> Why might those values be missing? How should I decide whether to ignore them, fill them in, or drop those rows? What factors should I consider before taking action?
  </span>
</div>

Double-click (or enter) to edit

In [9]:
# What is the average trip duration in minutes?
bike_rides['trip_duration_min'].mean()

18.283749437644648

## Task 2: Station Usage Analysis

This task explores how riders interact with the bike network — where trips are starting and ending, and which stations are most or least popular. Understanding station usage helps identify hotspots, gaps in service, and opportunities to optimize bike and dock placement.

How many unique starting stations are there in the data? Print the answer to the screen.

In [10]:
# Number of unique starting stations
bike_rides['start_station_name'].nunique()

771

What are the five most common stations where rides begin?

<div style="border: 3px solid #b67ae5; background-color: #f9f1ff; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
<span style="font-size: 10pt;">
<strong>HINT: </strong> After using <span style="font-family: monospace; color: #222;">.value_counts()</span> to rank the stations, you can add <span style="font-family: monospace; color: #222;">.head(5)</span> at the end to show only the top five most common ones!
</span>
</div>


In [11]:
# Top 5 most common starting stations
bike_rides['start_station_name'].value_counts().head(5)

Columbus Circle / Union Station    8996
New Hampshire Ave & T St NW        7792
Lincoln Memorial                   6490
Jefferson Memorial                 6476
15th & P St NW                     6468
Name: start_station_name, dtype: int64

What are the five most common ride destinations?

In [12]:
# Top 5 most common ending stations
bike_rides['end_station_name'].value_counts().head(5)

Columbus Circle / Union Station    8960
New Hampshire Ave & T St NW        7542
15th & P St NW                     6494
Jefferson Memorial                 6444
Jefferson Dr & 14th St SW          6426
Name: end_station_name, dtype: int64

## Task 3: Member Type Analysis

The column `member_type` indicates whether user was a "registered" member (Annual Member, 30-Day Member or Day Key Member) or a "casual" rider (Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass). How much each group contributes to the overall ridership can provide insights into which can inform service improvements, membership incentives, and marketing strategies.

How many rides were taken by "members" versus "casual" riders?
<div style="border: 3px solid #b67ae5; background-color: #f9f1ff; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
<span style="font-size: 10pt;">
    <strong>HINT: </strong> Check the <strong>count</strong> of each <strong>value</strong> in the <span style="font-family: monospace; color: #222;">member_type</span> column!
</span>
</div>

In [13]:
# Count of values in member_type
bike_rides['member_type'].value_counts()

member    549216
casual    313228
Name: member_type, dtype: int64

Are there more members or casual riders in March and April in Washington, D.C.? More members

Double-click (or enter) to edit

Find the longest and shortest rides in the entire dataset. Print both to the screen. Think about what might these extreme values tell you about the overall trip behavior for both members and casual riders.


In [22]:
# min trip_duration
print(f'Shortest ride duration is {bike_rides["trip_duration_min"].min()}')
# max trip_duration
print(f'Longest ride duration is {bike_rides["trip_duration_min"].max()}')



Shortest ride duration is 1.0
Longest ride duration is 1560.0


What is the median trip duration in minutes, across all users? Print to the screen.

In [23]:
# median trip_duration
print(f'Median trip duration is {bike_rides["trip_duration_min"].median()} minutes')

Median trip duration is 10.0 minutes


In Task 1, you found the mean trip duration. Why might the median be more useful than the mean in this case?

<div style="border: 3px solid #30EE99; background-color: #f0fff4; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
  <span style="font-size: 10pt;">
    <strong>Try This AI Prompt:</strong> I found that the median trip duration was XX minutes, but the mean was YY minutes. Why might the median be a more reliable metric in this context?
  </span>
</div>

I found that the median trip duration was 10.0 minutes, but the mean was 18.28 minutes. The mean is pulled up by unusually long rides (like the max of 1,560 minutes), which are likely outliers. The median better reflects the "typical" trip duration for most users because it isn't affected by these extreme values. In this context, the median provides a more accurate picture of central tendency for the majority of riders.

## Task 4: Identifying Underused Resources

Finally, you're looking for stations with low engagement to uncover inefficiencies in the system. Spotting underused stations helps inform marketing strategies, relocation plans, or targeted service improvements to boost ridership in overlooked areas.



Identify the **top 25** stations with the *fewest* ride departures.

In [28]:
# Top 10 most unpopular starting stations
least_used_stations = bike_rides['start_station_name'].value_counts().tail(25)
least_used_stations




Monroe St & Monroe Pl                        10
Division Ave & Foote St NE                    8
Fair Woods Pkwy & Fairfax Blvd                8
GMU/Rappahannock River Ln                     8
Lake Newport Rd and Autumn Ridge Cir          8
Ridge Rd Community Center                     8
The Shoppes @ Burnt Mills                     8
Fairfax Village                               8
Key West Ave & Diamondback Dr                 6
37th & Ely Pl SE                              6
Ridge Rd & Southern Ave SE                    6
United Medical Center                         6
White House                                   6
Green Range Dr and Glade Dr                   6
Becontree Ln & Goldenrain Ct                  6
Medical Center Dr & Key West Ave              4
GMU/Patriot Cir & York Dr                     4
Key West Ave & Great Seneca Hwy               4
GMU/Horizon Hall & Harris Theater             4
New Hampshire & Lockwood                      4
Joliet St & MLK Ave SW/Bald Eagle Rec Ct

What proportion of all rides started from these 25 least-used stations?

<div style="border: 3px solid #b67ae5; background-color: #f9f1ff; padding: 15px; border-radius: 8px; color: #222; display: flex; align-items: center;">
<span style="font-size: 10pt;">
<strong>HINT: </strong> There are several ways to go about this one! Here's one way: first find the total number of rides in the dataset using <span style="font-family: monospace; color: #222;">bike_rides.shape</span> from Task 1. Then, sum the number of rides that started from the ten least-used stations and divide that by the total number of rides. You could also use ChatGPT for help here, too!
    </span>
</div>



In [29]:
# proprtion of unpopular rides
total_rides = bike_rides.shape[0]
unpopular_ride_count = least_used_stations.sum()
proportion_unpopular = unpopular_ride_count / total_rides
print(f'Proportion of all rides from the 25 least-used stations: {proportion_unpopular}')


Proportion of all rides from the 25 least-used stations: 0.0001669673625186099


Based on your findings, do you think low-usage stations are underperforming due to location or lack of awareness? What would you recommend Capital Bikeshare do to increase usage in those areas? Don't forget you can use ChatGPT as a teammate here in crafting your recommendation!

Double-click (or enter) to edit

In [None]:
The 25 least-used starting stations make up less than 0.02% of all rides, indicating extremely low engagement. These stations may be underperforming due to inconvenient locations, such as being far from transit hubs or areas with high residential density, or they may suffer from lack of visibility and limited promotion. I recommend that Capital Bikeshare audit the physical condition and signage at these locations, explore partnerships with nearby businesses or community centers to raise awareness, and consider relocating or consolidating stations that show persistently low usage, especially when operational costs outweigh potential benefits. Additionally, analyzing nearby infrastructure such as bike lanes, public transit access, and tourist attractions could help identify which underused stations have room for growth.