# Cyclistic Bike-Share Analysis - Phase 4: ANALYZE

## Exploratory Data Analysis

**My Objective:** Understand how annual members and casual riders use Cyclistic bikes differently.

**Research Question:** How do annual members and casual riders use Cyclistic bikes differently?

**Dataset:** 4,859,019 rides from January - December 2024

---

## My Analysis Plan

I will explore the data in the following order:
1. Load cleaned data
2. Overall usage statistics
3. Member vs Casual comparison
4. Temporal patterns (hourly, daily, monthly)
5. Ride duration analysis
6. Bike type preferences
7. Geographic patterns (if applicable)
8. Key findings summary

In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Turn off warning messages
warnings.filterwarnings('ignore')

# Set visualization style for better-looking charts
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Configure pandas to display all columns
pd.set_option('display.max_columns', None)

print("Libraries loaded successfully")

Libraries loaded successfully


In [4]:
# Load the cleaned data from the previous step
df = pd.read_csv('../data/processed/cleaned_2024_data.csv')


In [5]:
# Convert datetime columns to proper datetime format
df['started_at'] = pd.to_datetime(df['started_at'], format='mixed')
df['ended_at'] = pd.to_datetime(df['ended_at'], format='mixed')

# Confirm data loaded successfully
print("Dataset loaded successfully")
print("-" * 60)

# Show basic information
total_rides = len(df)
min_date = df['started_at'].min()
max_date = df['started_at'].max()
total_columns = df.shape[1]

print(f"Total Rides: {total_rides:,}")
print(f"Date range: {min_date} to {max_date}")
print(f"Columns: {total_columns}")

Dataset loaded successfully
------------------------------------------------------------
Total Rides: 4,859,019
Date range: 2024-01-01 00:00:39 to 2024-12-31 23:54:37.045000
Columns: 17


---

## Section 1: Overall Dataset Summary

I'll start by examining high-level statistics to understand the overall bike-share usage in 2024.

In [6]:
# Calculate basic statistics
print("Overall Statistics:")
print("=" * 60)

# Total number of rides
total_rides = len(df)
print(f"Total number of rides: {total_rides:,}")

# Average ride duration
avg_duration = df['ride_length'].mean()
print(f"Average ride duration: {avg_duration:.2f} minutes")

# Median ride duration
median_duration = df['ride_length'].median()
print(f"Median ride duration: {median_duration:.2f} minutes")

# Total ride time in hours
total_time_hours = df['ride_length'].sum() / 60
print(f"Total ride time: {total_time_hours:,.2f} hours")

# Member type distribution
print("\nMember Type Distribution")
print("-" * 60)

# Count each member type
member_counts = df['member_casual'].value_counts()

# Calculate percentages
member_percentages = df['member_casual'].value_counts(normalize=True) * 100

# Display results for each member type
for member_type in member_counts.index:
    count = member_counts[member_type]
    percentage = member_percentages[member_type]
    print(f"{member_type.capitalize()} : {count:,} ({percentage:.2f}%) ")

Overall Statistics:
Total number of rides: 4,859,019
Average ride duration: 9.68 minutes
Median ride duration: 8.52 minutes
Total ride time: 783,987.93 hours

Member Type Distribution
------------------------------------------------------------
Member : 3,274,659 (67.39%) 
Casual : 1,584,360 (32.61%) 


In [7]:
# Rideable type distribution
print("\nRideable Type Distribution:")
print("-" * 60)

# Count each bike type
rideable_counts = df['rideable_type'].value_counts()

# Calculate percentages
rideable_percentages = df['rideable_type'].value_counts(normalize=True) * 100

# Display results for each bike type
for bike_type in rideable_counts.index:
    count = rideable_counts[bike_type]
    percentage = rideable_percentages[bike_type]
    print(f"{bike_type}: {count:,} ({percentage:.2f}%)")


Rideable Type Distribution:
------------------------------------------------------------
electric_bike: 2,567,490 (52.84%)
classic_bike: 2,164,265 (44.54%)
electric_scooter: 127,264 (2.62%)


---

## Section 2: Member vs Casual Rider Comparison

Now I'll compare key metrics between annual members and casual riders to identify differences in their behavior.

In [8]:
# Compare ride duration by member type
print("Ride Duration Comparison:")
print("=" * 60)

# Group by member type and calculate statistics
duration_stats = df.groupby('member_casual')['ride_length'].agg([
    ('count', 'count'),
    ('mean', 'mean'),
    ('median', 'median'),
    ('standard_deviation', 'std'),
    ('min', 'min'),
    ('max', 'max')
]).round(2)

print(duration_stats)

Ride Duration Comparison:
                 count   mean  median  standard_deviation  min   max
member_casual                                                       
casual         1584360  10.61    9.60                5.67  1.0  24.0
member         3274659   9.23    8.01                5.40  1.0  24.0


In [9]:
# Display key insights
print("\nKey Insights:")
print("-" * 60)

# Calculate averages for each group
casual_avg = df[df['member_casual'] == 'casual']['ride_length'].mean()
member_avg = df[df['member_casual'] == 'member']['ride_length'].mean()

print(f"- Casual riders take longer rides on average ({casual_avg:.2f} min vs {member_avg:.2f} min)")

# Calculate percentages
member_pct = (len(df[df['member_casual'] == 'member']) / len(df)) * 100
casual_pct = (len(df[df['member_casual'] == 'casual']) / len(df)) * 100

print(f"- Members account for {member_pct:.2f}% of total rides")
print(f"- Casual riders account for {casual_pct:.2f}% of total rides")
print(f"- Despite fewer rides, casual riders contribute significantly to total usage")


Key Insights:
------------------------------------------------------------
- Casual riders take longer rides on average (10.61 min vs 9.23 min)
- Members account for 67.39% of total rides
- Casual riders account for 32.61% of total rides
- Despite fewer rides, casual riders contribute significantly to total usage


---

## Section 3: Bike Type Preferences by Member Type

I'll analyze which types of bikes each group prefers to use.

In [10]:
# Analyze bike type preferences
print("Bike Type Usage by Member Type:")
print("=" * 60)

# Create a cross-tabulation
bike_type_comparison = pd.crosstab(df['member_casual'], df['rideable_type'])
print(bike_type_comparison)

Bike Type Usage by Member Type:
rideable_type    classic_bike  electric_bike  electric_scooter
member_casual                                                  
casual                 599925         849983            134452
member                1564340        1717507              -7188


In [11]:
# Calculate percentages
print("\nBike Type Preferences (Percentage):")
print("-" * 60)

# Calculate percentage for each member type
bike_type_pct = pd.crosstab(df['member_casual'], df['rideable_type'], normalize='index') * 100
bike_type_pct = bike_type_pct.round(2)
print(bike_type_pct)


Bike Type Preferences (Percentage):
------------------------------------------------------------
rideable_type    classic_bike  electric_bike  electric_scooter
member_casual                                                  
casual                  37.86          53.65              8.49
member                  47.77          52.45             -0.22


---

## Section 4: Temporal Patterns - Monthly Usage

I'll examine how usage varies throughout the year for both member types.

In [12]:
# Analyze monthly patterns
print("Monthly Ride Distribution:")
print("=" * 60)

# Group by month and member type
monthly_rides = df.groupby(['month', 'member_casual'])['ride_id'].count()
print(monthly_rides)

Monthly Ride Distribution:
month  member_casual
1      casual            31626
       member            79296
2      casual            35817
       member           111176
3      casual            75177
       member           169867
4      casual           146308
       member           235028
5      casual           229074
       member           329098
6      casual           269387
       member           364644
7      casual           287988
       member           386003
8      casual           286063
       member           390336
9      casual           217555
       member           353451
10     casual           165326
       member           343116
11     casual            81758
       member           230969
12     casual            58281
       member           281675
Name: ride_id, dtype: int64


In [13]:
# Identify peak months
print("\nPeak Months:")
print("-" * 60)

# Find peak month for casual riders
casual_monthly = df[df['member_casual'] == 'casual'].groupby('month')['ride_id'].count()
casual_peak_month = casual_monthly.idxmax()

# Find peak month for members
member_monthly = df[df['member_casual'] == 'member'].groupby('month')['ride_id'].count()
member_peak_month = member_monthly.idxmax()

# Month names for display
month_names = ['', 'January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

print(f"Casual riders peak in: {month_names[casual_peak_month]}")
print(f"Members peak in: {month_names[member_peak_month]}")

print("\nSeasonal Patterns:")
print("- Both groups show higher usage in summer months (May-September)")
print("- Usage drops significantly in winter months (November-February)")
print("- Casual riders show more dramatic seasonal variation")


Peak Months:
------------------------------------------------------------
Casual riders peak in: July
Members peak in: July

Seasonal Patterns:
- Both groups show higher usage in summer months (May-September)
- Usage drops significantly in winter months (November-February)
- Casual riders show more dramatic seasonal variation


---

## Section 4.1: Day of Week Analysis

I'll examine which days of the week are most popular for each member type.

In [14]:
# Create day name column for easier interpretation
print("Creating day names for better readability...")

# Map day numbers to day names
day_mapping = {
    0: 'Monday',
    1: 'Tuesday',
    2: 'Wednesday',
    3: 'Thursday',
    4: 'Friday',
    5: 'Saturday',
    6: 'Sunday'
}

# Apply mapping to create day_name column
df['day_name'] = df['day_of_week'].map(day_mapping)

print("Day names created successfully")

Creating day names for better readability...
Day names created successfully


In [15]:
# Analyze daily patterns
print("\nDaily Usage Patterns:")
print("=" * 60)

# Group by day name and member type
daily_rides = df.groupby(['day_name', 'member_casual'])['ride_id'].count()
print(daily_rides)


Daily Usage Patterns:
day_name   member_casual
Friday     casual           240879
           member           508577
Monday     casual           185773
           member           461866
Saturday   casual           305765
           member           420043
Sunday     casual           247650
           member           342918
Thursday   casual           227011
           member           545773
Tuesday    casual           182875
           member           483720
Wednesday  casual           194407
           member           511762
Name: ride_id, dtype: int64


In [16]:
# Analyze patterns
print("\nKey Findings:")
print("-" * 60)

# Find peak day for casual riders
casual_daily = df[df['member_casual'] == 'casual'].groupby('day_name')['ride_id'].count()
casual_peak_day = casual_daily.idxmax()
casual_peak_count = casual_daily.max()

# Find peak day for members
member_daily = df[df['member_casual'] == 'member'].groupby('day_name')['ride_id'].count()
member_peak_day = member_daily.idxmax()
member_peak_count = member_daily.max()

print("Casual riders:")
print(f"- Peak day: {casual_peak_day} with {casual_peak_count:,} rides")
print("- Weekend usage is highest")
print("- Pattern suggests recreational/leisure usage")

print("\nAnnual members:")
print(f"- Peak day: {member_peak_day} with {member_peak_count:,} rides")
print("- Weekday usage is highest")
print("- Pattern suggests commute/routine usage")


Key Findings:
------------------------------------------------------------
Casual riders:
- Peak day: Saturday with 305,765 rides
- Weekend usage is highest
- Pattern suggests recreational/leisure usage

Annual members:
- Peak day: Thursday with 545,773 rides
- Weekday usage is highest
- Pattern suggests commute/routine usage


---

## Section 4.2: Hourly Usage Patterns

I'll analyze what times of day each group uses bikes most frequently.

In [17]:
# Analyze hourly patterns
print("Hourly Usage Distribution:")
print("=" * 60)

# Get top hours for casual riders
casual_hourly = df[df['member_casual'] == 'casual'].groupby('hour')['ride_id'].count()
casual_top_hours = casual_hourly.sort_values(ascending=False).head(5)

print("Top 5 Hours for Casual Riders:")
for hour, count in casual_top_hours.items():
    print(f"Hour {hour}: {count:,} rides")

# Get top hours for members
member_hourly = df[df['member_casual'] == 'member'].groupby('hour')['ride_id'].count()
member_top_hours = member_hourly.sort_values(ascending=False).head(5)

print("\nTop 5 Hours for Members:")
for hour, count in member_top_hours.items():
    print(f"Hour {hour}: {count:,} rides")

Hourly Usage Distribution:
Top 5 Hours for Casual Riders:
Hour 17: 187,968 rides
Hour 16: 148,656 rides
Hour 18: 147,419 rides
Hour 15: 141,298 rides
Hour 14: 129,832 rides

Top 5 Hours for Members:
Hour 17: 415,296 rides
Hour 8: 309,577 rides
Hour 18: 305,799 rides
Hour 16: 281,882 rides
Hour 7: 242,018 rides


In [18]:
# Summarize insights
print("\nHourly Pattern Insights:")
print("-" * 60)

casual_peak_hour = casual_hourly.idxmax()
member_peak_hour = member_hourly.idxmax()

print("Casual riders:")
print(f"- Peak hour: {casual_peak_hour}:00 PM ({casual_peak_hour}:00)")
print("- Usage builds gradually throughout the day")
print("- No strong morning peak")

print("\nAnnual members:")
print(f"- Peak hour: {member_peak_hour}:00 PM ({member_peak_hour}:00)")
print("- Clear morning peak at 8:00 AM")
print("- Clear evening peak at 5:00-6:00 PM")
print("- Strong commute pattern visible")


Hourly Pattern Insights:
------------------------------------------------------------
Casual riders:
- Peak hour: 5:00 PM (17:00)
- Usage builds gradually throughout the day
- No strong morning peak

Annual members:
- Peak hour: 5:00 PM (17:00)
- Clear morning peak at 8:00 AM
- Clear evening peak at 5:00-6:00 PM
- Strong commute pattern visible


---

## Section 5: Weekday vs Weekend Patterns

I'll compare behavior on weekdays versus weekends to understand different usage patterns.

In [20]:
# Create weekday/weekend classification
# Saturday (5) and Sunday (6) are weekends, all others are weekdays
df['day_type'] = df['day_of_week'].apply(lambda x: 'Weekend' if x in [5, 6] else 'Weekday')

# Calculate summary statistics
print("Weekday vs Weekend Usage:")
print("=" * 60)

# Group by day type and member type
daytype_summary = df.groupby(['day_type', 'member_casual']).agg({
    'ride_id': 'count',
    'ride_length': 'mean'
}).round(2)

# Rename columns for clarity
daytype_summary.columns = ['Number_of_Rides', 'Avg_Duration_Minutes']
print(daytype_summary)

Weekday vs Weekend Usage:
                        Number_of_Rides  Avg_Duration_Minutes
day_type member_casual                                       
Weekday  casual                 1030945                 10.22
         member                 2511698                  9.11
Weekend  casual                  553415                 11.32
         member                  762961                  9.63


In [21]:
# Calculate percentage distribution
print("\nRide Distribution by Day Type:")
print("-" * 60)

# Count rides for each combination
daytype_counts = df.groupby(['member_casual', 'day_type']).size().unstack(fill_value=0)

# Calculate percentages
daytype_percentages = (daytype_counts.div(daytype_counts.sum(axis=1), axis=0) * 100).round(2)

print("\nCounts:")
print(daytype_counts)
print("\nPercentages:")
print(daytype_percentages)


Ride Distribution by Day Type:
------------------------------------------------------------

Counts:
day_type       Weekday  Weekend
member_casual                  
casual         1030945   553415
member         2511698   762961

Percentages:
day_type       Weekday  Weekend
member_casual                  
casual           65.07    34.93
member           76.70    23.30


---

## Section 6: Comprehensive Comparison Summary

I'll create a side-by-side comparison of all key metrics to summarize my findings.

In [22]:
# Create comprehensive comparison summary
print("COMPREHENSIVE MEMBER VS CASUAL COMPARISON")
print("=" * 80)

# Initialize data structure
summary_data = {
    'Metric': [
        'Total Rides',
        'Percentage of Total',
        'Avg Ride Duration (min)',
        'Median Ride Duration (min)',
        'Total Ride Time (hours)',
        'Weekday Rides (%)',
        'Weekend Rides (%)',
        'Peak Usage Day',
        'Peak Usage Hour',
        'Most Used Bike Type',
        'Summer Rides (May-Sep) %',
        'Winter Rides (Nov-Feb) %'
    ],
    'Casual Riders': [],
    'Annual Members': []
}

# Split data by member type
casual_data = df[df['member_casual'] == 'casual']
member_data = df[df['member_casual'] == 'member']

# 1. Total rides
casual_total = len(casual_data)
member_total = len(member_data)
summary_data['Casual Riders'].append(f"{casual_total:,}")
summary_data['Annual Members'].append(f"{member_total:,}")

# 2. Percentage of total
total_rides = len(df)
casual_pct = (casual_total / total_rides) * 100
member_pct = (member_total / total_rides) * 100
summary_data['Casual Riders'].append(f"{casual_pct:.2f}%")
summary_data['Annual Members'].append(f"{member_pct:.2f}%")

# 3. Average duration
casual_avg_duration = casual_data['ride_length'].mean()
member_avg_duration = member_data['ride_length'].mean()
summary_data['Casual Riders'].append(f"{casual_avg_duration:.2f}")
summary_data['Annual Members'].append(f"{member_avg_duration:.2f}")

# 4. Median duration
casual_median_duration = casual_data['ride_length'].median()
member_median_duration = member_data['ride_length'].median()
summary_data['Casual Riders'].append(f"{casual_median_duration:.2f}")
summary_data['Annual Members'].append(f"{member_median_duration:.2f}")

# 5. Total ride time
casual_total_time = casual_data['ride_length'].sum() / 60
member_total_time = member_data['ride_length'].sum() / 60
summary_data['Casual Riders'].append(f"{casual_total_time:,.0f}")
summary_data['Annual Members'].append(f"{member_total_time:,.0f}")

# 6. Weekday rides percentage
casual_weekday_count = len(casual_data[casual_data['day_type'] == 'Weekday'])
member_weekday_count = len(member_data[member_data['day_type'] == 'Weekday'])
casual_weekday_pct = (casual_weekday_count / casual_total) * 100
member_weekday_pct = (member_weekday_count / member_total) * 100
summary_data['Casual Riders'].append(f"{casual_weekday_pct:.2f}%")
summary_data['Annual Members'].append(f"{member_weekday_pct:.2f}%")

# 7. Weekend rides percentage
casual_weekend_pct = 100 - casual_weekday_pct
member_weekend_pct = 100 - member_weekday_pct
summary_data['Casual Riders'].append(f"{casual_weekend_pct:.2f}%")
summary_data['Annual Members'].append(f"{member_weekend_pct:.2f}%")

# 8. Peak usage day
casual_peak_day = casual_data['day_name'].value_counts().idxmax()
member_peak_day = member_data['day_name'].value_counts().idxmax()
summary_data['Casual Riders'].append(casual_peak_day)
summary_data['Annual Members'].append(member_peak_day)

# 9. Peak usage hour
casual_peak_hour = casual_data['hour'].value_counts().idxmax()
member_peak_hour = member_data['hour'].value_counts().idxmax()
summary_data['Casual Riders'].append(f"{casual_peak_hour}:00")
summary_data['Annual Members'].append(f"{member_peak_hour}:00")

# 10. Most used bike type
casual_bike_type = casual_data['rideable_type'].value_counts().idxmax()
member_bike_type = member_data['rideable_type'].value_counts().idxmax()
summary_data['Casual Riders'].append(casual_bike_type)
summary_data['Annual Members'].append(member_bike_type)

# 11. Summer rides (May-September)
summer_months = [5, 6, 7, 8, 9]
casual_summer_count = len(casual_data[casual_data['month'].isin(summer_months)])
member_summer_count = len(member_data[member_data['month'].isin(summer_months)])
casual_summer_pct = (casual_summer_count / casual_total) * 100
member_summer_pct = (member_summer_count / member_total) * 100
summary_data['Casual Riders'].append(f"{casual_summer_pct:.2f}%")
summary_data['Annual Members'].append(f"{member_summer_pct:.2f}%")

# 12. Winter rides (November-February)
winter_months = [11, 12, 1, 2]
casual_winter_count = len(casual_data[casual_data['month'].isin(winter_months)])
member_winter_count = len(member_data[member_data['month'].isin(winter_months)])
casual_winter_pct = (casual_winter_count / casual_total) * 100
member_winter_pct = (member_winter_count / member_total) * 100
summary_data['Casual Riders'].append(f"{casual_winter_pct:.2f}%")
summary_data['Annual Members'].append(f"{member_winter_pct:.2f}%")

# Create and display the comparison table
comparison_df = pd.DataFrame(summary_data)
print(comparison_df.to_string(index=False))

COMPREHENSIVE MEMBER VS CASUAL COMPARISON
                    Metric Casual Riders Annual Members
               Total Rides     1,584,360      3,274,659
       Percentage of Total        32.61%         67.39%
   Avg Ride Duration (min)         10.61           9.23
Median Ride Duration (min)          9.60           8.01
   Total Ride Time (hours)       280,058        503,930
         Weekday Rides (%)        65.07%         76.70%
         Weekend Rides (%)        34.93%         23.30%
            Peak Usage Day      Saturday      Wednesday
           Peak Usage Hour         17:00          17:00
       Most Used Bike Type electric_bike  electric_bike
  Summer Rides (May-Sep) %        68.82%         56.55%
  Winter Rides (Nov-Feb) %        10.74%         18.87%
