# Cyclistic Bike-Share Analysis

## Phase 1: ASK

## Business Problem Statement

**Company:** Cyclistic - A bike-share company in Chicago with 5,800+ bikes and 600 docking stations

**Challenge:** Annual members are more profitable than casual riders. I need to help convert casual riders into annual members.

**My Task:** I will analyze how annual members and casual riders use Cyclistic bikes differently to identify patterns that can inform marketing strategies.

## Key Definitions

- **Casual Riders:** Customers who purchase single-ride or full-day passes
- **Annual Members:** Customers who purchase annual memberships

## Success Metrics

My analysis should provide:
1. Clear differences in usage patterns between member types
2. Data-driven insights for marketing strategy
3. Actionable recommendations backed by data visualizations

## Deliverables

I will deliver:
1. Clear statement of business task
2. Description of data sources
3. Documentation of data cleaning process
4. Summary of analysis
5. Supporting visualizations
6. Top 3 recommendations





---

## Phase 2: PREPARE - Data Collection

For this analysis, I am using publicly available data from Divvy Bikes (Cyclistic).

**Data Source:** Divvy Bikes (Cyclistic) Historical Trip Data  
**License:** Public data from Motivate International Inc.  
**Time Period:** January 2024 - December 2024 (12 months)  
**Privacy Note:** No personally identifiable information (PII) included

### Dataset Information:
- **12 CSV files** (one per month)
- **Source URL:** https://divvy-tripdata.s3.amazonaws.com/index.html
- **File naming convention:** YYYYMM-divvy-tripdata.csv

### Files I Downloaded:
1. 202401-divvy-tripdata.csv (January 2024)
2. 202402-divvy-tripdata.csv (February 2024)
3. 202403-divvy-tripdata.csv (March 2024)
4. 202404-divvy-tripdata.csv (April 2024)
5. 202405-divvy-tripdata.csv (May 2024)
6. 202406-divvy-tripdata.csv (June 2024)
7. 202407-divvy-tripdata.csv (July 2024)
8. 202408-divvy-tripdata.csv (August 2024)
9. 202409-divvy-tripdata.csv (September 2024)
10. 202410-divvy-tripdata.csv (October 2024)
11. 202411-divvy-tripdata.csv (November 2024)
12. 202412-divvy-tripdata.csv (December 2024)

In [2]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings

# Turn off warning messages to keep output clean
warnings.filterwarnings('ignore')

# Configure pandas to display all columns and more rows for better data inspection
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Confirm libraries loaded successfully
print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 2.3.5


---

## Data Exploration : Understanding My Dataset

Before combining all the data, I need to explore the structure and contents of the data files to understand what I'm working with.

In [3]:
import os

# Set the path where my raw data files are stored
data_path = '../data/raw'

# Get a list of all CSV files in that folder
csv_files = []
for file in os.listdir(data_path):
    if file.endswith('.csv'):
        csv_files.append(file)

# Sort the files in alphabetical order
csv_files.sort()

# Display how many CSV files I found
print(f"Total number of CSV files: {len(csv_files)}\n")



Total number of CSV files: 12



### Loading Sample Data (January 2024)

I'll examine one month of data first to understand the structure before processing all files. This helps me understand what columns exist, what data types they have, and if there are any issues I need to address.

In [4]:
# Load the first CSV file (January 2024) as a sample
sample_file_path = '../data/raw/202401-divvy-tripdata.csv'
sample_df = pd.read_csv(sample_file_path)

# Confirm the file loaded successfully
print("Sample data loaded successfully!\n")

# Check how many rows and columns are in this file
number_of_rows = sample_df.shape[0]
number_of_columns = sample_df.shape[1]
print(f"Dataset Shape: {number_of_rows:,} rows Ã— {number_of_columns} columns")
print("=" * 60)

# Look at the first 3 rows to see what the data looks like
print("First 3 rows of the dataset:\n")
print(sample_df.head(3))
print("=" * 60)

# Get detailed information about the columns
print("\n Column Information:\n")
print(sample_df.info())
print("=" * 60)

# Show basic statistics for numerical columns
print("\nðŸ“Š Basic Statistics:\n")
print(sample_df.describe())

Sample data loaded successfully!

Dataset Shape: 144,873 rows Ã— 13 columns
First 3 rows of the dataset:

            ride_id  rideable_type           started_at             ended_at  \
0  C1D650626C8C899A  electric_bike  2024-01-12 15:30:27  2024-01-12 15:37:59   
1  EECD38BDB25BFCB0  electric_bike  2024-01-08 15:45:46  2024-01-08 15:52:59   
2  F4A9CE78061F17F7  electric_bike  2024-01-27 12:27:19  2024-01-27 12:35:19   

  start_station_name start_station_id          end_station_name  \
0  Wells St & Elm St     KA1504000135  Kingsbury St & Kinzie St   
1  Wells St & Elm St     KA1504000135  Kingsbury St & Kinzie St   
2  Wells St & Elm St     KA1504000135  Kingsbury St & Kinzie St   

  end_station_id  start_lat  start_lng    end_lat    end_lng member_casual  
0   KA1503000043  41.903267 -87.634737  41.889177 -87.638506        member  
1   KA1503000043  41.902937 -87.634440  41.889177 -87.638506        member  
2   KA1503000043  41.902951 -87.634470  41.889177 -87.638506        membe

### Combining All Monthly Files

Now that I understand the structure of one file, I'll combine all 12 monthly files into one complete dataset for the entire year of 2024. This will give me a comprehensive view of the bike-share usage patterns.

In [13]:
# Create an empty list to store each month's data
all_dataframes = []

# Loop through each CSV file
for i in range(len(csv_files)):
    file_name = csv_files[i]
    
    # Build the full path to the file
    file_path = data_path + '/' + file_name
    
    # Read the CSV file into a dataframe
    monthly_df = pd.read_csv(file_path)
    
    # Add this month's data to our list
    all_dataframes.append(monthly_df)
    
    # Show progress
    file_number = i + 1
    total_files = len(csv_files)
    row_count = len(monthly_df)
    print(f"âœ… Processing file {file_number}/{total_files}: {file_name} - {row_count:,} rows")

# Combine all monthly dataframes into one large dataframe
df = pd.concat(all_dataframes, ignore_index=True)

# Show summary of combined data
print("=" * 60)
print("âœ… All files combined successfully!")
print(f"Total combined rows: {len(df):,}")
print(f"Total columns: {len(df.columns)}")

âœ… Processing file 1/12: 202401-divvy-tripdata.csv - 144,873 rows
âœ… Processing file 2/12: 202402-divvy-tripdata.csv - 186,125 rows
âœ… Processing file 3/12: 202403-divvy-tripdata.csv - 284,042 rows
âœ… Processing file 4/12: 202404-divvy-tripdata.csv - 426,590 rows
âœ… Processing file 5/12: 202405-divvy-tripdata.csv - 604,827 rows
âœ… Processing file 6/12: 202406-divvy-tripdata.csv - 719,618 rows
âœ… Processing file 7/12: 202407-divvy-tripdata.csv - 754,698 rows
âœ… Processing file 8/12: 202408-divvy-tripdata.csv - 755,266 rows
âœ… Processing file 9/12: 202409-divvy-tripdata.csv - 694,587 rows
âœ… Processing file 10/12: 202410-divvy-tripdata.csv - 562,932 rows
âœ… Processing file 11/12: 202411-divvy-tripdata.csv - 368,026 rows
âœ… Processing file 12/12: 202412-divvy-tripdata.csv - 358,984 rows
âœ… All files combined successfully!
Total combined rows: 5,860,568
Total columns: 13


### Examining the Combined Dataset

Now that all 12 months are combined, I'll examine the complete dataset to identify any data quality issues, missing values, and understand the overall structure before cleaning.

In [13]:
# Display overall information about the dataset
print("Dataset Information:")
print('=' * 60)
df.info()

# Check what data type each column is
print("\nColumn Data Types:")
print("-" * 60)
print(df.dtypes)

# Look at the first 3 rows
print("\nFirst 3 rows:")
print(df.head(3))

# Look at the last 3 rows
print("\nLast 3 rows:")
print(df.tail(3))

# Check for missing values in each column
print("Missing Values Analysis:")
print("-" * 60)

# Create a list to store missing value information
missing_info = []

# Loop through each column
for column_name in df.columns:
    # Count how many values are missing
    missing_count = df[column_name].isnull().sum()
    
    # Calculate percentage of missing values
    total_rows = len(df)
    missing_percentage = (missing_count / total_rows) * 100
    missing_percentage = round(missing_percentage, 2)
    
    # Add to our list
    missing_info.append({
        'Column': column_name,
        'Missing_Count': missing_count,
        'Missing_Percentage': missing_percentage
    })

# Convert to dataframe
missing_data = pd.DataFrame(missing_info)

# Keep only columns that have missing values
missing_data = missing_data[missing_data['Missing_Count'] > 0]

# Sort by missing count (highest first)
missing_data = missing_data.sort_values('Missing_Count', ascending=False)

# Display the results
print(missing_data.to_string(index=False))

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5860568 entries, 0 to 5860567
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 581.3+ MB

Column Data Types:
------------------------------------------------------------
ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id       object
end_station_name       object
end_station_id         obj

In [14]:
# Analyze the distribution of member types
print("\nMember Type Distribution:")
print("-" * 60)

# Count how many members vs casual riders
member_counts = df['member_casual'].value_counts()

# Calculate the percentage for each type
member_percent = df['member_casual'].value_counts(normalize=True)

# Combine into one table
distribution = pd.DataFrame({
    'Count' : member_counts,
    'Percentage' : member_percent
})

print(distribution)


Member Type Distribution:
------------------------------------------------------------
                 Count  Percentage
member_casual                     
member         3708910    0.632858
casual         2151658    0.367142


In [15]:
# Analyze the distribution of bike types
print("\nRideable Type Distribution:")
print("-" * 60)

# Count how many of each bike type
rideable_counts = df['rideable_type'].value_counts()
print(rideable_counts)


Rideable Type Distribution:
------------------------------------------------------------
rideable_type
electric_bike       2980595
classic_bike        2735636
electric_scooter     144337
Name: count, dtype: int64


### Saving Combined Dataset

I'm saving the combined raw dataset before any cleaning. This preserves the original data for reference and ensures I can reproduce my work if needed.

In [16]:
# Define where to save the combined file
output_path = '../data/raw/combined_2024_raw.csv'

# Save the dataframe to CSV without the index column
df.to_csv(output_path, index=False)

# Confirm the file was saved
print(f'Combined raw data saved to: {output_path}')

Combined raw data saved to: ../data/raw/combined_2024_raw.csv
