# Cyclistic Bike-Share Analysis

## Phase 3: PROCESS

## Data Cleaning and Transformation

In this notebook, I'm documenting my systematic approach to cleaning and preparing the combined 2024 dataset. This is a critical step to ensure my analysis is based on accurate, reliable data.

**Input:** combined_2024_raw.csv (5,860,568 rows x 13 columns)  
**Objective:** Create a clean, analysis-ready dataset

## My Cleaning Strategy

I will follow these steps:
1. Load the combined raw data
2. Handle missing values
3. Remove duplicates
4. Fix data types (convert datetime columns)
5. Create calculated fields (ride_length, day_of_week)
6. Remove invalid records
7. Save cleaned dataset

---

### Step 1: Load Combined Dataset

I'll start by loading the raw combined dataset that I created in the previous notebook.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Turn off warning messages
warnings.filterwarnings('ignore')

# Load the combined raw data from the previous step
df = pd.read_csv('../data/raw/combined_2024_raw.csv')

# Confirm the file loaded successfully
print('Dataset loaded successfully')
print('_' * 60)

# Display the shape of the data
total_rows = df.shape[0]
total_columns = df.shape[1]
print(f"Shape: {total_rows,} rows x {total_columns} columns")


Dataset loaded successfully
____________________________________________________________
Shape: (5860568,) rows x 13 columns


In [2]:
# Display all column names
print("\nColumn Names:")
print("-" * 60)

# Loop through each column and print its name with a number
column_number = 1
for column_name in df.columns:
    print(f'{column_number}. {column_name}')
    column_number = column_number + 1

# Show a sample of the data
print("\nSample Data:")
print("-" * 60)
print(df.head(3))


Column Names:
------------------------------------------------------------
1. ride_id
2. rideable_type
3. started_at
4. ended_at
5. start_station_name
6. start_station_id
7. end_station_name
8. end_station_id
9. start_lat
10. start_lng
11. end_lat
12. end_lng
13. member_casual

Sample Data:
------------------------------------------------------------
            ride_id  rideable_type           started_at             ended_at  \
0  C1D650626C8C899A  electric_bike  2024-01-12 15:30:27  2024-01-12 15:37:59   
1  EECD38BDB25BFCB0  electric_bike  2024-01-08 15:45:46  2024-01-08 15:52:59   
2  F4A9CE78061F17F7  electric_bike  2024-01-27 12:27:19  2024-01-27 12:35:19   

  start_station_name start_station_id          end_station_name  \
0  Wells St & Elm St     KA1504000135  Kingsbury St & Kinzie St   
1  Wells St & Elm St     KA1504000135  Kingsbury St & Kinzie St   
2  Wells St & Elm St     KA1504000135  Kingsbury St & Kinzie St   

  end_station_id  start_lat  start_lng    end_lat    end

---

### Step 2: Check and Remove Duplicates

I need to check if there are any duplicate ride IDs in the dataset. Each ride should have a unique ID, so duplicates need to be removed.

In [3]:
# Check for duplicate ride_ids
print("Checking for duplicates...")
print("-" * 60)

# Count how many duplicate ride_ids exist
duplicate_count = df['ride_id'].duplicated().sum()
print(f"Duplicate ride_ids found: {duplicate_count}")

# If duplicates exist, remove them
if duplicate_count > 0:
    print(f"Removing {duplicate_count} duplicate records....")
    # Keep the first occurrence and remove the rest
    df = df.drop_duplicates(subset=['ride_id'], keep='first')
    print(f"New shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
else:
    print("No duplicates found. Data integrity confirmed.")

Checking for duplicates...
------------------------------------------------------------
Duplicate ride_ids found: 211
Removing 211 duplicate records....
New shape: 5,860,357 rows x 13 columns


---

### Step 3: Fix Data Types - Convert Datetime Columns

The started_at and ended_at columns are currently stored as text strings. I need to convert them to datetime format so I can perform time-based calculations and analysis.

In [4]:
# Convert started_at and ended_at columns from text to datetime format
print("Converting datetime columns...")
print("-" * 60)

# Convert started_at column to datetime
df['started_at'] = pd.to_datetime(df['started_at'], format='mixed')

# Convert ended_at column to datetime
df['ended_at'] = pd.to_datetime(df['ended_at'], format='mixed')

# Confirm conversion was successful
print("Datetime conversion complete")
print("\nData types after conversion:")
print(df[['started_at', 'ended_at']].dtypes)

Converting datetime columns...
------------------------------------------------------------
Datetime conversion complete

Data types after conversion:
started_at    datetime64[ns]
ended_at      datetime64[ns]
dtype: object


---

### Step 4: Create Calculated Fields

I'm creating new columns that will be useful for my analysis:
- **ride_length:** Duration of each ride in minutes
- **day_of_week:** Day when ride started (Monday=0, Sunday=6)
- **month:** Month when ride started
- **hour:** Hour when ride started

In [5]:
# Create calculated fields
print("Creating calculated fields...")
print("-" * 60)

# Calculate ride_length in minutes
# First, calculate the time difference
time_difference = df['ended_at'] - df['started_at']

# Convert to total seconds
time_in_seconds = time_difference.dt.total_seconds()

# Convert seconds to minutes
df['ride_length'] = time_in_seconds / 60

# Extract day of week (0=Monday, 6=Sunday)
df['day_of_week'] = df['started_at'].dt.dayofweek

# Extract month
df['month'] = df['started_at'].dt.month

# Extract hour
df['hour'] = df['started_at'].dt.hour

# Confirm new columns were created
print("Calculated fields created successfully")
print("\nNew columns:")
print(df[['started_at', 'ended_at', 'ride_length', 'day_of_week', 'month', 'hour']].head())

Creating calculated fields...
------------------------------------------------------------
Calculated fields created successfully

New columns:
           started_at            ended_at  ride_length  day_of_week  month  \
0 2024-01-12 15:30:27 2024-01-12 15:37:59     7.533333            4      1   
1 2024-01-08 15:45:46 2024-01-08 15:52:59     7.216667            0      1   
2 2024-01-27 12:27:19 2024-01-27 12:35:19     8.000000            5      1   
3 2024-01-29 16:26:17 2024-01-29 16:56:06    29.816667            0      1   
4 2024-01-31 05:43:23 2024-01-31 06:09:35    26.200000            2      1   

   hour  
0    15  
1    15  
2    12  
3    16  
4     5  


---

### Step 5: Analyze Missing Values

Now I need to understand which columns have missing values and decide how to handle them.

In [6]:
# Check for missing values in each column
print("Missing Values Analysis:")
print("-" * 60)

# Create a list to store missing value information
missing_info = []

# Loop through each column
for column_name in df.columns:
    # Count missing values
    missing_count = df[column_name].isnull().sum()
    
    # Calculate percentage
    total_rows = len(df)
    missing_percentage = (missing_count / total_rows) * 100
    missing_percentage = round(missing_percentage, 2)
    
    # Store the information
    missing_info.append({
        'Column': column_name,
        'Missing_Count': missing_count,
        'Missing_Percentage': missing_percentage
    })

# Convert to dataframe
missing_data = pd.DataFrame(missing_info)

# Show only columns with missing values
missing_data = missing_data[missing_data['Missing_Count'] > 0]

# Sort by missing count
missing_data = missing_data.sort_values('Missing_Count', ascending=False)

print(missing_data.to_string(index=False))

Missing Values Analysis:
------------------------------------------------------------
            Column  Missing_Count  Missing_Percentage
    end_station_id        1104579               18.85
  end_station_name        1104579               18.85
  start_station_id        1073884               18.32
start_station_name        1073884               18.32
           end_lat           7213                0.12
           end_lng           7213                0.12


---

### Step 6: Handle Missing Values

Based on my analysis, I'll keep rows with missing station names because they still have GPS coordinates. However, I'll remove rows missing critical location data (latitude and longitude).

In [7]:
# Check current size
print("Before removing missing coordinates:")
print("-" * 60)
print(f"Total rows: {len(df):,}")

# Store original count
original_count = len(df)

# Remove rows where end_lat or end_lng is missing
print("\nRemoving rows with missing end coordinates...")
print("-" * 60)
df = df.dropna(subset=['end_lat', 'end_lng'])

# Show results
print("After removing missing coordinates:")
print(f"Total rows: {len(df):,}")
print(f"Rows removed: {original_count - len(df):,}")

Before removing missing coordinates:
------------------------------------------------------------
Total rows: 5,860,357

Removing rows with missing end coordinates...
------------------------------------------------------------
After removing missing coordinates:
Total rows: 5,853,144
Rows removed: 7,213


---

### Step 7: Analyze Ride Length Distribution

Before filtering, I want to understand the distribution of ride lengths to identify any anomalies or outliers.

In [8]:
# Display statistics for ride_length
print("Ride Length Statistics:")
print("-" * 60)
print(df['ride_length'].describe())

# Display key metrics
mean_minutes = df['ride_length'].mean()
median_minutes = df['ride_length'].median()
min_minutes = df['ride_length'].min()
max_minutes = df['ride_length'].max()

print(f"\nMean: {mean_minutes:.2f} minutes")
print(f"Median: {median_minutes:.2f} minutes")
print(f"Min: {min_minutes:.2f} minutes")
print(f"Max: {max_minutes:.2f} minutes")

# Check for problematic values
negative_or_zero = (df['ride_length'] <= 0).sum()
over_24_hours = (df['ride_length'] > 1440).sum()  # 1440 minutes = 24 hours

print(f"\nNegative or zero ride lengths: {negative_or_zero:,}")
print(f"Rides over 24 hours (1440 min): {over_24_hours:,}")

Ride Length Statistics:
------------------------------------------------------------
count    5.853144e+06
mean     1.446152e+01
std      1.317905e+03
min     -6.864000e+02
25%      5.016667e+00
50%      8.316667e+00
75%      1.408333e+01
max      1.054819e+06
Name: ride_length, dtype: float64

Mean: 14.46 minutes
Median: 8.32 minutes
Min: -686.40 minutes
Max: 1054819.33 minutes

Negative or zero ride lengths: 105
Rides over 24 hours (1440 min): 8,096


---

### Step 8: Remove Invalid Ride Records

I need to filter out rides that don't make sense:
- Rides with negative or zero duration
- Rides shorter than 1 minute (likely false starts)
- Rides longer than 24 hours (likely unreturned bikes or data errors)

In [9]:
# Filter out invalid rides
print("Filtering invalid rides...")
print("-" * 60)
print(f"Before filtering: {len(df):,} rides")

# Store original count
before_filter = len(df)

# Apply filters
print("\nApplying filters:")
print("- Removing rides < 1 minute")
print("- Removing rides > 1440 minutes (24 hours)")

# Keep only rides between 1 minute and 1440 minutes (24 hours)
df = df[(df['ride_length'] >= 1) & (df['ride_length'] <= 1440)]

# Show results
after_filter = len(df)
rides_removed = before_filter - after_filter
percentage_retained = (after_filter / before_filter) * 100

print(f"\nAfter filtering: {after_filter:,} rides")
print(f"Rides removed: {rides_removed:,}")
print(f"Percentage retained: {percentage_retained:.2f}%")

Filtering invalid rides...
------------------------------------------------------------
Before filtering: 5,853,144 rides

Applying filters:
- Removing rides < 1 minute
- Removing rides > 1440 minutes (24 hours)

After filtering: 4,859,019 rides
Rides removed: 994,125
Percentage retained: 83.02%


---

### Step 9: Verify Cleaned Data Quality

Now I'll verify that the filtering worked correctly and examine the final statistics.

In [10]:
# Check cleaned ride length statistics
print("Cleaned Ride Length Statistics:")
print("-" * 60)
print(df['ride_length'].describe())

# Display key metrics
mean_value = df['ride_length'].mean()
median_value = df['ride_length'].median()
min_value = df['ride_length'].min()
max_value = df['ride_length'].max()

print(f"\nMean: {mean_value:.2f} minutes")
print(f"Median: {median_value:.2f} minutes")
print(f"Min: {min_value:.2f} minutes")
print(f"Max: {max_value:.2f} minutes")

Cleaned Ride Length Statistics:
------------------------------------------------------------
count    4.859019e+06
mean     9.680817e+00
std      5.524413e+00
min      1.000000e+00
25%      5.250000e+00
50%      8.516617e+00
75%      1.324084e+01
max      2.400000e+01
Name: ride_length, dtype: float64

Mean: 9.68 minutes
Median: 8.52 minutes
Min: 1.00 minutes
Max: 24.00 minutes


In [11]:
# Show percentiles for better understanding
print("\nPercentiles:")
print("-" * 60)

percentiles_list = [25, 50, 75, 90, 95, 99]

for p in percentiles_list:
    # Calculate the percentile value
    percentile_value = p / 100
    value = df['ride_length'].quantile(percentile_value)
    print(f"{p}th percentile: {value:.2f} minutes")


Percentiles:
------------------------------------------------------------
25th percentile: 5.25 minutes
50th percentile: 8.52 minutes
75th percentile: 13.24 minutes
90th percentile: 18.19 minutes
95th percentile: 20.68 minutes
99th percentile: 23.25 minutes


In [12]:
# Analyze distribution by duration range
print("\nRide Length Distribution:")
print("-" * 60)

# Define bins and labels for grouping
bins = [1, 5, 10, 15, 30, 60, 120, 240, 480, 1440]
labels = ['1-5 min', '5-10 min', '10-15 min', '15-30 min', '30-60 min', 
          '1-2 hrs', '2-4 hrs', '4-8 hrs', '8-24 hrs']

# Create duration range categories
df['duration_range'] = pd.cut(df['ride_length'], bins=bins, labels=labels, include_lowest=True)

# Count rides in each category
distribution = df['duration_range'].value_counts().sort_index()

# Calculate percentages
total_rides = len(df)
distribution_pct = (distribution / total_rides) * 100
distribution_pct = distribution_pct.round(2)

# Combine into one table
result = pd.DataFrame({
    'Count': distribution,
    'Percentage': distribution_pct
})

print(result)

# Remove the temporary duration_range column
df = df.drop('duration_range', axis=1)


Ride Length Distribution:
------------------------------------------------------------
                  Count  Percentage
duration_range                     
1-5 min         1113574       22.92
5-10 min        1767371       36.37
10-15 min       1070173       22.02
15-30 min        907901       18.68
30-60 min             0        0.00
1-2 hrs               0        0.00
2-4 hrs               0        0.00
4-8 hrs               0        0.00
8-24 hrs              0        0.00


---

### Step 10: Save Cleaned Dataset

Now that the data is clean and ready for analysis, I'll save it to a new file in the processed data folder.

In [13]:
# Save the cleaned data
output_path = '../data/processed/cleaned_2024_data.csv'
df.to_csv(output_path, index=False)

# Confirm save was successful
print("Cleaned dataset saved successfully")
print("-" * 60)
print(f"File path: {output_path}")
print(f"Records: {len(df):,}")

# Show date range of the cleaned data
min_date = df['started_at'].min()
max_date = df['started_at'].max()
print(f"Date range: {min_date} to {max_date}")

Cleaned dataset saved successfully
------------------------------------------------------------
File path: ../data/processed/cleaned_2024_data.csv
Records: 4,859,019
Date range: 2024-01-01 00:00:39 to 2024-12-31 23:54:37.045000
