# New York City Bike Share Analysis (Refactored)

This notebook comprehensively analyzes bike-sharing data from New York City and Jersey City using PySpark and Apache Iceberg.

## Project Structure
- `/data`: Raw and processed data files
  - `/raw`: Original downloaded files
  - `/processed`: Cleaned and transformed data
- `/conf`: Spark and analysis configurations
- `/notebooks`: Jupyter notebooks
- `/warehouse`: Iceberg tables location

## Data Sources
The Citi Bike dataset is available from the Citi Bike System Data S3 bucket:
- URL: https://s3.amazonaws.com/tripdata/index.html
- File formats:
  - Jersey City: `JC-YYYYMM-citibike-tripdata.csv.zip`
  - NYC: `YYYYMM-citibike-tripdata.zip`
    - 2013-2023: Annual files
    - 2024+: Monthly files

## Analysis Overview
1. Data Acquisition & Preparation
2. Data Processing & Cleaning
3. Feature Engineering
4. Analysis & Insights
5. Visualization & Reporting Spark and analysis configurations
- `/home/aldamiz/notebooks`: Jupyter notebooks
- `/home/aldamiz/warehouse`: Iceberg tables location

## Data Sources
The Citi Bike dataset is available from the Citi Bike System Data S3 bucket:
- URL: https://s3.amazonaws.com/tripdata/index.html
- File formats:
  - Jersey City: `JC-YYYYMM-citibike-tripdata.csv.zip`
  - NYC: `YYYYMM-citibike-tripdata.zip`
    - 2013-2023: Annual files
    - 2024+: Monthly files

## Analysis Overview
1. Data Acquisition & Preparation
2. Data Processing & Cleaning
3. Feature Engineering
4. Analysis & Insights
5. Visualization & Reporting

## Configuration and Setup

Initialize paths, constants, and Spark session with proper configurations.

In [1]:
# Standard libraries
import os
import urllib.request
import zipfile
import glob
from pathlib import Path
from datetime import datetime
import time
import logging

# Data processing and display
import tqdm
from IPython.display import display, Markdown
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Spark and Delta
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# Use the notebooks directory for logs (which has write permissions)
notebook_dir = Path('/home/aldamiz/notebooks')
LOG_DIR = notebook_dir / 'logs'
os.makedirs(LOG_DIR, exist_ok=True)

# Configure logging format
log_format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
date_format = '%Y-%m-%d %H:%M:%S'

def setup_logger(name, log_file):
    """Set up a logger that writes to both file and console"""
    logger = logging.getLogger(name)
    logger.setLevel(logging.INFO)
    
    # Prevent duplicate handlers
    if logger.handlers:
        logger.handlers.clear()
    
    # Create file handler
    log_path = LOG_DIR / log_file
    fh = logging.FileHandler(str(log_path))  # Convert Path to string
    fh.setLevel(logging.INFO)
    fh.setFormatter(logging.Formatter(log_format, date_format))
    
    # Create console handler
    ch = logging.StreamHandler()
    ch.setLevel(logging.INFO)
    ch.setFormatter(logging.Formatter(log_format, date_format))
    
    # Add handlers
    logger.addHandler(fh)
    logger.addHandler(ch)
    
    return logger

# Create loggers
etl_logger = setup_logger('etl', 'etl.log')
data_logger = setup_logger('data', 'data_quality.log')
analysis_logger = setup_logger('analysis', 'analysis.log')

# Helper functions for logging
def log_performance_metrics(start_time, operation_name):
    duration = time.time() - start_time
    etl_logger.info(f"Performance metric - {operation_name}: {duration:.2f} seconds")

def log_data_quality(df, stage):
    data_logger.info(f"Data quality metrics for {stage}:")
    data_logger.info(f"- Row count: {df.count()}")
    data_logger.info(f"- Null counts: {df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()}")
    data_logger.info(f"- Duplicate count: {df.count() - df.distinct().count()}")

def log_error_details(error, context):
    etl_logger.error(f"Error in {context}: {str(error)}", exc_info=True)
    etl_logger.error(f"Error context: {context}")

# Configure plotting
%matplotlib inline
sns.set_theme(style="whitegrid", context="notebook")

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.float_format', lambda x: '%.3f' if abs(x) < 1000 else '%.0f' % x)

# Configure plot sizes and styles
plt.rcParams.update({
    'figure.figsize': (12, 6),
    'axes.titlesize': 14,
    'axes.labelsize': 12
})

# Test logging
etl_logger.info(f"Starting NYCBS data processing pipeline")
etl_logger.info(f"Logs will be written to: {LOG_DIR}")

display(Markdown(f"✅ All libraries and loggers initialized successfully\n- Log directory: {LOG_DIR}"))

2025-03-05 00:00:53 - etl - INFO - Starting NYCBS data processing pipeline
2025-03-05 00:00:53 - etl - INFO - Logs will be written to: /home/aldamiz/notebooks/logs


✅ All libraries and loggers initialized successfully
- Log directory: /home/aldamiz/notebooks/logs

## Configuration and Setup

Before diving into the analysis, we need to set up our environment and configure Spark. This includes:
1. Defining project paths
2. Setting up Spark with proper configurations
3. Configuring storage locations for different data formats

In [2]:
# Cell 2: Path Configuration
# Use environment variables from container
HOME_DIR = '/home/aldamiz'  # Fixed in container
DATA_DIR = f"{HOME_DIR}/data"
WAREHOUSE_DIR = f"{HOME_DIR}/warehouse"

# Data paths
LANDING_PATH = f"{DATA_DIR}/landing"
BRONZE_PATH = f"{DATA_DIR}/bronze"
SILVER_PATH = f"{DATA_DIR}/silver"
GOLD_PATH = f"{DATA_DIR}/gold"

# Base URL and time parameters
BASE_URL = "https://s3.amazonaws.com/tripdata"
YEAR, MONTH = 2025, 1

etl_logger.info(f"Using data paths: {DATA_DIR}")
display(Markdown("✅ Paths configured"))

2025-03-05 00:00:53 - etl - INFO - Using data paths: /home/aldamiz/data


✅ Paths configured

## Spark Configuration

We'll configure Spark with support for both Delta Lake and Apache Iceberg. Key configurations include:
- Package dependencies for table formats
- SQL extensions for advanced features
- Catalog configurations for data management
- Performance optimizations

In [3]:
# Cell 3: Spark Initialization
from pyspark.sql import SparkSession

# Create Spark session (configurations from spark-defaults.conf)
spark = (SparkSession.builder
        .appName("CitiBike-Processing")
        .getOrCreate())

# Verify Delta Lake
try:
    test_delta = (spark.range(1)
                 .write
                 .format("delta")
                 .mode("overwrite")
                 .save("/home/aldamiz/warehouse/delta/test_delta"))
    etl_logger.info("✓ Delta Lake integration verified")
except Exception as e:
    etl_logger.error(f"Error verifying Delta Lake: {str(e)}", exc_info=True)
    raise

# Log configuration for verification
etl_logger.info("Spark Session created with configurations:")
for k, v in spark.sparkContext.getConf().getAll():
    etl_logger.info(f"- {k}: {v}")

display(Markdown("✅ Spark initialized with Delta Lake and Iceberg support"))

2025-03-05 00:01:15 - etl - INFO - ✓ Delta Lake integration verified
2025-03-05 00:01:15 - etl - INFO - Spark Session created with configurations:
2025-03-05 00:01:16 - etl - INFO - - spark.app.initial.jar.urls: spark://0.0.0.0:44849/jars/io.delta_delta-core_2.12-2.4.0.jar,spark://0.0.0.0:44849/jars/org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.3.1.jar,spark://0.0.0.0:44849/jars/io.delta_delta-storage-2.4.0.jar,spark://0.0.0.0:44849/jars/org.antlr_antlr4-runtime-4.9.3.jar
2025-03-05 00:01:16 - etl - INFO - - spark.eventLog.enabled: true
2025-03-05 00:01:16 - etl - INFO - - spark.history.fs.cleaner.maxAge: 7d
2025-03-05 00:01:16 - etl - INFO - - spark.jars: file:///home/aldamiz/.ivy2/jars/io.delta_delta-core_2.12-2.4.0.jar,file:///home/aldamiz/.ivy2/jars/io.delta_delta-storage-2.4.0.jar,file:///home/aldamiz/.ivy2/jars/org.apache.iceberg_iceberg-spark-runtime-3.4_2.12-1.3.1.jar,file:///home/aldamiz/.ivy2/jars/org.antlr_antlr4-runtime-4.9.3.jar
2025-03-05 00:01:16 - etl - INFO - - s

✅ Spark initialized with Delta Lake and Iceberg support

## Data Acquisition

Now we'll set up the data acquisition process for both NYC and Jersey City bike share data. The data follows these patterns:

### File Naming Conventions:
- NYC (2024 onwards): `YYYYMM-citibike-tripdata.zip`
- Jersey City: `JC-YYYYMM-citibike-tripdata.csv.zip`

We'll start by downloading January 2025 data for both cities.

In [4]:
# NYC Data Download
display(Markdown("## Downloading NYC Data"))

# Setup paths for NYC data
nyc_filename = f"{YEAR}{MONTH:02d}-citibike-tripdata.zip"
nyc_url = f"{BASE_URL}/{nyc_filename}"
nyc_output_dir = os.path.join(LANDING_PATH, "nyc", str(YEAR), f"{MONTH:02d}")
os.makedirs(nyc_output_dir, exist_ok=True)
nyc_output_path = os.path.join(nyc_output_dir, nyc_filename)

# Download if file doesn't exist
if not os.path.exists(nyc_output_path):
    etl_logger.info(f"Downloading {nyc_filename}")
    display(Markdown(f"📥 Downloading `{nyc_filename}`..."))
    response = urllib.request.urlopen(nyc_url)
    total_size = int(response.headers['Content-Length'])
    
    with tqdm.tqdm(total=total_size, unit='B', unit_scale=True) as pbar:
        def report_hook(count, block_size, total_size):
            pbar.update(block_size)
        urllib.request.urlretrieve(nyc_url, nyc_output_path, reporthook=report_hook)
    etl_logger.info(f"Successfully downloaded {nyc_filename}")
    display(Markdown(f"✅ Successfully downloaded `{nyc_filename}`"))
else:
    etl_logger.info(f"File already exists: {nyc_filename}")
    display(Markdown(f"ℹ️ File already exists: `{nyc_filename}`"))

## Downloading NYC Data

2025-03-05 00:01:16 - etl - INFO - Downloading 202501-citibike-tripdata.zip


📥 Downloading `202501-citibike-tripdata.zip`...

414MB [00:31, 13.2MB/s]                              
2025-03-05 00:01:48 - etl - INFO - Successfully downloaded 202501-citibike-tripdata.zip


✅ Successfully downloaded `202501-citibike-tripdata.zip`

In [5]:
# Control Table Setup
display(Markdown("## Control Table Setup"))

from datetime import datetime
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, LongType, IntegerType

# Define control table path and schema
control_table_path = os.path.join(BRONZE_PATH, "control_table")
control_schema = StructType([
    StructField("source_file", StringType(), False),
    StructField("file_path", StringType(), False),
    StructField("processing_date", TimestampType(), False),
    StructField("record_count", LongType(), False),
    StructField("status", StringType(), False),
    StructField("year", IntegerType(), False),
    StructField("month", IntegerType(), False)
])

# Load or create control table
try:
    control_df = spark.read.format("delta").load(control_table_path)
    etl_logger.info("Control table loaded successfully")
    display(Markdown("ℹ️ Using existing control table"))
except:
    etl_logger.info("Creating new control table")
    empty_control_df = spark.createDataFrame([], schema=control_schema)
    (empty_control_df.write
        .format("delta")
        .mode("errorIfExists")  # This ensures we don't accidentally overwrite
        .save(control_table_path))
    control_df = spark.read.format("delta").load(control_table_path)
    display(Markdown("✅ Created new control table"))

# Display current control table status
display(Markdown("## Current Control Table Status"))
display(control_df.orderBy(F.col("processing_date").desc()))

## Control Table Setup

2025-03-05 00:01:48 - etl - INFO - Creating new control table


✅ Created new control table

## Current Control Table Status

DataFrame[source_file: string, file_path: string, processing_date: timestamp, record_count: bigint, status: string, year: int, month: int]

In [6]:
# Check File Processing Status
display(Markdown("## Checking File Status"))

# Define file paths
nyc_filename = f"{YEAR}{MONTH:02d}-citibike-tripdata.zip"
nyc_output_path = os.path.join(LANDING_PATH, "nyc", str(YEAR), f"{MONTH:02d}", nyc_filename)
bronze_output_path = os.path.join(BRONZE_PATH, "rides_nyc")

# Check if source file exists
if not os.path.exists(nyc_output_path):
    etl_logger.error(f"Source file not found: {nyc_output_path}")
    display(Markdown("❌ Source file not found"))
    raise FileNotFoundError(f"Source file not found: {nyc_output_path}")

# Check if this file has already been processed successfully
processed_files = control_df.filter(
    (F.col("source_file") == nyc_filename) & 
    (F.col("year") == YEAR) & 
    (F.col("month") == MONTH) &
    (F.col("status") == "COMPLETED")
)

if processed_files.count() > 0:
    last_processed = processed_files.orderBy(F.col("processing_date").desc()).first()
    etl_logger.info(f"File already processed on {last_processed.processing_date}")
    display(Markdown(f"ℹ️ File was already processed on {last_processed.processing_date}"))
    display(Markdown("⏭️ Skipping processing"))
    should_process = False
else:
    etl_logger.info("File needs processing")
    display(Markdown("✅ File needs processing"))
    should_process = True

# Store the processing flag for next cell
%store should_process

## Checking File Status

2025-03-05 00:01:52 - etl - INFO - File needs processing


✅ File needs processing

Stored 'should_process' (bool)


In [7]:
# Retrieve the processing flag
%store -r should_process

if not should_process:
    etl_logger.info("Skipping processing as file was already processed")
    display(Markdown("⏭️ Processing skipped"))
else:
    display(Markdown("## Processing Data"))
    
    try:
        # Create temporary processing directory
        extract_dir = os.path.join(nyc_output_dir, "extracted")
        os.makedirs(extract_dir, exist_ok=True)

        # Extract files if needed
        if not any(fname.endswith('.csv') for fname in os.listdir(extract_dir)):
            etl_logger.info("Extracting ZIP file")
            display(Markdown("📂 Extracting data..."))
            with zipfile.ZipFile(nyc_output_path, 'r') as zip_ref:
                zip_ref.extractall(extract_dir)

        # Process CSV files
        csv_files = glob.glob(os.path.join(extract_dir, "*.csv"))
        if not csv_files:
            raise ValueError("No CSV files found in archive")

        # Read and combine all CSV files
        all_dfs = []
        total_records = 0

        for csv_path in csv_files:
            etl_logger.info(f"Reading {os.path.basename(csv_path)}")
            df = spark.read.csv(csv_path, header=True)
            
            # Add metadata columns
            df = df.withColumn("ingestion_date", F.current_timestamp()) \
                   .withColumn("source_file", F.lit(nyc_filename)) \
                   .withColumn("city", F.lit("nyc")) \
                   .withColumn("year", F.lit(YEAR).cast(IntegerType())) \
                   .withColumn("month", F.lit(MONTH).cast(IntegerType()))
            
            all_dfs.append(df)
            total_records += df.count()

        # Combine all DataFrames
        if not all_dfs:
            raise ValueError("No data was read from CSV files")
            
        combined_df = all_dfs[0]
        for df in all_dfs[1:]:
            combined_df = combined_df.unionAll(df)

        # Write data ONLY if the bronze table doesn't exist or if this partition doesn't exist
        bronze_exists = False
        try:
            existing_df = spark.read.format("delta").load(bronze_output_path)
            bronze_exists = True
            
            # Check if this partition exists
            partition_exists = existing_df.filter(
                (F.col("year") == YEAR) & 
                (F.col("month") == MONTH)
            ).count() > 0
            
            if partition_exists:
                raise ValueError(f"Data for {YEAR}-{MONTH} already exists in bronze layer")
                
        except Exception as e:
            if "Path does not exist" not in str(e):
                raise e

        # Write data
        write_mode = "append" if bronze_exists else "errorIfExists"
        
        (combined_df.write
            .format("delta")
            .partitionBy("year", "month")
            .mode(write_mode)
            .save(bronze_output_path))

        # Record successful processing in control table
        control_record = spark.createDataFrame([(
            nyc_filename,
            nyc_output_path,
            datetime.now(),
            total_records,  # Changed from long(total_records)
            "COMPLETED",
            int(YEAR),
            int(MONTH)
        )], schema=control_schema)

        (control_record.write
            .format("delta")
            .mode("append")
            .save(control_table_path))

        etl_logger.info(f"Successfully processed {total_records} records")
        display(Markdown(f"✅ Processed {total_records:,} records"))

    except Exception as e:
        # Record failure in control table
        error_record = spark.createDataFrame([(
            nyc_filename,
            nyc_output_path,
            datetime.now(),
            0,  # Changed from long(0)
            "FAILED",
            int(YEAR),
            int(MONTH)
        )], schema=control_schema)

        (error_record.write
            .format("delta")
            .mode("append")
            .save(control_table_path))

        etl_logger.error(f"Processing failed: {str(e)}")
        display(Markdown(f"❌ Processing failed: {str(e)}"))
        raise

# Display final status
display(Markdown("## Final Status"))
display(spark.read.format("delta").load(control_table_path).orderBy(F.col("processing_date").desc()))

## Processing Data

2025-03-05 00:01:53 - etl - INFO - Extracting ZIP file


📂 Extracting data...

2025-03-05 00:01:54 - etl - INFO - Reading 202501-citibike-tripdata_2.csv
2025-03-05 00:01:55 - etl - INFO - Reading 202501-citibike-tripdata_3.csv
2025-03-05 00:01:55 - etl - INFO - Reading 202501-citibike-tripdata_1.csv
2025-03-05 00:02:06 - etl - INFO - Successfully processed 2124475 records


✅ Processed 2,124,475 records

## Final Status

DataFrame[source_file: string, file_path: string, processing_date: timestamp, record_count: bigint, status: string, year: int, month: int]

## Silver Layer Transformations

In this section, we'll enrich our bronze data with:
1. Ride duration calculations
2. Distance calculations using the Haversine formula
3. Time-based features (part of day, weekends, seasons)
4. Speed calculations

In [8]:
# Silver Layer Processing
display(Markdown("## Silver Layer Transformations"))
etl_logger.info("Starting silver layer transformations")

try:
    # Set time parser policy
    spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

    # Read from bronze
    bronze_df = spark.read.format("delta").load(os.path.join(BRONZE_PATH, "rides_nyc"))
    etl_logger.info(f"Read {bronze_df.count():,} records from bronze layer")

    # Create a new DataFrame with proper types instead of modifying existing ones
    silver_df = bronze_df.select(
        "*",  # Keep all original columns
        F.to_timestamp("started_at").alias("started_at_clean"),
        F.to_timestamp("ended_at").alias("ended_at_clean"),
        # Cast coordinates to double for calculations but keep original string columns
        F.col("start_lat").cast("double").alias("start_lat_double"),
        F.col("start_lng").cast("double").alias("start_lng_double"),
        F.col("end_lat").cast("double").alias("end_lat_double"),
        F.col("end_lng").cast("double").alias("end_lng_double")
    )

    # Calculate enriched features using the double-type columns
    silver_df = silver_df.withColumn(
        "ride_duration_minutes", 
        F.round(
            (F.unix_timestamp("ended_at_clean") - F.unix_timestamp("started_at_clean")) / 60, 
            2
        )
    ).withColumn(
        "ride_distance_km", 
        F.round(
            F.acos(
                F.sin(F.radians("start_lat_double")) * F.sin(F.radians("end_lat_double")) + 
                F.cos(F.radians("start_lat_double")) * F.cos(F.radians("end_lat_double")) * 
                F.cos(F.radians("start_lng_double") - F.radians("end_lng_double"))
            ) * 6371,
            2
        )
    ).withColumn(
        "speed_kmh",
        F.round(F.col("ride_distance_km") / (F.col("ride_duration_minutes") / 60), 2)
    ).withColumn(
        "part_of_day",
        F.when(F.hour("started_at_clean").between(5, 11), "morning")
         .when(F.hour("started_at_clean").between(12, 16), "afternoon")
         .when(F.hour("started_at_clean").between(17, 21), "evening")
         .otherwise("night")
    ).withColumn(
        "is_weekend",
        F.dayofweek("started_at_clean").isin([1, 7])
    ).withColumn(
        "season",
        F.when(F.month("started_at_clean").isin([12, 1, 2]), "winter")
         .when(F.month("started_at_clean").isin([3, 4, 5]), "spring")
         .when(F.month("started_at_clean").isin([6, 7, 8]), "summer")
         .otherwise("fall")
    ).withColumn(
        "extra_time_charge",
        F.when(
            (F.col("ride_duration_minutes") > 30) & 
            (F.col("ride_duration_minutes").isNotNull()), 
            0.2
        ).otherwise(0.0)
    ).withColumn(
        "distance_bucket",
        F.when(F.col("ride_distance_km").isNull(), "unknown")
         .when(F.col("ride_distance_km") < 1, "0-1 km")
         .when(F.col("ride_distance_km").between(1, 4), "1-4 km")
         .when(F.col("ride_distance_km").between(4, 9), "4-9 km")
         .otherwise("10+ km")
    )

    # Drop temporary columns
    silver_df = silver_df.drop(
        "started_at_clean", "ended_at_clean",
        "start_lat_double", "start_lng_double",
        "end_lat_double", "end_lng_double"
    )

    # Cache the DataFrame for multiple actions
    silver_df.cache()

    # Log data quality metrics
    log_data_quality(silver_df, "silver_transformation")
    
    # Calculate revenue metrics
    display(Markdown("### Revenue Metrics"))
    silver_df.agg(
        F.format_number(F.sum("extra_time_charge"), 2).alias("total_revenue_usd"),
        F.count(F.when(F.col("ride_duration_minutes") > 30, True)).alias("charged_rides"),
        F.count("*").alias("total_rides")
    ).show()

    # Calculate distance distribution
    display(Markdown("### Distance Distribution"))
    silver_df.groupBy("distance_bucket").agg(
        F.count("*").alias("ride_count"),
        F.format_number(F.avg("ride_duration_minutes"), 2).alias("avg_duration_minutes"),
        F.format_number(F.sum("extra_time_charge"), 2).alias("revenue_usd")
    ).orderBy("distance_bucket").show()

    # Write to silver layer with schema evolution
    silver_path = os.path.join(SILVER_PATH, "rides_enriched")
    start_time = time.time()

    (silver_df.write
        .format("delta")
        .mode("overwrite")
        .option("mergeSchema", "true")
        .partitionBy("year", "month")
        .save(silver_path))

    # Uncache the DataFrame
    silver_df.unpersist()

    log_performance_metrics(start_time, "silver_write")
    etl_logger.info(f"Silver layer written to: {silver_path}")
    display(Markdown("✅ Silver layer processing complete"))

    # Display sample of processed data
    display(Markdown("### Sample of Processed Data"))
    spark.read.format("delta").load(silver_path).select(
        "started_at", "ended_at", 
        "ride_duration_minutes", "ride_distance_km", 
        "speed_kmh", "part_of_day", "is_weekend", "season",
        "extra_time_charge", "distance_bucket"
    ).limit(5).show()

except Exception as e:
    etl_logger.error(f"Error in silver transformation: {str(e)}", exc_info=True)
    display(Markdown(f"❌ Error: {str(e)}"))
    raise

finally:
    # Ensure DataFrame is uncached in case of error
    if 'silver_df' in locals():
        try:
            silver_df.unpersist()
        except:
            pass

## Silver Layer Transformations

2025-03-05 00:02:06 - etl - INFO - Starting silver layer transformations
2025-03-05 00:02:07 - etl - INFO - Read 2,124,475 records from bronze layer
2025-03-05 00:02:08 - data - INFO - Data quality metrics for silver_transformation:
2025-03-05 00:02:17 - data - INFO - - Row count: 2124475
2025-03-05 00:02:16 - data - INFO - - Null counts: {'ride_id': 0, 'rideable_type': 0, 'started_at': 0, 'ended_at': 0, 'start_station_name': 564, 'start_station_id': 564, 'end_station_name': 3975, 'end_station_id': 4322, 'start_lat': 0, 'start_lng': 0, 'end_lat': 269, 'end_lng': 269, 'member_casual': 0, 'ingestion_date': 0, 'source_file': 0, 'city': 0, 'year': 0, 'month': 0, 'ride_duration_minutes': 0, 'ride_distance_km': 269, 'speed_kmh': 269, 'part_of_day': 0, 'is_weekend': 0, 'season': 0, 'extra_time_charge': 0, 'distance_bucket': 0}
2025-03-05 00:02:28 - data - INFO - - Duplicate count: 0


### Revenue Metrics

+-----------------+-------------+-----------+
|total_revenue_usd|charged_rides|total_rides|
+-----------------+-------------+-----------+
|        11,741.60|        58708|    2124475|
+-----------------+-------------+-----------+



### Distance Distribution

+---------------+----------+--------------------+-----------+
|distance_bucket|ride_count|avg_duration_minutes|revenue_usd|
+---------------+----------+--------------------+-----------+
|         0-1 km|    769615|                5.32|   2,105.60|
|         1-4 km|   1199440|               10.32|   2,760.40|
|         10+ km|      8408|               38.55|   1,311.40|
|         4-9 km|    146743|               24.22|   5,510.60|
|        unknown|       269|            1,494.26|      53.60|
+---------------+----------+--------------------+-----------+



2025-03-05 00:02:37 - etl - INFO - Performance metric - silver_write: 7.62 seconds
2025-03-05 00:02:37 - etl - INFO - Silver layer written to: /home/aldamiz/data/silver/rides_enriched


✅ Silver layer processing complete

### Sample of Processed Data

+--------------------+--------------------+---------------------+----------------+---------+-----------+----------+------+-----------------+---------------+
|          started_at|            ended_at|ride_duration_minutes|ride_distance_km|speed_kmh|part_of_day|is_weekend|season|extra_time_charge|distance_bucket|
+--------------------+--------------------+---------------------+----------------+---------+-----------+----------+------+-----------------+---------------+
|2025-01-23 13:34:...|2025-01-23 13:42:...|                  7.4|            1.35|    10.95|  afternoon|     false|winter|              0.0|         1-4 km|
|2025-01-25 18:58:...|2025-01-25 19:01:...|                 3.12|            0.46|     8.85|    evening|      true|winter|              0.0|         0-1 km|
|2025-01-16 09:33:...|2025-01-16 09:43:...|                 9.57|             1.7|    10.66|    morning|     false|winter|              0.0|         1-4 km|
|2025-01-16 08:32:...|2025-01-16 08:39:...|               

## Gold Layer Analytics

We'll create several analytical views:
1. Station Analysis - Usage patterns and metrics per station
2. Temporal Analysis - Time-based patterns and trends
3. Popular Routes Analysis - Most frequent routes and their characteristics
4. Seasonal Patterns - Weather and seasonal impact on riding behavior

In [9]:
# Gold Layer - Station Analysis
display(Markdown("## Gold Layer - Station Analysis"))
etl_logger.info("Starting gold layer - station analysis")
start_time = time.time()

station_metrics = (silver_df
    .groupBy("start_station_name", "start_lat", "start_lng")
    .agg(
        F.count("*").alias("total_starts"),
        F.avg("ride_duration_minutes").alias("avg_ride_duration"),
        F.avg("ride_distance_km").alias("avg_ride_distance"),
        F.countDistinct("end_station_name").alias("unique_destinations")
    )
    .orderBy(F.desc("total_starts")))

# Write station metrics
gold_base_path = os.path.join(GOLD_PATH, "analytics")
(station_metrics.write
    .format("delta")
    .mode("overwrite")
    .save(f"{gold_base_path}/station_metrics"))

log_performance_metrics(start_time, "station_metrics_creation")
log_data_quality(station_metrics, "station_metrics")

## Gold Layer - Station Analysis

2025-03-05 00:02:37 - etl - INFO - Starting gold layer - station analysis
2025-03-05 00:02:42 - etl - INFO - Performance metric - station_metrics_creation: 5.13 seconds
2025-03-05 00:02:42 - data - INFO - Data quality metrics for station_metrics:
2025-03-05 00:02:43 - data - INFO - - Row count: 2320
2025-03-05 00:02:43 - data - INFO - - Null counts: {'start_station_name': 145, 'start_lat': 0, 'start_lng': 0, 'total_starts': 0, 'avg_ride_duration': 0, 'avg_ride_distance': 0, 'unique_destinations': 0}
2025-03-05 00:02:43 - data - INFO - - Duplicate count: 0


In [10]:
# Gold Layer - Temporal Analysis
display(Markdown("## Gold Layer - Temporal Analysis"))
etl_logger.info("Starting gold layer - temporal analysis")
start_time = time.time()

temporal_metrics = (silver_df
    .groupBy("year", "month", "part_of_day", "is_weekend")
    .agg(
        F.count("*").alias("total_rides"),
        F.avg("ride_duration_minutes").alias("avg_duration"),
        F.avg("ride_distance_km").alias("avg_distance"),
        F.avg("speed_kmh").alias("avg_speed")
    ))

# Write temporal metrics
(temporal_metrics.write
    .format("delta")
    .mode("overwrite")
    .partitionBy("year", "month")
    .save(f"{gold_base_path}/temporal_metrics"))

log_performance_metrics(start_time, "temporal_metrics_creation")
log_data_quality(temporal_metrics, "temporal_metrics")

## Gold Layer - Temporal Analysis

2025-03-05 00:02:44 - etl - INFO - Starting gold layer - temporal analysis
2025-03-05 00:02:46 - etl - INFO - Performance metric - temporal_metrics_creation: 2.89 seconds
2025-03-05 00:02:46 - data - INFO - Data quality metrics for temporal_metrics:
2025-03-05 00:02:47 - data - INFO - - Row count: 8
2025-03-05 00:02:49 - data - INFO - - Null counts: {'year': 0, 'month': 0, 'part_of_day': 0, 'is_weekend': 0, 'total_rides': 0, 'avg_duration': 0, 'avg_distance': 0, 'avg_speed': 0}
2025-03-05 00:02:50 - data - INFO - - Duplicate count: 0


In [11]:
# Gold Layer - Route Analysis
display(Markdown("## Gold Layer - Route Analysis"))
etl_logger.info("Starting gold layer - route analysis")
start_time = time.time()

popular_routes = (silver_df
    .groupBy("start_station_name", "end_station_name")
    .agg(
        F.count("*").alias("route_count"),
        F.avg("ride_distance_km").alias("avg_distance"),
        F.avg("ride_duration_minutes").alias("avg_duration")
    )
    .orderBy(F.desc("route_count")))

# Write popular routes
(popular_routes.write
    .format("delta")
    .mode("overwrite")
    .save(f"{gold_base_path}/popular_routes"))

log_performance_metrics(start_time, "route_analysis_creation")
log_data_quality(popular_routes, "popular_routes")

## Gold Layer - Route Analysis

2025-03-05 00:02:50 - etl - INFO - Starting gold layer - route analysis
2025-03-05 00:02:54 - etl - INFO - Performance metric - route_analysis_creation: 4.08 seconds
2025-03-05 00:02:54 - data - INFO - Data quality metrics for popular_routes:
2025-03-05 00:02:55 - data - INFO - - Row count: 388392
2025-03-05 00:02:58 - data - INFO - - Null counts: {'start_station_name': 344, 'end_station_name': 1255, 'route_count': 0, 'avg_distance': 67, 'avg_duration': 0}
2025-03-05 00:02:59 - data - INFO - - Duplicate count: 0


In [12]:
# Gold Layer - Seasonal Analysis
display(Markdown("## Gold Layer - Seasonal Analysis"))
etl_logger.info("Starting gold layer - seasonal analysis")
start_time = time.time()

seasonal_patterns = (silver_df
    .groupBy("season", "part_of_day")
    .agg(
        F.count("*").alias("total_rides"),
        F.avg("ride_duration_minutes").alias("avg_duration"),
        F.avg("ride_distance_km").alias("avg_distance")
    ))

# Write seasonal patterns
(seasonal_patterns.write
    .format("delta")
    .mode("overwrite")
    .save(f"{gold_base_path}/seasonal_patterns"))

log_performance_metrics(start_time, "seasonal_analysis_creation")
log_data_quality(seasonal_patterns, "seasonal_patterns")

etl_logger.info("Gold layer processing complete")

## Gold Layer - Seasonal Analysis

2025-03-05 00:02:59 - etl - INFO - Starting gold layer - seasonal analysis
2025-03-05 00:03:01 - etl - INFO - Performance metric - seasonal_analysis_creation: 2.71 seconds
2025-03-05 00:03:01 - data - INFO - Data quality metrics for seasonal_patterns:
2025-03-05 00:03:02 - data - INFO - - Row count: 4
2025-03-05 00:03:04 - data - INFO - - Null counts: {'season': 0, 'part_of_day': 0, 'total_rides': 0, 'avg_duration': 0, 'avg_distance': 0}
2025-03-05 00:03:05 - data - INFO - - Duplicate count: 0
2025-03-05 00:03:05 - etl - INFO - Gold layer processing complete


## Key Insights Visualization

Let's examine some key insights from our analysis:
1. Busiest stations and their characteristics
2. Most popular routes
3. Seasonal patterns
4. Daily usage patterns

In [13]:
# Display Key Insights
display(Markdown("## Key Insights"))

try:
    # 1. Top 10 Busiest Stations
    display(Markdown("### Top 10 Busiest Stations"))
    station_metrics = spark.read.format("delta").load(f"{gold_base_path}/station_metrics")
    station_metrics.select(
        "start_station_name",
        F.round("start_lat", 4).alias("latitude"),
        F.round("start_lng", 4).alias("longitude"),
        "total_starts",
        F.round("avg_ride_duration", 2).alias("avg_duration_min"),
        F.round("avg_ride_distance", 2).alias("avg_distance_km"),
        "unique_destinations"
    ).orderBy(F.desc("total_starts")).limit(10).show()

    # 2. Popular Routes
    display(Markdown("### Top 10 Most Popular Routes"))
    popular_routes = spark.read.format("delta").load(f"{gold_base_path}/popular_routes")
    popular_routes.select(
        "start_station_name",
        "end_station_name",
        "route_count",
        F.round("avg_distance", 2).alias("avg_distance_km"),
        F.round("avg_duration", 2).alias("avg_duration_min")
    ).orderBy(F.desc("route_count")).limit(10).show()

    # 3. Seasonal Patterns
    display(Markdown("### Seasonal Riding Patterns"))
    seasonal_patterns = spark.read.format("delta").load(f"{gold_base_path}/seasonal_patterns")
    seasonal_patterns.select(
        "season",
        "part_of_day",
        "total_rides",
        F.round("avg_duration", 2).alias("avg_duration_min"),
        F.round("avg_distance", 2).alias("avg_distance_km")
    ).orderBy("season", "part_of_day").show()

    # 4. Daily Pattern Analysis
    display(Markdown("### Daily Pattern Analysis"))
    temporal_metrics = spark.read.format("delta").load(f"{gold_base_path}/temporal_metrics")
    daily_patterns = temporal_metrics.groupBy("part_of_day").agg(
        F.round(F.avg("avg_duration"), 2).alias("avg_duration_minutes"),
        F.round(F.avg("avg_distance"), 2).alias("avg_distance_km"),
        F.round(F.avg("avg_speed"), 2).alias("avg_speed_kmh")
    ).orderBy("part_of_day")
    daily_patterns.show()

    # Summary Statistics with proper null and nan handling
    total_rides = temporal_metrics.agg(F.sum("total_rides")).collect()[0][0]
    avg_duration = temporal_metrics.agg(F.avg("avg_duration")).collect()[0][0]
    avg_distance = temporal_metrics.agg(F.avg("avg_distance")).collect()[0][0]

    import math
    
    # Format summary statistics with null/nan handling
    def format_stat(value, decimal_places=2):
        if value is None or (isinstance(value, float) and math.isnan(value)):
            return "0.00"
        return f"{value:.{decimal_places}f}"

    display(Markdown("### Summary Statistics"))
    summary_text = "\n".join([
        "### Summary Statistics",
        f"- Total number of rides: {total_rides:,}",
        f"- Average ride duration: {format_stat(avg_duration)} minutes",
        f"- Average ride distance: {format_stat(avg_distance)} km"
    ])
    display(Markdown(summary_text))

except Exception as e:
    display(Markdown(f"❌ Error: {str(e)}"))
    analysis_logger.error(f"Error in key insights generation: {str(e)}", exc_info=True)
    raise

analysis_logger.info("Key insights generated and displayed")

## Key Insights

### Top 10 Busiest Stations

+--------------------+--------+---------+------------+----------------+---------------+-------------------+
|  start_station_name|latitude|longitude|total_starts|avg_duration_min|avg_distance_km|unique_destinations|
+--------------------+--------+---------+------------+----------------+---------------+-------------------+
|     W 21 St & 6 Ave| 40.7417| -73.9942|        8943|            7.89|           1.35|                474|
|     W 31 St & 7 Ave| 40.7492| -73.9916|        7557|            9.02|           1.61|                486|
|Pier 61 at Chelse...| 40.7469| -74.0082|        7526|           10.18|            2.0|                459|
|Lafayette St & E ...| 40.7302|  -73.991|        7334|            8.19|           1.36|                506|
|    11 Ave & W 41 St| 40.7603| -73.9988|        6932|            9.51|            1.8|                457|
|  Broadway & E 14 St| 40.7345| -73.9907|        6626|            8.18|           1.39|                496|
|     Ave A & E 14 St| 40.73

### Top 10 Most Popular Routes

+--------------------+--------------------+-----------+---------------+----------------+
|  start_station_name|    end_station_name|route_count|avg_distance_km|avg_duration_min|
+--------------------+--------------------+-----------+---------------+----------------+
|Norfolk St & Broo...| Henry St & Grand St|        504|           0.67|            3.25|
|     W 21 St & 6 Ave|     9 Ave & W 22 St|        438|           0.78|            4.34|
|North Moore St & ...|Vesey St & Greenw...|        411|           0.85|            4.03|
| Henry St & Grand St|Norfolk St & Broo...|        401|           0.67|            3.79|
|     E 77 St & 1 Ave|     E 77 St & 3 Ave|        401|           0.49|            3.03|
|N 6 St & Bedford Ave|  S 4 St & Wythe Ave|        375|           0.81|            4.21|
|     E 77 St & 1 Ave|     2 Ave & E 72 St|        373|           0.48|            3.64|
|55 Ave & Center Blvd|Vernon Blvd & 50 Ave|        373|            0.6|            3.25|
|North Moore St & ...

### Seasonal Riding Patterns

+------+-----------+-----------+----------------+---------------+
|season|part_of_day|total_rides|avg_duration_min|avg_distance_km|
+------+-----------+-----------+----------------+---------------+
|winter|  afternoon|     671015|           10.32|            NaN|
|winter|    evening|     627970|            9.68|            NaN|
|winter|    morning|     671714|            9.23|            NaN|
|winter|      night|     153776|           10.13|            NaN|
+------+-----------+-----------+----------------+---------------+



### Daily Pattern Analysis

+-----------+--------------------+---------------+-------------+
|part_of_day|avg_duration_minutes|avg_distance_km|avg_speed_kmh|
+-----------+--------------------+---------------+-------------+
|  afternoon|               10.43|            NaN|          NaN|
|    evening|                9.81|            NaN|          NaN|
|    morning|                9.32|            NaN|          NaN|
|      night|               10.22|            NaN|          NaN|
+-----------+--------------------+---------------+-------------+



### Summary Statistics

### Summary Statistics
- Total number of rides: 2,124,475
- Average ride duration: 9.94 minutes
- Average ride distance: 0.00 km

2025-03-05 00:03:07 - analysis - INFO - Key insights generated and displayed
