# Exercise 2.6 - Interactive Charts with Plotly

## Citi Bike NYC Expansion Dashboard - Chart Development

**Author:** Saurabh Singh  
**Exercise:** Achievement 2, Exercise 2.6  
**Date:** February 2026

---

## Project Overview

### What are we doing?

This notebook develops interactive charts using plotly that will be integrated into a Streamlit dashboard. We're converting the static matplotlib and seaborn visualizations from previous exercises into interactive plotly versions.

### Why plotly?

- Hover tooltips show exact values automatically
- Zoom and pan functionality built in
- Responsive design (desktop, tablet, mobile)
- Easy integration with Streamlit dashboards

### Charts to create:

1. **Bar chart** - Top 20 most popular stations
2. **Dual-axis line chart** - Weather and ridership correlation

### Code structure:

Functions are defined for reuse in `st_dashboard.py`, keeping the dashboard code clean and maintainable.

---

## 1. Import Libraries

In [None]:
import pandas as pd
from pathlib import Path
from plotly.subplots import make_subplots
import plotly.graph_objects as go

---

## 2. Define File Paths

Centralizing paths avoids repetition and makes the notebook portable across machines.

In [None]:
# Centralized file paths
DATA_PATH = 'outputs/merged_citibike_weather_2022.csv'
OUT_DIR = Path('outputs')

# Ensure outputs folder exists
OUT_DIR.mkdir(exist_ok=True)

---

## 3. Define Helper Functions

Wrapping logic into functions allows reuse in `st_dashboard.py` without duplicating code.

Each function has a single responsibility, making the notebook easier to read and debug.

In [None]:
def load_data(path):
    """
    Load the merged Citi Bike + weather dataset.
    Validates required columns are present before returning.
    """
    required_columns = ['start_station_name', 'date', 'avgTemp']
    
    merged_df = pd.read_csv(path)
    merged_df['date'] = pd.to_datetime(merged_df['date'])
    
    # Validate required columns
    missing = [col for col in required_columns if col not in merged_df.columns]
    assert len(missing) == 0, f"Missing required columns: {missing}"
    
    # Drop rows with missing station names
    merged_df = merged_df.dropna(subset=['start_station_name'])
    
    return merged_df

In [None]:
def prepare_top_stations(merged_df, n=20):
    """
    Count trips per station and return the top n stations.
    Uses size() for clean counting without mutating the source dataframe.
    """
    station_counts = (
        merged_df
        .groupby('start_station_name')
        .size()
        .reset_index(name='trip_count')
        .nlargest(n, 'trip_count')
        .sort_values('trip_count', ascending=True)  # Ascending for horizontal bar chart
    )
    return station_counts

In [None]:
def prepare_daily(merged_df):
    """
    Aggregate trips by day and compute mean daily temperature.
    Uses mean() for temperature (more stable than first()).
    Drops rows with missing temperature before plotting.
    """
    daily_df = (
        merged_df
        .groupby(merged_df['date'].dt.date)
        .agg(
            bike_rides_daily=('start_station_name', 'count'),
            avgTemp=('avgTemp', 'mean')
        )
        .reset_index()
        .rename(columns={'date': 'date'})
    )
    
    # Drop rows with missing temperature
    daily_df = daily_df.dropna(subset=['avgTemp'])
    
    return daily_df

In [None]:
def plot_top_stations(station_counts):
    """
    Create a polished horizontal bar chart of top stations.
    Sorted ascending so highest bar appears at the top.
    """
    fig_top20 = go.Figure(go.Bar(
        x=station_counts['trip_count'],
        y=station_counts['start_station_name'],
        orientation='h',
        marker={'color': station_counts['trip_count'], 'colorscale': 'Blues'},
        hovertemplate='<b>%{y}</b><br>Trips: %{x:,}<extra></extra>'
    ))
    
    fig_top20.update_layout(
        title='Top 20 Most Popular Bike Stations in NYC (2022)',
        xaxis_title='Number of Trips',
        yaxis_title='Start Station',
        template='plotly_white',
        width=900,
        height=600,
        margin=dict(l=250, r=40, t=60, b=60)  # Extra left margin for station names
    )
    
    return fig_top20

In [None]:
def plot_daily_vs_temp(daily_df):
    """
    Create a dual-axis line chart showing daily bike rides vs temperature.
    """
    fig_temp_rides = make_subplots(specs=[[{'secondary_y': True}]])
    
    fig_temp_rides.add_trace(
        go.Scatter(
            x=daily_df['date'],
            y=daily_df['bike_rides_daily'],
            name='Daily Bike Rides',
            marker={'color': 'blue'},
            hovertemplate='Date: %{x}<br>Rides: %{y:,}<extra></extra>'
        ),
        secondary_y=False
    )
    
    fig_temp_rides.add_trace(
        go.Scatter(
            x=daily_df['date'],
            y=daily_df['avgTemp'],
            name='Avg Temperature (°C)',
            marker={'color': 'red'},
            hovertemplate='Date: %{x}<br>Temp: %{y:.1f}°C<extra></extra>'
        ),
        secondary_y=True
    )
    
    fig_temp_rides.update_layout(
        title='Daily Bike Rides and Temperature Correlation - 2022',
        xaxis_title='Date',
        template='plotly_white',
        height=600
    )
    
    fig_temp_rides.update_yaxes(title_text='Number of Bike Rides', secondary_y=False)
    fig_temp_rides.update_yaxes(title_text='Temperature (°C)', secondary_y=True)
    
    return fig_temp_rides

---

## 4. Load Data

Loading the merged dataset once. No re-merging needed since `merged_citibike_weather_2022.csv` already contains weather columns from Exercise 2.2.

In [None]:
# Load data once using helper function
merged_df = load_data(DATA_PATH)

print(f"Rows: {merged_df.shape[0]:,}")
print(f"Columns: {merged_df.columns.tolist()}")

In [None]:
merged_df.head()

---

## 5. Bar Chart - Top 20 Stations

### Purpose:

Identify the most popular starting stations to inform capacity expansion decisions.

### Design choices:

- **Horizontal orientation**: Better readability for long station names
- **Sorted ascending**: Highest bar appears at the top
- **Blues colorscale**: Consistent with previous exercises
- **Custom hover template**: Clean, formatted tooltip
- **Left margin**: Prevents station names from being cut off

In [None]:
# Prepare top stations data
station_counts = prepare_top_stations(merged_df, n=20)
station_counts.tail(5)

In [None]:
# Create and display bar chart
fig_top20 = plot_top_stations(station_counts)
fig_top20.show()

In [None]:
# Save for GitHub viewing and dashboard use
fig_top20.write_html(str(OUT_DIR / 'top20_stations.html'))
station_counts.to_csv(OUT_DIR / 'top20.csv', index=False)
print('Saved: outputs/top20_stations.html')
print('Saved: outputs/top20.csv')

---

## 6. Dual-Axis Line Chart

### Purpose:

Show the correlation between temperature and bike ridership to demonstrate seasonal demand patterns.

### Design choices:

- **Dual axis**: Different scales (thousands of trips vs. degrees Celsius)
- **Mean temperature**: More stable than `first()` if multiple rows exist per date
- **Date grouping by day**: Handles timestamp precision correctly
- **Custom hover templates**: Clean, informative tooltips

In [None]:
# Prepare daily aggregation
daily_df = prepare_daily(merged_df)

print(f"Daily rows: {len(daily_df)}")
daily_df.head()

In [None]:
# Create and display dual-axis chart
fig_temp_rides = plot_daily_vs_temp(daily_df)
fig_temp_rides.show()

In [None]:
# Save for GitHub viewing and dashboard use
fig_temp_rides.write_html(str(OUT_DIR / 'rides_vs_temp.html'))
daily_df.to_csv(OUT_DIR / 'daily_data.csv', index=False)
print('Saved: outputs/rides_vs_temp.html')
print('Saved: outputs/daily_data.csv')

---

## Summary

### Charts created:

1. ✅ **Bar chart**: Top 20 stations, sorted and polished with custom hover
2. ✅ **Dual-axis line chart**: Weather-ridership correlation with clean tooltips

### Files saved to `outputs/`:

| File | Purpose |
|------|---------|
| `top20_stations.html` | Interactive chart for GitHub viewers |
| `rides_vs_temp.html` | Interactive chart for GitHub viewers |
| `top20.csv` | Station data for Streamlit dashboard |
| `daily_data.csv` | Daily aggregated data for Streamlit dashboard |

### Code improvements over initial version:

- **Functions**: All logic wrapped for reuse in `st_dashboard.py`
- **No mutation**: Used `size()` instead of creating dummy `value` column
- **Stable aggregation**: `mean()` for temperature instead of `first()`
- **Validation**: Checks for required columns and missing values
- **Centralized paths**: `DATA_PATH` and `OUT_DIR` constants
- **Descriptive names**: `merged_df`, `daily_df`, `station_counts`, `fig_top20`, `fig_temp_rides`
- **Chart polish**: Custom hover templates, white template, adjusted margins