# Oslo Bike Share System Analysis

Oslo's bike-sharing system, *Bysykkel*, is a popular way to get around the city. Historical ride data is freely available — a paradise for any data scientist. In this project, I explore how topography influences bike movement patterns, identify critical stations in the network, and examine rebalancing needs and temporal usage dynamics.

## Project Goal

To discover actionable insights about Oslo’s bike-sharing system by analyzing patterns driven by topography, network structure, and temporal dynamics — with the aim of supporting operational improvements and user experience.

## Key Questions

1. **How does topography affect the flow of bikes?**
2. **Which stations are most important to the network?**
3. **What temporal patterns drive rebalancing needs?**
4. **How can the system be optimized for better efficiency and reliability?**

**Data Source:** [Oslo Bysykkel Historical Data](https://oslobysykkel.no/apne-data/historisk) (using all data from 2024)

## Project Overview

1. **Data Exploration**
2. **Cleaning in SQL**
3. **Topographical Flow Analysis**
4. **Network Structure Analysis**
5. **Temporal Flow Analysis**
6. **Rebalancing Insights & Optimization**

---

In [None]:
# imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import duckdb
import glob
import requests
import os
from datetime import datetime as dt
import seaborn as sns
import folium
import math

# Plotting style
plt.style.use('seaborn-whitegrid')  # or 'default', 'ggplot', etc.
plt.rcParams.update({
    'figure.figsize': (10, 6),
    'font.size': 11,
    'axes.titlesize': 12,
    'axes.labelsize': 11,
    'grid.alpha': 0.3
})

# Folium parameters
oslo_coordinates = [59.92381785337289, 10.746284281064217]
zoom = 14.45

## 1. Data Exploration
In this phase, we'll explore the dataset in order to find data quality issues. The findings will be used to determine what cleaning steps are needed.
 
### 1.1 Loading the data
This step reads all monthly CSV files from the `../data/` folder and loads them into a DuckDB database file (`db/bysykkel_2024.duckdb`). If the database doesn't exist, it will be created. 

In [None]:
# Load all CSVs into DuckDB table 'trips_raw'
con = duckdb.connect("../db/bysykkel_2024.duckdb")

# csv_files = glob.glob("../data/??.csv")
# for i, file in enumerate(csv_files):
#     if i == 0:
#         con.execute(f"CREATE OR REPLACE TABLE trips_raw AS SELECT * FROM read_csv_auto('{file}')")
#     else:
#         con.execute(f"INSERT INTO trips_raw SELECT * FROM read_csv_auto('{file}')")

# con.execute("CHECKPOINT")
# print("Loaded trips_raw data into DuckDB")

### 1.2 Exploring Dataset and Columns
In this step we'll explore all the columns to get familiar with the dataset and detect data quality issues. 

In [None]:
trips = con.execute('SELECT * FROM trips_raw').df()
trips.info()

The dataset contains 13 columns, where each column represents one ride. Only the `start_station_description`and the `end_station_description` contain null values. These columns contain no value to the project and will be discarded.  
  
❗ Drop start_station_description and end_station_description columns

#### 1.2.1 started_at & ended_at
These columns state the times at which a ride was started and ended. 

In [None]:
# Sanity check: Check if ended_at is always after started_at
(trips['ended_at'] >= trips['started_at']).all()
print(f"Are all return times after the trip was started? {(trips['ended_at'] >= trips['started_at']).all()}")

In [None]:
# Plot usage frequency

import matplotlib.dates as mdates

# Group by day
daily = trips.groupby(trips['started_at'].dt.date).size()

# Convert index to datetime (from date)
daily.index = pd.to_datetime(daily.index)

# Plot
plt.figure()
plt.plot(daily.index, daily.values, color='royalblue')

# Format x-axis with months
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%b'))

plt.title("Daily Ride Count (Binned by Day)")
plt.xlabel("Month [2024]")
plt.ylabel("Number of Rides")
plt.grid(True)
plt.tight_layout()
plt.savefig('../outputs/daily_ride_count.png')
plt.show()

✅ No problems detected in this column.

#### 1.2.2 Duration
The `duration` column states the duration of the ride in seconds. 

In [None]:
# Plot duration distribution, log scale helps
plt.figure()
sns.histplot(data=trips, x='duration', bins=100, log_scale=True)
plt.title("Ride Duration")
plt.xlabel("Ride Duration [sec]")
plt.axvline(7200, c='grey', linewidth=0.5, linestyle='--', label="2h")
plt.legend(frameon=False)
plt.ylabel("Number of Rides")
plt.tight_layout()
plt.show()

In [None]:
print(f"""
There are {(trips[trips['duration']>7200]).size} rides that exceed 2 hours. It is likely that these are not actual rides but bikes that were unsuccessfully returned. We will remove these from the dataset.  
""")

✅ No missing values.  
❗ Remove rides longer than 2 hours. 

#### 1.2.3 start_station_id & end_station_id
These columns state a unique id for each station. 

In [None]:
print(f"Number of start stations: {len(set(trips['start_station_id']))}")
print(f"Number of start stations: {len(set(trips['end_station_id']))}")

In [None]:
start_ids = set(trips['start_station_id'].unique())
end_ids = set(trips['end_station_id'].unique())

only_start = start_ids - end_ids
only_end = end_ids - start_ids

print(f"Start-only stations: {len(only_start)}")
print(f"End-only stations: {len(only_end)}")

Check how many trips are **loops**, so trips where the start and end stations are identical. These might distort the picture. 

In [None]:
print(f"There are {len(trips[trips['start_station_id']==trips['end_station_id']])} trips with identical start and end point. These will be discarded.")

✅ No missing values.  
❗ Remove loops

#### 1.2.3 start_station_name & end_station_name

In [None]:
print(f"Number of start stations: {len(set(trips['start_station_name']))}")
print(f"Number of start stations: {len(set(trips['end_station_name']))}")

There are two more unique station names than station id's. 

In [None]:
start_names = set(trips['start_station_name'].unique())
end_names = set(trips['end_station_name'].unique())

only_start_names = start_names - end_names
only_end_names = end_names - start_names

print(f"Start-only station names: {len(only_start_names)}")
print(f"End-only station names: {len(only_end_names)}")

In [None]:
start_name_map = (
    trips[['start_station_id', 'start_station_name']]
    .drop_duplicates()
    .groupby('start_station_id')['start_station_name']
    .nunique()
)

# IDs with >1 unique name
inconsistent_ids = start_name_map[start_name_map > 1]
print(f"Start station IDs with multiple names: {len(inconsistent_ids)}")
print(inconsistent_ids.head())

In [None]:
end_name_map = (
    trips[['end_station_id', 'end_station_name']]
    .drop_duplicates()
    .groupby('end_station_id')['end_station_name']
    .nunique()
)

# IDs with >1 unique name
inconsistent_ids = end_name_map[end_name_map > 1]
print(f"End station IDs with multiple names: {len(inconsistent_ids)}")
print(inconsistent_ids.head())

In [None]:
df = trips[trips['start_station_id']==608]
df['start_station_name'].unique()

In [None]:
df = trips[trips['start_station_id']==1101]
df['start_station_name'].unique()

❗ station_id 608 and 1101 don't have unique names. There is a small subset (~1%) of alternative names. When creating a station table it is important to use the station_id as the unique identifier.

#### 1.2.4 start_station_latitude, end_station_latitude, start_station_longitude, end_station_longitude

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(2, 2)

trips['start_station_latitude'].hist(ax=axs[0, 0], bins=50, color='skyblue')
axs[0, 0].set_title('Start Station Latitude')

trips['end_station_latitude'].hist(ax=axs[0, 1], bins=50, color='lightgreen')
axs[0, 1].set_title('End Station Latitude')

trips['start_station_longitude'].hist(ax=axs[1, 0], bins=50, color='skyblue')
axs[1, 0].set_title('Start Station Longitude')

trips['end_station_longitude'].hist(ax=axs[1, 1], bins=50, color='lightgreen')
axs[1, 1].set_title('End Station Longitude')

plt.tight_layout()
plt.show()

### Column Audit Summary

| Column                      | Notes                                                                 |
|-----------------------------|-----------------------------------------------------------------------|
| `started_at`, `ended_at`    | ✅ Valid timestamps, no missing values. |
| `duration`                  | ⚠️ No missing values, but contains outliers > 2 hours → to be removed in SQL. |
| `start_station_id`          | ✅ Valid IDs. However, 30,188 trips are loops (start = end) → to be removed. |
| `end_station_id`            | ✅ Valid IDs. Same note as above.                                    |
| `start_station_name`        | ⚠️ Mostly consistent. Two IDs (608 and 1101) map to multiple names. Will use the most common. |
| `end_station_name`          | ⚠️ Same as above. No major action needed if we trust IDs.            |
| `start_station_description` | ❌ Incomplete. Will be dropped in cleaning step.                 |
| `end_station_description`   | ❌ Same as above.                                                    |
| `start_station_latitude`    | ✅ All values present. Range appears valid.                          |
| `end_station_latitude`      | ✅ Same as above.                                                    |
| `start_station_longitude`   | ✅ All values present. Range appears valid.                          |
| `end_station_longitude`     | ✅ Same as above.                                                    |


### Cleaning Actions to Apply in SQL

- Remove trips longer than 2 hours  
- Remove loops (start and end station ID are the same)  
- Drop `start_station_description` and `end_station_description`  

---
## 2. Cleaning in SQL
In this step we'll clean the dataset and address the problems identified in our exploration phase.

The goal is to create two tables:
1. `trips_clean`: A filtered dataset without outliers and unnecessary columns, ready for analysis in pandas
2. `stations`: A normalized table listing all stations in the system, containing consistent names, station IDs, and geographic coordinates

### 2.1 trips_clean

In [None]:
# Remove rides longer than 2 hours (= 7200 sec)
# Remove loops
# Drop station descriptions

con.execute("""
CREATE OR REPLACE TABLE trips_clean AS
SELECT 
    * EXCLUDE (start_station_description, end_station_description)
FROM trips_raw
WHERE
    duration < 7200 AND
    start_station_id != end_station_id
""")
con.execute("CHECKPOINT")
print("Cleaned trips saved to DuckDB as 'trips_clean'");

### 2.2 Extract Unique Stations  
Build a `stations` table by combining all distinct start and end stations from the `trips_clean` table. This gives the full list of physical bike stations to enrich with elevation later.

In [None]:
con.execute("""
CREATE OR REPLACE TABLE stations AS
SELECT DISTINCT
    station_id,
    station_name,
    ROUND(lat, 5) AS lat,
    ROUND(lon, 5) AS lon
FROM (
    SELECT
        start_station_id AS station_id,
        start_station_name AS station_name,
        start_station_latitude AS lat,
        start_station_longitude AS lon
    FROM trips_clean

    UNION ALL

    SELECT
        end_station_id AS station_id,
        end_station_name AS station_name,
        end_station_latitude AS lat,
        end_station_longitude AS lon
    FROM trips_clean
)
ORDER BY station_id
""")

con.execute("CHECKPOINT")
print("Extracted and saved 'stations' table")

---
## 3. Topographical Analysis
In this phase of the project, we'll investigate whether Oslo's topography influences bike usage patterns. Oslo has a natural gradient from sea level upwards, which suggests that cyclists might prefer downhill over uphill routes.  

**Questions we'll answer:**
1. Do cyclists prefer downhill routes? If so, by what margin?
2. How does elevation affect station imbalance?
3. Which specific stations require the most urgent daily rebalancing?

### 3.1 Enhancing Data with Elevation Information
Before analyzing the effect of the terrain, we'll need to add elevation information based on the geographic location of each bike station in the system. We do this using a free api service "open-meteo.com".

In [None]:
# Read stations table and add elevation data
csv_path = "../db/stations_with_elevation.csv"

def get_elevation(row):
    lat = row['lat']
    lon = row['lon']
    
    response = requests.get(f"https://api.open-meteo.com/v1/elevation?latitude={lat}&longitude={lon}")
    data = response.json()
    return data['elevation'][0]

if os.path.exists(csv_path):
    print("Found existing data. Loading from csv.")
    stations = pd.read_csv(csv_path)
else:
    print("No csv found. Fetching elevation data from API.")
    stations = con.execute("SELECT * FROM stations").df()
    stations = stations.drop_duplicates(subset='station_id')
    stations['elevation'] = stations.apply(get_elevation, axis=1)
    stations.to_csv(csv_path, index=False)

print("Elevation received for all stations.")

In [None]:
# Save elevation data to database
con.register("stations_df", stations)
con.execute("CREATE OR REPLACE TABLE stations AS SELECT * FROM stations_df")
con.unregister("stations_df");

In [None]:
# Enrich trips with elevation data (overwriting original)
con.execute("""
CREATE OR REPLACE TABLE trips_clean AS
SELECT
    t.*,   
    s_start.elevation AS start_elevation,
    s_end.elevation AS end_elevation,
    s_end.elevation - s_start.elevation AS elevation_diff
FROM trips_clean t
JOIN stations s_start ON t.start_station_id = s_start.station_id
JOIN stations s_end ON t.end_station_id = s_end.station_id
""")

con.execute("CHECKPOINT");

### 3.2 Loading Analysis-Ready Data into pandas
Now that we have clean data with elevation information, we'll load it into pandas for the remainder of the analysis.

In [None]:
# Load enriched data into pandas for analysis
con = duckdb.connect("../db/bysykkel_2024.duckdb")
trips = con.execute("SELECT * FROM trips_clean").df()
stations = con.execute("SELECT * FROM stations").df()
con.close()

print(f"Loaded {len(trips):,} clean trips and {len(stations)} stations")
print(f"Data spans from {trips['started_at'].min()} to {trips['started_at'].max()}")

# Quick check of our elevation data
print(f"\nElevation range: {stations['elevation'].min():.1f}m to {stations['elevation'].max():.1f}m")

In [None]:
station_groups = {
    'Tjuvholmen': [534, 479],
    'Aker Brygge': [1755, 2358, 558, 2357],
    'Vippetangen': [452, 441],
    'Oslo S': [443, 392, 599],
    'Jernbanetorget': [478, 2328], 
    'Torggata': [437, 489],
    'Alexander Kiellands Plass': [421, 444, 617],
    'Arkaden': [545, 577],
    'Brugata / Vaterlandsparken': [491, 495],
    'Schous Plass': [401, 423, 463],

}

# Create mapping of old ID to new ID
id_mapping = {}
for group_name, station_ids in station_groups.items():
    new_id = station_ids[0] # Use first id for all stations in the group
    for id in station_ids:
        id_mapping[id] = new_id


# Create grouped dataframe
stations_grouped = []

for group_name, station_ids in station_groups.items():
    station_cluster = stations[stations['station_id'].isin(station_ids)]
    stations_grouped.append({
        'station_id': station_ids[0],
        'station_name': group_name,
        'lat': station_cluster['lat'].mean(),
        'lon': station_cluster['lon'].mean(),
        'elevation': station_cluster['elevation'].mean()
        
    })


# all_grouped_ids = []
# for _, ids in station_groups.items():
#     for id in ids:
#         all_grouped_ids.append(id)
all_grouped_ids = [id for ids in station_groups.values() for id in ids]
ungrouped_stations = stations[~stations['station_id'].isin(all_grouped_ids)]

for _, station in ungrouped_stations.iterrows():
    stations_grouped.append({
        'station_id': station['station_id'],
        'station_name': station['station_name'],
        'lat': station['lat'],
        'lon': station['lon'],
        'elevation': station['elevation']
    })
stations = pd.DataFrame(stations_grouped)

# Create lookup tables
name_lookup = stations.set_index('station_id')['station_name'].to_dict()
lat_lookup = stations.set_index('station_id')['lat'].to_dict()
lon_lookup = stations.set_index('station_id')['lon'].to_dict()



trips['start_station_id'] = trips['start_station_id'].map(lambda x: id_mapping.get(x, x))
trips['end_station_id'] = trips['end_station_id'].map(lambda x: id_mapping.get(x, x))

trips['start_station_name'] = trips['start_station_id'].map(name_lookup)
trips['end_station_name'] = trips['end_station_id'].map(name_lookup)
trips['start_station_latitude'] = trips['start_station_id'].map(lat_lookup)
trips['start_station_longitude'] = trips['start_station_id'].map(lon_lookup)
trips['end_station_latitude'] = trips['end_station_id'].map(lat_lookup)
trips['end_station_longitude'] = trips['end_station_id'].map(lon_lookup)

# Remove loops
trips = trips[trips['start_station_id']!=trips['end_station_id']]

### 3.3 Calculating Trip Distance and Gradient
To understand cyclist preferences, we need to calculate two key metrics for each trip:  
  
- **Distance**: The physical distance between start and end stations using the haversine formula
- **Gradient**: The slope percentage (elevation change divided by distance)  
  
Using these metrics, we can categorize each trip as uphill, downhill, or flat. This helps us to analyze route preferences and quantify any bias toward downhill travel.

In [None]:
# Compute travel distance and gradient

from haversine import haversine, Unit

def compute_distance(row):
    start = (row['start_station_latitude'], row['start_station_longitude'])
    end = (row['end_station_latitude'], row['end_station_longitude'])
    return haversine(start, end, unit=Unit.METERS)

trips["distance"] = trips.apply(compute_distance, axis=1)

def compute_gradient(row):
    if row['distance'] == 0:
        return np.nan
    return row['elevation_diff'] / row['distance'] * 100

trips["gradient"] = trips.apply(compute_gradient, axis=1)

### 3.4 Do Cyclists Prefer Downhill Routes?
Now we're ready to analyze whether there's a preference for downhill vs. uphill travel.

#### 3.4.1 Gradient Distribution Analysis
To understand the overall terrain preferences, we'll first examine the distribution of trip gradients across all rides.

In [None]:
plt.figure()
n, bins, patches = plt.hist(trips["gradient"], 
                            bins=200, 
                            edgecolor="black", alpha=0.7, linewidth=0.2)

for i, patch in enumerate(patches):
    if bins[i] < 0:
        patch.set_facecolor("green")
    elif bins[i] > 0:
        patch.set_facecolor("brown")
    else:
        patch.set_facecolor("grey")

plt.axvline(x=0, color="black", linestyle="--", alpha=0.7)
plt.xlim([-10, 10])
plt.title("Gradient Distribution")
plt.xlabel("Gradient [%]")
plt.ylabel("Frequency of Rides")

from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor="green", label="downhill"),
    Patch(facecolor="grey", label="flat"),
    Patch(facecolor="brown", label="uphill")
]
plt.legend(handles=legend_elements)
plt.tight_layout()
plt.show()

In [None]:
median_gradient = trips["gradient"].mean()
n_uphill = len(trips[trips['gradient']>0])
n_downhill = len(trips[trips['gradient']<0])
n_flat = len(trips[trips['gradient']==0])
n_total = len(trips)

percent_uphill = n_uphill / n_total * 100
percent_downhill = n_downhill / n_total * 100
percent_flat = n_flat / n_total * 100
net_bias = (n_downhill - n_uphill) / n_total * 100
relative_increase = (n_downhill - n_uphill) / n_uphill * 100

summary = f"""
gradient median = {median_gradient:.2f}%
downhill trips: {percent_downhill:.1f}%
uphill trips: {percent_uphill:.1f}%
flat trips: {percent_flat:.1f}%
net bias towards downhill: {net_bias:.1f}%
relative increase: {relative_increase:.1f}%
"""
print(summary)

Looking at the histogram, we can observe several key paterns in Oslo's bike sharing usage:  
  
**Visual Observations:**
- Downhill rides (negative gradients) are clearly more common than uphill rides.
- There's a pronounced peak at gradient 0%, likely representing popular rides along the flat waterfront paths. 
- The distribution shows a clear skew towards negative gradients, indicating a preference for downhill travel.  
  
**Quantitative Analysis:**  
Our analysis of all trips reveals the following results:  
  
- **Downhill trips**: 59.5% of all rides (gradient < 0%)
- **Uphill trips**: 38.8% of all rides (gradient > 0%) 
- **Flat trips**: 1.7% of all rides (gradient = 0%)
- **Mean gradient**: -0.44% (indicating an overall downhill bias)  
  
The data shows that **downhill trips are 53% more common than uphill trips** (59.5% vs 38.8%). This represents a strong preference for downhill routes with important implications for bike rebalancing operations.

#### 3.4.2 Distance and Duration Analysis
Counting trips alone might be misleading. We'll also examine how much total distance and time cyclists spend going uphill vs. downhill.

In [None]:
# Add a column for the slope type
def categorize_slope(slope):
    if slope < 0:
        return "downhill"
    elif slope > 0:
        return "uphill"
    else:
        return "flat"
    
trips['slope_type'] = trips['gradient'].apply(categorize_slope)

In [None]:
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(6,10))

# Duration histogram
sns.histplot(data=trips, x='duration', hue='slope_type', 
             hue_order=['flat', 'uphill', 'downhill'], 
             bins=50, kde=True, log_scale=True, ax=ax1, legend=True,
             palette=['lightblue', 'salmon', 'green'])
ax1.set_title("Trip Duration by Gradient Direction")
ax1.set_xlabel("Duration [s]")
ax1.set_ylabel("Frequency of Rides")
ax1.legend(title="", labels=["flat", "uphill", "downhill"])


# Distance histogram  
sns.histplot(data=trips, x='distance', hue='slope_type',
             hue_order=['flat', 'uphill', 'downhill'], 
             bins=80, kde=True, log_scale=True, ax=ax2, legend=True,
             palette=['lightblue', 'salmon', 'green'])
ax2.set_title("Trip Distance by Gradient Direction") 
ax2.set_xlabel("Distance [m]")
ax2.set_ylabel("Frequency of Rides")
ax2.legend(title="", labels=["flat", "uphill", "downhill"])
plt.xlim([50, 5000])

plt.tight_layout()
plt.show()

In [None]:
# Calculate statistics
usage_summary = trips.groupby('slope_type').agg({
    'duration': ['sum', 'mean', 'count'],
    'distance': ['sum', 'mean']}).round(1)
usage_summary.columns = ['Total Duration [s]', 'Average Duration [s]', 'Number of Rides', 'Total Distance [m]', 'Average Distance [m]']
print(usage_summary.to_string())

# Calculate comparisons (downhill vs. uphill)
duration_up = usage_summary.loc['uphill', 'Total Duration [s]']
duration_down = usage_summary.loc['downhill', 'Total Duration [s]']
distance_up = usage_summary.loc['uphill', 'Total Distance [m]']
distance_down = usage_summary.loc['downhill', 'Total Distance [m]']

# Calculate percentage differences
duration_diff_pct = (duration_down - duration_up) / duration_up * 100
distance_diff_pct = (distance_down - distance_up) / distance_up * 100

# Per-trip comparison
avg_duration_up = usage_summary.loc['uphill', 'Average Duration [s]']
avg_duration_down = usage_summary.loc['downhill', 'Average Duration [s]']
avg_distance_up = usage_summary.loc['uphill', 'Average Distance [m]']
avg_distance_down = usage_summary.loc['downhill', 'Average Distance [m]']

time_per_trip_diff = (avg_duration_down - avg_duration_up) / avg_duration_down * 100
distance_per_trip_diff = (avg_distance_down - avg_distance_up) / avg_distance_up * 100

print("\nKey Comparisons")
print(f"""
      - Downhill trips account for {duration_diff_pct:.1f}% more total ride time.
      - Downhill trips cover {distance_diff_pct:.1f}% more total distance.
      - Uphill trips take on average {abs(time_per_trip_diff):.1f}% longer per trip.
      - Downhill trips cover {distance_per_trip_diff:.1f}% more distance per trip.
      """)


#### 3.4.4 Summary: Evidence of Downhill Preference
Our analysis provides strong evidence that cyclists prefer downhill routes accross multiple metrics:

**Trip Districution:**
- 59.5% of trips go downhill vs. 38.8% go uphill (53% more common)
- Mean gradient of -0.44 indicates overall bias towards downhill travel

**Total System Usage:**
- Downhill trips account for 37.6% more total ride time
- Downhill trips cover 68.2% more total distance.

**Individual Trip Properties:**
- Uphill trips take on average 11.5% longer.
- Downhill trips cover on average 9.6% more distance.

**Answer to Question 1:** Yes, cyclists show a strong preference towards downhill routes. 

### 3.5 How does elevation affect station imbalance?

Having established that customers have a strong preference for downhill trips, we'll now investigate how that affects the net flow of bikes.  

#### 3.5.1 Compute net flux per station
We will start out by computing the **net flux** per station.  
net flux = arriving bikes - departing bikes  

In [None]:
# Compute the net flux per station
flux_out = trips.groupby('start_station_id').size().rename('departures')
flux_in = trips.groupby('end_station_id').size().rename('arrivals')

station_flux = flux_out.to_frame().join(flux_in.to_frame(), how='outer').fillna(0)
station_flux['net_flux'] = station_flux['arrivals'] - station_flux['departures']
station_flux['total_usage'] = station_flux['arrivals'] + station_flux['departures']
station_flux['net_flux_daily'] = station_flux['net_flux'] / 365
station_flux['total_usage_daily'] = station_flux['total_usage'] / 365
station_flux['departure_share'] = station_flux['departures'] / station_flux['total_usage']

station_flux = stations.join(station_flux, how="outer", on="station_id").fillna(0)

#### 3.5.2 Elevation's Impact on Station Balance

In [None]:
scatter = plt.scatter(station_flux['elevation'], station_flux['net_flux_daily'],
                      c=station_flux['total_usage_daily'],
                      s=60, alpha=0.7, cmap='RdYlBu_r',
                      edgecolors='black', linewidths=0.5)

cbar = plt.colorbar(scatter)
cbar.set_label('Total Station Usage', rotation=270, labelpad=15)

plt.axhline(0, color='black', linestyle='--', linewidth=2, alpha=0.8)
plt.axvline(35, color='orange', linestyle=':', linewidth=2, alpha=0.8, label='Elevation threshold (~35m)')

plt.xlabel('Elevation [m]')
plt.ylabel('Net Flux per Day (Arrivals - Departures)')
plt.title('Station Imbalance vs. Elevation')
plt.legend()

plt.tight_layout()
plt.show()

In [None]:
print(f"The correlation between elevation and net flux is {station_flux['elevation'].corr(station_flux['net_flux']):.2f}")

This figure clearly demonstrates how elevation causes station imbalances in Oslo's bike sharing system.  

**Key Observations:**
- **Above ~35m elevation**: Stations predominantly export bikes (negative net flux), losing around 5 to 10 bikes per day.
- **Below ~35m elevation**: Stations predominantly import bikes (positive net flux), gaining up to 20 bikes per day. 
- **Correlation**: The relationship shows a correlation of -0.61, indicating a strong relationship between elevation and station balance.

**Operational Interpretation:**  
Bikes naturally "flow" downhill through the system and preferentially accumulate near sea level. This creates a systematic daily imbalance, where:
- Higer elevation stations consistently need bike restocking.
- Lower elevation stations consistenly need bikes removed to create docking space.
- The ~35m elevation mark represents a "watershed" for bike flow in the system. 

#### 3.5.3 Geographic Distribution of Imbalances
Having established that there's a strong relation between elevation and station imbalance, let's now visualize these imbalances on a map.  
  
  The interactive map below shows all bike stations with:  
  - **Marker size**: Proportional to total daily usage (larger circles = busier station)
  - **Marker color**: Indicates imbalance of the station and magnitude
    - **Red stations**: Net exporters (lose bikes daily, need restocking)
    - **Green stations**: Net importers (gain bkes daily, need bike removal)
    - **Color intensity**: Darker colors indicate larger daily imbalances  

This visualization shows the geographic pattern of Oslo's bike flow.

In [None]:
# Get top 10 exporters and importers
top_exporters = station_flux.nsmallest(10, 'net_flux_daily')
top_importers = station_flux.nlargest(10, 'net_flux_daily')
critical_ids = set(top_exporters['station_id'].tolist() + top_importers['station_id'].tolist())

In [None]:
# Create interactive Folium map of Oslo to spatially demonstrate the behaviour where bikes are flowing

def get_color(flux_value):
    if flux_value > 2.7:
        return '#1a9850'     # Dark green - Heavy importer
    elif flux_value > 0.5:
        return 'yellowgreen' # Light green - Light importer  
    elif flux_value >= -0.5:
        return 'gray'        # Gray - Balanced (±0.5 bikes/day)
    elif flux_value > -2.7:
        return '#fc8d59'     # Orange - Light exporter
    else:
        return '#d73027'     # Red - Heavy exporter
    
def get_radius(total_usage):
    # Normalize usage
    return np.interp(total_usage,
                     (station_flux['total_usage_daily'].min(), station_flux['total_usage_daily'].max()),
                     (2, 17))

flux_map = folium.Map(location=oslo_coordinates, zoom_start=zoom, tiles="CartoDB positron")

# First, add all non-critical stations
for index, row in station_flux.iterrows():
    if row['station_id'] not in critical_ids:  # Only non-critical stations
        popup_content = f"""
        Station: {row['station_name']} <br>
        ID: {row['station_id']} <br>
        Elevation: {row['elevation']:.0f}m <br>
        Departures: {row['departures']/365:.0f}/day <br>
        Arrivals: {row['arrivals']/365:.0f}/day <br>
        Total Usage: {row['total_usage_daily']:.0f}/day <br>
        Net Flux: {row['arrivals']/365 - row['departures']/365:+.1f}/day
        """
        
        # Add marker
        folium.CircleMarker(
            location=[row['lat'], row['lon']],
            radius=get_radius(row['total_usage_daily']),
            color=get_color(row['net_flux_daily']),
            fill=True,
            fill_color=get_color(row['net_flux_daily']),
            fill_opacity=0.8,
            weight=1,
            popup=folium.Popup(popup_content, max_width=300)
        ).add_to(flux_map)

# Then add critical stations on top
for index, row in station_flux.iterrows():
    if row['station_id'] in critical_ids:  # Only critical stations
        popup_content = f"""
        <b>⚠️ CRITICAL STATION</b><br>
        Station: {row['station_name']} <br>
        ID: {row['station_id']} <br>
        Elevation: {row['elevation']:.0f}m <br>
        Departures: {row['departures']/365:.0f}/day <br>
        Arrivals: {row['arrivals']/365:.0f}/day <br>
        Total Usage: {row['total_usage_daily']:.0f}/day <br>
        <b>Net Flux: {row['arrivals']/365 - row['departures']/365:+.1f}/day</b>
        """
        
        # Add marker
        folium.CircleMarker(
            location=[row['lat'], row['lon']],
            radius=get_radius(row['total_usage_daily']),
            color='black',
            fill=True,
            fill_color=get_color(row['net_flux_daily']),
            fill_opacity=0.9,  # Slightly higher opacity
            weight=3,
            popup=folium.Popup(popup_content, max_width=300)
        ).add_to(flux_map)

# Updated legend to include critical stations
legend_html = '''
<div style="position: fixed; bottom: 50px; left: 50px; width: 175px; height: 170px; 
            background-color:white; border:2px solid grey; z-index:9999; 
            font-size:12px; padding: 8px">
<b>Net Flux</b><br>
<span style="color:#d73027">●</span> Heavy Export<br>
<span style="color:#fc8d59">●</span> Light Export<br>
<span style="color:gray">●</span> Balanced<br>
<span style="color:yellowgreen">●</span> Light Import<br>
<span style="color:#1a9850">●</span> Heavy Import<br>
<hr style="margin: 5px 0;">
<b>⚫</b> = Top 10 critical station
</div>
'''

flux_map.get_root().html.add_child(folium.Element(legend_html))

flux_map.save('../outputs/oslo_flux_map.html')
flux_map

### 3.5.4 Summary: Geographic imbalace

The interactive map above shows the geographic pattern of the imbalanced flow of bikes in the bike sharing system.  

**Clear Spatial Patterns:**
- **Northern/Outer areas** (higher elevation): Dominated by red stations that consistantly export bikes - *bike sources*
- **Central/southern areas** (lower elevation): Dominated by green stations that consistantly import bikes - *bike sinks*
- **Intermediate elevation areas**: Most stations are balanced (gray), creating a transition zone between exporter and importer areas - *transition zone*

**Geographic "Watershed":**
There is a clear three-zone pattern: a green core of heavy importers, surrounded by gray balanced stations at intermediate elevations, surrounded by red exporter stations at higher elevations. There is a daily flow of bikes from higher elevation down into the city center and lower elevation regions.  

**Important Note**: This analysis represents a yearly average across all seasons and times of day. The actual rebalancing challenges may be significantly more pronounced during peak times (rush hours, summer months, weekends)  

This visualization demonstrates that bike rebalancing in Oslo is not a random maintenance, but follows a clear geographical pattern. 

**Answer to Question 2:** Elevation systematically affects station imbalance by creating predictable daily bike flows from higher to lower elevations, with stations above ~35m consistently exporting bikes while those below consistently import them.

### 3.6 Which Stations Require the Most Urgent Daily Rebalancing?
Now that we've seen how elevation causes imbalances, let's identify the specific stations that create the most critical challenges.

In [None]:
# Display the top 10 heavist importer and exporter stations
print("TOP 10 EXPORTER STATIONS (Need Daily Restocking)")
print("="*70)
exporters_display = top_exporters[['station_name', 'elevation', 'net_flux_daily', 'total_usage_daily']].copy()
exporters_display.columns = ['Station', 'Elevation [m]', 'Bikes Lost Daily', 'Total Daily Usage']
exporters_display['Bikes Lost Daily'] = exporters_display['Bikes Lost Daily'].round(1).abs()
exporters_display['Total Daily Usage'] = exporters_display['Total Daily Usage'].round(1)
print(exporters_display.to_string(index=False))
print(f"\nTotal daily restocking need: {top_exporters['net_flux_daily'].abs().sum():.0f} bikes")
print(f"Average elevation: {top_exporters['elevation'].mean():.1f}m")

print("\nTOP 10 IMPORTER STATIONS (Need Daily Removal)")
print("="*70)
importers_display = top_importers[['station_name', 'elevation', 'net_flux_daily', 'total_usage_daily']].copy()
importers_display.columns = ['Station', 'Elevation [m]', 'Bikes Gained Daily', 'Total Daily Usage']
importers_display['Bikes Gained Daily'] = importers_display['Bikes Gained Daily'].round(1)
importers_display['Total Daily Usage'] = importers_display['Total Daily Usage'].round(1)
print(importers_display.to_string(index=False))
print(f"\nTotal daily removal need: {top_importers['net_flux_daily'].abs().sum():.0f} bikes")
print(f"Average elevation: {top_importers['elevation'].mean():.1f}m")

# Impact summary
total_critical_flux = abs(top_exporters['net_flux_daily'].sum()) + top_importers['net_flux_daily'].sum()
pct_of_stations = 20 / len(stations) * 100

print(f"""
OPERATIONAL IMPACT:
- These 20 stations ({pct_of_stations:.1f}% of network) require moving ~{total_critical_flux:.0f} bikes daily
- Elevation difference between critical exporters and importers: {top_exporters['elevation'].mean() - top_importers['elevation'].mean():.0f}m
- This represents the minimum daily rebalancing operation just to maintain these 20 stations. 
""")

### 3.6 Summary: Topography's Impact on Oslo's Bike Sharing System

Our topographical analysis reveals that Oslo's terrain creates fundamental operational challenges:

**Key Finding 1: Strong Downhill Preference**
- Cyclists are 53% more likely to choose downhill routes (59.5% vs 38.8% of trips)
- Downhill trips cover 68% more total distance and 38% more ride time
- This creates a system-wide "gravity bias" with a mean gradient of -0.44%

**Key Finding 2: Elevation-Driven Station Imbalances**
- Strong negative correlation (-0.61) between elevation and daily net flux
- Critical elevation threshold at ~35m separates bike exporters from importers
- Geographic pattern: red (exporter) periphery → gray (balanced) middle → green (importer) core

**Key Finding 3: Concentrated Rebalancing Needs**
- Top 10 exporters (avg 54m elevation) lose 75 bikes daily combined
- Top 10 importers (avg 17m elevation) gain 92 bikes daily combined  
- Just 7.5% of stations drive the majority of rebalancing requirements

**Operational Implications:**

This analysis proves that Oslo's bike rebalancing isn't random maintenance but a predictable daily battle against gravity. The system experiences a continuous "downhill tide" requiring strategic intervention:

1. **Predictable Patterns**: Rebalancing routes can be optimized based on elevation zones
2. **Resource Allocation**: Focus on the 20 critical stations for maximum impact
3. **Capacity Planning**: Low-elevation stations need more docks, high-elevation stations need more bikes
4. **Pricing Strategy**: Consider incentives for uphill trips to naturally counteract gravity

The topographical analysis has revealed WHERE bikes flow. Next, we'll examine the network structure to understand HOW bikes move through the system and which routes are most critical for connectivity.

---
## 3. Network Structure Analysis
Now that we've established that Oslo's topography creates a predictable bike flow from highter to lower elevations, we now examine *how* bikes move through the city. While elevation tells us the direction of the flow, this analysis will reveal the pathways and most ciritical destinations.  
  
Network analysis treates the bike sharing system as a complex web of interconnected stations, where each trip creates a weighted "link" between locations. This approach allows us to understand which stations serve as bike magnets, which routes form the main arteries of travel, and how the system can be split into different functional zones. 
  
Key questions we'll answer:  
1. Which stations are most important in the network?
2. What are the most important pathways for bike travel?
3. How is the bike network organized across Oslo?

### 3.1 Building the Network Graph
We start by defining our network graph where stations are nodes and trips create weighted edges between them.

In [None]:
import networkx as nx

# Create directional graph
G = nx.Graph()

# Add stations as nodes to graph
for _, station in stations.iterrows():
    G.add_node(station['station_id'],
               name=station['station_name'],
               lat=station['lat'],
               lon=station['lon'],
               elevation=station['elevation'])

# Add trips between stations as edges, with frequency of route as weight
trip_counts = trips.groupby(['start_station_id', 'end_station_id']).size().reset_index(name='weight')
trip_counts = trip_counts.sort_values('weight', ascending=False)#.head(100)

for _, trip in trip_counts.iterrows():
    G.add_edge(trip['start_station_id'],
               trip['end_station_id'],
               weight=trip['weight'])

print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")

### 3.2 Which stations are most important to the network?
To understand station importance, we calculate several network centrality metrics:  
- **PageRank**: Identifies stations that are well-connected to other well-connected stations ("hubs"). This metric correlates very well with total station usage.
- **Degree Centrality**: Counts how many direct connections a station has. Has a large plateau, since most stations are connected to most other stations over the year.

In [None]:
# Calculate metrics
pagerank = nx.pagerank(G, weight='weight')
degree_centrality = dict(G.degree())  # Number of unique connections

# Write to stations table
station_flux['pagerank'] = station_flux['station_id'].map(pagerank)
station_flux['degree_centrality'] = station_flux['station_id'].map(degree_centrality)

In [None]:
# Find top n stations
n = 10
top_pagerank_stations = station_flux.nlargest(n, 'pagerank')
top_centrality_stations = station_flux.nlargest(n, 'degree_centrality')
print("TOP 10 MOST IMPORTANT STATIONS (by PageRank)")
print("=" * 60)

for i, (idx, station) in enumerate(top_pagerank_stations.iterrows(), 1):
    print(f"{i:2d}. {station['station_name']}")
    print(f"    PageRank: {station['pagerank']:.4f}")
    print()

### 3.3 What are the most critical pathways?

In [None]:
# Get all edges with their weights (trip counts)
all_edges = [(start, end, data['weight']) for start, end, data in G.edges(data=True)]

# Sort by weight (trip count) descending
popular_routes = sorted(all_edges, key=lambda x: x[2], reverse=True)

# Convert to readable format
top_routes = []
for start, end, weight in popular_routes[:20]:  # Top 20
    route_info = {
        'from': G.nodes[start]['name'],
        'to': G.nodes[end]['name'],
        'trip_count': weight,
        'from_elevation': G.nodes[start]['elevation'],
        'to_elevation': G.nodes[end]['elevation'],
        'gradient': 'downhill' if G.nodes[end]['elevation'] < G.nodes[start]['elevation'] else 'uphill'
    }
    top_routes.append(route_info)

# Display as DataFrame
routes_df = pd.DataFrame(top_routes)
print("TOP 15 MOST POPULAR CONNECTIONS")
print("=" * 70)

for i, (idx, route) in enumerate(routes_df.head(15).iterrows(), 1):
    print(f"{i:2d}. {route['from']} ↔ {route['to']}")
    print(f"    {route['trip_count']:,} trips")
    print()

### 3.4 Network Zones: Core vs. Feeder System
Our network analysis reveals a two'zone structure that explains how Oslo's bike stystem actually works:

In [None]:
# Visualize network in Folium
n_edges = 800

network_map = folium.Map(location=oslo_coordinates, zoom_start=zoom)#, tiles='CartoDB positron')

# Add edges to map
top_edges = sorted(
    G.edges(data=True),
    key=lambda x: x[2]['weight'],
    reverse=True
    )[:n_edges]

# Compute weights for scaling
weights = [w['weight'] for _, _, w in top_edges]
min_weight, max_weight = min(weights), max(weights)

import branca.colormap as cm
from branca.element import Element
colormap = cm.LinearColormap(['blue', 'darkblue', 'navy'],
                              vmin=min_weight, vmax=max_weight)


def scale(weight, val_min, val_max, scale_min, scale_max):
    return np.interp(
        weight,
        (val_min, val_max),
        (scale_min, scale_max))

    
for e1, e2, data in top_edges:
    lat1, lon1 = G.nodes[e1]['lat'], G.nodes[e1]['lon']
    lat2, lon2 = G.nodes[e2]['lat'], G.nodes[e2]['lon']
    thickness = scale(data['weight'], min_weight, max_weight, 1, 26)

    folium.PolyLine(
        locations=[[lat1, lon1], [lat2, lon2]],
        weight=thickness,
        opacity=scale(data['weight'], min_weight, max_weight, 0.2, 0.3),
        color=colormap(data['weight']),#'blue',
        dash_array='5, 5' if thickness < 2 else None
        
    ).add_to(network_map)


# Add nodes to map
for node, data in G.nodes(data=True):
    popup_content =  f"""
    Station: {data['name']} <br>
    Elevation: {data['elevation']:.0f}m <br>
    """

    folium.CircleMarker(
        location=[data['lat'], data['lon']],
        color='black',
        radius=2,
        fill=True,
        fill_opacity=0.3,
        popup=folium.Popup(popup_content, max_width=300)
    ).add_to(network_map)


# Visualize top hubs
top_nodes = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:21]

for node, score in top_nodes:
    lat, lon = G.nodes[node]['lat'], G.nodes[node]['lon']
    name, elevation = G.nodes[node]['name'], G.nodes[node]['elevation']
    size = scale(score, min(pagerank.values()), max(pagerank.values()), 3, 10)

    popup_content =  f"""
    Station: {name} <br>
    Elevation: {elevation:.0f}m <br>
    """
    marker = folium.CircleMarker(
        location=[lat, lon],
        radius=size,
        color='#DAA520',
        fill=True,
        fill_opacity=0.8,
        popup=folium.Popup(popup_content, max_width=300),
    ).add_to(network_map)

    folium.Tooltip(
        name,
        permanent=True,  # Always visible
        # sticky=True,
        direction='top',  # Position: 'top', 'bottom', 'left', 'right'
        # offset=[0, -10]  # Fine-tune position
        class_name='transparent-label',
    ).add_to(marker)

    transparent_css = """
    <style>
    .transparent-label {
        background-color: #DAA520 !important;  /* Light white background */
        color: black !important;
        border: black !important;
        box-shadow: black !important;
        font-size: 12px;
        padding: 1px 3px;
        border-radius: 4px;
        font-weight: 600;
    }

    /* Kill the tooltip arrow */
    .transparent-label:before,
    .transparent-label:after {
        border: none !important;
        background: transparent !important;
        box-shadow: none !important;
        content: none !important;
    }
    </style>
"""
    network_map.get_root().html.add_child(Element(transparent_css))
network_map.save('../outputs/oslo_network_map.html')

network_map

In [None]:
# Money plot: Combine network map and importer/exporter map
# Oslo Bike Network: Core vs Feeder Stations

n_edges = 600
network_flux_map = folium.Map(location=oslo_coordinates, zoom_start=zoom, tiles='CartoDB positron')

top_edges = sorted(
    G.edges(data=True),
    key=lambda x: x[2]['weight'],
    reverse=True,
)[:n_edges]


# Function for scaling valiables
def scale(weight, val_min, val_max, scale_min, scale_max):
    return np.interp(
        weight,
        (val_min, val_max),
        (scale_min, scale_max)
    )

# Plot network
for start, end, data in top_edges:
    lat1, lon1 = G.nodes[start]['lat'], G.nodes[start]['lon']
    lat2, lon2 = G.nodes[end]['lat'], G.nodes[end]['lon']
    weight = data['weight']
    thickness = scale(data['weight'], min_weight, max_weight, 1, 8)

    # Thickness and opacity 
    thickness = 0.5 + (weight / max_weight) * 10
    opacity = 0.2 + (weight / max_weight) * 0.3

    folium.PolyLine(
        locations=[[lat1, lon1], [lat2, lon2]],
        weight=thickness,
        opacity=opacity,
        color='darkblue',
    ).add_to(network_flux_map)



# Prepare station list
# station_flux['pagerank'] = station_flux['station_id'].map(pagerank)

# Classify stations
def classify_station(net_flux_daily):
    if net_flux_daily > 2:
        return 'heavy_importer'
    elif net_flux_daily > 0.5:
        return 'light_importer'
    elif net_flux_daily < -2:
        return 'heavy_exporter'
    elif net_flux_daily < -0.5:
        return 'light_exporter'
    else:
        return 'balanced'
    
station_flux['station_type'] = station_flux['net_flux_daily'].apply(classify_station)

# Find top stations for labelling
top_pagerank = station_flux.nlargest(24, 'pagerank')['station_id'].tolist()
top_exporters = station_flux.nsmallest(10, 'net_flux_daily')['station_id'].tolist()
top_importers = station_flux.nlargest(10, 'net_flux_daily')['station_id'].tolist()

stations_to_label = set(top_pagerank + top_exporters + top_importers)

# Color mapping
color_map = {
    'heavy_exporter': '#d73027',
    'light_exporter': '#fc8d59',
    'balanced': '#ffffbf',
    'light_importer': '#91cf60',
    'heavy_importer': '#1a9850',
}

label_css_template = """
<style>
.label-{unique_id} {{
    background-color: rgba(255, 255, 255, 0.9);
    color: black;
    border: 2px solid {label_color};
    box-shadow: 2px 2px 2px rgba(0,0,0,0.3);
    font-size: 12px;
    padding: 1px 3px;
    border-radius: 4px;
    font-weight: 600;
}}

/* Kill the tooltip arrow */
.label-{unique_id}:before,
.label-{unique_id}:after {{
    border: none !important;
    background: transparent !important;
    box-shadow: none !important;
    content: none !important;
}}
</style>
"""

def get_radius(total_usage):
    # Normalize usage
    return np.interp(total_usage,
                     (station_flux['total_usage_daily'].min(), station_flux['total_usage_daily'].max()),
                     (3, 17))

for _, station in station_flux.iterrows():
    marker_size = get_radius(station['total_usage_daily'])

    # Check if station is top exporter or importer
    is_top_exporter = station['station_id'] in top_exporters
    is_top_importer = station['station_id'] in top_importers

    # Define border style for top stations
    if is_top_exporter or is_top_importer:
        border_weight = 3
        border_color = 'black'
    else:
        border_weight = 1
        border_color='darkgray'

    # Station color
    fill_color = color_map[station['station_type']]

    # Create popup
    popup_content = f"""
        <b>{station['station_name']}</b><br>
        Type: {station['station_type'].replace('_', ' ').title()}<br>
        Net Flux: {station['net_flux_daily']:+.1f} bikes/day<br>
        Total Usage: {station['total_usage_daily']:.0f} trips/day<br>
        Elevation: {station['elevation']:.0f}m
    """

    # Draw station marker
    marker = folium.CircleMarker(
        location=[station['lat'], station['lon']],
        radius=marker_size,
        color=border_color,
        weight=border_weight,
        fillColor=fill_color,
        fill=True,
        fillOpacity=0.8,
        popup=folium.Popup(popup_content, max_width=300)
    ).add_to(network_flux_map)

    # Add label to important stations
    if station['station_id'] in stations_to_label:
        if 'exporter' in station['station_type']:
            label_color = '#d73027'
        elif 'importer' in station['station_type']:
            label_color = '#1a9850'
        else:
            label_color = 'black'

        # In your loop, for each station:
        unique_id = f"station_{station['station_id']}"  # Create unique class name
        label_css = label_css_template.format(
            unique_id=unique_id,
            label_color=label_color
        )

        # Add the CSS for this specific label
        network_flux_map.get_root().html.add_child(folium.Element(label_css))

        # Create the tooltip with the unique class
        folium.Tooltip(
            station['station_name'],
            permanent=True,
            direction='top',
            offset=(0, -marker_size+4),
            class_name=f'label-{unique_id}',
        ).add_to(marker)

# Add legend
legend_html = '''
<div style="position: fixed; 
            bottom: 50px; left: 50px; width: 200px;
            background-color: white; border: 2px solid grey;
            z-index: 9999; font-size: 11px; padding: 10px">
    <b>Oslo Bike Network Analysis</b><br><br>
    <b>Station Types:</b><br>
    <span style="color: #d73027;">●</span> Heavy Exporter (>2 bikes/day)<br>
    <span style="color: #fc8d59;">●</span> Light Exporter<br>
    <span style="color: #ffffbf;">●</span> Balanced (±0.5 bikes/day)<br>
    <span style="color: #91cf60;">●</span> Light Importer<br>
    <span style="color: #1a9850;">●</span> Heavy Importer (>2 bikes/day)<br>
    <br>
    <b>Visual Elements:</b><br>
    • Size = Total daily usage<br>
    • Thick border = Top 5 exporter/importer<br>
    • Blue lines = Trip frequency<br>
    • Labels = Top stations by PageRank<br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;+ Top 5 exporters/importers
</div>
'''
network_flux_map.get_root().html.add_child(folium.Element(legend_html)) 

# Add title
title_html = '''
<div style="position: fixed; 
            top: 10px; left: 50%; transform: translateX(-50%);
            background-color: white; border: 2px solid grey;
            z-index: 9999; font-size: 16px; font-weight: bold;
            text-align: center; padding: 8px 15px;
            border-radius: 5px;">
    Oslo Bike Network: Core vs Feeder Stations
</div>
'''
network_flux_map.get_root().html.add_child(folium.Element(title_html))

network_flux_map.save('../outputs/network_flux_map.html')
network_flux_map

The network can be split into two distinct areas:  
  
**The core network (dense central area)**  
- High traffic density with many connections
- Net bike importers (accumulate bikes throughout the day)
- Short frequent trips between nearby stations
- High intra-city circulation  
  
This dense network serves daily urban activites, such as shopping, errands, meetings and so on. Bikes circulate actively within this zone as people move between neighborhoods.  
  
**The feeder network (peripheral stations)**
- Lower traffic density, fewer connections
- Net bike exporters (lose bikes throughout the day)
- More isolated from each other
  
These stations act as "park and ride" points for communters. People living in elevated residential areas use bikes for one-way trips into the city center, then rely on other transport to return home. 

---
## 4. Temporal Analysis
Having identified Oslo's dual-zone network structure (central importers vs. peripheral feeders), we now examine **when** these patterns emerge throught the day.  
  
This analysis will show how the gravitational flow we discovered earlier plays out hour by hour. This is critical information to plan when trucks have to go out and restock stations or remove bikes from fully stocked ones so that users can park their bikes.  
  
**Key questions to answer:**  
1. When do the peripheral stations feed the central ones?  
2. How do different station types behave throughout the day?  


### 4.1 Data preparation: timezone correction
Before analyzing temporal patterns, we need to fix an issue with the time stap of the dataset:  
  
The original dataset shows a an unusual rush hour pattern with peaks at 4-7 and 13-16, clearly outside of normal commuting hours. In addition, trips are outside of the offical opening hours of the bike sharing system, which is from 5 in the morning until 1 o'clock at night.  
  
**Problem**: The data appears to be in UTC timezone, while Oslo is on CET/CEST (central european time).  
**Solution**: We need to convert the timezone to the appropriate one while automatically handling daylight saving time.  
**Validation**: After time zone conversion, the bike usage hours align perfectly with Oslo Bysykkel's official operating hours (05:00 - 01:00), and the rush hour patterns appear at normal commuting times.

In [None]:
# Fix time stamp
trips['started_at'] = pd.to_datetime(trips['started_at']).dt.tz_localize('UTC').dt.tz_convert('Europe/Oslo')
trips['ended_at'] = pd.to_datetime(trips['ended_at']).dt.tz_localize('UTC').dt.tz_convert('Europe/Oslo')

In [None]:
# Add hour to trips table
trips['hour'] = trips['started_at'].dt.hour

plt.hist(trips['hour'], bins=23, color='skyblue', edgecolor='black', alpha=0.7, linewidth=0.5)
plt.axvspan(7, 9, alpha=0.3, color='gray', label='Morning Rush')
plt.axvspan(15, 18, alpha=0.3, color='gray', label='Evening Rush')

plt.xticks(range(24))
plt.xlabel('Hour of Day')
plt.ylabel('Number of Trips')
plt.title('System-wide bike usage throughout the day')
plt.grid(True, alpha=0.3)
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()

### 4.2 When do perpheral stations feed the central ones? Extreme station patterns
Let's first look at how the daily flux of bikes develops throughout the day. We will start by looking at the most extreme stations, wich are the top most exporting and importing stations.  
  
To achieve an overview of the daily pattern in the flux of bikes, we need to group the dataset by 'hour of day' and then count the hourly arrivals and departures. This will give us the average flux behaviour over the entire year, smoothing out any seasonal variations such as winter and summmer or weekends and week days, retaining only variations caused by the time of day.  
  
#### 4.2.1 Top exporters and importers

In [None]:
# Calculate hourly flux for each station
hourly_arrivals = trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_net_flux = (hourly_arrivals - hourly_departures)/365
hourly_net_flux = hourly_net_flux.reindex(columns=range(24), fill_value=0)

# Get top n exporters and importers
n = 5
top_importers = station_flux.nlargest(n, 'net_flux_daily')
top_exporters = station_flux.nsmallest(n, 'net_flux_daily')

importer_stations = top_importers['station_id'].tolist()
exporter_stations = top_exporters['station_id'].tolist()

# Plot top exporters and importers
plt.figure()
# Plot exporters
for i, (_, station) in enumerate(top_exporters.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Blues(0.6 + 0.4 * (i / n))
    # color=plt.cm.rainbow(0.6 + 0.4 * (i / n))
    plt.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])
    
# Plot importers
for i, (_, station) in enumerate(top_importers.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Reds(0.6 + 0.4 * (i / n))
    # color=plt.cm.rainbow(0.6 + 0.4 * (i / n))
    plt.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='s', label=station['station_name'])
    
plt.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
plt.xticks(range(24))
plt.xlabel('Hour of Day')
plt.ylabel('Daily Net Flux (Arrivals - Departures)')
plt.title('Extreme Stations: Top Exporters vs. Top Importers')
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("../outputs/imbalanced_stations.png")
plt.show()

**Key findings**:  
  
**Exporter station behaviour (blue lines - feeder network)**:  
- **Consistent daily export**: All exporter stations remain predominantly negative throughout the day, confirming that bike mainly flow one-way.  
- **Extended morning rush (05-09)**  
- **Notable exceptions reveal local patterns:**
    - **Lindern (Ullevål Hospital)**: Brief import spikes at 06-07 and two dramatic export spikes; one at 15:00 (likely hospital day shift ending) and another at 22:00 (evening shift ending).  
    - **BI Nydalen**: Brief import spike at 07-08, probably students and employees arriving at the BI campus. 
- **No evening return flow**: Confirms that people don't bike back uphill, they use other means of transportation.  
  
**Importer station behavior (red lines - core network)**:  
- **Massive morning influx (07-09)**: All downtown stations show dramatic positive spikes, whith peakes reaching 6+ bikes per hour net inflow.
- **Sustained high import levels throughout the day**.
- **Divergent afternoon patterns (14-17)**:  
    - **Oslo S and Aker Brygge:** Import fluxes spike again, likely due to their roles as major transport hub and place for leasure activities.  
    - **Other core stations**: Show more variable patterns, with some even breifly exporting bikes.  
- **Evening activities:**  
    - **Torggata emerges as top evening importer:** This area seves as destination for evening entertainment, dining and drinks. People bike there after work for social activities. 
  
**Note on temporal variation**: These patterns represent the system behaviour averaged across the entire year. Individual patterns may vary significantly based on seasonal effects (summer/winter), weather conditions (temperature, rain) and day of week (weekend or weekday). Future analys sections will explore these variations in detail. 

#### 4.2.2 Balanced stations
Let's now turn our attention to bike stations at intermediate elevations where the bike flow is more balanced.

In [None]:
# Get top balanced stations
balanced_stations = station_flux[
    (station_flux['net_flux'].abs() < 500) & # Small imbalance
    (station_flux['total_usage'] > 5000)     # High usage
].nlargest(n,'total_usage')

# Plot top balanced stations
plt.figure()
for i, (_, station) in enumerate(balanced_stations.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Greens(0.6 + 0.4 * (i / n))
    plt.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='s', label=station['station_name'])
    
# Add rush hour shading
plt.axvspan(6, 9, alpha=0.1, color='gray', label='Morning Rush')
plt.axvspan(14.5, 17.5, alpha=0.1, color='gray', label='Afternoon Rush')
    
plt.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
plt.xticks(range(24))
plt.xlabel('Hour of Day')
plt.ylabel('Daily Net Flux (Arrivals - Departures)')
plt.title('Balanced Stations: Line Crosses Zero Throughout the Day')
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("../outputs/balanced_stations.png")
plt.show()

**Key findings:**  
**Zero-crossing behaviour:**  
- Lines cross the `y=0` axis, indicating a two-way flow.
No persistent directional bias. Bikes flow both in and out throughout the day.  
  
**Clear temporal patterns**:
- **Morning outtlow (06-09):** These stations show negative spikes, suggesting people leave these areas for work.  
- **Afternoon influx (15-18):** As opposed to importer and exporter stations, here there is a clear sign that commuters actually return home by bike, which causes the large influx spike in the afternoon.  
- **Alexander Kiellands Plass** shows extreme pattern: Dramatic morning exports (-7 bikes/hour) followed by strong afternoon import (6+ bikes/hour). 

### 4.3 Weekday vs. weekend patterns

#### 4.3.1 Trip count and mean duration

In [None]:
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
trips['weekday'] = trips['started_at'].dt.day_name()

weekday_stats = trips.groupby('weekday').size() / 52

plt.figure()

colors = {
    'Monday': 'steelblue',
    'Tuesday': 'steelblue', 
    'Wednesday': 'steelblue', 
    'Thursday': 'steelblue', 
    'Friday': 'steelblue', 
    'Saturday': 'lightsteelblue', 
    'Sunday': 'lightsteelblue'}

bars = plt.bar(day_order, [weekday_stats[day] for day in day_order],
               color=[colors[day] for day in day_order],
            edgecolor='black', linewidth=0.5, alpha=0.8)

for bar in bars:
    height = bar.get_height()
    x = bar.get_x()
    width = bar.get_width()
    plt.text(x+width/2., height+50, f'{height:.0f}', ha='center',
             va='bottom', fontsize=10)

plt.title("Average daily bike usage: weekdays dominate")
plt.xlabel('Day of week')
plt.ylabel('Average trips per day')
plt.tight_layout()
plt.savefig('../outputs/weekly_usage_count.png')
plt.show()

In [None]:
plt.figure()
weekday_stats = trips.groupby('weekday').mean() / 60

bars = plt.bar(day_order, [weekday_stats['duration'][day] for day in day_order],
               color=[colors[day] for day in day_order],
            edgecolor='black', linewidth=0.5, alpha=0.8)

for bar in bars:
    height = bar.get_height()
    x = bar.get_x()
    width = bar.get_width()
    plt.text(x+width/2., height+0.2, f'{height:.0f}', ha='center',
             va='bottom', fontsize=10)
    
plt.title('Average daily trip duration: weekend trips longer')
plt.xlabel('Day of week')
plt.ylabel('Average daily trip duration [min]')
plt.tight_layout()
plt.savefig('../outputs/weekly_usage_duration.png')
plt.show()

#### 4.3.2 Flux analysis

In [None]:
weekend_trips = trips[trips['weekday'].isin(['Saturday', 'Sunday'])]
weekday_trips = trips[~trips['weekday'].isin(['Saturday', 'Sunday'])]

# Create subplots with shared y-axis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6), sharey=True)

# WEEKDAY PLOT (Left)
# Calculate hourly flux for weekdays
hourly_arrivals = weekday_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = weekday_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_net_flux = (hourly_arrivals - hourly_departures)/365
hourly_net_flux = hourly_net_flux.reindex(columns=range(24), fill_value=0)

# Plot exporters on weekday subplot
for i, (_, station) in enumerate(top_exporters.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Blues(0.6 + 0.4 * (i / n))
    ax1.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])

# Plot importers on weekday subplot
for i, (_, station) in enumerate(top_importers.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Reds(0.6 + 0.4 * (i / n))
    ax1.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='s', label=station['station_name'])

ax1.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax1.set_xticks(range(24))
ax1.set_xlabel('Hour of Day')
ax1.set_ylabel('Daily Net Flux (Arrivals - Departures)')
ax1.set_title('Weekdays: strong commuter patterns')
ax1.grid(True, alpha=0.3)
ax1.set_ylim([-3, 7])


# WEEKEND PLOT (Right)
# Calculate hourly flux for weekends
hourly_arrivals = weekend_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = weekend_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_net_flux = (hourly_arrivals - hourly_departures)/365
hourly_net_flux = hourly_net_flux.reindex(columns=range(24), fill_value=0)

# Plot exporters on weekend subplot
for i, (_, station) in enumerate(top_exporters.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Blues(0.6 + 0.4 * (i / n))
    ax2.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])

# Plot importers on weekend subplot
for i, (_, station) in enumerate(top_importers.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Reds(0.6 + 0.4 * (i / n))
    ax2.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='s', label=station['station_name'])

ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax2.set_xticks(range(24))
ax2.set_xlabel('Hour of day')
ax2.set_title('Weekends: gentler leisure patterns')
ax2.grid(True, alpha=0.3)

# Add single legend to the right
ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.savefig('../outputs/weekday_vs_weekend.png', bbox_inches='tight')
plt.show()

In [None]:
# Create subplots with shared y-axis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6), sharey=True)

# WEEKDAY PLOT (Left)
# Calculate hourly flux for weekdays
hourly_arrivals = weekday_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = weekday_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_net_flux = (hourly_arrivals - hourly_departures)/365
hourly_net_flux = hourly_net_flux.reindex(columns=range(24), fill_value=0)

# Plot exporters on weekday subplot
for i, (_, station) in enumerate(balanced_stations.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Greens(0.6 + 0.4 * (i / n))
    ax1.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])

ax1.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax1.set_xticks(range(24))
ax1.set_xlabel('Hour of Day')
ax1.set_ylabel('Daily Net Flux (Arrivals - Departures)')
ax1.set_title('Weekdays: strong commuter patterns')
ax1.grid(True, alpha=0.3)
ax1.set_ylim([-8, 6])


# WEEKEND PLOT (Right)
# Calculate hourly flux for weekends
hourly_arrivals = weekend_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = weekend_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_net_flux = (hourly_arrivals - hourly_departures)/365
hourly_net_flux = hourly_net_flux.reindex(columns=range(24), fill_value=0)

# Plot exporters on weekend subplot
for i, (_, station) in enumerate(balanced_stations.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Greens(0.6 + 0.4 * (i / n))
    ax2.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])



ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax2.set_xticks(range(24))
ax2.set_xlabel('Hour of day')
ax2.set_title('Weekends: gentler leisure patterns')
ax2.grid(True, alpha=0.3)

# Add single legend to the right
ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.savefig('../outputs/weekday_vs_weekend_balanced.png', bbox_inches='tight')
plt.show()

#### 4.3.3 Total usage analysis

In [None]:
weekend_trips

In [None]:
# Create subplots with shared y-axis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6), sharey=True)

# WEEKDAY PLOT (Left)
# Calculate hourly total usage for weekdays
hourly_arrivals = weekday_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = weekday_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_total_usage = (hourly_arrivals + hourly_departures)/365
hourly_total_usage = hourly_total_usage.reindex(columns=range(24), fill_value=0)

# Plot exporters on weekday subplot
for i, (_, station) in enumerate(top_exporters.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Blues(0.6 + 0.4 * (i / n))
    ax1.plot(range(24), hourly_total_usage.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])

# Plot importers on weekday subplot
for i, (_, station) in enumerate(top_importers.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Reds(0.6 + 0.4 * (i / n))
    ax1.plot(range(24), hourly_total_usage.loc[station_id], c=color, 
             linestyle='-', marker='s', label=station['station_name'])

# ax1.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax1.set_xticks(range(24))
ax1.set_xlabel('Hour of Day')
ax1.set_ylabel('Daily Net Flux (Arrivals - Departures)')
ax1.set_title('Weekdays: strong commuter patterns')
ax1.grid(True, alpha=0.3)
# ax1.set_ylim([-3, 7])


# WEEKEND PLOT (Right)
# Calculate hourly total usage for weekends
hourly_arrivals = weekend_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = weekend_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_total_usage = (hourly_arrivals + hourly_departures)/365
hourly_total_usage = hourly_total_usage.reindex(columns=range(24), fill_value=0)

# Plot exporters on weekend subplot
for i, (_, station) in enumerate(top_exporters.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Blues(0.6 + 0.4 * (i / n))
    ax2.plot(range(24), hourly_total_usage.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])

# Plot importers on weekend subplot
for i, (_, station) in enumerate(top_importers.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Reds(0.6 + 0.4 * (i / n))
    ax2.plot(range(24), hourly_total_usage.loc[station_id], c=color, 
             linestyle='-', marker='s', label=station['station_name'])

# ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax2.set_xticks(range(24))
ax2.set_xlabel('Hour of day')
ax2.set_title('Weekends: gentler leisure patterns')
ax2.grid(True, alpha=0.3)

# Add single legend to the right
ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.savefig('../outputs/weekday_vs_weekend_total_usage.png', bbox_inches='tight')
plt.show()

In [None]:

# Create subplots with shared y-axis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6), sharey=True)

# WEEKDAY PLOT (Left)
# Calculate hourly total usage for weekdays
hourly_arrivals = weekday_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = weekday_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_total_usage = (hourly_arrivals + hourly_departures)/365
hourly_total_usage = hourly_total_usage.reindex(columns=range(24), fill_value=0)

# Plot exporters on weekday subplot
for i, (_, station) in enumerate(balanced_stations.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Greens(0.6 + 0.4 * (i / n))
    ax1.plot(range(24), hourly_total_usage.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])

# ax1.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax1.set_xticks(range(24))
ax1.set_xlabel('Hour of Day')
ax1.set_ylabel('Daily Net Flux (Arrivals - Departures)')
ax1.set_title('Weekdays: strong commuter patterns')
ax1.grid(True, alpha=0.3)
# ax1.set_ylim([-8, 6])


# WEEKEND PLOT (Right)
# Calculate hourly total usage for weekends
hourly_arrivals = weekend_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = weekend_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_total_usage = (hourly_arrivals + hourly_departures)/365
hourly_total_usage = hourly_total_usage.reindex(columns=range(24), fill_value=0)

# Plot exporters on weekend subplot
for i, (_, station) in enumerate(balanced_stations.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Greens(0.6 + 0.4 * (i / n))
    ax2.plot(range(24), hourly_total_usage.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])



# ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax2.set_xticks(range(24))
ax2.set_xlabel('Hour of day')
ax2.set_title('Weekends: gentler leisure patterns')
ax2.grid(True, alpha=0.3)

# Add single legend to the right
ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.savefig('../outputs/weekday_vs_weekend_total_usage_balanced.png', bbox_inches='tight')
plt.show()

These colors are difficult to tell apart. find a better solution. Add different dashes. 

### 4.4 Summer vs. year-round patterns

In [None]:
trips['month'] = trips['started_at'].dt.month
summer_trips = trips[trips['month'].isin([6, 7, 8])]
nonsummer_trips = trips[~trips['month'].isin([6, 7, 8])]

# Create subplots with shared y-axis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6), sharey=True)

# SUMMER PLOT (Left)
# Calculate hourly flux for summer
hourly_arrivals = summer_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = summer_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_net_flux = (hourly_arrivals - hourly_departures)/90
hourly_net_flux = hourly_net_flux.reindex(columns=range(24), fill_value=0)

# Plot exporters in summer subplot
for i, (_, station) in enumerate(top_exporters.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Blues(0.6 + 0.4 * (i / n))
    ax1.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])

# Plot importers in summer subplot
for i, (_, station) in enumerate(top_importers.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Reds(0.6 + 0.4 * (i / n))
    ax1.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='s', label=station['station_name'])

ax1.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax1.set_xticks(range(24))
ax1.set_xlabel('Hour of Day')
ax1.set_ylabel('Daily Net Flux (Arrivals - Departures)')
ax1.set_title('Summer')
ax1.grid(True, alpha=0.3)
ax1.set_ylim([-6, 12])


# REST OF YEAR PLOT (Right)
# Calculate hourly flux for rest of year
hourly_arrivals = nonsummer_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = nonsummer_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_net_flux = (hourly_arrivals - hourly_departures)/270
hourly_net_flux = hourly_net_flux.reindex(columns=range(24), fill_value=0)

# Plot exporters rest of year subplot
for i, (_, station) in enumerate(top_exporters.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Blues(0.6 + 0.4 * (i / n))
    ax2.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])

# Plot importers rest of year subplot
for i, (_, station) in enumerate(top_importers.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Reds(0.6 + 0.4 * (i / n))
    ax2.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='s', label=station['station_name'])

ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax2.set_xticks(range(24))
ax2.set_xlabel('Hour of day')
ax2.set_title('Rest of year')
ax2.grid(True, alpha=0.3)

# Add single legend to the right
ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.savefig('../outputs/summer_vs_rest_of_year.png', bbox_inches='tight')
plt.show()

In [None]:
# Create subplots with shared y-axis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6), sharey=True)

# SUMMER PLOT (Left)
# Calculate hourly flux for weekdays
hourly_arrivals = summer_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = summer_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_net_flux = (hourly_arrivals - hourly_departures)/90
hourly_net_flux = hourly_net_flux.reindex(columns=range(24), fill_value=0)

# Plot balanced stations in summer subplot
for i, (_, station) in enumerate(balanced_stations.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Greens(0.6 + 0.4 * (i / n))
    ax1.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])

ax1.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax1.set_xticks(range(24))
ax1.set_xlabel('Hour of Day')
ax1.set_ylabel('Daily Net Flux (Arrivals - Departures)')
ax1.set_title('Summer')
ax1.grid(True, alpha=0.3)
ax1.set_ylim([-12, 9])


# REST OF YEAR PLOT (Right)
# Calculate hourly flux for rest of year
hourly_arrivals = nonsummer_trips.groupby(['end_station_id', 'hour']).size().unstack(fill_value=0)
hourly_departures = nonsummer_trips.groupby(['start_station_id', 'hour']).size().unstack(fill_value=0)
hourly_net_flux = (hourly_arrivals - hourly_departures)/270
hourly_net_flux = hourly_net_flux.reindex(columns=range(24), fill_value=0)

# Plot balanced stations for rest of year subplot
for i, (_, station) in enumerate(balanced_stations.iterrows()):
    station_id = station['station_id']
    color = plt.cm.Greens(0.6 + 0.4 * (i / n))
    ax2.plot(range(24), hourly_net_flux.loc[station_id], c=color, 
             linestyle='-', marker='o', label=station['station_name'])



ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.7)
ax2.set_xticks(range(24))
ax2.set_xlabel('Hour of day')
ax2.set_title('Rest of year')
ax2.grid(True, alpha=0.3)

# Add single legend to the right
ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.savefig('../outputs/summer_vs_rest_of_year_balanced.png', bbox_inches='tight')
plt.show()