# Blue Bikes Demand Prediction - Linear Regression Model

This notebook builds and evaluates a linear regression model to predict daily trip counts at Blue Bikes stations.

## Contents:
1. **Load and Prepare Data** - Import processed dataset
2. **Train/Test Split** - Split by year (2022-2024 train, 2025 test)
3. **Build Linear Regression** - Train model on selected features
4. **Evaluate Performance** - Calculate RMSE and MAE
5. **Analyze Results** - Feature importance and error analysis
6. **Conclusions** - Summary of findings

---

**Goal:** Predict daily trip count per station using weather and temporal features.

**Model:** Linear Regression (baseline model for comparison with future advanced models)

**Evaluation:** RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) on 2025 test data

## 1. Load and Prepare Data

Import necessary libraries and load the processed dataset.

In [None]:
# Imports
import os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Plot style
sns.set(style="whitegrid", context="notebook")
plt.rcParams["figure.figsize"] = (10, 5)

print("Libraries imported.")

## 2. Load processed dataset

Attempt to load the processed daily dataset. Expects a `date` column and a numeric target such as `trip_count`.

In [None]:
# Robustly resolve processed CSV path(s)
base = Path.cwd()
# Notebook is likely in notebooks/; processed data likely in ../data/processed/
candidates = [
    base / "../data/processed/merged_data.csv",
    base / "../data/processed/sample_merged_data.csv",
]

path = None
for c in candidates:
    if c.exists():
        path = c.resolve()
        break

if path is None:
    raise FileNotFoundError("Could not find processed data. Expected one of: " + ", ".join(map(str, candidates)))

print(f"Loading: {path}")

df = pd.read_csv(path)

# Ensure date column exists and is datetime
if "date" in df.columns:
    df["date"] = pd.to_datetime(df["date"])  # coerce errors if needed
else:
    raise KeyError("Expected a 'date' column in processed dataset.")

# Ensure target exists; default to 'trip_count' if present
candidate_targets = ["trip_count", "count", "rides", "n_trips"]
y_col = next((c for c in candidate_targets if c in df.columns), None)
if y_col is None:
    raise KeyError(f"Could not find target column in {candidate_targets}.")

# Sort by date and drop rows with missing target
if df[y_col].isna().any():
    print(f"Dropping {df[y_col].isna().sum()} rows with NaN {y_col}.")
    df = df.dropna(subset=[y_col])

df = df.sort_values("date").reset_index(drop=True)

# Convenience year column if missing
if "year" not in df.columns:
    df["year"] = df["date"].dt.year

print("Loaded rows:", len(df))
df.head(3)

## 3. Train/Test Split (temporal)

Use a time-based split to avoid leakage. We'll use dates before 2024-01-01 for training and 2024+ for testing if available.

In [None]:
split_date = pd.Timestamp("2024-01-01")

train_df = df[df["date"] < split_date].copy()
test_df = df[df["date"] >= split_date].copy()

print(f"Train rows: {len(train_df)}; Test rows: {len(test_df)}")

if len(test_df) == 0:
    print("No 2024+ data found; using last 20% of time-ordered data as test set.")
    cutoff = int(len(df) * 0.8)
    train_df = df.iloc[:cutoff].copy()
    test_df = df.iloc[cutoff:].copy()

train_df[["date", y_col]].head(3)

## 4. Feature Selection

Pick candidate features from weather and calendar; use only those that exist in the data.

In [None]:
# Candidate features; we'll intersect with available columns
candidate_features = [
    # weather
    "temp_max", "temp_min", "windspeed_max", "precipitation", "snowfall", "snow_depth",
    # calendar/temporal
    "is_weekend", "is_holiday", "month", "day_of_week", "day_of_year",
]

# Ensure basic temporal features exist
if "month" not in df.columns:
    df["month"] = df["date"].dt.month
    train_df["month"] = train_df["date"].dt.month
    test_df["month"] = test_df["date"].dt.month
if "day_of_week" not in df.columns:
    df["day_of_week"] = df["date"].dt.dayofweek
    train_df["day_of_week"] = train_df["date"].dt.dayofweek
    test_df["day_of_week"] = test_df["date"].dt.dayofweek
if "day_of_year" not in df.columns:
    df["day_of_year"] = df["date"].dt.dayofyear
    train_df["day_of_year"] = train_df["date"].dt.dayofyear
    test_df["day_of_year"] = test_df["date"].dt.dayofyear
if "is_weekend" not in df.columns:
    df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
    train_df["is_weekend"] = train_df["day_of_week"].isin([5, 6]).astype(int)
    test_df["is_weekend"] = test_df["day_of_week"].isin([5, 6]).astype(int)

# Booleans to ints if present
for b in ["is_weekend", "is_holiday"]:
    if b in df.columns and df[b].dtype == bool:
        df[b] = df[b].astype(int)
        if b in train_df.columns:
            train_df[b] = train_df[b].astype(int)
        if b in test_df.columns:
            test_df[b] = test_df[b].astype(int)

available_features = [c for c in candidate_features if c in df.columns]
print("Using features:", available_features)

# Build matrices
X_train = train_df[available_features].copy()
X_test = test_df[available_features].copy()
y_train = train_df[y_col].astype(float).values
y_test = test_df[y_col].astype(float).values

X_train.shape, X_test.shape

## 5. Train Linear Regression

Fit a simple baseline linear regression to available features.

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predictions
train_pred = lr.predict(X_train)
test_pred = lr.predict(X_test)

print("Coefficients:")
for name, coef in zip(available_features, lr.coef_):
    print(f"  {name:>15}: {coef: .4f}")
print(f"Intercept: {lr.intercept_: .4f}")

## 6. Evaluate Model

Report MAE, RMSE, and R^2 on train and test splits.

In [None]:
def compute_metrics(y_true, y_pred, label=""):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    r2 = r2_score(y_true, y_pred)
    print(f"{label} MAE:  {mae:,.2f}")
    print(f"{label} RMSE: {rmse:,.2f}")
    print(f"{label} R^2:  {r2:,.4f}")
    return {"mae": mae, "rmse": rmse, "r2": r2}

print("Train performance:")
train_metrics = compute_metrics(y_train, train_pred, label="Train")
print("\nTest performance:")
test_metrics = compute_metrics(y_test, test_pred, label="Test")