# 02 - Feature Engineering

Goal of this notebook:

- Take the merged hourly dataset (`base_hourly_2015_2024.parquet`)
- Create time-series features (lags, rolling means, diffs)
- Create simple weather & event features
- Output a model-ready dataset


In [99]:
# Imports and config
import pandas as pd
import numpy as np
from pathlib import Path

DATA_DIR = Path("../data/processed")
INPUT_PATH = DATA_DIR / "base_hourly_2015_2024.parquet"
OUTPUT_PATH = DATA_DIR / "model_ready_hourly.parquet"

pd.set_option("display.max_columns", 80)


In [100]:
# load base table
df = pd.read_parquet(INPUT_PATH)
df["timestamp"] = pd.to_datetime(df["timestamp"])
df = df.sort_values("timestamp").reset_index(drop=True)
print(df.head())

            timestamp  rides  temperature_2m  precipitation  windspeed_10m  \
0 2015-01-01 00:00:00  28312            -3.8            0.0           14.1   
1 2015-01-01 01:00:00  31707            -3.9            0.0           14.1   
2 2015-01-01 02:00:00  28068            -4.0            0.0           14.7   
3 2015-01-01 03:00:00  24288            -4.0            0.0           15.6   
4 2015-01-01 04:00:00  17081            -4.0            0.0           16.1   

   is_rain  event_count  has_event  heavy_event  hour  dayofweek  is_weekend  \
0        0          4.0        1.0            1     0          3           0   
1        0          8.0        1.0            1     1          3           0   
2        0          7.0        1.0            1     2          3           0   
3        0          4.0        1.0            1     3          3           0   
4        0          4.0        1.0            1     4          3           0   

   is_holiday  year  month  day  
0           1  2

In [101]:
# Quick checks
print("Shape:", df.shape)
print("Time range:", df["timestamp"].min(), "-", df["timestamp"].max())
print("\nColumns:\n", df.columns.tolist())

df[["rides", "temperature_2m", "precipitation", "windspeed_10m", "event_count"]].describe()


Shape: (87666, 16)
Time range: 2015-01-01 00:00:00 - 2024-12-31 23:00:00

Columns:
 ['timestamp', 'rides', 'temperature_2m', 'precipitation', 'windspeed_10m', 'is_rain', 'event_count', 'has_event', 'heavy_event', 'hour', 'dayofweek', 'is_weekend', 'is_holiday', 'year', 'month', 'day']


Unnamed: 0,rides,temperature_2m,precipitation,windspeed_10m,event_count
count,87666.0,87666.0,87666.0,87666.0,87666.0
mean,8587.456197,12.417274,0.148167,13.018612,2216.033742
std,6887.654832,9.926503,0.686577,6.814905,3241.007137
min,2.0,-20.6,0.0,0.0,0.0
25%,2889.25,4.5,0.0,7.9,0.0
50%,6403.0,12.6,0.0,11.7,121.0
75%,14224.0,20.8,0.0,16.9,3873.0
max,45849.0,37.4,33.8,60.4,29537.0


In [102]:
df["year"] = df["timestamp"].dt.year
df["month"] = df["timestamp"].dt.month
df["day"] = df["timestamp"].dt.day

## 1. Time-based lags and rolling features

For each hour `t`, I use information from previous hours (lags, rolling stats) to predict `rides_t`.

I’ll create:

- `lag_1`, `lag_24`, `lag_168` (1h, 1 day, 1 week)
- rolling means over 3h and 24h
- optional simple differences


In [103]:
# Create lag features for rides
df = df.sort_values("timestamp").reset_index(drop=True)

df["lag_1"] = df["rides"].shift(1)
df["lag_24"] = df["rides"].shift(24)
df["lag_168"] = df["rides"].shift(168) # 1 week

In [104]:
# Rolling means (using past values only)
df["roll_mean_3"] = df["rides"].shift(1).rolling(window=3).mean()
df["roll_mean_24"] = df["rides"].shift(1).rolling(window=24).mean()

In [105]:
# Simple difference features
df["diff_1"] = df["rides"] - df["lag_1"]
df["diff_24"] = df["rides"] - df["lag_24"]

## 2. Weather-derived features

I already have:

- `temperature_2m`
- `precipitation`
- `windspeed_10m`
- `weathercode`

I’ll now derive:

- `is_rain` (binary)
- simple temperature buckets (cold / mild / hot) – optional categorical


In [106]:
#Weather binary / categorical features
df["is_rain"] = (df["precipitation"] > 0).astype(int)

def temp_bucket(x):
    if x <= 0:
        return "cold"
    elif x <= 20:
        return "mild"
    else:
        return "hot"

df["temp_bucket"] = df["temperature_2m"].apply(temp_bucket)

df["temp_bucket"].value_counts()


temp_bucket
mild    53004
hot     24267
cold    10395
Name: count, dtype: int64

## 3. Event features

I already have:

- `event_count`
- `has_event`

I can now derive:

- `log_event_count`
- separate flag for “heavy event hours” (e.g., 3+ events)


In [107]:
df["heavy_event"] = (df["event_count"] >= 3).astype(int)

# If event_count has big range, log-transform (add 1 to avoid log(0))
df["log_event_count"] = np.log1p(df["event_count"])


## 4. Calendar features (verify)

I expect from the ETL:

- `year`, `month`, `day`, `hour`
- `dayofweek` (0=Mon, …, 6=Sun)
- `is_weekend`
- `is_holiday`

I’ll just check they exist and look reasonable.


In [108]:
calendar_cols = ["year", "month", "day", "hour", "dayofweek", "is_weekend", "is_holiday"]
print({c: (c in df.columns) for c in calendar_cols})

df[calendar_cols].head()


{'year': True, 'month': True, 'day': True, 'hour': True, 'dayofweek': True, 'is_weekend': True, 'is_holiday': True}


Unnamed: 0,year,month,day,hour,dayofweek,is_weekend,is_holiday
0,2015,1,1,0,3,0,1
1,2015,1,1,1,3,0,1
2,2015,1,1,2,3,0,1
3,2015,1,1,3,3,0,1
4,2015,1,1,4,3,0,1


## 5. Drop rows with NaNs introduced by lags/rollings

Because of the lags and rolling windows, the first ~168 rows will have NaNs.
We drop them to avoid leaking future info and to not cause training issues.


In [109]:
before = df.shape[0]
df = df.dropna().reset_index(drop=True)
after = df.shape[0]

print(f"Dropped {before - after} rows due to lag/rolling NaNs.")
print("New shape:", df.shape)


Dropped 168 rows due to lag/rolling NaNs.
New shape: (87498, 25)


## 6. Define feature columns and target

Target:
- `y = rides`

Features (X) (initial ones):

- lags: `lag_1`, `lag_24`, `lag_168`
- rolling: `roll_mean_3`, `roll_mean_24`
- differences: `diff_1`, `diff_24`
- weather: `temperature_2m`, `precipitation`, `windspeed_10m`, `is_rain`
- events: `event_count`, `has_event`, `heavy_event`, `log_event_count`
- calendar: `hour`, `dayofweek`, `is_weekend`, `is_holiday`

I’ll keep `timestamp` in the DataFrame for convenience, but exclude it from the feature list in the modeling notebook.


In [110]:
target_col = "rides"

feature_cols = [
    "lag_1", "lag_24", "lag_168",
    "roll_mean_3", "roll_mean_24",
    "diff_1", "diff_24",
    "temperature_2m", "precipitation", "windspeed_10m", "is_rain",
    "event_count", "has_event", "heavy_event", "log_event_count",
    "hour", "dayofweek", "is_weekend", "is_holiday",
]

print("Number of features:", len(feature_cols))
feature_cols


Number of features: 19


['lag_1',
 'lag_24',
 'lag_168',
 'roll_mean_3',
 'roll_mean_24',
 'diff_1',
 'diff_24',
 'temperature_2m',
 'precipitation',
 'windspeed_10m',
 'is_rain',
 'event_count',
 'has_event',
 'heavy_event',
 'log_event_count',
 'hour',
 'dayofweek',
 'is_weekend',
 'is_holiday']

In [111]:
# Check for remaining missing values in features/target
subset = df[feature_cols + [target_col]]
subset.isna().sum()


lag_1              0
lag_24             0
lag_168            0
roll_mean_3        0
roll_mean_24       0
diff_1             0
diff_24            0
temperature_2m     0
precipitation      0
windspeed_10m      0
is_rain            0
event_count        0
has_event          0
heavy_event        0
log_event_count    0
hour               0
dayofweek          0
is_weekend         0
is_holiday         0
rides              0
dtype: int64

## 7. Save model-ready dataset

I keep:

- `timestamp` (for plotting & time-based splits)
- all engineered features
- `rides` as target


In [112]:
cols_to_keep = ["timestamp", target_col] + feature_cols
df_model = df[cols_to_keep].copy()

df_model.to_parquet(OUTPUT_PATH, index=False)
print(f"Saved dataset to: {OUTPUT_PATH}")
print("Shape:", df_model.shape)


Saved dataset to: ..\data\processed\model_ready_hourly.parquet
Shape: (87498, 21)


## 8. Summary

---

- I created lag, rolling, diff, weather, event, and calendar features.
- The dataset is now prepared for supervised learning:
  - `y = rides`
  - `X = all engineered features`
- Next notebook (`03_modeling.ipynb`) will:
  - split data into train/val/test by time
  - implement naive and seasonal naive baselines
  - train XGBoost / other models
  - compare metrics and plot predictions vs actual values
