# Lecture 5: Programming Example - Advanced Feature Engineering for Transportation Data

## Introduction: Transforming Clean Data into ML-Ready Features

Today you'll learn about advanced preprocessing and feature engineering techniques that transform your clean bike-sharing data into optimized inputs for machine learning models. We'll apply categorical encoding first, then implement scaling strategies, and finally create time-based features that capture transportation demand patterns.

Every feature we create serves a specific purpose: enabling machine learning algorithms to better understand and predict transportation demand patterns.

> **🚀 Interactive Learning Alert**
> 
> This is a hands-on data cleaning tutorial with detective work and problem-solving. For the best experience:
>
> - **Click "Open in Colab"** at the bottom to run code interactively
> - **Execute each code cell** by pressing **Shift + Enter**
> - **Complete the challenges** to practice your data cleaning skills
> - **Think like a consultant** - every decision impacts client trust

---

## Step 1: Categorical Feature Engineering with One-Hot Encoding

Let's start by transforming categorical variables into machine learning-compatible formats, starting with the most fundamental technique: one-hot encoding for nominal categories.

We'll use pandas' `get_dummies()` function to convert categorical weather conditions into binary columns that machine learning algorithms can process. This function creates a separate binary column for each unique category, where 1 indicates the presence of that category and 0 indicates its absence.

Let's transform weather conditions into one-hot encoded features:

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Create weather_condition column from numeric weather codes
if 'weather' in df.columns:
    weather_map = {1: 'Clear', 2: 'Misty', 3: 'Light Rain', 4: 'Heavy Rain'}
    df['weather_condition'] = df['weather'].map(weather_map)
    print("Weather condition categories:")
    print(df['weather_condition'].value_counts())
else:
    print("Weather column not found - creating synthetic weather conditions")
    df['weather_condition'] = 'Clear'

# Apply one-hot encoding
weather_encoded = pd.get_dummies(df['weather_condition'], prefix='weather')
print("\nOne-hot encoded weather columns:")
print(weather_encoded.head())

# Add encoded columns to main dataframe
df = pd.concat([df, weather_encoded], axis=1)
print(f"\nOriginal 1 column → {len(weather_encoded.columns)} binary columns")

**What this accomplishes:**
- Converts text weather labels into numerical format algorithms can process
- Creates 4 binary columns (one per weather type)
- Preserves all category information without implying false orderings

**Business value:**
Enables the model to learn that "Clear" weather drives high recreational demand while "Light Rain" reduces casual ridership but maintains commuter patterns.

### Challenge 1: Create Day Type Categories with One-Hot Encoding
Create a `day_type` feature that categorizes days as 'weekday', 'weekend', or 'holiday', then apply one-hot encoding.

In [None]:
# Your code here - create day_type categories and apply one-hot encoding

# Step 1: Extract day of week from datetime (0=Monday, 6=Sunday)
df['day_of_week'] = df['datetime'].dt._____  # Fill in: dayofweek

# Step 2: Create day_type categories
df['day_type'] = '_____'  # Fill in: 'weekday' as default
df.loc[df['day_of_week'].isin([5, 6]), 'day_type'] = '_____'  # Fill in: 'weekend' for Sat/Sun
df.loc[df['holiday'] == 1, 'day_type'] = '_____'  # Fill in: 'holiday'

# Step 3: Apply one-hot encoding
day_type_encoded = pd.get_dummies(df['_____'], prefix='daytype')  # Fill in: 'day_type'
df = pd.concat([df, day_type_encoded], axis=1)

print("Day type distribution:")
print(df['day_type'].value_counts())
print(f"\nCreated {len(day_type_encoded.columns)} binary day type features")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Think about this in three steps: First, extract day of week using `.dt.dayofweek` (returns 0-6 where 0=Monday, 6=Sunday). Second, create a default 'weekday' label, then override for weekends and holidays. Third, apply `pd.get_dummies()` just like we did for weather conditions. Watch out: `.dt.dayofweek` uses 0=Monday, so weekends are 5 and 6 (not 6 and 7). Also, order matters when setting categories - set holidays AFTER weekends if you want holidays to take priority. Remember `.isin([5, 6])` for checking multiple values - using `== 5 or == 6` won't work in pandas. Your client needs different strategies for weekdays (commute-focused) vs weekends (recreation) vs holidays (special events).

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Step 1: Extract day of week from datetime (0=Monday, 6=Sunday)
df['day_of_week'] = df['datetime'].dt.dayofweek

# Step 2: Create day_type categories
df['day_type'] = 'weekday'
df.loc[df['day_of_week'].isin([5, 6]), 'day_type'] = 'weekend'
df.loc[df['holiday'] == 1, 'day_type'] = 'holiday'

# Step 3: Apply one-hot encoding
day_type_encoded = pd.get_dummies(df['day_type'], prefix='daytype')
df = pd.concat([df, day_type_encoded], axis=1)

print("Day type distribution:")
print(df['day_type'].value_counts())
print(f"\nCreated {len(day_type_encoded.columns)} binary day type features")
```
</details>

---

## Step 2: Binary Encoding for Business-Specific Conditions

Beyond standard categorical encoding, transportation consultants create focused binary indicators (0 or 1) that flag business-critical conditions. These features encode domain expertise directly into your data.

We'll create boolean conditions using pandas' `.isin()` method, then convert True/False values to 1/0 integers with `.astype(int)` for machine learning compatibility.

Let's build business-relevant binary features:

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Extract hour and day of week from datetime
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek

# Binary rush hour indicator (morning 7-9, evening 17-19)
rush_hours = [7, 8, 9, 17, 18, 19]
df['is_rush_hour'] = df['hour'].isin(rush_hours).astype(int)

# Binary weekend indicator
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Binary good weather indicator (Clear or Misty conditions)
if 'weather' in df.columns:
    df['is_good_weather'] = df['weather'].isin([1, 2]).astype(int)
else:
    df['is_good_weather'] = 1  # Default to good weather if data unavailable

print("Binary feature distributions:")
print(f"Rush hour periods: {df['is_rush_hour'].mean()*100:.1f}% of hours")
print(f"Weekend days: {df['is_weekend'].mean()*100:.1f}% of days")
print(f"Good weather: {df['is_good_weather'].mean()*100:.1f}% of time")

print("\nSample of binary features:")
print(df[['datetime', 'is_rush_hour', 'is_weekend', 'is_good_weather']].head(10))

**What this accomplishes:**
- Creates focused indicators for operationally critical conditions
- Encodes transportation domain expertise into features
- Provides clear yes/no signals that models can easily learn from

**Business value:**
These flags help models learn distinct patterns like "weekend + good weather = high recreational demand" or "rush hour + weekday = commuter surge."

### Challenge 2: Create Season Binary Indicators
Create binary indicators for each season to help the model learn seasonal demand patterns.

In [None]:
# Your code here - create binary indicators for each season

# Step 1: Extract month from datetime
df['month'] = df['datetime'].dt._____  # Fill in: month

# Step 2: Create binary indicators for each season
# Spring: March, April, May (months 3, 4, 5)
df['is_spring'] = df['month'].isin([_____, _____, _____]).astype(int)  # Fill in: 3, 4, 5

# Summer: June, July, August (months 6, 7, 8)
df['is_summer'] = df['month'].isin([_____, _____, _____]).astype(int)  # Fill in: 6, 7, 8

# Fall: September, October, November (months 9, 10, 11)
df['is_fall'] = df['month'].isin([_____, _____, _____]).astype(int)  # Fill in: 9, 10, 11

# Winter: December, January, February (months 12, 1, 2)
df['is_winter'] = df['month'].isin([_____, _____, _____]).astype(int)  # Fill in: 12, 1, 2

print("Seasonal distribution:")
for season in ['spring', 'summer', 'fall', 'winter']:
    pct = df[f'is_{season}'].mean() * 100
    print(f"{season.capitalize()}: {pct:.1f}% of observations")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Extract month numbers using `.dt.month` (returns 1-12, not month names). For each season, use `.isin()` to check if the month is in that season's list, then `.astype(int)` to convert True/False to 1/0. Watch out: winter spans the year boundary, so you need `[12, 1, 2]` not consecutive numbers. Remember to use `.astype(int)` because boolean True/False won't work directly in ML algorithms. Bike-sharing demand changes dramatically by season - summer weekend afternoons need 3x more bikes than winter weekday evenings, so these season indicators help your model learn those patterns.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Step 1: Extract month from datetime
df['month'] = df['datetime'].dt.month

# Step 2: Create binary indicators for each season
# Spring: March, April, May (months 3, 4, 5)
df['is_spring'] = df['month'].isin([3, 4, 5]).astype(int)

# Summer: June, July, August (months 6, 7, 8)
df['is_summer'] = df['month'].isin([6, 7, 8]).astype(int)

# Fall: September, October, November (months 9, 10, 11)
df['is_fall'] = df['month'].isin([9, 10, 11]).astype(int)

# Winter: December, January, February (months 12, 1, 2)
df['is_winter'] = df['month'].isin([12, 1, 2]).astype(int)

print("Seasonal distribution:")
for season in ['spring', 'summer', 'fall', 'winter']:
    pct = df[f'is_{season}'].mean() * 100
    print(f"{season.capitalize()}: {pct:.1f}% of observations")
```
</details>

---

## Step 3: Feature Scaling with StandardScaler

Now that we've created categorical features in numerical form, we need to address a critical issue: our features exist on wildly different scales. Temperature ranges from 0-40°C, humidity spans 0-100%, and bike counts vary from 1 to 1000+. Without scaling, machine learning algorithms will incorrectly prioritize features with larger numbers.

We'll apply `StandardScaler` to normalize our features - transforming them to have mean=0 and standard deviation=1. This Z-score normalization ensures all features contribute equally regardless of their original measurement scales.

`StandardScaler` uses a consistent three-step workflow:

1. **Create the scaler**: `scaler = StandardScaler()`
2. **Fit to data**: `scaler.fit_transform(df[columns])` - calculates mean/std from data AND transforms it
3. **Result**: All features now have mean ≈ 0, std ≈ 1

Let's apply `StandardScaler` to weather variables:

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Identify weather features for scaling
weather_features = ['temp', 'atemp', 'humidity', 'windspeed']

# Check original scales
print("Original feature scales:")
print(df[weather_features].describe().round(1))

# Create and apply StandardScaler
scaler = StandardScaler()
df[weather_features] = scaler.fit_transform(df[weather_features])

# Check scaled results
print("\nScaled feature statistics:")
print(df[weather_features].describe().round(3))
print("\nNotice: mean ≈ 0, std ≈ 1 for all features")

**What this accomplishes:**
- All weather features now exist on the same statistical scale
- Features with originally larger ranges (like humidity 0-100) no longer dominate
- Algorithms can fairly compare the importance of temperature vs. humidity vs. windspeed

**Business value:**
Ensures your model doesn't overweight humidity simply because it's measured in larger numbers (0-100%) compared to temperature (0-40°C).

### Challenge 3: Verify Scaling Preserved Relationships
Check that StandardScaler transformed the scales without breaking the relationships between variables.

In [None]:
# Your code here - verify scaling results

# Check that correlations are preserved (scaling shouldn't change relationships)
if 'temp' in df.columns and 'count' in df.columns:
    correlation = df['temp'].corr(df['_____'])  # Fill in: 'count'
    print(f"Temperature-Count correlation after scaling: {correlation:.3f}")
    print("(This should be similar to original correlation)")

# Verify scaled features have mean ≈ 0, std ≈ 1
print("\nScaled feature validation:")
for feature in weather_features:
    mean_val = df[feature]._____()  # Fill in: mean
    std_val = df[feature]._____()   # Fill in: std
    print(f"{feature}: mean={mean_val:.3f}, std={std_val:.3f}")

    # Check if properly standardized (mean close to 0, std close to 1)
    if abs(mean_val) < 0.01 and abs(std_val - 1) < 0.01:
        print(f"  ✓ {feature} properly standardized")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Scaling validation has two key checks: First, verify relationships are preserved by calculating correlations - they should be identical to the original. Second, check that mean ≈ 0 and std ≈ 1 using `.mean()` and `.std()`. Don't expect exact 0.000 and 1.000 - slight floating-point variations are normal (check if abs(mean) < 0.01). Think of scaling like translating a book: the story (relationships) stays the same, only the language (scale) changes. If correlations change significantly, your scaling corrupted the data!

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Check that correlations are preserved (scaling shouldn't change relationships)
if 'temp' in df.columns and 'count' in df.columns:
    correlation = df['temp'].corr(df['count'])
    print(f"Temperature-Count correlation after scaling: {correlation:.3f}")
    print("(This should be similar to original correlation)")

# Verify scaled features have mean ≈ 0, std ≈ 1
print("\nScaled feature validation:")
for feature in weather_features:
    mean_val = df[feature].mean()
    std_val = df[feature].std()
    print(f"{feature}: mean={mean_val:.3f}, std={std_val:.3f}")

    # Check if properly standardized (mean close to 0, std close to 1)
    if abs(mean_val) < 0.01 and abs(std_val - 1) < 0.01:
        print(f"  ✓ {feature} properly standardized")
```
</details>

---

## Step 4: Range Normalization with MinMaxScaler

`StandardScaler` is excellent for normally-distributed features, but some features naturally have fixed bounds. For these, `MinMaxScaler` provides an alternative scaling strategy that compresses values into a defined range, typically 0 to 1.

`MinMaxScaler` transforms features using a linear mapping: the minimum value becomes 0, the maximum becomes 1, and all other values scale proportionally between these bounds. The process follows three steps:

1. **Initialize**: `scaler = MinMaxScaler(feature_range=(0, 1))`
2. **Transform**: `scaler.fit_transform(df[columns])` - calculates min/max values and applies the transformation
3. **Output**: Features compressed to 0-1 range while preserving relative distances

Let's apply `MinMaxScaler` to temporal features:

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Extract temporal features
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# Identify temporal features for MinMax scaling
temporal_features = ['hour', 'day_of_week', 'month']

# Check original ranges
print("Original temporal feature ranges:")
print(df[temporal_features].agg(['min', 'max']))

# Create and apply MinMaxScaler
minmax_scaler = MinMaxScaler(feature_range=(0, 1))
df[temporal_features] = minmax_scaler.fit_transform(df[temporal_features])

# Check scaled results
print("\nScaled temporal feature ranges:")
print(df[temporal_features].agg(['min', 'max']).round(4))
print("\nNotice: All features now range from 0 to 1")

**What this accomplishes:**
- Temporal features compressed to consistent 0-1 range
- Boundary properties preserved (midnight still maps to 0, values stay bounded)
- Compatible scale with other normalized features

**Business value:**
Ensures hour-of-day (0-23) contributes proportionally to demand predictions alongside day-of-week (0-6), preventing arbitrary scale differences from biasing the model.

### Challenge 4: Compare Scaling Methods
Create a comparison showing how StandardScaler and MinMaxScaler transform the same data differently.

In [None]:
# Your code here - compare scaling methods on temperature data

# Get fresh temperature data (before scaling)
df_fresh = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
original_temp = df_fresh['temp'].head(10)

# Apply StandardScaler
from sklearn.preprocessing import StandardScaler, MinMaxScaler

standard_scaler = StandardScaler()
temp_standard = standard_scaler.fit_transform(original_temp.values.reshape(-1, 1)).flatten()

# Apply MinMaxScaler
minmax_scaler = _____()  # Fill in: MinMaxScaler
temp_minmax = minmax_scaler.fit_transform(original_temp.values.reshape(-1, 1)).flatten()

# Compare results
comparison = pd.DataFrame({
    'Original': original_temp.values,
    'StandardScaler': temp_standard,
    'MinMaxScaler': temp_minmax
})
print("Scaling method comparison (first 10 temperatures):")
print(comparison.round(3))

print("\nKey differences:")
print(f"StandardScaler range: {temp_standard.min():.2f} to {temp_standard.max():.2f}")
print(f"MinMaxScaler range: {temp_minmax.min():.2f} to {temp_minmax.max():.2f}")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Load fresh unscaled data, then apply both StandardScaler and MinMaxScaler to the same values to see their different behaviors. Remember `.reshape(-1, 1)` when scaling a single column because sklearn scalers expect 2D arrays, then use `.flatten()` to convert back to 1D after scaling. Choose StandardScaler for unbounded features like temperature (approximately normal distribution) and MinMaxScaler for bounded features like hour-of-day (uniform distribution). Wrong choice won't break models but may slow learning!

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Get fresh temperature data (before scaling)
df_fresh = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
original_temp = df_fresh['temp'].head(10)

# Apply StandardScaler
from sklearn.preprocessing import StandardScaler, MinMaxScaler

standard_scaler = StandardScaler()
temp_standard = standard_scaler.fit_transform(original_temp.values.reshape(-1, 1)).flatten()

# Apply MinMaxScaler
minmax_scaler = MinMaxScaler()
temp_minmax = minmax_scaler.fit_transform(original_temp.values.reshape(-1, 1)).flatten()

# Compare results
comparison = pd.DataFrame({
    'Original': original_temp.values,
    'StandardScaler': temp_standard,
    'MinMaxScaler': temp_minmax
})
print("Scaling method comparison (first 10 temperatures):")
print(comparison.round(3))

print("\nKey differences:")
print(f"StandardScaler range: {temp_standard.min():.2f} to {temp_standard.max():.2f}")
print(f"MinMaxScaler range: {temp_minmax.min():.2f} to {temp_minmax.max():.2f}")
```
</details>

---

## Step 5: Cyclical Encoding for Continuous Time

Now that you've mastered categorical encoding and scaling, we turn to the most critical dimension in transportation data: time. Raw time values like hour=23 and hour=0 appear numerically distant, but they represent adjacent hours on a clock. Time-based feature engineering solves this problem through cyclical encoding.

Let's implement cyclical encoding for hours:

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Extract hour from datetime
df['hour'] = df['datetime'].dt.hour

# Create cyclical encoding for 24-hour cycle
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

# Compare linear hour vs. cyclical encoding
print("Hour encoding comparison:")
comparison = df[['hour', 'hour_sin', 'hour_cos']].drop_duplicates().sort_values('hour')
print(comparison.head(12))

# Verify hour 23 and hour 0 are close
hour_23 = df[df['hour'] == 23][['hour_sin', 'hour_cos']].iloc[0]
hour_0 = df[df['hour'] == 0][['hour_sin', 'hour_cos']].iloc[0]
print(f"\nHour 23: sin={hour_23['hour_sin']:.3f}, cos={hour_23['hour_cos']:.3f}")
print(f"Hour 0:  sin={hour_0['hour_sin']:.3f}, cos={hour_0['hour_cos']:.3f}")
print("Notice how close these values are despite hours being 23 apart numerically!")

**What this accomplishes:**
- Hours now represented as circular coordinates instead of linear numbers
- Machine learning algorithms can recognize that 11 PM and midnight are adjacent
- Continuous temporal patterns preserved without artificial breaks

**Business value:**
Enables models to learn that late-night demand (11 PM) transitions smoothly into early-morning demand (midnight-1 AM), rather than treating them as disconnected time periods.

### Challenge 5: Create Cyclical Encoding for Day of Week
Apply cyclical encoding to the 7-day week cycle to capture how Monday transitions into Tuesday, and Sunday loops back to Monday.

In [None]:
# Your code here - create cyclical encoding for day of week (7-day cycle)

# Create sine and cosine features for 7-day cycle
df['dow_sin'] = np._____(2 * np.pi * df['day_of_week'] / _____)  # Fill in: sin, 7
df['dow_cos'] = np._____(2 * np.pi * df['day_of_week'] / _____)  # Fill in: cos, 7

# Verify Sunday (6) and Monday (0) are close
print("Day of week cyclical encoding:")
dow_comparison = df[['day_of_week', 'dow_sin', 'dow_cos']].drop_duplicates().sort_values('day_of_week')
print(dow_comparison)

# Calculate how close Sunday and Monday are in cyclical space
sunday = df[df['day_of_week'] == 6][['dow_sin', 'dow_cos']].iloc[0]
monday = df[df['day_of_week'] == 0][['dow_sin', 'dow_cos']].iloc[0]
print(f"\nSunday: sin={sunday['dow_sin']:.3f}, cos={sunday['dow_cos']:.3f}")
print(f"Monday: sin={monday['dow_sin']:.3f}, cos={monday['dow_cos']:.3f}")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Follow the exact same pattern as hour encoding, just change the cycle length to 7 days instead of 24 hours. Apply both `np.sin()` and `np.cos()` with the formula: `2 * np.pi * day_of_week / 7`. You need BOTH sine and cosine to uniquely identify each day position - using only one won't work. Bike-sharing demand on Sunday evening (preparing for work week) should be more similar to Monday morning than to Saturday evening (weekend recreation), and cyclical encoding captures this weekly rhythm.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Create sine and cosine features for 7-day cycle
df['dow_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
df['dow_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

# Verify Sunday (6) and Monday (0) are close
print("Day of week cyclical encoding:")
dow_comparison = df[['day_of_week', 'dow_sin', 'dow_cos']].drop_duplicates().sort_values('day_of_week')
print(dow_comparison)

# Calculate how close Sunday and Monday are in cyclical space
sunday = df[df['day_of_week'] == 6][['dow_sin', 'dow_cos']].iloc[0]
monday = df[df['day_of_week'] == 0][['dow_sin', 'dow_cos']].iloc[0]
print(f"\nSunday: sin={sunday['dow_sin']:.3f}, cos={sunday['dow_cos']:.3f}")
print(f"Monday: sin={monday['dow_sin']:.3f}, cos={monday['dow_cos']:.3f}")
```
</details>

---

## Step 6: Lag Features for Sequential Patterns

You've learned how cyclical encoding helps models understand that midnight follows 11 PM. Now we'll introduce lag features, which explicitly provide historical context by using past values as predictors for current observations.

Lag features use pandas' `.shift()` method to access historical values as predictors. The `.shift(n)` function moves each value down by `n` positions, so `shift(1)` gives you the previous hour's value, `shift(24)` gives you the same hour yesterday, and so on.

**Critical requirement:** Always sort your data by datetime before creating lag features. Without proper sorting, `.shift()` will create meaningless relationships between unrelated time periods.

Let's create lag features at multiple time scales:

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# CRITICAL: Sort by datetime before creating lag features
df = df.sort_values('datetime').reset_index(drop=True)

# Create lag features at different time scales
df['count_lag_1h'] = df['count'].shift(1)      # 1 hour ago (immediate momentum)
df['count_lag_24h'] = df['count'].shift(24)    # Same hour yesterday (daily cycle)
df['count_lag_168h'] = df['count'].shift(168)  # Same hour last week (weekly cycle)

# Display lag features
print("Lag feature examples (first 30 rows):")
print(df[['datetime', 'count', 'count_lag_1h', 'count_lag_24h', 'count_lag_168h']].head(30))

# Check how many NaN values were created
print("\nMissing values created by lag features:")
print(f"1-hour lag: {df['count_lag_1h'].isnull().sum()} NaN values")
print(f"24-hour lag: {df['count_lag_24h'].isnull().sum()} NaN values")
print(f"168-hour lag: {df['count_lag_168h'].isnull().sum()} NaN values")

**What this accomplishes:**
- Historical demand patterns explicitly available as features
- Three time scales capture immediate momentum, daily cycles, and weekly patterns
- Model can learn "if demand was high 24 hours ago, it's likely high now"

**Business value:**
Lag features are among the strongest predictors for time series. If 200 bikes were rented last hour, demand probably remains high this hour - lag features encode this critical pattern.

### Challenge 6: Create Temperature Lag Features
Create lag features for temperature to help the model understand temperature trends.

In [None]:
# Your code here - create temperature lag features

# Create 1-hour and 6-hour temperature lags
df['temp_lag_1h'] = df['temp']._____(_____)   # Fill in: shift, 1
df['temp_lag_6h'] = df['temp']._____(_____)   # Fill in: shift, 6

# Calculate temperature change over past hour
df['temp_change_1h'] = df['temp'] - df['_____']  # Fill in: temp_lag_1h

# Display temperature trends
print("Temperature lag features (rows 10-20):")
print(df[['datetime', 'temp', 'temp_lag_1h', 'temp_lag_6h', 'temp_change_1h']].iloc[10:20])

# Identify periods with rapid temperature changes
rapid_change = df[abs(df['temp_change_1h']) > 0.5]  # More than 0.5°C change per hour
print(f"\nPeriods with rapid temperature change: {len(rapid_change)}")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Use the same `.shift()` pattern as before: `df['temp'].shift(1)` for 1-hour lag and `df['temp'].shift(6)` for 6-hour lag. Calculate temperature change by subtracting lag from current: `df['temp'] - df['temp_lag_1h']` (this gives positive values when temperature rises). Use `abs()` when checking for rapid changes because we care about magnitude regardless of direction - both +0.6 and -0.6 are rapid changes. Rapid temperature swings affect bike demand differently than gradual changes - a sudden 5-degree warming at 2 PM might trigger recreational trips that wouldn't happen if the same temperature was reached gradually.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Create 1-hour and 6-hour temperature lags
df['temp_lag_1h'] = df['temp'].shift(1)
df['temp_lag_6h'] = df['temp'].shift(6)

# Calculate temperature change over past hour
df['temp_change_1h'] = df['temp'] - df['temp_lag_1h']

# Display temperature trends
print("Temperature lag features (rows 10-20):")
print(df[['datetime', 'temp', 'temp_lag_1h', 'temp_lag_6h', 'temp_change_1h']].iloc[10:20])

# Identify periods with rapid temperature changes
rapid_change = df[abs(df['temp_change_1h']) > 0.5]
print(f"\nPeriods with rapid temperature change: {len(rapid_change)}")
```
</details>

---

## Step 8: Rolling Window Features for Trend Detection

Lag features provide point-in-time historical values, but transportation demand also depends on recent trends. Is demand rising or falling? Has weather been stable or volatile? Rolling window features answer these questions by aggregating recent history.

We'll use pandas' `.rolling()` method to create rolling window features. The `.rolling(window=n)` method creates a moving window that slides through your data, calculating statistics over the most recent n observations.

**Critical:** Always use `.shift(1)` before `.rolling()` to prevent data leakage. Without `.shift(1)`, the rolling average includes the current value you're trying to predict - that's cheating!

Let's create rolling window features:

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Sort by datetime before creating rolling features
df = df.sort_values('datetime').reset_index(drop=True)

# Create rolling window features (with proper shift to avoid data leakage)
df['count_rolling_3h'] = df['count'].shift(1).rolling(window=3).mean()    # Average past 3 hours
df['count_rolling_24h'] = df['count'].shift(1).rolling(window=24).mean()  # Average past 24 hours

# Temperature volatility over past 6 hours
df['temp_rolling_std_6h'] = df['temp'].shift(1).rolling(window=6).std()

# Display rolling features
print("Rolling window features (rows 30-40):")
print(df[['datetime', 'count', 'count_rolling_3h', 'count_rolling_24h', 'temp_rolling_std_6h']].iloc[30:40])

# Check missing values (first few rows won't have full windows)
print(f"\n3-hour rolling: {df['count_rolling_3h'].isnull().sum()} NaN values")
print(f"24-hour rolling: {df['count_rolling_24h'].isnull().sum()} NaN values")

**What this accomplishes:**
- 3-hour average captures immediate demand trends (accelerating or declining)
- 24-hour average provides daily baseline context
- Temperature volatility indicates weather stability (affects user confidence)

**Business value:**
If 3-hour average is 150 bikes/hour but 24-hour average is 100, demand is surging above normal - signal operations to prioritize bike rebalancing!

### Challenge 8: Create Weather Stability Rolling Features
Create rolling features that measure weather stability, which affects user willingness to plan bike trips.

In [None]:
# Your code here - create weather stability rolling features

# Humidity volatility over past 6 hours (standard deviation)
df['humidity_rolling_std_6h'] = df['humidity'].shift(1)._____(window=_____).std()  # Fill in: rolling, 6

# Maximum windspeed over past 3 hours
df['windspeed_rolling_max_3h'] = df['windspeed'].shift(1)._____(window=_____).max()  # Fill in: rolling, 3

# Temperature range over past 12 hours (max - min)
temp_rolling_12h = df['temp'].shift(1).rolling(window=12)
df['temp_range_12h'] = temp_rolling_12h.max() - temp_rolling_12h._____()  # Fill in: min

# Display weather stability features
print("Weather stability rolling features (rows 40-50):")
print(df[['datetime', 'humidity_rolling_std_6h', 'windspeed_rolling_max_3h', 'temp_range_12h']].iloc[40:50])

# Identify periods with unstable weather (high humidity volatility)
unstable = df[df['humidity_rolling_std_6h'] > 15]  # High humidity variation
print(f"\nPeriods with unstable weather: {len(unstable)}")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Apply different aggregations to rolling windows: `.std()` for volatility, `.max()` for peak values, and `.max() - .min()` for range. ALWAYS include `.shift(1)` before `.rolling()` to avoid data leakage - otherwise you're including the current value in your "past" window! For range calculations, create the rolling object once, then apply both `.max()` and `.min()` to it. Watch your window sizes: 6-hour for humidity, 3-hour for windspeed, 12-hour for temperature. Stable weather (low volatility) encourages casual riders planning ahead, while chaotic weather (gusts, swings) reduces recreational demand but maintains commuter patterns.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Humidity volatility over past 6 hours (standard deviation)
df['humidity_rolling_std_6h'] = df['humidity'].shift(1).rolling(window=6).std()

# Maximum windspeed over past 3 hours
df['windspeed_rolling_max_3h'] = df['windspeed'].shift(1).rolling(window=3).max()

# Temperature range over past 12 hours (max - min)
temp_rolling_12h = df['temp'].shift(1).rolling(window=12)
df['temp_range_12h'] = temp_rolling_12h.max() - temp_rolling_12h.min()

# Display weather stability features
print("Weather stability rolling features (rows 40-50):")
print(df[['datetime', 'humidity_rolling_std_6h', 'windspeed_rolling_max_3h', 'temp_range_12h']].iloc[40:50])

# Identify periods with unstable weather (high humidity volatility)
unstable = df[df['humidity_rolling_std_6h'] > 15]
print(f"\nPeriods with unstable weather: {len(unstable)}")
```
</details>

---

## Summary: Professional Feature Engineering Pipeline Completed

**What We've Accomplished**:
- Implemented comprehensive categorical encoding transformations using one-hot encoding for nominal variables and binary indicators for business-critical conditions
- Applied advanced feature scaling methodologies through StandardScaler for statistical normalization and MinMaxScaler for range-bounded temporal variables
- Developed cyclical encoding frameworks for continuous time variables ensuring proper representation of circular temporal patterns in hours and days
- Engineered lag feature architectures capturing sequential dependencies across multiple time scales from immediate momentum to weekly cycles
- Constructed rolling window aggregation features for trend detection and volatility measurement in demand and weather stability patterns
- Established systematic feature validation protocols ensuring scaling operations preserved underlying data relationships and statistical properties

**Key Technical Skills Mastered**:
- One-hot encoding implementation for converting categorical weather and day-type variables into machine-readable binary column structures
- Binary indicator creation for operationally significant conditions including rush hours, weekends, seasons, and favorable weather patterns
- Z-score standardization techniques through StandardScaler achieving mean-centered unit-variance feature distributions for algorithm compatibility
- Range normalization methodologies via MinMaxScaler compressing bounded temporal features into consistent zero-to-one intervals
- Trigonometric cyclical transformation using sine and cosine functions for preserving circular continuity in time-based variables
- Historical value propagation through shift operations creating lag features that encode sequential patterns and temporal dependencies
- Moving window statistical computation implementing rolling averages, volatility measures, and trend indicators with proper data leakage prevention

**Next Steps**: Next, we'll advance to exploratory data analysis and visualization techniques, examining engineered feature distributions, validating temporal pattern capture effectiveness, and generating actionable business insights that inform machine learning model architecture selection and hyperparameter optimization strategies.

Your bike-sharing client now possesses production-grade feature engineering pipelines transforming raw operational datasets into optimized machine learning inputs - demonstrating the systematic preprocessing methodologies and advanced feature construction techniques that consulting firms require for high-performance predictive modeling and strategic transportation demand forecasting applications.