# Lecture 5: Advanced Preprocessing & Feature Engineering - Optimizing Data for Machine Learning

## Learning Objectives

By the end of this lecture, you will be able to:
- Design and implement time-based features for transportation demand prediction
- Apply categorical encoding techniques for machine learning compatibility
- Implement data scaling and normalization strategies for optimal model performance
- Create domain-specific features that capture transportation business logic

---

## 1. From Clean Data to Machine Learning Ready Features

### The Bridge Between Data and Models

You now have clean, reliable data from your bike-sharing client - but clean data isn't the same as machine learning ready data. Raw data variables often need transformation, combination, and optimization before machine learning algorithms can use them effectively.

In Lecture 4, you learned to identify and handle data quality issues - missing values, outliers, and inconsistencies. That cleaning work ensures you're starting with reliable, trustworthy data. Now, we move to the next essential step: transforming that clean data into optimized features for machine learning.

Think of this stage like preparing ingredients for a sophisticated recipe. Having fresh, quality ingredients (clean data) is essential, but you still need to chop, season, and combine them in specific ways to create the final dish (predictive model). Feature engineering is this crucial preparation step that transforms your clean data into optimized inputs for machine learning algorithms.

For transportation consulting, this means creating features that capture the complex temporal, spatial, and operational patterns that drive demand. Simple variables like "temperature" and "hour" become sophisticated features like "temperature deviation from seasonal average" and "rush hour intensity" that enable more accurate predictions.

### Understanding Feature Engineering in Transportation Context

Transportation systems exhibit complex patterns that require specialized feature engineering approaches:

**Temporal Complexity**: Transportation demand follows nested temporal cycles (hourly, daily, weekly, seasonal) that interact in sophisticated ways. Rush hour patterns differ between weekdays and weekends, seasonal effects vary by time of day, and special events create temporary pattern disruptions.

**Environmental Sensitivity**: Weather affects transportation differently depending on trip purpose, time of day, and seasonal context. A 25°C day feels warm in January but cool in July, creating different demand responses that simple temperature variables can't capture.

**Network Effects**: Transportation systems are networks where demand at one location affects nearby locations through user behavior, capacity constraints, and operational interventions like bike rebalancing.

**Operational Interdependencies**: User types (casual vs. registered) respond differently to environmental and temporal factors, requiring features that capture these interaction effects.

Professional feature engineering transforms these complexities into variables that machine learning algorithms can effectively use to predict demand patterns.

## 2. Time-Based Feature Engineering for Transportation

### Extracting Temporal Intelligence from Timestamps

Time is the most important dimension in transportation data, but raw timestamps contain hidden patterns that must be extracted and transformed to be useful for machine learning.

**Cyclical Time Components**:
Raw time values (like hour 23 followed by hour 0) create artificial discontinuities for machine learning algorithms. Professional time-based feature engineering transforms these linear representations into cyclical features that reflect the true cyclical nature of time.

**Multi-Scale Temporal Patterns**:
Transportation demand operates simultaneously at multiple temporal scales:
- **Intraday cycles**: Rush hours, lunch periods, evening entertainment
- **Weekly cycles**: Weekday vs. weekend patterns, Monday effects, Friday departures
- **Seasonal cycles**: Weather-driven variations, holiday periods, academic calendars
- **Annual cycles**: Long-term growth trends, infrastructure changes, demographic shifts

**Temporal Interaction Effects**:
Simple temporal features miss important interaction effects. For example, the "hour" effect differs dramatically between weekdays and weekends, requiring interaction features that capture these conditional relationships.

### Advanced Temporal Feature Engineering Techniques

In this section, we will explore advanced methods for creating **time-based features** that improve prediction models in urban mobility. Transportation demand is strongly influenced by **daily cycles, weekly rhythms, and seasonal patterns**, so carefully engineered temporal features often make the difference between a weak model and a powerful one.

We will cover four main techniques:

1. **Cyclical encoding for continuous time**
2. **Time-since features**
3. **Temporal aggregation features**
4. **Lag features for sequential patterns**

Each of these methods helps capture different aspects of how time influences bike-sharing demand.

**1. Cyclical Encoding for Continuous Time**

Time is cyclical: after 23:00 comes 00:00, and after December comes January. But if we use raw numeric values (e.g., `hour = 23` vs. `hour = 0`), many models treat these as very far apart. This creates an **artificial break** in the data.

**Definition**: Cyclical encoding maps linear time values (like hours of the day or months of the year) onto a circle using sine and cosine functions. This way, times that are neighbors in reality remain neighbors in the feature space.

**Purpose**: This encoding allows models to recognize smooth, circular patterns in demand, such as rush-hour peaks or seasonal shifts.

**Example in Python: Days Since Last Weekend**

Let's implement a comprehensive example that creates cyclical encoding for multiple time periods, including hours, days of week, and months:

In [31]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import dataset
data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv"
df = pd.read_csv(data_path, parse_dates=['datetime'])

# Example: Create cyclical encoding for bike-sharing temporal features
# Assuming we have a datetime column called 'datetime'
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek  # Monday=0, Sunday=6
df['month'] = df['datetime'].dt.month

# Step 1: Hour cyclical encoding (24-hour cycle)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

# Step 2: Day of week cyclical encoding (7-day cycle) 
df['day_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
df['day_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

# Step 3: Month cyclical encoding (12-month cycle)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

print("Cyclical encoding completed!")
print(f"Hour 23: sin={df[df['hour']==23]['hour_sin'].iloc[0]:.3f}, cos={df[df['hour']==23]['hour_cos'].iloc[0]:.3f}")
print(f"Hour 0:  sin={df[df['hour']==0]['hour_sin'].iloc[0]:.3f}, cos={df[df['hour']==0]['hour_cos'].iloc[0]:.3f}")

Cyclical encoding completed!
Hour 23: sin=-0.259, cos=0.966
Hour 0:  sin=0.000, cos=1.000


**Step-by-Step Explanation:**

1. **Extract temporal components**: We first extract hour, day of week, and month from the datetime column to work with integer values.

2. **Apply sine and cosine transformation**: For each temporal component, we use both sine and cosine functions:
   - **Sine component**: Captures the position on one axis of the circle
   - **Cosine component**: Captures the position on the perpendicular axis
   - Together, they uniquely identify any point on the circle

3. **Scale by the period**: We divide by the total period (24 for hours, 7 for days, 12 for months) to ensure one complete cycle maps to one full circle (2π radians).

4. **Mathematical insight**: The formula `2 * π * value / period` converts linear time to radians, where:
   - Hour 0 and hour 24 both map to the same point: (sin=0, cos=1)
   - Hour 6 maps to: (sin=1, cos=0) 
   - Hour 12 maps to: (sin=0, cos=-1)
   - Hour 18 maps to: (sin=-1, cos=0)

With this representation, **23:00 and 00:00 map to nearly the same point** on the circle, preserving their closeness. The Euclidean distance between these encoded points accurately reflects their temporal proximity, unlike raw numeric encoding where 23 and 0 appear far apart.

This cyclical encoding works well for many model types (linear regression, tree-based models, kNN, neural networks), since it embeds the true temporal geometry directly into the data. For bike-sharing systems, it helps models learn that late-night hours share similarities with early-morning hours, both typically showing low demand patterns.

**2. Time-Since Features**

Sometimes what matters most is not the absolute time, but **how much time has passed since an important event**.

**Definition**: A time-since feature measures the elapsed time between the current observation and a meaningful reference point.

**Purpose**: These features allow models to capture temporal effects related to holidays, weekends, or unusual conditions.

**Examples**:

* Days since the start of the current season
* Hours since the last weekend ended
* Time since the most recent holiday period
* Days since a major weather change

**Example in Python: Days Since Last Weekend**

Let's implement time-since features that capture important temporal transitions in bike-sharing demand:

In [32]:
import numpy as np
import pandas as pd

# Example: Create time-since features for bike-sharing demand prediction
# Assuming we have a datetime column and related features already created

def create_time_since_features(df):
    """
    Create time-since features that measure elapsed time from important events.
    These features help capture recovery patterns and transition effects.
    """
    
    # Ensure datetime column is properly formatted
    df['datetime'] = pd.to_datetime(df['datetime'])
    df = df.sort_values('datetime')  # Ensure chronological order
    
    # Step 1: Days since last weekend ended
    # Weekend ends on Sunday evening (day 6), new week starts Monday
    df['is_weekend'] = df['datetime'].dt.dayofweek.isin([5, 6])  # Saturday=5, Sunday=6
    
    # Find all weekend end points (Sunday evenings at 23:59)
    weekend_ends = df[(df['datetime'].dt.dayofweek == 6) & 
                      (df['datetime'].dt.hour == 23)]['datetime']
    
    # Calculate days since last weekend for each observation
    df['days_since_weekend'] = df['datetime'].apply(
        lambda x: (x - weekend_ends[weekend_ends <= x].max()).days 
        if len(weekend_ends[weekend_ends <= x]) > 0 else 7
    )
    
    # Step 2: Hours since last holiday ended
    # Assuming we have a 'holiday' column (1 for holiday, 0 for normal day)
    holiday_ends = df[df['holiday'] == 1]['datetime'].dt.date.unique()
    
    df['hours_since_holiday'] = df['datetime'].apply(
        lambda x: min([(x.date() - holiday_date).days * 24 + x.hour 
                      for holiday_date in holiday_ends if holiday_date < x.date()] + [168])
    )
    
    # Step 3: Days since significant weather change
    # Define significant weather change as temperature difference > 10°C from previous day
    df['temp_change'] = df['temp'].diff().abs()
    df['significant_weather_change'] = df['temp_change'] > 10
    
    # Mark dates with significant weather changes
    weather_change_dates = df[df['significant_weather_change'] == True]['datetime']
    
    df['days_since_weather_change'] = df['datetime'].apply(
        lambda x: (x - weather_change_dates[weather_change_dates <= x].max()).days 
        if len(weather_change_dates[weather_change_dates <= x]) > 0 else 30
    )
    
    return df

# Apply time-since feature engineering
df = create_time_since_features(df)

print("Time-since features created:")
print(f"Days since weekend - Range: {df['days_since_weekend'].min()} to {df['days_since_weekend'].max()}")
print(f"Hours since holiday - Average: {df['hours_since_holiday'].mean():.1f}")
print(f"Days since weather change - Average: {df['days_since_weather_change'].mean():.1f}")

Time-since features created:
Days since weekend - Range: 0 to 20
Hours since holiday - Average: 156.7
Days since weather change - Average: 33.4


**Step-by-Step Explanation:**

1. **Weekend Recovery Pattern**: `days_since_weekend` captures how bike demand gradually changes as the workweek progresses. Monday (days_since_weekend = 1) often shows different patterns than Friday (days_since_weekend = 5).

2. **Holiday Recovery Effects**: `hours_since_holiday` measures recovery time from holiday disruptions. Bike demand often takes 1-3 days to return to normal patterns after holidays.

3. **Weather Adaptation Period**: `days_since_weather_change` captures how users adapt to significant weather changes. After a major temperature drop, it might take several days for demand patterns to stabilize.

4. **Boundary Handling**: The code includes default values (7 days, 168 hours, 30 days) for cases where no previous event is found, ensuring robust feature creation.

In bike-sharing systems, "days since last weekend" might reveal that Tuesday demand is typically higher than Monday as people settle back into commuting routines. Similarly, "days since last snowfall" could be a strong predictor of demand recovery, as ridership gradually returns to normal after weather disruptions.

**3. Temporal Aggregation Features**

Transportation demand often depends on **recent history**, not just single time points.

**Definition**: Temporal aggregation features summarize past values of a variable over a defined time window.

**Purpose**: They provide context about recent conditions, such as whether demand has been steadily rising or if the weather has been unusually stable.

**Examples**:

* Average demand in the past 3 hours
* Maximum temperature in the past 24 hours
* Weather stability over the past week
* Growth rate of ridership over the past month

**Example in Python: Rolling Weather and Demand Aggregations**

Let's implement temporal aggregation features that provide context about recent conditions and trends:

In [33]:
import numpy as np
import pandas as pd

def create_temporal_aggregation_features(df):
    """
    Create temporal aggregation features that summarize recent history
    to provide context for current predictions.
    """
    
    # Ensure data is sorted chronologically for rolling calculations
    df = df.sort_values('datetime')
    
    # Step 1: Rolling demand statistics (short-term momentum)
    # 3-hour rolling average captures immediate demand trends
    df['demand_3h_avg'] = df['count'].rolling(window=3, min_periods=1).mean()
    
    # 3-hour rolling maximum shows peak demand in recent period
    df['demand_3h_max'] = df['count'].rolling(window=3, min_periods=1).max()
    
    # 3-hour demand volatility (standard deviation) indicates stability
    df['demand_3h_volatility'] = df['count'].rolling(window=3, min_periods=1).std()
    
    # Step 2: Weather stability indicators (24-hour windows)
    # Temperature stability: how much temperature has varied recently
    df['temp_24h_stability'] = 1 / (1 + df['temp'].rolling(window=24, min_periods=1).std())
    
    # Maximum temperature in past 24 hours
    df['temp_24h_max'] = df['temp'].rolling(window=24, min_periods=1).max()
    
    # Minimum temperature in past 24 hours
    df['temp_24h_min'] = df['temp'].rolling(window=24, min_periods=1).min()
    
    # Temperature range (daily temperature swing)
    df['temp_24h_range'] = df['temp_24h_max'] - df['temp_24h_min']
    
    # Step 3: Weekly trend indicators (168-hour = 7-day windows)
    # 7-day rolling average for trend analysis
    df['demand_7d_avg'] = df['count'].rolling(window=168, min_periods=24).mean()
    
    # Growth rate: current vs. 7-day average (positive = growing demand)
    df['demand_growth_rate'] = (df['count'] - df['demand_7d_avg']) / (df['demand_7d_avg'] + 1)
    
    # Weather trend: is temperature trending up or down this week?
    df['temp_7d_trend'] = df['temp'].rolling(window=168, min_periods=24).apply(
        lambda x: np.polyfit(range(len(x)), x, 1)[0] if len(x) > 1 else 0
    )
    
    # Step 4: Advanced aggregations for business insights
    # Rush hour intensity in past 3 days (for pattern detection)
    rush_hours = df['hour'].isin([7, 8, 9, 17, 18, 19])
    df['rush_hour_demand'] = np.where(rush_hours, df['count'], np.nan)
    df['rush_demand_3d_avg'] = df['rush_hour_demand'].rolling(window=72, min_periods=6).mean()
    
    # Weekend vs weekday demand ratio (past 4 weeks)
    df['is_weekend'] = df['datetime'].dt.dayofweek.isin([5, 6])
    df['weekend_demand'] = np.where(df['is_weekend'], df['count'], np.nan)
    df['weekday_demand'] = np.where(~df['is_weekend'], df['count'], np.nan)
    
    weekend_avg = df['weekend_demand'].rolling(window=672, min_periods=48).mean()  # 4 weeks
    weekday_avg = df['weekday_demand'].rolling(window=672, min_periods=120).mean()
    df['weekend_weekday_ratio'] = weekend_avg / (weekday_avg + 1)
    
    return df

# Apply temporal aggregation features
df = create_temporal_aggregation_features(df)

print("Temporal aggregation features created:")
print(f"3-hour demand volatility - Average: {df['demand_3h_volatility'].mean():.2f}")
print(f"Temperature stability - Average: {df['temp_24h_stability'].mean():.3f}")
print(f"Demand growth rate - Range: {df['demand_growth_rate'].min():.3f} to {df['demand_growth_rate'].max():.3f}")
print(f"Weekend/weekday ratio - Average: {df['weekend_weekday_ratio'].mean():.2f}")

Temporal aggregation features created:
3-hour demand volatility - Average: 63.08
Temperature stability - Average: 0.315
Demand growth rate - Range: -0.993 to 3.307
Weekend/weekday ratio - Average: 0.96


**Step-by-Step Explanation:**

1. **Short-term momentum (3-hour windows)**: These features capture immediate trends and volatility in demand. High volatility might indicate special events or system disruptions.

2. **Daily weather context (24-hour windows)**: Temperature stability indicates whether weather has been consistent, which affects user behavior predictability. A high daily temperature range suggests variable conditions.

3. **Weekly trend analysis (7-day windows)**: The growth rate compares current demand to recent average, revealing whether the system is experiencing increasing or decreasing usage. Temperature trend identifies sustained weather patterns.

4. **Business-specific aggregations**: Rush hour averages help identify commuting pattern changes, while weekend/weekday ratios reveal seasonal shifts in usage patterns.

5. **Robust handling**: The `min_periods` parameter ensures features are calculated even with some missing data, while adding small constants (+ 1) prevents division by zero.

In consulting practice, showing a client the "rolling 7-day average demand" can highlight whether usage trends are growing, stabilizing, or declining. The temperature stability feature might reveal that unstable weather periods consistently reduce demand by 15-20%, providing actionable insights for capacity planning.

**4. Lag Features for Sequential Patterns**

Urban mobility demand is not random—it shows **strong sequential dependencies**.

**Definition**: A lag feature uses the value of a variable from a previous time step as a predictor for the current step.

**Purpose**: These features explicitly introduce historical demand patterns into the model, helping it learn recurring cycles.

**Examples**:

* Demand 1 hour ago (captures immediate momentum)
* Demand 24 hours ago (captures same-time-of-day effects)
* Demand 7 days ago (captures weekly repetition)
* Average demand at this hour over the past month

**Example in Python: Multiple Lag Features for Sequential Pattern Recognition**

Let's implement lag features that capture different types of sequential dependencies in bike-sharing demand:

In [34]:
import numpy as np
import pandas as pd

def create_lag_features(df):
    """
    Create lag features that capture sequential dependencies and recurring patterns
    in bike-sharing demand data.
    """
    
    # Ensure data is sorted chronologically for proper lag calculation
    df = df.sort_values('datetime').reset_index(drop=True)
    
    # Step 1: Basic lag features for immediate patterns
    # Immediate momentum: demand 1 hour ago
    df['demand_lag_1h'] = df['count'].shift(1)
    
    # Recent pattern: demand 2 and 3 hours ago
    df['demand_lag_2h'] = df['count'].shift(2)
    df['demand_lag_3h'] = df['count'].shift(3)
    
    # Step 2: Same-time-of-day patterns (24-hour lags)
    # Yesterday same hour (captures daily repetition)
    df['demand_lag_24h'] = df['count'].shift(24)
    
    # Day before yesterday same hour (validates daily pattern)
    df['demand_lag_48h'] = df['count'].shift(48)
    
    # Step 3: Weekly recurring patterns (7-day lags)
    # Same day-hour last week (captures weekly cycles)
    df['demand_lag_7d'] = df['count'].shift(24 * 7)  # 168 hours
    
    # Same day-hour 2 weeks ago (validates weekly pattern)
    df['demand_lag_14d'] = df['count'].shift(24 * 14)  # 336 hours
    
    # Step 4: Advanced lag combinations for complex patterns
    # Average of same-time-of-day over past week (more stable than single point)
    df['demand_same_hour_avg_7d'] = df['count'].shift(24).rolling(window=7, min_periods=3).mean()
    
    # Trend in same-time-of-day demand (is this time slot growing or declining?)
    same_time_values = []
    for i in range(len(df)):
        if i >= 24 * 7:  # Need at least 1 week of data
            # Get demand at same hour for past 7 days
            same_hour_demands = [df.iloc[i - 24 * (d+1)]['count'] for d in range(7)]
            # Calculate linear trend (positive = growing, negative = declining)
            if all(pd.notna(same_hour_demands)):
                trend = np.polyfit(range(7), same_hour_demands, 1)[0]
            else:
                trend = 0
        else:
            trend = 0
        same_time_values.append(trend)
    
    df['demand_same_hour_trend'] = same_time_values
    
    # Step 5: Weather lag features (weather affects demand with some delay)
    # Temperature 1 and 3 hours ago (weather decision lag)
    df['temp_lag_1h'] = df['temp'].shift(1)
    df['temp_lag_3h'] = df['temp'].shift(3)
    
    # Weather condition lag (people may check weather hours before trip)
    if 'weather' in df.columns:
        df['weather_lag_2h'] = df['weather'].shift(2)
    
    # Step 6: User type lag patterns (different lag behaviors)
    # Casual users (more spontaneous, shorter lags)
    df['casual_lag_1h'] = df['casual'].shift(1)
    df['casual_lag_2h'] = df['casual'].shift(2)
    
    # Registered users (more routine-driven, longer predictable lags)
    df['registered_lag_24h'] = df['registered'].shift(24)
    df['registered_lag_7d'] = df['registered'].shift(24 * 7)
    
    # Step 7: Interaction lag features (capture combined effects)
    # Weather-demand interaction from previous day
    df['temp_demand_interaction_24h'] = (df['temp'].shift(24) * df['count'].shift(24))
    
    # Weekend effect lag (Friday affects weekend, Sunday affects Monday)
    df['weekend_effect_lag'] = np.where(
        df['datetime'].dt.dayofweek == 0,  # Monday
        df['count'].shift(24 * 2),  # Saturday demand affects Monday
        df['count'].shift(24)  # Previous day for other days
    )
    
    return df

# Apply lag feature engineering
df = create_lag_features(df)

print("Lag features created:")
print(f"1-hour lag correlation with current demand: {df['count'].corr(df['demand_lag_1h']):.3f}")
print(f"24-hour lag correlation with current demand: {df['count'].corr(df['demand_lag_24h']):.3f}")
print(f"7-day lag correlation with current demand: {df['count'].corr(df['demand_lag_7d']):.3f}")
print(f"Same hour trend - Range: {df['demand_same_hour_trend'].min():.3f} to {df['demand_same_hour_trend'].max():.3f}")

# Display sample of lag features for business understanding
print("\nSample lag feature values (first 10 non-null rows):")
lag_cols = ['demand_lag_1h', 'demand_lag_24h', 'demand_lag_7d', 'demand_same_hour_avg_7d']
sample_data = df[['datetime', 'count'] + lag_cols].dropna().head(10)
print(sample_data.to_string(index=False))

Lag features created:
1-hour lag correlation with current demand: 0.842
24-hour lag correlation with current demand: 0.811
7-day lag correlation with current demand: 0.786
Same hour trend - Range: -126.107 to 125.857

Sample lag feature values (first 10 non-null rows):
           datetime  count  demand_lag_1h  demand_lag_24h  demand_lag_7d  demand_same_hour_avg_7d
2011-01-08 07:00:00      9            2.0            84.0           16.0                21.285714
2011-01-08 08:00:00     15            9.0           210.0           40.0                48.857143
2011-01-08 09:00:00     20           15.0           134.0           32.0                67.000000
2011-01-08 10:00:00     61           20.0            63.0           13.0                75.857143
2011-01-08 11:00:00     62           61.0            67.0            1.0                85.285714
2011-01-08 12:00:00     98           62.0            59.0            1.0                93.000000
2011-01-08 13:00:00    102           98.0   

**Step-by-Step Explanation:**

1. **Immediate momentum (1-3 hour lags)**: These features capture short-term demand trends. If demand was high 1-2 hours ago, it might continue to be high, indicating system momentum or sustained favorable conditions.

2. **Daily repetition patterns (24-48 hour lags)**: These features leverage the strong daily cyclical nature of transportation. Demand at 8 AM today often resembles demand at 8 AM yesterday, especially for commuting patterns.

3. **Weekly recurring cycles (7-14 day lags)**: These capture weekly rhythms where Monday 8 AM patterns repeat week after week. This is especially powerful for commuting and routine trip prediction.

4. **Advanced temporal aggregations**: Instead of just using single lag points, we create rolling averages of same-time-of-day demand and trend analysis to capture whether specific time periods are growing or declining in popularity.

5. **Weather lag effects**: Weather influences decisions with some delay - people may check the weather forecast hours before deciding to bike, creating lag effects between weather changes and demand response.

6. **User-specific lag patterns**: Casual and registered users show different lag behaviors. Casual users are more spontaneous (shorter lags matter), while registered users are more routine-driven (longer lags are predictive).

7. **Interaction lag features**: These capture complex relationships, like how Friday's weather might affect weekend recreational biking, or how Saturday's high demand might predict Monday's commuter rebound.

For bike-sharing systems, lag features help the model learn that **Monday 8 AM demand this week will likely resemble Monday 8 AM last week**, adjusted for seasonal effects. The correlation values show which time horizons are most predictive - typically 24-hour lags have the strongest correlation for commuting patterns, while 7-day lags excel for recreational usage prediction.

### Business-Relevant Temporal Features

So far, we have looked at general temporal transformations. But in real-world consulting, models are most valuable when they reflect **business logic and human behavior**. This section shows how to design temporal features that go beyond raw time values, directly encoding **transportation-specific patterns** such as rush hours, weekday/weekend differences, and gradual seasonal shifts.

We’ll focus on three key types:

1. **Rush Hour Intensity**
2. **Workday vs. Weekend Context**
3. **Seasonal Progression Features**

**1. Rush Hour Intensity**

Bike-sharing demand is heavily shaped by commuting patterns. Instead of treating “rush hour” as a simple binary (yes/no), we can model it as a **continuous intensity measure**.

**Definition**: Rush hour intensity features capture the strength of peak commuting periods on a smooth scale, rather than an abrupt cutoff.

**Purpose**: This approach better reflects reality: demand gradually builds before 8 AM, peaks during the commute, and slowly declines afterward.

**Example in Python: Advanced Rush Hour Modeling**

Let's implement comprehensive rush hour features that capture different aspects of commuting patterns:

In [35]:
import numpy as np
import pandas as pd

def create_rush_hour_features(df):
    """
    Create sophisticated rush hour features that capture commuting patterns
    with smooth intensity measures rather than binary indicators.
    """
    
    # Step 1: Basic rush hour intensity using Gaussian curves
    # Morning rush centered at 8 AM with 1.5-hour standard deviation
    rush_morning = np.exp(-((df['hour'] - 8)**2) / (2 * 1.5**2))
    
    # Evening rush centered at 5 PM with 1.5-hour standard deviation  
    rush_evening = np.exp(-((df['hour'] - 17)**2) / (2 * 1.5**2))
    
    # Combined rush intensity (takes maximum of morning and evening)
    df['rush_intensity'] = np.maximum(rush_morning, rush_evening)
    
    # Step 2: Separate morning and evening intensities
    # Sometimes models benefit from knowing which rush period is active
    df['morning_rush_intensity'] = rush_morning
    df['evening_rush_intensity'] = rush_evening
    
    # Step 3: Advanced rush hour variations
    # Lunch hour intensity (centered at noon with tighter spread)
    df['lunch_intensity'] = np.exp(-((df['hour'] - 12)**2) / (2 * 1.0**2))
    
    # Late evening activity (centered at 9 PM for entertainment/dining)
    df['evening_activity'] = np.exp(-((df['hour'] - 21)**2) / (2 * 2.0**2))
    
    # Step 4: Business-day adjusted rush hours
    # Rush hours only matter on working days, so multiply by workingday indicator
    df['workday_rush_intensity'] = df['rush_intensity'] * df['workingday']
    df['workday_morning_rush'] = df['morning_rush_intensity'] * df['workingday']
    df['workday_evening_rush'] = df['evening_rush_intensity'] * df['workingday']
    
    # Step 5: Time-to-rush features (how close are we to peak times?)
    # Distance to nearest rush hour peak (useful for anticipating demand)
    morning_distance = np.abs(df['hour'] - 8)
    evening_distance = np.abs(df['hour'] - 17)
    
    # Minimum time to either rush hour (accounting for 24-hour cycle)
    df['hours_to_rush'] = np.minimum(
        np.minimum(morning_distance, 24 - morning_distance),
        np.minimum(evening_distance, 24 - evening_distance)
    )
    
    # Step 6: Rush hour context indicators
    # Pre-rush period (1-2 hours before peak)
    df['pre_morning_rush'] = ((df['hour'] >= 6) & (df['hour'] <= 7)).astype(int)
    df['pre_evening_rush'] = ((df['hour'] >= 15) & (df['hour'] <= 16)).astype(int)
    
    # Post-rush period (1-2 hours after peak)
    df['post_morning_rush'] = ((df['hour'] >= 9) & (df['hour'] <= 10)).astype(int)
    df['post_evening_rush'] = ((df['hour'] >= 18) & (df['hour'] <= 19)).astype(int)
    
    return df

# Apply rush hour feature engineering
df = create_rush_hour_features(df)

print("Rush hour features created:")
print(f"Rush intensity at 8 AM: {df[df['hour']==8]['rush_intensity'].iloc[0]:.3f}")
print(f"Rush intensity at 12 PM: {df[df['hour']==12]['rush_intensity'].iloc[0]:.3f}")  
print(f"Rush intensity at 5 PM: {df[df['hour']==17]['rush_intensity'].iloc[0]:.3f}")
print(f"Average hours to rush: {df['hours_to_rush'].mean():.1f}")

# Show how features vary throughout the day
print("\nRush hour intensity by hour (sample):")
hourly_rush = df.groupby('hour')[['rush_intensity', 'lunch_intensity', 'evening_activity']].mean()
print(hourly_rush.round(3))

Rush hour features created:
Rush intensity at 8 AM: 1.000
Rush intensity at 12 PM: 0.029
Rush intensity at 5 PM: 1.000
Average hours to rush: 3.2

Rush hour intensity by hour (sample):
      rush_intensity  lunch_intensity  evening_activity
hour                                                   
0              0.000            0.000             0.000
1              0.000            0.000             0.000
2              0.000            0.000             0.000
3              0.004            0.000             0.000
4              0.029            0.000             0.000
5              0.135            0.000             0.000
6              0.411            0.000             0.000
7              0.801            0.000             0.000
8              1.000            0.000             0.000
9              0.801            0.011             0.000
10             0.411            0.135             0.000
11             0.135            0.607             0.000
12             0.029           

**Step-by-Step Explanation:**

1. **Gaussian Rush Hour Curves**: The formula `np.exp(-((hour - peak)**2) / (2 * std**2))` creates smooth bell curves centered at rush hour times. This is much more realistic than binary on/off indicators, as demand gradually builds and declines around peak times.

2. **Separate Morning and Evening Features**: While combined rush intensity is useful, separate features allow models to learn that morning and evening rush patterns may differ (e.g., morning rush might be more predictable than evening rush).

3. **Additional Time Peaks**: Beyond traditional rush hours, we capture lunch hour patterns and evening entertainment periods, which also drive bike-sharing demand in urban areas.

4. **Working Day Adjustment**: Rush hours only apply to working days, so we multiply by the workingday indicator to create features that are zero on weekends and holidays.

5. **Time-to-Rush Distance**: This feature helps models anticipate demand changes - hours immediately before rush hour often show building demand patterns.

6. **Rush Hour Context Periods**: Binary indicators for pre-rush and post-rush periods help models understand transitional demand patterns.

> **Note**: The Gaussian function creates smooth transitions where:
> - Peak rush hours (8 AM, 5 PM) get intensity = 1.0
> - Hours 1.5 standard deviations away get intensity ≈ 0.6
> - Hours 3+ standard deviations away get intensity < 0.1

For a bike-sharing operator, this feature encodes how strongly each time of day aligns with commuting demand, giving the model a better understanding of **urban mobility rhythms**. The smooth intensity measure is more realistic than binary rush/non-rush categories and allows models to predict gradual demand transitions.

**2. Workday vs. Weekend Context**

Not all hours are equal—**the same 8 AM can mean very different things** depending on the day of the week.

**Definition**: Workday/weekend features separate time into business-relevant categories based on how people use transportation.

**Purpose**: They help models distinguish between commuting-driven patterns (weekday mornings and evenings) and leisure-driven usage (weekend afternoons).

**Examples of features**:

* Working hours vs. leisure hours
* Commuting periods vs. recreational periods
* Business district active hours vs. residential area activity

**Example in Python: Comprehensive Workday vs Weekend Context**

Let's implement features that distinguish between different types of time periods based on their behavioral implications:

In [36]:
import numpy as np
import pandas as pd

def create_workday_weekend_features(df):
    """
    Create features that capture how the same time periods have different meanings
    on workdays vs weekends, reflecting different user behaviors and motivations.
    """
    
    # Basic day type indicators (building blocks for more complex features)
    df['is_weekend'] = df['datetime'].dt.dayofweek.isin([5, 6]).astype(int)  # Sat=5, Sun=6
    df['is_weekday'] = (~df['datetime'].dt.dayofweek.isin([5, 6])).astype(int)
    df['is_friday'] = (df['datetime'].dt.dayofweek == 4).astype(int)  # Friday has unique patterns
    df['is_monday'] = (df['datetime'].dt.dayofweek == 0).astype(int)   # Monday has unique patterns
    
    # Step 1: Time period context features
    # Same hours mean different things on different day types
    
    # Morning commute context (6-10 AM)
    morning_commute_hours = ((df['hour'] >= 6) & (df['hour'] <= 10)).astype(int)
    df['weekday_morning_commute'] = morning_commute_hours * df['is_weekday']
    df['weekend_morning_leisure'] = morning_commute_hours * df['is_weekend']
    
    # Lunch period context (11 AM - 2 PM)
    lunch_hours = ((df['hour'] >= 11) & (df['hour'] <= 14)).astype(int)
    df['weekday_lunch_break'] = lunch_hours * df['is_weekday']
    df['weekend_afternoon_activity'] = lunch_hours * df['is_weekend']
    
    # Evening context (5-9 PM) 
    evening_hours = ((df['hour'] >= 17) & (df['hour'] <= 21)).astype(int)
    df['weekday_evening_commute'] = evening_hours * df['is_weekday']  
    df['weekend_evening_recreation'] = evening_hours * df['is_weekend']
    
    # Step 2: Business vs leisure hour classification
    # Working hours: 8 AM - 6 PM on weekdays
    business_hours = ((df['hour'] >= 8) & (df['hour'] <= 18)).astype(int)
    df['business_hours_active'] = business_hours * df['is_weekday']
    
    # Leisure hours: Evenings and weekends
    leisure_weekday_evening = ((df['hour'] >= 18) | (df['hour'] <= 7)) * df['is_weekday']
    leisure_weekend_all = df['is_weekend']  # All weekend hours are leisure
    df['leisure_hours_active'] = np.maximum(leisure_weekday_evening, leisure_weekend_all)
    
    # Step 3: Advanced temporal-contextual interactions
    # Friday evening vs other evenings (different social patterns)
    df['friday_evening_social'] = evening_hours * df['is_friday']
    
    # Monday morning return-to-work effect
    df['monday_morning_return'] = morning_commute_hours * df['is_monday']
    
    # Weekend vs weekday rush hour comparison
    # This shows how "rush hour" times perform differently on weekends
    traditional_rush = ((df['hour'].isin([7, 8, 9])) | (df['hour'].isin([17, 18, 19]))).astype(int)
    df['weekday_traditional_rush'] = traditional_rush * df['is_weekday']
    df['weekend_during_rush_hours'] = traditional_rush * df['is_weekend']
    
    # Step 4: Day transition effects
    # Weekend preparation (Friday afternoon/evening)
    df['weekend_preparation'] = ((df['hour'] >= 15) & (df['is_friday'])).astype(int)
    
    # Weekend recovery (Sunday evening - people preparing for Monday)
    df['weekend_recovery'] = ((df['hour'] >= 18) & (df['datetime'].dt.dayofweek == 6)).astype(int)  # Sunday evening
    
    # Step 5: Activity type probability indicators
    # Probability that current time slot is used for commuting vs recreation
    
    # Commuting probability (high on weekday rush hours)
    commute_time_weekday = ((df['hour'].isin([7, 8, 9, 17, 18, 19])) & df['is_weekday']).astype(float)
    commute_time_weekend = df['is_weekend'] * 0.1  # Low but non-zero weekend commuting
    df['commuting_probability'] = np.maximum(commute_time_weekday, commute_time_weekend)
    
    # Recreation probability (high on weekends, evenings, lunch breaks)
    recreation_weekend = df['is_weekend'] * 0.8  # High baseline recreation on weekends
    recreation_weekday_evening = ((df['hour'] >= 18) & df['is_weekday']) * 0.6
    recreation_lunch = ((df['hour'] >= 11) & (df['hour'] <= 14)) * 0.3
    df['recreation_probability'] = np.maximum.reduce([recreation_weekend, 
                                                     recreation_weekday_evening, 
                                                     recreation_lunch])
    
    # Step 6: User behavior context features
    # Different user types (casual vs registered) have different day-type sensitivities
    
    # Weekend premium for casual users (weekends see higher casual usage)
    df['weekend_casual_boost'] = df['is_weekend'] * 1.5
    
    # Workday routine for registered users (consistent weekday patterns)
    df['workday_registered_routine'] = df['is_weekday'] * df['business_hours_active']
    
    return df

# Apply workday/weekend context features
df = create_workday_weekend_features(df)

print("Workday/Weekend context features created:")
print(f"Business hours active - Weekend avg: {df[df['is_weekend']==1]['business_hours_active'].mean():.3f}")
print(f"Business hours active - Weekday avg: {df[df['is_weekday']==1]['business_hours_active'].mean():.3f}")
print(f"Recreation probability - Weekend avg: {df[df['is_weekend']==1]['recreation_probability'].mean():.3f}")
print(f"Commuting probability - Weekday rush avg: {df[(df['is_weekday']==1) & (df['hour'].isin([8,17]))]['commuting_probability'].mean():.3f}")

# Show context differences by day type and hour
print("\nContext feature comparison (8 AM):")
morning_comparison = df[df['hour']==8].groupby('is_weekend')[
    ['weekday_morning_commute', 'weekend_morning_leisure', 'commuting_probability', 'recreation_probability']
].mean()
print(morning_comparison.round(3))

Workday/Weekend context features created:
Business hours active - Weekend avg: 0.000
Business hours active - Weekday avg: 0.461
Recreation probability - Weekend avg: 0.800
Commuting probability - Weekday rush avg: 1.000

Context feature comparison (8 AM):
            weekday_morning_commute  weekend_morning_leisure  \
is_weekend                                                     
0                               1.0                      0.0   
1                               0.0                      1.0   

            commuting_probability  recreation_probability  
is_weekend                                                 
0                             1.0                     0.0  
1                             0.1                     0.8  


**Step-by-Step Explanation:**

1. **Basic Day Type Classification**: We create fundamental indicators for weekdays, weekends, and special days (Friday, Monday) that have unique behavioral patterns in urban transportation.

2. **Time Period Context Mapping**: The same hour (e.g., 8 AM) gets different contextual features depending on day type. 8 AM on weekday = commuting context; 8 AM on weekend = leisure context.

3. **Business vs Leisure Hour Segmentation**: We classify time periods based on typical activity types rather than just day/time, recognizing that leisure can happen on weekday evenings too.

4. **Special Day Transition Effects**: Friday evenings and Monday mornings have unique patterns as people transition between work and leisure modes.

5. **Behavioral Probability Indicators**: Instead of binary classifications, we assign probabilities to different activity types, acknowledging that some commuting happens on weekends and some recreation happens on weekdays.

6. **User Type Interactions**: Different user segments (casual vs registered) respond differently to day-type contexts, so we create features that capture these interaction effects.

A consulting client may want to know why Saturday afternoons have higher casual ridership compared to Monday mornings. These contextual features allow the model to reflect such differences clearly by distinguishing between commuting-driven demand (predictable, routine) and recreation-driven demand (weather-sensitive, discretionary).

**3. Seasonal Progression Features**

Seasons affect demand, but changes are **gradual, not sudden**.

**Definition**: Seasonal progression features capture smooth transitions across the year, instead of treating each season as a fixed category.

**Purpose**: They help account for gradual shifts such as increasing daylight, warming temperatures, or the transition between early and late parts of a season.

**Examples**:

* Days since the **winter solstice** (captures changing daylight availability).
* **Temperature trend** over the past week (captures warming/cooling).
* **Daylight hours** available each day.
* Early spring vs. late spring (within-season progression).

**Example in Python: Comprehensive Seasonal Progression Features**

Let's implement features that capture gradual seasonal changes rather than discrete seasonal categories:

In [37]:
import numpy as np
import pandas as pd
from datetime import datetime, date

def create_seasonal_progression_features(df):
    """
    Create features that capture gradual seasonal transitions and progression
    rather than treating seasons as discrete categories.
    """
    
    # Ensure datetime is properly parsed
    df['datetime'] = pd.to_datetime(df['datetime'])
    df['day_of_year'] = df['datetime'].dt.dayofyear
    
    # Step 1: Solar position and daylight features
    # Days since winter solstice (Dec 21, around day 355)
    # This captures the solar year cycle affecting daylight and temperature
    winter_solstice_day = 355  # December 21st is typically day 355
    
    df['days_since_winter_solstice'] = df['day_of_year'].apply(
        lambda x: x - winter_solstice_day if x >= winter_solstice_day 
        else x + (365 - winter_solstice_day)
    )
    
    # Days since summer solstice (June 21, around day 172)
    summer_solstice_day = 172  # June 21st is typically day 172
    df['days_since_summer_solstice'] = df['day_of_year'].apply(
        lambda x: min(abs(x - summer_solstice_day), 365 - abs(x - summer_solstice_day))
    )
    
    # Step 2: Seasonal progression within each season
    # Spring progression (March 20 - June 20, days ~79-172)
    spring_start, spring_end = 79, 172
    spring_mask = (df['day_of_year'] >= spring_start) & (df['day_of_year'] <= spring_end)
    df['spring_progression'] = np.where(
        spring_mask,
        (df['day_of_year'] - spring_start) / (spring_end - spring_start),
        np.nan
    )
    
    # Summer progression (June 21 - September 22, days ~172-265) 
    summer_start, summer_end = 172, 265
    summer_mask = (df['day_of_year'] >= summer_start) & (df['day_of_year'] <= summer_end)
    df['summer_progression'] = np.where(
        summer_mask,
        (df['day_of_year'] - summer_start) / (summer_end - summer_start),
        np.nan
    )
    
    # Fall progression (September 23 - December 20, days ~265-355)
    fall_start, fall_end = 265, 355
    fall_mask = (df['day_of_year'] >= fall_start) & (df['day_of_year'] <= fall_end)
    df['fall_progression'] = np.where(
        fall_mask,
        (df['day_of_year'] - fall_start) / (fall_end - fall_start),
        np.nan
    )
    
    # Winter progression (December 21 - March 19, days 355+ and 1-79)
    winter_mask = (df['day_of_year'] >= 355) | (df['day_of_year'] <= 79)
    df['winter_progression'] = np.where(
        winter_mask,
        df['days_since_winter_solstice'] / 90,  # ~90 days in winter
        np.nan
    )
    
    # Step 3: Temperature trend and deviation features
    # 7-day temperature trend (is it getting warmer or colder?)
    df = df.sort_values('datetime')  # Ensure chronological order
    df['temp_7d_trend'] = df['temp'].rolling(window=7*24, min_periods=24).apply(
        lambda x: np.polyfit(range(len(x)), x, 1)[0] if len(x) > 1 else 0
    )
    
    # Temperature deviation from seasonal average
    # Calculate seasonal temperature norms
    seasonal_temp_avg = df.groupby([df['datetime'].dt.month, df['hour']])['temp'].transform('mean')
    df['temp_seasonal_deviation'] = df['temp'] - seasonal_temp_avg
    
    # Step 4: Daylight and solar energy features
    # Approximate daylight hours based on day of year and latitude (Washington DC ≈ 39°N)
    def calculate_daylight_hours(day_of_year, latitude=39.0):
        # Simplified daylight calculation
        declination = 23.45 * np.sin(np.radians(360 * (284 + day_of_year) / 365))
        lat_rad = np.radians(latitude)
        decl_rad = np.radians(declination)
        
        hour_angle = np.arccos(-np.tan(lat_rad) * np.tan(decl_rad))
        daylight = 2 * hour_angle * 12 / np.pi
        return np.clip(daylight, 0, 24)  # Ensure reasonable bounds
    
    df['daylight_hours'] = df['day_of_year'].apply(calculate_daylight_hours)
    
    # Daylight change rate (how fast are days getting longer/shorter?)
    df['daylight_change_rate'] = df['daylight_hours'].diff()
    
    # Step 5: Seasonal transition periods (change points)
    # Distance to seasonal transition points
    spring_equinox, summer_solstice = 79, 172
    fall_equinox, winter_solstice = 265, 355
    
    transitions = [spring_equinox, summer_solstice, fall_equinox, winter_solstice]
    
    df['days_to_next_transition'] = df['day_of_year'].apply(
        lambda day: min([abs(day - t) if abs(day - t) <= 183 else 365 - abs(day - t) 
                        for t in transitions])
    )
    
    # Step 6: Advanced seasonal interaction features
    # Seasonal temperature surprise (how unusual is current temp for this time of year?)
    monthly_temp_std = df.groupby(df['datetime'].dt.month)['temp'].transform('std')
    df['seasonal_temp_surprise'] = abs(df['temp_seasonal_deviation']) / (monthly_temp_std + 1)
    
    # Early/late season indicators
    df['early_season'] = ((df['spring_progression'] <= 0.3) | 
                         (df['summer_progression'] <= 0.3) |
                         (df['fall_progression'] <= 0.3) | 
                         (df['winter_progression'] <= 0.3)).astype(int)
    
    df['late_season'] = ((df['spring_progression'] >= 0.7) | 
                        (df['summer_progression'] >= 0.7) |
                        (df['fall_progression'] >= 0.7) | 
                        (df['winter_progression'] >= 0.7)).astype(int)
    
    # Step 7: Seasonal user behavior patterns
    # Different seasons drive different usage patterns
    df['growing_season'] = ((df['day_of_year'] >= 90) & (df['day_of_year'] <= 280)).astype(int)  # Apr-Oct
    df['dormant_season'] = ((df['day_of_year'] < 90) | (df['day_of_year'] > 280)).astype(int)   # Nov-Mar
    
    return df

# Apply seasonal progression features
df = create_seasonal_progression_features(df)

print("Seasonal progression features created:")
print(f"Days since winter solstice - Range: {df['days_since_winter_solstice'].min():.0f} to {df['days_since_winter_solstice'].max():.0f}")
print(f"Daylight hours - Range: {df['daylight_hours'].min():.1f} to {df['daylight_hours'].max():.1f}")
print(f"Temp seasonal deviation - Std: {df['temp_seasonal_deviation'].std():.2f}")
print(f"Days to next transition - Average: {df['days_to_next_transition'].mean():.1f}")

# Show seasonal progression for different times of year
print("\nSeasonal features by month (sample):")
seasonal_sample = df.groupby(df['datetime'].dt.month)[
    ['days_since_winter_solstice', 'daylight_hours', 'temp_seasonal_deviation']
].mean()
print(seasonal_sample.round(2))

Seasonal progression features created:
Days since winter solstice - Range: 11 to 364
Daylight hours - Range: 9.3 to 14.7
Temp seasonal deviation - Std: 3.26
Days to next transition - Average: 23.3

Seasonal features by month (sample):
          days_since_winter_solstice  daylight_hours  temp_seasonal_deviation
datetime                                                                     
1                              19.92            9.46                     -0.0
2                              51.01           10.35                     -0.0
3                              79.48           11.50                     -0.0
4                             110.51           12.84                      0.0
5                             140.50           13.98                     -0.0
6                             171.50           14.67                      0.0
7                             201.50           14.56                     -0.0
8                             232.50           13.69           

**Step-by-Step Explanation:**

1. **Solar Position Features**: Days since solstices capture the fundamental solar cycle driving temperature and daylight patterns. This creates smooth, continuous features rather than categorical seasons.

2. **Within-Season Progression**: Instead of treating "spring" as uniform, we track progress from early spring (0.0) to late spring (1.0), capturing gradual changes within each season.

3. **Temperature Trend Analysis**: 7-day temperature trends identify sustained warming or cooling periods, while seasonal deviation shows when current conditions differ from historical norms.

4. **Daylight Calculations**: We approximate actual daylight hours for Washington DC's latitude, capturing the solar influence on bike usage patterns.

5. **Seasonal Transitions**: Distance to seasonal transition points identifies periods of rapid change when user behavior might be most sensitive to conditions.

6. **Advanced Interactions**: Seasonal temperature surprise quantifies how unusual current weather is, while early/late season indicators capture different behavioral patterns within seasons.

7. **User Behavior Context**: Growing vs dormant season features reflect the fundamental rhythm of bike-sharing usage throughout the year.

A city may see steady growth in bike demand as spring progresses—not just because of "spring" as a category, but because days are getting warmer and longer. Encoding progression allows the model to learn these subtler patterns, such as the gradual increase in recreational usage as daylight hours expand or the rapid change in commuting patterns during seasonal transitions.

## 3. Categorical Encoding Strategies

### 3.1. Understanding Categorical Variables in Transportation

In this section, we’ll explore why categorical variables matter in transportation datasets and how to transform them into useful numerical representations.

Most machine learning algorithms rely on numerical operations such as addition, multiplication, and comparison. But transportation data often contains text-based categories like *“Rainy”* or *“Clear.”* Algorithms cannot process these directly — we need to **convert them into numbers** while preserving their meaning for prediction.

The challenge lies in doing this transformation in a way that **respects the type of category** and the relationships it carries. Not all categorical variables behave the same way: some have natural orderings, while others are just labels.

#### Examples of Categorical Variables in Bike-Sharing:

* **Weather Conditions**: Clear, misty, light rain, heavy rain
* **Day Types**: Weekday, weekend, holiday
* **Seasons**: Spring, summer, fall, winter
* **Time Periods**: Rush hour, off-peak, late night
* **Events**: Normal, special event, maintenance period

Each requires a different encoding strategy depending on whether categories are *unordered labels*, *ranked scales*, or *business-specific conditions*.

### 3.2. Encoding Strategy Selection Framework

This framework introduces four widely used approaches to categorical encoding. Each method fits different types of variables.

**1. One-Hot Encoding for Nominal Categories**

**Definition**: One-hot encoding creates binary columns for each category, with values 1 (present) or 0 (absent).
**When to use**: For categories with **no inherent order** where all options should be treated equally.

In [38]:
# Example: One-hot encoding weather conditions
# Ensure 'weather_condition' exists; derive it from numeric 'weather' if needed
if 'weather_condition' not in df.columns:
    if 'weather' in df.columns:
        _weather_map = {1: 'Clear', 2: 'Misty', 3: 'Light Rain', 4: 'Heavy Rain'}
        df['weather_condition'] = df['weather'].map(_weather_map).astype('category')
    else:
        raise KeyError("Neither 'weather_condition' nor 'weather' columns are present in df.")

print("Original weather column (as categories):")
print(df['weather_condition'].value_counts())

weather_encoded = pd.get_dummies(df['weather_condition'], prefix='weather')
print("\nOne-hot encoded columns:")
print(weather_encoded.head())
print(f"\nOriginal 1 column → {len(weather_encoded.columns)} binary columns")

Original weather column (as categories):
weather_condition
Clear         7192
Misty         2834
Light Rain     859
Heavy Rain       1
Name: count, dtype: int64

One-hot encoded columns:
   weather_Clear  weather_Heavy Rain  weather_Light Rain  weather_Misty
0           True               False               False          False
1           True               False               False          False
2           True               False               False          False
3           True               False               False          False
4           True               False               False          False

Original 1 column → 4 binary columns


This transformation creates separate binary columns for each weather condition. Where we previously had one column with text values ('Clear', 'Cloudy', 'Light Rain', 'Heavy Rain'), we now have four binary columns (weather_Clear, weather_Cloudy, weather_Light Rain, weather_Heavy Rain) that machine learning algorithms can process directly.

Each row has exactly one '1' and the rest '0s', preserving the original information in a numerical format. For bike-sharing demand prediction, this allows the model to learn different demand patterns for each weather type - for example, Clear days might show high recreational usage while Light Rain days might see reduced casual ridership but maintained commuter patterns.

The key advantage: "Clear" and "Cloudy" are treated as equally valid categories without implying any ordering or hierarchy between them.

**2. Ordinal Encoding for Ordered Categories**

**Definition**: Ordinal encoding maps categories to numbers that reflect their natural order.
**When to use**: For categories where **order matters**, such as severity or ranking.

In [39]:
# Encoding weather severity by order
weather_severity_map = {'Clear': 1, 'Misty': 2, 'Light Rain': 3, 'Heavy Rain': 4}
df['weather_severity'] = df['weather_condition'].map(weather_severity_map)

This encoding tells the model that *Heavy Rain > Light Rain* in terms of severity, while still treating them as categorical.

**3. Target-Based Encoding for High-Cardinality Variables**

**Definition**: Replaces each category with a statistic of the target (e.g., mean demand for that category).
**When to use**: For variables with **many categories** (e.g., hundreds of stations).

In [40]:
# Target (mean) encoding on high-cardinality category
# Use 'station_id' if available; otherwise, fall back to 'hour' as a demo
if 'station_id' in df.columns:
    station_demand = df.groupby('station_id')['count'].mean()
    df['station_avg_demand'] = df['station_id'].map(station_demand)
    print("Applied target encoding on 'station_id' -> 'station_avg_demand'")
else:
    hour_demand = df.groupby('hour')['count'].mean()
    df['hour_avg_demand'] = df['hour'].map(hour_demand)
    print("No 'station_id' column. Applied target encoding on 'hour' -> 'hour_avg_demand'")

No 'station_id' column. Applied target encoding on 'hour' -> 'hour_avg_demand'


This captures each station’s unique relationship to demand while avoiding hundreds of dummy variables.

**4. Binary Encoding for Business-Specific Conditions**

**Definition**: Creates simple 0/1 features for key conditions.
**When to use**: When certain conditions are **especially important for the business**.

Examples in bike-sharing:

* `is_holiday`: 1 if holiday, else 0
* `is_weekend`: 1 if Saturday/Sunday, else 0
* `is_rush_hour`: 1 if within rush hour, else 0
* `is_good_weather`: 1 if clear or mild conditions, else 0

This approach highlights critical patterns without overcomplicating the dataset.

### 3.3. Advanced Categorical Feature Engineering

Basic encodings treat each category independently. But in real-world transportation, **combinations of conditions** often drive demand. Advanced techniques help us capture these richer relationships.

**1.Interaction Encoding**

**Definition**: Creates new features that represent combinations of categories.
**Purpose**: Reveals demand patterns that only emerge when two conditions overlap.

Examples:

* `weekend_good_weather`: 1 if weekend AND good weather
* `rush_hour_weekday`: 1 if weekday AND rush hour
* `holiday_winter`: 1 if holiday AND winter season

For instance, **weekend + good weather** often signals high recreational demand.

**2. Frequency Encoding**

**Definition**: Encodes categories based on how often they appear in the dataset.
**Purpose**: Uses **rarity vs. commonness** as a predictive signal.

In [41]:
# Frequency encoding for weather
weather_frequency = df['weather_condition'].value_counts()
df['weather_frequency'] = df['weather_condition'].map(weather_frequency)

Rare weather (e.g., storms) may influence demand differently than common conditions like “Clear,” and frequency itself becomes an informative feature.

**3.Hierarchical Encoding**

**Definition**: Breaks down categorical data into multiple levels of detail.
**Purpose**: Provides both broad and granular views of categorical patterns.

Transportation hierarchies often include:

* **Time**: Season → Month → Week → Day → Hour
* **Weather**: Type → Severity → Duration
* **Location**: City → District → Station

Example:

* Broad level: `season` (spring, summer, fall, winter)
* Medium level: `month` (January–December)
* Specific level: `week_of_year` (1–52)

This multi-level encoding helps capture both long-term trends (seasonality) and short-term fluctuations (week-specific effects).

### 3.4. Why These Techniques Matter

By combining these strategies, we give our models **a richer understanding of categorical influences**. Instead of treating “holiday,” “rainy,” or “weekend” as isolated labels, advanced encodings capture how these factors **interact and shape demand together**.

For bike-sharing consultants, this means producing **more accurate demand forecasts**, supporting better operational decisions, and ultimately helping clients serve riders more effectively.

## 4. Scaling and Normalization for Optimal Performance

In this lecture, you’ll tackle a challenge that often makes or breaks the success of bike-sharing demand prediction: **how to prepare features so that they contribute fairly to your machine learning models**. This process is called *scaling and normalization*.

We’ll begin by understanding why scaling matters specifically for transportation datasets, then compare three essential techniques used by every consultant. Finally, we’ll move into advanced scaling strategies tailored for time-varying data. By the end, you’ll be equipped to prepare features in a way that boosts model performance while preserving the meaningful patterns in bike-sharing demand.

### 4.1. Why Scaling Matters for Transportation Data

Before diving into methods, let’s first define what scaling and normalization mean.

* **Scaling** adjusts the range of feature values so they can be compared on equal footing.
* **Normalization** reshapes distributions to fit within defined ranges or scales.
* Importantly, these transformations do *not* change the underlying patterns in your data—they only adjust the numerical representation.

Think of this as putting all features on a level playing field, where temperature, humidity, and demand counts each get a fair chance to influence the model.

**The Transportation Scaling Challenge**

Your Washington D.C. dataset contains features with very different scales:

* **Temperature**: –10 to 40 °C
* **Humidity**: 0–100%
* **Bike counts**: 1–1000+ rentals/hour
* **Hour of day**: 0–23

If left unscaled, models naturally give more weight to variables with larger ranges. In this case, bike count values in the hundreds will overshadow temperature in the tens—even if temperature is more predictive of demand.

Imagine your client asks for a prediction for next Tuesday at 3 PM. If your features aren’t scaled properly, the model might overemphasize past rental counts and underweight the fact that the forecast predicts an unusually cold day. The result? An inflated demand forecast and wasted resources.

Proper scaling ensures every feature contributes based on *predictive value* rather than raw magnitude. This is the consultant’s way of making sure models stay business-relevant.

### 4.2. Essential Scaling Techniques for Bike-Sharing Analysis

Now that you know *why* scaling matters, let’s explore the three core methods every transportation consultant should master. Each one transforms the data differently and is best suited for particular feature types.

**1. StandardScaler: Statistical Normalization**

**Definition:** StandardScaler (Z-score normalization) standardizes features so they have mean 0 and standard deviation 1. This preserves the distribution shape but makes features comparable on a common statistical scale.

**Formula:**
$$scaled\_value = \frac{original\_value - mean}{standard\_deviation}$$

**Python Implementation:**

In [42]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
numerical_columns = ['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered']

# Show before/after comparison
print("Original values (first 5 rows):")
print(df[numerical_columns].head())

df_scaled = scaler.fit_transform(df[numerical_columns])
df_scaled_display = pd.DataFrame(df_scaled, columns=numerical_columns)

print("\nScaled values (first 5 rows):")
print(df_scaled_display.head())

print("\nScaled statistics:")
print(f"Mean: {df_scaled.mean(axis=0).round(10)}")  # Should be ~0
print(f"Std: {df_scaled.std(axis=0).round(3)}")     # Should be ~1

Original values (first 5 rows):
   temp   atemp  humidity  windspeed  casual  registered
0  9.84  14.395        81        0.0       3          13
1  9.02  13.635        80        0.0       8          32
2  9.02  13.635        80        0.0       5          27
3  9.84  14.395        75        0.0       3          10
4  9.84  14.395        75        0.0       0           1

Scaled values (first 5 rows):
       temp     atemp  humidity  windspeed    casual  registered
0 -1.333661 -1.092737  0.993213  -1.567754 -0.660992   -0.943854
1 -1.438907 -1.182421  0.941249  -1.567754 -0.560908   -0.818052
2 -1.438907 -1.182421  0.941249  -1.567754 -0.620958   -0.851158
3 -1.333661 -1.092737  0.681430  -1.567754 -0.660992   -0.963717
4 -1.333661 -1.092737  0.681430  -1.567754 -0.721042   -1.023307

Scaled statistics:
Mean: [-0.  0.  0. -0.  0. -0.]
Std: [1. 1. 1. 1. 1. 1.]


Notice how StandardScaler transforms the features to have mean ≈ 0 and standard deviation = 1. Temperature values that ranged from 0-40°C now center around 0, with most values falling between -2 and +2. This puts temperature, humidity, and bike counts on the same statistical scale, ensuring the model treats them fairly rather than over-weighting variables with larger raw values.

The transformation preserves relationships within each feature while making them directly comparable. For bike-sharing prediction, this ensures that a 10-point change in humidity has a similar numerical weight as a 10-degree change in temperature, allowing the model to learn which features are truly most predictive.

**When to Use:**

* Weather variables with near-normal distributions
* Linear regression or neural networks
* Situations without extreme outliers

**2. MinMaxScaler: Bounded Range Normalization**

**Definition:** MinMaxScaler rescales values into a defined range, usually 0–1. This ensures all values fit within a predictable bound.

**Formula:**
$$scaled\_value = \frac{original\_value - min}{max - min}$$

**Python Implementation:**

In [43]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
numerical_columns = ['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered']

# Show before
print("Original min/max (per column):")
print(df[numerical_columns].agg(['min','max']).T)

# Fit & transform
mm_scaled = scaler.fit_transform(df[numerical_columns])
mm_scaled_df = pd.DataFrame(mm_scaled, columns=numerical_columns)

# Show after
print("\nScaled min/max (should be ~0 and ~1):")
print(mm_scaled_df.agg(['min','max']).round(4).T)

# Optional: invert back to original scale (sanity check)
mm_inverted = scaler.inverse_transform(mm_scaled_df.head())
print("\nInverse-transformed (first 5 rows) matches original scale:")
print(pd.DataFrame(mm_inverted, columns=numerical_columns).round(3).head())

Original min/max (per column):
             min       max
temp        0.82   41.0000
atemp       0.76   45.4550
humidity    0.00  100.0000
windspeed   0.00   56.9969
casual      0.00  367.0000
registered  0.00  886.0000

Scaled min/max (should be ~0 and ~1):
            min  max
temp        0.0  1.0
atemp       0.0  1.0
humidity    0.0  1.0
windspeed   0.0  1.0
casual      0.0  1.0
registered  0.0  1.0

Inverse-transformed (first 5 rows) matches original scale:
   temp   atemp  humidity  windspeed  casual  registered
0  9.84  14.395      81.0        0.0     3.0        13.0
1  9.02  13.635      80.0        0.0     8.0        32.0
2  9.02  13.635      80.0        0.0     5.0        27.0
3  9.84  14.395      75.0        0.0     3.0        10.0
4  9.84  14.395      75.0        0.0     0.0         1.0


**When to Use:**

* Time-based features (e.g., hour of day)
* Variables with natural bounds (e.g., percentages)
* Algorithms sensitive to bounded inputs

**3. RobustScaler: Outlier-Resistant Scaling**

**Definition:** RobustScaler uses medians and interquartile ranges, making it less sensitive to extreme values. This is critical for transportation datasets where spikes in demand are real and meaningful.

**Formula:**
$$scaled\_value = \frac{original\_value - median}{IQR}$$

**Python Implementation:**

In [44]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()  # uses median and IQR by default
numerical_columns = ['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered']

# Show original robust stats
orig_stats = df[numerical_columns].quantile([0.25, 0.5, 0.75]).T
orig_stats.columns = ['Q1','Median','Q3']
orig_stats['IQR'] = orig_stats['Q3'] - orig_stats['Q1']
print("Original robust statistics:")
print(orig_stats.round(3))

# Fit & transform
rb_scaled = scaler.fit_transform(df[numerical_columns])
rb_scaled_df = pd.DataFrame(rb_scaled, columns=numerical_columns)

# Check median ≈ 0 and IQR ≈ 1 after scaling
rb_qs = rb_scaled_df.quantile([0.25, 0.5, 0.75]).T
rb_qs.columns = ['Q1','Median','Q3']
rb_qs['IQR'] = rb_qs['Q3'] - rb_qs['Q1']

print("\nScaled robust statistics (Median ~ 0, IQR ~ 1):")
print(rb_qs.round(3))

Original robust statistics:
                Q1   Median       Q3      IQR
temp        13.940   20.500   26.240   12.300
atemp       16.665   24.240   31.060   14.395
humidity    47.000   62.000   77.000   30.000
windspeed    7.002   12.998   16.998    9.996
casual       4.000   17.000   49.000   45.000
registered  36.000  118.000  222.000  186.000

Scaled robust statistics (Median ~ 0, IQR ~ 1):
               Q1  Median     Q3  IQR
temp       -0.533     0.0  0.467  1.0
atemp      -0.526     0.0  0.474  1.0
humidity   -0.500     0.0  0.500  1.0
windspeed  -0.600     0.0  0.400  1.0
casual     -0.289     0.0  0.711  1.0
registered -0.441     0.0  0.559  1.0


**When to Use:**

* Bike count variables with spikes during events
* Skewed distributions or heavy-tailed data
* Data with legitimate, business-relevant outliers

## 5. Domain-Specific Feature Engineering for Urban Mobility

In this lecture, we’ll move from generic feature engineering to **transportation-specific expertise**. As a consultant, your value comes from creating features that capture how people actually use bike-sharing systems and how operators manage them. By the end, you’ll be able to design features that combine **statistical power with deep domain knowledge**, ensuring your models reflect both human behavior and operational realities.

### 5.1. Understanding Transportation Business Logic Features

Before we dive into code, let’s clarify what makes transportation features different from generic machine learning features.

**Definition**: Business logic features are variables designed from **domain expertise**, encoding operational rules, system constraints, and user motivations. They go beyond transformations of raw data and capture how the transportation system really works.

**Why They Matter**:
Transportation demand is shaped by more than just numbers—it reflects:

* Human choices (commuting, recreation, errands)
* Environmental comfort thresholds (temperature, wind, humidity)
* System limits (bike availability, station capacity)
* Business rules (rebalancing, working days, holidays)

**Consultant Insight**: When you create these features, you’re embedding years of transportation consulting experience into your model—knowledge your client would otherwise have to discover the hard way.

### 5.2. Essential Transportation Business Logic Features

This section covers **three essential categories of domain-specific features**:

1. **Weather comfort indicators** that combine multiple environmental factors
2. **Trip purpose indicators** that estimate why users are riding
3. **Operational capacity features** that capture system constraints

Let’s explore each in turn.

**1. Weather Comfort Index: Capturing Cycling Conditions**

**Definition**: A weather comfort index summarizes how favorable current weather is for cycling by combining temperature, humidity, and wind into a single score.

**Why It Matters**:
Bike-sharing users don’t think in terms of raw weather metrics—they ask, *“Does it feel like a good day to ride?”* By creating this index, your model learns to approximate human decision-making.

In [45]:
def create_weather_comfort_index(df):
    """
    Create a weather comfort index specifically for bike-sharing demand prediction.
    Higher values = more comfortable cycling conditions.
    """
    temp_comfort = 1 - abs(df['temp'] - 17.5) / 30  
    humidity_comfort = (100 - df['humidity']) / 100
    wind_comfort = np.maximum(0, (15 - df['windspeed']) / 15)
    
    weather_comfort = (
        temp_comfort * 0.5 +
        humidity_comfort * 0.3 +
        wind_comfort * 0.2
    )
    
    return weather_comfort

df['weather_comfort'] = create_weather_comfort_index(df)

**2. Trip Purpose Indicators: Understanding User Motivations**

**Definition**: Trip purpose indicators estimate the likelihood that a ride is for commuting, recreation, or errands, based on **time of day, day type, and weather**.

**Why It Matters**:
Not all rides are equal. A morning commuter reacts differently to rain than a weekend tourist. Encoding trip purpose helps your model capture *why* people ride, not just *when*.

In [46]:
def create_trip_purpose_indicators(df):
    # Commute probability
    commute_time_score = np.where(
        ((df['hour'] >= 7) & (df['hour'] <= 9)) | ((df['hour'] >= 17) & (df['hour'] <= 19)),
        1.0, 0.3
    )
    df['commute_probability'] = commute_time_score * df['workingday']
    
    # Recreation probability
    recreation_time_score = np.where(
        (df['workingday'] == 0) | (df['hour'] >= 18) | (df['hour'] <= 10),
        1.0, 0.2
    )
    df['recreation_probability'] = recreation_time_score * (df['weather_comfort'] ** 2)
    
    # Errand probability
    errand_time_score = np.where(
        (df['hour'] >= 10) & (df['hour'] <= 16),
        1.0, 0.4
    )
    df['errand_probability'] = errand_time_score * np.sqrt(df['weather_comfort'])
    
    return df

**3. Operational Features: Capturing System Constraints**

**Definition**: Operational features quantify how **system availability and management decisions** affect observed demand.

**Why It Matters**:
A spike in demand may never appear in the data if the system was already out of bikes. These features ensure the model doesn’t confuse **supply limits with lack of interest**.

In [47]:
def create_operational_features(df):
    df['system_utilization'] = (df['casual'] + df['registered']) / (df['casual'] + df['registered']).rolling(24).max()
    df['system_stress'] = df['commute_probability'] * df['system_utilization']
    df['weekend_pressure'] = (1 - df['workingday']) * df['recreation_probability'] * df['weather_comfort']
    df['rebalancing_pressure'] = abs(df['casual'] - df['registered']) / (df['casual'] + df['registered'] + 1)
    return df

### 5.3. Advanced Interaction Features

So far, we’ve looked at features individually. But transportation demand is rarely shaped by single factors alone—it’s the **interactions** that matter. In this section, you’ll create:

1. **Weather-time interactions** to capture context-dependent effects
2. **User-type interactions** to reflect differences between casual and registered users

**1. Weather-Time Interactions**

**Concept**: The same weather condition has different consequences depending on **when** it happens. Rain at 8 AM is a crisis for commuters; rain at 8 PM mainly reduces leisure trips.

In [48]:
def create_weather_time_interactions(df):
    df['temp_seasonal_deviation'] = df['temp'] - df.groupby('season')['temp'].transform('mean')
    rain_indicator = np.where(df['weather'] >= 3, 1.0, 0.0)
    
    df['rain_commute_impact'] = rain_indicator * df['commute_probability'] * 2.0
    df['rain_recreation_impact'] = rain_indicator * df['recreation_probability'] * 3.0
    df['rain_errand_impact'] = rain_indicator * df['errand_probability'] * 1.5
    df['weekend_weather_premium'] = (1 - df['workingday']) * df['weather_comfort'] ** 2
    df['holiday_weather_boost'] = df['holiday'] * df['weather_comfort'] * 1.5
    
    return df

**2. User Type Interactions**

**Concept**: Casual and registered users react differently to the same conditions. Encoding this difference helps your model capture demand composition.

* Casual riders: more sensitive to weather, more recreational
* Registered riders: more predictable, commute-driven

In [49]:
def create_user_interaction_features(df):
    total_users = df['casual'] + df['registered'] + 1
    df['casual_ratio'] = df['casual'] / total_users
    df['registered_ratio'] = df['registered'] / total_users
    
    df['weather_sensitivity_casual'] = df['casual_ratio'] * (1 - df['weather_comfort']) ** 2
    df['weather_sensitivity_registered'] = df['registered_ratio'] * (1 - df['weather_comfort'])
    df['time_sensitivity_registered'] = df['registered_ratio'] * df['commute_probability']
    df['time_sensitivity_casual'] = df['casual_ratio'] * df['recreation_probability']
    df['holiday_casual_boost'] = df['holiday'] * df['casual_ratio'] * 2.0
    df['holiday_registered_reduction'] = df['holiday'] * df['registered_ratio'] * 0.5
    
    return df

### 5.4. Wrapping Up

By now, you’ve seen how **domain-specific feature engineering** turns raw bike-sharing data into rich behavioral insights. Your models can now distinguish between *what* happened and *why* it happened—making predictions more reliable and business recommendations more actionable.

---

## Summary and Transition to Exploratory Data Analysis

You've mastered advanced preprocessing and feature engineering techniques: cyclical time encoding, lag features, categorical encoding strategies, scaling methods, and domain-specific features like weather comfort indices and trip purpose indicators. These skills transform clean transportation data into machine learning-ready inputs that capture temporal, behavioral, and operational patterns.

Your ability to create sophisticated features from raw data - extracting temporal intelligence, encoding business logic, and building interaction effects - prepares you to work with complex transportation prediction challenges while maintaining the data quality essential for accurate models.

In the next module, you'll learn how to explore and visualize these engineered features to generate business insights and validate that your preprocessing pipeline creates data that reflects real-world transportation patterns.