# Lecture 5: Advanced Preprocessing & Feature Engineering - Optimizing Data for Machine Learning

## Learning Objectives

By the end of this lecture, you will be able to:

- Apply categorical encoding techniques for machine learning compatibility
- Implement data scaling and normalization strategies for optimal model performance
- Design and implement time-based features for transportation demand prediction

---

## 1. From Clean Data to Machine Learning Ready Features

### 1.1. The Bridge Between Data and Models

You now have **clean, reliable data** from your bike-sharing client - but **clean data isn't the same as machine learning ready data**. Raw data variables often need **transformation, combination, and optimization** before machine learning algorithms can use them effectively.

In the previous lecture, you learned to identify and handle data quality issues - missing values, outliers, and inconsistencies. That cleaning work ensures you're starting with reliable, trustworthy data. Now, we move to the next essential step: **transforming that clean data into optimized features for machine learning**.

Think of this stage like preparing ingredients for a sophisticated recipe. Having fresh, quality ingredients (clean data) is essential, but you still need to chop, season, and combine them in specific ways to create the final dish (predictive model). This preparation involves two complementary activities:

**Data preprocessing** adjusts existing variables to formats and scales that algorithms can process efficiently - ensuring numerical features exist on compatible scales so algorithm mechanics don't introduce artificial biases. **Feature engineering** creates new variables that capture patterns and relationships hidden in raw data - transforming simple measurements into sophisticated representations that expose the underlying structure driving predictions.

The boundaries between **data cleaning**, **preprocessing**, and **feature engineering** are not always clear-cut. In practice, these steps often overlap, and the exact labels matter less than the outcome: **data that is trustworthy, ready for algorithms, and enriched with predictive signal**.

In this lecture, we will focus on three essential transformation techniques:

1. **Categorical encoding**, which converts qualitative categories into numerical representations,
2. **Numerical scaling and normalization**, which adjust numerical values either to standardized distributions or to a fixed range,
3. **Temporal feature creation**, which extracts meaningful time-based patterns such as seasonality, trends, or hour-of-day effects.

These methods form the backbone of most machine learning workflows, especially in time-sensitive transportation contexts like bike-sharing demand. While many other transformation strategies exist in the broader ML toolkit, mastering these three gives you a strong foundation for preparing real-world datasets.

### 1.2. Understanding Data Preprocessing: Making Data Algorithm-Ready

Machine learning algorithms operate on numerical matrices through mathematical operations like multiplication, addition, and distance calculations. However, transportation data arrives in formats that algorithms cannot process directly - text labels like "Rainy" or "Clear," and numerical values spanning wildly different scales. Data preprocessing transforms this raw data into algorithm-compatible formats while preserving the information needed for accurate predictions.

Transportation datasets present specific preprocessing challenges that require careful handling:

- **Mixed Data Types**: Transportation systems generate both categorical variables (weather conditions, day types, station locations) and continuous numerical measurements (temperature, humidity, bike counts). Each type requires different preprocessing strategies to become machine learning compatible.
- **Categorical Relationships**: Not all categories are equal. Weather severity has natural ordering (Clear < Misty < Light Rain < Heavy Rain), while seasons are cyclical labels without inherent hierarchy. Choosing the wrong encoding strategy can introduce false relationships or miss important orderings.
- **Scale Disparities**: Raw features exist on incompatible numerical scales - temperature ranges from -10°C to 40°C, humidity spans 0-100%, while hourly bike rentals can vary from 1 to 1000+. Without normalization, algorithms incorrectly prioritize features with larger numerical ranges over potentially more predictive features with smaller scales.

Professional preprocessing ensures that algorithm limitations don't distort the transportation patterns you're trying to predict.

### 1.3. Understanding Time-Based Feature Engineering in Transportation

Once data is preprocessed into algorithm-compatible formats, feature engineering creates new variables that capture hidden patterns and relationships in the raw data. While preprocessing ensures algorithms can process your data, feature engineering determines what insights they can extract from it. For transportation demand prediction, time is one of the most critical dimensions requiring sophisticated feature engineering.

Transportation demand exhibits temporal complexities that raw timestamps cannot capture directly:

- **Cyclical Time Patterns**: Transportation operates on repeating cycles - hourly rush patterns, daily commute rhythms, weekly work schedules, and seasonal vacation periods. Raw timestamps treat these cycles as linear sequences, failing to recognize that 11 PM and midnight are adjacent hours, or that December and January are consecutive months. Machine learning algorithms need explicit encoding to understand temporal continuity.
- **Multi-Scale Temporal Dependencies**: Demand at any moment depends on patterns at multiple time scales simultaneously. Current hour demand relates to yesterday's same hour (daily cycle), last week's same hour (weekly cycle), and recent hours (momentum). Simple timestamp features cannot represent these layered temporal relationships that drive transportation behavior.
- **Temporal Context and Transitions**: The meaning of any time point depends on its position within larger temporal structures. Friday evening demand differs fundamentally from Monday evening demand despite identical clock times. Similarly, demand evolves systematically as the work week progresses, requiring features that capture position within weekly and seasonal cycles.
- **Sequential Momentum Effects**: Transportation demand exhibits inertia - high demand periods tend to persist, and transitions between demand states follow predictable patterns. Raw timestamps provide no information about recent demand history or emerging trends that influence near-future predictions.

Professional time-based feature engineering transforms simple timestamps into sophisticated temporal features that capture these cyclical, multi-scale, contextual, and sequential patterns essential for accurate transportation demand forecasting.

## 2. Categorical Encoding Strategies

### 2.1. Understanding Categorical Variables in Transportation

We will start by exploring why categorical variables matter in transportation datasets and how to transform them into useful numerical representations.

Most machine learning algorithms rely on numerical operations such as addition, multiplication, and comparison. But transportation data often contains text-based categories like *“Rainy”* or *“Clear.”* Algorithms cannot process these directly — we need to **convert them into numbers** while preserving their meaning for prediction.

The challenge lies in doing this transformation in a way that **respects the type of category** and the relationships it carries. Not all categorical variables behave the same way: some have natural orderings, while others are just labels.

**Examples of Categorical Variables in Bike-Sharing:**

* **Weather Conditions**: Clear, misty, light rain, heavy rain
* **Day Types**: Weekday, weekend, holiday
* **Seasons**: Spring, summer, fall, winter
* **Time Periods**: Rush hour, off-peak, late night
* **Events**: Normal, special event, maintenance period

Each requires a different encoding strategy depending on whether categories are *unordered labels*, *ranked scales*, or *business-specific conditions*.

Here we introduce three widely used approaches to categorical encoding: one-hot encoding, ordinal encoding, and binary encoding. Each method fits different types of variables.

### 2.2. One-Hot Encoding for Nominal Categories**

**Definition**: One-hot encoding creates binary columns for each category, with values 1 (present) or 0 (absent).

**When to use**: For categories with **no inherent order** where all options should be treated equally.

**Python Example:**

In [46]:
import pandas as pd

data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv"
df = pd.read_csv(data_path, parse_dates=['datetime'])

# Ensure 'weather_condition' exists; derive it from numeric 'weather' if needed
if 'weather_condition' not in df.columns:
    if 'weather' in df.columns:
        _weather_map = {1: 'Clear', 2: 'Misty', 3: 'Light Rain', 4: 'Heavy Rain'}
        df['weather_condition'] = df['weather'].map(_weather_map).astype('category')
    else:
        raise KeyError("Neither 'weather_condition' nor 'weather' columns are present in df.")

print("Original weather column (as categories):")
print(df['weather_condition'].value_counts())

# One-hot encode weather conditions
weather_encoded = pd.get_dummies(df['weather_condition'], prefix='weather')
print("\nOne-hot encoded columns:")
print(weather_encoded.head())
print(f"\nOriginal 1 column → {len(weather_encoded.columns)} binary columns")

Original weather column (as categories):
weather_condition
Clear         7192
Misty         2834
Light Rain     859
Heavy Rain       1
Name: count, dtype: int64

One-hot encoded columns:
   weather_Clear  weather_Heavy Rain  weather_Light Rain  weather_Misty
0           True               False               False          False
1           True               False               False          False
2           True               False               False          False
3           True               False               False          False
4           True               False               False          False

Original 1 column → 4 binary columns


This transformation creates separate binary columns for each weather condition. Where we previously had one column with text values ('Clear', 'Cloudy', 'Light Rain', 'Heavy Rain'), we now have four binary columns (weather_Clear, weather_Cloudy, weather_Light Rain, weather_Heavy Rain) that machine learning algorithms can process directly.

Each row has exactly one '1' (`True`) and the rest '0s' (`False`), preserving the original information in a numerical format. For bike-sharing demand prediction, this allows the model to learn different demand patterns for each weather type - for example, Clear days might show high recreational usage while Light Rain days might see reduced casual ridership but maintained commuter patterns.

The key advantage: "Clear" and "Cloudy" are treated as equally valid categories without implying any ordering or hierarchy between them.

### 2.3. Ordinal Encoding for Ordered Categories

**Definition**: Ordinal encoding maps categories to numbers that reflect their natural order.

**When to use**: For categories where **order matters**, such as severity or ranking.

**Python Example:**

In [47]:
import pandas as pd

data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv"
df = pd.read_csv(data_path, parse_dates=['datetime'])

# Ensure 'weather_condition' exists; derive it from numeric 'weather' if needed
if 'weather_condition' not in df.columns:
    if 'weather' in df.columns:
        _weather_map = {1: 'Clear', 2: 'Misty', 3: 'Light Rain', 4: 'Heavy Rain'}
        df['weather_condition'] = df['weather'].map(_weather_map).astype('category')
    else:
        raise KeyError("Neither 'weather_condition' nor 'weather' columns are present in df.")

# Encoding weather severity by order
weather_severity_map = {'Clear': 1, 'Misty': 2, 'Light Rain': 3, 'Heavy Rain': 4}
df['weather_severity'] = df['weather_condition'].map(weather_severity_map)
print("Weather severity column (as ordered numbers):")
print(df['weather_severity'].value_counts())

Weather severity column (as ordered numbers):
weather_severity
1    7192
2    2834
3     859
4       1
Name: count, dtype: int64


This transformation converts text weather conditions into ordered numerical values that preserve their severity relationship. Where we previously had one column with text values ('Clear', 'Misty', 'Light Rain', 'Heavy Rain'), we now have a single numerical column with ordered values (1, 2, 3, 4) that machine learning algorithms can process directly while understanding the inherent ordering.

Each weather condition maps to a specific number representing its severity level: Clear conditions (most favorable, severity 1) appear 7,192 times in the dataset, Misty conditions (severity 2) appear 2,834 times, Light Rain (severity 3) occurs 859 times, and Heavy Rain (most severe, severity 4) is rare with only 1 occurrence. For bike-sharing demand prediction, this allows the model to understand that deteriorating weather conditions progressively reduce ridership - a unit increase in weather severity (e.g., from Clear to Misty) represents a consistent step toward worse conditions.

The key advantage: the model learns that Heavy Rain > Light Rain > Misty > Clear in terms of severity, capturing the natural ordering that affects transportation behavior, unlike one-hot encoding which would treat these conditions as unrelated categories.

### 2.4. Binary Encoding for Business-Specific Conditions

**Definition**: Creates simple 0/1 features for key conditions.

**When to use**: When certain conditions are **especially important for the business**.

Examples in bike-sharing:

* `is_holiday`: 1 if holiday, else 0
* `is_weekend`: 1 if Saturday/Sunday, else 0
* `is_rush_hour`: 1 if within rush hour, else 0
* `is_good_weather`: 1 if clear or mild conditions, else 0

**Python Example:**

In [48]:
import pandas as pd
import numpy as np

data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv"
df = pd.read_csv(data_path, parse_dates=['datetime'])

# Binary indicators for business-relevant conditions
df['is_holiday'] = (df['holiday'] == 1).astype(int)

# Weekend: Saturday=5, Sunday=6
df['is_weekend'] = df['datetime'].dt.dayofweek.isin([5, 6]).astype(int)

# Rush hour windows (commuting peaks)
rush_hours = [7, 8, 9, 17, 18, 19]
df['is_rush_hour'] = df['datetime'].dt.hour.isin(rush_hours).astype(int)

# Good weather: Clear/Misty considered favorable; support either numeric or text weather columns
if 'weather' in df.columns:
    df['is_good_weather'] = df['weather'].isin([1, 2]).astype(int)
elif 'weather_condition' in df.columns:
    df['is_good_weather'] = df['weather_condition'].isin(['Clear', 'Misty']).astype(int)
else:
    df['is_good_weather'] = 0  # fallback if weather info is unavailable

print("\nSample binary business-condition features (first 10 rows):")
print(df[['datetime', 'is_holiday', 'is_weekend', 'is_rush_hour', 'is_good_weather']].head(10).to_string(index=False))


Sample binary business-condition features (first 10 rows):
           datetime  is_holiday  is_weekend  is_rush_hour  is_good_weather
2011-01-01 00:00:00           0           1             0                1
2011-01-01 01:00:00           0           1             0                1
2011-01-01 02:00:00           0           1             0                1
2011-01-01 03:00:00           0           1             0                1
2011-01-01 04:00:00           0           1             0                1
2011-01-01 05:00:00           0           1             0                1
2011-01-01 06:00:00           0           1             0                1
2011-01-01 07:00:00           0           1             1                1
2011-01-01 08:00:00           0           1             1                1
2011-01-01 09:00:00           0           1             1                1


This transformation creates focused binary indicators that flag business-critical conditions. Where we previously needed to check multiple columns and apply business logic (e.g., "Is this Saturday or Sunday?" or "Does the hour fall in 7-9 or 17-19?"), we now have simple 0/1 flags (is_weekend, is_rush_hour, is_good_weather, is_holiday) that machine learning algorithms can process directly.

Examining January 1, 2011 (the dataset start): this Saturday is correctly flagged as weekend (is_weekend=1) throughout the day, not a holiday (is_holiday=0), and shows good weather conditions (is_good_weather=1). The rush hour indicator activates appropriately at 7 AM, 8 AM, and 9 AM (is_rush_hour=1), even on this weekend day - capturing that some commuters work weekends and morning activity patterns persist. For bike-sharing demand prediction, this allows the model to learn distinct patterns for each condition combination - for example, weekend mornings with good weather drive recreational usage, while weekday rush hours with poor weather concentrate demand around transit stations.

The key advantage: binary flags make complex business rules immediately accessible to the model without requiring it to rediscover these domain-specific condition definitions, allowing faster learning of operationally relevant patterns.

## 3. Scaling and Normalization for Optimal Performance

You may have sophisticated engineered features from your bike-sharing data - but having great features isn't enough if they can't work together effectively. Raw feature values often exist on completely different scales, creating a hidden problem that can sabotage your machine learning models.

Consider the magnitude differences in your Washington D.C. bike-sharing dataset:

- **Temperature**: -10 to 40°C (range of 50 units)
- **Humidity**: 0 to 100% (range of 100 units)
- **Bike demand**: 1 to 1000+ rentals per hour (range of 1000+ units)
- **Hour of day**: 0 to 23 (range of 24 units)

Without proper scaling, machine learning algorithms naturally give more weight to features with larger numerical ranges. In this case, bike count values in the hundreds will overshadow temperature changes in the tens - even when temperature might be more predictive of future demand.

### 3.1. Scaling and Normalization Fundamentals

Before exploring specific techniques, let's establish the main definitions:

- **Normalization** reshapes distributions to fit standardized ranges or statistical properties  
- **Scaling** adjusts feature value ranges to enable fair comparison across variables

A key principle here is that **these transformations preserve underlying data patterns while standardizing numerical representation**.

Professional scaling ensures features contribute based on *predictive importance* rather than arbitrary numerical scale, keeping your models aligned with real transportation dynamics. Consider a Tuesday 3 PM demand prediction request from your client. Without proper scaling, your model might overweight historical bike counts (values in the hundreds) while underweighting a forecasted 15-degree temperature drop (numerically smaller change). The result: inflated demand predictions and misallocated bike fleet resources.

Now that you know *why* scaling matters, let's explore the two core methods every transportation consultant should master: StandardScaler and MinMaxScaler. Each one transforms the data differently and is best suited for particular feature types.

### 3.2. StandardScaler: Statistical Normalization

**Definition:** StandardScaler (Z-score normalization) standardizes features so they have mean 0 and standard deviation 1. This preserves the distribution shape but makes features comparable on a common statistical scale.

**Formula:**
$$scaled\_value = \frac{original\_value - mean}{standard\_deviation}$$

**Python Example:**

In [49]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv"
df = pd.read_csv(data_path, parse_dates=['datetime'])

scaler = StandardScaler()
numerical_columns = ['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered']

# Show before/after comparison
print("Original values (first 5 rows):")
print(df[numerical_columns].head())

df_scaled = scaler.fit_transform(df[numerical_columns])
df_scaled_display = pd.DataFrame(df_scaled, columns=numerical_columns)

print("\nScaled values (first 5 rows):")
print(df_scaled_display.head())

print("\nScaled statistics:")
print(f"Mean: {df_scaled.mean(axis=0).round(10)}")  # Should be ~0
print(f"Std: {df_scaled.std(axis=0).round(3)}")     # Should be ~1

Original values (first 5 rows):
   temp   atemp  humidity  windspeed  casual  registered
0  9.84  14.395        81        0.0       3          13
1  9.02  13.635        80        0.0       8          32
2  9.02  13.635        80        0.0       5          27
3  9.84  14.395        75        0.0       3          10
4  9.84  14.395        75        0.0       0           1

Scaled values (first 5 rows):
       temp     atemp  humidity  windspeed    casual  registered
0 -1.333661 -1.092737  0.993213  -1.567754 -0.660992   -0.943854
1 -1.438907 -1.182421  0.941249  -1.567754 -0.560908   -0.818052
2 -1.438907 -1.182421  0.941249  -1.567754 -0.620958   -0.851158
3 -1.333661 -1.092737  0.681430  -1.567754 -0.660992   -0.963717
4 -1.333661 -1.092737  0.681430  -1.567754 -0.721042   -1.023307

Scaled statistics:
Mean: [-0.  0.  0. -0.  0. -0.]
Std: [1. 1. 1. 1. 1. 1.]


Notice how StandardScaler transforms the features to have mean ≈ 0 and standard deviation = 1. Temperature values that ranged from 0-40°C now center around 0, with most values falling between -2 and +2. This puts temperature, humidity, and bike counts on the same statistical scale, ensuring the model treats them fairly rather than over-weighting variables with larger raw values.

The transformation preserves relationships within each feature while making them directly comparable. For bike-sharing prediction, this ensures that a 10-point change in humidity has a similar numerical weight as a 10-degree change in temperature, allowing the model to learn which features are truly most predictive.

**When to Use:**

* Weather variables with near-normal distributions
* Linear regression or neural networks
* Situations without extreme outliers

### 3.3. MinMaxScaler: Bounded Range Normalization

**Definition:** MinMaxScaler rescales values into a defined range, usually 0–1. This ensures all values fit within a predictable bound.

**Formula:**
$$scaled\_value = \frac{original\_value - min}{max - min}$$

**Python Example:**

In [50]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv"
df = pd.read_csv(data_path, parse_dates=['datetime'])

scaler = MinMaxScaler(feature_range=(0, 1))
numerical_columns = ['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered']

# Show before
print("Original min/max (per column):")
print(df[numerical_columns].agg(['min','max']).T)

# Fit & transform
mm_scaled = scaler.fit_transform(df[numerical_columns])
mm_scaled_df = pd.DataFrame(mm_scaled, columns=numerical_columns)

# Show after
print("\nScaled min/max (should be ~0 and ~1):")
print(mm_scaled_df.agg(['min','max']).round(4).T)

# Optional: invert back to original scale (sanity check)
mm_inverted = scaler.inverse_transform(mm_scaled_df.head())
print("\nInverse-transformed (first 5 rows) matches original scale:")
print(pd.DataFrame(mm_inverted, columns=numerical_columns).round(3).head())

Original min/max (per column):
             min       max
temp        0.82   41.0000
atemp       0.76   45.4550
humidity    0.00  100.0000
windspeed   0.00   56.9969
casual      0.00  367.0000
registered  0.00  886.0000

Scaled min/max (should be ~0 and ~1):
            min  max
temp        0.0  1.0
atemp       0.0  1.0
humidity    0.0  1.0
windspeed   0.0  1.0
casual      0.0  1.0
registered  0.0  1.0

Inverse-transformed (first 5 rows) matches original scale:
   temp   atemp  humidity  windspeed  casual  registered
0  9.84  14.395      81.0        0.0     3.0        13.0
1  9.02  13.635      80.0        0.0     8.0        32.0
2  9.02  13.635      80.0        0.0     5.0        27.0
3  9.84  14.395      75.0        0.0     3.0        10.0
4  9.84  14.395      75.0        0.0     0.0         1.0


Notice how MinMaxScaler compresses all values into the 0–1 range while preserving the relative distances between data points. Temperature values that originally ranged from 0-40°C are now squeezed between 0 and 1, with the minimum temperature mapping to 0 and the maximum to 1. This bounded transformation is particularly useful when you need predictable value ranges.

The transformation maintains the original distribution shape and outlier patterns, simply rescaling them to fit the target range. For bike-sharing prediction, this ensures that features with different natural scales—like temperature (0-40°C) and humidity (0-100%)—are all mapped to the same 0-1 range, preventing any single feature from dominating due to its larger numerical scale. The inverse_transform function allows you to convert predictions back to the original scale for interpretation.

**When to Use:**

* Time-based features (e.g., hour of day)
* Variables with natural bounds (e.g., percentages)
* Algorithms sensitive to bounded inputs

## 4. Time-Based Feature Engineering for Transportation

### 4.1. Extracting Temporal Intelligence from Timestamps

Time is the most important dimension in transportation data, but raw timestamps contain hidden patterns that must be extracted and transformed to be useful for machine learning.

Think about how a clock works - after 11 PM comes midnight (12 AM), but to a computer, these look like completely different numbers (23 and 0). This creates a problem: the computer thinks 11 PM and midnight are far apart, when they're actually next to each other on the clock. Similarly, transportation demand operates simultaneously at multiple temporal scales - hourly rush patterns, weekly commute cycles, and seasonal variations - all requiring specialized feature engineering approaches.

In this section, we explore four essential temporal feature engineering techniques that transform raw timestamps into predictive features for bike-sharing demand forecasting:

1. **Cyclical encoding for continuous time**
2. **Time-since features**
3. **Temporal aggregation features**
4. **Lag features for sequential patterns**

Each technique captures different temporal patterns that drive transportation demand.

### 4.2. Cyclical Encoding for Continuous Time

**Definition**: Cyclical encoding transforms linear time values (hours, days, months) into circular representations using sine and cosine functions, ensuring that adjacent time points remain close in the feature space.

**Purpose**: This technique prevents artificial breaks in temporal data - hour 23 and hour 0 are neighbors on a clock, and models should treat them as such.

**Python Example**:

In [51]:
import numpy as np
import pandas as pd

data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv"
df = pd.read_csv(data_path, parse_dates=['datetime'])

# Extract hour from timestamp
df['hour'] = df['datetime'].dt.hour

# Create cyclical encoding for 24-hour cycle
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

# Compare hour 23 vs hour 0
print("Hour 23 encoding:", f"sin={df[df['hour']==23]['hour_sin'].iloc[0]:.3f}, cos={df[df['hour']==23]['hour_cos'].iloc[0]:.3f}")
print("Hour 0 encoding: ", f"sin={df[df['hour']==0]['hour_sin'].iloc[0]:.3f}, cos={df[df['hour']==0]['hour_cos'].iloc[0]:.3f}")

Hour 23 encoding: sin=-0.259, cos=0.966
Hour 0 encoding:  sin=0.000, cos=1.000


The formula `2 * π * hour / 24` maps the 24-hour cycle onto a circle. We use both sine and cosine because together they uniquely identify any point on the circle. Hour 0 maps to (sin=0, cos=1), hour 6 to (sin=1, cos=0), hour 12 to (sin=0, cos=-1), and hour 18 to (sin=-1, cos=0). Critically, hour 23 maps to approximately (sin=-0.259, cos=0.966), very close to hour 0's position.

For bike-sharing demand prediction, this encoding allows the model to learn that late-night hours (22-23) and early-morning hours (0-1) share similar low-demand patterns. Without cyclical encoding, a model would incorrectly assume hour 23 is as different from hour 0 as hour 0 is from hour 23, missing the continuous flow of nighttime demand patterns.

### 4.3. Time-Since Features

**Definition**: Time-since features measure the elapsed time between the current observation and a meaningful reference point, such as the last weekend, holiday, or weather event.

**Purpose**: These features capture recovery patterns and transition effects that influence transportation demand after significant events.

**Python Example**:

In [52]:
import pandas as pd

data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv"
df = pd.read_csv(data_path, parse_dates=['datetime'])
df = df.sort_values('datetime')

# Calculate days since last weekend (Monday=0, Sunday=6)
df['day_of_week'] = df['datetime'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Simple calculation: Monday=1, Tuesday=2, ..., Friday=5, Sat/Sun=0
df['days_since_weekend'] = df['day_of_week'].apply(
    lambda x: x + 1 if x < 5 else 0
)

# Show demand pattern by days since weekend
print("\nAverage demand by days since weekend:")
print(df.groupby('days_since_weekend')['count'].mean().round(1))


Average demand by days since weekend:
days_since_weekend
0    188.8
1    190.4
2    189.7
3    188.4
4    197.3
5    197.8
Name: count, dtype: float64


We created a feature that tracks how many days have passed since the last weekend ended. Monday gets value 1, Tuesday gets 2, and so on through Friday (value 5), while weekends themselves get 0. This captures the weekly rhythm where demand patterns evolve as the work week progresses.

Analysis reveals that bike-sharing demand builds throughout the work week. Monday through Wednesday (1-3 days since weekend) show relatively stable baseline demand around 189-190 bikes per hour. Demand increases notably on Thursday and Friday (4-5 days), reaching nearly 198 bikes per hour - about 5% higher than mid-week. This pattern suggests people may make more trips toward the end of the work week, possibly combining commutes with after-work activities or weekend preparation. Operators should ensure higher bike availability at stations on Thursday-Friday afternoons.

### 4.4. Temporal Aggregation Features

**Definition**: Temporal aggregation features summarize past values of a variable over a defined time window, providing context about recent trends and stability.

**Purpose**: They help models understand whether demand has been rising or falling, and whether conditions have been stable or volatile.

**Python Example**:

In [53]:
import pandas as pd

data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv"
df = pd.read_csv(data_path, parse_dates=['datetime'])
df = df.sort_values('datetime')

# Create 3-hour rolling average demand
df['demand_3h_avg'] = df['count'].rolling(window=3, min_periods=1).mean()

# Create 24-hour rolling average demand
df['demand_24h_avg'] = df['count'].rolling(window=24, min_periods=1).mean()

# Compare current demand to 24-hour average (momentum indicator)
df['demand_momentum'] = df['count'] - df['demand_24h_avg']

# Show example
print("\nSample temporal aggregation features:")
print(df[['datetime', 'count', 'demand_3h_avg', 'demand_24h_avg', 'demand_momentum']].head(30).to_string(index=False))


Sample temporal aggregation features:
           datetime  count  demand_3h_avg  demand_24h_avg  demand_momentum
2011-01-01 00:00:00     16      16.000000       16.000000         0.000000
2011-01-01 01:00:00     40      28.000000       28.000000        12.000000
2011-01-01 02:00:00     32      29.333333       29.333333         2.666667
2011-01-01 03:00:00     13      28.333333       25.250000       -12.250000
2011-01-01 04:00:00      1      15.333333       20.400000       -19.400000
2011-01-01 05:00:00      1       5.000000       17.166667       -16.166667
2011-01-01 06:00:00      2       1.333333       15.000000       -13.000000
2011-01-01 07:00:00      3       2.000000       13.500000       -10.500000
2011-01-01 08:00:00      8       4.333333       12.888889        -4.888889
2011-01-01 09:00:00     14       8.333333       13.000000         1.000000
2011-01-01 10:00:00     36      19.333333       15.090909        20.909091
2011-01-01 11:00:00     56      35.333333       18.500000    

We created rolling window features that summarize recent demand history. The 3-hour average captures immediate trends - if demand has been building over the past few hours, this average will be rising. The 24-hour average provides daily context, smoothing out hourly fluctuations. The momentum indicator (current - 24h average) shows whether current demand is above or below typical levels for this time of day.

These aggregation features help distinguish between sustained demand trends and temporary spikes. For example, if current demand is 200 bikes/hour but the 3-hour average is 150, this suggests demand is accelerating - perhaps weather improved or an event started. Conversely, if current demand is 100 but the 3-hour average is 150, demand is declining and bike rebalancing can be delayed. The 24-hour average is particularly valuable for detecting anomalies: when current demand deviates significantly from the 24-hour norm (momentum > ±50), it signals special conditions requiring operational attention.

### 4.5. Lag Features for Sequential Patterns

**Definition**: Lag features use values from previous time steps as predictors for the current observation, explicitly introducing historical patterns into the model.

**Purpose**: They capture the sequential dependencies and recurring cycles that characterize transportation demand.

**Python Example**:

In [54]:
import pandas as pd

data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv"
df = pd.read_csv(data_path, parse_dates=['datetime'])
df = df.sort_values('datetime')

# Create lag features at different time scales
df['demand_lag_1h'] = df['count'].shift(1)      # 1 hour ago
df['demand_lag_24h'] = df['count'].shift(24)    # Same hour yesterday
df['demand_lag_7d'] = df['count'].shift(24 * 7) # Same hour last week

# Calculate correlations to understand predictive power
print("Lag feature correlations with current demand:")
print(f"1-hour lag:  {df['count'].corr(df['demand_lag_1h']):.3f}")
print(f"24-hour lag: {df['count'].corr(df['demand_lag_24h']):.3f}")
print(f"7-day lag:   {df['count'].corr(df['demand_lag_7d']):.3f}")

# Show example for Saturday morning
print("\nSaturday 7 AM demand patterns:")
saturday_7am = df[(df['datetime'].dt.dayofweek == 5) & (df['datetime'].dt.hour == 7)]
print(saturday_7am[['datetime', 'count', 'demand_lag_1h', 'demand_lag_24h', 'demand_lag_7d']].head(5).to_string(index=False))

Lag feature correlations with current demand:
1-hour lag:  0.842
24-hour lag: 0.811
7-day lag:   0.786

Saturday 7 AM demand patterns:
           datetime  count  demand_lag_1h  demand_lag_24h  demand_lag_7d
2011-01-01 07:00:00      3            2.0             NaN            NaN
2011-01-08 07:00:00      9            2.0            84.0           16.0
2011-01-15 07:00:00     10            3.0            70.0           16.0
2011-02-05 07:00:00      4            4.0            87.0          113.0
2011-02-12 07:00:00     11            2.0            74.0           11.0


We created three lag features at different time scales. The 1-hour lag captures immediate momentum - high demand often persists for several hours. The 24-hour lag captures daily repetition - 8 AM today resembles 8 AM yesterday. The 7-day lag captures weekly cycles - Monday patterns repeat week after week. The `.shift()` function moves values backward in time, making historical demand available as features for prediction.

The correlation analysis reveals that 1-hour lags (correlation ≈ 0.84) are the strongest predictors for bike-sharing demand, as demand in consecutive hours tends to be very similar. The 24-hour lag (correlation ≈ 0.81) is nearly as strong, reflecting the dominant daily cycle in urban transportation. The Saturday 7 AM example demonstrates the value of the 7-day lag (correlation ≈ 0.79): demand on one Saturday (9 bikes) is much closer to the previous Saturday (16 bikes) than to Friday 7 AM (84 bikes), showing that weekly patterns are crucial for weekend predictions.

---

## Summary and Transition to Exploratory Data Analysis

You've mastered advanced preprocessing and feature engineering techniques: categorical encoding strategies, scaling methods, cyclical time encoding, and lag features. These skills transform clean transportation data into machine learning-ready inputs.

Your ability to create new features from raw data prepares you to work with complex transportation prediction challenges while maintaining the data quality essential for accurate models.

In the next module, you'll learn how to explore and visualize these engineered features to generate business insights and validate that your preprocessing pipeline creates data that reflects real-world transportation patterns.