#  Temporal and Lag Features
     Engineering Time-Aware Signals Without Leakage


## Objective

This notebook provides a rigorous treatment of **temporal and lag-based feature engineering**, covering:

- Absolute vs relative time features
- Lag features and rolling windows
- Cumulative and trend-based signals
- Leakage traps in temporal data
- Time-aware feature engineering inside pipelines

It answers:

**How do we engineer temporal features that reflect reality without using future information?**


## Why Temporal Features Matter

Time is often the strongest predictor in real systems.

Temporal features:
- Capture behavior evolution
- Enable trend detection
- Reflect customer lifecycle stages

But time is also the **most common source of leakage**.

## Imports and Dataset



In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv("../datasets/03_feature_engineering/synthetic_customer_churn_temporal.csv",
                parse_dates=["event_date"]
)


df.sort_values(["customer_id", "event_date"], inplace=True)
df.head()


Unnamed: 0,customer_id,event_date,avg_monthly_usage,churned
0,1,2022-01-01,61.99,0
1,1,2022-02-01,85.53,0
2,1,2022-03-01,71.94,0
3,1,2022-04-01,67.69,0
4,1,2022-05-01,69.28,0


# Step 1 – Temporal Data Sanity Checks


In [3]:
df[["customer_id", "event_date"]].isnull().sum()


customer_id    0
event_date     0
dtype: int64

## Step 2 – Absolute Time Features

Absolute time captures **seasonality and calendar effects**.


In [4]:
df["year"] = df["event_date"].dt.year
df["month"] = df["event_date"].dt.month
df["day_of_week"] = df["event_date"].dt.dayofweek
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)

df[["event_date", "month", "day_of_week", "is_weekend"]].head()


Unnamed: 0,event_date,month,day_of_week,is_weekend
0,2022-01-01,1,5,1
1,2022-02-01,2,1,0
2,2022-03-01,3,1,0
3,2022-04-01,4,4,0
4,2022-05-01,5,6,1


## Step 3 – Relative Time Features

Relative time captures **position within lifecycle**.


In [5]:
df["first_event"] = df.groupby("customer_id")["event_date"].transform("min")

df["days_since_first_event"] = (
    df["event_date"] - df["first_event"]
).dt.days


## Step 4 – Lag Features

Lag features use **past values only**.


In [6]:
df["usage_lag_1"] = (
    df.groupby("customer_id")["avg_monthly_usage"]
    .shift(1)
)

df["usage_lag_3"] = (
    df.groupby("customer_id")["avg_monthly_usage"]
    .shift(3)
)


## Leakage Warning

    Using future values as features invalidates models.

Rules:
- Lags must be strictly backward
- Sort by time first
- Group by entity


## Leakage Warning

     Using future values as features invalidates models.

Rules:
- Lags must be strictly backward
- Sort by time first
- Group by entity


In [7]:
df["usage_roll_mean_3"] = (
    df.groupby("customer_id")["avg_monthly_usage"]
    .rolling(window=3)
    .mean()
    .reset_index(level=0, drop=True)
)

df["usage_roll_std_3"] = (
    df.groupby("customer_id")["avg_monthly_usage"]
    .rolling(window=3)
    .std()
    .reset_index(level=0, drop=True)
)


## Step 6 – Cumulative Features

Cumulative features reflect long-term behavior.


In [8]:
df["usage_cumulative"] = (
    df.groupby("customer_id")["avg_monthly_usage"]
    .cumsum()
)


## Step 7 – Trend Features

Trends indicate acceleration or decay.


In [9]:
df["usage_delta"] = (
    df.groupby("customer_id")["avg_monthly_usage"]
    .diff()
)
df["usage_delta"]

0         NaN
1       23.54
2      -13.59
3       -4.25
4        1.59
        ...  
6995     8.01
6996   -13.47
6997    11.78
6998    -4.64
6999    13.33
Name: usage_delta, Length: 7000, dtype: float64

## Step 8 – Time-Aware Validation Requirement

Random splits break temporal integrity.

Correct strategies:
- Forward chaining
- TimeSeriesSplit

## Step 9 – Temporal Features Inside Pipelines

Temporal transformations must:
- Occur after sorting
- Respect entity boundaries
- Be reproducible


## Model Sensitivity to Temporal Features

| Model | Benefit |
|------|--------|
| Linear | Moderate |
| Tree-Based | High |
| Boosting | Very High |
| RNN / LSTM | Native |


## Common Mistakes (Avoided)

- `[neg] -` Random train/test split
- `[neg] -` Using future aggregates
- `[neg] -` Ignoring entity grouping
- `[neg] -` Global rolling windows

## Summary Table

| Feature Type | Purpose |
|-------------|---------|
| Lag | Past behavior |
| Rolling | Recent trend |
| Cumulative | Long-term |
| Delta | Change rate |


## Key Takeaways

- Temporal features are high-signal
- Leakage is the primary risk
- Always group and sort
- Time-aware validation is mandatory
- Pipelines must respect time


## Next Notebook

03_Feature_Engineering/

└── [08_feature_stability_and_drift_checks.ipynb](08_feature_stability_and_drift_checks.ipynb)


<br><br><br><br><br>



# Complete: [Data Science Techniques](https://github.com/lei-soares/Data-Science-Techniques)

- [00_Data_Generation_and_Simulation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/00_Data_Generation_and_Simulation)


- [01_Exploratory_Data_Analysis_(EDA)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/01_Exploratory_Data_Analysis_(EDA))


- [02_Data_Preprocessing](https://github.com/lei-soares/Data-Science-Techniques/tree/main/02_Data_Preprocessing)


- [03_Feature_Engineering](https://github.com/lei-soares/Data-Science-Techniques/tree/main/03_Feature_Engineering)


- [04_Supervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/04_Supervised_Learning)


- [05_Unsupervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/05_Unsupervised_Learning)


- [06_Model_Evaluation_and_Validation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/06_Model_Evaluation_and_Validation)


- [07_Model_Tuning_and_Optimization](https://github.com/lei-soares/Data-Science-Techniques/tree/main/07_Model_Tuning_and_Optimization)


- [08_Interpretability_and_Explainability](https://github.com/lei-soares/Data-Science-Techniques/tree/main/08_Interpretability_and_Explainability)


- [09_Pipelines_and_Workflows](https://github.com/lei-soares/Data-Science-Techniques/tree/main/09_Pipelines_and_Workflows)


- [10_Natural_Language_Processing_(NLP)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/10_Natural_Language_Processing_(NLP))


- [11_Time_Series](https://github.com/lei-soares/Data-Science-Techniques/tree/main/11_Time_Series)


- [12_Anomaly_and_Fraud_Detection](https://github.com/lei-soares/Data-Science-Techniques/tree/main/12_Anomaly_and_Fraud_Detection)


- [13_Imbalanced_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/13_Imbalanced_Learning)


- [14_Deployment_and_Production_Concepts](https://github.com/lei-soares/Data-Science-Techniques/tree/main/14_Deployment_and_Production_Concepts)


- [15_Business_and_Experimental_Design](https://github.com/lei-soares/Data-Science-Techniques/tree/main/15_Business_and_Experimental_Design)