# Feature Store and Reusability
     Turning Feature Engineering into a Reproducible Asset


## Objective

This notebook provides a conceptual and practical introduction to **feature stores and feature reusability**, covering:

- Why feature engineering must be centralized
- Online vs offline features
- Feature definitions and versioning
- Training–serving consistency
- Lightweight feature store patterns in Python

It answers:

    How do we prevent duplicated, inconsistent, and non-reproducible feature engineering?


## Why Feature Stores Matter

Without feature reuse:

- Features are re-implemented inconsistently
- Training ≠ serving
- Models silently break in production
- Feature logic is lost over time

A feature store treats features as **first-class assets**.



## Imports and Dataset


In [2]:
import numpy as np
import pandas as pd
from datetime import datetime

In [3]:
df = pd.read_csv("../datasets/03_feature_engineering/synthetic_customer_churn_temporal.csv",
                parse_dates=["event_date"]
)


df.sort_values(["customer_id", "event_date"], inplace=True)
df.head()

Unnamed: 0,customer_id,event_date,avg_monthly_usage,churned,customer_segment,income,tenure_years
0,1,2022-01-01,61.99,0,segment_20,32133.0,-1
1,1,2022-02-01,85.53,0,segment_12,17875.0,8
2,1,2022-03-01,71.94,0,segment_11,26139.0,10
3,1,2022-04-01,67.69,0,segment_9,54872.0,16
4,1,2022-05-01,69.28,0,segment_10,48679.0,11


## Step 1 – What Is a Feature?

A production feature must have:

- Clear definition
- Deterministic logic
- Time reference
- Ownership
- Version


## Step 2 – Feature Definition as Code

Feature logic should be **centralized and reusable**.


In [4]:
def avg_usage_last_3_months(df):
    return (
        df.sort_values("event_date")
        .groupby("customer_id")["avg_monthly_usage"]
        .rolling(3)
        .mean()
        .reset_index(level=0, drop=True)
    )


def tenure_in_days(df):
    first_event = df.groupby("customer_id")["event_date"].transform("min")
    return (df["event_date"] - first_event).dt.days


## Step 3 – Feature Registry (Lightweight Pattern)


In [5]:
FEATURE_REGISTRY = {
    "avg_usage_last_3m": {
        "description": "Rolling mean of usage over last 3 periods",
        "owner": "data_science",
        "version": "1.0",
        "function": avg_usage_last_3_months
    },
    "tenure_days": {
        "description": "Customer tenure in days",
        "owner": "data_science",
        "version": "1.0",
        "function": tenure_in_days
    }
}


## Step 4 – Feature Materialization (Offline)


In [6]:
for feature_name, meta in FEATURE_REGISTRY.items():
    df[feature_name] = meta["function"](df)

df[["customer_id", "avg_usage_last_3m", "tenure_days"]].head()


Unnamed: 0,customer_id,avg_usage_last_3m,tenure_days
0,1,,0
1,1,,31
2,1,73.153333,59
3,1,75.053333,90
4,1,69.636667,120


## Step 5 – Feature Versioning

Versioning protects models from silent feature changes.


In [8]:
df["avg_usage_last_3m_v1"] = df["avg_usage_last_3m"]


## Step 6 – Training vs Serving Consistency

Features must be computed **identically** in training and serving.

## Offline vs Online Features

| Feature Type | Use |
|-------------|-----|
| Offline | Training, batch inference |
| Online | Real-time inference |

## Step 7 – Point-in-Time Correctness

A feature must use **only data available at prediction time**.

## Step 8 – Feature Store Architecture

Typical components:
- Source data
- Feature definitions
- Offline store
- Online store
- Metadata store

## Example: Feature Store Table Schema

| customer_id | event_time | avg_usage_last_3m | tenure_days |

## Step 9 – Feature Reuse Across Models

Well-defined features can be reused by:
- Churn models
- Uplift models
- Fraud models


## Common Mistakes (Avoided)

- `[neg] -` Feature logic inside notebooks only
- `[neg] -` No versioning
- `[neg] -` Training-serving skew
- `[neg] -` Recomputing features ad hoc


## Summary Table

| Concept | Purpose |
|-------|---------|
| Feature registry | Governance |
| Versioning | Stability |
| Point-in-time | No leakage |
| Reusability | Scalability |


## Key Takeaways

- Features are long-lived assets
- Centralization prevents chaos
- Versioning is mandatory
- Training ≠ serving must be eliminated


## Next Section

04_Supervised_Learning/

└── [01_regression_models.ipynb](01_regression_models/)


<br><br><br><br><br>



# Complete: [Data Science Techniques](https://github.com/lei-soares/Data-Science-Techniques)

- [00_Data_Generation_and_Simulation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/00_Data_Generation_and_Simulation)


- [01_Exploratory_Data_Analysis_(EDA)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/01_Exploratory_Data_Analysis_(EDA))


- [02_Data_Preprocessing](https://github.com/lei-soares/Data-Science-Techniques/tree/main/02_Data_Preprocessing)


- [03_Feature_Engineering](https://github.com/lei-soares/Data-Science-Techniques/tree/main/03_Feature_Engineering)


- [04_Supervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/04_Supervised_Learning)


- [05_Unsupervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/05_Unsupervised_Learning)


- [06_Model_Evaluation_and_Validation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/06_Model_Evaluation_and_Validation)


- [07_Model_Tuning_and_Optimization](https://github.com/lei-soares/Data-Science-Techniques/tree/main/07_Model_Tuning_and_Optimization)


- [08_Interpretability_and_Explainability](https://github.com/lei-soares/Data-Science-Techniques/tree/main/08_Interpretability_and_Explainability)


- [09_Pipelines_and_Workflows](https://github.com/lei-soares/Data-Science-Techniques/tree/main/09_Pipelines_and_Workflows)


- [10_Natural_Language_Processing_(NLP)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/10_Natural_Language_Processing_(NLP))


- [11_Time_Series](https://github.com/lei-soares/Data-Science-Techniques/tree/main/11_Time_Series)


- [12_Anomaly_and_Fraud_Detection](https://github.com/lei-soares/Data-Science-Techniques/tree/main/12_Anomaly_and_Fraud_Detection)


- [13_Imbalanced_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/13_Imbalanced_Learning)


- [14_Deployment_and_Production_Concepts](https://github.com/lei-soares/Data-Science-Techniques/tree/main/14_Deployment_and_Production_Concepts)


- [15_Business_and_Experimental_Design](https://github.com/lei-soares/Data-Science-Techniques/tree/main/15_Business_and_Experimental_Design)