# Production Feature Engineering Pipeline

This notebook demonstrates production-ready feature engineering practices for MLOps workflows.
It includes data validation, feature creation, transformation pipelines, and monitoring components.

## Table of Contents
1. [Setup and Imports](#setup)
2. [Data Loading and Validation](#data-loading)
3. [Exploratory Data Analysis](#eda)
4. [Feature Engineering Pipeline](#feature-engineering)
5. [Feature Validation and Quality Checks](#validation)
6. [Pipeline Serialization and Deployment](#deployment)
7. [Monitoring and Logging](#monitoring)

## 1. Setup and Imports

Import all necessary libraries for production feature engineering including:
- Data manipulation and analysis
- Feature engineering and preprocessing
- Pipeline creation and serialization
- Logging and monitoring
- Data validation

In [None]:
# Core data manipulation libs
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Visualization libs
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8')

# Scikit-learn for preprocessing and pipeline creation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    LabelEncoder, OneHotEncoder, OrdinalEncoder,
    PolynomialFeatures, PowerTransformer
)
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# Model evaluation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Serialization and deployment
import joblib
import pickle
from pathlib import Path

# Logging/monitoring
import logging
import json
import sys
from typing import Dict, List, Tuple, Any, Optional

# Data Validation
from scipy import stats
from scipy.stats import chi2_contingency

# Config
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Set random seed for reproducibility
RANDOM_STATE = 77
np.random.seed(RANDOM_STATE)

print(f"Setup completed at: {datetime.now()}")
print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 2. Data Loading and Validation

Load data with comprehensive validation including:
- Schema validation
- Data quality checks
- Missing value anlaysis
- Data type validation

In [None]:
# Configure logging to log file and console
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('feature_engineering.log'),
        logging.StreamHandler(sys.stdout)
    ]
)
# Logger object named after current module
logger = logging.getLogger(__name__)

class DataValidator:
    """
    Data validation class for production feature engineering.
    Performs comprehensive data quality checks and schema validation.
    """

    def __init__(self, expected_schema: Dict[str, str]):
        self.expected_schema = expected_schema
        self.validation_results = {}

    def validate_schema(self, df: pd.DataFrame) -> bool:
        """Validate dataframe schema against expected schema."""
        logger.info("Starting schema validation")

        missing_cols = set(self.expected_schema.keys()) - set(df.columns)
        extra_cols = set(df.columns) - set(self.expected_schema.keys())

        if missing_cols:
            logger.error(f"Missing columns: {missing_cols}")
            return False
        
        if extra_cols:
            logger.warning(f"Unexpected columns found: {extra_cols}")

        # Validate data types, iterate over each real col and expected data type
        type_mismatches = []
        for col, expected_type in self.expected_schema.items():
            if col in df.columns:
                actual_type = str(df[col].dtype)
                # lenient, as int might appear in int64
                if expected_type not in actual_type and actual_type not in expected_type:
                    type_mismatches.append((col, expected_type, actual_type))

        if type_mismatches:
            logger.warning(f"Data type mismatches: {type_mismatches}")

        logger.info("Schema validation completed!")
        return True
    
    def validate_data_quality(self, df: pd.DataFrame) -> Dict[str, Any]:
        """Perform comprehensive data quality validation."""
        logger.info("Starting data quality validation")

        quality_report = {
            'total_rows': len(df),
            'total_columns': len(df.columns),
            'missing_values': df.isnull().sum().to_dict(),
            'missing_percentage': (df.isnull().sum() / len(df) * 100).to_dict(),
            'duplicate_rows': df.duplicated().sum(),
            'data_types': df.dtypes.astype(str).to_dict(),
            'memory_usage': df.memory_usage(deep=True).sum() / 1024**2  # MB
        }

        # Check for columns with greater than 50% missing values
        high_missing_cols = [
            col for col, pct in quality_report['missing_precentage'].items()
            if pct > 50
        ]

        if high_missing_cols:
            logger.warning(f"Columns with >50% missing values: {high_missing_cols}")

        # check for columns with 0 variance (np.number is superclass for all numeric types)
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        zero_variance_cols = [col for col in numeric_cols if df[col].var() == 0]

        if zero_variance_cols:
            logger.warning(f"Columns with zero variance: {zero_variance_cols}")
        
        quality_report['high_missing_columns'] = high_missing_cols
        quality_report['zero_variance_columns'] = zero_variance_cols
        
        logger.info("Data quality validation completed!")
        return quality_report

In [None]:
# Create a sample dataset or load one
def create_sample_dataset():
    """Create a sample dataset for demonstration purposes."""
    np.random.seed(RANDOM_STATE)
    n_samples = 10000
    
    data = {
        'customer_id': range(1, n_samples + 1),
        'age': np.random.randint(18, 80, n_samples),
        'income': np.random.lognormal(10, 1, n_samples),
        'credit_score': np.random.randint(300, 850, n_samples),
        'account_balance': np.random.normal(5000, 2000, n_samples),
        'num_products': np.random.poisson(2, n_samples),
        'tenure_months': np.random.randint(1, 120, n_samples),
        'is_active': np.random.choice([0, 1], n_samples, p=[0.3, 0.7]),
        'geography': np.random.choice(['Urban', 'Suburban', 'Rural'], n_samples, p=[0.5, 0.3, 0.2]),
        'gender': np.random.choice(['Male', 'Female'], n_samples, p=[0.52, 0.48]),
        'has_credit_card': np.random.choice([0, 1], n_samples, p=[0.4, 0.6]),
        'last_transaction_date': pd.date_range('2023-01-01', '2024-01-01', periods=n_samples),
        'churn': None  # Target variable - will be generated based on features
    }
    
    df = pd.DataFrame(data)

    # Generate target variable based on features using a realistic relationship
    # We are adding these probabilities together
    churn_prob = (
        0.1 +                                            # Baseline risk
        0.001 * (df['age'] - 40)**2 +                    # U-shaped age effect
        -0.00001 * df['income'] +                        # Higher income = lower churn
        -0.0002 * df['credit_score'] +                   # Higher credit score = lower churn
        -0.01 * df['num_products'] +                     # More products = lower churn
        -0.002 * df['tenure_months'] +                   # Longer tenure = lower churn
        -0.1 * df['is_active'] +                         # Active users = lower churn
        0.05 * (df['geography'] == 'Rural').astype(int)  # Rural = higher churn
    )

    # Ensure probabilities are between 0 and 1
    churn_prob = np.clip(churn_prob, 0, 1)
    df['churn'] = np.random.binomial(1, churn_prob, n_samples)
    
    # Introduce some missing values to simulate real-world data
    df.loc[np.random.choice(df.index, size=500, replace=False), 'income'] = np.nan
    df.loc[np.random.choice(df.index, size=300, replace=False), 'credit_score'] = np.nan
    df.loc[np.random.choice(df.index, size=200, replace=False), 'account_balance'] = np.nan
    
    return df

# Load and validate data
logger.info("Loading dataset...")
df = create_sample_dataset()

# Define the expected schema
expected_schema = {
    'customer_id': 'int64',
    'age': 'int64',
    'income': 'float64',
    'credit_score': 'float64',
    'account_balance': 'float64',
    'num_products': 'int64',
    'tenure_months': 'int64',
    'is_active': 'int64',
    'geography': 'object',
    'gender': 'object',
    'has_credit_card': 'int64',
    'last_transaction_date': 'datetime64[ns]',
    'churn': 'int64'
}

# Initialize validator and perform validation
validator = DataValidator(expected_schema)
is_valid_schema = validator.validate_schema(df)
quality_report = validator.validate_data_quality(df)

print("Data Quality Report:")
print(json.dumps(quality_report, indent=2, default=str))

## 3. Exploratory Data Analysis

Perform comprehensive EDA to understand data patterns and inform feature engineering decisions.

In [None]:
def perform_eda(df: pd.DataFrame) -> None:
    """
    Perform comprehensive exploratory data analysis (EDA).
    """

    logger.info("Starting exploratory data analysis")
    
    print("Dataset Overview:")
    print(f"Shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

    # Describe gives descriptive stats, works only on numeric columns
    # display makes the DF look like a table, part of Jupyter
    print("\nBasic Stats:")
    display(df.describe(include='all'))

    print("\nMissing Values Analysis:")
    missing_data = pd.DataFrame({
        'Missing Count': df.isnull().sum(),
        'Missing Percentage': (df.isnull().sum() /len(df)) * 100
    })
    # missing_data['Missing Count'] > 0 Creates a boolean mask: Series of True/False values, one per row
    # missing_data[boolean_mask] -> pandas does the filtering, only keeping where condition is true
    missing_data = missing_data[missing_data['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
    # Assign it back to missing_data for clean reassignment
    print(missing_data)

    # Correlation analysis for numeric features (select only numeric column names)
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 1:
        plt.figure(figsize=(12, 8))
        # .corr() computes the correlation matrix (pairwise correlations between all numeric columns)
        correlation_matrix = df[numeric_cols].corr()
        sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
        plt.title('Feature Correlation Matrix')
        plt.tight_layout()
        plt.show()
    
    # Target variable distribution
    if 'churn' in df.columns:
        plt.figure(figsize=(15, 5))
        
        plt.subplot(1, 3, 1) # 1 row, 3 columns of plots, this is the first plot
        # plot the distribution of churn values, 0 vs 1
        df['churn'].value_counts().plot(kind='bar')
        plt.title('Target Variable Distribution')
        plt.xlabel('Churn')
        plt.ylabel('Count')
        
        # Age distribution by churn
        plt.subplot(1, 3, 2)
        for churn_val in df['churn'].unique():
            subset = df[df['churn'] == churn_val]['age'].dropna()
            plt.hist(subset, alpha=0.7, label=f'Churn = {churn_val}', bins=20)
        plt.title('Age Distribution by Churn')
        plt.xlabel('Age')
        plt.ylabel('Frequency')
        plt.legend()
        
        # Income distribution by churn
        plt.subplot(1, 3, 3)
        for churn_val in df['churn'].unique():
            subset = df[df['churn'] == churn_val]['income'].dropna()
            plt.hist(subset, alpha=0.7, label=f'Churn = {churn_val}', bins=20)
        plt.title('Income Distribution by Churn')
        plt.xlabel('Income')
        plt.ylabel('Frequency')
        plt.legend()
        plt.yscale('log')  # Log scale due to income distribution
        
        plt.tight_layout()
        plt.show()
    
    # Categorical variable analysis
    categorical_cols = df.select_dtypes(include=['object']).columns
    if len(categorical_cols) > 0:
        # create a row of subplots, 1 row, N columns (len(categorical_cols))
        fig, axes = plt.subplots(1, len(categorical_cols), figsize=(15, 5))
        if len(categorical_cols) == 1:
            # if only 1 column, then axes will not be a list, so make it one
            axes = [axes]
            
        # loop and plot each into its subplot
        for i, col in enumerate(categorical_cols):
            df[col].value_counts().plot(kind='bar', ax=axes[i])
            axes[i].set_title(f'{col} Distribution')
            axes[i].tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        plt.show()

# Perform EDA
perform_eda(df)

## 4. Feature Engineering Pipeline

Create comprehensive feature engineering pipeline with custom transformers and production-ready components.

## Complete Sections:

1. **Setup and Imports** - All necessary libraries with proper configuration
2. **Data Loading and Validation** - Schema validation, data quality checks, and synthetic dataset creation
3. **Exploratory Data Analysis** - Comprehensive EDA with visualizations and statistical analysis
4. **Feature Engineering Pipeline** - Custom transformers, preprocessing pipelines, and feature creation
5. **Feature Validation and Quality Checks** - Distribution shift detection and quality monitoring
6. **Pipeline Serialization and Deployment** - Versioning, metadata management, and deployment artifacts
7. **Monitoring and Logging** - Production monitoring with drift detection and reporting
8. **Model Training and Evaluation** - Performance comparison showing feature engineering benefits
9. **Summary and Best Practices** - Key takeaways and production guidelines

## Key Features Demonstrated:

- **Custom Transformers**: DateFeatureExtractor, BusinessFeatureCreator, OutlierTreatment.
- **Production Pipeline**: Complete scikit-learn pipeline with proper preprocessing.
- **Validation Framework**: Comprehensive data validation and quality checks.
- **Deployment Ready**: Serialization, versioning, and schema management.
- **Monitoring System**: Data drift detection and performance tracking.
- **Performance Analysis**: Clear demonstration of feature engineering benefits.