# Procurement KPI Analytics - Feature Engineering

**Objective**: Create advanced features from cleaned procurement data to enable sophisticated analytics and modeling.

**Key Feature Categories**:
- Temporal features (seasonality, trends, lead time patterns)
- Supplier performance metrics (reliability, risk, efficiency)
- Financial engineering (cost optimization, savings patterns)
- Quality indicators (defect patterns, improvement trends)
- Category-based features (procurement patterns, category performance)
- Risk and compliance indicators
- Aggregated performance metrics

**Input**: Clean procurement dataset from data cleaning phase

**Output**: Feature-rich dataset ready for KPI analysis and modeling

---

## 1. Setup and Data Loading

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Any
import calendar

# Optional imports with fallback
try:
    from scipy import stats
    SCIPY_AVAILABLE = True
except ImportError:
    print("Warning: scipy not available. Some statistical features will be limited.")
    SCIPY_AVAILABLE = False

try:
    from sklearn.preprocessing import StandardScaler, LabelEncoder
    SKLEARN_AVAILABLE = True
except ImportError:
    print("Warning: scikit-learn not available. Some feature scaling will be limited.")
    SKLEARN_AVAILABLE = False

# Configure display
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.3f}'.format)
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')

print("Feature engineering environment initialized")
print(f"Analysis timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Feature engineering environment initialized
Analysis timestamp: 2025-07-09 07:52:40


In [2]:
# Load cleaned dataset
try:
    df = pd.read_csv('../data/processed/procurement_data_clean.csv')
    print("Cleaned dataset loaded successfully")
    print(f"Dataset shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
except FileNotFoundError:
    print("Error: Cleaned dataset not found. Please run the data cleaning notebook first.")
    print("Expected file: '../data/processed/procurement_data_clean.csv'")

# Convert date columns if needed
date_columns = ['Order_Date', 'Delivery_Date']
for col in date_columns:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')

print(f"\nDataset date range: {df['Order_Date'].min()} to {df['Order_Date'].max()}")
print(f"Data types: {df.dtypes.value_counts().to_dict()}")

Cleaned dataset loaded successfully
Dataset shape: 777 rows x 19 columns

Dataset date range: 2022-01-01 00:00:00 to 2024-01-01 00:00:00
Data types: {dtype('float64'): 10, dtype('O'): 5, dtype('<M8[ns]'): 2, dtype('int64'): 2}


## 2. Temporal Feature Engineering

In [3]:
# Create comprehensive temporal features
def create_temporal_features(df: pd.DataFrame) -> pd.DataFrame:
    df_temp = df.copy()
    
    print("Creating Temporal Features:")
    print("=" * 40)
    
    # Order Date Features
    if 'Order_Date' in df_temp.columns:
        print("Processing Order_Date features...")
        
        # Basic date components
        df_temp['order_year'] = df_temp['Order_Date'].dt.year
        df_temp['order_month'] = df_temp['Order_Date'].dt.month
        df_temp['order_quarter'] = df_temp['Order_Date'].dt.quarter
        df_temp['order_day_of_week'] = df_temp['Order_Date'].dt.dayofweek
        df_temp['order_day_of_month'] = df_temp['Order_Date'].dt.day
        df_temp['order_week_of_year'] = df_temp['Order_Date'].dt.isocalendar().week
        
        # Month and day names for interpretability
        df_temp['order_month_name'] = df_temp['Order_Date'].dt.month_name()
        df_temp['order_day_name'] = df_temp['Order_Date'].dt.day_name()
        
        # Seasonal indicators
        season_map = {
            12: 'Winter', 1: 'Winter', 2: 'Winter',
            3: 'Spring', 4: 'Spring', 5: 'Spring',
            6: 'Summer', 7: 'Summer', 8: 'Summer',
            9: 'Fall', 10: 'Fall', 11: 'Fall'
        }
        df_temp['order_season'] = df_temp['order_month'].map(season_map)
        
        # Business vs weekend
        df_temp['order_is_weekend'] = df_temp['order_day_of_week'].isin([5, 6])
        df_temp['order_is_month_end'] = df_temp['order_day_of_month'] >= 28
        df_temp['order_is_quarter_end'] = df_temp['order_month'].isin([3, 6, 9, 12])
        
        print(f"  Created 12 order date features")
    
    # Lead time features
    if 'lead_time_days' in df_temp.columns:
        print("Processing lead time features...")
        
        # Lead time categories
        df_temp['lead_time_category'] = pd.cut(df_temp['lead_time_days'], 
                                              bins=[-np.inf, 7, 14, 30, np.inf],
                                              labels=['Express', 'Standard', 'Extended', 'Long'])
        
        # Lead time efficiency indicators
        median_lead_time = df_temp['lead_time_days'].median()
        df_temp['lead_time_vs_median'] = df_temp['lead_time_days'] - median_lead_time
        df_temp['is_fast_delivery'] = df_temp['lead_time_days'] <= 7
        df_temp['is_slow_delivery'] = df_temp['lead_time_days'] >= 30
        
        print(f"  Created 4 lead time features")
    
    return df_temp

# Apply temporal feature engineering
df_features = create_temporal_features(df)

# Display sample of temporal features
temporal_cols = [col for col in df_features.columns if any(x in col for x in ['order_', 'delivery_', 'lead_time_'])]
print(f"\nTemporal features created: {len(temporal_cols)}")

Creating Temporal Features:
Processing Order_Date features...
  Created 12 order date features
Processing lead time features...
  Created 4 lead time features

Temporal features created: 15


## 3. Supplier Performance Features

In [4]:
# Create supplier performance features
def create_supplier_features(df: pd.DataFrame) -> pd.DataFrame:
    df_supplier = df.copy()
    
    print("Creating Supplier Performance Features:")
    print("=" * 40)
    
    if 'Supplier' not in df_supplier.columns:
        print("Warning: Supplier column not found")
        return df_supplier
    
    # Basic supplier statistics
    print("Calculating basic supplier statistics...")
    
    # Order volume features
    supplier_stats = df_supplier.groupby('Supplier').agg({
        'PO_ID': 'count',
        'Quantity': ['sum', 'mean'],
        'total_negotiated_value': ['sum', 'mean'],
        'lead_time_days': ['mean', 'std'],
        'defect_rate': ['mean', 'std'],
        'savings_percentage': ['mean', 'std']
    }).round(3)
    
    # Flatten column names
    supplier_stats.columns = ['_'.join(col).strip() if col[1] else col[0] for col in supplier_stats.columns]
    
    # Rename for clarity
    column_mapping = {
        'PO_ID_count': 'supplier_total_orders',
        'Quantity_sum': 'supplier_total_quantity',
        'Quantity_mean': 'supplier_avg_order_quantity',
        'total_negotiated_value_sum': 'supplier_total_spend',
        'total_negotiated_value_mean': 'supplier_avg_order_value',
        'lead_time_days_mean': 'supplier_avg_lead_time',
        'lead_time_days_std': 'supplier_lead_time_consistency',
        'defect_rate_mean': 'supplier_avg_defect_rate',
        'defect_rate_std': 'supplier_quality_consistency',
        'savings_percentage_mean': 'supplier_avg_savings_rate',
        'savings_percentage_std': 'supplier_savings_consistency'
    }
    
    supplier_stats = supplier_stats.rename(columns=column_mapping)
    
    # Merge back to main dataframe
    df_supplier = df_supplier.merge(supplier_stats, left_on='Supplier', right_index=True, how='left')
    
    # Create supplier performance scores
    print("Creating supplier performance scores...")
    
    # Delivery performance score (lower lead time is better)
    if 'supplier_avg_lead_time' in df_supplier.columns:
        lead_time_percentile = df_supplier['supplier_avg_lead_time'].rank(pct=True, ascending=False)
        df_supplier['supplier_delivery_score'] = (lead_time_percentile * 100).round(1)
    
    # Quality performance score (lower defect rate is better)
    if 'supplier_avg_defect_rate' in df_supplier.columns:
        quality_percentile = df_supplier['supplier_avg_defect_rate'].rank(pct=True, ascending=False)
        df_supplier['supplier_quality_score'] = (quality_percentile * 100).round(1)
        df_supplier['supplier_quality_score'] = df_supplier['supplier_quality_score'].fillna(100)
    
    # Overall supplier score (composite)
    score_columns = ['supplier_delivery_score', 'supplier_quality_score']
    available_scores = [col for col in score_columns if col in df_supplier.columns]
    
    if available_scores:
        df_supplier['supplier_overall_score'] = df_supplier[available_scores].mean(axis=1).round(1)
    
    # Create supplier tier classification
    if 'supplier_overall_score' in df_supplier.columns:
        df_supplier['supplier_tier'] = pd.cut(df_supplier['supplier_overall_score'],
                                            bins=[0, 60, 80, 90, 100],
                                            labels=['Poor', 'Average', 'Good', 'Excellent'])
    
    print(f"  Created supplier performance features")
    
    return df_supplier

# Apply supplier feature engineering
df_features = create_supplier_features(df_features)

# Display supplier feature summary
supplier_cols = [col for col in df_features.columns if 'supplier_' in col]
print(f"\nSupplier features created: {len(supplier_cols)}")

Creating Supplier Performance Features:
Calculating basic supplier statistics...
Creating supplier performance scores...
  Created supplier performance features

Supplier features created: 15


## 4. Financial Features

In [5]:
# Create financial features
def create_financial_features(df: pd.DataFrame) -> pd.DataFrame:
    df_financial = df.copy()
    
    print("Creating Financial Features:")
    print("=" * 40)
    
    # Price analysis features
    if all(col in df_financial.columns for col in ['Unit_Price', 'Negotiated_Price']):
        print("Creating price analysis features...")
        
        # Price change metrics
        df_financial['price_change_absolute'] = df_financial['Unit_Price'] - df_financial['Negotiated_Price']
        df_financial['price_change_ratio'] = df_financial['Negotiated_Price'] / df_financial['Unit_Price']
        
        # Negotiation effectiveness
        df_financial['negotiation_effectiveness'] = np.where(
            df_financial['savings_percentage'] > 0, 'Successful',
            np.where(df_financial['savings_percentage'] < 0, 'Price_Increase', 'No_Change')
        )
        
        print(f"  Created 3 price analysis features")
    
    # Order value analysis
    if 'total_negotiated_value' in df_financial.columns:
        print("Creating order value features...")
        
        # Order size categories
        q25 = df_financial['total_negotiated_value'].quantile(0.25)
        q75 = df_financial['total_negotiated_value'].quantile(0.75)
        
        df_financial['order_size_category'] = pd.cut(df_financial['total_negotiated_value'],
                                                   bins=[0, q25, q75, np.inf],
                                                   labels=['Small', 'Medium', 'Large'])
        
        # High-value order indicator
        value_95th = df_financial['total_negotiated_value'].quantile(0.95)
        df_financial['is_high_value_order'] = df_financial['total_negotiated_value'] >= value_95th
        
        print(f"  Created 2 order value features")
    
    return df_financial

# Apply financial feature engineering
df_features = create_financial_features(df_features)

# Display financial feature summary
financial_cols = [col for col in df_features.columns if any(x in col for x in ['price_', 'order_size_', 'negotiation_'])]
print(f"\nFinancial features created: {len(financial_cols)}")

Creating Financial Features:
Creating price analysis features...
  Created 3 price analysis features
Creating order value features...
  Created 2 order value features

Financial features created: 4


## 5. Quality Features

In [6]:
# Create quality features
def create_quality_features(df: pd.DataFrame) -> pd.DataFrame:
    df_quality = df.copy()
    
    print("Creating Quality Features:")
    print("=" * 40)
    
    # Enhanced defect analysis
    if all(col in df_quality.columns for col in ['Defective_Units', 'Quantity']):
        print("Creating defect analysis features...")
        
        # Defect severity categories
        df_quality['defect_severity'] = pd.cut(df_quality['defect_rate'],
                                             bins=[0, 1, 5, 10, 100],
                                             labels=['Excellent', 'Good', 'Poor', 'Critical'])
        
        # Perfect order indicator
        df_quality['is_perfect_order'] = (df_quality['Defective_Units'] == 0)
        
        print(f"  Created 2 defect analysis features")
    
    # Compliance features
    if 'Compliance' in df_quality.columns:
        print("Creating compliance features...")
        
        # Compliance scoring
        compliance_map = {
            'Compliant': 100,
            'Non-Compliant': 0,
            'Unknown': 50
        }
        df_quality['compliance_score'] = df_quality['Compliance'].map(compliance_map)
        
        print(f"  Created 1 compliance feature")
    
    return df_quality

# Apply quality feature engineering
df_features = create_quality_features(df_features)

# Display quality feature summary
quality_cols = [col for col in df_features.columns if any(x in col for x in ['defect_', 'compliance_', 'perfect_'])]
print(f"\nQuality features created: {len(quality_cols)}")

Creating Quality Features:
Creating defect analysis features...
  Created 2 defect analysis features
Creating compliance features...
  Created 1 compliance feature

Quality features created: 5


## 6. Risk Assessment Features

In [7]:
# Create risk assessment features
def create_risk_features(df: pd.DataFrame) -> pd.DataFrame:
    df_risk = df.copy()
    
    print("Creating Risk Assessment Features:")
    print("=" * 40)
    
    # Overall performance risk score
    print("Creating performance risk scores...")
    
    # Initialize risk components
    risk_components = {}
    
    # Delivery risk
    if 'lead_time_days' in df_risk.columns:
        lead_time_95th = df_risk['lead_time_days'].quantile(0.95)
        risk_components['delivery_risk'] = (df_risk['lead_time_days'] > lead_time_95th).astype(int)
    
    # Quality risk
    if 'defect_rate' in df_risk.columns:
        risk_components['quality_risk'] = (df_risk['defect_rate'] > 5).astype(int)
    
    # Compliance risk
    if 'Compliance' in df_risk.columns:
        risk_components['compliance_risk'] = (df_risk['Compliance'] == 'Non-Compliant').astype(int)
    
    # Add risk components to dataframe
    for risk_name, risk_values in risk_components.items():
        df_risk[risk_name] = risk_values
    
    # Calculate composite risk score
    if risk_components:
        df_risk['composite_risk_score'] = sum(risk_components.values())
        df_risk['risk_level'] = pd.cut(df_risk['composite_risk_score'],
                                     bins=[-1, 0, 1, 2, len(risk_components)],
                                     labels=['Low', 'Medium', 'High', 'Critical'])
    
    print(f"  Created {len(risk_components) + 2} risk scoring features")
    
    return df_risk

# Apply risk feature engineering
df_features = create_risk_features(df_features)

# Display risk feature summary
risk_cols = [col for col in df_features.columns if any(x in col for x in ['risk_', 'risk'])]
print(f"\nRisk features created: {len(risk_cols)}")

Creating Risk Assessment Features:
Creating performance risk scores...
  Created 5 risk scoring features

Risk features created: 5


## 7. Feature Summary

In [8]:
# Comprehensive feature summary
print("Feature Engineering Summary:")
print("=" * 50)

# Count features by category
feature_categories = {
    'Original Features': [col for col in df.columns],
    'Temporal Features': [col for col in df_features.columns if any(x in col for x in ['order_', 'delivery_', 'lead_time_'])],
    'Supplier Features': [col for col in df_features.columns if 'supplier_' in col],
    'Financial Features': [col for col in df_features.columns if any(x in col for x in ['price_', 'order_size_', 'negotiation_'])],
    'Quality Features': [col for col in df_features.columns if any(x in col for x in ['defect_', 'compliance_', 'perfect_'])],
    'Risk Features': [col for col in df_features.columns if any(x in col for x in ['risk_', 'risk'])]
}

total_original = len(feature_categories['Original Features'])
total_current = len(df_features.columns)
total_new = total_current - total_original

print(f"Original features: {total_original}")
print(f"Total features after engineering: {total_current}")
print(f"New features created: {total_new}")
print(f"Feature expansion ratio: {total_current/total_original:.2f}x")

print("\nFeatures by Category:")
for category, features in feature_categories.items():
    print(f"  {category}: {len(features)} features")

Feature Engineering Summary:
Original features: 19
Total features after engineering: 63
New features created: 44
Feature expansion ratio: 3.32x

Features by Category:
  Original Features: 19 features
  Temporal Features: 21 features
  Supplier Features: 15 features
  Financial Features: 4 features
  Quality Features: 6 features
  Risk Features: 5 features


## 8. Export Engineered Features

In [9]:
# Export engineered dataset
import os

# Ensure output directory exists
os.makedirs('../data/processed', exist_ok=True)

print("Exporting Feature Dataset:")
print("=" * 50)

# Export main feature dataset
output_path = '../data/processed/procurement_features_engineered.csv'
df_features.to_csv(output_path, index=False)
print(f"Feature dataset exported to: {output_path}")
print(f"Records: {len(df_features):,} | Features: {len(df_features.columns)}")

# Create and export feature metadata
feature_metadata = pd.DataFrame({
    'Feature_Name': df_features.columns,
    'Data_Type': df_features.dtypes,
    'Non_Null_Count': df_features.count(),
    'Null_Count': df_features.isnull().sum(),
    'Null_Percentage': (df_features.isnull().sum() / len(df_features) * 100).round(2)
})

metadata_path = '../data/processed/feature_metadata.csv'
feature_metadata.to_csv(metadata_path, index=False)
print(f"Feature metadata exported to: {metadata_path}")

# Export feature summary
summary_path = '../data/processed/feature_engineering_summary.txt'
with open(summary_path, 'w') as f:
    f.write("PROCUREMENT FEATURE ENGINEERING SUMMARY\n")
    f.write("=" * 50 + "\n")
    f.write(f"Engineering Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    f.write(f"Original Features: {total_original}\n")
    f.write(f"Engineered Features: {total_current}\n")
    f.write(f"New Features Created: {total_new}\n")
    f.write(f"Feature Expansion Ratio: {total_current/total_original:.2f}x\n")
    
    f.write("\nFEATURE CATEGORIES:\n")
    for category, features in feature_categories.items():
        f.write(f"  {category}: {len(features)} features\n")

print(f"Feature summary exported to: {summary_path}")

print(f"\nFeature Engineering Complete!")
print(f"Dataset enhanced from {total_original} to {total_current} features")
print(f"Ready for KPI analysis and modeling")

Exporting Feature Dataset:
Feature dataset exported to: ../data/processed/procurement_features_engineered.csv
Records: 777 | Features: 63
Feature metadata exported to: ../data/processed/feature_metadata.csv
Feature summary exported to: ../data/processed/feature_engineering_summary.txt

Feature Engineering Complete!
Dataset enhanced from 19 to 63 features
Ready for KPI analysis and modeling


---

## Feature Engineering Complete!

**Major Accomplishments:**
- Created comprehensive temporal features for seasonality and trend analysis
- Developed supplier performance scoring and assessment metrics
- Engineered financial and cost optimization features
- Built quality and compliance monitoring indicators
- Implemented risk assessment scoring systems

**Key Feature Categories Created:**
- **Temporal Features**: Seasonality, business cycles, lead time patterns
- **Supplier Features**: Performance scores, tier classifications
- **Financial Features**: Price analysis, order categorization, negotiation effectiveness
- **Quality Features**: Defect severity, compliance scoring, perfect order tracking
- **Risk Features**: Composite risk scoring, performance risk assessment

**Ready for Next Phase:**
- **KPI Analysis** (Notebook 04) - Calculate and visualize key performance indicators
- **Supplier Performance Analysis** (Notebook 05) - Deep dive into supplier metrics
- **Predictive Modeling** (Notebook 06) - Build forecasting and prediction models

**Files Generated:**
- `procurement_features_engineered.csv` - Enhanced dataset with engineered features
- `feature_metadata.csv` - Complete feature documentation
- `feature_engineering_summary.txt` - Summary and documentation

---